Ensemble Kalman Inversion: mean-field limit and convergence analysis
aa r X i v : . [ m a t h . NA ] J u l Ensemble Kalman Inversion: mean-field limit and convergenceanalysis
Zhiyan Ding · Qin LiAbstract
Ensemble Kalman Inversion (EKI) has beena very popular algorithm used in Bayesian inverse prob-lems [22]. It samples particles from a prior distribu-tion, and introduces a motion to move the particlesaround in pseudo-time. As the pseudo-time goes to in-finity, the method finds the minimizer of the objectivefunction, and when the pseudo-time stops at 1, the en-semble distribution of the particles resembles, in somesense, the posterior distribution in the linear setting.The ideas trace back further to Ensemble Kalman Fil-ter and the associated analysis [14,30], but to today,when viewed as a sampling method, why EKI works,and in what sense with what rate the method convergesis still largely unknown.In this paper, we analyze the continuous version ofEKI, a coupled SDE system, and prove the mean fieldlimit of this SDE system. In particular, we will showthat 1. as the number of particles goes to infinity, theempirical measure of particles following SDE convergesto the solution to a Fokker-Planck equation in Wasser-stein 2-distance with an optimal rate, for both linearand weakly nonlinear case; 2. the solution to the Fokker-Planck equation reconstructs the target distribution infinite time in the linear case, as suggested in [22].
Keywords
Ensemble Kalman Inversion, Wassersteinmetric, mean-field limit, Fokker-Planck equation
Zhiyan DingMathematics Department, University of Wisconsin-Madison,480 Lincoln Dr., Madison, WI 53705 USA.E-mail: [email protected] LiMathematics Department and Wisconsin Institutes of Dis-coveries, University of Wisconsin-Madison, 480 Lincoln Dr.,Madison, WI 53705 USA.E-mail: [email protected]
How to sample from a target distribution is a centralchallenge in Bayesian inverse problems, especially whenthe to-be-reconstructed parameter lives on a high di-mensional space. Suppose a 1000-dimensional parame-ter needs to be reconstructed, and we have a budget ofmaking 10 ,
000 samples, then how do we design algo-rithms so that these 10 ,
000 samples look like they arei.i.d. samples from the posterior distribution?There are abundant studies in this direction. Tra-ditional methods such as Markov chain Monte Carlo(MCMC) like Metropolis Hastings type algorithm, andsequential Monte Carlo (SMC) have garnered a largeamount of investigations both on the theoretical andnumerical sides [12,32,9]. Newer methods such as steinvariational gradient descent (SVGD) based on Kernel-ized Stein Discrepancy [26], the ensemble Kalman in-version (EKI), the ensemble Kalman sampling method(EKS) [18,10] quickly drew attention from many re-lated areas. There are advantages and disadvantagesassociated with each method.In this paper, we study Ensemble Kalman Inver-sion (EKI) method in depth [22]. The method can beviewed as one step in the popular Ensemble Kalman fil-ter (EnKF) method. EnKF was introduced initially fordynamical systems in [16,14,19,15,21]: one sequentiallymixes in newly available data and evolve the probabil-ity distribution of the to-be-reconstructed parametersalong the evolution of the dynamical system [25,24]. Ineach step of EnKF, the method consists of a forecaststage, which amounts to evolving underlying dynami-cal systems, and the analysis stage, which amounts toadjusting the distribution of states. EKI only studiesstatic problems: one is given a fixed set of data to re-construct a fixed set of unknown parameters, and thusis comparable to the analysis stage of EnKF. Such con-
Zhiyan Ding, Qin Li nection was first documented in the beautiful paper of[30] (and the references therein, e.g. [1,2], and was dis-cussed in depth in [22] where the authors fully devel-oped the idea into an algorithm. The procedure is rathereasy to understand: one i.i.d. samples a fixed numberof particles according to the prior distribution and la-bels them the initial data at t = 0. The particles arethen pushed around according to certain dynamics in(pseudo-)time, hoping at t = 1 the particles look likethey are i.i.d. sampled from the posterior distribution.The algorithm was designed on the discrete level,with J particles moved around using stepsize h , and thenumber of time steps ( N in our paper) is naturally N =1 /h to ensure the pseudo-time stops at 1. The contin-uous version of the algorithm (with h →
0) represents J -coupled SDE systems, for which there are alreadya number of theoretical studies [33,34,3]. However, tothe authors’ understanding, despite some heuristic ar-guments [33,34], there has been no result discussing the J → ∞ limit of the coupled SDE system, and in partic-ular for practical reasons, how this limit connects withthe target distribution.In this paper we will give two results concerning thisconvergence. – We will prove, both in the linear and weak-nonlinearcase, the coupled SDE system converges to a Fokker-Planck equation with an optimal rate in Wasserstein2-metric. The relevant results are Theorem 1 and 2,and the optimality is discussed after the statementof Theorem 1. – We will prove that the Fokker-Planck equation con-nects the prior distribution with the target posteriordistribution only in the linear case. This is presentedin Corollary 1. The nonlinear case can be vastlymore complicated, as discussed in Section 4.2, alsosee [24].On the technical level, the first result amounts toshowing the mean-field limit of the SDE system. In-deed, we largely rely on the classical Dobrushin’s argu-ment, which consists of constructing a “bridging SDE”and compare the distance between the PDE with thebridging SDE, and the distance between the two SDEsystems. The former is an established result in [17], andthe latter amounts to bounding the flux and Brown-ian motion coefficients, and then looping it back forthe Gr¨onwall inequality. The argument, despite beingvery popular in the mean-field community [6,7,5,36]to deal with particle systems in chemistry and biology,has rarely been applied to investigate sampling meth-ods. The only exception known to us is [27] in whichthe authors proved the continuous version of SVGD isthe weak solution to a transport type equation whose equilibrium state at the infinite time is the target pos-terior distribution. However, due to the Gr¨onwall na-ture of the argument, the constant blows up in infi-nite time, while the convergence to the equilibrium re-quires infinite time. EKI, however, stops at finite time t = 1, and thus the constant would be finite. Com-paring to other mean-field problems emerging in chem-istry/biology (such as Cucker-Smale model), the diffi-culty here mainly comes from the fact that the fluxand diffusion coefficients rely on higher moments of thePDE solution, and thus we do not have properties suchas Lipschitz continuity for the Gr¨onwall inequality todirectly apply.The way to overcome these technical difficulties isto employ the bootstrapping argument, namely, we as-sume the convergence is of certain rate, and a lemma(Lemma 8) is then derived to show that such rate can betightened. One continues this tightening process till themaximum rate is achieved (Proposition 2). The initialconvergence rate can be as low as 0, meaning one onlyneeds the boundedness. This boundedness is shown inLemma 3, Lemma 5, and Corollary 2. Theorem 1 and 2are then direct consequences of Proposition 2, combinedwith Proposition 1, which itself is a simple applicationof the celebrated theorem from [17] (cited as Theorem3 in this paper).The second result amounts to direct derivation. Theargument was hinted in multiple papers [30,15,22], butwe have not found explicit derivation in literature.We would like to mention that in [20] the authorsinvestigated the convergence of the moments using ki-netic tools, a relevant class of methods for investigatingthe convergence of sampling methods; in [29], the au-thors drew the connection with the Schr¨odinger bridgeproblem, and in [31] the authors discuss the transitionkernel’s dependence in conjunction with dynamics ver-sus analysis. These papers are not directly related tothe results presented in this paper, but shed light tounderstanding of sampling in depth.In Section 2, we give a quick overview of the method,and present the continuous version, the SDE of the algo-rithm. In Section 3 we summarize our own result, Theo-rem 1 and Theorem 2, and present the mean-field limit.In Section 4 we discuss the meaning of the result in thelinear and nonlinear setting. Section 5 and 6 are dedi-cated to proving the main theorems. Some calculationsare rather technical and we leave them in appendix. The Ensemble Kalman Inversion (EKI) was initiallyproposed to be a gradient-free optimization method [22], nsemble Kalman Inversion: mean-field limit and convergence analysis 3 but has been widely used to find samples that are ap-proximately drawn i.i.d. from the target posterior dis-tribution if one stops the method in finite time. Gettingi.i.d. (or approximately i.i.d.) samples from an arbitrar-ily given target distribution is a challenging task, andobtaining it in finite time makes it even harder. Webriefly review the process of the method.Suppose u ∈ R L is the to-be-reconstructed vector-parameter, and let G : R L → R K be the parameter-to-observable map, namely: y = G ( u ) + η , where y ∈ R K collects the observed data with η de-notes the noise in the measurement-taking. The generalinverse problem amounts to reconstructing u from y .The Bayesian inverse problem amounts to reconstruct-ing the distribution of u given y with assumption onthe distribution of η . In this article we let η ∼ N (0 , Γ )be a Gaussian noise independent of u .Denoting the loss functional Φ ( · ; y ) : R L → R by Φ ( u ; y ) = 12 | y − G ( u ) | Γ , where | · | Γ := | Γ − · | . The Bayes’ theorem states thatthe posterior distribution is the (normalized) productof the prior distribution and the likelihood function: µ pos ( u ) = 1 Z exp ( − Φ ( u ; y )) µ ( u ) , (1)where Z := Z R L exp ( − Φ ( u ; y )) µ ( u ) du . Here Z is the normalization factor, exp ( − Φ ( u ; y )) is thelikelihood function and µ is the prior density functionthat collects people’s prior knowledge about the distri-bution of u (suppose it is absolutely continuous withrespect to Lebesgue measure for now). This so-calledposterior distribution represents the probability mea-sure of the to-be-reconstructed parameter u , blendingthe prior knowledge and the collected data y , taking η ,the measurement error into account. See more detailsin [8,35].2.1 Ensemble Kalman InversionThe solution of the Bayesian inverse problem is givenby (1), and in practice, one still needs to generate anumber of samples that represent this target distribu-tion. These samples can later on be used to estimatequantities such as moments.There are a large number of algorithms developedtowards this end, including the classical MCMC (Markov chain Monte Carlo) method, Sequential Monte Carlomethod, and the newly developed SVGD (Stein vari-ational Gradient Descent), birth-death Langevin, En-semble Kalman Sampling, among many others [26,18,28]. It is not our intension to compare these differentmethods. In this paper, we would like to focus on En-semble Kalman Inversion and give a sharp estimate tothe convergence rate of the method. We emphasize thatEKI was developed to be an optimization method, andis widely used as a sampling method. We mainly discussits performance as a sampling method in this article.In the setup of EKI, a fixed number of particlesare sampled according to the prior distribution first,call them { u j } Jj =1 (with 0 in the subscript standing forinitial time), and these particles are then propagatedaccording to a certain flow defined by the ensemblemean and covariance in pseudo-time. Hopefully by thepseudo-time achieves 1, the particles can be seen as i.i.d.drawn from the posterior distribution. The algorithm issummarized in Algorithm 1. Algorithm 1 Ensemble Kalman Inversion
Preparation:
1. Input: J ≫ h ≪ N = 1 /h (stoppingindex); Γ ; and y (data).2. Initial: { u j } sampled from initial distribution inducedby density function µ . Run:
Set time step n = 0; While n < N :1. Define empirical means and covariance: u n = 1 J J X j =1 u jn , and G n = 1 J J X j =1 G ( u jn ) ,C ppn ( u ) = 1 J J X j =1 (cid:0) G ( u jn ) − G n (cid:1) ⊗ (cid:0) G ( u jn ) − G n (cid:1) ,C upn ( u ) = 1 J J X j =1 (cid:0) u jn − u n (cid:1) ⊗ (cid:0) G ( u jn ) − G n (cid:1) . (2)2. Artificially perturb data (with ξ jn +1 drawn i.i.d. from N (0 , h − Γ )): y jn +1 = y + ξ jn +1 , ∀ ≤ j ≤ J .
3. Update (set n → n + 1) r jn = y jn +1 − G ( u jn ) ,u jn +1 = u jn + C upn ( u n ) (cid:0) C ppn ( u n ) + h − Γ (cid:1) − r jn , (3)for all 1 ≤ j ≤ J . endOutput: { u jN } . Prior to running the algorithm, one first specifiesthe number of samples needed (denote by J ), and thenumber of steps one can take (denote by N ). The time- Zhiyan Ding, Qin Li step size, then is simply h = 1 /N . This is to ensure t = 1is the final time. So in total, there are two parametersin the algorithm:1: The pseudo-time-step h .2. The number of particles J .Along the evolution, at each time step, one com-putes the sample mean and covariance in (2), and usesthem to move the samples around according to (3).Upon finishing the algorithm in N steps, one obtainsa list of particles { u jN } Jj =1 and defines the ensembledistribution: M u = 1 J J X j =1 δ u jN . (4)It is our goal, in this article to show in both linear andnonlinear setup, when and how M u approximates tar-get posterior distribution induced by posterior densityfunction µ pos .There are two parameters in the algorithm, and thusthe convergence result of the algorithm to the posteriordistribution should be established in the h → J → ∞ limit. The h → J → ∞ limit. Remark 1
Four comments are in order:1. We emphasize that N and h satisfy a certain rela-tion: N h = 1, and thus N is not a free parameter.This fact is easily overlooked. In fact, in all the pre-vious theoretical studies that we found [33,3], peo-ple have been looking for convergence result where h → N → ∞ afterwards. Namely it islim N →∞ lim h → instead of lim Nh =1 ,h → that has been studied. These works lay the theoret-ical foundation for ours, and builds wellposednesstheory for the underlying SDE, but we would liketo emphasize, however, that the two limits do notcommute. Exactly for this reason, when one con-siders lim Nh =1 ,h → , a posterior distribution is ob-tained, but when the two limits are taken separately,the “collapsing” phenomenon is observed [33,22]. Inthis article, we stick to the finite time t = N h = 1regime.2. We do not aim at comparing different methods, butone immediate advantage of this method over MCMCor other classical sampling method is worth of men-tioning: in this method, the number of samples arefixed, and the number of steps are also fixed. So in-stead of tracing the error in time and terminatingthe process on-the-fly whenever tolerance is met, the number of particles is pre-set, and thus the numeri-cal cost is known ahead of the computation. Indeed,exactly because of this, the error analysis is rathercrucial: based on the error analysis, one can pre-determine the proper values of J and h .3. EKI shares some similarity with a very famous dataassimilation method called Ensemble Kalman Fil-ter [14], which was itself derived from Kalman fil-ter with the mean and the covariance replaced bytheir ensemble versions. One main difference be-tween EKI and EnKF is that EKI looks for solu-tion to a static problem, and the dynamics is builtin pseudo-time. EnKF, however, tries to blend infor-mation from the underlying dynamics, characterizedby ODE/PDE/SDE, and the collected data, usingthe Bayesian formulation. The time in EnKF is real.A beautiful set of analysis can be found in [25,24,13]. These works provide theoretical studies in theensemble Kalman framework. However, these resultsconsider discrete case where the time stepsize h = 1.On the contrary, we study the continuum limit with h →
0, and a lot of technicalities are associated withSDE’s mean-field limit analysis, making the previ-ous results not particularly useful in our setting.4. Similar to the EnKF, EKI also tries to translate par-ticles from one distribution to another, and recordsonly the first two moments (mean and covariance).If the distribution fails to be a Gaussian along theevolution, information carried by the higher mo-ments is simply removed from the system, leadingto numerical error unavoidably. If the nonlinearity isweak, higher moments could be potentially boundedand there is still hope to control the EKI’s mean-field limit. We will explain this in better detail inSection 3, when we present the weakly nonlinear as-sumption in (6).2.2 Continuum limit and dynamical system of { u jt } EKI is an algorithm with discrete-in-time updates. For-mally let the time step h →
0, equation (3) becomes: du jt = C up ( u t ) Γ − (cid:16) y − G ( u jt ) (cid:17) dt + C up ( u t ) Γ − dW jt , (5)where C up ( u ) = 1 J J X j =1 (cid:0) u j − u (cid:1) ⊗ (cid:0) G ( u j ) − G (cid:1) with u = 1 J J X j =1 u j , G = 1 J J X j =1 G ( u j ) . nsemble Kalman Inversion: mean-field limit and convergence analysis 5 Here ⊗ means the first argument is viewed as a columnvector while the second is viewed as the row vector.Indeed, as shown in [33,4], the method (3) can beviewed as the Euler-Maruyama discretization of theSDE.Let Ω be the sample space and F being the σ -algebra: σ (cid:0) u j ( t = 0) , ≤ j ≤ J (cid:1) , then the filtration isintroduced by the dynamics: F t = σ (cid:0) u j ( t = 0) , W js , ≤ j ≤ J, s ≤ t (cid:1) . In [3], the authors showed the wellposedness of theSDE system under the linear assumption ( G = Au ).The techniques, when combined with boundedness ofmoments, should work even when G is nonlinear. In thelater section (in particular, Lemma 2), we will provethe boundedness of the moments. However, how to ex-plicitly incorporate these with the techniques in [3] forthe wellposedness is beyond the focus of the current pa-per. In [33,4], the authors formally derive the contin-uum limit of the method and arrived at the SDE. Theproof has not been made rigorous. Indeed for the con-vergence of the Euler-Maruyama discretization, strongassumptions are imposed on the coefficients (transportand Brownian motion), and the nonlinearity induced inthe covariance matrix makes the proof highly nontriv-ial. We believe under certain condition on the targetdistribution, this could be made possible, but it is alsonot directly related to deriving and proving the mean-field limit, and will be omitted from the current paper.A similar result under the EnKF framework [23] couldpotentially be useful in this direction.In this paper, we start with the SDE, and we will an-alyze its mean-field limit as J → ∞ in the Wasserstein-2metric. The limit is characterized by a Fokker-Planck(FP) type equation, and we will show, in the linearsetting, such FP equation recovers the posterior distri-bution and in the nonlinear setting, it deviates from theposterior distribution by a weight factor. We present our main theorem in this section.To do so we first unify the notations. In the pa-per we denote E the expectation in the probabilityspace ( Ω, F t , P ) and often use ρ t as a short notationfor ρ ( t, u ). For any vectors { m j } Jj =1 and { n j } Jj =1 , wedenote m = 1 J J X j =1 m j andCov m,n = 1 J J X j =1 (cid:0) m j − m (cid:1) ⊗ (cid:0) n j − n (cid:1) , and denote Cov m = Cov m,m . Here ⊗ means the first ar-gument is viewed as a column vector while the second isviewed as the row vector. Similarly, for any probabilitydensity function ρ and function g , we denote E ρ = Z R L uρ ( u ) du, E g,ρ = Z R L g ( u ) ρ ( u ) du , Cov ρ = Z R L ( u − E ρ ) ⊗ ( u − E ρ ) ρ ( u ) du , andCov ρ,g = Z R L ( u − E ρ ) ⊗ ( g ( u ) − E g,ρ ) ρ ( u ) du . Apparently Cov g,ρ = Cov ⊤ ρ,g .The distance we use to quantify the “smallness” isthe Wasserstein 2-metric: Definition 1
Let υ , υ be two probability measures in (cid:0) R L , B R L (cid:1) , then the W -Wasserstein distance between υ , υ is defined as W ( υ , υ ) := (cid:18) inf γ ∈ Γ ( υ ,υ ) Z R L × R L | x − y | dγ ( x, y ) (cid:19) , where Γ ( υ , υ ) denotes the collection of all measureson R L × R L with marginals υ and υ for x and y respec-tively. Here υ i can be either general probability mea-sures or the measures induced by probability densityfunctions υ i .We also assume weak nonlinearity, meaning there isa matrix A ∈ L ( R L , R K ) such that G ( u ) = Au + m( u ) , (6)where m( u ) : R L → R K is a smooth bounded functionsatisfyingRange(m) ⊥ Γ − Range( A ) , | m( u ) | + |∇ u m( u ) | ≤ M , with some constant
M > R L , and a ⊥ Γ − b means a ⊤ Γ − b = 0 and a ⊤ is to the take transpose of a . Thisassumption plays a crucial role in the later proofs: iteliminates the cross-terms such as m ⊤ Γ − A in the pos-terior distribution, and thus put m entirely in the per-pendicular direction of Range( A ). The m ⊤ Γ − m termsare then controlled using the boundedness condition,boiling the analysis down to the linear situation. Zhiyan Ding, Qin Li
We further denote the “closest” solution of the lin-ear component to be u † , and r the corresponding noise,then y = Au † + r, with r ⊤ Γ − range( A ) = 0 , (7)then the loss functional is also explicit: Φ ( u ; y ) = 12 (cid:0) u † − u (cid:1) ⊤ A ⊤ Γ − A ( u † − u )+ 12 ( r − m( u )) ⊤ Γ − ( r − m( u )) , where we used the fact that m ⊥ Γ − A , r ⊥ Γ − A .Under such weakly nonlinear assumption (6), thedynamical system of { u jt } , written in (5) can be ex-panded: du jt = Cov u t ,u t A ⊤ Γ − A (cid:16) u † − u jt (cid:17) dt + Cov u t ,u t A ⊤ Γ − dW jt + Cov u t , m Γ − ( r − m( u )) dt + Cov u t , m Γ − dW jt . (8)Our main theorem states as the following: Theorem 1 (Main result 1: mean-field limit)
Un-der the weakly nonlinear assumption (6) , the mean fieldlimit of M u t is the probability distribution induced by ρ ( t, u ) . Here M u t is the ensemble distribution of { u jt } as defined in (4) and ρ ( t, u ) is the strong solution to thefollowing Fokker-Planck equation: ∂ t ρ = −∇ u · (cid:16) ( y − G ( u )) ⊤ Γ − Cov G ,ρ t ρ (cid:17) + 12 Tr (cid:0) Cov ρ t , G Γ − Cov G ,ρ t H u ( ρ ) (cid:1) ,ρ (0 , u ) = µ ( u ) (9) where µ is the prior density function, H u ( ρ ) is Hessianof ρ .More specifically, assume µ is C , and for any p > , µ satisfies Z R L | u | p µ ( u ) du = C p < ∞ . If { u j } are i.i.d. sampled from the measure induced by µ , then for any t < ∞ and any ǫ > , there is aconstant C ǫ ( t ) independent of J such that: E ( W ( M u t , ρ ( t ))) ≤ C ǫ ( t ) ( J − + ǫ , L ≤ J − /L , L > . The significance of the result is apparent. 1. Whenthe number of samples J is big enough, the ensembledistribution of { u jt } , the continuous version of EKI can be viewed approximately the solution to the Fokker-Planck equation (9). So to analyze the long time largesample properties of EKI is boiled down to analyzing aFokker-Planck equation (9). The analysis for the latteris very rich, and the literature encompasses the well-posedness, the existence of the equilibrium and the con-vergence rate in time. All these could direct us in bet-ter understanding the algorithm. 2. We give the specificrate of convergence. For L ≤ J − . This is the optimalrate one can hope for from a Monte Carlo samplingmethod. For the case L >
4, we believe the result isalso optimal. Indeed, as will shown in Section 4, by set-ting up a dynamical system { v jt } that strictly follow theflow of the PDE, one expects the best representation ofthe PDE on the particle level, but yet, W ( M v , ρ ) is atbest of J − /L , according to [17]. So the theorem aboveis essentially saying that { u jt } , while being accessible, isnot worse than { v jt } , and thus obtains the best possibleconvergence rate.We do have to mention, however, the theorem quan-tifies the Wasserstein distance. It is a very strong mea-sure. In practice, it is sufficient to have a number ofparticles that can characterize the weak convergence.For this practical purpose, we also show the followingtheorem: Theorem 2 (Main result 2: weak convergence)
Under the weakly nonlinear assumption (6) , M u t weaklyconverge to the probability distribution induces by ρ ( t, u ) with the optimal rate, namely: given any l -Lipschitzfunction f , for any ǫ > , there is a constant C ǫ ( l, f ( ) , t ) independent of J such that: for any t < ∞ E (cid:12)(cid:12)(cid:12)(cid:12)Z f ( u ) [ M u t − ρ ( t, u )] du (cid:12)(cid:12)(cid:12)(cid:12) ! ≤ C ǫ ( l, f ( ) , t ) J − + ǫ . (10) Here M u t is the ensemble distribution (4) and ρ solves (9) . This result significantly strengthen the convergence rate,and eliminates the dimension L -dependence. Before proving the two theorems, we present here howto interpret them in linear and nonlinear setups.4.1 Linear setupThis is the setup in which we consider m = 0, meaning G ( u ) = Au , and the initial condition µ is a Gaussiandensity function. When this happens, on one hand, the nsemble Kalman Inversion: mean-field limit and convergence analysis 7 entire process of the FP evolution is a Gaussian pro-cess, and on the other, the posterior distribution is alsoa Gaussian, and thus one would expect the completereconstruction.Indeed let us follow [33] and define: µ ( t, u ) = 1 Z ( t ) exp ( − tΦ ( u ; y )) µ ( u ) , (11)where Z ( t ) := R R L exp ( − tΦ ( u ; y )) µ ( u ) du is the nor-malization factor, then it is clear that µ ( t = 0 , u ) = µ , and µ ( t = 1 , u ) = µ pos , meaning this new definition (11) finds a smooth tran-sition that moves the prior distribution to the poste-rior, and exactly reconstructs our target distribution atprecisely t = 1. With more derivation, one can actu-ally show this is a strong solution to the Fokker-Planckequation, meaning ρ ( t, u ) = µ ( t, u ) satisfies (9), and ρ ( t = 1 , u ) is the posterior density function under thelinear assumption.This quickly leads to a corollary of the main theo-rem: Corollary 1
Under assumption (6) with m( u ) = ,and { u j } are i.i.d. sampled from a Gaussian distribu-tion induced by density function µ ( u ) , then for any ǫ > , there exists J ( ǫ ) > , such that for any J > J ( ǫ ) E ( W ( µ pos ( u ) , M u ) ≤ ǫ , where M u , defined in (2) , is the ensemble distributionof { u j } , the SDE (8) solution, and µ pos is the posteriordensity function induces the posterior distribution. The corollary is direct consequence of Theorem 1and we omit the proof. To show that µ ( t, u ) is the so-lution to the PDE (9) amounts to calculating its timeand first two derivatives in u and plugging them in (9)to balance the terms out. For the completeness of thepaper, we present the derivation briefly below. Withoutloss of generality, we assume y = Au † with r = 0.Taking the time derivative, we have: ∂ t µ ( t, u ) = − Φ ( u ; y ) µ ( t, u ) − ∂ t Z ( t ) Z ( t ) µ ( t, u ) , (12)where, under the linearity assumption: Φ ( u ; y ) = (cid:0) u † − u (cid:1) ⊤ A ⊤ Γ − A ( u † − u ) / , and ∂ t ZZ = Z − ( u − E µ t ) ⊤ A ⊤ Γ − A ( u − E µ t ) ⊤ / µdu + Z − ( E µ t − u † ) ⊤ A ⊤ Γ − A ( E µ t − u † ) ⊤ / µdu = − Tr (cid:2) Cov µ t A ⊤ Γ − A (cid:3) / − (cid:0) u † − E µ t (cid:1) ⊤ A ⊤ Γ − A (cid:0) u † − E µ t (cid:1) / . Similarly the gradients in u are: ∇ u µ ( t, u ) = tA ⊤ Γ − A ( u † − u ) µ ( t, u )+ Γ − ( u − u ) µ ( t, u ) , and the hessian is: H u µ = (Cov µ t ) − (cid:0) − I + ( u − E µ t )( u − E µ t ) ⊤ (Cov µ t ) − (cid:1) µ . Putting them back into (9), one has ∂ t µ + ∇ u · (cid:16)(cid:0) u † − u (cid:1) ⊤ A ⊤ Γ − A Cov µ t µ (cid:17) −
12 Tr (cid:0)
Cov µ t A ⊤ Γ − A Cov µ t H u ( µ ) (cid:1) = ∂ t µ + (cid:0) u † − u (cid:1) ⊤ A ⊤ Γ − A Cov µ t ∇ u µ + ∇ u · (cid:16)(cid:0) u † − u (cid:1) ⊤ A ⊤ Γ − A Cov µ t (cid:17) µ −
12 Tr (cid:0)
Cov µ t A ⊤ Γ − A Cov µ t H u ( µ ) (cid:1) = term I + term II + term III + term IV . Term III becomes to Tr (cid:2)
Cov µ t A ⊤ Γ − A (cid:3) µ , and TermIV turns to: −
12 Tr (cid:0)
Cov µ t A ⊤ Γ − A Cov µ t H u ( µ ) (cid:1) = 12 Tr (cid:0) Cov µ t A ⊤ Γ − A (cid:1) µ + 12 | A ( u − E µ t ) | Γ µ . To handle term II, we have: (cid:0) u † − u (cid:1) ⊤ A ⊤ Γ − A Cov µ t ∇ u µ = t (cid:0) u † − u (cid:1) ⊤ A ⊤ Γ − A Cov µ t A ⊤ Γ − A ( u † − u ) µ + (cid:0) u † − u (cid:1) ⊤ A ⊤ Γ − A Cov µ t Γ − ( u − u ) µ = t (cid:0) u † − u (cid:1) ⊤ A ⊤ Γ − A ( u † − u ) µ − (cid:0) u † − u (cid:1) ⊤ A ⊤ Γ − A Cov µ t Γ − (cid:0) u † − u (cid:1) µ = t (cid:0) u † − u (cid:1) ⊤ A ⊤ Γ − A ( u † − u ) µ − (cid:0) u † − u (cid:1) ⊤ A ⊤ Γ − A (cid:0) u † − E µ t (cid:1) µ . Adding all the terms up, we find the summationbeing 0, making µ the strong solution to the PDE (9).4.2 Nonlinear setupIn the weakly nonlinear situation, Theorem 1 still holdstrue, however, µ ( t, u ), as defined in (11), despite smoothlyconnects the prior and the target distribution, is nolonger the solution to the PDE. Indeed, if we plug it in,define the operator L [ µ ] = ∂ t µ ( t, u ) + ∇ u · (cid:16) ( y − G ( u )) ⊤ Γ − Cov G ,µ t µ (cid:17) −
12 Tr (cid:0)
Cov µ t , G Γ − Cov G ,µ t H u ( µ ) (cid:1) , Zhiyan Ding, Qin Li we have L µ = 0 as it is in the linear case, but rather L [ µ ] = [ R ( t, u ) + R ( t, u ) + R ( t, u )] µ ( t, u ) . The remaining term are: R ( t, u ) = 12 Tr (cid:8) Cov G , G Γ − (cid:9) − Tr (cid:8) ∇G ( u ) Γ − Cov G ,µ t (cid:9) + 12 Tr (cid:8) Cov µ t , G Γ − Cov G ,µ t V ( u ) (cid:9) , R ( t, u ) = 12 (cid:0) y − G (cid:1) ⊤ Γ − (cid:0) y − G (cid:1) −
12 ( y − G ( u )) ⊤ Γ − ( y − G ( u ))+ ( y − G ( u )) Γ − Cov G ,µ t V ( u ) − V ⊤ ( u )Cov µ t , G Γ − Cov G ,µ t V ( u ) , R ( t, u ) = − t (cid:8) Cov µ t , G Γ − Cov G ,µ t W ( u ) (cid:9) with V ( u ) = t ( ∇G ( u )) ⊤ Γ − ( y − G ( u )) − Γ − ( u − u ) , W ( u ) ∈ R L × L , with ( W ( u )) : ,i = ∂ i ∇G Γ − ( y −G ( u )) . This equation defers from the PDE by the threeweight terms R i . In some sense, this is a negative result.It suggests that density of the mean field limit of M u t ,proved to be ρ ( t, u ), defers from µ ( t, u ) by the weightterms R i , that could potentially bring an O (1) effects.The question then comes down to bounding the effectsof R i and showing them to be small in certain scenarios.This is, however, not within the realm of deriving andproving the mean-field limit, and is beyond the focus ofthis paper. More discussion can be found in [13,11,24]. We now start proving the theorem. For notation-wisesimplicity, we consider 0 ≤ t ≤
1, and all proofs can beeasily extended to 1 < t < ∞ . To a large extent, we relyon a “bridge” to connect ρ , the solution to the PDE (9),and the { u jt } system, the solution to the SDE (8). The“bridge” is another dynamical system, termed { v jt } thatfollows the exact the same flow defined by (9), mean-ing the coefficient in { v jt } are defined by ρ ( t, u ) andregarded as given a-priori.Intuitively since { v jt } follows the flow of the PDE,it carries the PDE information, and thus its ensembledistribution should be close to the measure induced by ρ . This is discussed in Proposition 1. { v jt } inherits prop-erties of ρ , such as boundedness of moments, as will bepresented in Lemma 3. Since { v j } and { u j } are bothdynamical systems, the comparison is boiled down to the stability analysis for SDE systems, and this part ofthe result is presented in Proposition 2.The proof of the theorem is thereby divided into twosections, here and the subsequent one: in this section,we show the closeness of { v jt } and ρ t , and in the follow-ing we show the closeness of { v jt } and { u jt } . Both resultsare characterized in W -metric, and the combination ofthe two naturally leads to the proof of Theorem 1, 2.In this section in particular, we discuss the proper-ties of the Fokker-Planck equation and give some esti-mates of the moments in Section 5.1. We then discuss { v jt } system in Section 5.2.5.1 Properties of the Fokker-Planck equationWe would like to show the boundedness of moments of ρ ( t, u ), the solution to (9). We start with the covariancefirst: Lemma 1
Under weakly nonlinear assumption (6) , wehave: for ≤ t ≤ k Cov ρ t k ≤ C, k Cov ρ t , G k ≤ C , (13) where C is a constant independent of t and ρ ( t, u ) isthe solution to (9) .Proof First, by the weakly-nonlinear assumption (6),there is an
M > |G ( u ) − G ( u ) | ≤ max( k A k , M ) | u − u | . Multiplying k u − E ρ t k on both sides of (9) and takeintegral, we have ∂ t Z R K k u − E ρ t k ρ ( t, u ) du = Z R K y − G ( u )) ⊤ Γ − Cov G ,ρ t ( u − E ρ t ) ρ + Tr (cid:0) Cov ρ t , G Γ − Cov G ,ρ t (cid:1) ρdu = Z R K − G ( u ) − E G ,ρ t ) ⊤ Γ − Cov G ,ρ t ( u − E ρ t ) ρ + Tr (cid:0) Cov ρ t , G Γ − Cov G ,ρ t (cid:1) ρdu = Z R K − Tr (cid:0) Cov ρ t , G Γ − Cov G ,ρ t (cid:1) ρdu ≤ , nsemble Kalman Inversion: mean-field limit and convergence analysis 9 which implies k Cov ρ t k ≤ k Cov ρ k ≤ C . Furthermore,we also have k Cov ρ t , G k ≤ Z R K k ( u − E ρ t ) ( G ( u ) − E G ,ρ t ) ⊤ k ρdu ≤ Z R K k ( u − E ρ t ) k k ( G ( u ) − E G ,ρ t ) k ρdu ≤ (cid:18)Z R K k u − E ρ t k ρdu (cid:19) · (cid:18)Z R K kG ( u ) − E G ,ρ t k ρdu (cid:19) ≤ max( k A k , M ) C , which proves (13). ⊓⊔ Such boundedness can be extended to higher mo-ments:
Lemma 2
Let ρ solve (9) with initial condition µ . If µ ∈ C and has finite high moments, meaning for any ≤ p < ∞ , there is a C p, < ∞ such that Z R L | u | p µ ( u ) du = C p, < ∞ . then under weakly nonlinear assumption (6) , for any ≤ p < ∞ , there is a constant C p < ∞ such that: Z R L | u − E ρ t | p ρ ( t, u ) du < C p , Z R L | u − u † | p ρ ( t, u ) du < C p , (14) for all ≤ t ≤ .Proof We first rewrite (9) into the following form: ∂ t ρ = ∇ u · ( F ⊤ ( t, u ) ρ ) + 12 Tr (cid:0) D ( t, u ) D ⊤ ( t, u ) H u ( ρ ) (cid:1) , where the flux term is F ( t, u ) = Cov ρ t , G ( t ) Γ − ( y − G ( u ))and the hessian term is D ( t, u ) = Cov ρ t , G ( t ) Γ − . According to this definition and Lemma 1, F ( t, u ) and D ( t, u ) are Lipschitz and bounded respectively: | F ( t, u ) − F ( t, u ) | ≤ C | u − u | , | F ( t, ) | ≤ C , (15)and | D ( t, u ) | ≤ C , (16)where C is a constant independent of t, u , u .Consider the corresponding SDE to (9): dz t = F ( t, z t ) dt + D ( t, z t ) dW t with z ∼ µ , then R R L | u | p ρ ( t, u ) du = E | z t | p and itsuffices to prove the boundedness of E | z t | p : Z R L | u | p ρ ( t, u ) du = E | z t | p ≤ C p . (17)Using Itˆo’s formula: d E | z t | k dt ≤ k E | z t | k − h z t , F ( t, z t ) i + k E | z t | k − Tr( D ⊤ ( t, z t ) D ( t, z t ))+ 2 k ( k − E | z t | k − (cid:10) z t , D ( t, z t ) D ⊤ ( t, z t ) z t (cid:11) ≤ C ,k E | z t | k + C ,k , where C ,k , C ,k are constants only depending on k , andwe use (15)-(16) and Young’s inequality in the secondinequality. For example: E | z t | k − h z t , F ( t, z t ) i≤ E | z t | k − | F ( t, z t ) |≤ E | z t | k − ( C | z t | + | F ( t, ) | ) ≤ C E | z t | k + C E | z t | k − ≤ (cid:18) C + 2 k − k (cid:19) E | z t | k + C k k , where the last inequality comes from the Young’s in-equality: C E | z t | k − ≤ k − k E | z t | k + 12 k C k . Since E | z | k = Z R L | u | k µ ( u ) du < ∞ , by Gr¨onwall’s inequality, we finally obtain E | z t | k ≤ C ′ k , ∀ ≤ t ≤ , which implies (17).Finally, (14) follows from (17) and the boundedness of u † and E ρ t . ⊓⊔ { v j } and the Fokker-Planck-like equationThe { v j } system is the “bridge” we build to connect { u jt } with the PDE. It follows the flow of the PDE: dv jt =Cov ρ t , G Γ − (cid:16) y − G ( v jt ) (cid:17) dt + Cov ρ t , G Γ − dW jt (18) with Cov ρ t , G determined by solution to (9). We denoteits ensemble distribution M v = 1 J J X j =1 δ v jN . It is a classical result that W ( M v , ρ ) → J →∞ limit in the expectation sense. Indeed, if the initialcondition for this SDE system is consistent with µ ,meaning { v j } are drawn i.i.d. from the measure inducedby µ , then the ensemble distribution of { v jt } is closeto measure induced by ρ t for all finite time. Proposition 1 (Linking { v j } with Fokker-Planck-like PDE) Let { v jt } solve (18) with { v j } drawn i.i.d.from the measure induced by µ , and let ρ ( t, u ) solve (9) with initial condition µ , then if µ ∈ C and hasfinite high moments, then under the weakly nonlinearassumptions (6) , there is a constant C ( t ) independentof J such that, E ( W ( M v t , ρ t )) ≤ C ( t ) J − , L < J − log(1 + J ) , L = 4 J − /L , L > . (19) for all t < ∞ . Here M v t is the ensemble distribution of { v jt } . This is a straightforward consequence of the famousresult by [17], and for the completeness we cite the the-orem here:
Theorem 3 (Theorem 1 in [17])
Let ρ ( u ) be a pro-bility density on R L and let p > . Assume that M q ( ρ ) := Z R d | x | q ρ ( dx ) < ∞ for some q > p . Consider an i.i.d sequence ( X k ) k ≥ of ρ -distributed random variables and, for N ≥ , definethe empirical measure ρ N := 1 N N X k =1 δ X k . There is a constant C depending only on p, q, L suchthat, for all N ≥ ,1. If p > L/ and q = 2 p E ( W p ( ρ N , ρ )) ≤ N − + N − ( q − p ) /q .
2. If p = L/ and q = 2 p E ( W p ( ρ N , ρ )) ≤ N − log(1 + N ) + N − ( q − p ) /q .
3. If p ∈ (0 , L/ and q = L/ ( L − p ) E ( W p ( ρ N , ρ )) ≤ N − p/L + N − ( q − p ) /q . To show Proposition 1 one essentially only needs toshow the boundedness of all moments of the particlesystem. This is given by the following Lemma 3. Wesimply choose a large enough q to have the first termsin Theorem 3 being the dominant term that eliminatesthe second terms.As a result of Lemma 2, we can also bound the highmoments of { v j } . This is indeed what we plan to do. Inthe lemma below we will show the boundedness of themoments of { v jt } , derived as a consequence of Lemma 2.Before starting the lemma, we first define q jt = v jt − v , then we have: Lemma 3
Under conditions in Proposition 1, for anyfixed even number ≤ p < ∞ and large enough J , thereexits a constant C p independent of J such that for all ≤ t ≤ : E | v jt | p ≤ C p , E (cid:12)(cid:12)(cid:12) q jt (cid:12)(cid:12)(cid:12) p ≤ C p , ∀ ≤ j ≤ J , (20) and (cid:0) E k v − E ρ t k p (cid:1) /p . J − , E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) J J X j =1 | q jt | − Tr(Cov ρ t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p /p . J − . (21) (cid:0) E k Cov v t − Cov ρ t k p (cid:1) /p . J − , (22) Proof
Since { v kt } are i.i.d sampled from measure in-duced by ρ ( t, u ), (20) is a direct result from (14). Now,we prove the first inequality in (21). Use Jensen’s in-equality, we have( E | v − E ρ t | p ) /p ≤ L X n =1 ( E | α n | p ) /p , (23)where we denote α n = (cid:0) v t − E ρ ( t ) (cid:1) n = 1 J X (cid:16) v jt − E ρ ( t ) (cid:17) n = 1 J X α jn . The subscript n means the n -th entry of the vector. Itis easy to show, due to the fact that { v j } are i.i.d. that E ( α jn ) = 0 , E | α jn | p < ∞ . (24)We also show in Appendix A Lemma 9 that E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) J X j =1 α jn (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p . J p/ , (25) nsemble Kalman Inversion: mean-field limit and convergence analysis 11 which implies E | α n | p ≤ E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) J J X j =1 α jn (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p . O ( J − p/ ) . (26)Plugging (26) into (23), we prove the first inequality of(21). To show the second inequality in (21) we note: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) J J X j =1 | q jt | − Tr(Cov ρ t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = | Tr (Cov v t − Cov ρ t ) |≤ L k Cov v t − Cov ρ t k Therefore it would be a direct result from (22).To show (22), we write Cov v t asCov v t = 1 J J X j =1 v jt ⊗ v jt − v ⊗ v meaning: (cid:0) E k Cov v t − Cov ρ t k p (cid:1) /p ≤ E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) J J X j =1 v jt ⊗ v jt − E ρ ( t ) ( v ⊗ v ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p /p + (cid:0) E k v ⊗ v − E ρ t ⊗ E ρ t k p (cid:1) /p . (27)We show below that both terms are of order J − / . Toshow this for the first term, let W = J X j =1 (cid:16) v jt ⊗ v jt − E ρ ( t ) ( v ⊗ v ) (cid:17) = X j w j , then the first term becomes (cid:18) E (cid:13)(cid:13)(cid:13)(cid:13) J W (cid:13)(cid:13)(cid:13)(cid:13) p (cid:19) /p ≤ (cid:18) E (cid:13)(cid:13)(cid:13)(cid:13) J W (cid:13)(cid:13)(cid:13)(cid:13) pF (cid:19) /p . L X m,n =1 ( E | W m,n /J | p ) /p = L X m,n =1 J / (cid:16) E | W m,n / √ J | p (cid:17) /p , where W m,n means the ( m, n ) th entry of matrix. Similarto before, for each m, n , we have E ( w jm,n ) = 0 , E | w jm,n | p < ∞ , (28)and by Appendix A Lemma 9, we have E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) J X j =1 w jm,n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p . J p/ , (29) which implies E | W/ √ J | pm,n = E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P Jj =1 w jm,n √ J (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p ∼ O (1)which makes the first term J − / . For the second termin (27), we have (cid:0) E k v ⊗ v − E ρ t ⊗ E ρ t k p (cid:1) /p ≤ (cid:0) E k ( v − E ρ t ) ⊗ v k p (cid:1) /p + (cid:0) E k E ρ t ⊗ ( v − E ρ t ) k p (cid:1) /p , (30)The first term of (30) can be bounded by (cid:0) E k ( v − E ρ t ) ⊗ v k p (cid:1) /p ≤ (cid:0) E k v − E ρ t k p k v k p (cid:1) /p ( I ) ≤ (cid:16) E k v − E ρ t k p (cid:17) / p (cid:16) E k v k p (cid:17) / p ( II ) . J − / , where we use H¨older’s inequality in ( I ) and (20) andfirst inequality in (21) in ( II ). Similarly, second termof (30) can also be bounded by (cid:0) E k E ρ t ⊗ ( v − E ρ t ) k p (cid:1) /p . J − / . Plug these two inequalities into (30), we have (cid:0) E k v ⊗ v − E ρ t ⊗ E ρ t k p (cid:1) /p . J − / . In conclusion, we finally obtain (22). ⊓⊔ We are now left with the task to show the closeness of { u jt } and { v jt } . The two systems are governed by theSDE (8), and (18).The precise statement is the following: Proposition 2 [Linking { u j } with { v j } ] Let { v jt } Jj =1 solve (18) and { u jt } Jj =1 solve (8) , with the same ini-tial data i.i.d drawn from the measure induced by µ . If µ ∈ C and has finite high moments, then under weaklynonlinear assumptions (6) , the two SDE systems areclose in the following sense: for any < ǫ < , thereis a constant < C ǫ < ∞ independent of J and t suchthat for any ≤ t ≤ J J X j =1 E | u jt − v jt | ≤ C ǫ J − ǫ . (31) Furthermore, denote M v t and M u t the ensemble distri-butions of { v jt } and { u jt } respectively, then E ( W ( M v t , M u t )) ≤ J J X j =1 E | u jt − v jt | ≤ C ǫ J − + ǫ . (32)This proposition states that the two particle systemsare close for big J . Combined with Proposition 1, it isstraightforward to show Theorem 1. Proof (Proof of Theorem 1)
Considering (19) and (32),by triangle inequality, for any 0 ≤ t ≤
1, one has: E ( W ( M u t , ρ ( t, u ))) ≤ E ( W ( M u t , M v t )) + E ( W ( M v t , ρ ( t, u ))) ≤ C ǫ ( J − + ǫ , L ≤ J − /L , L > , which finishes the proof. ⊓⊔ The proof for Theorem 2 is also straightforward.
Proof (Proof of Theorem 2)
Using triangle inequalityto the left hand side of (10), we have E (cid:12)(cid:12)(cid:12)(cid:12)Z f ( u ) [ M u t − ρ ( t, u )] du (cid:12)(cid:12)(cid:12)(cid:12) ! ≤ E (cid:12)(cid:12)(cid:12)(cid:12)Z f ( u ) [ M u t − M v t ] du (cid:12)(cid:12)(cid:12)(cid:12) ! + E (cid:12)(cid:12)(cid:12)(cid:12)Z f ( u ) [ M v t − ρ ( t, u )] du (cid:12)(cid:12)(cid:12)(cid:12) ! . (33)We bound both terms: – Expand the first term: we have E (cid:12)(cid:12)(cid:12)(cid:12)Z f ( u ) [ M u t − M v t ] du (cid:12)(cid:12)(cid:12)(cid:12) = E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) J J X j =1 f ( u jt − v jt ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ l J E J X j =1 | u jt − v jt | ≤ C ǫ L J − ǫ , (34)where in the first inequality we use f is l -Lipshitzand H¨older’s inequality and in the second inequalitywe use Proposition 2 (31). – Consider the second term, we have E (cid:12)(cid:12)(cid:12)(cid:12)Z f ( u ) [ M v t − ρ ( t, u )] du (cid:12)(cid:12)(cid:12)(cid:12) = E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) J J X j =1 f ( v jt ) − E ρ t ( f ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 1 J J X j =1 E (cid:12)(cid:12)(cid:12) f ( v jt ) − E ρ t ( f ) (cid:12)(cid:12)(cid:12) ≤ Cov ρ t ,f J − , where in the second equality we use v jt ∼ ρ ( t, u ) areindependent and Cov ρ t ,f is same as covariance of f .Since f is l -Lipschitz and ρ has finite second mo-ment, there is a constant C ( l, f ( )) such thatCov ρ t ,f ≤ C ( l, f ( )) . Therefore, we have E (cid:12)(cid:12)(cid:12)(cid:12)Z f ( u ) [ M v t − ρ ( t, u )] du (cid:12)(cid:12)(cid:12)(cid:12) ≤ C ( l, f ( )) J − . (35)Combine the two terms into (33), we proves (10) withthe constant depending on ǫ , l and f ( ). ⊓⊔ In the following subsections, we first provide some a-priori estimate, and prove Proposition 2 using the boot-strapping method.6.1 Some a-priori estimatesWe mainly show the higher moments of { u j } are bounded.First, we present a lemma similar to proof of Theo-rem 4.5 in [3]. For convenience, denote e j ( t ) = u j ( t ) − u ( t ) , e j ( t ) = Γ − Ae j ( t ) , u j ( t ) = Γ − Au j ( t ) , r j ( t ) = Γ − m( u j ( t )) − J J X j =1 m( u j ( t )) . then: Lemma 4
Denote V p ( e ( t )) := E K X m =1 J J X j =1 (cid:12)(cid:12) e jm ( t ) (cid:12)(cid:12) p/ (36) for some p ≥ . Then under conditions of Proposition 2,for every p , there is a constant J p such that for any J > J p and ≤ t ≤ p ( e ( t )) ≤ C p , (37) nsemble Kalman Inversion: mean-field limit and convergence analysis 13 where C p is a constant independent of J and t . More-over, J = 0 . Here, e jm is the m -th component of e j .Proof Without loss of generality, assume u † = . When t = 0, since µ has finite high moments, we can find abound for V p ( e (0)) independent of J . LetW p ( e ( t )) = K X m =1 J J X j =1 (cid:12)(cid:12) e jm ( t ) (cid:12)(cid:12) p/ , then we have de jm = − J J X k =1 e km (cid:10) e k , e j (cid:11) dt − J J X k =1 e km (cid:10) r k , r j (cid:11) dt + 1 J J X k =1 e km (cid:10) e k , d (cid:0) W j − W (cid:1)(cid:11) + 1 J J X k =1 e km (cid:10) r k , d (cid:0) W j − W (cid:1)(cid:11) and d W p ( e ) = K X m =1 J X j =1 ∂ W p ∂e jm de jm + 12 K X m =1 J X j,j ′ =1 de jm ∂ W p ∂e jm ∂e j ′ m de j ′ m . (38)Let E = K X m =1 J J X j =1 | e jm | p − K X n =1 J X k =1 e km e kn ! , (39) R = K X m =1 J J X j =1 | e jm | p − K X n =1 J X k =1 e km r kn ! , (40) F = K X m =1 J J X j =1 | e jm | p − K X n =1 J X k =1 e km ( e kn + r kn ) ! . Using Young’s inequality: ( a + b ) ≤ (1 + ǫ ) a + (1 +1 /ǫ ) b for any ǫ >
0, we have
F ≤ (1 + ǫ ) E + (1 + 1 /ǫ ) R . (41)Similar to [3] (B.1), taking expectation on the first partof (38) gives us: E K X m =1 J X j =1 ∂ W p ∂e jm de jm = − pJ E ( E + R ) (42) and the second part of (38) give us: E K X m =1 J X j,j ′ =1 de jm ∂ W p ∂e jm ∂e j ′ m de j ′ m ≤ C E ( F ) ≤ C (1 + ǫ ) E ( E ) + C (1 + 1 /ǫ ) E ( R ) (43)where C = pJ (cid:16) ( p − J )( J − J + ( p − J (cid:17) and in the lastinequality we use (41) with ǫ > ǫ = , thenthe expectation of W p is given by d V p ( e ) dt = d E W p ( e ) dt ≤ − C E ( E ) + C E ( R )= − C E K X m =1 J X j =1 | e jm | p − K X n =1 J X k =1 e km e kn ! + C E K X m =1 J X j =1 | e jm | p − K X n =1 J X k =1 e km r kn ! ≤ C E K X m =1 J X j =1 | e jm | p = C V p ( e ) , (44)where C = pJ p/ (cid:18) − p − J )( J − J − p − J (cid:19) ,C = − pJ p/ (cid:18) − p − J )( J − J − p − J (cid:19) ,C = 4 k Γ − k M J × C ,C = C × J p/ ∼ O (1) . From the second to the third inequality, we delete thefirst term since it is always negative. We also used thefollowing: K X n =1 J X k =1 e km r kn ! ≤ J X k =1 | e km | ! K X n =1 J X k =1 | r kn | ! ≤ k Γ − k M J J X k =1 | e km | ! to obtain the formula for C . Note that there is a num-ber J p such that when J > J p , the constants are allpositive. Note that according to the formula of C and C , J = 0. Since V p ( e (0)) is bound, by the Gr¨onwallinequality, (44) implies (37). ⊓⊔ Lemma 5
Under conditions of Proposition 2, for any ≤ p < ∞ and large enough J (larger than J p asdefined in Lemma 4), p -th moment of particles { u jt } Jj =1 are uniformly bounded for finite time, namely there isa constant C p > independent of J and t such that forall ≤ t ≤ and ≤ j ≤ J E | u jt | p ≤ C p , (cid:0) E k Cov u t − Cov ρ t k p (cid:1) /p ≤ C p . (45) Furthermore, E (cid:12)(cid:12)(cid:12) u jt − ¯ u t (cid:12)(cid:12)(cid:12) p ≤ C p , E (cid:12)(cid:12)(cid:12) u jt − u † (cid:12)(cid:12)(cid:12) p ≤ C p . We note that the linear case with p = 2 was stud-ied in [3] (Proposition 4.11 and 5.1). This will not beenough for our use in the later section since our analy-sis crucially depends on the boundedness of higher mo-ments. We leave the proof in Appendix B.Combining Lemma 3 and Lemma 5, using triangleinequality we have: Corollary 2
Under conditions of Proposition 2, forany ≤ p < ∞ and large enough J (larger than J p as defined in Lemma 4), we have a constant C p inde-pendent of J such that for all ≤ j ≤ J and ≤ t ≤ E | u jt − v jt | p = E | u t − v t | p ≤ C p . (46)6.2 Proof of Proposition 2To show Proposition 2, we first unify the notations.Without loss of generality, we let u † = . We furtheruse the following notations for conciseness. Let x jt = u jt − v jt , p jt = x jt − x t , and denote (call them observables) x jt = Γ − Ax jt , u jt = Γ − Au jt , v jt = Γ − Av jt , p jt = Γ − A ( x jt − x t ) , q jt = Γ − A ( v jt − v t ) . We also use notation A . O ( J α ) to mean that there isa constant C independent of J so that A ≤ CJ α .To prove the theorem amounts to tracing the evo-lution of E | x jt | as a function of time and J . For thatwe use the bootstrapping argument, namely, we assume E | x jt | decays in J with certain rate (could be 0, as havealready suggested in Lemma 5 and Corollary 2), then byfollowing the flow of the SDE we can show the rate canbe tightened till a threshold is achieved. This thresholdis exactly the rate one needs to prove in Proposition 2.The tightening procedure is discussed in Lemma 7and Lemma 8 respectively for observables x jt , and the true error x jt . The proof of the proposition is an imme-diate consequence.In the proofs we will constantly use the fact that E | p jt | = E | p t | , E | x jt | = E | x t | for all 0 ≤ t ≤ ≤ j ≤ J . When the context isclear, we also omit subscript t for the simplicity of thenotation.We first show | x | , | p j | , | x | , | p j | can be boundedby | x j | . Lemma 6
For any ≤ α < , and ≤ t ≤ , with thedefinition above, if one has: E | x j | . O (cid:0) J − α (cid:1) (47) for all ≤ j ≤ J , then E | x j | . O (cid:0) J − α (cid:1) (48) and E | p j | . O (cid:0) J − α (cid:1) , E | p j | . O (cid:0) J − α (cid:1) (49) for all ≤ j ≤ J .Proof Due to (47), we first have for all j , (cid:0) E | p j | (cid:1) = E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) J − J x j − J J X k = j x k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:0) E | x | (cid:1) . O (cid:0) J − α (cid:1) and (cid:0) E | x | (cid:1) ≤ J J X j =1 (cid:0) E | x j | (cid:1) . O (cid:0) J − α (cid:1) , which implies first inequality in (49). Then we also havean estimate for x j : E | x j | . k Γ − A k E | x j | . O (cid:0) J − α (cid:1) , which implies (48) and it also leads to (cid:0) E | p j | (cid:1) = E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) J − J x j − J J X k = j x k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:0) E | x | (cid:1) . O (cid:0) J − α (cid:1) and (cid:0) E | x | (cid:1) ≤ J J X j =1 (cid:0) E | x j | (cid:1) . O (cid:0) J − α (cid:1) . This finishes the proof. ⊓⊔ nsemble Kalman Inversion: mean-field limit and convergence analysis 15 Then we show if we already have an a-priori esti-mate for { x j } , we can have a better control for { x j } . Lemma 7
For any ≤ α < , and ≤ t ≤ , if onehas: E | x j | . O (cid:0) J − α (cid:1) , (50) for all j , then for any < ǫ < , there is C ǫ < ∞ independent of J and t such that E | p j | = E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) x j − J J X k x k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C ǫ J − − α + ǫ , E | x j | ≤ C ǫ J − − α + ǫ . for all j . Note for any α < , we can choose ǫ < − α to make + α − ǫ > α .Proof Firstly, by Lemma 6 equations (48),(49) we havea rough estimate for x j , p j , x E | x j | . O (cid:0) J − α (cid:1) , E | p j | . O (cid:0) J − α (cid:1) , E | x | . O (cid:0) J − α (cid:1) . (51)Apply Γ − A on both sides of (8) and (18), we have theevolution of the observables: d u j = − Cov u t , u t u j dt + Cov u t , u t dW jt + Cov u t , m Γ − (cid:0) r − m( u j ) (cid:1) dt + Cov u t , m Γ − dW jt (52)and d v j = − Γ − A Cov ρ t A ⊤ Γ − v j dt + Γ − A Cov ρ t A ⊤ Γ − dW jt + Γ − A Cov ρ t , m Γ − (cid:0) r − m( v j ) (cid:1) dt + Γ − A Cov ρ t , m Γ − dW jt . (53)Subtracting the two equations we can derive the evolu-tion of x j . With the calculation shown in Supp. A, forany 0 < ǫ < , there is a J ∗ ǫ > J > J ∗ ǫ and 0 ≤ t ≤ d J P Jj =1 E | x j | dt ≤ C ǫ J − (cid:16)(cid:0) E | x | (cid:1) − ǫ + (cid:0) E | x | (cid:1) − ǫ + (cid:0) E | p | (cid:1) − ǫ (cid:17) + C (cid:0) E | x | + E | p | (cid:1) + C ǫ J − (cid:18)(cid:16) E (cid:12)(cid:12) x (cid:12)(cid:12) (cid:17) − ǫ + (cid:16) E (cid:12)(cid:12) p (cid:12)(cid:12) (cid:17) − ǫ (cid:19) + C ǫ J − E | x | + C ǫ J − , (54) where C ǫ is a constant independent of J and t . Thisleads to, plugging in (50) and (51): d E | x | dt = 1 J J X j =1 d E | x j | dt ≤ C ǫ E | x | + C ǫ J − (cid:0) E | x | (cid:1) − ǫ + C ǫ J − − α + αǫ . Define X β = E J β | x | , the equation rewrites as d X β dt ≤ C ǫ X β + C ǫ J − + ǫβ (cid:0) X β (cid:1) − ǫ + C ǫ J − − α + αǫ + β . Because X β (0) = 0, this implies k X β k L ∞ . max n O (1) , J − + ǫβ , J − − α + αǫ + β o , (55)for J > J ∗ ǫ . For J ≤ J ∗ ǫ , according to Corollary 2, onestill has k X β k L ∞ ≤ ( J ∗ ǫ ) β sup ≤ t ≤ E | x t | ≤ ( J ∗ ǫ ) β C . O (1) . This can be absorbed in (55) and (55) is true for any
J > β = + α − αǫ ,then E | x j | = E | x | . O (cid:16) J − − α + αǫ (cid:17) , and E | p j | ≤ E | x j | = 2 E | x | . O (cid:16) J − − α + αǫ (cid:17) , for any 0 < ǫ < and 1 ≤ j ≤ J . The O notationincludes a constant C ǫ that has ǫ dependence. ⊓⊔ This allows us to give a tighter bound for E | x j | : Lemma 8
For any ≤ α < , ≤ t ≤ , if we havean estimate of: E | x j | . O (cid:0) J − α (cid:1) , (56) for all j , then one can tighten it to: for any < ǫ < ,there is a constant C ǫ independent of J and t such that E | p j | ≤ C ǫ J − − α + ǫ , E | x j | ≤ C ǫ J − − α + ǫ . (57) for all j . Note for any α < , we can choose ǫ < − α to make + α − ǫ > α . Proof
Firstly, by Lemma 6 equation (49), we have arough estimate for p j , x j E | p j | . O (cid:0) J − α (cid:1) , E | x | . O (cid:0) J − α (cid:1) . (58)Similar to deriving (54), we subtract the two particlesystems (8) and (18). With the calculation in Supp. Band Lemma 7, for any 0 < ǫ < , there is a J ∗ ǫ > J > J ∗ ǫ and 0 ≤ t ≤ J J X j =1 d E | x j | dt ≤ C ǫ J − (cid:18)(cid:16) E (cid:12)(cid:12) x (cid:12)(cid:12) (cid:17) − ǫ + (cid:16) E (cid:12)(cid:12) p (cid:12)(cid:12) (cid:17) − ǫ (cid:19) + C ǫ J − − α + ǫ (cid:18)(cid:0) E | x | (cid:1) − ǫ + (cid:0) E | p | (cid:1) − ǫ (cid:19) + C ǫ J − (cid:16)(cid:0) E | x | (cid:1) − ǫ + (cid:0) E | x | (cid:1) − ǫ + (cid:0) E | p | (cid:1) − ǫ (cid:17) + C (cid:0) E | x | + E | p | (cid:1) + C ǫ J − (cid:0) E | x | (cid:1) + C ǫ J − − α + αǫ , (59)where C ǫ is a constant independent of J and t . Inserting(56),(58) back into (59), we have the bounds for the firstfour terms: C ǫ J − (cid:18)(cid:16) E (cid:12)(cid:12) x (cid:12)(cid:12) (cid:17) − ǫ + (cid:16) E (cid:12)(cid:12) p (cid:12)(cid:12) (cid:17) − ǫ (cid:19) ≤ C ǫ J − − α + αǫ C ǫ J − − α + ǫ (cid:18)(cid:0) E | x | (cid:1) − ǫ + (cid:0) E | p | (cid:1) − ǫ (cid:19) ≤ C ǫ J − − α + ǫ (cid:0) E | x | (cid:1) − ǫ ,C ǫ J − (cid:16)(cid:0) E | x | (cid:1) − ǫ + (cid:0) E | x | (cid:1) − ǫ + (cid:0) E | p | (cid:1) − ǫ (cid:17) ≤ C ǫ J − (cid:0) E | x | (cid:1) − ǫ C ǫ J − (cid:0) E | x | (cid:1) ≤ C ǫ J − − α , which implies, for 0 < ǫ < and J > J ∗ ǫ : d E | x | dt = 1 J J X j =1 d E | x j | dt ≤ C ǫ J − − α + ǫ (cid:0) E | x | (cid:1) − ǫ + C ǫ J − (cid:0) E | x | (cid:1) − ǫ + E | x | + J − − α + αǫ . Similar to (55), define X β = E J β | x | , we have d X β dt ≤ C ǫ J − − α + β (2+ ǫ )4 (cid:0) X β (cid:1) − ǫ + C ǫ J − + ǫβ (cid:0) X β (cid:1) − ǫ + C ǫ X β + J − − α + αǫ + β , which implies k X β k L ∞ . max { O (1) , J − − α + β (2+ ǫ )4 ,J − + ǫβ , J − − α + αǫ + β } . (60)for J > J ∗ ǫ . Noting that k X β k L ∞ ≤ ( J ∗ ǫ ) β sup ≤ t ≤ E | x t | ≤ ( J ∗ ǫ ) β C . O (1)for all J ≤ J ∗ ǫ with constant C stemming from theboundedness of Corollary 2. We have (60) holds truefor all J >
0. Therefore, we can choose β = α ǫ toobtain E | x j | = E | x | . O (cid:16) J − α ǫ (cid:17) for any ǫ < , which concludes (57). ⊓⊔ Finally, we are ready to prove Proposition 2.
Proof
We first note that by the definition of Wasser-stein distance, for any 0 ≤ t ≤ E ( W ( M v t , M u t )) ≤ J J X j =1 E | u jt − v jt | = J J X j =1 E | x jt | , and thus the estimate (32) holds true once (31) is shown.For that we directly apply Lemma 8. Starting with α = 0 we recursively use the lemma, equation (57)in particular, for α n = 12 + α n − / − ǫ till the rate saturates to lim n →∞ α n = 1 − ǫ . Since ǫ isan arbitrary small number, we conclude the proof. ⊓⊔ The research of Q.L. and Z.D. was supported in partby National Science Foundation under award 1619778,1750488 and Wisconsin Data Science Initiative. Bothauthors would like to thank Andrew Stuart for the help-ful discussions. nsemble Kalman Inversion: mean-field limit and convergence analysis 17
References
1. K. Bergemann and S. Reich. A localization technique forensemble kalman filters.
Quarterly Journal of the RoyalMeteorological Society , 136(648):701–707, 2010.2. K. Bergemann and S. Reich. A mollified ensemble kalmanfilter.
Quarterly Journal of the Royal Meteorological So-ciety , 136(651):1636–1643, 2010.3. D. Bloemker, C. Schillings, P. Wacker, and S. Weissmann.Well posedness and convergence analysis of the ensemblekalman inversion.
Inverse Problems , 2019.4. D. Blomker, C. Schillings, and P. Wacker. A strongly con-vergent numerical scheme from ensemble kalman inver-sion.
SIAM Journal on Numerical Analysis , 56(4):2537–2562, 2018.5. F. Bolley, J. A. Ca˜nizo, and J. A. Carrillo. Stochas-tic mean-field limit: Non-Llipschitz forces and swarming.
Mathematical Models and Methods in Applied Sciences ,21(11):2179–2210, 2011.6. J. A. Ca˜nizo, J. A. Carrillo, and J. Rosado. A well-posedness theory in measures for some kinetic modelsof collective motion.
Mathematical Models and Methodsin Applied Sciences , 21(03):515–539, 2011.7. K. Craig and A. Bertozzi. A blob method for the ag-gregation equation.
Mathematics of Computation , 85, 052014.8. M. Dashti and A. M. Stuart.
The Bayesian Approachto Inverse Problems . Springer International Publishing,Cham, 2017.9. P. De Moral.
Feynman-Kac Formulae: Genealogical andInteracting Particle Approximations . Springer-Verlag,2004.10. Z. Ding and Q. Li. Ensemble kalman sampling: mean-field limit and convergence analysis. arXiv: 1910.12923 ,2019.11. Z. Ding, Q. Li, and J. Lu. Ensemble kalman inversionfor nonlinear problems: weights, consistency, and vari-ance bounds, 2020.12. A. Doucet, N. de Freitas, and N. Gordon.
An Introductionto Sequential Monte Carlo Methods . Springer New York,New York, NY, 2001.13. O. G. Ernst, B. Sprungk, and H-J. Starkloff. Analysisof the ensemble and polynomial chaos kalman filters inbayesian inverse problems.
SIAM/ASA Journal on Un-certainty Quantification , 3(1):823–851, 2015.14. G. Evensen. Sequential data assimilation with a nonlin-ear quasi-geostrophic model using monte carlo methodsto forecast error statistics.
Journal of Geophysical Re-search: Oceans , 99(C5):10143–10162, 1994.15. G. Evensen. The ensemble kalman filter: theoretical for-mulation and practical implementation.
Ocean Dynam-ics , 53(4):343–367, Nov 2003.16. G. Evensen.
Data Assimilation-The Ensemble KalmanFilter . Springer-Verlag New York, Inc., Secaucus, NJ,USA, 2006.17. N. Fournier and A. Guillin. On the rate of convergence inwasserstein distance of the empirical measure.
ProbabilityTheory and Related Fields , 162(3):707–738, Aug 2015.18. A. Garbuno-Inigo, F. Hoffmann, W. Li, and A. M. Stuart.Interacting langevin diffusions: Gradient structure andensemble kalman sampler. arXiv:1903.08866 , 2019.19. M. Ghil, S. Cohn, J. Tavantzis, K. Bube, and E. Isaac-son.
Applications of Estimation Theory to NumericalWeather Prediction . Springer New York, New York, NY,1981.20. M. Herty and G. Visconti. Kinetic methods for inverseproblems.
Kinetic & Related Models , 12:1109, 2019. 21. P. L. Houtekamer and Herschel L. Mitchell. A sequentialensemble kalman filter for atmospheric data assimilation.
Monthly Weather Review , 129(1):123–137, 2001.22. M. A. Iglesias, K. Law, and A. M. Stuart. Ensemblekalman methods for inverse problems.
Inverse Problems ,29(4):045001, Mar 2013.23. T. Lange and W. Stannat. On the continuous time limitof the ensemble kalman filter, 2019.24. K. J. H. Law, H. Tembine, and R. Tempone. Determinis-tic mean-field ensemble kalman filtering.
SIAM Journalon Scientific Computing , 38(3):A1251–A1279, 2016.25. F. Le Gland, V. Monbet, and V. Tran. Large sampleasymptotics for the ensemble kalman filter.
Handbook onNonlinear Filtering , 2011.26. Q. Liu and D. Wang. Stein variational gradient de-scent: A general purpose bayesian inference algorithm.In
Advances in Neural Information Processing Systems29 , pages 2378–2386. 2016.27. J. Lu, Y. Lu, and J. Nolen. Scaling limit of the stein vari-ational gradient descent: The mean field regime.
SIAMJournal on Mathematical Analysis , 51(2):648–671, 2019.28. Y. Lu, J. Lu, and J. Nolen. Accelerating langevin sam-pling with birth-death, 2019.29. M. Pavon, E. G. Tabak, and G. Trigila. The data-drivenschroedinger bridge. arXiv:1806.01364 , Jun 2018.30. S. Reich. A dynamical systems framework for intermit-tent data assimilation.
BIT Numerical Mathematics ,51(1):235–249, Mar 2011.31. S. Reich. Data assimilation: The schrdinger perspective.
Acta Numerica , 28:635711, 2019.32. C. P. Robert and G. Casella.
Monte Carlo StatisticalMethods . 2nd ed. Springer, New York, 2004.33. C. Schillings and A. M. Stuart. Analysis of the ensem-ble kalman filter for inverse problems.
SIAM J. Numer.Anal , 55(3):1264–1290, 2017.34. C. Schillings and A. M. Stuart. Convergence analysis ofensemble kalman inversion: the linear, noisy case.
Appli-cable Analysis , 97(1):107–123, 2018.35. A. M. Stuart. Inverse problems: A bayesian perspective.
Acta Numerica , 19:451559, 2010.36. A. Sznitman. Topics in propagation of chaos. In
Ecoled’Et´e de Probabilit´es de Saint-Flour XIX — 1989 , pages165–251. Springer Berlin Heidelberg, 1991.
A Moments bound of summation of indepedentmean-zero random variables
In this section, we prove a lemma which is used in proof ofLemma 3.
Lemma 9
Assume x , · · · , x J are i.i.d random variables andsatisfy (for p ≥ ) E x i = 0 , L p = E | x i | p < ∞ . Then we have E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) J X j =1 x j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p /p ≤ CJ / , where C is a constant only depends on L p and p . Proof
Without loss of generality, we assume p is an even num-ber and J > p/
2. Then E (cid:12)(cid:12)(cid:12)P Jj =1 x j (cid:12)(cid:12)(cid:12) p = E (cid:16)P Jj =1 x j (cid:17) p .Since { x i } are independent with zero mean, we have E J X j =1 x j p = X j + j + ··· + j J = p E (cid:16) x j x j · · · x j J J (cid:17) , where { j n } should be non-negative integers and not equal to1 (otherwise E x i = 0 provides a trivial contribution).For each term in the summation, using generalization of H¨older’sinequality, we have E (cid:16) x j x j · · · x j J J (cid:17) ≤ Π Jn =1 ( E | x n | p ) j n /p = L p , which impies E J X j =1 x j p ≤ L p X j + j + ··· + j J = p = L p | I | (61)where I = ( ( j , · · · , j J ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) j n ∈ N \ { } , J X n =1 j n = p ) and | I | denotes the cardinality of the set I .In I , if j n doesn’t equal to zero, then j n is at least 2, mean-ing there are at most p/ | I | ≤ P ( J, p/ | I | ≤ J p/ | I | ≤ C ( p ) J p/ . (62)Here P ( J, p/
2) denotes the number of p/ J and is thus smaller than J p/ , and I is a new set defined by: I = (cid:0) i , · · · , i p/ (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) i n ∈ N + \ { } , p/ X n =1 i n = p . Its cardinality does not have J dependence and thus we boundit by C ( p ), a constant depending on p only. ⊓⊔ B Bound of high moments of { u j } Proof
For convenience, we omit the subscript ′ t ′ in u, u , e, e etc. First, we prove the boundedness of E h J P Jj | e j | i p , whichwe will use later. E J J X j | e j | p ≤ E K X m =1 J J X j | e jm | p ≤ C p E K X m =1 J J X j | e jm | p ≤ C p V p ( e ) ≤ C , (63)which also implies E J J X j | e j | p ≤ C E J J X j | e j | p ≤ C . (64) Then, we first estimate E | u j | p . Using Ito’s formula, for fix1 ≤ j ≤ J and p ≥
1, we obtain d | u j | p = − p (cid:16) | u j | p − (cid:10) u j , Cov u u j (cid:11)(cid:17) dt + R dW jt + p | u j | p − J J X i,k =1 (cid:10) e i , e k (cid:11) dt + 2 p ( p − J (cid:12)(cid:12) u j (cid:12)(cid:12) p − J X i,k =1 (cid:10) u j , e i (cid:11) (cid:10) u j , e k (cid:11) (cid:10) e i , e k (cid:11) dt + 2 p (cid:16) | u j | p − D u j , Cov u , r Γ − ( r − m( u )) E(cid:17) dt + p | u j | p − J J X i,k =1 (cid:10) e i , e k (cid:11) (cid:10) r i , r k (cid:11) dt + 2 p ( p − J (cid:12)(cid:12) u j (cid:12)(cid:12) p − J X i,k =1 (cid:10) u j , e i (cid:11) (cid:10) u j , e k (cid:11) (cid:10) r i , r k (cid:11) dt , (65)where R is the coefficient before Brownian motion. The firstterm is negative. To complete the computation, we need toprovide the bound for the rest. The second term is boundedby: E | u j | p − J J X i,k =1 (cid:10) e i , e k (cid:11) ≤ E | u j | p − " J J X i =1 | e i | ≤ (cid:0) E | u j | p (cid:1) ( p − /p E " J J X i =1 | e i | p /p . The third term is bounded by:1 J E (cid:12)(cid:12) u j (cid:12)(cid:12) p − J X i,k =1 (cid:10) u j , e i (cid:11) (cid:10) u j , e k (cid:11) (cid:10) e i , e k (cid:11) ≤ E | u j | p − " J J X i =1 | e i | ≤ (cid:0) E | u j | p (cid:1) ( p − /p E " J J X i =1 | e i | p /p . And similarly, the rests are bounded by: E (cid:16) | u j | p − D u j , Cov u , r Γ − ( r − m( u )) E(cid:17) ≤ C E | u j | p − ) " J J X k =1 | e k | ≤ C (cid:0) E | u j | p (cid:1) ( p − ) /p E " J J X i =1 | e i | p ! / (2 p ) nsemble Kalman Inversion: mean-field limit and convergence analysis 19and E | u j | p − J J X i,k =1 (cid:10) e i , e k (cid:11) (cid:10) r i , r k (cid:11) ≤ C E | u j | p − " J J X i =1 | e i | ≤ C (cid:0) E | u j | p (cid:1) ( p − /p E " J J X i =1 | e i | p ! /p and1 J E (cid:12)(cid:12) u j (cid:12)(cid:12) p − J X i,k =1 (cid:10) u j , e i (cid:11) (cid:10) u j , e k (cid:11) (cid:10) r i , r k (cid:11) ≤ C E | u j | p − " J J X i =1 | e i | ≤ C (cid:0) E | u j | p (cid:1) ( p − /p E " J J X i =1 | e i | p ! /p . Plug all these inequalities back in (65), and utilize (64), wehave: d E | u j | p dt ≤ C (cid:0) E | u j | p (cid:1) ( p − /p ⇒ E | u j | p ≤ C . (66)Then, to deal with E | u j | p , we use Ito’s formula similarly,for fix 1 ≤ j ≤ J and p ≥
1, we obtain d | u j | p dt = − p (cid:16) | u j | p − (cid:10) u j , Cov u, u u j (cid:11)(cid:17) dt + R dW jt + p | u j | p − J J X i,k =1 (cid:10) e i , e k (cid:11) (cid:10) e k , e i (cid:11) dt + 2 p ( p − J (cid:12)(cid:12) u j (cid:12)(cid:12) p − J X i,k =1 (cid:10) u j , e i (cid:11) (cid:10) u j , e k (cid:11) (cid:10) e i , e k (cid:11) dt + 2 p (cid:16) | u j | p − D u j , Cov u, r Γ − ( r − m( u )) E(cid:17) dt + p | u j | p − J J X i,k =1 (cid:10) e i , e k (cid:11) (cid:10) r i , r k (cid:11) dt + 2 p ( p − J (cid:12)(cid:12) u j (cid:12)(cid:12) p − J X i,k =1 (cid:10) u j , e i (cid:11) (cid:10) u j , e k (cid:11) (cid:10) r i , r k (cid:11) dt , where R is the coefficient before Brownian motion. The sixterms are considered separately:Term 1 (cid:12)(cid:12)(cid:12) E (cid:16) | u j | p − (cid:10) u j , Cov u, u u j (cid:11)(cid:17)(cid:12)(cid:12)(cid:12) ≤ E | u j | p − J J X k =1 | e k || e k || u j | ! ≤ (cid:0) E | u j | p (cid:1) (2 p − ) / (2 p ) E J J X k =1 | e k || e k || u j | ! p / (4 p ) ≤ C (cid:0) E | u j | p (cid:1) (2 p − ) / (2 p ) , where in the last inequality we use (63),(64) and (66) withH¨older’s inequality. Term 2 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E | u j | p − J J X i,k =1 (cid:10) e i , e k (cid:11) (cid:10) e k , e i (cid:11)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C E | u j | p − J J X i,k =1 | e i || e k || e i || e k | ≤ C E | u j | p − J J X i =1 | e i | ! J J X k =1 | e k | !! ≤ C E (cid:0) | u j | p (cid:1) ( p − /p E " J J X i =1 | e i | p /p ≤ C E (cid:0) | u j | p (cid:1) ( p − /p E " K X m =1 J J X i =1 | e im | p /p ≤ C E (cid:0) | u j | p (cid:1) ( p − /p E K X m =1 " J J X i =1 | e im | p /p ≤ CV /p p ( e ) E (cid:0) | u j | p (cid:1) ( p − /p . Term 3 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:12)(cid:12) u j (cid:12)(cid:12) p − J J X i,k =1 (cid:10) u j , e i (cid:11) (cid:10) u j , e k (cid:11) (cid:10) e i , e k (cid:11)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C E | u j | p − J J X i,k =1 | e i || e k || e i || e k | ≤ CV /p p ( e ) E (cid:0) | u j | p (cid:1) ( p − /p . Term 4 (cid:12)(cid:12)(cid:12) E (cid:16) | u j | p − D u j , Cov u, r Γ − ( r − m( u )) E(cid:17)(cid:12)(cid:12)(cid:12) ≤ M E | u j | p − J J X k =1 | e k | ! ≤ (cid:0) E | u j | p (cid:1) (2 p − ) / (2 p ) E J J X k =1 | e k | ! p / (4 p ) ≤ C (cid:0) E | u j | p (cid:1) (2 p − ) / (2 p ) , where in the last inequality we use (63) and (66) withH¨older’s inequality.Term 5 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E | u j | p − J J X i,k =1 (cid:10) e i , e k (cid:11) (cid:10) r k , r i (cid:11)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ CM E | u j | p − J J X i =1 | e i | !! ≤ C E (cid:0) | u j | p (cid:1) ( p − /p E " J J X i =1 | e i | p ! /p ≤ C E (cid:0) | u j | p (cid:1) ( p − /p E " K X m =1 J J X i =1 | e im | p ! /p ≤ C E (cid:0) | u j | p (cid:1) ( p − /p E K X m =1 " J J X i =1 | e im | p ! /p ≤ CV /p p ( e ) E (cid:0) | u j | p (cid:1) ( p − /p . (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:12)(cid:12) u j (cid:12)(cid:12) p − J J X i,k =1 (cid:10) u j , e i (cid:11) (cid:10) u j , e k (cid:11) (cid:10) r i , r k (cid:11)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C E | u j | p − J J X i,k =1 | e i || e k || r i || r k | ≤ CV /p p ( e ) E (cid:0) | u j | p (cid:1) ( p − /p . By Lemma 4, we obtain the boundedness for E (cid:13)(cid:13) u j (cid:13)(cid:13) p .Thento prove the second inequality of (45), it suffices to prove( E k Cov u t k p ) /p ≤ C p , which is a direct result by expansion of Cov u t and triangleinequality:( E k Cov u t k p ) /p ≤ J J X j =1 (cid:16) E (cid:13)(cid:13) ( u j − u ) ⊗ ( u j − u ) (cid:13)(cid:13) p (cid:17) /p ≤ J J X j =1 (cid:16) E (cid:12)(cid:12) u j − u (cid:12)(cid:12) p (cid:17) /p ≤ C .