[PDF] Log-Normalization Constant Estimation using the Ensemble Kalman-Bucy Filter with Application to High-Dimensional Models

Abstract

In this article we consider the estimation of the log-normalization constant associated to a class of continuous-time filtering models. In particular, we consider ensemble Kalman-Bucy filter based estimates based upon several nonlinear Kalman-Bucy diffusions. Based upon new conditional bias results for the mean of the afore-mentioned methods, we analyze the empirical log-scale normalization constants in terms of their \mathbb{L}_n-errors and conditional bias. Depending on the type of nonlinear Kalman-Bucy diffusion, we show that these are of order (\sqrt{t/N}) + t/N or 1/\sqrt{N} (\mathbb{L}_n-errors) and of order [t+\sqrt{t}]/N or 1/N (conditional bias), where t is the time horizon and N is the ensemble size. Finally, we use these results for online static parameter estimation for above filtering models and implement the methodology for both linear and nonlinear models.

Full PDF

LLog-Normalization Constant Estimation using the EnsembleKalman-Bucy Filter with Application to High-Dimensional Models

BY DAN CRISAN , PIERRE DEL MORAL , AJAY JASRA & HAMZA RUZAYQAT Department of Mathematics, Imperial College London, London, SW7 2AZ, UK. E-Mail: [email protected] Center INRIA Bordeaux Sud-Ouest & Institut de Mathematiques de Bordeaux, Bordeaux, 33405, FR. E-Mail: [email protected] Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Scienceand Technology, Thuwal, 23955, KSA. E-Mail: [email protected], [email protected]

Abstract

In this article we consider the estimation of the log-normalization constant associated toa class of continuous-time ﬁltering models. In particular, we consider ensemble Kalman-Bucyﬁlter based estimates based upon several nonlinear Kalman-Bucy diﬀusions. Based upon newconditional bias results for the mean of the afore-mentioned methods, we analyze the empiricallog-scale normalization constants in terms of their L n − errors and conditional bias. Depending onthe type of nonlinear Kalman-Bucy diﬀusion, we show that these are of order ( t / /N / )+ t/N or /N / ( L n − errors) and of order [ t + t / ] /N or /N (conditional bias), where t is the time horizonand N is the ensemble size. Finally, we use these results for online static parameter estimationfor above ﬁltering models and implement the methodology for both linear and nonlinear models. Keywords : Kalman-Bucy ﬁlter, Riccati equations, nonlinear Markov processes.

The ﬁltering problem concerns the recursive estimation of a partially observed Markov processconditioned on a path of observations. It is found in a wide class of real applications, includingﬁnance, applied mathematics and engineering; see [2] for instance. This article focusses upon thecomputation of the normalizing constant for certain classes of ﬁltering problems, to be introducedbelow. These normalizing constants can be of interest in statistics and engineering for model selectionand/or parameter estimation.In this article we study linear and Gaussian models in continuous-time. It is well-known that,for such models, under some minimal assumptions, the ﬁlter is Gaussian, with mean satisfying theKalman-Bucy (KB) equations and the covariance matrix obeying a Riccati equation. In many casesof practical interest, these latter equations may not be computable, for instance if one does not haveaccess to an entire trajectory of data. Instead one can use some suitable approximations obtainedvia time-discretization methods. However, even then, if the dimension r of the state-vector is verylarge, the numerical approximation of the KB and Riccati equations can become computationallyprohibitive, often with a computational eﬀort of at least O ( r ) per update. The case where r islarge occurs in many applications including ocean and atmosphere science (e.g. [1]) and oil reservoirsimulations (e.g. [17]). A rather elegant and successful solution, in discrete-time, to this problemwas developed in [10] in the guise of the ensemble Kalman ﬁlter (EnKF), which can reduce the costto O ( r ) .The EnKF can be interpreted as a mean-ﬁeld particle approximation of a conditional McKean-Vlasov diﬀusion (the K-B diﬀusion). This latter diﬀusion shares the same law as the ﬁlter associatedto the linear and Gaussian model in continuous-time. Hence a possible alternative to recursivelysolving the K-B and Riccati equations is to generate N ∈ N independent copies from the K-Bdiﬀusion and use a simple Monte Carlo estimate for expectations with respect to the ﬁlter. However,the diﬀusion process cannot be simulated exactly, but can be approximated in a principled way by1 a r X i v : . [ s t a t . C O ] J a n llowing the N samples to interact; precise details are given in Section 2. The resulting method,named the Ensemble Kalman-Bucy Filter (EnKBF), is by now rather well-understood with severalcontributions on its convergence (as N → ∞ ) analysis; see for instance [3, 5, 8, 12, 19].In this work we focus upon using several versions of EnKBF for an online estimate of thenormalization constant. In particular the contributions of this work are:1. New results on the conditional bias of the mean using the EnKBF2. A derivation of an estimate of the normalization constant using the EnKBF.3. A proof that the L n − error of the estimate on the log-scale is of O (cid:0)(cid:113) tN + tN (cid:1) or O (cid:0) √ N (cid:1) ,depending on the nonlinear Kalman-Bucy diﬀusion and where t is the time parameter.4. A proof that the conditional bias of the estimate on the log-scale is of O (cid:0) t + √ tN (cid:1) or O (cid:0) N (cid:1) ,depending on the nonlinear Kalman-Bucy diﬀusion.5. A development of a method that uses this estimate to perform online static parameter estima-tion.The result in 1. is of independent interest, but is used directly in 4.. To the best of our knowledge theestimate in 2. is new and can be computed with a little extra computational cost over applying anEnKBF (under time discretization). Whilst several authors have used the EnKF for normalizationconstant estimation, e.g. [9], we have not seen the EnKBF version investigated in the literature.In addition to contribution 3. & 4., the results establish the decay of the mean square error(MSE), for instance, of the log-normalizing constant estimate as the time parameter increases. Thisrate is expected to occur in practice, as we will see in simulations, and parallels other results found inthe literature for particle estimates of the normalization constant (e.g. such as particle ﬁlters [7]). Inrelation to 5., if one assumes that the model of interest has several unknown and time-homogeneousparameters, θ say, then one is interested to estimate such parameters, for instance using likelihoodbased methods. We show how our estimate can be leveraged for this latter task. In this paper we usethe simultaneous stochastic perturbation stochastic approximation (SPSA) method [18] popular inengineering applications. SPSA is based upon a ﬁnite diﬀerence estimator of the gradient, w.r.t. θ , ofthe log-normalizing constant. It constructs a stochastic approximation scheme for the estimation ofstatic parameters. We are not aware of this method being used for the EnKBF. As it is a zero-orderoptimization method, we expect to be computationally less expensive than resorting to using otherestimates of the gradient (of the log-normalizing constant). It should be noted that our work focussesupon the standard or ‘vanilla’ EnKBF, the deterministic EnKBF [16] and deterministic transporttype EnKBFs [15]; other extensions are possible, for instance, to the feedback particle ﬁlter.This article is structured as follows. In Section 2 we provide details on the class of ﬁlteringproblems that we address as well as details on the ensemble Kalman Bucy ﬁlter. The conditionalbias of the mean associated to various EnKBFs is also presented there. In Section 3 we discuss howthe normalizing constant estimate can be computed, as well as its L n − error and conditional bias.The latter is supported by numerical results checking that the order of convergence rate indeed holdseven under naive Euler discretization. In Section 4 we show how the normalizing constant estimatecan be used in online static parameter estimation problems.2 Description of the Model and Algorithms

Let (Ω , F , P ) be a probability space together with a ﬁltration ( G t ) t ≥ which satisﬁes the usualconditions. On (Ω , F , P ) we consider two G t -adapted processes X = { X t , t ≥ } and Y = { Y t , t ≥ } that form a time homogeneous linear-Gaussian ﬁltering model of the following form (cid:40) dX t = A X t dt + R / dW t dY t = C X t dt + R / dV t . (1)In the above display, ( W t , V t ) is an G t -adapted ( r + r ) -dimensional Brownian motion, X is an G -measurable r -valued Gaussian random vector with mean and covariance matrix ( E ( X ) , P ) (independent of ( W t , V t ) ), the symmetric matrices R / and R / are invertible, A is a square ( r × r ) -matrix, C is an ( r × r ) -matrix, and Y = 0 . We let F t = σ ( Y s , s ≤ t ) be the ﬁltration generatedby the observation process.It is well-known that the conditional distribution η t of the signal state X t given F t is a r -dimensional Gaussian distribution with a a mean and covariance matrix (cid:98) X t := E ( X t | F t ) and P t := E (cid:0) ( X t − E ( X t | F t )) ( X t − E ( X t | F t )) (cid:48) (cid:1) given by the Kalman-Bucy and the Riccati equations d (cid:98) X t = A (cid:98) X t dt + P t C (cid:48) R − (cid:16) dY t − C (cid:98) X t dt (cid:17) (2) ∂ t P t = Ricc ( P t ) . (3)with the Riccati drift function from S + r into S r (where S + r (resp. S r ) is the collection of symmetricand positive deﬁnite (resp. semi-deﬁnite) r × r matrices deﬁned for any Q ∈ S + r ) byRicc ( Q ) = AQ + QA (cid:48) − QSQ + R with R = R and S := C (cid:48) R − C (4) We now consider three conditional nonlinear McKean-Vlasov type diﬀusion processes dX t = A X t dt + R / dW t + P η t C (cid:48) R − (cid:104) dY t − (cid:16) CX t dt + R / dV t (cid:17)(cid:105) (5) dX t = A X t dt + R / dW t + P η t C (cid:48) R − (cid:20) dY t − (cid:18) C (cid:2) X t + η t ( e ) (cid:3) dt (cid:19)(cid:21) (6) dX t = A X t dt + R P − η t (cid:0) X t − η t ( e ) (cid:1) dt + P η t C (cid:48) R − (cid:20) dY t − (cid:18) C (cid:2) X t + η t ( e ) (cid:3) dt (cid:19)(cid:21) (7)where ( W t , V t , X ) are independent copies of ( W t , V t , X ) (thus independent of the signal and theobservation path). In the above displayed formula P η t stands for the covariance matrix P η t = η t (cid:2) ( e − η t ( e ))( e − η t ( e )) (cid:48) (cid:3) with η t := Law ( X t | F t ) and e ( x ) := x (8)and we use the notation, η t ( f ) = (cid:82) R r f ( x ) η t ( dx ) for f : R r → R r that is η t − integrable (inparticular, r = r in (6) and (7) and r = r in (8)). Any of these probabilistic models will becommonly referred to as Kalman-Bucy (nonlinear) diﬀusion processes. In the following we willdenote by G t the augmented ﬁltration generated by X and the triplet of independent Brownian3otions ( Y t , W t , V t ) . The process (5) corresponds to the vanilla type ensemble Kalman-Bucy ﬁlterthat is typically used in the literature. The process (6) is associated the so-called deterministicensemble Kalman-Bucy ﬁlter [16] and (7) is a deterministic transport-inspired equation [15].We have the following result that is considered, for instance, in [8]: Lemma 2.1.

Let X t be a process such that E [ (cid:12)(cid:12) X (cid:12)(cid:12) ] < ∞ and that it satisﬁes any one of (5) - (7) .Then the conditional mean and the conditional covariance matrix (given F t ) of any of the nonlinearKalman-Bucy diﬀusions (5) - (7) satisfy equations (2) and (3), respectively. This result enables us to approximate the mean and covariance associated to the linear ﬁlteringproblem in (1) by simulating N i.i.d. samples from one of the processes (5)-(7) exactly and thenuse the sample mean and sample covariance to approximate the mean and covariance of the ﬁlter.Since this exact simulation is seldom possible, we consider how one can generate N − particle systemswhose sample mean and sample covariance can be used to replace the exact simulation. Remark 2.1.

Let us observe that equations (5) - (7) have indeed a unique solution in the class ofprocesses Z t such that E [sup t ∈ [0 ,T ] | Z t | ] < ∞ . To show existence, one considers the (linear version ofthe) equations (5) - (7) with the conditional expectations η t ( e ) and η t [( e − η t ( e ))( e − η t ( e )) (cid:48) ] replacedby the solution of the equations (2) and (3), respectively, that is, dX t = A X t dt + R / dW t + P t C (cid:48) R − (cid:104) dY t − (cid:16) CX t dt + R / dV t (cid:17)(cid:105) (9) dX t = A X t dt + R / dW t + P t C (cid:48) R − (cid:20) dY t − (cid:18) C (cid:2) X t + η t ( e ) (cid:3) dt (cid:19)(cid:21) (10) dX t = A X t dt + R P − t (cid:16) X t − ˆ X t (cid:17) dt + P t C (cid:48) R − (cid:20) dY t − (cid:18) C (cid:104) X t + ˆ X t (cid:105) dt (cid:19)(cid:21) . (11) Equations (9) - (11) have a unique solution (as they are linear) that indeed satisfy the corresponding(nonlinear) equations (5) - (7) . To show the uniqueness of the solutions of equations (5) - (7) , one ob-serves ﬁrst that any solution η t of the (nonlinear) equations (5) - (7) has its conditional expectations η t ( e ) and η t [( e − η t ( e ))( e − η t ( e )) (cid:48) ] uniquely characterized by the equations (2) and (3). Thereforethey satisfy the corresponding linear versions of the equations (5) - (7) , with the conditional expecta-tions η t ( e ) and η t [( e − η t ( e ))( e − η t ( e )) (cid:48) ] replaced by the solution of the equations (2) and (3). Inother words, they satisfy equations (9) - (11) and therefore they are unique as the corresponding linearequations (9) - (11) have a unique solution. Remark 2.2.

If one modiﬁes (1) to (cid:40) dX t = f ( X t ) dt + R / dW t dY t = C X t dt + R / dV t . for some non-linear function f : R r → R r , one can consider a modiﬁcation of any of (5) - (7) . Forinstance in the case (5) dX t = f ( X t ) dt + R / dW t + P η t C (cid:48) R − (cid:104) dY t − (cid:16) CX t dt + R / dV t (cid:17)(cid:105) . We note however that any approximation of this process, as alluded to above, does not typicallyprovide an approximation of the non-linear ﬁlter. Nonetheless it is considered in many works in theﬁeld. .3 Ensemble Kalman-Bucy Filters We now consider three Ensemble Kalman-Bucy ﬁlters that correspond to the mean-ﬁeld particle in-terpretation of the nonlinear diﬀusion processes (5)-(7). To be more precise we let ( W it , V it , ξ i ) ≤ i ≤ N be N independent copies of ( W t , V t , X ) . In this notation, we have the three McKean-Vlasov typeinteracting diﬀusion processes for i ∈ { , . . . , N } dξ it = A ξ it dt + R / dW it + p t C (cid:48) R − (cid:104) dY t − (cid:16) Cξ it dt + R / dV it (cid:17)(cid:105) (12) dξ it = A ξ it dt + R / dW it + p t C (cid:48) R − (cid:20) dY t − (cid:18) C (cid:2) ξ it + η Nt ( e ) (cid:3) dt (cid:19)(cid:21) (13) dξ it = A ξ it dt + R ( p t ) − (cid:0) ξ it − η Nt ( e ) (cid:1) dt + p t C (cid:48) R − (cid:20) dY t − (cid:18) C (cid:2) ξ it + η Nt ( e ) (cid:3) dt (cid:19)(cid:21) (14)with the rescaled particle covariance matrices p t := (cid:18) − N (cid:19) − P η Nt = 1 N − (cid:88) ≤ i ≤ N (cid:0) ξ it − m t (cid:1) (cid:0) ξ it − m t (cid:1) (cid:48) (15)and the empirical measures η Nt := 1 N (cid:88) ≤ i ≤ N δ ξ it and the sample mean m t := 1 N (cid:88) ≤ i ≤ N ξ it . Note that for (14) one needs N ≥ r for the almost sure invertibility of p t , but it is also suﬃcient touse the pseudo inverse instead of the inverse.These processes have been thoroughly studied in the literature and a review can be found in [3].We have the evolution equations for (12) and (13) dm t = A m t dt + p t C (cid:48) Σ − ( dY t − Cm t dt ) + 1 √ N + 1 dM t dp t = Ricc ( p t ) dt + 1 √ N dM t with a triplet of pairwise orthogonal martingales M t and M t and the innovation martingale (cid:99) M t := Y t − C (cid:90) t (cid:98) X s ds. Note that for any t ≥ , the absolute moments of M t and M t are uniformly, w.r.t. N , upper-bounded.We remark that the mathematical theory for (14) in the linear-Gaussian setting reverts to that ofthe standard Kalman-Bucy ﬁlter as detailed in [5]. Remark 2.3.

Returning to the context of Remark 2.2, one could, for instance, run the followingversion of (12) for i ∈ { , . . . , N } dξ it = f ( ξ it ) dt + R / dW it + p Nt C (cid:48) R − (cid:104) dY t − (cid:16) Cξ it dt + R / dV it (cid:17)(cid:105) again, noting that this will typically not produce a consistent approximation of the non-linear ﬁlter,as would be the case for (12) with the model (1) . (cid:107) · (cid:107) as the L − norm for vectors. For a square matrix B say, B sym = ( B + B (cid:48) ) and µ ( B ) being the largest eigenvalue of B sym . In the cases of (12) and (13), [8] consider thefollowing assumption (recall (4)): we have µ ( S ) > with S = µ ( S ) Id (16)where Id is the r × r identity matrix (in general we use Id to denote the identity matrix of‘appropriate dimension’, appropriate depending on the context). [8] prove the following time-uniformconvergence theorem for the mean. Theorem 2.1.

Consider the cases of (12) and (13) . Assume that (16) holds and that µ ( A ) < .Then for any n ≥ and N suﬃciently large, we have that sup t ≥ E [ (cid:107) m t − (cid:98) X t (cid:107) n ] n ≤ C ( n ) √ N where C ( n ) < ∞ is a constant that does not depend upon N . We denote by P ∞ as the solution to Ricc ( P ∞ ) = 0 . Set µ ( A − P ∞ S ) as the largest eigenvalue of ( A − P ∞ S + ( A − P ∞ S ) (cid:48) ) . For the case of (13), the following result, presented in [3], can be shown. Theorem 2.2.

Consider the case of (13) . Assume that µ ( A − P ∞ S ) < and S ∈ S + r . Then forany n ≥ , N ≥ there exists a t ( n, N ) > with t ( n, N ) → ∞ as N → ∞ , such that for any t ∈ [0 , t ( n, N )] E [ (cid:107) m t − (cid:98) X t (cid:107) n ] n ≤ C √ N where C < ∞ is a constant that does not depend upon N . t ( n, N ) is characterized in [6]. We note that t ( n, N ) is O (log( N )) ; see [6, Theorem 2.1] when (cid:15) = 1 / √ N ( (cid:15) is as [6, Theorem2.1]).For the case of (14) we have the following result. Below we cite [3, eq. (2.7)] and it should benoted that the notation of that article diﬀers slightly to this one, so we mean of course that theequation holds in the notation of this article. Also note that we use the inverse of p Nt , but as notedabove this need not be the case and then the constraint on N below is not required. Theorem 2.3.

Consider the case of (14) . Assume that [3, eq. (2.7)] holds. Then for any n ≥ there exists a C < ∞ , κ ∈ (0 , such that for any N ≥ r and t ≥ we have that E [ (cid:107) m t − (cid:98) X t (cid:107) n ] n ≤ C κ t √ N .

We set (cid:98) m t := E ( m t | F t ) (cid:98) p t := E ( p t | F t ) . We consider bounds on E (cid:16) (cid:107) (cid:98) m t − (cid:98) X t (cid:107) n (cid:17) /n . These bounds do not currently exist in the literature and will allow us to investigate our newestimator for the normalization constant, later on in the article. In the case (14) we have (seee.g. [5]) (cid:98) m t = (cid:98) X t and (cid:98) p t = P t (17)6o we focus only on (12) and (13).Using the second order Taylor-type expansions presented in [4] when the pair ( A, R / ) is sta-bilizable and ( A, S / ) is detectable we have the uniform almost sure bias and for any n ≥ , the L n -mean error estimates ≤ P t − (cid:98) p t ≤ C /N and E ( (cid:107) P t − p t (cid:107) n | F t ) /n ≤ C ( n ) / √ N (18)as soon as N is chosen suﬃciently large, for some deterministic constants C , C ( n ) that do notdepend on the time horizon or N (but C ( n ) does depend upon n ). Whenever µ ( A ) < , Theorem2.1 yields the rather crude estimate E (cid:16) (cid:107) (cid:98) m t − (cid:98) X t (cid:107) n (cid:17) /n = E (cid:16) (cid:107) E (cid:16) m t − (cid:98) X t | F t (cid:17) (cid:107) n (cid:17) /n ≤ E (cid:16) (cid:107) m t − (cid:98) X t (cid:107) n (cid:17) /n ≤ C ( n ) √ N . (19)Our objective is to improve the estimate (19). We observe that d ( m t − (cid:98) X t )= A ( m t − (cid:98) X t ) dt + ( p t − P t ) C (cid:48) Σ − ( dY t − Cm t dt )+ P t C (cid:48) Σ − ( dY t − Cm t dt ) − P t C (cid:48) Σ − (cid:16) dY t − C (cid:98) X t dt (cid:17) + 1 √ N + 1 dM t (20)which yields d ( (cid:98) m t − (cid:98) X t )= ( A − P t S ) ( (cid:98) m t − (cid:98) X t ) dt + ( (cid:98) p t − P t ) C (cid:48) Σ − (cid:16) dY t − C (cid:98) X t dt (cid:17) + E (cid:16) ( P t − p t ) S ( m t − (cid:98) X t ) | F t (cid:17) dt. (21)Equation (21) is deduced by conditioning both terms in (20) with respect to F t , exchanging the(stochastic) integration with the conditional expectation for all the terms on the right hand sideof (20). To justify this Fubini-like exchange one can use, for example, Lemma 3.21 in [2]. Onealso needs to use the fact that, for arbitrary ≤ s ≤ t , (cid:98) m s = E ( m s | F s ) = E ( m s | F t ) and (cid:98) p s := E ( p s | F s ) = E ( p s | F t ) . To justify this, one can use, for example, Proposition 3.15 in [2].We note that both Proposition 3.15 and Lemma 3.21 hold true when the conditioning is done withrespect the equivalent probability measure ˜ P under which Y is a Brownian motion. However since ( W t , V t , X ) are independent of Y under P they remain independent of Y under ˜ P and thereforeconditioning the processes ξ i under ˜ P is the same as conditioning them under P . Note that argumentthis does not apply to the original signal X .Let E s,t be the exponential semigroup (or the state transition matrix) associated with the smoothﬂow of matrices t (cid:55)→ ( A − P t S ) deﬁned for any s ≤ t by the forward and backward diﬀerentialequations, ∂ t E s,t = ( A − P t S ) E s,t and ∂ s E s,t = −E s,t ( A − P s S ) with E s,s = Id . Equivalently in terms of the matrices E t := E ,t we have E s,t = E t E − s . Under someobservability and controllability conditions, we have that the drift matrices ( A − P t S ) delivers auniformly stable time varying linear system in the sense that (cid:107)E s,t (cid:107) ≤ c e − λ ( t − s ) for some λ > . (22)7ee for instance Eq. (2.13) in the review article [3]. In this notation, recalling that (cid:98) m = (cid:98) X we havethe bias formulae (cid:98) m t − (cid:98) X t = (cid:90) t E s,t ( (cid:98) p s − P s ) (cid:124) (cid:123)(cid:122) (cid:125) ∈ [ − c /N, a.e. C (cid:48) Σ − (cid:16) dY s − C (cid:98) X s dt (cid:17) + (cid:90) t E s,t E (cid:16) ( P s − p s ) S ( m s − (cid:98) X s ) | F s (cid:17) ds. Note that ( (cid:98) p s − P s ) ∈ [ − c /N, a.e. can be justiﬁed via (18). Combining the Burkholder-Davis-Gundy inequality with the almost sure and uniform bias estimate (18) and the exponential decay(22) we obtain the uniform estimate E (cid:18) (cid:107) (cid:90) t E s,t ( (cid:98) p s − P s ) C (cid:48) Σ − (cid:16) dY s − C (cid:98) X s dt (cid:17) (cid:107) n (cid:19) /n ≤ C ( n ) N for some deterministic constants C ( n ) that do not depend upon t nor N . Conversely, combiningHölder’s inequality with the uniform variance estimate (18) we have the almost sure inequality E (cid:16) (cid:107) E (cid:16) ( P s − p s ) S ( m s − (cid:98) X s ) | F s (cid:17) (cid:107) n (cid:17) /n ≤ C ( n ) N for some deterministic constants C ( n ) that do not depend upon t nor N . Applying the generalizedMinkowski inequality we have thus proved the following theorem. Theorem 2.4.

Consider the cases of (12) and (13) . Assume that (16) holds and that µ ( A ) < .Then for any n ≥ and N suﬃciently large, we have sup t ≥ E (cid:16) (cid:107) (cid:98) m t − (cid:98) X t (cid:107) n (cid:17) /n ≤ C ( n ) N where C ( n ) < ∞ is a constant that does not depend upon N . We deﬁne Z t := L X t,Y t L X t,W t to be the density of L X t ,Y t , the law of the process ( X, Y ) and that of L X t ,W t , the law of the process ( X, W ) . That is, E [ f ( X t ) g ( Y t )] = E [ f ( X t ) g ( W t ) Z t ( X, Y )] . One can show that (see Exercise 3.14 pp 56 in [2]) Z t ( X, Y ) = exp (cid:20)(cid:90) t (cid:20) (cid:104) CX s , R − dY s (cid:105) − (cid:104) X s , SX s (cid:105) ds (cid:21)(cid:21) . Following a standard approach (see, e.g., Chapter 3 in [2]), we introduce a probability measure ˜ P byspecifying its Radon–Nikodym derivative with respect to P to be given by Z t ( X, Y ) − , i.e. d˜ P d P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) G t = Z t . ˜ P , the observation process Y is a scaled Brownian motion independent of X ; additionally thelaw of the signal process X under ˜ P is the same as its law under P . Moreover, for every f deﬁnedon the signal path space, we have the Bayes’ formula E (cid:0) f (( X s ) s ∈ [0 ,t ] ) | F t (cid:1) = ˜ E (cid:0) f (( X s ) s ∈ [0 ,t ] ) Z t ( X, Y ) | F t (cid:1) ˜ E ( Z t ( X, Y ) | F t ) . (23)where ˜ E () means expectation with respect to ˜ P . Since, under ˜ P , X and Y are independent wecan interpret the numerator ˜ E (cid:0) f (( X s ) s ∈ [0 ,t ] ) Z t ( X, Y ) | F t (cid:1) and ˜ E ( Z t ( X, Y ) | F t ) as integrals withrespect to the law of the signal, where the integrant has the observation process path ﬁxed Y . . Wecan therefore, let Z t ( Y ) be the likelihood function deﬁned by Z t ( Y ) := E Y ( Z t ( X, Y )) , where E Y ( . ) stands for the expectation w.r.t. the signal process when the observation is ﬁxed andindependent of the signal. In this notation, the Bayes’ formula (23) takes the form E (cid:0) f (( X s ) s ∈ [0 ,t ] ) | F t (cid:1) = E Y (cid:0) f (( X s ) s ∈ [0 ,t ] ) Z t ( X, Y ) (cid:1) E Y ( Z t ( X, Y )) . We have dZ t ( X, Y ) = (cid:20) (cid:104) CX t , R − dY t (cid:105) − (cid:104) X t , SX t (cid:105) dt (cid:21) Z t + 12 (cid:104) X t , SX t (cid:105) Z t dt = Z t ( X, Y ) (cid:104) CX t , R − dY t (cid:105) We check this claim using the fact that d (cid:104) CX t , R − dY t (cid:105) d (cid:104) CX t , R − dY t (cid:105) = (cid:88) k,l ( CX t )( k )( R − / dV t )( k )( CX t )( l )( R − / dV t )( l )= (cid:88) k,l,k (cid:48) ,l (cid:48) ( CX t )( k ) R − / ( k, k (cid:48) ) dV t ( k (cid:48) )( CX t )( l ) R − / ( l, l (cid:48) ) dV t ( l (cid:48) )= (cid:88) k,l,k (cid:48) ( CX t )( k ) R − ( k, l )( CX t )( l ) dt = (cid:104) X t , SX t (cid:105) dt This implies that Z t ( Y ) = 1 + (cid:90) t Z s ( Y ) Z s ( Y ) − E Y (cid:0) Z s ( X, Y ) (cid:104) CX s , R − dY s (cid:105) (cid:1) = 1 + (cid:90) t Z s ( Y ) (cid:104) C (cid:98) X s , R − dY s (cid:105) from which we conclude that Z t ( Y ) = exp (cid:20)(cid:90) t (cid:20) (cid:104) C (cid:98) X s , R − dY s (cid:105) − (cid:104) (cid:98) X s , S (cid:98) X s (cid:105) ds (cid:21)(cid:21) or equivalently dZ t ( Y ) = Z t ( Y ) (cid:104) C (cid:98) X t , R − dY t (cid:105) . The proof at this statement is an immediate application of Girsanov’s theorem and follows in a similar mannerwith the proof of Proposition 3.13 in [2]. Formula (23) is commonly known as the Kallianpur-Striebel formula. For a proof, see, e.g., Prop 3.16 in [2]. This interpretation can be made rigorous, see e.g. the robust representation formulation of the ﬁltering problemdescribed in Chapter 5 in [2]. Z Nt ( Y ) = exp (cid:20)(cid:90) t (cid:20) (cid:104) Cm s , R − dY s (cid:105) − (cid:104) m s , Sm s (cid:105) ds (cid:21)(cid:21) . (24)which satisﬁes the equation dZ t ( Y ) = Z t ( Y ) (cid:104) Cm t , R − dY t (cid:105) . (25)We remark that any of the three processes (12)-(14) can be used to compute Z Nt ( Y ) . We now seek to justify the estimator Z Nt ( Y ) . We have Z t ( X, Y ):= Z t ( Y ) − Z t ( X, Y )= exp (cid:104)(cid:82) t (cid:104) (cid:104) C ( X s − (cid:98) X s ) , R − ( dY s − C (cid:98) X s ds ) (cid:105) − (cid:104) (cid:104) X s , SX s (cid:105) − (cid:104) (cid:98) X s , S (cid:98) X s (cid:105) − (cid:104) X s − (cid:98) X s , S (cid:98) X s (cid:105) (cid:105) ds (cid:105)(cid:105) = exp (cid:104)(cid:82) t (cid:104) (cid:104) C ( X s − (cid:98) X s ) , R − ( dY s − C (cid:98) X s ds ) (cid:105) − (cid:104) ( X s − (cid:98) X s ) , S ( X s − (cid:98) X s ) (cid:105) ds (cid:105)(cid:105) from which we conclude that dZ t ( X, Y ) = Z t ( X, Y ) (cid:104) C ( X s − (cid:98) X s ) , R − ( dY s − C (cid:98) X s ds ) (cid:105) . This already implies that E (cid:2) Z t ( X, Y ) | F t (cid:3) = 1 ⇐⇒ E [ Z t ( X, Y ) | F t ] = Z t ( Y ) . Now, we have Z Nt ( Y ) Z − t ( Y )= exp (cid:104)(cid:82) t (cid:104) (cid:104) C ( m s − (cid:98) X s ) , R − dY s (cid:105) − (cid:104) (cid:104) m s , Sm s (cid:105) − (cid:104) (cid:98) X s , S (cid:98) X s (cid:105) (cid:105) ds (cid:105)(cid:105) = exp (cid:104)(cid:82) t (cid:104) (cid:104) C ( m s − (cid:98) X s ) , R − dY s (cid:105) − (cid:104) ( m s − (cid:98) X s ) , S ( m s − (cid:98) X s ) (cid:105) ds (cid:105)(cid:105) × exp (cid:104)(cid:82) t (cid:104) (cid:104) ( m s − (cid:98) X s ) , S ( m s − (cid:98) X s ) (cid:105) − (cid:104) (cid:104) m s , Sm s (cid:105) − (cid:104) (cid:98) X s , S (cid:98) X s (cid:105) ds (cid:105)(cid:105)(cid:105) = exp (cid:104)(cid:82) t (cid:104) (cid:104) C ( m s − (cid:98) X s ) , R − dY s (cid:105) − (cid:104) ( m s − (cid:98) X s ) , S ( m s − (cid:98) X s ) (cid:105) ds (cid:105)(cid:105) × exp (cid:104)(cid:82) t (cid:104) (cid:98) X s − m s , S (cid:98) X s (cid:105) ds (cid:105) = exp (cid:104)(cid:82) t (cid:104) (cid:104) C ( m s − (cid:98) X s ) , R − dY s (cid:105) − (cid:104) ( m s − (cid:98) X s ) , S ( m s − (cid:98) X s ) (cid:105) ds (cid:105)(cid:105) × exp (cid:104) − (cid:82) t (cid:104) C ( m s − (cid:98) X s ) , R − C (cid:98) X s (cid:105) ds (cid:105) = exp (cid:104)(cid:82) t (cid:104) (cid:104) C ( m s − (cid:98) X s ) , R − ( dY s − C (cid:98) X s ds ) (cid:105) − (cid:104) ( m s − (cid:98) X s ) , S ( m s − (cid:98) X s ) (cid:105) ds (cid:105)(cid:105) Observe that dY s − C (cid:98) X s ds is an F s -martingale increment. We also have d ( Z Nt ( Y ) Z − t ( Y )) = Z Nt ( Y ) Z − t ( Y ) (cid:104) C ( m t − (cid:98) X t ) , R − ( dY t − C (cid:98) X t dt ) (cid:105) . (26)10ow to conclude, it follows that that Z Nt ( Y ) Z − t ( Y ) is a positive local martingale and thereforea supermartingale, hence E ( Z Nt ( Y ) Z − t ( Y )) ≤ E ( Z N ( Y ) Z − ( Y )) = 1 for all t ≥ . To justify the martingale property of Z Nt ( Y ) Z − t ( Y ) one can use, for example, anargument based on Corollary 5.14 in [13]. As a result of the above calculations we can deducethat Z Nt ( Y ) is in some sense well-deﬁned, but a biased estimator of the normalization constant.Therefore, we will focus on the logarithm of the normalization constant, as it is this latter quantitythat is used in practical algorithms. We begin by investigating the conditional bias. We now consider the estimate of the logarithm of the normalizing constant U Nt ( Y ) := log( Z Nt ( Y )) with the notation U t ( Y ) = log( Z Nt ( Y )) .Set (cid:98) U t ( Y ) := E ( U Nt ( Y ) | F t ) . Then, using (26) we have U Nt ( Y ) − U t ( Y )= log( Z Nt ( Y ) /Z t ( Y ))= (cid:90) t (cid:20) (cid:104) C ( m s − (cid:98) X s ) , R − ( dY s − C (cid:98) X s ds ) (cid:105) − (cid:104) ( m s − (cid:98) X s ) , S ( m s − (cid:98) X s ) (cid:105) ds (cid:21) . This yields the bias formula (cid:98) U t ( Y ) − U t ( Y )= (cid:90) t (cid:20) (cid:104) C ( (cid:98) m s − (cid:98) X s ) , R − ( dY s − C (cid:98) X s ds ) (cid:105) − E (cid:16) (cid:104) ( m s − (cid:98) X s ) , S ( m s − (cid:98) X s ) (cid:105) | F s (cid:17) ds (cid:21) . (27)Taking the expectation in the above display, when (16) holds and µ ( A ) < and applying Theorem2.1 we readily check that ≤ E ( U t ( Y )) − E (cid:16) (cid:98) U t ( Y ) (cid:17) = µ ( S )2 (cid:90) t E (cid:16) (cid:107) m s − (cid:98) X s (cid:107) (cid:17) ds ≤ C tN which is the bias, but is not of particular interest. So we will focus on the conditional bias E (cid:16)(cid:104) (cid:98) U t ( Y ) − U t ( Y ) (cid:105) n (cid:17) /n .In the case (14), recall (17). In this context, using the generalized Minkowski inequality andunder the assumptions of Theorem 2.3 we ﬁnd that sup t ≥ E (cid:16)(cid:12)(cid:12)(cid:12) (cid:98) U t ( Y ) − U t ( Y ) (cid:105) n (cid:12)(cid:12)(cid:12) /n ≤ C ( n ) N where C ( n ) < ∞ is a constant that does not depend upon N .11or the cases of (12) and (13), we start with the fact that the conditional bias in (27) is decom-posed into two terms α t := (cid:90) t (cid:104) C ( (cid:98) m t − (cid:98) X s ) , R − ( dY s − C (cid:98) X s ds ) (cid:105) and β t := 12 (cid:90) t E (cid:16) (cid:104) ( m s − (cid:98) X s ) , S ( m s − (cid:98) X s ) (cid:105) | F s (cid:17) ds. Using the uniform estimates presented in Section 2.3 we have E ( | β t | n ) /n ≤ C ( n ) tN for some deterministic constants C ( n ) that do not depend on the time horizon. On the other hand,combining the Burkholder-Davis-Gundy inequality with the uniform bias estimate (19) we have therather crude estimate E ( | α t | n ) ≤ C ( n ) (cid:18) tN (cid:19) n/ (28)for some deterministic constants C ( n ) that do not depend on the time horizon. Arguing as for theproof of Theorem 2.4, we check that E ( | α t | n ) ≤ C ( n ) (cid:18) √ tN (cid:19) n . Observe that the above improves the crude estimate stated in (28). This yields the following corollary.

Corollary 3.1.

Consider the cases of (12) and (13) . Assume that (16) holds and that µ ( A ) < .Then for any n ≥ , t ≥ and N suﬃciently large, we have that E (cid:16)(cid:12)(cid:12)(cid:12) (cid:98) U t ( Y ) − U t ( Y ) (cid:12)(cid:12)(cid:12) n (cid:17) /n ≤ C ( n )( t + √ t ) N where C ( n ) < ∞ is a constant that does not depend upon t or N . L n − Error

We now consider the L n − error of the estimate of the log-of the normalizing constant U Nt ( Y ) =log( Z Nt ( Y )) . In the case of (12) and (13) we have the following.

Proposition 3.1.

Consider the cases of (12) and (13) . Assume that (16) holds and that µ ( A ) < .Then for any n ≥ , t ≥ and N suﬃciently large, we have that E (cid:16)(cid:12)(cid:12) U Nt ( Y ) − U t ( Y ) (cid:12)(cid:12) n (cid:17) /n ≤ C ( n ) (cid:16)(cid:114) tN + tN (cid:17) where C ( n ) < ∞ is a constant that does not depend upon t or N .Proof. We have that U Nt ( Y ) − U t ( Y ) = (cid:90) t (cid:104) (cid:104) C ( m s − (cid:98) X s ) , R − ( dY s − C (cid:98) X s ds ) (cid:105) − (cid:104) ( m s − (cid:98) X s ) , S ( m s − (cid:98) X s ) (cid:105) ds (cid:105) . So one can apply the Minkowski inequality to yield that E (cid:16)(cid:12)(cid:12) U Nt ( Y ) − U t ( Y ) (cid:12)(cid:12) n (cid:17) /n ≤ (cid:18)(cid:12)(cid:12)(cid:12) (cid:90) t (cid:104) C ( m s − (cid:98) X s ) , R − ( dY s − C (cid:98) X s ds ) (cid:105) ds (cid:12)(cid:12)(cid:12) n (cid:19) /n + 12 E (cid:18)(cid:12)(cid:12)(cid:12) (cid:90) t (cid:104) ( m s − (cid:98) X s ) , S ( m s − (cid:98) X s ) (cid:105) ds (cid:12)(cid:12)(cid:12) n (cid:19) /n . (29)For the left term on the R.H.S. of (29) one can use the Burkholder-Gundy-Davis inequality alongwith Theorem 2.1 to yield that E (cid:18)(cid:12)(cid:12)(cid:12) (cid:90) t (cid:104) C ( m s − (cid:98) X s ) , R − ( dY s − C (cid:98) X s ds ) (cid:105) ds (cid:12)(cid:12)(cid:12) n (cid:19) /n ≤ C ( n ) t √ N for some C < ∞ a constant that does not depend upon t or N . For the right term on the R.H.S. of(29) one can use the generalized Minkowski inequality and Theorem 2.1 to give E (cid:18)(cid:12)(cid:12)(cid:12) (cid:90) t (cid:104) ( m s − (cid:98) X s ) , S ( m s − (cid:98) X s ) (cid:105) ds (cid:12)(cid:12)(cid:12) n (cid:19) /n ≤ C ( n ) tN for some C ( n ) < ∞ a constant that does not depend upon t or N . The proof is now easily concluded.In the case of (13) we can reﬁne further: Proposition 3.2.

Consider the case of (13) . Assume that µ ( A − P ∞ S ) < and S ∈ S + r . Then forany N ≥ there exists a t ( N ) > with t ( N ) → ∞ as N → ∞ , such that for any t ∈ [0 , t ( N )] E (cid:16)(cid:12)(cid:12) U Nt ( Y ) − U t ( Y ) (cid:12)(cid:12) n (cid:17) /n ≤ C ( n ) (cid:16)(cid:114) tN + tN (cid:17) where C ( n ) < ∞ is a constant that does not depend upon t or N .Proof. The proof is essentially as Proposition 3.1 except one should use Theorem 2.2 instead ofTheorem 2.1.In the case of (14) we have the following, whose proof is again similar to the above results.

Proposition 3.3.

Consider the case of (14) . Assume that [3, eq. (2.7)] holds. Then for any t ≥ and N ≥ r , we have that E (cid:16)(cid:12)(cid:12) U Nt ( Y ) − U t ( Y ) (cid:12)(cid:12) n (cid:17) /n ≤ C ( n ) √ N where C ( n ) < ∞ is a constant that does not depend upon t or N . Both Proposition 3.1 and Proposition 3.2 establish that one can estimate the log of the nor-malization constant using the ensemble Kalman-Bucy type ﬁlters, with a mean square error (forinstance) that grows at most linearly in time. Proposition 3.3 provides a uniform in time error,mostly following from the deterministic nature of the algorithm and the dependence on standardKalman-Bucy theory.One can say more, when considering the average estimation error in over a window of time w aswe now state. We restrict ourselves to the mean square error below. In the cases of both (12) and(13) we have the time-uniform upper-bound. Corollary 3.2.

Consider the case of (12) and (13) . Assume that (16) holds and that µ ( A ) < .Then for any t > , < w < t and N suﬃciently large, we have that E (cid:34)(cid:32) w (cid:40) log (cid:32) Z Nt ( Y ) Z Nt − w ( Y ) (cid:33) − log (cid:32) Z t ( Y ) Z t − w ( Y ) (cid:33)(cid:41)(cid:33) (cid:35) ≤ C N where C < ∞ is a constant that does not depend upon t , w or N .

13n the case of (13) we reﬁne further and have the following time-uniform upper-bound.

Corollary 3.3.

Consider the cases of (13) . Assume that µ ( A − P ∞ S ) < and S ∈ S + r . Then forany N ≥ there exists a t ( N ) > with t ( N ) → ∞ as N → ∞ , such that for any t ∈ (0 , t ( N )] , < w < t E (cid:34)(cid:32) w (cid:40) log (cid:32) Z Nt ( Y ) Z Nt − w ( Y ) (cid:33) − log (cid:32) Z t ( Y ) Z t − w ( Y ) (cid:33)(cid:41)(cid:33) (cid:35) ≤ C N where C < ∞ is a constant that does not depend upon t , w or N . Note that, in Corollary 3.3 as we require t ∈ (0 , t ( N )] and t ( N ) = O (log( N )) this may not beas useful as Corollary 3.2, but the assumption of a stable system in Corollary 3.2 is much strongerthan the hypotheses in Corollary 3.3; see [3] for a discussion. For ( i, k, L ) ∈ { , · · · , N }× N × N , let ∆ L = 2 − L and consider the Euler-discretization of (12)-(14): ( F1 ) ξ i ( k +1)∆ L = ξ ik ∆ L + Aξ ik ∆ L ∆ L + R / (cid:8) W i ( k +1)∆ L − W ik ∆ L (cid:9) + p k ∆ L C (cid:48) R − (cid:16)(cid:8) Y ( k +1)∆ L − Y k ∆ L (cid:9) − (cid:104) Cξ ik ∆ L ∆ L + R / (cid:8) V i ( k +1)∆ L − V ik ∆ L (cid:9)(cid:105)(cid:17) ( F2 ) ξ i ( k +1)∆ L = ξ ik ∆ L + Aξ ik ∆ L ∆ L + R / (cid:8) W i ( k +1)∆ L − W ik ∆ L (cid:9) + p k ∆ L C (cid:48) R − (cid:32)(cid:8) Y ( k +1)∆ L − Y k ∆ L (cid:9) − C (cid:32) ξ ik ∆ L + m k ∆ L (cid:33) ∆ L (cid:33) ( F3 ) ξ i ( k +1)∆ L = ξ ik ∆ L + Aξ ik ∆ L ∆ L + R ( p k ∆ L ) − (cid:16) ξ ik ∆ L − m k ∆ L (cid:17) ∆ L + p k ∆ L C (cid:48) R − (cid:32)(cid:8) Y ( k +1)∆ L − Y k ∆ L (cid:9) − C (cid:32) ξ ik ∆ L + m k ∆ L (cid:33) ∆ L (cid:33) (30)and the discretization of Z NT ( Y ) : Z N,LT ( Y ) = exp  T/ ∆ L − (cid:88) k =0 (cid:10) m k ∆ L , C (cid:48) R − (cid:2) Y ( k +1)∆ L − Y k ∆ L (cid:3)(cid:11) − (cid:104) m k ∆ L , S m k ∆ L (cid:105) ∆ L  . (31)For the purpose of showing that the mean square error of the estimate of the log of the normalizationconstant in the cases F1 & F2 is of O ( tN ) and in the case F3 is of O ( N ) , we take r = r = 1 , A = − , R − / = 1 , R − / = 2 , C a uniform random number in (0 , and ξ i i.i.d. ∼ N (0 . , . (normal distribution mean 0.5 and variance 0.2). In Tables 1 - 2 and Figures 1 - 2, we show thatthe rate of the MSE of the estimate in (31) for the cases F1 & F2 is of O ( tN ) , even though we haveused a naive time discretization for our results. In Table 3 and Figure 3 we show that the rate inthe F3 case is of O (1 /N ) . 14 = 1000 N = 500 N = 250 t MSE MSE / ( t/N ) MSE MSE / ( t/N ) MSE MSE / ( t/N )

50 2.194E-04 4.4E-03 5.404E-04 5.4E-03 8.903E-04 4.5E-03100 5.483E-04 5.5E-03 1.442E-03 7.2E-03 3.082E-03 7.7E-03200 1.085E-03 5.4E-03 2.901E-03 7.3E-03 4.442E-03 5.6E-03400 2.283E-03 5.7E-03 5.047E-03 6.3E-03 1.017E-02 6.4E-03800 4.894E-03 6.1E-03 8.579E-03 5.4E-03 1.985E-02 6.2E-031600 9.716E-03 6.1E-03 2.039E-02 6.4E-03 2.904E-02 4.5E-033200 1.974E-02 6.2E-03 3.504E-02 5.5E-03 7.835E-02 6.1E-036400 3.571E-02 5.6E-03 7.087E-02 5.5E-03 1.599E-01 6.2E-03Table 1: The mean square error (MSE) and MSE / ( t/N ) for N ∈ { , , } in the F1 case. N = 1000 N = 500 N = 250 t MSE MSE / ( t/N ) MSE MSE / ( t/N ) MSE MSE / ( t/N )

50 1.137E-03 5.7E-03 5.127E-04 5.1E-03 3.317E-04 6.6E-03100 2.157E-03 5.4E-03 1.128E-03 5.6E-03 8.892E-04 8.9E-03200 4.601E-03 5.8E-03 2.708E-03 6.8E-03 1.260E-03 6.3E-03400 9.777E-03 6.1E-03 5.250E-03 6.6E-03 2.436E-03 6.1E-03800 1.979E-02 6.2E-03 9.949E-03 6.2E-03 4.639E-03 5.8E-031600 4.901E-02 7.7E-03 1.595E-02 5.0E-03 8.895E-03 5.6E-033200 9.593E-02 7.5E-03 3.001E-02 4.7E-03 1.925E-02 6.0E-036400 1.744E-01 6.8E-03 5.415E-02 4.2E-03 3.635E-02 5.7E-03Table 2: The MSE and MSE / ( t/N ) for N ∈ { , , } in the F2 case. N MSE MSE × N

50 1.073E-05 5.4E-04100 4.951E-06 5.0E-04200 2.867E-06 5.7E-04400 1.492E-06 6.0E-04800 8.157E-07 6.5E-041600 3.119E-07 5.0E-043200 1.773E-07 5.7E-046400 6.344E-08 4.1E-04Table 3: The MSE and MSE × N for t = 100 in the F3 case.15 .0E-042.0E-032.0E-022.0E-01 50 100 200 400 800 1600 3200 6400 M S E Time

N = 250 N = 500 N = 1000

F1 Case

Figure 1: The mean square error associated to the EnKBF in the F1 case. This plots the MSEagainst the time parameter for N ∈ { , , } . MSE ≈ (cid:0) tN (cid:1) , ≈ (cid:0) tN (cid:1) , ≈ (cid:0) tN (cid:1) , for N = 250 , , , respectively. M S E Time

N = 250 N = 500 N = 1000

F2 Case

Figure 2: The mean square error associated to the EnKBF in the F2 case. This plots the MSEagainst the time parameter for N ∈ { , , } . MSE ≈ (cid:0) tN (cid:1) , ≈ (cid:0) tN (cid:1) , ≈ (cid:0) tN (cid:1) , for N = 250 , , , respectively.16 .0E-082.0E-064.0E-066.0E-068.0E-061.0E-051.2E-05 0 800 1600 2400 3200 4000 4800 5600 6400 M S E N F3 Case: MSE VS. Ensemble Size N Figure 3: The mean square error associated to the EnKBF in the F3 case. This plots the MSEagainst the ensemble size N for ﬁxed time t = 100 . MSE ≈ /N . We now assume that there are a collection of unknown parameters, θ ∈ Θ ⊆ R d θ , associated to themodel (1). For instance θ could consist of some or all of the elements of A, C, R , R . We will theninclude an additional subscript θ in each of the mathematical objects that have been introduced inthe previous two sections. As an example, we would write Z t,θ ( X, Y ) = exp (cid:20)(cid:90) t (cid:20) (cid:104) CX s , R − dY s (cid:105) − (cid:104) X s , SX s (cid:105) (cid:21) ds (cid:21) . For < s < t we introduce the notation Z s,t,θ ( Y ) := Z t,θ ( Y ) Z s,θ ( Y ) with the obvious extension to Z Ns,t,θ ( Y ) := Z Nt,θ ( Y ) /Z Ns,θ ( Y ) .The basic idea of our approach is to consider the recursive estimation of θ on the basis of thearriving observations. In particular, for notational convenience, we shall consider a method whichwill update our current estimate of θ at unit times. Our objective is to follow a recursive maximumlikelihood (RML) method (e.g. [11]) which is based upon the following update at any time t ∈ N θ t = θ t − + κ t ∇ θ log (cid:16) Z t − ,t,θ ( Y ) (cid:17)(cid:12)(cid:12)(cid:12) θ = θ t − where { κ t } t ∈ N is a sequence of real numbers with κ t > , (cid:80) t ∈ N κ t = ∞ , (cid:80) t ∈ N κ t < ∞ . Computingthe gradient of Z t − ,t,θ ( Y ) can be computationally expensive, so our approach is to use a typeof ﬁnite diﬀerence estimator via SPSA. We note that a classical ﬁnite diﬀerence estimator wouldrequire d θ evaluations of Z t − ,t,θ ( Y ) , whereas the SPSA method keeps this evaluation down to 2;as computational cost is a concern, we prefer this afore-mentioned method.Our approach is the following, noting that we will use the argument ( k ) to denote the k th − elementof a vector. 17. Initialization: Set an initial θ ∈ Θ and choose two step sizes { κ t } t ∈ N and { ν t } t ∈ N such that κ t > , κ t , ν t → , (cid:80) t ∈ N κ t = ∞ , (cid:80) t ∈ N κ t ν t < ∞ . Set t = 1 and generate i.i.d. the initialensemble from η ,θ .2. Iteration:• For k ∈ { , . . . , d θ } , independently, sample ∆ t ( k ) from a Bernoulli distribution withsuccess probability / and support {− , } .• Set θ + t − = θ t − + ν t ∆ t , θ + t − = θ t − − ν t ∆ t .• Using the EnKBF, with samples simulated under θ t − , generate estimates Z Nt − ,t,θ + t − ( Y ) and Z Nt − ,t,θ − t − ( Y ) .• Set, for k ∈ { , . . . , d θ } , θ t ( k ) = θ t − ( k ) + κ t ν t ∆ t ( k ) log (cid:32) Z Nt − ,t,θ + t − ( Y ) Z Nt − ,t,θ − t − ( Y ) (cid:33) . • Run the EnKBF from t − (starting with samples simulated under θ t − ) up-to time t using the parameter θ t . Set t = t + 1 and return to 2..We note that in practice, one must run a time-discretization of the EnKBF, rather than the EnKBFitself. We remark also that the use of SPSA for parameter estimation associated to hidden Markovmodels has been considered previously, for instance in [14]. We use the algorithm described in the previous section along with the Euler-discretization in (30)to estimate the parameters in three diﬀerent models. In particular, we show that the algorithmworks in both linear and non-linear models. In all three models, the data is generated from the trueparameters.

In the ﬁrst example, we take A = θ Id , R − / = θ R , where R =  . . . . . . . . . .. . . . . . . .  C = α ( r , r ) C ∗ , where C ∗ is a uniform random matrix and α ( r , r ) is a constant, R − / = α ( r ) Id , where α ( r ) is a constant. In Figures 4 - 9, we show the results for the parametersestimation of θ and θ in the cases r = r = 2 and r = r = 100 . In all cases we take N = 100 except in the case when r = r = 100 in F3 , where we take N = 200 to assure the invertibilityof p Nt , otherwise, the condition number of p Nt will be huge. The discretization level is L = 8 in allcases. The initial state is X ∼ N (4 r , Id ) , where r is a vector of 1’s in R r .18he results display that in a reasonable case, one can estimate low-dimensional parameters usingRML via SPSA. We now consider a few nonlinear models. -2.5-2-1.5-1-0.5 0 5000 10000 15000 20000 𝜽 ₁ Time 𝜽 ₂ Time

Figure 4: The blue curves along with their average (in red) are trajectories from the execution of thealgorithm in Section 4.1 for the estimation of ( θ , θ ) in the case F1 with r = r = 2 . The initialvalues of the parameters are ( − , . The green horizontal lines represent the true parameter values ( θ ∗ , θ ∗ ) = ( − , . We take ν t = t − . and κ t = 0 . when t ≤ and κ t = 3 × t − . otherwise. -2.5-2-1.5-1-0.5 0 5000 10000 15000 20000 𝜽 ₁ Time 𝜽 ₂ Time

Figure 5: The blue curves along with their average (in red) are trajectories from the execution of thealgorithm in Section 4.1 for the estimation of ( θ , θ ) in the case F1 with r = r = 100 . The initialvalues of the parameters are ( − , . The green horizontal lines represent the true parameter values ( θ ∗ , θ ∗ ) = ( − , . We take ν t = t − . and κ t = 0 . when t ≤ and κ t = 3 × t − . otherwise.19 𝜃 ₁ Time 𝜽 ₂ Time

Figure 6: The blue curves along with their average (in red) are trajectories from the execution ofthe algorithm in Section 4.1 for the estimation of ( θ , θ ) in the case F2 with r = r = 2 . Theinitial values of the parameters are ( − , . The green horizontal lines represent the true parametervalues ( θ ∗ , θ ∗ ) = ( − , . We take ν t = t − . and κ t = 0 . when t ≤ and κ t = t − . otherwise. -2.5-2-1.5-1-0.5 0 5000 10000 15000 20000 𝜃 ₁ Time 𝞱 ₂ Time

Figure 7: The blue curves along with their average (in red) are trajectories from the execution of thealgorithm in Section 4.1 for the estimation of ( θ , θ ) in the case F2 with r = r = 100 . The initialvalues of the parameters are ( − , . The green horizontal lines represent the true parameter values ( θ ∗ , θ ∗ ) = ( − , . We take ν t = t − . and κ t = 0 . when t ≤ and κ t = 3 × t − . otherwise.20 𝜃 ₁ Time 𝜽 ₂ Time

Figure 8: The blue curves along with their average (in red) are trajectories from the execution ofthe algorithm in Section 4.1 for the estimation of ( θ , θ ) in the case F3 with r = r = 2 . Theinitial values of the parameters are ( − , . . The green horizontal lines represent the true parametervalues ( θ ∗ , θ ∗ ) = ( − , . We take ν t = t − . and κ t = 0 . when t ≤ and κ t = t − . otherwise. -2.5-2-1.5-1-0.5 0 5000 10000 15000 20000 𝜃 ₁ Time 𝜽 ₂ Time

Figure 9: The blue curves along with their average (in red) are trajectories from the execution of thealgorithm in Section 4.1 for the estimation of ( θ , θ ) in the case F3 with r = r = 100 . The initialvalues of the parameters are ( − , . The green horizontal lines represent the true parameter values ( θ ∗ , θ ∗ ) = ( − , . We take ν t = t − . and κ t = 0 . when t ≤ and κ t = 2 × t − . otherwise. We now consider the following nonlinear model with r = r = 3 , which is a simpliﬁed mathematicalmodel for atmospheric convection. dX t = f ( X t ) dt + R / dW t dY t = C X t dt + R / dV t , where f ( X t ) = θ ( X t (2) − X t (1)) ,f ( X t ) = θ X t (1) − X t (2) − X t (1) X t (3) ,f ( X t ) = X t (1) X t (2) − θ X t (3) , X t ( i ) is the i th component of X t . We also have R / = Id , C ij =  if i = j if i = j − otherwise , i, j ∈ { , , } , and ( R / ) ij = 2 q (cid:0) min {| i − j | , r − | i − j |} (cid:1) , i, j ∈ { , , } , where q ( x ) = (cid:40) − x + x if ≤ x ≤ otherwise . In Figures 10 - 12, we show the results for the parameters estimation of θ = ( θ , θ , θ ) in the F1 , F2 and F3 cases. The ensemble size is N = 100 and the discretization level is L = 8 in allcases. The initial state is X ∼ N ( r , Id ) , where r is a vector of 1’s in R r . 𝜃 ₁ Time 𝜃 ₂ Time 𝜃 ₃ Time

Figure 10: The blue curves along with their average (in red) are trajectories from the execution ofthe algorithm in Section 4.1 for the estimation of ( θ , θ , θ ) in the case F1 . The initial values ofthe parameters are (7 . , . , . . The green horizontal lines represent the true parameter values ( θ ∗ , θ ∗ , θ ∗ ) = (10 , , ) . We take ν t = t − . an κ t = 0 . when t ≤ and κ t = t − . otherwise. 𝜃 ₁ Time 𝜃 ₂ Time 𝜃 ₃ Time

Figure 11: The blue curves along with their average (in red) are trajectories from the execution ofthe algorithm in Section 4.1 for the estimation of ( θ , θ , θ ) in the case F2 . The initial values ofthe parameters are (7 . , . , . . The green horizontal lines represent the true parameter values ( θ ∗ , θ ∗ , θ ∗ ) = (10 , , ) . We take ν t = t − . and κ t = 0 . when t ≤ and κ t = t − . otherwise.22 𝜃 ₁ Time 𝜃 ₂ Time 𝜃 ₃ Time

Figure 12: The blue curves along with their average (in red) are trajectories from the execution ofthe algorithm in Section 4.1 for the estimation of ( θ , θ , θ ) in the case F3 . The initial values ofthe parameters are (7 . , . , . . The green horizontal lines represent the true parameter values ( θ ∗ , θ ∗ , θ ∗ ) = (10 , , ) . We take ν t = t − . and κ t = 0 . when t ≤ and κ t = t − . otherwise. Finally, we consider the following nonlinear model, Lorenz’ 96, with r = r = 40 . The solution ofthis model has a chaotic behavior and it describes the evolution of a scalar quantity on a circle oflatitude. dX t = f ( X t ) dt + R / dW t dY t = X t dt + R / dV t , where f i ( X t ) = ( X t ( i + 1) − X t ( i − X t ( i − − X t ( i ) + θ where X t ( i ) is the i th component of X t , and it is assumed that X t ( −

1) = X t ( r − , X t (0) = X t ( r ) and X t ( r + 1) = X t (1) . We also have R / = √ Id and R / = Id . θ here represents the externalforce in the system, while ( X t ( i + 1) − X t ( i − X t ( i − is the advection-like term and − X t ( i ) isthe damping term. In Figure 13, we show the results for the parameters estimation of θ in the F1 , F2 and F3 cases. The ensemble size is N = 100 and the discretization level is L = 8 in all cases. In F1 & F2 cases, X t is initialized as follows: X (1) = 8 . and X ( k ) = 8 for < k ≤ . In F3 , toavoid having the matrix p N equal to zero, we take X ∼ N (8 r , . Id ) . 𝜃 Time 𝜃 Time 𝜃 Time

Figure 13: The blue curves along with their average (in red) are trajectories from the execution ofthe algorithm in Section 4.1 for the estimation of θ in the case F1 (left), F2 (middle) and F3 (right).The initial value of θ is 10. The green horizontal lines represent the true parameter value θ ∗ = 8 .We take ν t = t − . and κ t = 0 . when t ≤ and κ t = t − . otherwise.23 cknowldegements AJ & HR were supported by KAUST baseline funding. DC has been partially supported by EUproject STUOD - DLV-856408.

References [1]

Allen , J. I.,

Eknes , M. &

Evensen , G. (2002). An Ensemble Kalman ﬁlter with a complexmarine ecosystem model: Hindcasting phytoplankton in the Cretan Sea.

Ann. Geo-phys. , ,1–13.[2] Bain , A. &

Crisan , D. (2009).

Fundamentals of Stochastic Filtering . Springer, New York.[3]

Bishop , A. N. &

Del Moral , P. (2020). On the mathematical theory of ensemble (linear-gaussian) Kalman–Bucy ﬁltering. arXiv preprint arXiv:2006.08843.[4]

Bishop , A. N. &

Del Moral , P. (2020). On the stability of matrix-valued Riccati diﬀusions.

Electron. J. Probab. , , 1–40.[5] Bishop , A. N. &

Del Moral , P. (2017). On the stability of Kalman–Bucy diﬀusion processes.

SIAM J. Control Optim ., , 4015-4047.[6] Bishop , A. N. &

Del Moral , P. (2019). Stability Properties of Systems of Linear StochasticDiﬀerential Equations with Random Coeﬃcients.

SIAM J. Control Optim ., , 1023–1042.[7] Cerou , F.

Del Moral , P &

Guyader , A. (2011) A non-asymptotic variance theorem forun-normalized Feynman- Kac particle models.

Ann. Inst. Henri Poincare , , 629–649.[8] Del Moral , P. &

Tugaut , J. (2018). On the stability and the uniform propagation of chaosproperties of ensemble Kalman–Bucy ﬁlters.

Ann. Appl. Probab. , , 790–850.[9] Drovandi , C.,

Everitt , R.,

Golightly , A. &

Prangle , D. (2019). Ensemble MCMC:Accelerating Pseudo-Marginal MCMC for State Space Models using the Ensemble KalmanFilter. arXiv preprint arXiv:1906.02014v2.[10]

Evensen , G. (1994). Sequential data assimilation with a non-linear quasi-geostrophic modelusing Monte Carlo methods to forecast error statistics.

J. Geophys. Res. , , 10143–10162.[11] Le Gland , F. &

Mevel , M. (1997). Recursive identiﬁcation in hidden Markov models.

Proc.36th IEEE Conf. Dec. Contr. , 3468-3473.[12]

Le Gland , F.,

Monbet , V. &

Tran , V.-D. (2011). Large sample asymptotics for the ensembleKalman ﬁlter. In

The Oxford Handbook of Nonlinear Filtering , Oxford Univ. Press, Oxford,598–631.[13]

Karatzas , I. &

Shreve , S. (1991).

Brownian Motion and Stochastic Calculus . Second edition,Springer: New York.[14]

Poyiadjis , G,

Singh , S.S. &

Doucet , A. (2006) Gradient-free maximum likelihood parameterestimation with particle ﬁlters.

Proc Amer. Control Conf. , 6–9.[15]

Reich , S. &

Cotter

C. (2013). Ensemble ﬁlter techniques for intermittent data assimilation. In

Large Scale Inverse Problems: Computational Methods and Applications in the Earth Sciences (eds: M. Cullen, M.A. Freitag, S. Kindermann, R. Scheichl), 91–134. De Gruyter Publishers.2416]

Sakov , P. &

Oke , P. R. (2008). A deterministic formulation of the ensemble Kalman ﬁlter: analternative to ensemble square root ﬁlters.

Tellus A. , , 361–371.[17] Skjervheim , J -A.,

Evensen , G.,

Aanonsen , S. I.,

Ruud , B. O. &

Johansen , T. A. (2005).Incorporating 4D seismic data in reservoir simulation models using ensemble Kalman ﬁlter.

SPE , , 95789.[18] Spall , J. C. (1992) Multivariate stochastic approximation using a simultaneous perturbationgradient approximation.

IEEE Trans. Auto. Control. , 332-341.[19] Tong , X. T.,

Majda , A. J. &

Kelly , D. (2016). Nonlinear stability and ergodicity of ensemblebased Kalman ﬁlters.

Nonlinearity ,29