[PDF] Score-Based Parameter Estimation for a Class of Continuous-Time State Space Models

Abstract

We consider the problem of parameter estimation for a class of continuous-time state space models. In particular, we explore the case of a partially observed diffusion, with data also arriving according to a diffusion process. Based upon a standard identity of the score function, we consider two particle filter based methodologies to estimate the score function. Both methods rely on an online estimation algorithm for the score function of O( N 2 ) cost, with N∈N the number of particles. The first approach employs a simple Euler discretization and standard particle smoothers and is of cost O( N 2 +N Δ −1 l ) per unit time, where Δ l = 2 −l , l∈ N 0 , is the time-discretization step. The second approach is new and based upon a novel diffusion bridge construction. It yields a new backward type Feynman-Kac formula in continuous-time for the score function and is presented along with a particle method for its approximation. Considering a time-discretization, the cost is O( N 2 Δ −1 l ) per unit time. To improve computational costs, we then consider multilevel methodologies for the score function. We illustrate our parameter estimation method via stochastic gradient approaches in several numerical examples.

Full PDF

SScore-Based Parameter Estimation for a Class ofContinuous-Time State Space Models

BY ALEXANDROS BESKOS , DAN CRISAN , AJAY JASRA , NIKOLAS KANTAS &HAMZA RUZAYQAT Department of Statistical Science, University College London, London, WC1E 6BT, UK. E-Mail: [email protected] Department of Mathematics, Imperial College London, London, SW7 2AZ, UK. E-Mail: [email protected], [email protected] Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Scienceand Technology, Thuwal, 23955, KSA. E-Mail: [email protected], [email protected]

Abstract

We consider the problem of parameter estimation for a class of continuous-time statespace models. In particular, we explore the case of a partially observed diﬀusion, with dataalso arriving according to a diﬀusion process. Based upon a standard identity of the scorefunction, we consider two particle ﬁlter based methodologies to estimate the score function.Both methods rely on an online estimation algorithm for the score function, as described, e.g.,in [11], of O ( N ) cost, with N ∈ N the number of particles. The ﬁrst approach employs asimple Euler discretization and standard particle smoothers and is of cost O ( N + N ∆ − l ) perunit time, where ∆ l = 2 − l , l ∈ N , is the time-discretization step. The second approach is newand based upon a novel diﬀusion bridge construction. It yields a new backward type Feynman-Kac formula in continuous-time for the score function and is presented along with a particlemethod for its approximation. Considering a time-discretization, the cost is O ( N ∆ − l ) perunit time. To improve computational costs, we then consider multilevel methodologies forthe score function. We illustrate our parameter estimation method via stochastic gradientapproaches in several numerical examples. Keywords : Score Function, Parameter Estimation, Particle Filter, Diﬀusion Bridges.

We consider the problem of parameter estimation for continuous-time state space models. Theseare models comprising stochastic diﬀerential equations (SDEs) describing a hidden dynamic stateand their observations. Such models are ubiquitous in a large number of practical applicationsin science, engineering, ﬁnance, and economics, see [9] for an overview. Inference in state spacemodels, also known as hidden Markov models, hinges upon computing conditional probabilitydistributions of the dynamic hidden state given the acquired observations and unknown staticparameters. This is referred to as the stochastic ﬁltering problem, which is in general intractable,but reliable numerical approximations are routinely available [1, 9]. The problem of inferring theunknown parameters is more challenging. In this paper we focus on maximum likelihood inferenceand gradient methods that are performed in an online manner. The oﬄine or iterative case givena ﬁxed batch of observations can also be treated using our proposed methods. The approachconsidered in this article is to make use of the gradient of the log-likelihood, commonly referredto as the score function , within a stochastic gradient algorithm (see e.g. [11, 12]). Intrinsically,there are several challenges arising with such an approach. Firstly, when one adopts a continuous-time model and assumes access to arbitrarily high frequency observations, one does not observe inpractice truly in continuous-time, therefore some sort of time-discretization is required. Secondly,1 a r X i v : . [ s t a t . C O ] A ug n both discrete-time and continuous-time formulations, there are very few cases when the scorefunction is analytically available. Both these issues imply that numerical approximations arerequired.We consider two diﬀerent approaches for the numerical approximation of the score function.The ﬁrst is to simply time-discretize a representation of the score function and then apply discrete-time numerical approximation schemes [11, 12, 20]. The second, is to develop a numerical approx-imation scheme directly on continuous-time path-space to estimate the score function and then(necessarily) discretize the algorithm in time. The order of designing the estimation method andtime-discretization can be rather important. Often the second approach is preferable in terms ofboth performance and robustness as the discretization mesh vanishes, see e.g. [22, 21]. We thenuse the score estimate for implementing recursive maximum likelihood, where the parameters areupdated at unit time intervals ([19, 24]). The particular choice of time interval length is withoutloss of generality, and allows the score to accumulate suﬃcient information from the observationsbefore updating the parameters. Whilst a fully continuous-time implementation for the parameterupdate is desirable, there are fundamental challenges in terms of establishing a valid likelihoodfunction for the multiplicative noise case.In the ﬁrst approach, we consider a well-known expression for the score function, for instance asgiven in [6]. Given this formula, one can produce an Euler discretized version of the score and workin discrete-time but with high frequency. The score is an expectation of an additive functional ofthe hidden state path conditioned on the available observations, which is commonly referred to asthe smoothing distribution. In this context, many well-known particle smoothing schemes can nowbe adopted, such as the ones described in [11, 20]. These latter approaches are simulation-basedschemes whose convergence is based upon the number of samples N ∈ N . We prove some technicalresults for the discretized problem, which together with the work in [11] allow us conjecture thatto obtain a mean square error (MSE) of O ( (cid:15) ) , for given (cid:15) > , we require a computational costof O ( (cid:15) − ) per unit time. The latter derives from an algorithmic cost of O ( N + N ∆ − l ) per unittime, where ∆ l = 2 − l is the time-discretization. As we explain later in the article, this correspondsto a best case scenario, due to the intrinsic nature of the algorithm. In particular, we start witha continuous-time formula and time-discretize it, but the deduced numerical algorithm can beproblematic in terms of computational complexity as l → ∞ . Whilst in some scenarios one doesnot observe any issues, examples can be found where the variance of the method can explode as l grows ([33]), thus putting into question the validity of the conjecture on the cost to achieve anMSE of O ( (cid:15) ) .This motivates the introduction of our second approach, where we build upon a change ofmeasure technique proposed in [26]. This latter approach has so far been used in very diﬀerentcontexts than the present paper, namely related to discretely observed diﬀusions for Bayesianinference and Markov chain Monte Carlo (MCMC) [30, 32] or smoothing for linear observationfunctions [31]. Our approach is a data augmentation scheme, whereby, at unit time intervals, theend points for the hidden state are sampled and the path is connected using diﬀusion bridges.Then, starting again from the formula for the score function in [6], we will use this change ofmeasure associated to a diﬀusion bridge and its driving Brownian noise; see also [33] where arelated approach is used for a diﬀerent class of problems. Based upon this change of measurewe develop a new backward type Feynman-Kac formula in continuous-time. This new formulafacilitates an adaptation of the method in [11] in true continuous-time, albeit one cannot apply itin practice. We time-discretize the algorithm and conjecture that to obtain a MSE of O ( (cid:15) ) , forgiven (cid:15) > , we need a cost of O ( (cid:15) − ) per unit time, which derives from an algorithmic cost of O ( N ∆ − l ) per unit time. We note however, that this computational complexity will not explode2ith increasing l as may be the case in the ﬁrst approach. To improve the cost required for a givenMSE, we develop a novel multilevel Monte Carlo extension that can, in some cases, achieve a MSEof O ( (cid:15) ) at a cost per unit time of O ( (cid:15) − ) . We remark that although our MSE-cost statements arebased upon conjectures, they are veriﬁed numerically. Direct proofs of these require substantialtechnical results that will be the topic of future work. We conclude this introduction by emphasizing that our contributions are aimed to deal with bothcontinuous-time observations and hidden states. As mentioned earlier this poses very particularchallenges relative to earlier works that deal with ﬁltering and smoothing when discrete-timeobservations/models are used as in [5, 13, 14, 31, 32, 25]. None of these works look at continuous-time observations. Similarly online likelihood estimation of the parameters using the score functionis considered in [13, 14] only for the case of discrete-time observations of hidden diﬀusions.The contributions of this paper can be summarized as follows: • We investigate the eﬃciency and accuracy of two fundamentally diﬀerent numerical approxi-mations of the score function on its own and when used for the purpose of recursive maximumlikelihood. Both methods rely on fairly standard tools such as changes of measure, particlesmoothing and Euler time-discretization. • We provide a detailed discussion on the computational complexity of each method. Weillustrate the performance and computational cost for several models in numerical examplesthat consider estimation of the score function and parameter estimation. • The second approach is a novel method that operates directly on the path-space. Theapproach improves performance and is robust to arbitrarily small time-discretization at theexpense of additional computational cost. The latter is reduced using a new MultilevelParticle Filter; see [15, 18] for some existing approaches.This article is structured as follows. In Section 2 the basic problem is formulated in continuous-time. In Section 3 we consider our ﬁrst method for online score estimation and explain the variousfeatures associated to it. In Section 4 our second method for online score estimation is developed.In Section 5 our numerical results are presented.

Let ( X , X ) be a measurable space. We write B b ( X ) for the set of bounded measurable functions, ϕ : X → R d , d ∈ N , and C ( X ) for the continuous ones. Let ϕ : R d → R ; Lip (cid:107)·(cid:107) ( R d ) denotes thecollection of real-valued functions that are Lipschitz w.r.t. the Euclidean distance (cid:107) · (cid:107) . That is, ϕ ∈ Lip (cid:107)·(cid:107) ( R d ) if there exists a C < + ∞ such that for any ( x, y ) ∈ R d , | ϕ ( x ) − ϕ ( y ) | ≤ C (cid:107) x − y (cid:107) . N s ( µ, Σ) denotes an s -dimensional Gaussian law of mean µ and covariance Σ ; if s = 1 we omitsubscript s . For a vector/matrix X , X ∗ denotes the transpose of X . For A ∈ X , δ A ( du ) denotesthe Dirac measure of A , and if A = { x } with x ∈ X , we write δ x ( du ) . For a vector-valued functionin d -dimensions ϕ ( x ) (resp. d -dimensional vector x ) we write the i th -component, ≤ i ≤ d , as ϕ ( i ) ( x ) (resp. x i ). For a d × q matrix x we write the ( i, j ) th -entry as x ( ij ) . N = { , , . . . } and N = N ∪ { } . 3 Problem Formulation

We consider the parameter space θ ∈ Θ ⊂ R d θ , with Θ being compact, d θ ∈ N . The stochasticprocesses { Y t } t ≥ , { X t } t ≥ of interest are deﬁned upon the probability triple (Ω , F , P θ ) , with Y t ∈ R d y , X t ∈ R d x , d y , d x ∈ N , initial conditions X = x ∗ ∈ R d x , Y = y ∗ ∈ R d y , and are determined asthe solution of the system of SDEs: dY t = h θ ( X t ) dt + dB t ; (1) dX t = b θ ( X t ) dt + σ ( X t ) dW t . (2)Here, for each θ ∈ Θ , h θ : R d x → R d y , b θ : R d x → R d x , σ : R d x → R d x × d x , with σ being of full rank,and { B t } t ≥ , { W t } t ≥ are independent standard Brownian motions of dimension d y , d x respectively.To minimize technical diﬃculties, the following assumptions are made throughout the paper:( D1 ) (i) σ is continuous, bounded; a ( x ) := σ ( x ) σ ( x ) ∗ is uniformly elliptic;(ii) for each θ , h θ and b θ are bounded, measurable; h ( i ) θ ∈ Lip (cid:107)·(cid:107) ( R d x ) , ≤ i ≤ d y ;(iii) the gradients ∇ θ h θ : R d x → R d y × d θ and ∇ θ b θ : R d x → R d x × d θ exist, and are continuous,bounded, measurable; ∇ θ h ( ij ) θ ∈ Lip (cid:107)·(cid:107) ( R d x ) , ≤ i ≤ d y , ≤ j ≤ d θ ;(iv) let φ θ ( x ) = ( ∇ θ b θ ( x )) ∗ a ( x ) − σ ( x ) ; for any θ , φ ( ij ) θ ∈ Lip (cid:107)·(cid:107) ( R d x ) , ≤ i ≤ d θ , ≤ j ≤ d x .We introduce the probability measure P θ , deﬁned via the Radon-Nikodym derivative: Z t,θ := d P θ d P θ (cid:12)(cid:12)(cid:12) F t = exp (cid:110) (cid:90) t h θ ( X s ) ∗ dY s − (cid:90) t h θ ( X s ) ∗ h θ ( X s ) ds (cid:111) , (3)with F t = σ ( { X s , Y s } ≤ s ≤ t ) . A simple implication of Girsanov’s theorem gives that: Z T,θ ≡ p θ ( { Y t } ≤ t ≤ T |{ X t } ≤ t ≤ T ) , (4)for a time horizon T > , the right hand side quantity denoting the (un-normalized version ofthe) density – w.r.t. the Wiener measure on C ([0 , T ] , R d y ) – of the data { Y t } ≤ t ≤ T conditionallyon the signal { X t } ≤ t ≤ T . Henceforth, E θ denotes expectation w.r.t. P θ , so that under P θ , theprocess { X t } t ≥ follows the dynamics in (2), whereas { Y t } t ≥ is a Brownian motion independentof { X t } t ≥ . With the Zakai equation in mind, we deﬁne, for ϕ ∈ B b ( R d x ) : γ t,θ ( ϕ ) := E θ (cid:104) ϕ ( X t ) exp (cid:110) (cid:90) t h θ ( X s ) ∗ dY s − (cid:90) t h θ ( X s ) ∗ h θ ( X s ) ds (cid:111) (cid:12)(cid:12)(cid:12) Y t (cid:105) , where Y t is the ﬁltration generated by the process { Y s } ≤ s ≤ t .Our objective is to produce estimates of the gradient of the score function: ∇ θ log p θ ( { Y t } ≤ t ≤ T ) ≡ ∇ θ log( γ T,θ (1)) . In our setting, the score function writes as (see e.g. [6]): ∇ θ log( γ T,θ (1)) = E θ [ λ T,θ Z T,θ | Y T ] E θ [ Z T,θ | Y T ] , (5)4here we have deﬁned: λ T,θ := (cid:90) T ( ∇ θ b θ ( X t )) ∗ a ( X t ) − σ ( X t ) dW t + (cid:90) T ( ∇ θ h θ ( X t )) ∗ dY t − (cid:90) T ( ∇ θ h θ ( X t )) ∗ h θ ( X t ) dt. For completeness, a derivation of (5) can be found in Appendix A. We remark that one can derivea formula for the score function when σ depends upon θ , which is given in Section 4. We willassume throughout that T ∈ N . Note also that an application of Bayes’ rule gives that, almostsurely: E θ [ λ T,θ Z T,θ | Y T ] E θ [ Z T,θ | Y T ] = E θ [ λ T,θ | Y T ] , (6)where E θ denotes expectation w.r.t. P θ . In the oﬄine case suppose one has obtained data { Y t } ≤ t ≤ T . Then it is possible to perform standardgradient descent using (6) and updating θ iteratively: θ m +1 = θ m + α m E θ m [ λ T,θ m | Y T ] , (7)where α m ∈ R + , m ∈ N , are decreasing step-sizes. Instead here we will mainly focus on an onlinegradient estimation procedure as follows. Given an initial θ ∈ Θ , as we obtain the observationpath continuously in time we will update θ at times T ∈ N using the following recursion: θ T = θ T − + α T (cid:16) ∇ θ log( γ T,θ T − (1)) − ∇ θ log( γ T − ,θ T − (1)) (cid:17) (8) ≡ θ T − + α T (cid:16) ∇ θ log p θ T − (cid:0) { Y t } T − ≤ t ≤ T (cid:12)(cid:12) { Y t } ≤ t ≤ T − (cid:1)(cid:17) , where, for T ∈ N , α T ∈ R + is a collection of step-sizes that satisfy (cid:80) T ∈ N α T = ∞ , (cid:80) T ∈ N α T < ∞ to ensure convergence of the estimation as T → ∞ ; see [3, 19] for details. This scheme can providean online estimate for the parameter vector as data arrive. Steps are performed at O (1) times toensure that enough information has accumulated to update the parameter. The adoption of unittimes is made only for notational convenience. As both recursions (7) and (8) cannot be computedexactly, we focus upon methodologies that approximate the score function ∇ θ log( γ T,θ (1)) . In practice, we will have to work with a discretization of the model in (1)-(2). We assume ac-cess to path of data { Y t } ≤ t ≤ T which is available up-to an (almost) arbitrarily ﬁne level of timediscretization. This would be a very ﬁnely discretized path, as accessing the actual continuouspath of observation is not possible; this point is discussed later on. One could focus on a time-discretization of either side of (6), however, as is conventional in the literature (e.g. [1, 18]) wefocus on the left hand side. 5et l ∈ N and consider an Euler-Maruyama time-discretization with step-size ∆ l = 2 − l . Thatis, for k ∈ { , , . . . , T / ∆ l } : (cid:101) X k ∆ l = (cid:101) X ( k − l + b θ ( (cid:101) X ( k − l )∆ l + σ ( (cid:101) X ( k − l )( W k ∆ l − W ( k − l ) , (cid:101) X = x ∗ . (9)Note that the Brownian motion in (9) is the same as in (2) under both P θ and P θ . We set: λ lT,θ ( x ,x ∆ l , . . . , x T ) := T/ ∆ l − (cid:88) k =0 (cid:110) ( ∇ θ b θ ( x k ∆ l )) ∗ a ( x k ∆ l ) − σ ( x k ∆ l )( W ( k +1)∆ l − W k ∆ l )+ ( ∇ θ h θ ( x k ∆ l )) ∗ ( Y ( k +1)∆ l − Y k ∆ l ) − ( ∇ θ h θ ( x k ∆ l )) ∗ h θ ( x k ∆ l )∆ l (cid:111) . (10)We remark that λ lT,θ is a function also of the observations, but this dependence is suppressed fromthe notation. For k ∈ { , , . . . , T / ∆ l − } , we deﬁne: g lk,θ ( x k ∆ l ) := exp (cid:110) h θ ( x k ∆ l ) ∗ ( y ( k +1)∆ l − y k ∆ l ) − ∆ l h θ ( x k ∆ l ) ∗ h θ ( x k ∆ l ) (cid:111) . Note that: Z lT,θ ( x , x ∆ l , . . . , x T ) := T/ ∆ l − (cid:89) k =0 g lk,θ ( x k ∆ l )= exp (cid:110) T/ ∆ l − (cid:88) k =0 (cid:2) h θ ( x k ∆ l ) ∗ ( y ( k +1)∆ l − y k ∆ l ) − ∆ l h θ ( x k ∆ l ) ∗ h θ ( x k ∆ l ) (cid:3)(cid:111) is a time-discretization of Z T,θ . We thus obtain the discretized approximation of the score function ∇ θ log( γ T,θ (1)) : ∇ θ log( γ lT,θ (1)) := E θ [ λ lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) Z lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) | Y T ] E θ [ Z lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) | Y T ] . (11)We have the following result which establishes the convergence of our Euler approximation.The proof and the associated additional assumptions D2-3 are given in Appendix B. Proposition 3.1.

Assume (D1-3). Then, for any ( T, θ ) ∈ [0 , ∞ ) × Θ there exists a C < + ∞ suchthat for any l ∈ N we have: (cid:12)(cid:12)(cid:12) E θ (cid:2) ∇ θ log( γ T,θ (1)) − ∇ θ log( γ lT,θ (1)) (cid:3) (cid:12)(cid:12)(cid:12) ≤ C ∆ / l . The result is fairly standard, but we note it is not a simple application of results on SDEs in ﬁltering(e.g. [23, 28]) and this is reﬂected in the proof. The rate of convergence of the approximation willbe very relevant for some of our subsequent discussions.

From herein the (cid:101) X notation is dropped for simplicity. Consider the time interval [ k, k + 1] and the k -th update of (8). We deﬁne the discrete-time approximation (at level l ) as: u k,l = ( x k +∆ l , . . . , x k +1 ) ∈ E l := ( R d x ) ∆ − l , k ∈ N . p θ ( { Y t } k ≤ t ≤ k +1 |{ X t } ≤ t ≤ T ) is: G lk,θ ( u k − ,l , u k,l ) := ∆ − l − (cid:89) p =0 g lk + p,θ ( x k + p ∆ l ) , where we set u − ,l = x ∗ , for each l ∈ N . We denote by m lθ the Euler transition density inducedby time-discretisation (9), and then write the initial distribution and Markov transition kernel forthe discrete-time process with k ∈ N as follows: η l ,θ ( du ,l ) = ∆ − l (cid:89) p =1 m lθ ( x ( p − l , x p ∆ l ) dx p ∆ l ; M lθ ( u k − ,l , du k,l ) = ∆ − l (cid:89) p =1 m lθ ( x k +( p − l , x k + p ∆ l ) dx k + p ∆ l . Remark 3.1.

The deﬁnition of G lk,θ ( u k − ,l , u k,l ) , M lθ ( u k − ,l , du k,l ) implies that: i) G lk,θ ( u k − ,l , u k,l ) involves u k − ,l only via its very last element, x k ; ii) the dynamics of u k,l conditionally on u k − ,l depend only on the very last element, x k , of u k − ,l . We can now state the discrete-time ﬁltering distribution for k ∈ N : π lk,θ (cid:0) d ( u ,l , . . . , u k,l ) (cid:1) := (cid:0) (cid:81) kp =0 G lp,θ ( u p − ,l , u p,l ) (cid:1) η l ,θ ( du ,l ) (cid:81) kp =1 M lθ ( u p − ,l , du p,l ) (cid:82) E k +1 l (cid:0) (cid:81) kp =0 G lp,θ ( u p − ,l , u p,l ) (cid:1) η l ,θ ( du ,l ) (cid:81) kp =1 M lθ ( u p − ,l , du p,l ) . (12)That is, π lk,θ (cid:0) d ( u ,l , . . . , u k,l ) (cid:1) is a discrete-time approximation of the ﬁltering distribution: π k,θ (cid:0) d ( { X t } ≤ t ≤ k ) (cid:1) := P θ ( d { X t } ≤ t ≤ k |{ Y t } ≤ t ≤ k ) . Expression (12) corresponds to a standard Feynman-Kac model (see e.g. [10]), thus one can ap-proximate the involved ﬁltering distributions via the corresponding Monte Carlo methodology.We develop a Monte Carlo method for the approximation of the discretised score function in(11). This is accomplished by presenting a backward formula for (11). We deﬁne for any p ∈ N : f lθ ( x p ∆ l , x ( p +1)∆ l ) := ( ∇ θ b θ ( x p ∆ l )) ∗ a ( x p ∆ l ) − σ ( x p ∆ l )( W ( p +1)∆ l − W p ∆ l )+ ( ∇ θ h θ ( x p ∆ l )) ∗ ( Y ( p +1)∆ l − Y p ∆ l ) − ( ∇ θ h θ ( x p ∆ l )) ∗ h θ ( x p ∆ l )∆ l . and let: Λ lk,θ ( u k − ,l , u k,l ) := ∆ − l − (cid:88) p =0 f lθ ( x k + p ∆ l , x k +( p +1)∆ l ); F lT,θ ( u ,l , . . . , u T − ,l ) := T − (cid:88) k =0 Λ lk,θ ( u k − ,l , u k,l ) (cid:16) ≡ λ lT,θ ( x , x ∆ l , . . . , x T ) (cid:17) , (13)for λ lT,θ ( x , x ∆ l , . . . , x T ) as deﬁned in (10) and used in the score function approximation (11). Thus: ∇ θ log( γ lT,θ (1)) = (cid:90) E Tl F lT,θ ( u ,l , . . . , u T − ,l ) Q lT − ,θ (cid:0) d ( u ,l , . . . , u T − ,l ) (cid:1) , (14)7here Q lT − ,θ (cid:0) d ( u ,l , . . . , u T − ,l ) (cid:1) is a time-discretisation of the smoothing law: Q T − ,θ ( d { X t } ≤ t ≤ T ) := P θ ( d { X t } ≤ t ≤ T |{ Y t } ≤ t ≤ T ) . Now, by the time-reversal formula for hidden Markov models (see e.g. [11, 12]) one has: Q lT − ,θ (cid:0) d ( u ,l , . . . , u T − ,l ) (cid:1) := π lT − ,θ ( du T − ,l ) T − (cid:89) k =1 B lk,θ,π lk − ,θ ( u k,l , du k − ,l ) , for the backward Markov kernel: B lk,θ,π lk − ,θ ( u k,l ,du k − ,l ) := π lk − ,θ ( du k − ,l ) G lk,θ ( u k − ,l , u k,l ) m lθ ( u k − ,l , u k,l ) π lk − ,θ ( G lk,θ ( · , u k,l ) m lθ ( · , u k,l )) , (15)under the standard notation: π lk − ,θ ( G lk,θ ( · , u k,l ) m lθ ( · , u k,l )) = (cid:90) E l π lk − ,θ ( du k − ,l ) G lk,θ ( u k − ,l , u k,l ) m lθ ( u k − ,l , u k,l ) . Remark 3.2.

The model structure gives important cancellations in (15), so that: B lk,θ,π lk − ,θ ( u k,l , du k − ,l ) ≡ π lk − ,θ ( du k − ,l ) g lk,θ ( x k ) m lθ ( x k , x k +∆ l ) (cid:82) E l π lk − ,θ ( du k − ,l ) g lk,θ ( x k ) m lθ ( x k , x k +∆ l ) . The objective now is to approximate the right hand side of (14) using particle approximations.Our online particle approximation of the gradient of the log-likelihood in (14), for a given l ∈ N ispresented in Algorithm 1. Our estimates are given in (16) and (18) in Algorithm 1. The approachis the method introduced in [11, 12]. There are several remarks worth making, before proceeding. Firstly, the cost of this algorithm perunit time is O ( N ∆ − l + N ) . In detail, the cost of the particle ﬁlter is O ( N ∆ − l ) . The cost ofcalculating F l,Nk,θ ( u ik,l ) , ≤ i ≤ N , in (17), is O ( N ) ; ∆ l is not involved here due to cancellations– see Remark 3.2. The cost of (18) is O ( N ) , given the particle ﬁlter has already been executed.There are several implications of this remark. Based upon the results in [11] and Proposition 3.1,one expects that, under appropriate assumptions, we will have the following MSE, for ( k, N ) ∈ N : E θ (cid:104) (cid:13)(cid:13)(cid:13) (cid:92) ∇ θ log( γ lk,θ (1)) − ∇ θ log( γ k,θ (1)) (cid:13)(cid:13)(cid:13) (cid:105) ≤ C (cid:16) N + ∆ l (cid:17) , (19)where C is a constant that does not depend on N or l . To achieve an MSE of O ( (cid:15) ) for some (cid:15) > given, one sets l so that ∆ l = (cid:15) (i.e. l = O ( | log( (cid:15) ) | ) ) and N = O ( (cid:15) − ) . The cost per unittime of doing this is then O ( (cid:15) − ) . We note that to choose l as speciﬁed, one has to have accessto an appropriately ﬁnely observed data set and this is assumed throughout. Typically, one coulduse a multilevel Monte Carlo method, as in [18], to reduce the cost to achieve an MSE of O ( (cid:15) ) .However, in this case as the O ( N ) cost dominates and does not depend on l , one can easily checkthat such a variance reduction method will not improve our particle method. To understand this,one can prove a bound on the MSE, for instance of the type conjectured later in this article (38),and then try to minimize the cost, by selecting the appropriate number of samples on each level to8 lgorithm 1 Online Score Function Estimation for a given l ∈ N .1. For i ∈ { , . . . , N } , sample u i ,l i.i.d. from η l ,θ ( · ) . The estimate of ∇ θ log( γ l ,θ (1)) is: (cid:92) ∇ θ log( γ l ,θ (1)) := (cid:80) Ni =1 G l ,θ ( x ∗ , u i ,l )Λ l ,θ ( x ∗ , u i ,l ) (cid:80) Ni =1 G l ,θ ( x ∗ , u i ,l ) . (16)Set k = 1 , and for i ∈ { , . . . , N } , ˇ u i − ,l = x ∗ .2. For i ∈ { , . . . , N } sample ˇ u ik − ,l from: N (cid:88) i =1 G lk − ,θ (ˇ u ik − ,l , u ik − ,l ) (cid:80) Nj =1 G lk − ,θ (ˇ u jk − ,l , u jk − ,l ) δ { u ik − ,l } ( · ) . If k = 1 , for i ∈ { , . . . , N } , set F l,Nk − ,θ (ˇ u i ,l ) = Λ l ,θ ( x ∗ , ˇ u i ,l ) .3. For i ∈ { , . . . , N } , sample u ik,l from M lθ (ˇ u ik − ,l , · ) . For i ∈ { , . . . , N } , compute: F l,Nk,θ ( u ik,l ) = (cid:80) Nj =1 g lk,θ (ˇ x jk ) m lθ (ˇ x jk , x ik +∆ l ) { F l,Nk − ,θ (ˇ u jk − ,l ) + Λ lk,θ (ˇ u jk − ,l , u ik,l ) } (cid:80) Nj =1 g lk,θ (ˇ x jk ) m lθ (ˇ x jk , x ik +∆ l ) . (17)The estimate of ∇ θ log( γ lk +1 ,θ (1)) is: (cid:92) ∇ θ log( γ lk +1 ,θ (1)) := (cid:80) Ni =1 G lk,θ (ˇ u ik − ,l , u ik,l ) F l,Nk,θ ( u ik,l ) (cid:80) Ni =1 G lk,θ (ˇ u ik − ,l , u ik,l ) . (18)Set k = k + 1 and return to the start of 2..obtain a given MSE. This latter problem leads to a constrained minimization problem that can besolved using Lagrange multipliers (as in e.g. [7]), but one can show that this yields that the orderof the cost to achieve an MSE of O ( (cid:15) ) is still O ( (cid:15) − ) .Secondly, it is important to note that the method in [20] can reduce the cost of online scoreestimation to O ( N ∆ − l ) per unit time. However, in order to do so, one requires that m lθ ( x, x (cid:48) ) isuniformly lower-bounded in ( x, x (cid:48) ) , which does not typically occur for Euler-discretized diﬀusiondensities. As a result, we only use the approach shown in Algorithm 1. Note that [14], in a diﬀerentbut related context, considers using unbiased and non-negative estimates of the transition densityin the approach in [20], but such estimates are not always available.Thirdly and rather importantly, there is a potential issue related to the construction of thealgorithm. We have started with a continuous-time formula, discretized it and applied what areessentially discrete-time methods for smoothing of additive functionals. A serious caveat is thatthe algorithm is not well-deﬁned as l → ∞ , which is what we mean by saying in has no (Wiener)path-space formulation . The source of the issue is related to using approximations of the transitiondensity m lθ (ˇ x jk , x ik +∆ l ) in (17), which can degenerate when l is high. This will result in increasingMonte Carlo variance and computational cost and may mean that C in (19) explodes exponentiallyin l . We refer the reader to [33, Figure 1.1] for a numerical example. This issue has manifested9tself in MCMC schemes for inferring fully observed SDEs (see e.g. [21]), but in the context ofparticle smoothing and Algorithm 1 there are additional considerations. The resampling operationintroduces discontinuities . Often such terminology refers to the lack of continuity of the transitiondensity: θ → (cid:92) ∇ θ log( γ lk,θ (1)) , but here we are interested in the behavior of m lθ (ˇ x jk , x ik +∆ l ) when we combine points ˇ x jk and x ik +∆ l that are intrinsically not obtained in a continuous way as ∆ l diminishes. This issue has not receivedattention in earlier numerical studies, but remains a concern. As a result, we now consider deﬁningan algorithm that is robust to the size of the time-discretization mesh and hence has a path-spaceformulation. We begin this section with a review of the method in [26, 30]. For simplicity we consider the case t ∈ [0 , and let X := { X t } t ∈ [0 , , and W := { W t } t ∈ [0 , . Let also p θ ( x, t ; x (cid:48) , denote the unknowntransition density from time t to associated to (2) and let also p θ ( x, x (cid:48) ) := p θ ( x, x (cid:48) , . Supposeone could sample from p θ to obtain ( x, x (cid:48) ) ∈ R d x . Then one can interpolate these points by usinga bridge process which has a drift given by b θ ( x ) + a ( x ) ∇ x log p θ ( x, t ; x (cid:48) , , as we will explainbelow. Let P θ,x,x (cid:48) denote the law of the solution of the SDE (2), on [0 , , started at x ∗ = x andconditioned to hit x (cid:48) at time .As p θ is intractable in general, we consider a user-speciﬁed auxiliary process { ˜ X t } t ∈ [0 , following: d ˜ X t = ˜ b θ ( t, ˜ X t ) dt + ˜ σ ( t, ˜ X t ) dW t , t ∈ [0 , , ˜ X = x, (20)where for each parameter value θ ∈ Θ , ˜ b θ : [0 , × R d x → R d x and ˜ σ : R d x → R d x × d x is suchthat ˜ a (1 , x (cid:48) ) := ˜ σ (1 , x (cid:48) )˜ σ (1 , x (cid:48) ) ∗ ≡ a ( x (cid:48) ) . Most importantly, (20) is chosen so that its transitiondensity ˜ p θ is available. One possible choice is to use an Ornstein-Uhlenbeck process (e.g. obtainedusing linearizations or variational inference with (2) [26, Section 1.3]); see also [26, Section 2.2]for technical conditions on ˜ b θ , ˜ a, ˜ p θ . The main purpose of { ˜ X t } t ∈ [0 , is to sample x (cid:48) and use itstransition density to construct another process { X ◦ t } t ∈ [0 , conditioned to hit x (cid:48) at t = 1 . The latterwill form an importance proposal for { X t } t ∈ [0 , . Let: dX ◦ t = b ◦ θ ( t, X ◦ t ; x (cid:48) ) dt + σ ( X ◦ t ) dW t , t ∈ [0 , , X ◦ = x, (21)where: b ◦ θ ( t, x ; x (cid:48) ) = b θ ( x ) + a ( x ) ∇ x log ˜ p θ ( x, t ; x (cid:48) , , and denote by P ◦ θ,x,x (cid:48) the probability law of the solution of (21). The SDE in (21) gives rise to afunction: W → C θ ( x, W , x (cid:48) ) , (22)mapping the driving Wiener noise W to the solution of (21), so we have eﬀectively reparameterizedthe problem from X to ( W , x (cid:48) ) .Now, following [26], the two measures P θ,x,x (cid:48) and P ◦ θ,x,x (cid:48) are absolutely continuous w.r.t. eachother, with Radon-Nikodym derivative: d P θ,x,x (cid:48) d P ◦ θ,x,x (cid:48) ( X ) = exp (cid:110) (cid:90) L θ ( t, X t ) dt (cid:111) × ˜ p θ ( x, x (cid:48) ) p θ ( x, x (cid:48) ) , (23)10here: L θ ( t, x ) := (cid:0) b θ ( x ) − ˜ b θ ( t, x ) (cid:1) ∗ ∇ x log ˜ p θ ( x, t ; x (cid:48) , − Tr (cid:110) (cid:2) a ( x ) − ˜ a ( t, x ) (cid:3)(cid:2) − ∇ x log ˜ p θ ( x, t ; x (cid:48) , − ∇ x log ˜ p θ ( x, t ; x (cid:48) , ∇ x log ˜ p θ ( x, t ; x (cid:48) , ∗ (cid:3) (cid:111) , with Tr ( · ) denoting the trace of a squared matrix. Note that, in the case when σ = σ ( x ) is nota constant function, then, typically, x (cid:48) → ˜ p θ ( x, x (cid:48) ) will not integrate to and will give rise toa non-trivial distribution to sample from. As the complete algorithm will require being able tosample from the transition density, we rewrite: d P θ,x,x (cid:48) d (cid:101) P θ,x,x (cid:48) ( X ) = exp (cid:110) (cid:90) L θ ( t, X t ) dt (cid:111) × ˜ p θ ( x, x (cid:48) ) p θ ( x, x (cid:48) )ˆ p θ ( x, x (cid:48) ) × ˆ p θ ( x, x (cid:48) ) , (24)where an arbitrary, tractable and easy to sample density ˆ p θ ( x, x (cid:48) ) is used to sample x (cid:48) . We return to the expression of the score function in (5) and use the alternative change of measuredescribed above. Consider the processes: X k := { X t } t ∈ [ k,k +1] , Y k := { Y t } t ∈ [ k,k +1] , k ∈ N . We introduce the following notation: Ψ θ ( X k ) = (cid:90) k +1 k L θ ( t, X t ) dt ; J k,θ ( X k , Y k ) = (cid:90) k +1 k h θ ( X t ) ∗ dY t − (cid:90) k +1 k h θ ( X t ) ∗ h θ ( X t ) dt ;Λ k,θ ( X k , Y k ) = (cid:90) k +1 k ( ∇ θ b θ ( X t )) ∗ a ( X t ) − ( dX t − b θ ( X t ) dt )+ (cid:90) k +1 k ( ∇ θ h θ ( X t )) ∗ dY t − (cid:90) k +1 k ( ∇ θ h θ ( X t )) ∗ h θ ( X t ) dt ;Φ k,θ ( X k , Y k ) = J k,θ ( X k , Y k ) + Ψ θ ( X k ) + log ˜ p θ ( x k , x k +1 )ˆ p θ ( x k , x k +1 ) . Note all the integrands can be computed point-wise.

Remark 4.1.

The Wiener process in (22) is deﬁned on the time interval [0 , , thus so is thetransform C θ ( x, W , x (cid:48) ) . In the derivations below, one needs to calculate J k,θ ( C θ ( x k , W k , x k +1 ) , Y k ) and Λ k,θ ( C θ ( x k , W k , x k +1 ) , Y k ) , for W k ’s that correspond to samples from the Wiener measure on [0 , . With some abuse of notation, it is to be understood that the path C θ ( x k , W k , x k +1 ) is ‘shifted’from [0 , to [ k, k + 1] , so all quantities below agree with the notation introduced above. Also, thecalculation of Ψ θ ( C θ ( x k , W k , x k +1 )) will be required, but this should create no confusion. T ∈ N the score function in (5) can be rewritten as: ∇ θ log( γ T,θ (1)) ≡ E θ (cid:104) (cid:0) (cid:80) T − k =0 Λ k,θ ( X k , Y k ) (cid:1) exp (cid:8) (cid:80) T − k =0 J k,θ ( X k , Y k ) (cid:9) (cid:12)(cid:12) Y T (cid:105) E θ (cid:104) exp (cid:8) (cid:80) T − k =0 J k,θ ( X k , Y k ) (cid:9) (cid:12)(cid:12) Y T (cid:105) . Making use of the transform (22) and the density expression in (23), we can equivalently write: ∇ θ log( γ T,θ (1)) = (25) = (cid:101) E θ (cid:104) (cid:0) (cid:80) T − k =0 Λ k,θ ( C θ ( x k , W k , x k +1 ) , Y k ) (cid:1) exp (cid:8) (cid:80) T − k =0 Φ k,θ ( C θ ( x k , W k , x k +1 ) , Y k ) (cid:9) (cid:12)(cid:12) Y T (cid:105)(cid:101) E θ (cid:104) exp (cid:8) (cid:80) T − k =0 Φ k,θ ( C θ ( x k , W k , x k +1 ) , Y k ) (cid:9) (cid:12)(cid:12) Y T (cid:105) , where, the expectation (cid:101) E θ [ · | Y T ] is considered under the probability measure: (cid:101) P θ (cid:0) d ( W , x , . . . , W T − , x T ) (cid:1) := T − (cid:79) k =0 (cid:2) W ( d W k ) ⊗ ˆ p θ ( x k , x k +1 ) dx k +1 (cid:3) , (26)independently of Y T ; here, W is the standard Wiener measure on [0 , and x = x ∗ . Remark 4.2.

The approach that has been adopted here can also be used if σ depends upon θ .Assuming the formula is well-deﬁned, one would have a score function with an expression of thetype: (cid:101) E θ (cid:104) (cid:0) (cid:80) T − k =0 Ξ k,θ ( C θ ( x k , W k , x k +1 ) (cid:1) exp (cid:8) (cid:80) T − k =0 Φ k,θ ( C θ ( x k , W k , x k +1 ) , Y k ) (cid:9) (cid:12)(cid:12) Y T (cid:105)(cid:101) E θ (cid:104) exp (cid:8) (cid:80) T − k =0 Φ k,θ ( C θ ( x k , W k , x k +1 ) , Y k ) (cid:9) (cid:12)(cid:12) Y T (cid:105) , where: Ξ k,θ ( C θ ( x k , W k , x k +1 ) = {∇ θ Φ k,θ ( C θ ( x k , W k , x k +1 ) , Y k ) } + ∇ θ log { ˆ p θ ( x k , x k +1 ) } , and with an appropriate modiﬁcation of the approach to allow σ to depend on θ . To keep consistencywith the ideas in Section 3 we do not consider the formula from herein, but remark that extensionof the forthcoming methodology to this case is straightforward. The expression in (25), together with the (trivially) Markovian dynamics for the process in(26) allow one to construct a backward Feynman-Kac type formula as in (14). To better connectthe approach here and that in Section 3, deﬁne u k = ( W k , x k +1 ) for k ∈ N , u − = x ∗ and: Λ Ck,θ ( u k − , u k ) = Λ k,θ ( C θ ( x k , W k , x k +1 ) , Y k );Φ Ck,θ ( u k − , u k ) = Φ k,θ ( C θ ( x k , W k , x k +1 ) , Y k ); F T,θ ( u , . . . , u T − ) = T − (cid:88) k =0 Λ Ck,θ ( u k − , u k ) . Superscript C is motivated by the well-posedness of the formula in continuous-time path-space.Set: π k,θ (cid:0) d ( u , . . . , u k ) (cid:1) := (cid:16) (cid:81) kp =0 exp { Φ Cp,θ ( u p − , u p ) } (cid:17)(cid:101) P θ (cid:0) d ( u , . . . , u k ) (cid:1)(cid:82) E k +1 (cid:16) (cid:81) kp =0 exp { Φ Cp,θ ( u p − , u p ) } (cid:17)(cid:101) P θ (cid:0) d ( u , . . . , u k ) (cid:1) , E = C ([0 , , R d x ) . Then, we have the representation: ∇ θ log( γ T,θ (1)) = (cid:90) E T F T,θ ( u , . . . , u T − ) Q T − ,θ (cid:0) d ( u , . . . , u T − ) (cid:1) , (27)where: Q T − ,θ (cid:0) d ( u , . . . , u T − ) (cid:1) = π T − ,θ ( du T − ) T − (cid:89) k =1 B k,θ,π k − ,θ ( u k , du k − ) , and: B k,θ,π k − ,θ ( u k , du k − ) := π k − ,θ ( du k − ) exp { Φ Ck,θ ( u k − , u k ) } ˆ p θ ( x k , x k +1 ) π k − ,θ (exp { Φ Ck,θ ( · , u k ) } ˆ p θ ( · , x k +1 )) . We remark that formula (27) is a type of backward Feynman-Kac formula in continuous-time,which to our knowledge is new.As in the case of Algorithm 1, an eﬀective Monte Carlo approximation of such a smoothingexpectation (27) is given in Algorithm 2. The estimates of the score function are given in equations(28)-(29) in Algorithm 2.

Algorithm 2

Online Score Function Estimation on Path-Space1. For i ∈ { , . . . , N } , sample u i i.i.d. from W ( · ) ⊗ ˆ p θ ( x ∗ , · ) . The estimate of ∇ θ log( γ ,θ (1)) is: (cid:92) ∇ θ log( γ ,θ (1)) := (cid:80) Ni =1 exp { Φ C ,θ ( x ∗ , u i ) } Λ C ,θ ( x ∗ , u i ) (cid:80) Ni =1 exp { Φ C ,θ ( x ∗ , u i ) } . (28)Set k = 1 and for i ∈ { , . . . , N } , ˇ u i − = x ∗ .2. For i ∈ { , . . . , N } sample ˇ u ik − from: N (cid:88) i =1 exp { Φ Ck − ,θ (ˇ u ik − , u ik − ) } (cid:80) Nj =1 exp { Φ Ck − ,θ (ˇ u jk − , u jk − ) } δ { u ik − } ( · ) . If k = 1 , for i ∈ { , . . . , N } , set (cid:101) F Nk − ,θ (ˇ u i ) = Λ C ,θ ( x ∗ , ˇ u i ) .3. For i ∈ { , . . . , N } , sample u ik from W ( · ) ⊗ ˆ p lθ (ˇ x ik , · ) . For i ∈ { , . . . , N } , compute: (cid:101) F Nk,θ ( u ik ) = (cid:80) Nj =1 exp { Φ Ck,θ (ˇ u jk − , u ik ) } ˆ p θ (ˇ x jk , x ik +1 ) { (cid:101) F Nk − ,θ (ˇ u jk − ) + Λ Ck,θ (ˇ u jk − , u ik ) } (cid:80) Nj =1 exp { Φ Ck,θ (ˇ u jk − , u ik ) } ˆ p θ (ˇ x jk , x ik +1 ) . The estimate of ∇ θ log( γ k +1 ,θ (1)) is: (cid:92) ∇ θ log( γ k +1 ,θ (1)) := (cid:80) Ni =1 exp { Φ Ck,θ (ˇ u ik − , u ik ) } (cid:101) F Nk,θ ( u ik ) (cid:80) Ni =1 exp { Φ Ck,θ (ˇ u ik − , u ik ) } . (29)Set k = k + 1 and return to the start of 2.13 .3 Time-Discretization Whilst conceptually important, path-space valued Algorithm 2 can seldom be implemented directlyin practice; unbiased methods e.g. [4] may be possible, but would be cumbersome. We develop atime-discretization procedure, in a similar manner to that considered in Section 3.2.We will discretize on the uniform grid with increment ∆ l = 2 − l . Let k ∈ N and deﬁne: u k,l = ( z k +∆ l , z k +2∆ l , . . . , z k +1 − ∆ l , x k +1 ) ∈ ( R d x ) ∆ − l = E l , where z k +∆ l , z k +2∆ l , . . . , z k +1 − ∆ l will represent increments of Brownian motion and u − ,l = x ∗ .Deﬁne the Markov kernel on E l , for k ∈ N : (cid:102) M lθ ( u k − ,l , du k,l ) = (cid:16) ∆ − l − (cid:89) s =1 φ l ( z k + s ∆ l ) dz k + s ∆ l (cid:17) ˆ p θ ( x k , x k +1 ) dx k +1 , where φ l ( z k + s ∆ l ) is the density associated to the N d x (0 , ∆ l I d x ) distribution. We denote the densityof (cid:102) M lθ as (cid:101) Q lθ . Set: (cid:101) η l ,θ ( du ,l ) = (cid:16) ∆ − l − (cid:89) s =1 φ l ( z s ∆ l ) dz s ∆ l (cid:17) ˆ p θ ( x ∗ , x ) dx . Now set, for ( k, s ) ∈ { , , . . . , T − } × { , , . . . , ∆ − l − } : X ( s +1)∆ l + k = X s ∆ l + k + b ◦ θ ( s ∆ l , X s ∆ l + k ; x k +1 )∆ l + σ ( X s ∆ l + k ) Z ( s +1)∆ l + k . (30)Deﬁne for k ∈ N : Ψ lθ ( u k − ,l , u k,l ) = ∆ − l − (cid:88) p =0 L θ ( p ∆ l , x k + p ∆ l )∆ l ; J lk,θ ( u k − ,l , u k,l ) = ∆ − l − (cid:88) p =0 h θ ( x k + p ∆ l ) ∗ ( Y k +( s +1)∆ l − Y k + s ∆ l ) − ∆ − l − (cid:88) p =0 h θ ( x k + p ∆ l ) ∗ h θ ( x k + p ∆ l )∆ l ;Φ lk,θ ( u k − ,l , u k,l ) = J lk,θ ( u k − ,l , u k,l ) + Ψ lθ ( u k − ,l , u k,l ) + log ˜ p θ ( x k , x k +1 )ˆ p θ ( x k , x k +1 ) ; (cid:101) G lk,θ ( u k − ,l , u k,l ) = exp { Φ lk,θ ( u k − ,l , u k,l ) } . Now, for k ∈ N : (cid:101) Λ lk,θ ( u k − ,l , u k,l ) = ∆ − l − (cid:88) p =0 ( ∇ θ b θ ( x k + p ∆ l )) ∗ a ( x k + p ∆ l ) − ( x k +( p +1)∆ l − x k + p ∆ l − b θ ( x k + p ∆ l )∆ l )+ ∆ − l − (cid:88) p =0 ( ∇ θ h θ ( x k + p ∆ l )) ∗ ( Y k +( p +1)∆ l − Y k + p ∆ l ) − ∆ − l − (cid:88) p =0 ( ∇ θ h θ ( x k + p ∆ l )) ∗ h θ ( x k + p ∆ l )∆ l . Set: (cid:101) F lT,θ ( u ,l , . . . , u T − ,l ) = T − (cid:88) k =0 (cid:101) Λ lk,θ ( u k − ,l , u k,l ) . (cid:101) η lθ ( du ,l ) (cid:81) T − k =1 (cid:102) M lθ ( u k − ,l , du k,l ) as (cid:101) E lθ [ · |Y T ] , our discretized approxima-tion of ∇ θ log( γ T,θ (1)) is: ∇ θ log( (cid:101) γ lT,θ (1)) := (cid:101) E lθ (cid:2) (cid:101) F lT,θ ( U ,l , . . . , U T − ,l ) (cid:81) T − k =0 (cid:101) G lk,θ ( U k − ,l , U k,l ) (cid:12)(cid:12) Y T (cid:3)(cid:101) E lθ (cid:2) (cid:81) T − k =0 (cid:101) G lk,θ ( U k − ,l , U k,l ) (cid:12)(cid:12) Y T (cid:3) . We note that, whilst terms ∇ θ log( (cid:101) γ lT,θ (1)) , ∇ θ log( γ lT,θ (1)) should both converge to ∇ θ log( γ T,θ (1)) ,as l → ∞ , they will in general be diﬀerent for any ﬁxed l .One can also easily develop a discretized time reversal formula such as (14) which will convergeprecisely to (27) as l → ∞ . Deﬁne, for k ∈ N : (cid:101) π lk,θ (cid:0) d ( u ,l , . . . , u k,l ) (cid:1) := (cid:0) (cid:81) kp =0 (cid:101) G lp,θ ( u p − ,l , u p,l ) (cid:1) (cid:101) η l ,θ ( du ,l ) (cid:81) kp =1 (cid:102) M lθ ( u p − ,l , du p,l ) (cid:82) E k +1 l (cid:16) (cid:81) kp =0 (cid:101) G lp,θ ( u p − ,l , u p,l ) (cid:17)(cid:101) η l ,θ ( du ,l ) (cid:81) kp =1 (cid:102) M lθ ( u p − ,l , du p,l ) . Then, we have that: ∇ θ log( (cid:101) γ lT,θ (1)) = (cid:90) E Tl (cid:101) F lT,θ ( u ,l , . . . , u T − ,l ) (cid:101) Q lT − ,θ (cid:0) d ( u ,l , . . . , u T − ,l ) (cid:1) , (31)where we have deﬁned: (cid:101) Q lT − ,θ (cid:0) d ( u ,l , . . . , u T − ,l ) (cid:1) := (cid:101) π lT − ,θ ( du T − ,l ) T − (cid:89) k =1 (cid:101) B lk,θ,π lk − ,θ ( u k,l , du k − ,l ) and: (cid:101) B lk,θ, (cid:101) π lk − ,θ ( u k,l ,du k − ,l ) := (cid:101) π lk − ,θ ( du k − ,l ) (cid:101) G lk,θ ( u k − ,l , u k,l ) (cid:101) Q lθ ( u k − ,l , u k,l ) (cid:101) π lk − ,θ ( (cid:101) G lk,θ ( · , u k,l ) (cid:101) Q lθ ( · , u k,l )) . We remark that, due to the structure of the model: (cid:101) B lk,θ, (cid:101) π lk − ,θ ( u k,l , du k − ,l ) = (cid:101) π lk − ,θ ( du k − ,l ) (cid:101) G lk,θ ( u k − ,l , u k,l )ˆ p θ ( x k , x k +1 ) (cid:82) E l (cid:101) π lk − ,θ ( du k − ,l ) (cid:101) G lk,θ ( u k − ,l , u k,l )ˆ p θ ( x k , x k +1 ) . Remark 4.3.

It is important to note that – in contrast to Remark 3.2 – there is no cancellationof terms of (cid:101) G lk,θ ( u k − ,l , u k,l ) in the numerator and denominator of this backward kernel. This isprecisely due to recursion (30) which leads to a path-dependence of the future co-ordinates of thediscretized bridge on the terminal position x k +1 . Our online particle approximation of the gradient of the log-likelihood in (31), for a given l ∈ N is presented in Algorithm 3. Our estimates are given in (33) and (35) in Algorithm 3.Algorithm 3 is simply the time-discretization of the procedure presented in Algorithm 2. Anumber of remarks are again of interest. Firstly, the cost of the algorithm per unit time is now O ( N ∆ − l ) . The increase in computational cost over Algorithm 1 is the fact that when computing (cid:101) Λ lk,θ (ˇ u jk − ,l , u ik,l ) in (34), one must solve the recursion (30) for each ( i, j ) ∈ { , . . . , N } , which has15 cost O (∆ − l ) and it is this cost that dominates. Secondly, following the discussion in Section 3.3,one also expects that, under appropriate assumptions, the MSE for ( k, N ) ∈ N : E θ (cid:104) (cid:13)(cid:13)(cid:13) (cid:92) ∇ θ log( (cid:101) γ lk,θ (1)) − ∇ θ log( γ k,θ (1)) (cid:13)(cid:13)(cid:13) (cid:105) ≤ C (cid:16) N + ∆ l (cid:17) , (32)for constant C that does not depend on N , l . To achieve an MSE of O ( (cid:15) ) for some (cid:15) > given,one sets l = O ( | log( (cid:15) ) | ) and N = O ( (cid:15) − ) . The cost per unit time of doing this, is then O ( (cid:15) − ) .This is signiﬁcantly worse than the approach in Algorithm 1, but we remark that in the bound(19) the constant C does not depend upon l , whereas we have conjectured that C may actuallyexplode exponentially in l . We expect however, that C in (32) is independent of l , precisely due tothe path-space development we have adopted. We remark, however, that one can use an MLMCmethod to reduce this cost and this algorithm is presented in the next section. Algorithm 3

Modiﬁed Online Score Function Estimation for a given l ∈ N .1. For i ∈ { , . . . , N } , sample u i ,l i.i.d. from (cid:101) η l ,θ ( · ) . The estimate of ∇ θ log( (cid:101) γ l ,θ (1)) is: (cid:92) ∇ θ log( (cid:101) γ l ,θ (1)) := (cid:80) Ni =1 (cid:101) G l ,θ ( x ∗ , u i ,l ) (cid:101) Λ l ,θ ( x ∗ , u i ,l ) (cid:80) Ni =1 (cid:101) G l ,θ ( x ∗ , u i ,l ) . (33)Set k = 1 and for i ∈ { , . . . , N } , ˇ u i − ,l = x ∗ .2. For i ∈ { , . . . , N } sample ˇ u ik − ,l from: N (cid:88) i =1 (cid:101) G lk − ,θ (ˇ u ik − ,l , u ik − ,l ) (cid:80) Nj =1 (cid:101) G lk − ,θ (ˇ u jk − ,l , u jk − ,l ) δ { u ik − ,l } ( · ) . If k = 1 , for i ∈ { , . . . , N } , set (cid:101) F l,Nk − ,θ (ˇ u i ,l ) = (cid:101) Λ l ,θ ( x ∗ , ˇ u i ,l ) .3. For i ∈ { , . . . , N } , sample u ik,l from (cid:102) M lθ (ˇ u ik − ,l , · ) . For i ∈ { , . . . , N } , compute: (cid:101) F l,Nk,θ ( u ik,l ) = (cid:80) Nj =1 (cid:101) G lk,θ (ˇ u jk − ,l , u ik,l )ˆ p θ (ˇ x jk , x ik +1 ) { (cid:101) F l,Nk − ,θ (ˇ u jk − ,l ) + (cid:101) Λ lk,θ (ˇ u jk − ,l , u ik,l ) } (cid:80) Nj =1 (cid:101) G lk,θ (ˇ u jk − ,l , u ik,l )ˆ p θ (ˇ x jk , x ik +1 ) . (34)The estimate of ∇ θ log( (cid:101) γ lk +1 ,θ (1)) is: (cid:92) ∇ θ log( (cid:101) γ lk +1 ,θ (1)) = (cid:80) Ni =1 (cid:101) G lk,θ (ˇ u ik − ,l , u ik,l ) (cid:101) F l,Nk,θ ( u ik,l ) (cid:80) Ni =1 (cid:101) G lk,θ (ˇ u ik − ,l , u ik,l ) . (35)Set k = k + 1 and return to the start of 2..16 .5 Multilevel Particle Filter We now present a new multilevel particle ﬁlter along with online estimation of the score-function.We ﬁx l ∈ N for now and for ( k, s ) ∈ N × { l, l − } deﬁne: u k,s = ( z k +∆ l ,s , z k +2∆ l ,s , . . . , z k +1 − ∆ l ,s , x k +1 ,s ) ∈ ( R d x ) ∆ − s = E s . We give the approach in Algorithm 4. Before explaining how one can use Algorithm 4 to provideonline estimates of the score function, several remarks are required to continue. The ﬁrst is relatedto the couplings mentioned in Algorithm 4 point 2. and point 3. bullet 1. The coupling in point2, requires a way to resample the indices of the particles so that they have the correct marginals.This topic has been investigated considerably in the literature, see e.g. [17, 27], and techniquesthat have been adopted include sampling maximal coupling, e.g. [15], or using the L -Wassersteinoptimal coupling [2]; in general the latter is found to be better in terms of variance reduction, butcan only be implemented when d x = 1 . We rely upon the maximal coupling in this paper, whichhas a cost of O ( N ) per unit time. For point 3. bullet 1, one again has a considerable degree ofﬂexibility. In this article we sample the maximal coupling which can be achieved at a cost which isat most O ( N ) cost per-unit time using the algorithm of [29]. The second main remark of interestis that the basic ﬁlter that is sampled in Algorithm 4 is an entirely new coupled particle ﬁlter fordiﬀusions (i.e. diﬀerent to [15, 18]). The utility of the approach relative to [18] is of great interest,in the context of ﬁltering.Set ( l ∗ , L ) ∈ N with l ∗ < L . The idea is to run Algorithm 4, independently, for l ∈ { l ∗ , . . . , L } each with N l particles and, independently, Algorithm 3 for l = l ∗ − with N l ∗ − particles. Wethen consider the estimate, for k ∈ N(cid:92) ∇ θ log( (cid:101) γ Lk,θ (1)) ML := L (cid:88) l = l ∗ (cid:110) (cid:92) ∇ θ log( (cid:101) γ lk,θ (1)) − (cid:92) ∇ θ log( (cid:101) γ l − k,θ (1)) (cid:111) + (cid:92) ∇ θ log( (cid:101) γ l ∗ − k,θ (1)) , where the summands on the right hand side are deﬁned in (36) and (37) and the last term onthe right hand side is as either (33) or (35) (depending on k ). Now, one expects that, underappropriate assumptions, based upon our previous remarks and the work in [18], one can provethe following result for ( k, N l ∗ − L ) ∈ N L − l ∗ +2 : E θ (cid:104) (cid:13)(cid:13)(cid:13) (cid:92) ∇ θ log( (cid:101) γ Lk,θ (1)) ML − ∇ θ log( γ k,θ (1)) (cid:13)(cid:13)(cid:13) (cid:105) ≤ C (cid:16) L (cid:88) l = l ∗ − ∆ βl N l + ∆ L (cid:17) , (38)for constant C that does not depend on N , l ; also, β = 1 if σ is a constant function, else β = 1 / .Choose: i) L so that ∆ L = O ( (cid:15) ) , for (cid:15) > given; ii) if β = 1 , N l = O ( (cid:15) − ∆ / ρl ) for some < ρ < / . These selections yield an MSE of O ( (cid:15) ) for a cost of O ( (cid:15) − ) . If β = 1 / , one can set N l = O ( (cid:15) − ∆ / ρl ∆ − ρL ) for some ρ > . This will yield an MSE of O ( (cid:15) ) for a cost of O ( (cid:15) − ρ ) ) .Such results are at least as good as the method in Section 3.3, assuming that latter approach doesnot collapse with l .We remark that it is possible to produce an almost-surely unbiased estimator of the scorefunction, when θ is the true parameter, using a combination of the multilevel method that hasbeen developed here and the approach in [16]. This is left for future work.17 lgorithm 4 Coupled Online Score Function Estimation for a given l ∈ N .1. • For i ∈ { , . . . , N } , sample u i ,l i.i.d. from (cid:101) η l ,θ ( · ) . • For ( i, p ) ∈ { , . . . , N } × { , . . . , ∆ − l − − } , set z ip ∆ l − ,l − = z ip ∆ l − ,l + z ip ∆ l − − ∆ l ,l and x i ,l − = x i ,l .The estimate of ∇ θ log( (cid:101) γ l ,θ (1)) − ∇ θ log( (cid:101) γ l − ,θ (1)) is: (cid:92) ∇ θ log( (cid:101) γ l ,θ (1)) − (cid:92) ∇ θ log( (cid:101) γ l − ,θ (1)) := (cid:80) Ni =1 (cid:101) G l ,θ ( x ∗ , u i ,l ) (cid:101) Λ l ,θ ( x ∗ , u i ,l ) (cid:80) Ni =1 (cid:101) G l ,θ ( x ∗ , u i ,l ) − (cid:80) Ni =1 (cid:101) G l − ,θ ( x ∗ , u i ,l − ) (cid:101) Λ l − ,θ ( x ∗ , u i ,l − ) (cid:80) Ni =1 (cid:101) G l − ,θ ( x ∗ , u i ,l − ) . (36)Set k = 1 and for i ∈ { , . . . , N } , ˇ u i − ,l = ˇ u i − ,l − = x ∗ .2. For i ∈ { , . . . , N } , sample ( α l ( i ) , α l − ( i )) ∈ { , . . . , N } from a coupling of: N (cid:88) i =1 (cid:101) G lk − ,θ (ˇ u α l ( i ) k − ,l , u α l ( i ) k − ,l ) (cid:80) Nj =1 (cid:101) G lk − ,θ (ˇ u jk − ,l , u jk − ,l ) and N (cid:88) i =1 (cid:101) G l − k − ,θ (ˇ u α l − ( i ) k − ,l − , u α l − ( i ) k − ,l ) (cid:80) Nj =1 (cid:101) G l − k − ,θ (ˇ u jk − ,l − , u jk − ,l − ) , and set (ˇ u ik − ,l , ˇ u ik − ,l − ) = ( u α l ( i ) k − ,l , u α l − ( i ) k − ,l − ) . If k = 1 , for i ∈ { , . . . , N } , s ∈ { l, l − } , set (cid:101) F s,Nk − ,θ (ˇ u i ,s ) = (cid:101) Λ s ,θ ( x ∗ , ˇ u i ,s ) .3. • For i ∈ { , . . . , N } , sample ( x ik +1 ,l , x ik +1 ,l − ) from a coupling of ˆ p θ (ˇ x ik,l , · ) and ˆ p θ (ˇ x ik,l − , · ) . • For i ∈ { , . . . , N } sample z ik +∆ l ,l , . . . , z ik +1 − ∆ l ,l i.i.d. from (cid:81) ∆ − l − s =1 φ l ( · ) . • For ( i, p ) ∈ { , . . . , N } × { , . . . , ∆ − l − − } set z ik + p ∆ l − ,l − = z ik + p ∆ l − ,l + z ik + p ∆ l − − ∆ l ,l . • For ( i, s ) ∈ { , . . . , N } × { l.l − } , compute: (cid:101) F s,Nk,θ ( u ik,l ) = (cid:80) Nj =1 (cid:101) G sk,θ (ˇ u jk − ,s , u ik,s )ˆ p θ (ˇ x jk,s , x ik +1 ,s ) { (cid:101) F s,Nk − ,θ (ˇ u jk − ,s ) + (cid:101) Λ sk,θ (ˇ u jk − ,s , u ik,s ) } (cid:80) Nj =1 (cid:101) G sk,θ (ˇ u jk − ,s , u ik,s )ˆ p θ (ˇ x jk,s , x ik +1 ,s ) . The estimate of ∇ θ log( (cid:101) γ lk +1 ,θ (1)) − ∇ θ log( (cid:101) γ l − k +1 ,θ (1)) is: (cid:92) ∇ θ log( (cid:101) γ lk +1 ,θ (1)) − (cid:92) ∇ θ log( (cid:101) γ l − k +1 ,θ (1)) := (cid:80) Ni =1 (cid:101) G lk,θ (ˇ u ik − ,l , u ik,l ) (cid:101) F l,Nk,θ ( u ik,l ) (cid:80) Ni =1 (cid:101) G lk,θ (ˇ u ik − ,l , u ik,l ) − (cid:80) Ni =1 (cid:101) G l − k,θ (ˇ u ik − ,l − , u ik,l − ) (cid:101) F l − ,Nk,θ ( u ik,l − ) (cid:80) Ni =1 (cid:101) G l − k,θ (ˇ u ik − ,l − , u ik,l − ) . (37)Set k = k + 1 and return to the start of 2..18 Numerical Results

In this section, we consider three models to investigate the various properties of our algorithms. Thescore function is estimated using both Algorithms 1 and 3 for a ﬁxed θ . We will show, as expected,that they are equivalent for a large number of particles N and a high level of descritization l . Wethen compare the cost of Algorithm 3 and its multilevel version Algorithm 4. As an application ofour methods, we use Algorithms 1 and 4 for parameter estimation via stochastic gradient.We remark that we will not use Algorithm 3 for parameter estimation because it is ‘slow’compared to the algorithms as illustrated in the previous section and in Figure 1. In Figure 1 weconsider the Model 1, as described in the next section, with T = 40 , l = 10 , θ = ( − . , − . , κ = 2 and x ∗ = 0 . . Figure 1 provides a comparison between the cost of Algorithms 1 and 3 persimulation versus the number of particles N . As predicted by our theoretical conjectures, we seethat the cost of Algorithm 1 is signiﬁcantly lower than that of Algorithm 3. C o s t ( S e c ) Number of Particles N

Alg 1Alg 3

Figure 1: Comparison between the cost of Algorithms 1 and 3 per each simulation versus thenumber of particles N . We run both algorithms on Model 1. In the following, parameters ( κ, σ ) are ﬁxed. Model 1:

Let d x = d y = 1 , d θ = 2 and consider the following linear SDEs: dX t = θ X t dt + σdW t ; dY t = θ ( κ − X t ) dt + dB t . Model 2:

Let d x = d y = 1 , d θ = 3 and consider a nonlinear diﬀusion process along with a lineardiﬀusion process of observations: dX t = (cid:0) θ X t + θ X t (cid:1) dt + σdW t ; dY t = θ ( κ − X t ) dt + dB t . Model 3:

Let d x = d y = 1 , d θ = 3 and consider a nonlinear signal along with a nonlinear diﬀusionprocess of observations. The ﬁrst SDE is a Cox-Ingersoll-Ross process after an 1-1 transform.19hus: dX t = 12 (cid:0) θ θ − σ X t − θ X t (cid:1) dt + σdW t ; dY t = θ ( κ − X t ) dt + dB t . This model has a solution if and only if θ θ > σ . In all our results data are generated from the model under the ﬁnest discretization considered. InAlgorithm 3, for all the models above, ˜ p θ ( x, t ; x (cid:48) ,

1) = N ( x (cid:48) ; x, (1 − t ) σ ) and ˆ p θ ( x, x (cid:48) ) = ˜ p θ ( x, x (cid:48) ) . For each model, we ﬁx parameter θ and estimate the score function using Algorithms 1 and 3.In Algorithm 1, we use N ∈ { , , } in the 1st, 2nd & 3rd models, respectively. InAlgorithm 3, we use N ∈ { , , } in the 1st, 2nd & 3rd models, respectively. In bothalgorithms, we set the discretization level to l = 10 . In Models 1, 2 and 3, we set κ = 2 , . , . , x ∗ = 0 . , , and σ = 0 . , . , . , respectively; T = 50 for all models.Figure 2 summarizes the results of 56 replications of estimates of the score function for eachmodel and for each unit time point. These simulations are implemented in parallel using 8 CPUs.The ﬁgure illustrates that both algorithms are equivalent for large N and l as one would expect. We now consider comparing the costs of Algorithms 3 and 4. We take l ∗ as 7, 6 and 7 in Models1, 2 and 3, respectively. The parameters of the model are as in the previous section. In Figure 3, L is 10, and the ground truth is computed at level 11 with N = 2 , . In Figure 3 we can observethe cost against MSE curve, that appear to follow our conjectures over algorithmic costs earlier inthe article. We use Algorithms 1, 4 to estimate the parameters in each model. In Algorithm 1, the level ofdiscretization, l , is 10 and the number of particles, N , is 2,000. In Algorithm 4, we use l ∗ = 7 , L = 10 and the number of particles on each level l ∈ { l ∗ − , · · · , L } is N l = 2 L ( L − l ∗ + 2)∆ / ρl ,where ρ = 0 . , . , . in Models 1,2 and 3, respectively.Figure 4 considers Model 1. We ﬁx x ∗ = 0 . , σ = 0 . , κ = 2 , T = 20 , . The parametervalues used to generate the data are ( θ (cid:63) , θ (cid:63) ) = ( − . , − . . For the stochastic gradient algorithm,we used an initial value ( − . , − . and step-size α k = k − . . Figure 5 considers Model 2.We ﬁx x ∗ = 1 . , σ = 0 . , κ = 2 . , T = 20 , . The parameter values used to generate thedata are ( θ (cid:63) , θ (cid:63) , θ (cid:63) ) = (1 . , − . , . . For the stochastic gradient algorithm, we used initial value (0 . , − , . and step-size α k = k − . . Figure 6 considers Model 3. We ﬁx x ∗ = 1 . , σ = 0 . , κ = 2 , T = 20 , . The parameter values used to generate the data are ( θ (cid:63) , θ (cid:63) , θ (cid:63) ) = (2 , , . .For the stochastic gradient algorithm, we used an initial value (1 . , . , . and step size α k = k − . . In all cases considered (Figures 4-6) our selected settings allow for an accurate estimationof the parameter values over long time periods. 20 S c o r e f u n c t i o n a t 𝜃 = ( − . , − . ) Time

Alg1 - 𝜃₁Alg3 - 𝜃₁Alg1 - 𝜃₂Alg3 - 𝜃₂ -4.5-3.5-2.5-1.5-0.50.51.52.5 0 10 20 30 40 50 S c o r e f u n c t i o n a t 𝜃 = ( . , − . , . ) Time

Alg1 - 𝜃 ₁Alg3 - 𝜃 ₁Alg1 - 𝜃 ₂Alg3 - 𝜃 ₂Alg1 - 𝜃 ₃Alg3 - 𝜃 ₃ -10123456 0 10 20 30 40 50 S c o r e f u n c t i o n a t 𝜃 = ( , . , . ) Time

Alg1 - 𝜃 ₁Alg3 - 𝜃 ₁Alg1 - 𝜃 ₂Alg3 - 𝜃 ₂Alg1 - 𝜃 ₃Alg3 - 𝜃 ₃ Figure 2: Trajectories from the execution of Algorithms 1 and 3 for the estimation of the scorefunction in Models 1-3.

Acknowledgements

AJ & HR were supported by KAUST baseline funding. AB acknowledges support from a Lever-hulme Trust Prize. DC acknowledges partial support from an ERC synergy grant. NK acknowl-edges funding by a JP Morgan A.I. Faculty award.

A Derivation of (5)

Recall that under P θ the original processes (1)-(2) have dynamics: dY t = dB t ; dX t = b θ ( X t ) dt + σ ( X t ) dW t . We consider the processes: dY t = dB t ; dX t = σ ( X t ) dW t , so that if P denotes their law, then we have the Radon-Nikodym derivative: d P θ d P = exp (cid:110) (cid:90) T b θ ( X s ) ∗ a ( X s ) − dX s − (cid:90) T b θ ( X s ) ∗ a ( X s ) − b θ ( X s ) ds (cid:111) . .11101001000100000.05 0.5 C o s t ( s e c ) MSE1st

Model

Alg3Alg4 𝜖 ⁻⁶ 𝜖 ⁻⁴ C o s t ( s e c ) MSE2nd

Model

Alg3Alg4 𝜖 ⁻⁶ 𝜖 ⁻⁴ C o s t ( s e c ) MSE3rd

Model

Alg3Alg4 𝜖 ⁻⁶ 𝜖 ⁻⁴ Figure 3: Cost per each simulation versus MSE on a log-log scale for Algorithms 3 and 4. Thedashed lines are for reference.The log-likelihood is log( γ T,θ (1)) = log E θ [ Z T,θ |Y T ] , with Z T,θ = d P θ /d P θ , thus: ∇ θ log( γ T,θ (1)) = 1 E θ [ Z T,θ |Y T ] ∇ θ E θ (cid:2) d P θ /d P θ (cid:12)(cid:12) Y T (cid:3) . For convenience we use the notation E θ,X , E X for the marginal expectations w.r.t. the originalprocess X and the θ -free process X deﬁned above, respectively. Notice that we can write: ∇ θ E θ (cid:2) d P θ /d P θ (cid:12)(cid:12) Y T (cid:3) = ∇ θ E θ,X (cid:2) d P θ /d P θ (cid:3) = ∇ θ E X (cid:2) ( d P θ /d P θ )( d P θ /d P ) (cid:3) = E X (cid:2) Z T,θ ∇ θ log( Z T,θ )( d P θ /d P ) (cid:3) + E X (cid:2) Z T,θ ∇ θ log( d P θ /d P )( d P θ /d P ) (cid:3) = E θ [( ∇ θ log( Z T,θ ) + ∇ θ log( d P θ /d P )) Z T,θ ] . One can now verify that, for λ T,θ as deﬁned in the main text: ∇ θ log( Z T,θ ) + ∇ θ log( d P θ /d P ) = λ T,θ . B Proofs for Proposition 3.1

The Appendix is organized as follows. Section B.1 states our additional assumptions. Section B.2shows the proof of Proposition 3.1. Section B.3 gives some technical results needed in the proof.22 𝜃 ₁ Time -0.8-0.7-0.6-0.5-0.4-0.3-0.2 0 5000 10000 15000 20000 𝜃 ₁ Time -1-0.8-0.6-0.4-0.2 0 5000 10000 15000 20000 𝜃 ₂ Time -1-0.8-0.6-0.4-0.2 0 5000 10000 15000 20000 𝜃 ₂ Time

Figure 4: Trajectories from the execution of Algorithm 1 (left panel) and Algorithm 4 (right panel)for the estimation of ( θ , θ ) from Model 1. The horizontal dashed lines in the plots show the trueparameter values ( θ (cid:63) , θ (cid:63) ) = ( − . , − . . B.1 Assumptions

Set H lT,θ ( x , x ∆ l , . . . , x T ) := log { Z lT,θ ( x , x ∆ l , . . . , x T ) } . In addition: ρ T,θ := λ T,θ Z T,θ ; (39) ρ lT,θ ( x , x ∆ l , . . . , x T ) := λ lT,θ ( x , x ∆ l , . . . , x T ) Z lT,θ ( x , x ∆ l , . . . , x T ); (40) (cid:101) H l,sT,θ ( x , . . . , x T , (cid:101) x , . . . , (cid:101) x T ) := exp (cid:8) sH lT,θ ( x , . . . , x T ) + (1 − s ) H lT,θ ( (cid:101) x , . . . , (cid:101) x T ) (cid:9) . Note that, as h θ is bounded, there exists a p > such that sup l ≥ sup s ∈ [0 , E θ (cid:2) (cid:101) H l,sT,θ ( X , X ∆ l , . . . , X T , (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) p (cid:3) < + ∞ ; (41) sup l ≥ E θ [ Z lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) p ] < + ∞ . (42)( D2 ) We have: sup l ≥ E θ (cid:34) (cid:13)(cid:13) E θ (cid:2) λ lT,θ ( X , X ∆ l , . . . , X T ) Z lT,θ ( X , X ∆ l , . . . , X T ) (cid:12)(cid:12) Y T (cid:3) (cid:13)(cid:13) E θ (cid:2) Z lT,θ ( X , X ∆ l , . . . , X T ) (cid:12)(cid:12) Y T (cid:3) (cid:35) < + ∞ . ( D3 ) We have:1. There exists a p > such that: sup l ≥ sup s ∈ [0 , E θ (cid:104) (cid:13)(cid:13) λ lT,θ ( X , X ∆ l , . . . , X T ) (cid:13)(cid:13) p · (cid:101) H l,sT,θ ( X , . . . , X T , (cid:101) X , . . . , (cid:101) X T ) p (cid:105) < + ∞ . .80.911.11.21.31.41.51.6 0 5000 10000 15000 20000 𝜃 ₁ Time 𝜃 ₁ Time -0.8-0.7-0.6-0.5-0.4-0.3-0.2-0.1 0 5000 10000 15000 20000 𝜃 ₂ Time -0.8-0.7-0.6-0.5-0.4-0.3-0.2-0.1 0 5000 10000 15000 20000 𝜃 ₂ Time 𝜃 ₃ Time 𝜃 ₃ Time

Figure 5: Trajectories from the execution of Algorithm 1 (left panel) and Algorithm 4 (right panel)for the estimation of ( θ , θ , θ ) of the second model. We used an initial value (0 . , − , . . Thehorizontal dashed lines in the plots show the true parameter values ( θ (cid:63) , θ (cid:63) , θ (cid:63) ) = (1 . , − . , . .2. We have that: sup l ≥ E θ (cid:34) E θ [ Z T,θ | Y T ] E θ [ Z lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) | Y T ] (cid:35) < + ∞ ;sup l ≥ E θ (cid:34) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E θ [ Z T,θ | Y T ] · E θ [ ρ lT,θ ( X , X ∆ l , . . . , X T ) | Y T ] E θ [ Z lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) | Y T ] · E θ [ Z lT,θ ( X , X ∆ l , . . . , X T ) | Y T ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:35) < + ∞ . B.2 Proof of Proposition 3.1

Proof of Proposition 3.1.

Noting the deﬁnitions (39)-(40), the result follows easily by adding andsubtracting the term: E θ (cid:34) E θ [ ρ lT,θ ( X , X ∆ l , . . . , X T ) | Y T ] E θ [ Z lT,θ ( X , X ∆ l , . . . , X T ) | Y T ] (cid:35) , inside the L -norm and then applying triangular inequality followed by Lemma B.2 and B.5.24 𝜃 ₁ Time 𝜃 ₁ Time 𝜃 ₂ Time 𝜃 ₂ Time 𝜃 ₃ Time 𝜃 ₃ Time

Figure 6: Trajectories from the execution of Algorithm 1 (left panel) and Algorithm 4 (right panel)for the estimation of ( θ , θ , θ ) from Model 3. The horizontal dashed lines in the plots show thetrue parameter values ( θ (cid:63) , θ (cid:63) , θ (cid:63) ) = (2 , , . . B.3 Technical Results

Throughout the section

C < + ∞ is a constant that will not depend upon l (but may depend upon T and θ ) with a value that may change from line-to-line. Lemma B.1.

Assume (D1). Then for any l ∈ N we have: (cid:12)(cid:12)(cid:12) E θ [ ρ T,θ ] − E θ [ ρ lT,θ ( X , X ∆ l , . . . , X T ) ] (cid:12)(cid:12)(cid:12) = 0 . Proof.

We have E θ [ ρ T,θ ] = E θ [ λ T,θ ] . Then using, under P θ , the structure of λ T,θ and the process,we get E θ [ λ T,θ ] = 0 . For E θ [ ρ lT,θ ( X , X ∆ l , . . . , X T ) ] , by using the zero expectations of the Brownianincrements, it easily follows that by taking expectations backward in time: E θ (cid:104) (cid:110) T/ ∆ l − (cid:88) k =0 ( ∇ θ b θ ( X k ∆ l )) ∗ a ( X k ∆ l ) − σ ( X k ∆ l )( W ( k +1)∆ l − W k ∆ l ) (cid:111) Z lT,θ ( X , X ∆ l , . . . , X T ) (cid:105) = 0 , E θ [ ρ lT,θ ( X , X ∆ l , . . . , X T ) ]= E θ (cid:104) T/ ∆ l − (cid:88) k =0 (cid:110) ( ∇ θ h θ ( X k ∆ l )) ∗ ( Y ( k +1)∆ l − Y k ∆ l ) − ( ∇ θ h θ ( X k ∆ l )) ∗ h θ ( X k ∆ l )∆ l (cid:111) Z lT,θ ( X , X ∆ l , . . . , X T ) (cid:105) . Using properties of the independent standard Brownian motion { Y t } t ≥ under P θ , it is straightfor-ward to show that: E θ (cid:104) (cid:110) T/ ∆ l − (cid:88) k =0 ( ∇ θ h θ ( X k ∆ l )) ∗ ( Y ( k +1)∆ l − Y k ∆ l ) (cid:111) Z lT,θ ( X , X ∆ l , . . . , X T ) (cid:105) = E θ (cid:104) (cid:110) T/ ∆ l − (cid:88) k =0 ( ∇ θ h θ ( X k ∆ l )) ∗ h θ ( X k ∆ l )∆ l (cid:111) Z lT,θ ( X , X ∆ l , . . . , X T ) (cid:105) , which concludes the proof. Lemma B.2.

Assume (D1-2). Then for any ( T, θ ) ∈ [0 , ∞ ) × Θ there exists a C < + ∞ such thatfor any l ∈ N we have: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E θ (cid:34) E θ [ ρ T,θ | Y T ] E θ [ Z T,θ | Y T ] − E θ [ ρ lT,θ ( X , X ∆ l , . . . , X T ) | Y T ] E θ [ Z lT,θ ( X , X ∆ l , . . . , X T ) | Y T ] (cid:35) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C ∆ l . Proof.

We begin by noting a well-known decomposition, that is, almost surely: E θ [ ρ T,θ | Y T ] E θ [ Z T,θ | Y T ] − E θ [ ρ lT,θ ( X , X ∆ l , . . . , X T ) | Y T ] E θ [ Z lT,θ ( X , X ∆ l , . . . , X T ) | Y T ]= − E θ [ Z T,θ | Y T ] (cid:16) E θ [ ρ T,θ | Y T ] − E θ [ ρ lT,θ ( X , X ∆ l , . . . , X T ) | Y T ] (cid:17) − E θ [ ρ lT,θ ( X , X ∆ l , . . . , X T ) | Y T ] E θ [ Z lT,θ ( X , X ∆ l , . . . , X T ) | Y T ] E θ [ Z T,θ | Y T ] (cid:16) E θ [ Z T,θ | Y T ] − E θ [ Z lT,θ ( X , X ∆ l , . . . , X T ) | Y T ] (cid:17) . Hence: R : = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E θ (cid:34) E θ [ ρ T,θ | Y T ] E θ [ Z T,θ | Y T ] − E θ [ ρ lT,θ ( X , X ∆ l , . . . , X T ) | Y T ] E θ [ Z lT,θ ( X , X ∆ l , . . . , X T ) | Y T ] (cid:35) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E θ (cid:34) E θ [ ρ T,θ | Y T ] − E θ [ ρ lT,θ ( X , X ∆ l , . . . , X T ) | Y T ]+ E θ [ ρ lT,θ ( X , X ∆ l , . . . , X T ) | Y T ] E θ [ Z lT,θ ( X , X ∆ l , . . . , X T ) | Y T ] (cid:16) E θ [ Z T,θ | Y T ] − E θ [ Z lT,θ ( X , X ∆ l , . . . , X T ) | Y T ] (cid:17) (cid:35) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . R ≤ E θ (cid:34) (cid:13)(cid:13) E θ [ ρ lT,θ ( X , X ∆ l , . . . , X T ) | Y T ] (cid:13)(cid:13) E θ [ Z lT,θ ( X , X ∆ l , . . . , X T ) | Y T ] (cid:35) / × E θ (cid:104) (cid:16) E θ [ Z T,θ | Y T ] − E θ [ Z lT,θ ( X , X ∆ l , . . . , X T ) | Y T ] (cid:17) (cid:105) / . Using [8, Theorem 21.5] and (D2) one can conclude.

Lemma B.3.

Assume (D1). Then for any ( T, θ ) ∈ [0 , ∞ ) × Θ there exists a C < + ∞ such thatfor any l ∈ N we have: E θ (cid:104) (cid:16) E θ [ Z lT,θ ( X , X ∆ l , . . . , X T ) − Z lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) | Y T ] (cid:17) (cid:105) ≤ C ∆ l . Proof.

We have via the conditional Jensen inequality: R : = E θ (cid:104) (cid:16) E θ [ Z lT,θ ( X , X ∆ l , . . . , X T ) − Z lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) | Y T ] (cid:17) (cid:105) ≤ E θ (cid:104) (cid:16) Z lT,θ ( X , X ∆ l , . . . , X T ) − Z lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) (cid:17) (cid:105) . Now, by the Mean Value Theorem: Z lT,θ ( X , X ∆ l , . . . , X T ) − Z lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T )= (cid:90) (cid:101) H l,sT,θ ( X , . . . , X T , (cid:101) X , . . . , (cid:101) X T ) ds × (cid:16) H lT,θ ( X , X ∆ l , . . . , X T ) − H lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) (cid:17) . (43)Hence, setting q > , q (cid:48) = 1 − /q and applying Hölder’s inequality, R is upper-bounded by: E θ (cid:104) (cid:16) (cid:90) (cid:101) H l,sT,θ ( X , . . . , X T , (cid:101) X , . . . , (cid:101) X T ) ds (cid:17) q (cid:105) /q × E θ (cid:104) (cid:12)(cid:12)(cid:12) H lT,θ ( X , X ∆ l , . . . , X T ) − H lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) (cid:12)(cid:12)(cid:12) /q (cid:48) (cid:105) q (cid:48) . Then by (41): R ≤ C E θ (cid:104) (cid:12)(cid:12)(cid:12) H lT,θ ( X , X ∆ l , . . . , X T ) − H lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) (cid:12)(cid:12)(cid:12) /q (cid:48) (cid:105) q (cid:48) . Now, we note that: H lT,θ ( X , X ∆ l , . . . , X T ) − H lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) = M lT,θ − N lT,θ , where: M lT,θ := T/ ∆ l − (cid:88) k =0 (cid:104) { h θ ( X k ∆ l ) ∗ − h θ ( (cid:101) X k ∆ l ) ∗ } ( Y ( k +1)∆ l − Y k ∆ l ) (cid:105) ; N lT,θ := ∆ l T/ ∆ l − (cid:88) k =0 (cid:104) h θ ( X k ∆ l ) ∗ h θ ( X k ∆ l ) − h θ ( (cid:101) X k ∆ l ) ∗ h θ ( (cid:101) X k ∆ l ) (cid:105) . C /q (cid:48) -inequality, one has: R ≤ C E θ (cid:104) (cid:12)(cid:12)(cid:12) H lT,θ ( X , X ∆ l , . . . , X T ) − H lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) (cid:12)(cid:12)(cid:12) /q (cid:48) (cid:105) q (cid:48) ≤ C (cid:16) E θ [ | M lT,θ | /q (cid:48) ] + E θ [ | N lT,θ | /q (cid:48) ] (cid:17) q (cid:48) . (44)We ﬁrst focus on the ﬁrst term on the right side of (44). Applying C /q (cid:48) -inequality, over d y -times,we have: E θ [ | M lT,θ | /q (cid:48) ] ≤ C d y (cid:88) i =1 E θ (cid:104) (cid:12)(cid:12)(cid:12) T/ ∆ l − (cid:88) k =0 (cid:2) { h ( i ) θ ( X k ∆ l ) − h ( i ) θ ( (cid:101) X k ∆ l ) } ( Y ( i )( k +1)∆ l − Y ( i ) k ∆ l ) (cid:3) (cid:12)(cid:12)(cid:12) /q (cid:48) (cid:105) . (45)We consider just the i th -summand on the right side, as the argument to be used is essentiallyexchangeable w.r.t. i . Applying the Burkholder-Gundy-Davis (BGD) inequality: E θ (cid:104) (cid:12)(cid:12)(cid:12) T/ ∆ l − (cid:88) k =0 (cid:2) { h ( i ) θ ( X k ∆ l ) − h ( i ) θ ( (cid:101) X k ∆ l ) } ( Y ( i )( k +1)∆ l − Y ( i ) k ∆ l ) (cid:3) (cid:12)(cid:12)(cid:12) /q (cid:48) (cid:105) ≤ C E θ (cid:104) (cid:12)(cid:12)(cid:12) T/ ∆ l − (cid:88) k =0 (cid:2) { h ( i ) θ ( X k ∆ l ) − h ( i ) θ ( (cid:101) X k ∆ l ) } ( Y ( i )( k +1)∆ l − Y ( i ) k ∆ l ) (cid:3) (cid:12)(cid:12)(cid:12) /q (cid:48) (cid:105) . Now, applying the Minkowski inequality, and making using the independence and Brownian prop-erty of the increments ( Y i ( k +1)∆ l − Y ik ∆ l ) , along with h iθ ∈ Lip (cid:107)·(cid:107) ( R d x ) : E θ (cid:104) (cid:12)(cid:12)(cid:12) T/ ∆ l − (cid:88) k =0 (cid:2) { h ( i ) θ ( X k ∆ l ) − h ( i ) θ ( (cid:101) X k ∆ l ) } ( Y i ( k +1)∆ l − Y ik ∆ l ) (cid:3) (cid:12)(cid:12)(cid:12) /q (cid:48) (cid:105) ≤ C ∆ /q (cid:48) l (cid:16) T/ ∆ l − (cid:88) k =0 E θ (cid:2) (cid:107) X k ∆ l − (cid:101) X k ∆ l (cid:107) /q (cid:48) (cid:3) q (cid:48) (cid:17) /q (cid:48) . Then using standard results for Euler discretizations: E θ (cid:104) (cid:12)(cid:12)(cid:12) T/ ∆ l − (cid:88) k =0 (cid:2) { h ( i ) θ ( X k ∆ l ) − h ( i ) θ ( (cid:101) X k ∆ l ) } ( Y i ( k +1)∆ l − Y ik ∆ l ) (cid:3) (cid:12)(cid:12)(cid:12) /q (cid:48) (cid:105) ≤ C ∆ /q (cid:48) l (cid:16) T/ ∆ l − (cid:88) k =0 ∆ l (cid:17) /q (cid:48) ≤ C ∆ /q (cid:48) l . Thus, on returning to (45), we have shown that: E θ [ | M lT,θ | /q (cid:48) ] ≤ C ∆ /q (cid:48) l . (46)Noting that as h ( i ) θ ∈ Lip (cid:107)·(cid:107) ( R d x ) and h ( i ) θ ∈ B b ( R d x ) , it follows that ( h ( i ) θ ) ∈ Lip (cid:107)·(cid:107) ( R d x ) . So ,usingvery similar calculations to those for M lT,θ (except not requiring to apply the BGD inequality), onecan prove that: E θ [ | N lT,θ | /q (cid:48) ] ≤ C ∆ /q (cid:48) l . (47)Thus, combining (46)-(47) with (44), one can conclude.28 emma B.4. Assume (D1) and part 1. of (D3). Then, for any ( T, θ ) ∈ [0 , ∞ ) × Θ there exists a C < + ∞ such that for any l ∈ N we have: E θ (cid:104) (cid:13)(cid:13)(cid:13) E θ [ ρ lT,θ ( X , X ∆ l , . . . , X T ) − ρ lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) | Y T ] (cid:13)(cid:13)(cid:13) (cid:105) ≤ C ∆ l . Proof.

We have via the C -inequality: E θ (cid:104) (cid:13)(cid:13)(cid:13) E θ [ ρ lT,θ ( X , X ∆ l , . . . , X T ) − ρ lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) | Y T ] (cid:13)(cid:13)(cid:13) (cid:105) ≤ R + R ) , (48)where: R := E θ (cid:104) (cid:107) λ lT,θ ( X , X ∆ l , . . . , X T ) (cid:107) · (cid:0) Z lT,θ ( X , X ∆ l , . . . , X T ) − Z lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) (cid:1) (cid:105) ; R := E θ (cid:104) Z lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) · (cid:13)(cid:13) λ lT,θ ( X , X ∆ l , . . . , X T ) − λ lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) (cid:13)(cid:13) (cid:105) . We will upper-bound R and R respectively and sum the bounds to conclude. Applying (43), wehave: R = E θ (cid:34) (cid:13)(cid:13) λ lT,θ ( X , X ∆ l , . . . , X T ) (cid:13)(cid:13) (cid:16) (cid:90) (cid:101) H l,sT,θ ( X , . . . , X T , (cid:101) X , . . . , (cid:101) X T ) ds (cid:17) × (cid:16) H lT,θ ( X , X ∆ l , . . . , X T ) − H lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) (cid:17) (cid:35) . Then, setting q > , q (cid:48) = 1 − /q and applying Hölder’s inequality: R ≤ E θ (cid:104) (cid:13)(cid:13) λ lT,θ ( X , X ∆ l , . . . , X T ) (cid:13)(cid:13) q (cid:16) (cid:90) (cid:101) H l,sT,θ ( X , . . . , X T , (cid:101) X , . . . , (cid:101) X T ) ds (cid:17) q (cid:105) /q × E θ (cid:104) (cid:12)(cid:12)(cid:12) H lT,θ ( X , X ∆ l , . . . , X T ) − H lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) (cid:12)(cid:12)(cid:12) /q (cid:48) (cid:105) q (cid:48) . Then noting (44) along with (46)-(47) and part 1. of (D3), we have: R ≤ C ∆ l . (49)For R , we note that: λ lT,θ ( X , X ∆ l , . . . , X T ) − λ lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) = M l, T,θ + M l, T,θ − N lT,θ , where: M l, T,θ := T/ ∆ l − (cid:88) k =0 (cid:104) { φ θ ( X k ∆ l ) − φ θ ( (cid:101) X k ∆ l ) } ( W ( k +1)∆ l − W k ∆ l ) (cid:105) ; M l, T,θ := T/ ∆ l − (cid:88) k =0 (cid:104) {∇ θ h θ ( X k ∆ l ) ∗ − ∇ θ h θ ( (cid:101) X k ∆ l ) ∗ } ( Y ( k +1)∆ l − Y k ∆ l ) (cid:105) ; N lT,θ := ∆ l T/ ∆ l − (cid:88) k =0 (cid:104) ∇ θ h θ ( X k ∆ l ) ∗ h θ ( X k ∆ l ) − ∇ θ h θ ( (cid:101) X k ∆ l ) ∗ h θ ( (cid:101) X k ∆ l ) (cid:105) .

29e remark that one can deal with terms M l, T,θ , M l, T,θ (resp. N lT,θ ) (in almost) the same way as for M lT,θ (resp. N lT,θ ) in the proof of Lemma B.3. So, essentially following the above argument for R , bysplitting terms Z lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) and (cid:107) λ lT,θ ( X , X ∆ l , . . . , X T ) − λ lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) (cid:107) viathe same Hölder inequality, using (42) along with the inferred bounds on E θ [ | M l,jT,θ | /q (cid:48) ] , j ∈ { , } ,and E θ [ | N lT,θ | /q (cid:48) ] (i.e. C ∆ /q (cid:48) l , as in (46)-(47)), one can deduce that: R ≤ C ∆ l . (50)Combining (49), (50) with (48) concludes the proof. Lemma B.5.

Assume (D1), (D3). Then for any ( T, θ ) ∈ [0 , ∞ ) × Θ there exists a C < + ∞ suchthat for any l ∈ N we have: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E θ (cid:34) E θ [ ρ lT,θ ( X , X ∆ l , . . . , X T ) | Y T ] E θ [ Z lT,θ ( X , X ∆ l , . . . , X T ) | Y T ] − E θ [ ρ lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) | Y T ] E θ [ Z lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) | Y T ] (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C ∆ / l . Proof.

We have: R := E θ (cid:34) E θ [ ρ lT,θ ( X , X ∆ l , . . . , X T ) | Y T ] E θ [ Z lT,θ ( X , X ∆ l , . . . , X T ) | Y T ] − E θ [ ρ lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) | Y T ] E θ [ Z lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) | Y T ] (cid:35) = E θ [ R ( R R + R ) ] , where: R = E θ [ Z T,θ | Y T ] E θ [ Z lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) | Y T ] ; R = E θ [ ρ lT,θ ( X , X ∆ l , . . . , X T ) | Y T ] E θ [ Z lT,θ ( X , X ∆ l , . . . , X T ) | Y T ] ; R = E θ [ Z lT,θ ( X , X ∆ l , . . . , X T ) − Z lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) | Y T ]; R = E θ [ ρ lT,θ ( X , X ∆ l , . . . , X T ) − ρ lT,θ ( (cid:101) X , (cid:101) X ∆ l , . . . , (cid:101) X T ) | Y T ] . Using the Cauchy-Schwarz inequality, part 2. of (D3), along with Lemmas B.3-B.4, one can deducethat: | R | ≤ C ∆ / l , which concludes the proof. References [1]

Bain , A. &

Crisan , D. (2009).

Fundamentals of Stochastic Filtering . Springer: New York.[2]

Ballesio , M.,

Jasra , A., von Schwerin , E. &

Tempone , R. (2020). A Wassersteincoupled particle ﬁlter for multilevel estimation. arXiv:2004.03981.[3]

Benveniste , A.,

Métivier , M. &

Priouret , P. (1990).

Adaptive Algorithms and StochasticApproximation . New York: Springer-Verlag.304]

Beskos , A.,

Papaspiliopoulos , O.,

Roberts , G.,

Fearnhead , P. (2006). Exact and com-putationally eﬃcient likelihood-based estimation for discretely observed diﬀusion processes(with discussion).

J. R. Statist. Soc. Ser. B , , 333-382.[5] Botha , I.,

Kohn , R., &

Drovandi , C. (2020). Particle methods for stochastic diﬀerentialequation mixed eﬀects models.

Bayes. Anal. (to appear).[6]

Campillo , F. &

Le Gland , F. (1989). Maximum likelihood estimation for partially observeddiﬀusions: Direct Maximization vs The EM algorithm.

Stoch. Proc. Appl. , , 245–274.[7] Cliffe , K. A.,

Giles , M. B.,

Scheichl , R., &

Teckentrup , A. L. (2011). Multilevel MonteCarlo methods and applications to elliptic PDEs with random coeﬃcients.

Comput. Vis. Sci. , , 3–15.[8] Crisan , D. (2011). Discretizing the continuous-time ﬁltering problem: Order of Convergence.In [9], 572–597.[9]

Crisan , D., &

Rozovskii , B. (2011).

The Oxford Handbook of Nonlinear Filtering . OxfordUniversity Press.[10]

Del Moral , P. (2004).

Feynman-Kac Formulae . Springer.[11]

Del Moral , P.,

Doucet , A., &

Singh

S. S. (2010). A backward particle interpretation ofFeynman-Kac formuale.

M2AN , , 947–975.[12] Del Moral , P.,

Doucet , A., &

Singh

S. S. (2010). Forward smoothing using sequentialMonte Carlo, arXiv:1012.5390[13]

Etienne , M. P.,

Gloaguen , P.,

Corff , S. L., &

Olsson , J. (2020). Backward importancesampling for partially observed diﬀusion processes. arXiv:2002.05438.[14]

Gloaguen , P.,

Etienne , M. P. &

Le Corff , S. (2018). Online sequential Monte Carlosmoother for partially observed diﬀusion processes.

EURASIP J. Adv. Sig. Proc , article 9.[15]

Jasra , A.,

Kamatani , K.,

Law

K. J. H. &

Zhou , Y. (2017). Multilevel particle ﬁlters.

SIAMJ. Numer. Anal. , , 3068-3096.[16] Jasra , A.,

Law , K. J. H., & Yu , F. (2020). Unbiased ﬁltering of a class of partially observeddiﬀusions. arXiv:2002.03747.[17] Jasra , A., & Yu , F. (2020). Central limit theorems for coupled particle ﬁlters. Adv. Appl.Probab. (to appear).[18]

Jasra , A., Yu , F. & Heng , J. (2020). Multilevel particle ﬁlters for the nonlinear ﬁlteringproblem in continuous time.

Stat. Comp. (to appear).[19]

Le Gland , F. and

Mevel , M. (1997). Recursive identiﬁcation in hidden Markov models.

Proc. 36th IEEE Conf. Dec. Contr. , 3468-3473.[20]

Olsson , J. &

Westerborn , J. (2017). Eﬃcient particle-based online smoothing in generalhidden Markov models: The PaRIS algorithm.

Bernoulli , , 1951-1996.3121] Papaspiliopoulos , O.,

Roberts , G. O., &

Stramer , O. (2013). Data augmentation fordiﬀusions.

J. Comp. Graph. Stat. , , 665-688.[22] Papaspiliopoulos , O. &

Roberts , G. (2012). Importance sampling techniques for estima-tion of diﬀusion models.

Stat. Meth. Stoch. Diﬀ. Eq. , , 311-340.[23] Picard , J. (1984). Approximation of nonlinear ﬁltering problems and order of convergence.In

Filtering and control of random processes , 219-236, Springer, Berlin, Heidelberg.[24]

Poyiadjis , G.,

Doucet , A., &

Singh , S. S. (2011). Particle approximations of the score andobserved information matrix in state space models with application to parameter estimation.

Biometrika , , 65-80.[25] Särkkä , S., &

Sottinen , T. (2008). Application of Girsanov theorem to particle ﬁltering ofdiscretely observed continuous-time non-linear systems.

Bayes. Anal. , , 555-584.[26] Schauer , M., van der Meulen , F. & van Zanten , H. (2017). Guided proposals forsimulating multi-dimensional diﬀusion bridges.

Bernoulli , , 2917–2950.[27] Sen , D.,

Thiery , A.,

Jasra , A. (2018). On coupling particle ﬁlters.

Statist. Comp. , ,461-475.[28] Talay , D. (1984). Eﬃcient numerical schemes for the approximation of expectations of func-tionals of the solution of a SDE, and applications. In

Filtering and control of random processes ,294-313, Springer, Berlin, Heidelberg.[29]

Thorisson , H. (2000).

Coupling, stationarity, and regeneration . Springer:New York.[30] van der Meulen , F., &

Schauer , M. (2017). Bayesian estimation of discretely observedmulti-dimensional diﬀusion processes using guided proposals.

Elec. J. Stat. , , 2358-2396.[31] van der Meulen , F., & Schauer , M. (2017). Continuous-discrete smoothing of diﬀusions.arXiv:1712.03807.[32]

Whitaker , G. A.,

Golightly , A.,

Boys , R. J., &

Sherlock , C. (2017). Improved bridgeconstructs for stochastic diﬀerential equations.

Stat. Comp. , , 885-900.[33] Yonekura , S. &