[PDF] Particle-based, online estimation of tangent filters with application to parameter estimation in nonlinear state-space models

Abstract

This paper presents a novel algorithm for efficient online estimation of the filter derivatives in general hidden Markov models. The algorithm, which has a linear computational complexity and very limited memory requirements, is furnished with a number of convergence results, including a central limit theorem with an asymptotic variance that can be shown to be uniformly bounded in time. Using the proposed filter derivative estimator we design a recursive maximum likelihood algorithm updating the parameters according the gradient of the one-step predictor log-likelihood. The efficiency of this online parameter estimation scheme is illustrated in a simulation study.

Full PDF

aa r X i v : . [ s t a t . C O ] J a n Particle-based, online estimation of tangentﬁlters with application to parameter estimationin nonlinear state-space models

Jimmy Olsson and Johan Westerborn Alenl¨ov

Department of MathematicsKTH Royal Institute of TechnologySE-100 44 Stockholm, Swedene-mail: [email protected] , [email protected] Abstract:

This paper presents a novel algorithm for eﬃcient online estimation of the ﬁlterderivatives in general hidden Markov models. The algorithm, which has a linear compu-tational complexity and very limited memory requirements, is furnished with a number ofconvergence results, including a central limit theorem with an asymptotic variance that canbe shown to be uniformly bounded in time. Using the proposed ﬁlter derivative estimatorwe design a recursive maximum likelihood algorithm updating the parameters accordingthe gradient of the one-step predictor log-likelihood. The eﬃciency of this online parameterestimation scheme is illustrated in a simulation study.

1. Introduction

The general state-space hidden Markov models form a powerful statistical modeling tool whichis presently applied across a wide range of scientiﬁc and engineering disciplines; see e.g. [2, 4]and the references therein for such examples. In the literature, the term “hidden Markov model”is often restricted to the case where the state space of the hidden chain is a ﬁnite set, and wehence use the term “state-space model” (SSM) to stress that the state spaces of the models thatwe consider are completely general. More speciﬁcally, SSMs are bivariate stochastic processesconsisting of an observable process { Y t } t ∈ N and an unobservable Markov chain { X t } t ∈ N , referredto as the state process and observation process and taking on values in some general state spaces Y and X , respectively. When using these models in practice, the statistician is typically interestedin calculating conditional distributions of the unobservable states given some ﬁxed observationrecord or data-stream { y t } t ∈ N . Generally, these distributions cannot be expressed in a closed form,and the user needs to rely on approximations. Moreover, any SSM is typically parameterised byunknown model parameters θ which need to be calibrated ex ante.The frequentist approach to parameter estimation consists in maximising the likelihood func-tion θ L θ ( y t ), where L θ ( y t ) denotes the joint density of the observations evaluated at thegiven data y t = ( y , . . . , y t ) (this will be our generic notation for vectors), or, equivalently, the log-likelihood function θ ℓ θ ( y t ) := log L θ ( y t ). Even though the likelihood cannot generallybe expressed in a closed form, there are several approaches to approximate maximisation of thesame. These methods are typically based on either the expectation-maximisation algorithm [10],where the maximisation is performed over an intermediate quantity, or the tools of gradient-basedoptimisation . The latter algorithms produce, in a steepest ascent manner, a sequence { θ n } n ∈ N of parameter estimates converging to the maximum likelihood estimator; i.e., at each iteration,a parameter estimate θ n is updated recursively through a move in the direction given by anestimate of the score function ∇ θ ℓ θ n ( y t ).There are mainly two approaches to estimation of the score function. The ﬁrst one goes via Fisher’s identity [see, e.g., 4, Proposition 10.1.6], which relates the score function of the observed . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent ﬁlters data to that of the complete data and thus transfers the problem to that of joint smoothing , i.e.,the computation of the joint posterior distribution of X t given Y t = y t . The second approach,which is the focus of the present paper, goes via the decomposition ℓ θ ( y t ) = t X s =0 ℓ θ ( y s | y s − ) , (1.1)where each one-step predictor likelihood ℓ θ ( y s | y s − ) (with ℓ θ ( y | y − ) := ℓ θ ( y )) can becomputed via the sensitivity equations ; see Capp´e et al. [4, Section 10.2.4] and Section 4 fordetails. Let π s ; θ ( x s ) be the density of the predictor at time s , i.e., the distribution of the state X s conditional to the observation record Y s − = y s − ; it is then, by swapping the order ofthe nabla and integration operations, possible to express the gradient ∇ θ ℓ θ ( y s | y s − ) as afunctional of the gradient ∇ θ π s ; θ ( x s ), the latter being typically referred to as the ﬁlter derivative or tangent ﬁlter or ﬁlter sensitivity . It should be noted that despite its name, the tangent ﬁlteris the gradient of the predictor rather than the ﬁlter , where the ﬁlter at time s is deﬁned as thedistribution of X s conditionally to Y s = y s , i.e., the updated predictor.Appealingly, the recursive updating scheme for the tangent ﬁlters given by the sensitivityequations requires only access to the ﬁlter distributions (and not the full joint smoothing dis-tribution, as in the case of Fisher identity-based score approximation). If estimation is to becarried through in the oﬄine mode, by processing repeatedly a ﬁxed batch of observations untilconvergence, the approach based on Fisher’s identity is typically computationally more eﬃcientand hence to be preferred. On the other hand, if estimation is to be carried through in an online manner, i.e., through a single sweep of a (potentially inﬁnitely) long sequence of observations,then the sensitivity equation approach is better suited, as the ﬁlter distributions can be updatedrecursively at constant computational cost. Online parameter learning is relevant when dealingwith very big batches of data or in scenarios requiring the parameters to be continuously updatedon the basis of a continuous stream of arriving observations.If the state space of the latent Markov chain is a ﬁnite set or if the model belongs to the linearGaussian SSMs, exact computation of the ﬁlter distributions—and thus the ﬁlter derivatives andthe gradient of the one-step predictor log-likelihood—is possible [28, 4]. However, as touchedupon above, the general case calls for approximation of these quantities, and existing approachesto nonlinear ﬁltering can typically be divided into two classes. The ﬁrst class contains methodsrelying on linearisation, such as the extended Kalman ﬁlter , the unscented Kalman ﬁlter [1, 22],etc., whose success is generally limited in the presence of signiﬁcant nonlinear or non-Gaussianmodel components. The second class is formed by the sequential Monte Carlo (SMC) or par-ticle ﬁltering approaches [18], where a sample of particles with associated importance weightsis propagated sequentially and randomly in time according to the model dynamics in order toapproximate the ﬂows of ﬁlter and predictor distributions. Due to their high capability of solv-ing potentially very diﬃcult nonlinear/non-Gaussian ﬁltering problems, SMC methods have, asindicated by the over 8000 Google Scholar citations of the ﬁrst monograph [16] on the topic, seena rapid and ever-increasing interest during the last decades. For an expos´e of current particle-based approaches to parameter estimation in nonlinear SSMs,we refer to [23]. SMC-based score function estimation via tangent ﬁlters was discussed by [14,36, 37, 8].The work [8], which can be viewed as the state of the art when it concerns SMC-based tangentﬁlter estimation, expresses each tangent ﬁlter as a centered, smoothed expectation of an additive . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent ﬁlters state functional. Using the so-called backward decomposition of the joint smoothing distributionand a recursive form of this decomposition going back to [3], [7] are able to form, on-the-ﬂy andthrough the perspicacious propagation of a number of smoothed statistics, one statistic per par-ticle, non-collapsed particle approximations of such additive smoothed expectations. Moreover,[7] furnish their estimator with a central limit theorem (CLT) describing the weak convergence,as the size of the particle sample tends to inﬁnity, of the particle tangent ﬁlter approximationsto the exact counterparts. Appealingly, the fact that the smoothed expectations are centeredallows, under strong mixing assumptions, the asymptotic variance of the mentioned CLT to bebounded uniformly in time, which establishes the long-term stochastic stability of the algorithmfor large sample sizes. However, since the propagation of the smoothed statistic associated witha given particle requires an expectation under the ancestral posterior of that particle to be com-puted by summation, the computational complexity of the algorithm is quadratic in the numberof particles. The user is hence forced to keep the number of particles at a modest value, whichmay imply severe numerical instability in practice (see Section 4.3 for an illustration). On the basis of the approach taken by [8], we develop an eﬃcient and numerically stable algo-rithm for particle approximation of the ﬁlter derivatives in SSMs. The approach that we proposeincludes, as a sub-routine, the particle-based, rapid incremental smoother (PaRIS) [introducedrecently by us in 34], an algorithm that is tailor-made for online approximation of smoothedadditive state functionals.By replacing, in the forward recursion proposed by [8], the computation of each smoothedstatistic by a conditionally unbiased Monte Carlo estimate, the computational complexity be-comes, under mild assumptions, linear in the number of particles. Since the original estimatorcan be viewed as a Rao-Blackwellisation of the new one, we will sometimes refer to this measureas “ de -Rao-Blackwellisation”. This allows the user to mount a considerably larger amount ofparticles for a given computational budget, which compensates by far for the slight increase ofasymptotic variance added by the additional sampling step. Thus, the technique that we pro-pose has some similarities with that used in particle-based two-ﬁlter smoothing , where a linearcomputational complexity is obtained by replacing certain marginalisations by random sampling[see 17] (however, for the sake of completeness we remark that two-ﬁlter particle smoothers,which require a so-called backward information ﬁlter to be run backwards in time, cannot beused in online settings). By reusing techniques developed by [34] and Del Moral et al. [8] we areable to furnish the proposed algorithm with an exponential concentration inequality and a CLT,whose asymptotic variance is given by that of the Rao-Blackwellised counterpart plus one addi-tional term being inversely proportional to sample size of the supplementary, PaRISian samplingoperation. This result is analogous to the results obtained by [32] for the case of particle-basedtwo-ﬁlter smoothing. Moreover, by assuming strong mixing of the model, the asymptotic variancecan be bounded uniformly in time.In the second part of the paper, we cast our PaRIS-based tangent ﬁlter estimator into theframework of recursive maximum likelihood estimation. The numerical performance of this onlineparameter estimation algorithm is investigated in a simulation study, where we are able to reportsigniﬁcant improvement over the approach of [8] for online parameter learning in the stochasticvolatility model of [19]. In addition, we propose our algorithm as a prototypical solution to the simultaneous localisation and mapping (SLAM) problem, with promising results.Finally, we remark that the parameter estimation algorithm was outlined by us in the confer-ence note [33]. The present paper provides rigorous theoretical results underpinning the initial . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent ﬁlters observations made by [33] and complement the simulation study of that work with additionalexamples. After having introduced, prefatorily, general SSMs, tangent ﬁlters, and SMC-based smoothing(including the PaRIS) in Section 2, Section 3 presents the proposed approach to tangent ﬁlterestimation. In addition, Section 3.1 includes theoretical convergence results in the form of anexponential concentration inequality and a CLT with a time-uniform bound on the asymptoticvariance. Section 4 applies the PaRISian tangent ﬁlter estimator to recursive maximum likelihoodestimation, and the resulting online parameter estimation algorithm is tested numerically inSection 4.3. Finally, some conclusions are drawn in Section 5 and Appendix A contains all proofs.

2. Preliminaries

In the following we let N denote the natural numbers and set N ∗ := N \ { } . For any measurablespace ( E , E ), where E is a countably generated σ -algebra, we denote by F ( E ) the set of bounded E / B ( R )-measurable functions on E , where B ( R ) denotes the Borel σ -algebra. For any function h ∈ F ( E ), we let k h k ∞ := sup x ∈ E | h ( x ) | denote the supremum norm of h . Let M ( E ) denote theset of σ -ﬁnite measures on E and M ( E ) ⊂ M ( E ) the set of probability measures on the samespace.Now, let ( X , X ) and ( Y , Y ) be measurable spaces and Q θ : X × X → [0 ,

1] and G θ : X × Y → [0 ,

1] Markov transition kernels. We will often subject kernels to operations such as multiplicationand tensor multiplication. Moreover, a transition kernel induces two diﬀerent operators, oneon bounded measurable functions and one on measures; we refer to Appendix B for details.Moreover, let χ ∈ M ( X ). The kernels above are parameterised by a model parameter vector θ belonging to some space Θ ⊆ R d , d ∈ N ∗ , and we will use subindices to stress the dependenceof any distributions on these parameters. We deﬁne an SSM as the canonical Markov chain { ( X t , Y t ) } t ∈ N with initial distribution χ (cid:15) G θ (see Appendix B for a deﬁnition of the product (cid:15) of a measure and a kernel) and Markov transition kernel( X × Y ) × ( X (cid:15) Y ) ∋ (( x, y ) , A ) Z A ( x ′ , y ′ ) Q θ ( x, d x ′ ) G θ ( x ′ , d y ′ ) . (2.1)In this setting, the state process { X t } t ∈ N is assumed to be only partially observed through theobservation process { Y t } t ∈ N . Under the dynamics (2.1),(i) the state process { X t } t ∈ N is itself a Markov chain with transition kernel Q θ and initialdistribution χ .(ii) the observations are, conditionally on the states, independent and such that the conditionaldistribution of each Y t depends on the corresponding X t only and is given by the emissiondistribution G θ ( X t , · ).For simplicity we have assumed that the initial distribution χ does not depend on θ . Throughoutthis paper we will consider the fully dominated case where for all θ ∈ Θ, Q θ and G θ admitdensities q θ and g θ with respect to some reference measures µ ∈ M ( X ) and ν ∈ M ( Y ), respectively. . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent ﬁlters This means that for all θ ∈ Θ, x ∈ X , f ∈ F ( X ), and h ∈ F ( Y ), Q θ f ( x ) = Z f ( x ′ ) q θ ( x, x ′ ) µ (d x ′ ) , G θ h ( x ) = Z h ( y ) g θ ( x, y ) ν (d y ) . Likewise, χ is assumed to have a density, which we denote by the same symbol, with respect tothe same reference measure µ .In the following we assume that we are given a ﬁxed sequence { y t } t ∈ N of observations of { Y t } t ∈ N , and in order to avoid notational overload we will generally suppress any dependence onthese observations from the notation by denoting, e.g., by g t ; θ the function X ∋ x g θ ( x, y t ).For any triple ( s, s ′ , t ) ∈ N such that s ≤ s ′ we denote by φ s : s ′ | t ; θ the conditional distributionof X s : s ′ conditioned on Y t . For any f ∈ F ( X (cid:15) ( s ′ − s +1) ), this distribution can be expressed as φ s : s ′ | t ; θ f = L θ ( y t ) − Z · · · Z f ( x s : s ′ ) t Y m =0 g m ; θ ( x m ) !  χ (d x ) ( t ∨ s ′ ) − Y ℓ =0 Q θ ( x ℓ , d x ℓ +1 )  , where L θ ( y t ) = Z · · · Z g θ ( x ) χ (d x ) t − Y ℓ =0 g ℓ +1; θ ( x ℓ +1 ) Q θ ( x ℓ , d x ℓ +1 ) (2.2)is the observed data likelihood. In the cases s = s ′ = t and s = s ′ = t + 1 we let φ t ; θ and π t +1; θ beshorthand notation for the ﬁlter and predictor distributions φ t : t | t ; θ and φ t +1: t +1 | t ; θ , respectively.In the case s = 0 and s ′ = t , the distribution φ t | t ; θ is referred to as the joint-smoothingdistribution. The ﬁlter and predictor distributions are closely related; indeed, the well-known ﬁltering recursion provides that for all t ∈ N and f ∈ F ( X ), φ t ; θ f = π t ; θ ( g t ; θ f ) π t ; θ g t ; θ , (2.3) π t +1; θ f = φ t ; θ Q θ f, (2.4)where, by convention, π θ := χ .We will often deal with sums and products on functions with possibly diﬀerent arguments.Since these functions will be deﬁned on products of X , we will, when needed, let, with slightabuse of notation, subscripts deﬁne the domains of such sums and products. For instance, f t ˜ f t : X ∋ x t f t ( x t ) ˜ f t ( x t ), while f t + ˜ f t +1 : X ∋ x t : t +1 f t ( x t ) + ˜ f t +1 ( x t +1 ).It may be shown [see, e.g., 4, Proposition 3.3.6] that the state process has still the Markovproperty when evolving conditionally on Y t = y t in the time-reversed direction. Moreover,the distribution of X s given X s +1 and Y t = y t is, for s ≤ t , given by the backward kernel denoted by ←− Q φ s ; θ . Under the assumption of a fully dominated model the backward kernel can,for x s +1 ∈ X and f ∈ F ( X ), be expressed as ←− Q φ s ; θ f ( x s +1 ) := R f ( x s ) q θ ( x s , x s +1 ) φ s ; θ (d x s ) R q θ ( x ′ s , x s +1 ) φ s ; θ (d x ′ s ) . Consequently, using the backward kernel, we may express the joint-smoothing distribution φ t | t ; θ as φ t | t ; θ = φ t ; θ T t ; θ , (2.5) . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent ﬁlters where we have deﬁned the kernels T t ; θ := ( ←− Q φ t − θ (cid:15) ←− Q φ t − θ (cid:15) · · · (cid:15) ←− Q φ θ for t ∈ N ∗ , id for t = 0 , where (cid:15) denotes kernel tensor product; see Appendix B for details. Note that for all x ∈ X , thedistribution T t ; θ ( x, · ) on the product space X (cid:15) t describes the law of the inhomogeneous Markovchain initialised at x ∈ X and evolving backwards in time according to the backward kernels.As we will see in the following, the modeler is often required to compute smoothed expectationsof additive objective functions of type h t ( x t ) := t − X s =0 ˜ h s ( x s : s +1 ) , where for all 0 ≤ s ≤ t −

1, ˜ h s is a measurable function on X . This setting allows for recursivecomputation of { T t ; θ h t } t ∈ N according to T t +1; θ h t +1 ( x t +1 ) = Z { T t ; θ h t ( x t ) + ˜ h t ( x t : t +1 ) } ←− Q φ t ; θ ( x t +1 , d x t ); (2.6)see [3, 7]. The recursion (2.6) will be a key ingredient also in the coming developments. The gradient of the prediction ﬁlter, known as the ﬁlter derivative or the tangent ﬁlter, is asigned measure which we, following [8, Section 2], deﬁne as follows. We start with consideringthe gradient of φ t | t − θ and then extract the marginal of this quantity. First, note that for all f t ∈ F ( X (cid:15) ( t +1) ),1 L θ ( y t − ) ∇ θ Z f t ( x t ) χ (d x ) t − Y s =0 g s ; θ ( x s ) q θ ( x s , x s +1 ) µ (cid:15) t (d x t ) = φ t | t − θ ( h t ; θ f t ) , (2.7)where h t ; θ ( x t ) = P t − s =0 ˜ h s ; θ ( x s : s +1 ), with˜ h s ; θ ( x s : s +1 ) = ∇ θ log g s ; θ ( x s ) + ∇ θ log q θ ( x s , x s +1 ) , (2.8)is complete data score function . The identity (2.7) is established by, ﬁrst, swapping, on the lefthand side, the order of diﬀerentiation and integration and, second, dividing and multiplying theintegrand by the complete data likelihood. In the case f t = X t +1 , (2.7) coincides with Fisher’sidentity [see, e.g., 4, Proposition 10.1.6]. Now, using (2.7), we obtain straightforwardly ∇ θ φ t | t − θ f t = ∇ θ L θ ( y t − ) Z f t ( x t ) χ (d x ) t − Y s =0 g s ; θ ( x s ) q θ ( x s , x s +1 ) µ (cid:15) t (d x t )= 1 L θ ( y t − ) ∇ θ Z f t ( x t ) χ (d x ) t − Y s =0 g s ; θ ( x s ) q θ ( x s , x s +1 ) µ (cid:15) t (d x t ) − φ t | t − θ f t L θ ( y t − ) ∇ θ Z χ (d x ) t − Y s =0 g s ; θ ( x s ) q θ ( x s , x s +1 ) µ (cid:15) t (d x t )= φ t | t − θ ( h t ; θ f t ) − φ t | t − θ f t × φ t | t − θ h t ; θ . (2.9) . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent ﬁlters In other words, ∇ θ φ t | t − θ f t coincides with the covariance of f t and the complete data scorefunction h t ; θ under φ t | t − θ . Thus, in order to obtain, from (2.9), an expression of the gradientof the predictor distribution π t ; θ , which is the marginal of φ t | t − θ with respect to the last state,we consider f t ∈ F ( X ), let f t ( x t ) = f t ( x t ), and use the backward decomposition (2.5) toobtain ∇ θ π t ; θ f t = π t ; θ { ( T t ; θ h t ; θ − π t ; θ T t ; θ h t ; θ ) f t } . (2.10)The tangent ﬁlter η t ; θ is now deﬁned as the signed measure η t ; θ f t := ∇ θ π t ; θ f t , f t ∈ F ( X ) . Since the predictors as well as the backward statistics { T t ; θ h t } t ∈ N may be updated recursivelythrough the ﬁltering recursion (2.3)–(2.4) and the recursion (2.6), respectively, the identity (2.10)allows for recursive updating of the ﬁlter derivatives as well. Still, since neither the predictorsnor the backward statistics are generally available in a closed form, we need to resort to approx-imations. This will be the topic of the next section. Clearly, the approach to ﬁlter derivative estimation outlined above requires access to the ﬁlter andpredictor distributions as well as the backward statistics. However, these quantities are availablein a closed form only in very exceptional cases, e.g. when the state-space model is linear Gaussianor the state space X is a ﬁnite set. In the general case we have to rely on approximations, and inthis paper we employ SMC methods for this purpose. As mentioned in the introduction, SMCmethods are genetic-type algorithms propagating recursively a random sample of particles withassociated weights through repeated importance sampling.In the following we assume that all random variables are well deﬁned on a common probabilityspace (Ω , F , P ). The bootstrap particle ﬁlter [18] updates sequentially in time a set of particles with associatedimportance weights in order to approximate the ﬁlter and prediction distribution ﬂows { φ t ; θ } t ∈ N and { π t ; θ } t ∈ N , respectively, given the sequence { y t } t ∈ N of observations. These methods are bestillustrated in a recursive manner. Thus, assume that we have at hand a particle sample { ξ it } Ni =1 approximating the predictor π t ; θ in the sense that for all f ∈ F ( X ), as N tends to inﬁnity, π Nt ; θ f := 1 N N X i =1 f ( ξ it ) ⋍ π t ; θ f. Using the updating step (2.3) of the ﬁltering recursion we can, as soon as the observation y t becomes available, transform the uniformly weighted particle sample { ξ it } Ni =1 into a weightedparticle sample approximating φ t ; θ by associating each particle in the sample with the importanceweight ω it := g t ; θ ( ξ it ). Now, the weighted particle sample { ( ω it , ξ it ) } Ni =1 targets φ t ; θ in the sensethat for all f ∈ F ( X ), as N tends to inﬁnity, φ Nt ; θ f := N X i =1 ω it Ω t f ( ξ it ) ⋍ φ t ; θ f, . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent ﬁlters where Ω t := P Ni =1 ω it denotes the weight sum. In order to form a uniformly weighted particlesample { ξ it +1 } Ni =1 targeting the next predictor π t +1; θ we plug in the ﬁlter approximation φ Nt ; θ into the prediction step (2.4), which results in an approximation of π t +1; θ given by a mixtureproportional to P Ni =1 ω it Q θ ( ξ it , · ). The particle cloud is then updated through sampling from thismixture, an operation that is typically carried through in two steps, selection and mutation ,which are analogous to updating and prediction. In the selection step, a set { I it +1 } Ni =1 of indicesare drawn from Pr ( { ω it } Ni =1 ), where for any positive numbers { a i } ki =1 , Pr ( { a i } ki =1 ) denotes thecategorical distribution on { , . . . , k } induced by the probabilities proportional to { a i } ki =1 . Afterthis, the mutation step propagates the particles forwards according to the dynamics of the stateprocess, i.e., for all i ∈ { , . . . , N } , ξ it +1 ∼ Q θ ( ξ I it +1 t , · ) . The algorithm is initialised by drawing { ξ i } Ni =1 ∼ χ (cid:15) N .The bootstrap particle ﬁlter is well suited for approximating the ﬁlter and prediction distribu-tion ﬂows, but it can also be used for estimating the ﬂow of joint-smoothing distributions. In the Poor man’s smoother , this is done by tracing, backwards in time, the genealogical history of theparticles, and using the empirical measures associated with these ancestral lineages as approxi-mations of the joint-smoothing distributions. Unfortunately, the repeated resampling operationsof bootstrap particle ﬁlter collapses these trajectories in the long run, implying, for large t , theexistence of a random time point T < t before which all the trajectories coincide, i.e., all theparticles { ξ it } Ni =1 have a common time T ancestor. Thus, this naive approach leads to a severelydepleted estimator; see [24, 25] for discussions of this particle path degeneracy phenomenon and[20, 35, 34] for theoretical analyses of the same. A way of detouring the particle path degeneracy phenomenon goes via the backward decomposi-tion (2.5). Moreover, as we will see next, (2.6) provides, when the objective function of interestis of additive form, a means for recursive, non-collapsed particle smoothing. This requires eachbackward kernel ←− Q φ s ; θ , s ∈ N , to be approximated, which is naturally done through ←− Q φ Ns ; θ f ( x ) = N X i =1 ω is q θ ( ξ is , x ) P Nℓ =1 ω ℓs q θ ( ξ ℓt , x ) f ( ξ is ) , ( x, f ) ∈ X × F ( X ) . The forward-only smoothing algorithm proposed by [7] consists in replacing, in (2.6), ←− Q φ t ; θ by theapproximation ←− Q φ Ns ; θ . Proceeding again recursively and assuming that we have at hand estimates { ˜ τ it } Ni =1 of { T t h t ( ξ it ) } Ni =1 , we obtain the updating formula˜ τ it +1 = N X j =1 ω jt q θ ( ξ jt , ξ it +1 ) P Nℓ =1 ω ℓt q θ ( ξ ℓt , ξ it +1 ) (cid:16) ˜ τ jt + ˜ h t ( ξ jt , ξ it +1 ) (cid:17) . (2.11)The recursion is initialised by setting ˜ τ i = 0 for all i ∈ { , . . . , N } . On the basis of these estimates,each expectation φ t | t − θ h t may then be approximated by φ N t | t − θ h t := 1 N N X i =1 ˜ τ it . . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent ﬁlters This method allows for online estimation of φ t +1 | t ; θ h t +1 . It also has the appealing property thatonly the current statistics { ˜ τ it } Ni =1 and particle cloud { ( ξ it , ω it ) } Ni =1 need to be stored. However,since the updating formula (2.11) requires, at each time step, a sum of N terms to be computedfor each particle, the resulting algorithm has a computational complexity that grows quadratically with the number N of particles. Needless to say, this forces the user to keep the particle samplesize relatively small, resulting in low precision. In order to increase the accuracy for a given computational budget, [34] replace (2.11) by theMonte Carlo estimate τ it +1 = 1˜ N ˜ N X j =1 (cid:18) τ J ( i,j ) t +1 t + ˜ h t ( ξ J ( i,j ) t +1 t , ξ it +1 ) (cid:19) , (2.12)where the sample size ˜ N ∈ N ∗ (the so-called precision parameter ) is typically very small comparedto N and { J ( i,j ) t +1 } ˜ Nj =1 are i.i.d. samples from Pr ( { ω ℓt q θ ( ξ ℓt , ξ it +1 ) } Nℓ =1 ). Using the updated { τ it +1 } Ni =1 ,an estimate of φ t +1 | t ; θ h t +1 = π t +1; θ T t +1; θ h t +1 is obtained as N − P Ni =1 τ it +1 . As before, thealgorithm is initialised by setting τ i = 0 for i ∈ { , . . . , N } .In its most basic form, also the previous approach has an O ( N ) complexity since it re-quires the normalising constant P Nℓ =1 ω ℓt q θ ( ξ ℓt , ξ it +1 ) of the distribution Pr ( { ω ℓt q θ ( ξ ℓt , ξ it +1 ) } Nℓ =1 )to be computed particle-wise, i.e. for all ξ it +1 . In order to speed up the algorithm we apply theaccept-reject sampling approach proposed by [12]. This technique presupposes that there exists aconstant ¯ ε > q θ ( x, x ′ ) ≤ ¯ ε for all ( x, x ′ ) ∈ X , an assumption that is satisﬁed for mostmodels. Then, in order to sample from Pr ( { ω ℓt q θ ( ξ ℓt , ξ it +1 ) } Nℓ =1 ) a candidate J ∗ ∼ Pr ( { ω it } Ni =1 ) isaccepted with probability q θ ( ξ J ∗ t , ξ it +1 ) / ¯ ε . This procedure is repeated until acceptance. Undercertain mixing assumptions (see Assumption 2 below) it can be shown (see Douc et al. [12,Proposition 2] and Olsson and Westerborn [34, Theorem 10]) that the expected number of trialsneeded for this approach, which is referred to as the particle-based, rapid incremental smoother (PaRIS), to update { τ it } Ni =1 to { τ it +1 } Ni =1 is O ( ˜ N N ).A key discovery of [34] is that the algorithm converges, as N tends to inﬁnity, for all ﬁxed˜ N ∈ N ∗ . In particular, Olsson and Westerborn [34, Corollary 5] provide a CLT with an asymptoticvariance that may, as long as ˜ N ≥

2, be shown to grow only linearly with t , which is optimalfor a Monte Carlo estimator on the product space [we refer to 34, Section 1, for a discussion]. Inaddition, the additional asymptotic variance imposed by the “de-Rao-Blackwellisation” strategy(2.12) is inversely proportional to ˜ N −

1, which suggests that ˜ N should be kept at a modestvalue (in fact, using, say, ˜ N ∈ { , } works typically well in simulations). On the other hand, inthe case ˜ N = 1, the estimator suﬀers from a degeneracy problem that resembles closely that ofthe Poor man’s smoother, implying an asymptotic variance that grows quadratically fast with t .Similar stochastic stability aspects of the tangent ﬁlter estimator proposed in the present paperwill be discussed in Section 3.2.

3. Tangent ﬁlter estimation: main results

The estimator that we propose is based on the identity (2.10), which we for a given θ approximateonline using the PaRIS algorithm. More speciﬁcally, we estimate, on-the-ﬂy as new observations . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent ﬁlters appear, the ﬂow { η t ; θ } t ∈ N of tangent ﬁlters. This is done by running a particle ﬁlter targetingthe predictor distributions while updating estimates { τ it } Ni =1 of { T t ; θ h t ; θ ( ξ it ) } Ni =1 , where h t ; θ isthe complete data score given by (2.8). The statistics { τ it } Ni =1 are updated eﬃciently using thePaRIS updating rule (2.12). At each time-step t , our algorithm returns η Nt ; θ f t = 1 N N X i =1  τ it − N N X j =1 τ jt  f t ( ξ it )as an estimate of η t ; θ f t for any f t ∈ F ( X ). We summarise the algorithm in the pseudo-codebelow, where every line with i should be executed for all i ∈ { , . . . , N } . draw ξ i ∼ χ set τ i ← for t ← , , , . . . do set ω it ← g t ; θ ( ξ it ) draw I it +1 ∼ Pr ( { ω ℓt } Nℓ =1 ) draw ξ it +1 ∼ q θ ( ξ I it +1 t , · ) for j ← , . . . , ˜ N do draw J ( i,j ) t +1 ∼ Pr ( { ω ℓt q θ ( ξ ℓt , ξ it +1 ) } Nℓ =1 ) end for set τ it +1 ← N P ˜ Nj =1 (cid:18) τ J ( i,j ) t +1 t + ˜ h t ( ξ J ( i,j ) t +1 t , ξ it +1 ) (cid:19) set ¯ τ t +1 ← N P Nℓ =1 τ ℓt +1 set η Nt +1; θ ← N P Nℓ =1 (cid:0) τ ℓt +1 − ¯ τ t +1 (cid:1) δ ξ ℓt +1 end forRemark 1. The previous algorithm can be viewed as a “de-Rao-Blackwellisation” of the tangentﬁlter estimator proposed in [8], which is based on the updating rule (2.11) instead of (2.12) .With the exception of this important diﬀerence, the two algorithms follow the same recursion.Consequently, the algorithm of [8] has an O ( N ) complexity. The ﬁrst part of the convergence analysis of our algorithm will be carried through under thefollowing, relatively mild, assumption.

Assumption 1. (i) For all t ∈ N and θ ∈ Θ , g t ; θ is a positive and bounded measurable function.(ii) For all θ ∈ Θ , q θ ∈ F ( X (cid:15) ) . Assumption 1(i) implies ﬁniteness and positiveness of the particle importance weights, whileAssumption 1(ii) allows, apart from some technical arguments needed in the proofs, the accept-reject sampling technique mentioned in Section 2.4 to be used.We begin by establishing an exponential concentration inequality, valid for all ﬁnite particlesample sizes N , for our PaRIS-based tangent ﬁlter estimator. This yields the strong ( P -a.s.)consistency (as N tends to inﬁnity) of the algorithm as a corollary. In addition, we state andestablish a CLT with an asymptotic variance given by a sum of the asymptotic variance ofthe estimator proposed by [8] and one term reﬂecting the additional variance introduced bythe supplementary sampling performed in the PaRIS-version. Finally, we are able to derive, by . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent ﬁlters operating with strong mixing assumptions, a time uniform bound on the asymptotic variance,which is inversely proportional to ˜ N −

1. This guarantees that the algorithm is stochastically stablein the long run. All proofs, which use results obtained by [34], are postponed to Appendix A.

Theorem 2.

Let Assumption 1 hold. Then for all t ∈ N , θ ∈ Θ , f t ∈ F ( X ) , and ˜ N ∈ N ∗ thereexists ( c t , ˜ c t ) ∈ ( R ∗ + ) (depending on θ , ˜ N , and f t ) such that for all N ∈ N ∗ and ε ∈ R + , P (cid:0)(cid:12)(cid:12) η Nt ; θ f t − η t ; θ f t (cid:12)(cid:12) ≥ ε (cid:1) ≤ c t exp ( − ˜ c t N ε ( ε ∧ . The strong consistency of the estimator follows.

Corollary 3.

Let Assumption 1 hold. Then for all t ∈ N , θ ∈ Θ , f t ∈ F ( X ) , and ˜ N ∈ N ∗ itholds, P -a.s., lim N →∞ η Nt ; θ f t = η t ; θ f t . We now set focus on weak convergence. For this purpose, we introduce, for all t ∈ N , theunnormalised kernels L t ; θ f ( x ) := g t ; θ ( x ) Q θ f ( x ) , ( x, f ) ∈ X × F ( X ) , and, for all ( s, t ) ∈ N such that s ≤ t , the retro-prospective kernels D s,t ; θ f ( x s ) := Z Z f ( x t ) g t ; θ ( x t ) T s ; θ ( x s , d x s − ) L s ; θ (cid:15) · · · (cid:15) L t − θ ( x s , d x s +1: t ) , ˜ D s,t ; θ f ( x s ) := D s,t ; θ ( f − φ t | t − θ f )( x s ) , for ( x s , f ) ∈ X × F ( X (cid:15) ( t +1) ), operating simultaneously in the backward and forward directions. Theorem 4.

Let Assumption 1 hold. For all t ∈ N , θ ∈ Θ , ˜ N ∈ N ∗ , and f t ∈ F ( X ) , as N → ∞ , √ N (cid:0) η Nt ; θ f t − η t ; θ f t (cid:1) D −→ σ t ; θ ( f t ) Z, where Z is a standard Gaussian random variable and σ t ; θ ( f t ) := ˜ σ t ; θ ( f t ) + t − X s =0 s X ℓ =0 ˜ N ℓ − ( s +1) ς s,ℓ,t ; θ ( f t ) , (3.1) where ˜ σ t ; θ ( f t ) := t − X s =0 π s +1; θ ˜ D s +1 ,t ; θ { ( h t ; θ − π t ; θ T t ; θ h t ; θ )( f t − π t ; θ f t ) } ( π s +1; θ L s +1; θ · · · L t − θ X ) (3.2) is the asymptotic variance of the tangent ﬁlter estimator proposed by [8] and ς s,ℓ,t ; θ ( f t ) := π ℓ +1; θ {←− Q φ ℓ ; θ ( T ℓ ; θ h ℓ ; θ + ˜ h ℓ ; θ − T ℓ +1; θ h ℓ +1; θ ) L ℓ +1; θ · · · L s ; θ ( L s +1; θ · · · L t − θ { f t − π t ; θ f t } ) } ( π ℓ +1; θ L ℓ +1; θ · · · L s ; θ X )( π s +1; θ L s +1; θ · · · L t − θ X ) . Remark 5.

We note that for all t ∈ N , lim ˜ N →∞ σ t ; θ ( f t ) = ˜ σ t ; θ ( f t ) . This is in line with our expectations, as the tangent ﬁlter estimator proposed by [8] can be viewedas a Rao-Blackwellisation of our estimator. . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent ﬁlters We now set focus on the numerical stability of the algorithm, and show that the error of thetangent ﬁlter estimator proposed by us stays uniformly bounded in time. There are several waysof formulating such stability, and in the present paper we follow [6, 13] and establish a time-uniform bound on the asymptotic variance of the estimator. The analysis requires Assumption 1to be sharpened as follows.

Assumption 2. (i) There exists ρ > such that for all θ ∈ Θ and ( x, ˜ x ) ∈ X , ρ − ≤ q θ ( x, x ′ ) ≤ ρ. (ii) There exist constants < δ < ¯ δ < ∞ such that for all t ∈ N all θ ∈ Θ , and x ∈ X , δ ≤ g t ; θ ( x ) ≤ ¯ δ. Under Assumption 2 we deﬁne ̺ := 1 − ρ − . These—admittedly very restrictive— strong mixing assumptions require typically the statespace X of the underlying Markov chain to be a compact set. Nevertheless, such assumptions arestandard in the literature of SMC analysis; see [6] and, e.g., [5, 13] for reﬁnements. Finally, weassume that the terms (2.8) of the complete data score function are uniformly bounded. Assumption 3.

There exists a constant | ˜ h | ∞ ∈ R + such that for all s ∈ N and θ ∈ Θ , osc(˜ h s ; θ ) ≤ | ˜ h | ∞ . In conformity with the strong mixing assumptions, also Assumption 3 requires typically thestate space X to be compact.The previous assumptions allow us to establish the following theorem, providing a time-uniform bound on the sequence { σ t ; θ ( f ) } t ∈ N of asymptotic variances. Generally, when the ob-jective function is of additive form, the variance of a Monte Carlo estimator increases typicallylinearly with the number of terms, or, the dimension of the underlying product space [see e.g.34, Section 1, for a discussion]. Thus, as our estimator is based on the identities (2.9) and (2.10),which express the tangent ﬁlter in terms of additive smoothed expectations over the path space,it may seem surprising that we are able to derive a time-uniform bound on the asymptotic vari-ance. However, recall from (2.9) that the tangent ﬁlter η t ; θ f t coincides with the covariance of f t and the complete data score function h t ; θ under φ t | t − θ . Thus, when the model exhibits for-getting properties—which is the case under the strong mixing assumptions above—the sequence { η t ; θ f t } t ∈ N stays bounded, and time uniform convergence of the Monte Carlo estimator is hencewithin reach. Theorem 6.

Let Assumption 2 and Assumption 3 hold. Then there exist constants ( c, ˜ c ) ∈ R such that for all t ∈ N , θ ∈ Θ , ˜ N ≥ , and f t ∈ F ( X ) , σ t ; θ ( f t ) ≤ k f t k ∞ (cid:18) c + ˜ c

1( ˜ N − − ̺ ) (cid:19) and ˜ σ t ; θ ( f t ) ≤ c k f t k ∞ . (3.3) . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent ﬁlters The bound (3.3) on the ﬁrst term (3.2) of the asymptotic variance, corresponding to thevariance of the Rao-Blackwellised algorithm, was derived by [8]. As expected from the theoreticalresults obtained by [34], the bound on the second term is inversely proportional to ˜ N −

1, whichindicates that boosting the number ˜ N of PaRISian backward draws from a moderate to a verylarge value will have negligible eﬀect on the precision. We hence recommend keeping ˜ N low, saylower than 5, in order to gain computational eﬃciency. A more detailed discussion on the choiceof ˜ N can be found in [34].

4. Application to recursive maximum likelihood estimation

Given a ﬁxed data-record y t , our goal is to perform maximum likelihood estimation of θ , i.e., toﬁnd the vector of parameters θ ∗ ∈ Θ such that θ ∗ = arg max θ ∈ Θ L θ ( y t ), where L θ ( y t ) is thelikelihood of the observed data given in (2.2). Equivalently, we may maximise the log-likelihood ℓ θ ( y t ) = log L θ ( y t ). There are many diﬀerent approaches to such maximisation; see [4, Chapter10] for a more general overview of diﬀerent approaches.Here we will focus on the following Robbins-Monro scheme : at iteration n , let θ n = θ n − + γ n Z n , where Z n is a noisy measurement of ∇ θ ℓ θ ( y t ) | θ = θ n − , i.e., the score of the observed data evalu-ated in θ = θ n − , and { γ n } n ∈ N ∗ a sequence of positive step-sizes satisfying the regular stochasticapproximation requirements P ∞ n =1 γ n = ∞ and P ∞ n =1 γ n < ∞ . Note that this approach requiresapproximation of Z n at each iteration of the algorithm. If the number of observations is verylarge computing Z n is costly, and since many iterations are often required for convergence thisresults in an impractical algorithm. Moreover, if we receive a new observation Z n needs to berecalculated which turns the procedure into an oﬄine algorithm. We sketch the basic principles of recursive maximum likelihood. First note that since { ( X t , Y t ) } t ∈ N is a Markov chain, the quadruple { ( X t , Y t , π t ; θ , η t ; θ ) } t ∈ N forms, by (2.3) and (2.4) and for all θ ∈ Θ, a Markov chain as well. In the case where the state space X is a ﬁnite set, [27] showedthat this chain is ergodic under certain mixing assumptions. This result was later extended by[11] to the case where X is compact. Now, assume that data is generated by a model parame-terised by a distinctive θ ∗ ∈ Θ; then, denoting by Π θ,θ ∗ the stationary distribution of this chainand ˜Π θ,θ ∗ ( · ) = Π θ,θ ∗ ( X × · ) its marginal with respect to the last three components, it holds forall θ ∈ Θ, P -a.s.,lim t →∞ t ∇ θ ℓ θ ( Y t ) = lim t →∞ t t X s =0 ∇ θ ℓ θ ( Y s | Y s − ) = lim t →∞ t t X s =0 π s ; θ ( ∇ θ g s ; θ ) + η s ; θ g s ; θ π s ; θ g s ; θ = Z Z Z π ( ∇ θ g θ ( · , y )) + η ( g θ ( · , y )) π ( g θ ( · , y )) ˜Π θ,θ ∗ (d( y, π, η )) =: λ ( θ, θ ∗ ) , where π s ; θ g s ; θ , η s ; θ g s ; θ , and π s ; θ g s ; θ depend implicitly on Y s . These equations follow, in order,by the decomposition (1.1) of the log-likelihood, by applying the nabla operator, and by applyingBirkhoﬀ’s ergodic theorem. By indentiﬁability [see, e.g., 4, Section 12.4], the true parameter θ ∗ . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent ﬁlters solves λ ( θ, θ ∗ ) = 0, and the task of solving this equation may be cast into the framework of stochastic approximation and the Robbins-Monro algorithm θ t +1 = θ t + γ t +1 ζ t +1 , t ∈ N , (4.1)where ζ t +1 is a noisy observation of λ ( θ t , θ ∗ ). Ideally, such a noisy observation would be formedby estimating λ ( θ t , θ ∗ ) by a draw from ˜Π θ,θ ∗ , which is, needless to say, not possible in practice.Thus, by introducing one more level of approximation and allowing for Markovian perturbations,one may instead estimate, following (2.10), λ ( θ t , θ ∗ ) by ζ t +1 := ζ t +1 + ζ t +1 ζ t +1 , where ζ t +1 := π t +1 ( ∇ θ g t +1; θ | θ = θ t ) ,ζ t +1 := η t +1 g t +1; θ t ,ζ t +1 := π t +1 g t +1; θ t , with η t +1 := π t { ( T t − π t T t ) g t ; θ t } , and statistics { T t } t ∈ N being, following (2.6), updated recursively according to T t +1 ( x t +1 ) = Z { T t ( x t ) + ˜ h t ; θ t +1 ( x t : t +1 ) } ←− Q φ t ( x t +1 , d x t ) , x t +1 ∈ X , (4.2)and initialised as T ≡

0. The statistic (4.2) depends on the new observation Y t +1 throughthe term ˜ h t ; θ t +1 . In addition, the measures { φ t } t ∈ N and { π t } t ∈ N are updated according to therecursion φ t f = π t ( g t ; θ t f ) π t g t ; θ t ,π t +1 f = φ t Q θ t f, f ∈ F ( X ) , (4.3)with, by convention, π := χ . Note that the recursion (4.3) updates the ﬁlter and predictordistributions on the basis of the current parameter ﬁt. This means that for all t ∈ N , themeasures φ t , π t +1 and η t +1 depend on all the past parameter values { θ s } ts =0 .This approach yields an online algorithm where a new observation can be incorporated intothe estimator without the need of having to recalculate the latter from scratch. In the case where X is ﬁnite, this algorithm was studied in [28], which also provides assumptions under which thesequence { θ t } t ∈ N ∗ converges towards the parameter θ ∗ [see also 38, for reﬁnements].In the general case, the recursions (4.2) and (4.3) are intractable. The translation of the sameinto the language of particles is however immediate by approximating (4.3) by a particle ﬁlter,updated according to time-varying parameters, and (4.2) by means of a PaRISian updating step(2.12), the latter executed for some suitable precision parameter ˜ N ≥ θ . After this, the parameterﬁt is updated recursively as each new observation y t +1 is received, by ﬁrst calculating a particleapproximation ˆ ζ t +1 of ζ t +1 and then updating the parameter according to θ t +1 = θ t + γ t +1 ˆ ζ t +1 , where { γ t } t ∈ N ∗ satisﬁes the standard stochastic optimisation requirements (i.e. inﬁnite sum andﬁnite sum of squares) above. The algorithm, which is illustrated numerically in the next section,is detailed in pseudo code below. . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent ﬁlters set arbitrarily θ draw ξ i ∼ χ set τ i ← for t ← , , , . . . do set ω it ← g θ t ( ξ it , y t ) draw I i ∼ Pr ( { ω ℓt } Nℓ =1 ) draw ξ it +1 ∼ q θ t ( ξ I i t , · ) for j ← , . . . , ˜ N do draw J ( i,j ) t +1 ∼ Pr ( { ω ℓt q θ t ( ξ ℓt , ξ it +1 ) } Nℓ =1 ) end for set τ it +1 ← ˜ N − P ˜ Nj =1 (cid:18) τ J ( i,j ) t +1 t + ˜ h t ; θ t ( ξ J ( i,j ) t +1 t , ξ it +1 ) (cid:19) set ¯ τ t +1 ← N − P Nℓ =1 τ ℓt +1 set ˆ ζ t +1 ← N − P Nℓ =1 ∇ g θ t ( ξ ℓt +1 , y t +1 ) set ˆ ζ t +1 ← N − P Nℓ =1 (cid:0) τ ℓt +1 − ¯ τ t +1 (cid:1) g θ t ( ξ ℓt +1 , y t +1 ) set ˆ ζ t +1 ← N − P Nℓ =1 g θ t ( ξ ℓt +1 , y t +1 ) set θ t +1 ← θ t + γ t +1 ˆ ζ t +1 + ˆ ζ t +1 ˆ ζ t +1 end for We benchmark our algorithm on two diﬀerent models, namely • the stochastic volatility model of [19], where we compare our estimator to that of [8], and • the simultaneous localisation and mapping (SLAM) model where we apply our algorithmas a proof of concept. Consider the stochastic volatility model X t +1 = φX t + σV t +1 ,Y t = β exp( X t / U t , t ∈ N , proposed by [19], where { V t } t ∈ N ∗ and { U t } t ∈ N are independent sequences of mutually inde-pendent standard Gaussian noise variables. In this model the parameters to be estimated are θ = ( φ, σ , β ). We compared our PaRISian RML estimator to the estimator proposed by [8]—thelatter being referred to as the “Rao-Blackwellised estimator”. Since there is a signiﬁcant diﬀer-ence in computational complexity between the algorithms, we design the particle sample size ineach algorithm in such a way that the running times of the two algorithms are close to identical.With our implementation, using N = 100 in the algorithm of [8] yields close to the same CPUtime as using ( N, ˜ N ) = (1400 ,

2) in the PaRISian RML estimator. For both algorithms we let { γ t } t ∈ N = { t − . } t ∈ N . The algorithms were executed on an observation data record comprising500 ,

000 observations generated under the parameters θ ∗ = (0 . , . , . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent ﬁlters variance than those produced by the Rao-Blackwellised estimator (using a considerably smallerparticle sample size). Indeed, computing the sample variances over the ﬁnal parameter estimatesindicates an improvement by almost an order of magnitude to the beneﬁt of the PaRISian versionfor the same computational time: ( . , . , . × − and ( . , . , . × − for thePaRISian and Rao-Blackwellised versions, respectively. It can also be seen in Figure 1 that thetrajectories exhibit some jumps, mainly in the β variable. This occurs when the estimate of ζ t +1 is close to zero, which corresponds to time steps when the particles fail to cover the support of theemission density. Since the computationally more eﬃcient PaRISian version allows considerablymore particles to be used for the given budget, the problem is much smaller for this algorithm;indeed, whereas the learning curves of the Rao-Blackwellised version exhibit a high degree ofvolatility and several large jumps (the peaks are cut for visibility) with subsequent very longexcursions out in the parameter space, the PaRISian version exhibits only one signiﬁcant jumpof moderate size. time × φ -1-0.500.51 time × σ time × β (a) PaRIS-based RML time × φ -1-0.500.51 time × σ time × β (b) Rao-Blackwellised RML Fig 1 . Particle learning trajectories produced by the PaRISian RML estimator (left panel) and the Rao-Blackwellised RML estimator (right panel) proposed in [8] for, from top to bottom, φ , σ , and β . The formerand latter algorithms used N = 1 , and N = 400 particles, respectively (leading to comparable CPU times). Foreach algorithm, replicates were generated on the same data set with diﬀerent, randomised initial parameters(being the same for both algorithms). For the Rao-Blackwellised version, the plot of β does not contain the fulltrajectories due to very high peaks. The simultaneous localisation and mapping (SLAM) problem is fundamental in robotics. In theversion considered here we let the state space consist of the coordinates and the bearing of arobot moving in the plane, i.e. X = R × ( − π, π ]. The prior motion of the robot is modeled as a . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent ﬁlters Markov chain X t := ( X t , X t , X t ), t ∈ N , on this space, deﬁned through the state equations X t +1 = X t + d t +1 cos( X t ) + ǫ t +1 ,X t +1 = X t + d t +1 sin( X t ) + ǫ t +1 ,X t +1 = X t + α t +1 + ǫ t +1 , where { ǫ it } t ∈ N ∗ , i ∈ { , , } , are independent sequences of mutually independent noise variableswith known variances σ i , i ∈ { , , } . The sequence { ( d t , α t ) } t ∈ N ∗ provides the commands—interms of speed and bearing changes—of the robot at each time point.The robot observes a landmark, deﬁned by its positions in the plane, by measuring the distanceand the relative bearing to the same. Assuming L landmarks, the observations at time t are givenby Y t = { Y it } i ∈ O t , where O t ⊆ { , . . . , L } is the set of observed landmarks. For each observedlandmark, Y it =  k ( θ i , θ i ) − ( X t , X t ) k + ε i, t arctan (cid:18) θ i − X t θ i − X t (cid:19) − X t + ε i, t  , where k · k denotes the Euclidean norm, ( θ i , θ i ) is the location of landmark i , ( ε i, t , ε i, t ) areindependent noise variables with known variances ( ς , ς ), the noises of diﬀerent time steps anddiﬀerent landmarks being independent.In this setting we wish to estimate the locations of all the landmarks, which implies 2 L unknown parameters. Note that the noise parameters in the model are assumed to be calibratedbeforehand. Several existing works apply particle methods to the SLAM problem; see, e.g., the FastSLAM [30] and

FastSLAM 2.0 [31] algorithms. More recently an online EM version wasproposed by [26] and the

Marginal-SLAM algorithm [29] is based on an updating step similarto (4.1).Using the model above, data was simulated for L = 9 landmarks located according to thegreen dots in Figure 2. The observations were generated by simulating 100 ,

000 observations ofthe robot moving around a closed loop with a time resolution of . σ = σ = . σ = π , ς = .

25, and ς = π . The robot observes landmarks within a30 m radius and a 180 ◦ ﬁeld of vision.Using this simulated data we performed 20 independent runs of our PaRISian RML algorithmin order to estimate the landmark positions. For each landmark, the initial estimate of its positionwas drawn randomly according to a bivariate Gaussian distribution with the true landmarkposition as mean and the identity matrix as covariance matrix. The algorithm used ( N, ˜ N ) =(500 ,

2) and the particles were initialised in (0 ,

0) with zero bearing, i.e., the same as the robot’sstarting position. Since the problem is invariant under translations and rotations we ﬁx twolandmarks and assume that these are known a priori. The results are displayed in Figure 2,where the estimates cluster closely around the true parameter values. The ﬁgure also showsthe estimated ﬁltered trajectory of the robot during its ﬁrst 3 laps. In order to illustrate theconvergence of the landmark position estimates, we consider the evolution of their MSE overtime. The results are reported in Figure 3, where the average MSE of 4 landmarks is computedbased on the 20 independent runs of the algorithm. As clear from the picture, the processing ofthe full observation record yields MSE values that ﬂuctuate stably around zero.Needless to say, even though this prototypical solution to the SLAM seems promising, a lotof additional work is needed for obtaining an algorithm running in real applications. Moreover,the comprehensive task of benchmarking our approach against full palette of existing algorithmstreating this fundamental problem falls outside the scope of the present paper. . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent ﬁlters Fig 2 . Resulting estimates of the landmarks for the SLAM problem using the PaRISian RML algorithm. Thecircles are the true positions of the landmarks and green stars are the resulting estimates. The black dots denotethe particle estimates of the robot’s positions during the ﬁrst three laps. Rerunning the algorithm yields similartrajectories.

5. Conclusion

We have presented a novel algorithm for eﬃcient particle-based estimation of tangent ﬁlters ingeneral SSMs. The algorithm involves the PaRIS algorithm proposed by [34] as a key ingredient.The estimator is furnished with several convergence results, the main result being a CLT at therate √ N . The convergence analysis is driven by results obtained by [34, 8], and our time uniformbound on the asymptotic variance extends the existing results in [8]. Importantly, under weakassumptions, the computational complexity of our algorithm grows only linearly with the numberof particles, whereas the complexity of existing methods such as the estimator of [8], which maybe viewed as a Rao-Blackwellisation of our estimator, is typically quadratic in the number ofparticles. Thus, we may expect the computational beneﬁt of the “de-Rao-Blackwellisation” thatcharacterises PaRIS to grossly exceed the price of the additional variance introduced by thePaRISian backward sampling procedure.In the second part of the paper we cast our PaRISian tangent ﬁlter estimator into the frame-work of RML, yielding a computationally eﬃcient and easily implemented online parameterestimation procedure. As clear from the simulations, the fact that our online estimator allows,compared to existing methods, more particles to be used for a given computational budget is ofimportance for the stability of the online estimates.Appealingly, the asymptotic variance of our estimator can be bounded uniformly in time; inother words, the estimator stays stochastically stable in the long term. Needless to say, the strong . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent ﬁlters Observations × M SE Observations × M SE Observations × M SE Observations × M SE Fig 3 . Resulting mean square error estimate for 4 of the landmarks based on the 20 independent runs of thePaRISian RML algorithm for the SLAM problem. mixing assumptions driving the stability analysis are restrictive, and to relax these assumptionsis a natural direction of research. [21] provides an O ( t ) bound on the asymptotic variance ofthe forward-ﬁltering backward-smoothing (FFBSm) algorithm [15], which is equivalent with theRao-Blackwellised PaRIS in the case of additive state functionals (the online formulation washowever found by Del Moral et al. [7]), under assumptions that point to applications on non-compact state spaces, and the same approach may be applicable to our PaRISian tangent ﬁlterestimator.Finally, it still remains to prove, under general, veriﬁable assumptions, the convergence of theparameter estimates produced by our PaRISian RML algorithm. Such an analysis is expectedto be technically very involved, and is hence beyond the scope of the present paper. In [28], inwhich the convergence of exact RML in HMMs is established for the case of ﬁnite state space, theproof consists in showing that the extended Markov chain comprising the state and observationprocesses as well as the prediction and tangent ﬁlter ﬂows is ergodic and applying standardstochastic approximation results [9]. A route to the convergence of our proposed PaRISian RMLcould be to include also the particle cloud and the particle-based statistics { τ it } Ni =1 into theMarkov chain and study the ergodicity of the same. An alternative and more direct approachwould be to cast the problem into the framework of biased stochastic gradient search , which wasanalysed recently by [39]. References [1] B. D. O. Anderson and J. B. Moore.

Optimal Filtering . Prentice-Hall, New Jersey, 1979. . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent ﬁlters [2] O. Capp´e. Ten years of HMMs (online bibliography 1989–2000). http://perso.telecom-paristech.fr/~cappe/docs/hmmbib.html , 2001.[3] O. Capp´e. Online EM algorithm for hidden Markov models. Journal of Computational andGraphical Statistics , 20(3):728–749, 2011.[4] O. Capp´e, E. Moulines, and T. Ryd´en.

Inference in Hidden Markov Models . Springer, NewYork, 2005.[5] D. Crisan and K. Heine. Stability of the discrete time ﬁlter in terms of the tails of noisedistributions.

Journal of the London Mathematical Society. Second Series , 78(2):441–458,2008.[6] P. Del Moral and A. Guionnet. On the stability of interacting processes with applicationsto ﬁltering and genetic algorithms.

Annales de l’Institut Henri Poincar´e , 37:155–194, 2001.[7] P. Del Moral, A. Doucet, and S. Singh. A Backward Particle Interpretation of Feynman-KacFormulae.

ESAIM: Mathematical Modelling and Numerical Analysis , 44(5):947–975, 2010.[8] P. Del Moral, A. Doucet, and S. S. Singh. Uniform stability of a particle approximation ofthe optimal ﬁlter derivative.

SIAM Journal on Control and Optimization , 53(3):1278–1304,2015.[9] B. Delyon. General results on the convergence of stochastic algorithms.

IEEE Transactionson Automatic Control , 41(9):1245–1255, 1996.[10] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete datavia the EM algorithm.

Journal of the Royal Statistical Society , 39(1):1–38 (with discussion),1977.[11] R. Douc and C. Matias. Asymptotics of the maximum likelihood estimator for generalhidden Markov models.

Bernoulli , 7(3):381–420, 2001.[12] R. Douc, A. Garivier, E. Moulines, and J. Olsson. Sequential Monte Carlo smoothing forgeneral state space hidden Markov models.

Annals of Applied Probability , 21(6):2109–2145,2011.[13] R. Douc, E. Moulines, and J. Olsson. Long-term stability of sequential Monte Carlo methodsunder veriﬁable conditions.

Annals of Applied Probability , 24(5):1767–1802, 2014.[14] A. Doucet and V. B. Tadi´c. Parameter estimation in general state-space models usingparticle methods.

Annals of the Institute of Statistical Mathematics , 55(2):409–422, 2003.[15] A. Doucet, S. Godsill, and C. Andrieu. On sequential Monte-Carlo sampling methods forBayesian ﬁltering.

Statistics and Computing , 10:197–208, 2000.[16] A. Doucet, N. De Freitas, and N. Gordon, editors.

Sequential Monte Carlo Methods inPractice . Springer, New York, 2001.[17] P. Fearnhead, D. Wyncoll, and J. Tawn. A sequential smoothing algorithm with linearcomputational cost.

Biometrika , 97(2):447–464, 2010.[18] N. Gordon, D. Salmond, and A. F. Smith. Novel approach to nonlinear/non-GaussianBayesian state estimation.

IEE Proceedings F, Radar and Signal Processing , 140:107–113,1993.[19] J. Hull and A. White. The pricing of options on assets with stochastic volatilities.

TheJournal of Finance , 42:281–300, 1987.[20] P. E. Jacob, L. M. Murray, and S. Rubenthaler. Path storage in the particle ﬁlter.

Statisticsand Computing , pages 1–10, 2013.[21] A. Jasra. On the behaviour of the backward interpretation of Feynman-Kac formulae underveriﬁable conditions.

Journal of Applied Probability , 52(2):339–359, 2015.[22] S. J. Julier and J. K. Uhlmann. A new extension of the Kalman ﬁlter to nonlinear systems. In

AeroSense: The 11th International Symposium on Aerospace/Defense Sensing, Simulationand Controls , 1997.[23] N. Kantas, A. Doucet, S. S. Singh, J. Maciejowski, and N. Chopin. On particle methods for . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent ﬁlters parameter estimation in state-space models. Statistical Science , 30(3):328–351, 2015.[24] G. Kitagawa. Monte-Carlo ﬁlter and smoother for non-Gaussian nonlinear state space mod-els.

Journal of Computational and Graphical Statistics , 1:1–25, 1996.[25] G. Kitagawa and S. Sato. Monte Carlo smoothing and self-organising state-space model. In

Sequential Monte Carlo methods in practice , pages 177–195. Springer, New York, 2001.[26] S. Le Corﬀ, G. Fort, and E. Moulines. Online expectation maximization algorithm to solvethe SLAM problem. In , pages225–228, 2011.[27] F. Le Gland and L. Mevel. Geometric Ergodicity in Hidden Markov Models. ResearchReport RR-2991, INRIA, 1996.[28] F. Le Gland and L. Mevel. Recursive estimation in hidden Markov models. In

Priceesingsof the 36th IEEE Conference on Decision and Control , pages 3468–3473, 1997.[29] R. Martinez-Cantin, N. de Freitas, and J. A. Castellanos. Analysis of particle methodsfor simultaneous robot localization and mapping and a new algorithm: Marginal-slam. In

Proceedings 2007 IEEE International Conference on Robotics and Automation , pages 2415–2420, 2007.[30] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit. FastSLAM: A factored solution tothe simultaneous localization and mapping problem. In

Proceedings of the AAAI NationalConference on Artiﬁcial Intelligence , Edmonton, Canada, 2002. AAAI.[31] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit. FastSLAM 2.0: An improved particleﬁltering algorithm for simultaneous localization and mapping that provably converges. In

Proceedings of the Sixteenth International Joint Conference on Artiﬁcial Intelligence (IJ-CAI) , Acapulco, Mexico, 2003. IJCAI.[32] T. N. M. Nguyen, S. Le Corﬀ, and E. Moulines. On the two-ﬁlter approximations of marginalsmoothing distributions in general state-space models.

Advances in Applied Probability , 50(1):154–177, 2017.[33] J. Olsson and J. Westerborn. Eﬃcient parameter inference in general hidden markov modelsusing the ﬁlter derivatives. In , pages 3984–3988, 2016.[34] J. Olsson and J. Westerborn. Eﬃcient particle-based online smoothing in general hiddenMarkov models: The PaRIS algorithm.

Bernoulli , 23(3):1951–1996, 2017.[35] J. Olsson, O. Capp´e, R. Douc, and E. Moulines. Sequential Monte Carlo smoothing withapplication to parameter estimation in non-linear state space models.

Bernoulli , 14(1):155–179, 2008.[36] G. Poyiadjis, A. Doucet, and S. S. Singh. Particle methods for optimal ﬁlter derivative:application to parameter estimation. In

Proceedings IEEE International Conference onAcoustics, Speech, and Signal Processing , pages 925–928, 2005.[37] G. Poyiadjis, A. Doucet, and S. Singh. Particle approximations of the score and ob-served information matrix in state space models with application to parameter estimation.

Biometrika , 2011.[38] V. B. Tadic. Analyticity, convergence, and convergence rate of recursive maximum-likelihoodestimation in hidden markov models.

IEEE Transactions on Information Theory , 56(12):6406–6432, 2010.[39] V. B. Tadi´c and A. Doucet. Asymptotic bias of stochastic gradient search.

Annals of AppliedProbability , 27(6):3255–3304, 2017. . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent ﬁlters Appendix A: Proofs

Deﬁne for all t ∈ N and θ ∈ Θ, L t ; θ : X × X ∋ ( x, A ) g t ; θ ( x ) Q θ ( x, A ) . (Note that our deﬁnition of L t diﬀers from that used by [34], in which the order of g t ; θ and Q θ is swapped.) With this notation, by the ﬁltering recursion (2.3)–(2.4), π t +1; θ = π t ; θ L t ; θ π t ; θ L t ; θ X , (A.1)with, as previously, φ |− := χ . This condensed form of the ﬁltering recursion will be used inSection A.3.In the coming analysis the following decomposition will be instrumental. For all t ∈ N , η Nt ; θ f t − η t ; θ f t = 1 N N X i =1 { ( τ it − π t ; θ T t ; θ h t ; θ )( f t ( ξ it ) − π t ; θ f t ) }− π t ; θ { ( T t ; θ h t ; θ − π t ; θ T t ; θ h t ; θ )( f t − π t ; θ f t ) }− N N X i =1 τ it − π t ; θ T t ; θ h t ; θ ! N N X i =1 f t ( ξ it ) − π t ; θ f t ! . (A.2) A.1. Proof of Theorem 2

We apply the decomposition (A.2). Note that1 N N X i =1 { ( τ it − π t ; θ T t ; θ h t ; θ )( f t ( ξ it ) − π t ; θ f t ) } − π t ; θ { ( T t ; θ h t ; θ − π t ; θ T t ; θ h t ; θ )( f t − π t ; θ f t ) } = 1 N N X i =1 ω it { τ it F t ; θ ( ξ it ) + ˜ F t ; θ ( ξ it ) } − π t ; θ { g t ; θ ( T t ; θ h t ; θ F t ; θ + ˜ F t ; θ ) } , (A.3)where F t ; θ ( x ) := f t ( x ) − π t ; θ f t g t ; θ ( x ) , ˜ F t ; θ ( x ) := − π t ; θ T t ; θ h t ; θ { f t ( x ) − π t ; θ f t } g t ; θ ( x ) , x ∈ X . (A.4)Now, since F t g t ; θ and ˜ F t g t ; θ both belong to F ( X ), Proposition 7 provides constants c t > c t > ε > P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 ω it { τ it F t ; θ ( ξ it ) + ˜ F t ; θ ( ξ it ) } − π t ; θ { g t ; θ ( T t ; θ h t ; θ F t ; θ + ˜ F t ; θ ) } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ ε ! ≤ c t exp ( − ˜ c t N ε ) . (A.5)To deal with the second part of the decomposition (A.2), we use the same technique. First, byapplying Proposition 7 with f ≡ /g t ; θ and ˜ f ≡

0, we obtain constants a t > a t > ε > P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 τ it − π t ; θ T t ; θ h t ; θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ ε ! ≤ a t exp ( − ˜ a t N ε ) . (A.6) . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent ﬁlters Similarly, using Proposition 7 with f ≡ f ≡ f t /g t ; θ provides constants b t > b t > ε > P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 f t ( ξ it ) − π t ; θ f t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ ε ! ≤ b t exp ( − ˜ b t N ε ) . (A.7)Combining (A.5), (A.6), and (A.7) yields, for all ε > P (cid:0)(cid:12)(cid:12) η Nt ; θ f t − η t ; θ f t (cid:12)(cid:12) ≥ ε (cid:1) ≤ a t exp ( − ˜ a t N ε/

2) + b t exp ( − ˜ b t N ε/

2) + c t exp {− ˜ c t N ( ε/ } , from which the statement of the theorem follows.The following result is obtained by inspection of the proof of Olsson and Westerborn [34,Theorem 1(i)]. Proposition 7.

Let Assumption 1 hold. Then for all t ∈ N , all θ ∈ Θ , all additive state function-als h t ∈ F ( X (cid:15) ( t +1) ) , all measurable functions f t and ˜ f t such that f t g t ; θ ∈ F ( X ) and ˜ f t g t ; θ ∈ F ( X ) ,and all ˜ N ∈ N , there exist constants c t > and ˜ c t > (possibly depending on θ , h t f t , ˜ f t , and ˜ N ) such that for all ε > , P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 ω it { τ it f t ( ξ it ) + ˜ f t ( ξ it ) } − π t ; θ { g t ; θ ( T t ; θ h t f t + ˜ f t ) } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ ε ! ≤ c t exp ( − ˜ c t N ε ) , where { ( ξ it , ω it , τ it ) } Ni =1 are produced using the PaRIS algorithm. A.2. Proof of Corollary 3

The P -a.s. convergence of η Nt ; θ f t to η t ; θ f t is implied straightforwardly by the exponential conver-gence rate in Theorem 2. Indeed, note that P (cid:16) lim N →∞ η Nt ; θ f t = η t ; θ f t (cid:17) = lim k →∞ lim n →∞ P  [ n ≤ N (cid:26)(cid:12)(cid:12) η Nt ; θ f t − η t ; θ f t (cid:12)(cid:12) ≥ k (cid:27) ;now, by Theorem 2, P  [ n ≤ N (cid:26)(cid:12)(cid:12) η Nt ; θ f t − η t ; θ f t (cid:12)(cid:12) ≥ k (cid:27) ≤ c t exp (cid:0) − ˜ c t nk − (cid:1) ∞ X N =0 exp (cid:0) − ˜ c t N k − (cid:1) , where the right hand side tends to zero when n tends to inﬁnity. This completes the proof. A.3. Proof of Theorem 4

By combining (A.2) and (A.3), √ N (cid:0) η Nt ; θ f t − η t ; θ f t (cid:1) = √ N N N X i =1 { τ it F t ; θ ( ξ it ) + ˜ F t ; θ ( ξ it ) } − π t ; θ ( T t ; θ h t ; θ F t ; θ + ˜ F t ; θ ) ! − √ N N N X i =1 f t ( ξ it ) − π t ; θ f t ! N N X i =1 τ it − π t ; θ T t ; θ h t ; θ ! , . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent ﬁlters where in this case F t ; θ ( x ) := f t ( x ) − π t ; θ f t , ˜ F t ; θ ( x ) := − π t ; θ T t ; θ h t ; θ { f t ( x ) − π t ; θ f t } , x ∈ X . are deﬁned in (A.4). By Proposition 8, since F t ; θ ∈ F ( X ) and ˜ F t ; θ ∈ F ( X ), √ N N N X i =1 { τ it F t ; θ ( ξ it ) + ˜ F t ; θ ( ξ it ) } − π t ; θ ( T t ; θ h t ; θ F t ; θ + ˜ F t ; θ ) ! D −→ σ t ; θ h F t ; θ , ˜ F t ; θ i ( h t ) Z, where Z is standard normally distributed and σ t ; θ h F t ; θ , ˜ F t ; θ i ( h t ; θ ) = σ t ; θ ( h t ; θ ) , with σ t ; θ ( h t ; θ ) being deﬁned in (3.1). Now, Proposition 7 and Proposition 8 yield1 N N X i =1 τ it P −→ π t ; θ T t ; θ h t ; θ , and √ N N N X i =1 f t ( ξ it ) − π t ; θ f t ! D −→ σ t ; θ h , f t i ( h t ; θ ) Z (with 0 denoting the zero function), respectively, implying, by Slutsky’s theorem, √ N N N X i =1 f t ( ξ it ) − π t ; θ f t ! N N X i =1 τ it − π t ; θ T t ; θ h t ; θ ! P −→ . Finally, we complete the proof by noting that the term ˜ σ t ; θ ( f t ) in (3.1) coincides with theasymptotic variance provided by Del Moral et al. [8, Theorem 3.2]. Proposition 8.

Assumption 1 hold. Then for all t ∈ N , all θ ∈ Θ , all additive state functionals h t ∈ F ( X (cid:15) ( t +1) ) , all measurable functions f t ∈ F ( X ) and ˜ f t ∈ F ( X ) , and all ˜ N ∈ N , as N → ∞ , √ N N N X i =1 { τ it f t ( ξ it ) + ˜ f t ( ξ it ) } − π t ; θ ( T t ; θ h t f t + ˜ f t ) ! D −→ σ t ; θ h f t , ˜ f t i ( h t ) Z, where Z is a standard Gaussian random variable and σ t ; θ h f t , ˜ f t i ( h t ) := ˜ σ t ; θ h f t , ˜ f t i ( h t ) + t − X s =0 s X ℓ =0 ˜ N ℓ − ( s +1) ς s,ℓ,t ; θ h f t i ( h t ) , (A.8) with σ t ; θ h f t , ˜ f t i ( h t ) := t − X s =0 π s +1; θ ˜ D s +1 ,t ; θ ( h t f t + ˜ f t )( π s +1; θ L s +1; θ · · · L t − θ X ) and ς s,ℓ,t ; θ h f t i ( h t ) := π ℓ +1; θ {←− Q φ ℓ ; θ ( T ℓ ; θ h ℓ + ˜ h ℓ − T ℓ +1; θ h ℓ +1 ) L ℓ +1; θ · · · L s ; θ ( L s +1; θ · · · L t − θ f t ) } ( π ℓ +1; θ L ℓ +1; θ · · · L s ; θ X )( π s +1; θ L s +1; θ · · · L t − θ X ) . . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent ﬁlters of Proposition 8. Assume ﬁrst that π t ; θ ( T t ; θ h t f t + ˜ f t ) = 0. Then, by Lemma 9 and Slutsky’stheorem, as Ω t /N P −→ π t ; θ g t ; θ by Proposition 7, √ N N N X i =1 { τ it f t ( ξ it ) + ˜ f t ( ξ it ) } ! = √ N N X i =1 ω it Ω t { τ it F t ; θ ( ξ it ) + ˜ F t ; θ ( ξ it ) } ! N Ω t D −→ π t ; θ g t ; θ × Γ t ; θ h F t ; θ , ˜ F t ; θ i ( h t ) Z, where again Z has standard Gaussian distribution, Γ t ; θ h F t ; θ , ˜ F t ; θ i ( h t ) is given in Lemma 9, andwe have set F t ; θ := f t /g t ; θ and ˜ F t ; θ := ˜ f t /g t ; θ and used, ﬁrst, that g t : θ F t ; θ = f t ∈ F ( X ) and g t : θ ˜ F t ; θ = ˜ f t ∈ F ( X ) and, second, that φ t ; θ ( T t ; θ h t ; θ F t ; θ + ˜ F t ; θ ) = π t ; θ { g t ; θ ( T t ; θ h t ; θ F t ; θ + ˜ F t ; θ ) } π t ; θ g t ; θ = π t ; θ ( T t ; θ h t f t + ˜ f t ) π t ; θ g t ; θ = 0 . Now, by iterating (A.1) we conclude that for all ( s, t ) ∈ N , π t ; θ = π t − θ L t − θ π t − θ L t − θ X = π t − θ L t − θ L t − θ ( π t − θ L t − θ X )( π t − θ L t − θ X ) = π t − θ L t − θ L t − θ π t − θ L t − θ L t − θ X = π s +1; θ L s +1; θ . . . L t − θ π s +1; θ L s +1; θ . . . L t − θ X and, consequently, π t ; θ g t ; θ = π s +1; θ L s +1; θ . . . L t − θ g t ; θ π s +1; θ L s +1; θ . . . L t − θ X . (A.9)Finally, by (A.9) it holds that π t ; θ g t ; θ × Γ t ; θ h F t ; θ , ˜ F t ; θ i ( h t ) = σ t ; θ h f t , ˜ f t i ( h t ) , where Γ t ; θ h f t , ˜ f t i ( h t ) is deﬁned in (A.10) and σ t ; θ h f t , ˜ f t i ( h t ) is deﬁned in (A.8). Finally, in thegeneral case, the previous holds true when ˜ f t is replaced by ˜ f t − π t ; θ ( T t ; θ h t f t + ˜ f t ), whichcompletes the proof.The following lemma is obtained by inspection of the proof of Olsson and Westerborn [34,Theorem 3]. Lemma 9.

Assumption 1 hold. Then for all t ∈ N , all θ ∈ Θ , all additive state functionals h t ∈ F ( X (cid:15) ( t +1) ) , all measurable functions f t and ˜ f t such that f t g t ; θ ∈ F ( X ) and ˜ f t g t ; θ ∈ F ( X ) ,and all ˜ N ∈ N , as N → ∞ , √ N N X i =1 ω it Ω t { τ it f t ( ξ it ) + ˜ f t ( ξ it ) } − φ t ; θ ( T t ; θ h t f t + ˜ f t ) ! D −→ Γ t ; θ h f t , ˜ f t i ( h t ) Z, where Z is a standard normal distribution and Γ t ; θ h f t , ˜ f t i ( h t ) := ˜Γ t ; θ h f t , ˜ f t i ( h t ) + t − X s =0 s X ℓ =0 ˜ N ℓ − ( s +1) γ s,ℓ,t ; θ h f t i ( h t ) , (A.10) . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent ﬁlters with ˜Γ t ; θ h f t , ˜ f t i ( h t ) := t − X s =0 π s +1; θ ˜ D s +1 ,t ; θ { g t ; θ ( h t f t + ˜ f t ) } ( π s +1; θ L s +1; θ · · · L t − θ g t ; θ ) and γ s,ℓ,t ; θ h f t i ( h t ) := π ℓ +1; θ {←− Q φ ℓ ; θ ( T ℓ ; θ h ℓ + ˜ h ℓ − T ℓ +1; θ h ℓ +1 ) L ℓ +1; θ · · · L s ; θ ( L s +1; θ · · · L t − θ g t ; θ f t ) } ( π ℓ +1; θ L ℓ +1; θ · · · L s ; θ X )( π s +1; θ L s +1; θ · · · L t − θ g t ; θ ) . A.4. Proof of Theorem 6

As noted above, the ﬁrst term of the asymptotic variance σ t ( f t ) coincides with the asymptoticvariance ˜ σ t ; θ ( f t ) obtained by Del Moral et al. [8, Theorem 3.2]. The same work provides a constant c ∈ R + such that ˜ σ t ; θ ( f t ) ≤ c k f t k ∞ , and we may hence focus on bounding second term of theasymptotic variance.For this purpose, note that for all s ≤ t − x s +1 ∈ X , L s +1; θ · · · L t − θ ( f t − π t ; θ f t )( x s +1 ) = g s +1; θ ( x s +1 ) Q θ L s +2; θ · · · L t − θ { g t − θ ( Q θ f t − φ t − θ Q θ f t ) } ( x s +1 ) . (A.11)By applying the forgetting of the ﬁlter, or, more particularly, Douc et al. [12, Lemma 10], to(A.11) we obtain k L s +1; θ · · · L t − θ ( f t − π t ; θ f t ) k ∞ ≤ δ k Q θ L s +2; θ · · · L t − θ X k ∞ k f t k ∞ ̺ t − s − . Note that in the previous bound, the exponential contraction follows from the fact that theobjective function f t is centered around its predicted mean. The latter is a consequence of thefact that the tangent ﬁlter is, as a covariance, centered itself (recall the identities (2.9) and(2.10) and the decomposition (A.2)). In addition, from the proof of Olsson and Westerborn [34,Theorem 8] we extract, using Assumption 3, k T ℓ ; θ h ℓ ; θ + ˜ h ℓ ; θ − T ℓ +1; θ h ℓ +1; θ k ∞ ≤ | ˜ h | ∞ (cid:18) δ ̺ − ̺ + 1 (cid:19) , and under Assumption 2, for all x ∈ X , ρ − δµ L s +2; θ · · · L t − θ X ≤ L s +1; θ · · · L t − θ X ( x ) ≤ ρ ¯ δµ L s +2; θ · · · L t − θ X . Combining the previous bounds gives ς s,ℓ,t ; θ ( f t ) = π ℓ +1; θ {←− Q φ ℓ ; θ ( T ℓ ; θ h ℓ ; θ + ˜ h ℓ ; θ − T ℓ +1; θ h ℓ +1; θ ) L ℓ +1; θ · · · L s ; θ ( L s +1; θ · · · L t − θ { f t − π t ; θ f t } ) } ( π ℓ +1; θ L ℓ +1; θ · · · L s ; θ X )( π s +1; θ L s +1; θ · · · L t − θ X ) ≤ ˜ c k f t k ∞ ̺ t − s − , where ˜ c := 4 | ˜ h | ∞ (cid:18) δ ̺ − ̺ + 1 (cid:19) (cid:18) ¯ δδ (1 − ̺ ) (cid:19) . . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent ﬁlters Finally, summing up yields t − X s =0 s X ℓ =0 ς s,ℓ,t ; θ ( f t ) ≤ ˜ c k f t k ∞ t − X s =0 s X ℓ =0 ˜ N ℓ − s − ̺ t − s − = ˜ c k f t k ∞ N − t − X s =0 ̺ t − s − (1 − ˜ N − s − ) ≤ ˜ c k f t k ∞ N − t − X s =0 ̺ t − s − ≤ ˜ c k f t k ∞

1( ˜ N − − ̺ ) , which completes the proof. Appendix B: Kernels

Given two measurable spaces ( X , X ) and ( Y , Y ), an unnormalised transition kernel K betweenthese spaces induces two integral operators, one acting on functions and the other on measures.Speciﬁcally, we deﬁne the function K h : X ∋ x Z h ( x, y ) K ( x, d y ) ( h ∈ F ( X (cid:15) Y ))and the measure ν K : Y ∋ A Z K ( x, A ) ν (d x ) ( ν ∈ M ( X ))whenever these quantities are well-deﬁned. Moreover, let L be another unnormalised transitionkernel from ( Y , Y ) to the measurable space ( Z , Z ); then two diﬀerent products of K and L canbe deﬁned, namely KL : X × Z ∋ ( x, A ) Z K ( x, d y ) L ( y, A )and K (cid:15) L : X × ( Y (cid:15) Z ) ∋ ( x, A ) Z A ( y, z ) K ( x, d y ) L ( y, d z )whenever these are well-deﬁned. These products form new transition kernels from ( X , X ) to ( Z , Z )and from ( X , X ) to ( Y × Z , Y (cid:15) Z ), respectively. Also the (cid:15) -product of a kernel K and a measure ν ∈ M ( X ) is deﬁned as the new measure ν (cid:15) K : X (cid:15) Y ∋ A Z A ( x, y ) K ( x, d y ) ν (d x ) . Finally, for any kernel K and any bounded measurable function h we write K h := ( K h ) and K h := K ( h2