Particle-based, online estimation of tangent filters with application to parameter estimation in nonlinear state-space models
aa r X i v : . [ s t a t . C O ] J a n Particle-based, online estimation of tangentfilters with application to parameter estimationin nonlinear state-space models
Jimmy Olsson and Johan Westerborn Alenl¨ov
Department of MathematicsKTH Royal Institute of TechnologySE-100 44 Stockholm, Swedene-mail: [email protected] , [email protected] Abstract:
This paper presents a novel algorithm for efficient online estimation of the filterderivatives in general hidden Markov models. The algorithm, which has a linear compu-tational complexity and very limited memory requirements, is furnished with a number ofconvergence results, including a central limit theorem with an asymptotic variance that canbe shown to be uniformly bounded in time. Using the proposed filter derivative estimatorwe design a recursive maximum likelihood algorithm updating the parameters accordingthe gradient of the one-step predictor log-likelihood. The efficiency of this online parameterestimation scheme is illustrated in a simulation study.
1. Introduction
The general state-space hidden Markov models form a powerful statistical modeling tool whichis presently applied across a wide range of scientific and engineering disciplines; see e.g. [2, 4]and the references therein for such examples. In the literature, the term “hidden Markov model”is often restricted to the case where the state space of the hidden chain is a finite set, and wehence use the term “state-space model” (SSM) to stress that the state spaces of the models thatwe consider are completely general. More specifically, SSMs are bivariate stochastic processesconsisting of an observable process { Y t } t ∈ N and an unobservable Markov chain { X t } t ∈ N , referredto as the state process and observation process and taking on values in some general state spaces Y and X , respectively. When using these models in practice, the statistician is typically interestedin calculating conditional distributions of the unobservable states given some fixed observationrecord or data-stream { y t } t ∈ N . Generally, these distributions cannot be expressed in a closed form,and the user needs to rely on approximations. Moreover, any SSM is typically parameterised byunknown model parameters θ which need to be calibrated ex ante.The frequentist approach to parameter estimation consists in maximising the likelihood func-tion θ L θ ( y t ), where L θ ( y t ) denotes the joint density of the observations evaluated at thegiven data y t = ( y , . . . , y t ) (this will be our generic notation for vectors), or, equivalently, the log-likelihood function θ ℓ θ ( y t ) := log L θ ( y t ). Even though the likelihood cannot generallybe expressed in a closed form, there are several approaches to approximate maximisation of thesame. These methods are typically based on either the expectation-maximisation algorithm [10],where the maximisation is performed over an intermediate quantity, or the tools of gradient-basedoptimisation . The latter algorithms produce, in a steepest ascent manner, a sequence { θ n } n ∈ N of parameter estimates converging to the maximum likelihood estimator; i.e., at each iteration,a parameter estimate θ n is updated recursively through a move in the direction given by anestimate of the score function ∇ θ ℓ θ n ( y t ).There are mainly two approaches to estimation of the score function. The first one goes via Fisher’s identity [see, e.g., 4, Proposition 10.1.6], which relates the score function of the observed . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent filters data to that of the complete data and thus transfers the problem to that of joint smoothing , i.e.,the computation of the joint posterior distribution of X t given Y t = y t . The second approach,which is the focus of the present paper, goes via the decomposition ℓ θ ( y t ) = t X s =0 ℓ θ ( y s | y s − ) , (1.1)where each one-step predictor likelihood ℓ θ ( y s | y s − ) (with ℓ θ ( y | y − ) := ℓ θ ( y )) can becomputed via the sensitivity equations ; see Capp´e et al. [4, Section 10.2.4] and Section 4 fordetails. Let π s ; θ ( x s ) be the density of the predictor at time s , i.e., the distribution of the state X s conditional to the observation record Y s − = y s − ; it is then, by swapping the order ofthe nabla and integration operations, possible to express the gradient ∇ θ ℓ θ ( y s | y s − ) as afunctional of the gradient ∇ θ π s ; θ ( x s ), the latter being typically referred to as the filter derivative or tangent filter or filter sensitivity . It should be noted that despite its name, the tangent filteris the gradient of the predictor rather than the filter , where the filter at time s is defined as thedistribution of X s conditionally to Y s = y s , i.e., the updated predictor.Appealingly, the recursive updating scheme for the tangent filters given by the sensitivityequations requires only access to the filter distributions (and not the full joint smoothing dis-tribution, as in the case of Fisher identity-based score approximation). If estimation is to becarried through in the offline mode, by processing repeatedly a fixed batch of observations untilconvergence, the approach based on Fisher’s identity is typically computationally more efficientand hence to be preferred. On the other hand, if estimation is to be carried through in an online manner, i.e., through a single sweep of a (potentially infinitely) long sequence of observations,then the sensitivity equation approach is better suited, as the filter distributions can be updatedrecursively at constant computational cost. Online parameter learning is relevant when dealingwith very big batches of data or in scenarios requiring the parameters to be continuously updatedon the basis of a continuous stream of arriving observations.If the state space of the latent Markov chain is a finite set or if the model belongs to the linearGaussian SSMs, exact computation of the filter distributions—and thus the filter derivatives andthe gradient of the one-step predictor log-likelihood—is possible [28, 4]. However, as touchedupon above, the general case calls for approximation of these quantities, and existing approachesto nonlinear filtering can typically be divided into two classes. The first class contains methodsrelying on linearisation, such as the extended Kalman filter , the unscented Kalman filter [1, 22],etc., whose success is generally limited in the presence of significant nonlinear or non-Gaussianmodel components. The second class is formed by the sequential Monte Carlo (SMC) or par-ticle filtering approaches [18], where a sample of particles with associated importance weightsis propagated sequentially and randomly in time according to the model dynamics in order toapproximate the flows of filter and predictor distributions. Due to their high capability of solv-ing potentially very difficult nonlinear/non-Gaussian filtering problems, SMC methods have, asindicated by the over 8000 Google Scholar citations of the first monograph [16] on the topic, seena rapid and ever-increasing interest during the last decades. For an expos´e of current particle-based approaches to parameter estimation in nonlinear SSMs,we refer to [23]. SMC-based score function estimation via tangent filters was discussed by [14,36, 37, 8].The work [8], which can be viewed as the state of the art when it concerns SMC-based tangentfilter estimation, expresses each tangent filter as a centered, smoothed expectation of an additive . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent filters state functional. Using the so-called backward decomposition of the joint smoothing distributionand a recursive form of this decomposition going back to [3], [7] are able to form, on-the-fly andthrough the perspicacious propagation of a number of smoothed statistics, one statistic per par-ticle, non-collapsed particle approximations of such additive smoothed expectations. Moreover,[7] furnish their estimator with a central limit theorem (CLT) describing the weak convergence,as the size of the particle sample tends to infinity, of the particle tangent filter approximationsto the exact counterparts. Appealingly, the fact that the smoothed expectations are centeredallows, under strong mixing assumptions, the asymptotic variance of the mentioned CLT to bebounded uniformly in time, which establishes the long-term stochastic stability of the algorithmfor large sample sizes. However, since the propagation of the smoothed statistic associated witha given particle requires an expectation under the ancestral posterior of that particle to be com-puted by summation, the computational complexity of the algorithm is quadratic in the numberof particles. The user is hence forced to keep the number of particles at a modest value, whichmay imply severe numerical instability in practice (see Section 4.3 for an illustration). On the basis of the approach taken by [8], we develop an efficient and numerically stable algo-rithm for particle approximation of the filter derivatives in SSMs. The approach that we proposeincludes, as a sub-routine, the particle-based, rapid incremental smoother (PaRIS) [introducedrecently by us in 34], an algorithm that is tailor-made for online approximation of smoothedadditive state functionals.By replacing, in the forward recursion proposed by [8], the computation of each smoothedstatistic by a conditionally unbiased Monte Carlo estimate, the computational complexity be-comes, under mild assumptions, linear in the number of particles. Since the original estimatorcan be viewed as a Rao-Blackwellisation of the new one, we will sometimes refer to this measureas “ de -Rao-Blackwellisation”. This allows the user to mount a considerably larger amount ofparticles for a given computational budget, which compensates by far for the slight increase ofasymptotic variance added by the additional sampling step. Thus, the technique that we pro-pose has some similarities with that used in particle-based two-filter smoothing , where a linearcomputational complexity is obtained by replacing certain marginalisations by random sampling[see 17] (however, for the sake of completeness we remark that two-filter particle smoothers,which require a so-called backward information filter to be run backwards in time, cannot beused in online settings). By reusing techniques developed by [34] and Del Moral et al. [8] we areable to furnish the proposed algorithm with an exponential concentration inequality and a CLT,whose asymptotic variance is given by that of the Rao-Blackwellised counterpart plus one addi-tional term being inversely proportional to sample size of the supplementary, PaRISian samplingoperation. This result is analogous to the results obtained by [32] for the case of particle-basedtwo-filter smoothing. Moreover, by assuming strong mixing of the model, the asymptotic variancecan be bounded uniformly in time.In the second part of the paper, we cast our PaRIS-based tangent filter estimator into theframework of recursive maximum likelihood estimation. The numerical performance of this onlineparameter estimation algorithm is investigated in a simulation study, where we are able to reportsignificant improvement over the approach of [8] for online parameter learning in the stochasticvolatility model of [19]. In addition, we propose our algorithm as a prototypical solution to the simultaneous localisation and mapping (SLAM) problem, with promising results.Finally, we remark that the parameter estimation algorithm was outlined by us in the confer-ence note [33]. The present paper provides rigorous theoretical results underpinning the initial . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent filters observations made by [33] and complement the simulation study of that work with additionalexamples. After having introduced, prefatorily, general SSMs, tangent filters, and SMC-based smoothing(including the PaRIS) in Section 2, Section 3 presents the proposed approach to tangent filterestimation. In addition, Section 3.1 includes theoretical convergence results in the form of anexponential concentration inequality and a CLT with a time-uniform bound on the asymptoticvariance. Section 4 applies the PaRISian tangent filter estimator to recursive maximum likelihoodestimation, and the resulting online parameter estimation algorithm is tested numerically inSection 4.3. Finally, some conclusions are drawn in Section 5 and Appendix A contains all proofs.
2. Preliminaries
In the following we let N denote the natural numbers and set N ∗ := N \ { } . For any measurablespace ( E , E ), where E is a countably generated σ -algebra, we denote by F ( E ) the set of bounded E / B ( R )-measurable functions on E , where B ( R ) denotes the Borel σ -algebra. For any function h ∈ F ( E ), we let k h k ∞ := sup x ∈ E | h ( x ) | denote the supremum norm of h . Let M ( E ) denote theset of σ -finite measures on E and M ( E ) ⊂ M ( E ) the set of probability measures on the samespace.Now, let ( X , X ) and ( Y , Y ) be measurable spaces and Q θ : X × X → [0 ,
1] and G θ : X × Y → [0 ,
1] Markov transition kernels. We will often subject kernels to operations such as multiplicationand tensor multiplication. Moreover, a transition kernel induces two different operators, oneon bounded measurable functions and one on measures; we refer to Appendix B for details.Moreover, let χ ∈ M ( X ). The kernels above are parameterised by a model parameter vector θ belonging to some space Θ ⊆ R d , d ∈ N ∗ , and we will use subindices to stress the dependenceof any distributions on these parameters. We define an SSM as the canonical Markov chain { ( X t , Y t ) } t ∈ N with initial distribution χ (cid:15) G θ (see Appendix B for a definition of the product (cid:15) of a measure and a kernel) and Markov transition kernel( X × Y ) × ( X (cid:15) Y ) ∋ (( x, y ) , A ) Z A ( x ′ , y ′ ) Q θ ( x, d x ′ ) G θ ( x ′ , d y ′ ) . (2.1)In this setting, the state process { X t } t ∈ N is assumed to be only partially observed through theobservation process { Y t } t ∈ N . Under the dynamics (2.1),(i) the state process { X t } t ∈ N is itself a Markov chain with transition kernel Q θ and initialdistribution χ .(ii) the observations are, conditionally on the states, independent and such that the conditionaldistribution of each Y t depends on the corresponding X t only and is given by the emissiondistribution G θ ( X t , · ).For simplicity we have assumed that the initial distribution χ does not depend on θ . Throughoutthis paper we will consider the fully dominated case where for all θ ∈ Θ, Q θ and G θ admitdensities q θ and g θ with respect to some reference measures µ ∈ M ( X ) and ν ∈ M ( Y ), respectively. . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent filters This means that for all θ ∈ Θ, x ∈ X , f ∈ F ( X ), and h ∈ F ( Y ), Q θ f ( x ) = Z f ( x ′ ) q θ ( x, x ′ ) µ (d x ′ ) , G θ h ( x ) = Z h ( y ) g θ ( x, y ) ν (d y ) . Likewise, χ is assumed to have a density, which we denote by the same symbol, with respect tothe same reference measure µ .In the following we assume that we are given a fixed sequence { y t } t ∈ N of observations of { Y t } t ∈ N , and in order to avoid notational overload we will generally suppress any dependence onthese observations from the notation by denoting, e.g., by g t ; θ the function X ∋ x g θ ( x, y t ).For any triple ( s, s ′ , t ) ∈ N such that s ≤ s ′ we denote by φ s : s ′ | t ; θ the conditional distributionof X s : s ′ conditioned on Y t . For any f ∈ F ( X (cid:15) ( s ′ − s +1) ), this distribution can be expressed as φ s : s ′ | t ; θ f = L θ ( y t ) − Z · · · Z f ( x s : s ′ ) t Y m =0 g m ; θ ( x m ) ! χ (d x ) ( t ∨ s ′ ) − Y ℓ =0 Q θ ( x ℓ , d x ℓ +1 ) , where L θ ( y t ) = Z · · · Z g θ ( x ) χ (d x ) t − Y ℓ =0 g ℓ +1; θ ( x ℓ +1 ) Q θ ( x ℓ , d x ℓ +1 ) (2.2)is the observed data likelihood. In the cases s = s ′ = t and s = s ′ = t + 1 we let φ t ; θ and π t +1; θ beshorthand notation for the filter and predictor distributions φ t : t | t ; θ and φ t +1: t +1 | t ; θ , respectively.In the case s = 0 and s ′ = t , the distribution φ t | t ; θ is referred to as the joint-smoothingdistribution. The filter and predictor distributions are closely related; indeed, the well-known filtering recursion provides that for all t ∈ N and f ∈ F ( X ), φ t ; θ f = π t ; θ ( g t ; θ f ) π t ; θ g t ; θ , (2.3) π t +1; θ f = φ t ; θ Q θ f, (2.4)where, by convention, π θ := χ .We will often deal with sums and products on functions with possibly different arguments.Since these functions will be defined on products of X , we will, when needed, let, with slightabuse of notation, subscripts define the domains of such sums and products. For instance, f t ˜ f t : X ∋ x t f t ( x t ) ˜ f t ( x t ), while f t + ˜ f t +1 : X ∋ x t : t +1 f t ( x t ) + ˜ f t +1 ( x t +1 ).It may be shown [see, e.g., 4, Proposition 3.3.6] that the state process has still the Markovproperty when evolving conditionally on Y t = y t in the time-reversed direction. Moreover,the distribution of X s given X s +1 and Y t = y t is, for s ≤ t , given by the backward kernel denoted by ←− Q φ s ; θ . Under the assumption of a fully dominated model the backward kernel can,for x s +1 ∈ X and f ∈ F ( X ), be expressed as ←− Q φ s ; θ f ( x s +1 ) := R f ( x s ) q θ ( x s , x s +1 ) φ s ; θ (d x s ) R q θ ( x ′ s , x s +1 ) φ s ; θ (d x ′ s ) . Consequently, using the backward kernel, we may express the joint-smoothing distribution φ t | t ; θ as φ t | t ; θ = φ t ; θ T t ; θ , (2.5) . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent filters where we have defined the kernels T t ; θ := ( ←− Q φ t − θ (cid:15) ←− Q φ t − θ (cid:15) · · · (cid:15) ←− Q φ θ for t ∈ N ∗ , id for t = 0 , where (cid:15) denotes kernel tensor product; see Appendix B for details. Note that for all x ∈ X , thedistribution T t ; θ ( x, · ) on the product space X (cid:15) t describes the law of the inhomogeneous Markovchain initialised at x ∈ X and evolving backwards in time according to the backward kernels.As we will see in the following, the modeler is often required to compute smoothed expectationsof additive objective functions of type h t ( x t ) := t − X s =0 ˜ h s ( x s : s +1 ) , where for all 0 ≤ s ≤ t −
1, ˜ h s is a measurable function on X . This setting allows for recursivecomputation of { T t ; θ h t } t ∈ N according to T t +1; θ h t +1 ( x t +1 ) = Z { T t ; θ h t ( x t ) + ˜ h t ( x t : t +1 ) } ←− Q φ t ; θ ( x t +1 , d x t ); (2.6)see [3, 7]. The recursion (2.6) will be a key ingredient also in the coming developments. The gradient of the prediction filter, known as the filter derivative or the tangent filter, is asigned measure which we, following [8, Section 2], define as follows. We start with consideringthe gradient of φ t | t − θ and then extract the marginal of this quantity. First, note that for all f t ∈ F ( X (cid:15) ( t +1) ),1 L θ ( y t − ) ∇ θ Z f t ( x t ) χ (d x ) t − Y s =0 g s ; θ ( x s ) q θ ( x s , x s +1 ) µ (cid:15) t (d x t ) = φ t | t − θ ( h t ; θ f t ) , (2.7)where h t ; θ ( x t ) = P t − s =0 ˜ h s ; θ ( x s : s +1 ), with˜ h s ; θ ( x s : s +1 ) = ∇ θ log g s ; θ ( x s ) + ∇ θ log q θ ( x s , x s +1 ) , (2.8)is complete data score function . The identity (2.7) is established by, first, swapping, on the lefthand side, the order of differentiation and integration and, second, dividing and multiplying theintegrand by the complete data likelihood. In the case f t = X t +1 , (2.7) coincides with Fisher’sidentity [see, e.g., 4, Proposition 10.1.6]. Now, using (2.7), we obtain straightforwardly ∇ θ φ t | t − θ f t = ∇ θ L θ ( y t − ) Z f t ( x t ) χ (d x ) t − Y s =0 g s ; θ ( x s ) q θ ( x s , x s +1 ) µ (cid:15) t (d x t )= 1 L θ ( y t − ) ∇ θ Z f t ( x t ) χ (d x ) t − Y s =0 g s ; θ ( x s ) q θ ( x s , x s +1 ) µ (cid:15) t (d x t ) − φ t | t − θ f t L θ ( y t − ) ∇ θ Z χ (d x ) t − Y s =0 g s ; θ ( x s ) q θ ( x s , x s +1 ) µ (cid:15) t (d x t )= φ t | t − θ ( h t ; θ f t ) − φ t | t − θ f t × φ t | t − θ h t ; θ . (2.9) . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent filters In other words, ∇ θ φ t | t − θ f t coincides with the covariance of f t and the complete data scorefunction h t ; θ under φ t | t − θ . Thus, in order to obtain, from (2.9), an expression of the gradientof the predictor distribution π t ; θ , which is the marginal of φ t | t − θ with respect to the last state,we consider f t ∈ F ( X ), let f t ( x t ) = f t ( x t ), and use the backward decomposition (2.5) toobtain ∇ θ π t ; θ f t = π t ; θ { ( T t ; θ h t ; θ − π t ; θ T t ; θ h t ; θ ) f t } . (2.10)The tangent filter η t ; θ is now defined as the signed measure η t ; θ f t := ∇ θ π t ; θ f t , f t ∈ F ( X ) . Since the predictors as well as the backward statistics { T t ; θ h t } t ∈ N may be updated recursivelythrough the filtering recursion (2.3)–(2.4) and the recursion (2.6), respectively, the identity (2.10)allows for recursive updating of the filter derivatives as well. Still, since neither the predictorsnor the backward statistics are generally available in a closed form, we need to resort to approx-imations. This will be the topic of the next section. Clearly, the approach to filter derivative estimation outlined above requires access to the filter andpredictor distributions as well as the backward statistics. However, these quantities are availablein a closed form only in very exceptional cases, e.g. when the state-space model is linear Gaussianor the state space X is a finite set. In the general case we have to rely on approximations, and inthis paper we employ SMC methods for this purpose. As mentioned in the introduction, SMCmethods are genetic-type algorithms propagating recursively a random sample of particles withassociated weights through repeated importance sampling.In the following we assume that all random variables are well defined on a common probabilityspace (Ω , F , P ). The bootstrap particle filter [18] updates sequentially in time a set of particles with associatedimportance weights in order to approximate the filter and prediction distribution flows { φ t ; θ } t ∈ N and { π t ; θ } t ∈ N , respectively, given the sequence { y t } t ∈ N of observations. These methods are bestillustrated in a recursive manner. Thus, assume that we have at hand a particle sample { ξ it } Ni =1 approximating the predictor π t ; θ in the sense that for all f ∈ F ( X ), as N tends to infinity, π Nt ; θ f := 1 N N X i =1 f ( ξ it ) ⋍ π t ; θ f. Using the updating step (2.3) of the filtering recursion we can, as soon as the observation y t becomes available, transform the uniformly weighted particle sample { ξ it } Ni =1 into a weightedparticle sample approximating φ t ; θ by associating each particle in the sample with the importanceweight ω it := g t ; θ ( ξ it ). Now, the weighted particle sample { ( ω it , ξ it ) } Ni =1 targets φ t ; θ in the sensethat for all f ∈ F ( X ), as N tends to infinity, φ Nt ; θ f := N X i =1 ω it Ω t f ( ξ it ) ⋍ φ t ; θ f, . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent filters where Ω t := P Ni =1 ω it denotes the weight sum. In order to form a uniformly weighted particlesample { ξ it +1 } Ni =1 targeting the next predictor π t +1; θ we plug in the filter approximation φ Nt ; θ into the prediction step (2.4), which results in an approximation of π t +1; θ given by a mixtureproportional to P Ni =1 ω it Q θ ( ξ it , · ). The particle cloud is then updated through sampling from thismixture, an operation that is typically carried through in two steps, selection and mutation ,which are analogous to updating and prediction. In the selection step, a set { I it +1 } Ni =1 of indicesare drawn from Pr ( { ω it } Ni =1 ), where for any positive numbers { a i } ki =1 , Pr ( { a i } ki =1 ) denotes thecategorical distribution on { , . . . , k } induced by the probabilities proportional to { a i } ki =1 . Afterthis, the mutation step propagates the particles forwards according to the dynamics of the stateprocess, i.e., for all i ∈ { , . . . , N } , ξ it +1 ∼ Q θ ( ξ I it +1 t , · ) . The algorithm is initialised by drawing { ξ i } Ni =1 ∼ χ (cid:15) N .The bootstrap particle filter is well suited for approximating the filter and prediction distribu-tion flows, but it can also be used for estimating the flow of joint-smoothing distributions. In the Poor man’s smoother , this is done by tracing, backwards in time, the genealogical history of theparticles, and using the empirical measures associated with these ancestral lineages as approxi-mations of the joint-smoothing distributions. Unfortunately, the repeated resampling operationsof bootstrap particle filter collapses these trajectories in the long run, implying, for large t , theexistence of a random time point T < t before which all the trajectories coincide, i.e., all theparticles { ξ it } Ni =1 have a common time T ancestor. Thus, this naive approach leads to a severelydepleted estimator; see [24, 25] for discussions of this particle path degeneracy phenomenon and[20, 35, 34] for theoretical analyses of the same. A way of detouring the particle path degeneracy phenomenon goes via the backward decomposi-tion (2.5). Moreover, as we will see next, (2.6) provides, when the objective function of interestis of additive form, a means for recursive, non-collapsed particle smoothing. This requires eachbackward kernel ←− Q φ s ; θ , s ∈ N , to be approximated, which is naturally done through ←− Q φ Ns ; θ f ( x ) = N X i =1 ω is q θ ( ξ is , x ) P Nℓ =1 ω ℓs q θ ( ξ ℓt , x ) f ( ξ is ) , ( x, f ) ∈ X × F ( X ) . The forward-only smoothing algorithm proposed by [7] consists in replacing, in (2.6), ←− Q φ t ; θ by theapproximation ←− Q φ Ns ; θ . Proceeding again recursively and assuming that we have at hand estimates { ˜ τ it } Ni =1 of { T t h t ( ξ it ) } Ni =1 , we obtain the updating formula˜ τ it +1 = N X j =1 ω jt q θ ( ξ jt , ξ it +1 ) P Nℓ =1 ω ℓt q θ ( ξ ℓt , ξ it +1 ) (cid:16) ˜ τ jt + ˜ h t ( ξ jt , ξ it +1 ) (cid:17) . (2.11)The recursion is initialised by setting ˜ τ i = 0 for all i ∈ { , . . . , N } . On the basis of these estimates,each expectation φ t | t − θ h t may then be approximated by φ N t | t − θ h t := 1 N N X i =1 ˜ τ it . . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent filters This method allows for online estimation of φ t +1 | t ; θ h t +1 . It also has the appealing property thatonly the current statistics { ˜ τ it } Ni =1 and particle cloud { ( ξ it , ω it ) } Ni =1 need to be stored. However,since the updating formula (2.11) requires, at each time step, a sum of N terms to be computedfor each particle, the resulting algorithm has a computational complexity that grows quadratically with the number N of particles. Needless to say, this forces the user to keep the particle samplesize relatively small, resulting in low precision. In order to increase the accuracy for a given computational budget, [34] replace (2.11) by theMonte Carlo estimate τ it +1 = 1˜ N ˜ N X j =1 (cid:18) τ J ( i,j ) t +1 t + ˜ h t ( ξ J ( i,j ) t +1 t , ξ it +1 ) (cid:19) , (2.12)where the sample size ˜ N ∈ N ∗ (the so-called precision parameter ) is typically very small comparedto N and { J ( i,j ) t +1 } ˜ Nj =1 are i.i.d. samples from Pr ( { ω ℓt q θ ( ξ ℓt , ξ it +1 ) } Nℓ =1 ). Using the updated { τ it +1 } Ni =1 ,an estimate of φ t +1 | t ; θ h t +1 = π t +1; θ T t +1; θ h t +1 is obtained as N − P Ni =1 τ it +1 . As before, thealgorithm is initialised by setting τ i = 0 for i ∈ { , . . . , N } .In its most basic form, also the previous approach has an O ( N ) complexity since it re-quires the normalising constant P Nℓ =1 ω ℓt q θ ( ξ ℓt , ξ it +1 ) of the distribution Pr ( { ω ℓt q θ ( ξ ℓt , ξ it +1 ) } Nℓ =1 )to be computed particle-wise, i.e. for all ξ it +1 . In order to speed up the algorithm we apply theaccept-reject sampling approach proposed by [12]. This technique presupposes that there exists aconstant ¯ ε > q θ ( x, x ′ ) ≤ ¯ ε for all ( x, x ′ ) ∈ X , an assumption that is satisfied for mostmodels. Then, in order to sample from Pr ( { ω ℓt q θ ( ξ ℓt , ξ it +1 ) } Nℓ =1 ) a candidate J ∗ ∼ Pr ( { ω it } Ni =1 ) isaccepted with probability q θ ( ξ J ∗ t , ξ it +1 ) / ¯ ε . This procedure is repeated until acceptance. Undercertain mixing assumptions (see Assumption 2 below) it can be shown (see Douc et al. [12,Proposition 2] and Olsson and Westerborn [34, Theorem 10]) that the expected number of trialsneeded for this approach, which is referred to as the particle-based, rapid incremental smoother (PaRIS), to update { τ it } Ni =1 to { τ it +1 } Ni =1 is O ( ˜ N N ).A key discovery of [34] is that the algorithm converges, as N tends to infinity, for all fixed˜ N ∈ N ∗ . In particular, Olsson and Westerborn [34, Corollary 5] provide a CLT with an asymptoticvariance that may, as long as ˜ N ≥
2, be shown to grow only linearly with t , which is optimalfor a Monte Carlo estimator on the product space [we refer to 34, Section 1, for a discussion]. Inaddition, the additional asymptotic variance imposed by the “de-Rao-Blackwellisation” strategy(2.12) is inversely proportional to ˜ N −
1, which suggests that ˜ N should be kept at a modestvalue (in fact, using, say, ˜ N ∈ { , } works typically well in simulations). On the other hand, inthe case ˜ N = 1, the estimator suffers from a degeneracy problem that resembles closely that ofthe Poor man’s smoother, implying an asymptotic variance that grows quadratically fast with t .Similar stochastic stability aspects of the tangent filter estimator proposed in the present paperwill be discussed in Section 3.2.
3. Tangent filter estimation: main results
The estimator that we propose is based on the identity (2.10), which we for a given θ approximateonline using the PaRIS algorithm. More specifically, we estimate, on-the-fly as new observations . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent filters appear, the flow { η t ; θ } t ∈ N of tangent filters. This is done by running a particle filter targetingthe predictor distributions while updating estimates { τ it } Ni =1 of { T t ; θ h t ; θ ( ξ it ) } Ni =1 , where h t ; θ isthe complete data score given by (2.8). The statistics { τ it } Ni =1 are updated efficiently using thePaRIS updating rule (2.12). At each time-step t , our algorithm returns η Nt ; θ f t = 1 N N X i =1 τ it − N N X j =1 τ jt f t ( ξ it )as an estimate of η t ; θ f t for any f t ∈ F ( X ). We summarise the algorithm in the pseudo-codebelow, where every line with i should be executed for all i ∈ { , . . . , N } . draw ξ i ∼ χ set τ i ← for t ← , , , . . . do set ω it ← g t ; θ ( ξ it ) draw I it +1 ∼ Pr ( { ω ℓt } Nℓ =1 ) draw ξ it +1 ∼ q θ ( ξ I it +1 t , · ) for j ← , . . . , ˜ N do draw J ( i,j ) t +1 ∼ Pr ( { ω ℓt q θ ( ξ ℓt , ξ it +1 ) } Nℓ =1 ) end for set τ it +1 ← N P ˜ Nj =1 (cid:18) τ J ( i,j ) t +1 t + ˜ h t ( ξ J ( i,j ) t +1 t , ξ it +1 ) (cid:19) set ¯ τ t +1 ← N P Nℓ =1 τ ℓt +1 set η Nt +1; θ ← N P Nℓ =1 (cid:0) τ ℓt +1 − ¯ τ t +1 (cid:1) δ ξ ℓt +1 end forRemark 1. The previous algorithm can be viewed as a “de-Rao-Blackwellisation” of the tangentfilter estimator proposed in [8], which is based on the updating rule (2.11) instead of (2.12) .With the exception of this important difference, the two algorithms follow the same recursion.Consequently, the algorithm of [8] has an O ( N ) complexity. The first part of the convergence analysis of our algorithm will be carried through under thefollowing, relatively mild, assumption.
Assumption 1. (i) For all t ∈ N and θ ∈ Θ , g t ; θ is a positive and bounded measurable function.(ii) For all θ ∈ Θ , q θ ∈ F ( X (cid:15) ) . Assumption 1(i) implies finiteness and positiveness of the particle importance weights, whileAssumption 1(ii) allows, apart from some technical arguments needed in the proofs, the accept-reject sampling technique mentioned in Section 2.4 to be used.We begin by establishing an exponential concentration inequality, valid for all finite particlesample sizes N , for our PaRIS-based tangent filter estimator. This yields the strong ( P -a.s.)consistency (as N tends to infinity) of the algorithm as a corollary. In addition, we state andestablish a CLT with an asymptotic variance given by a sum of the asymptotic variance ofthe estimator proposed by [8] and one term reflecting the additional variance introduced bythe supplementary sampling performed in the PaRIS-version. Finally, we are able to derive, by . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent filters operating with strong mixing assumptions, a time uniform bound on the asymptotic variance,which is inversely proportional to ˜ N −
1. This guarantees that the algorithm is stochastically stablein the long run. All proofs, which use results obtained by [34], are postponed to Appendix A.
Theorem 2.
Let Assumption 1 hold. Then for all t ∈ N , θ ∈ Θ , f t ∈ F ( X ) , and ˜ N ∈ N ∗ thereexists ( c t , ˜ c t ) ∈ ( R ∗ + ) (depending on θ , ˜ N , and f t ) such that for all N ∈ N ∗ and ε ∈ R + , P (cid:0)(cid:12)(cid:12) η Nt ; θ f t − η t ; θ f t (cid:12)(cid:12) ≥ ε (cid:1) ≤ c t exp ( − ˜ c t N ε ( ε ∧ . The strong consistency of the estimator follows.
Corollary 3.
Let Assumption 1 hold. Then for all t ∈ N , θ ∈ Θ , f t ∈ F ( X ) , and ˜ N ∈ N ∗ itholds, P -a.s., lim N →∞ η Nt ; θ f t = η t ; θ f t . We now set focus on weak convergence. For this purpose, we introduce, for all t ∈ N , theunnormalised kernels L t ; θ f ( x ) := g t ; θ ( x ) Q θ f ( x ) , ( x, f ) ∈ X × F ( X ) , and, for all ( s, t ) ∈ N such that s ≤ t , the retro-prospective kernels D s,t ; θ f ( x s ) := Z Z f ( x t ) g t ; θ ( x t ) T s ; θ ( x s , d x s − ) L s ; θ (cid:15) · · · (cid:15) L t − θ ( x s , d x s +1: t ) , ˜ D s,t ; θ f ( x s ) := D s,t ; θ ( f − φ t | t − θ f )( x s ) , for ( x s , f ) ∈ X × F ( X (cid:15) ( t +1) ), operating simultaneously in the backward and forward directions. Theorem 4.
Let Assumption 1 hold. For all t ∈ N , θ ∈ Θ , ˜ N ∈ N ∗ , and f t ∈ F ( X ) , as N → ∞ , √ N (cid:0) η Nt ; θ f t − η t ; θ f t (cid:1) D −→ σ t ; θ ( f t ) Z, where Z is a standard Gaussian random variable and σ t ; θ ( f t ) := ˜ σ t ; θ ( f t ) + t − X s =0 s X ℓ =0 ˜ N ℓ − ( s +1) ς s,ℓ,t ; θ ( f t ) , (3.1) where ˜ σ t ; θ ( f t ) := t − X s =0 π s +1; θ ˜ D s +1 ,t ; θ { ( h t ; θ − π t ; θ T t ; θ h t ; θ )( f t − π t ; θ f t ) } ( π s +1; θ L s +1; θ · · · L t − θ X ) (3.2) is the asymptotic variance of the tangent filter estimator proposed by [8] and ς s,ℓ,t ; θ ( f t ) := π ℓ +1; θ {←− Q φ ℓ ; θ ( T ℓ ; θ h ℓ ; θ + ˜ h ℓ ; θ − T ℓ +1; θ h ℓ +1; θ ) L ℓ +1; θ · · · L s ; θ ( L s +1; θ · · · L t − θ { f t − π t ; θ f t } ) } ( π ℓ +1; θ L ℓ +1; θ · · · L s ; θ X )( π s +1; θ L s +1; θ · · · L t − θ X ) . Remark 5.
We note that for all t ∈ N , lim ˜ N →∞ σ t ; θ ( f t ) = ˜ σ t ; θ ( f t ) . This is in line with our expectations, as the tangent filter estimator proposed by [8] can be viewedas a Rao-Blackwellisation of our estimator. . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent filters We now set focus on the numerical stability of the algorithm, and show that the error of thetangent filter estimator proposed by us stays uniformly bounded in time. There are several waysof formulating such stability, and in the present paper we follow [6, 13] and establish a time-uniform bound on the asymptotic variance of the estimator. The analysis requires Assumption 1to be sharpened as follows.
Assumption 2. (i) There exists ρ > such that for all θ ∈ Θ and ( x, ˜ x ) ∈ X , ρ − ≤ q θ ( x, x ′ ) ≤ ρ. (ii) There exist constants < δ < ¯ δ < ∞ such that for all t ∈ N all θ ∈ Θ , and x ∈ X , δ ≤ g t ; θ ( x ) ≤ ¯ δ. Under Assumption 2 we define ̺ := 1 − ρ − . These—admittedly very restrictive— strong mixing assumptions require typically the statespace X of the underlying Markov chain to be a compact set. Nevertheless, such assumptions arestandard in the literature of SMC analysis; see [6] and, e.g., [5, 13] for refinements. Finally, weassume that the terms (2.8) of the complete data score function are uniformly bounded. Assumption 3.
There exists a constant | ˜ h | ∞ ∈ R + such that for all s ∈ N and θ ∈ Θ , osc(˜ h s ; θ ) ≤ | ˜ h | ∞ . In conformity with the strong mixing assumptions, also Assumption 3 requires typically thestate space X to be compact.The previous assumptions allow us to establish the following theorem, providing a time-uniform bound on the sequence { σ t ; θ ( f ) } t ∈ N of asymptotic variances. Generally, when the ob-jective function is of additive form, the variance of a Monte Carlo estimator increases typicallylinearly with the number of terms, or, the dimension of the underlying product space [see e.g.34, Section 1, for a discussion]. Thus, as our estimator is based on the identities (2.9) and (2.10),which express the tangent filter in terms of additive smoothed expectations over the path space,it may seem surprising that we are able to derive a time-uniform bound on the asymptotic vari-ance. However, recall from (2.9) that the tangent filter η t ; θ f t coincides with the covariance of f t and the complete data score function h t ; θ under φ t | t − θ . Thus, when the model exhibits for-getting properties—which is the case under the strong mixing assumptions above—the sequence { η t ; θ f t } t ∈ N stays bounded, and time uniform convergence of the Monte Carlo estimator is hencewithin reach. Theorem 6.
Let Assumption 2 and Assumption 3 hold. Then there exist constants ( c, ˜ c ) ∈ R such that for all t ∈ N , θ ∈ Θ , ˜ N ≥ , and f t ∈ F ( X ) , σ t ; θ ( f t ) ≤ k f t k ∞ (cid:18) c + ˜ c
1( ˜ N − − ̺ ) (cid:19) and ˜ σ t ; θ ( f t ) ≤ c k f t k ∞ . (3.3) . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent filters The bound (3.3) on the first term (3.2) of the asymptotic variance, corresponding to thevariance of the Rao-Blackwellised algorithm, was derived by [8]. As expected from the theoreticalresults obtained by [34], the bound on the second term is inversely proportional to ˜ N −
1, whichindicates that boosting the number ˜ N of PaRISian backward draws from a moderate to a verylarge value will have negligible effect on the precision. We hence recommend keeping ˜ N low, saylower than 5, in order to gain computational efficiency. A more detailed discussion on the choiceof ˜ N can be found in [34].
4. Application to recursive maximum likelihood estimation
Given a fixed data-record y t , our goal is to perform maximum likelihood estimation of θ , i.e., tofind the vector of parameters θ ∗ ∈ Θ such that θ ∗ = arg max θ ∈ Θ L θ ( y t ), where L θ ( y t ) is thelikelihood of the observed data given in (2.2). Equivalently, we may maximise the log-likelihood ℓ θ ( y t ) = log L θ ( y t ). There are many different approaches to such maximisation; see [4, Chapter10] for a more general overview of different approaches.Here we will focus on the following Robbins-Monro scheme : at iteration n , let θ n = θ n − + γ n Z n , where Z n is a noisy measurement of ∇ θ ℓ θ ( y t ) | θ = θ n − , i.e., the score of the observed data evalu-ated in θ = θ n − , and { γ n } n ∈ N ∗ a sequence of positive step-sizes satisfying the regular stochasticapproximation requirements P ∞ n =1 γ n = ∞ and P ∞ n =1 γ n < ∞ . Note that this approach requiresapproximation of Z n at each iteration of the algorithm. If the number of observations is verylarge computing Z n is costly, and since many iterations are often required for convergence thisresults in an impractical algorithm. Moreover, if we receive a new observation Z n needs to berecalculated which turns the procedure into an offline algorithm. We sketch the basic principles of recursive maximum likelihood. First note that since { ( X t , Y t ) } t ∈ N is a Markov chain, the quadruple { ( X t , Y t , π t ; θ , η t ; θ ) } t ∈ N forms, by (2.3) and (2.4) and for all θ ∈ Θ, a Markov chain as well. In the case where the state space X is a finite set, [27] showedthat this chain is ergodic under certain mixing assumptions. This result was later extended by[11] to the case where X is compact. Now, assume that data is generated by a model parame-terised by a distinctive θ ∗ ∈ Θ; then, denoting by Π θ,θ ∗ the stationary distribution of this chainand ˜Π θ,θ ∗ ( · ) = Π θ,θ ∗ ( X × · ) its marginal with respect to the last three components, it holds forall θ ∈ Θ, P -a.s.,lim t →∞ t ∇ θ ℓ θ ( Y t ) = lim t →∞ t t X s =0 ∇ θ ℓ θ ( Y s | Y s − ) = lim t →∞ t t X s =0 π s ; θ ( ∇ θ g s ; θ ) + η s ; θ g s ; θ π s ; θ g s ; θ = Z Z Z π ( ∇ θ g θ ( · , y )) + η ( g θ ( · , y )) π ( g θ ( · , y )) ˜Π θ,θ ∗ (d( y, π, η )) =: λ ( θ, θ ∗ ) , where π s ; θ g s ; θ , η s ; θ g s ; θ , and π s ; θ g s ; θ depend implicitly on Y s . These equations follow, in order,by the decomposition (1.1) of the log-likelihood, by applying the nabla operator, and by applyingBirkhoff’s ergodic theorem. By indentifiability [see, e.g., 4, Section 12.4], the true parameter θ ∗ . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent filters solves λ ( θ, θ ∗ ) = 0, and the task of solving this equation may be cast into the framework of stochastic approximation and the Robbins-Monro algorithm θ t +1 = θ t + γ t +1 ζ t +1 , t ∈ N , (4.1)where ζ t +1 is a noisy observation of λ ( θ t , θ ∗ ). Ideally, such a noisy observation would be formedby estimating λ ( θ t , θ ∗ ) by a draw from ˜Π θ,θ ∗ , which is, needless to say, not possible in practice.Thus, by introducing one more level of approximation and allowing for Markovian perturbations,one may instead estimate, following (2.10), λ ( θ t , θ ∗ ) by ζ t +1 := ζ t +1 + ζ t +1 ζ t +1 , where ζ t +1 := π t +1 ( ∇ θ g t +1; θ | θ = θ t ) ,ζ t +1 := η t +1 g t +1; θ t ,ζ t +1 := π t +1 g t +1; θ t , with η t +1 := π t { ( T t − π t T t ) g t ; θ t } , and statistics { T t } t ∈ N being, following (2.6), updated recursively according to T t +1 ( x t +1 ) = Z { T t ( x t ) + ˜ h t ; θ t +1 ( x t : t +1 ) } ←− Q φ t ( x t +1 , d x t ) , x t +1 ∈ X , (4.2)and initialised as T ≡
0. The statistic (4.2) depends on the new observation Y t +1 throughthe term ˜ h t ; θ t +1 . In addition, the measures { φ t } t ∈ N and { π t } t ∈ N are updated according to therecursion φ t f = π t ( g t ; θ t f ) π t g t ; θ t ,π t +1 f = φ t Q θ t f, f ∈ F ( X ) , (4.3)with, by convention, π := χ . Note that the recursion (4.3) updates the filter and predictordistributions on the basis of the current parameter fit. This means that for all t ∈ N , themeasures φ t , π t +1 and η t +1 depend on all the past parameter values { θ s } ts =0 .This approach yields an online algorithm where a new observation can be incorporated intothe estimator without the need of having to recalculate the latter from scratch. In the case where X is finite, this algorithm was studied in [28], which also provides assumptions under which thesequence { θ t } t ∈ N ∗ converges towards the parameter θ ∗ [see also 38, for refinements].In the general case, the recursions (4.2) and (4.3) are intractable. The translation of the sameinto the language of particles is however immediate by approximating (4.3) by a particle filter,updated according to time-varying parameters, and (4.2) by means of a PaRISian updating step(2.12), the latter executed for some suitable precision parameter ˜ N ≥ θ . After this, the parameterfit is updated recursively as each new observation y t +1 is received, by first calculating a particleapproximation ˆ ζ t +1 of ζ t +1 and then updating the parameter according to θ t +1 = θ t + γ t +1 ˆ ζ t +1 , where { γ t } t ∈ N ∗ satisfies the standard stochastic optimisation requirements (i.e. infinite sum andfinite sum of squares) above. The algorithm, which is illustrated numerically in the next section,is detailed in pseudo code below. . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent filters set arbitrarily θ draw ξ i ∼ χ set τ i ← for t ← , , , . . . do set ω it ← g θ t ( ξ it , y t ) draw I i ∼ Pr ( { ω ℓt } Nℓ =1 ) draw ξ it +1 ∼ q θ t ( ξ I i t , · ) for j ← , . . . , ˜ N do draw J ( i,j ) t +1 ∼ Pr ( { ω ℓt q θ t ( ξ ℓt , ξ it +1 ) } Nℓ =1 ) end for set τ it +1 ← ˜ N − P ˜ Nj =1 (cid:18) τ J ( i,j ) t +1 t + ˜ h t ; θ t ( ξ J ( i,j ) t +1 t , ξ it +1 ) (cid:19) set ¯ τ t +1 ← N − P Nℓ =1 τ ℓt +1 set ˆ ζ t +1 ← N − P Nℓ =1 ∇ g θ t ( ξ ℓt +1 , y t +1 ) set ˆ ζ t +1 ← N − P Nℓ =1 (cid:0) τ ℓt +1 − ¯ τ t +1 (cid:1) g θ t ( ξ ℓt +1 , y t +1 ) set ˆ ζ t +1 ← N − P Nℓ =1 g θ t ( ξ ℓt +1 , y t +1 ) set θ t +1 ← θ t + γ t +1 ˆ ζ t +1 + ˆ ζ t +1 ˆ ζ t +1 end for We benchmark our algorithm on two different models, namely • the stochastic volatility model of [19], where we compare our estimator to that of [8], and • the simultaneous localisation and mapping (SLAM) model where we apply our algorithmas a proof of concept. Consider the stochastic volatility model X t +1 = φX t + σV t +1 ,Y t = β exp( X t / U t , t ∈ N , proposed by [19], where { V t } t ∈ N ∗ and { U t } t ∈ N are independent sequences of mutually inde-pendent standard Gaussian noise variables. In this model the parameters to be estimated are θ = ( φ, σ , β ). We compared our PaRISian RML estimator to the estimator proposed by [8]—thelatter being referred to as the “Rao-Blackwellised estimator”. Since there is a significant differ-ence in computational complexity between the algorithms, we design the particle sample size ineach algorithm in such a way that the running times of the two algorithms are close to identical.With our implementation, using N = 100 in the algorithm of [8] yields close to the same CPUtime as using ( N, ˜ N ) = (1400 ,
2) in the PaRISian RML estimator. For both algorithms we let { γ t } t ∈ N = { t − . } t ∈ N . The algorithms were executed on an observation data record comprising500 ,
000 observations generated under the parameters θ ∗ = (0 . , . , . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent filters variance than those produced by the Rao-Blackwellised estimator (using a considerably smallerparticle sample size). Indeed, computing the sample variances over the final parameter estimatesindicates an improvement by almost an order of magnitude to the benefit of the PaRISian versionfor the same computational time: ( . , . , . × − and ( . , . , . × − for thePaRISian and Rao-Blackwellised versions, respectively. It can also be seen in Figure 1 that thetrajectories exhibit some jumps, mainly in the β variable. This occurs when the estimate of ζ t +1 is close to zero, which corresponds to time steps when the particles fail to cover the support of theemission density. Since the computationally more efficient PaRISian version allows considerablymore particles to be used for the given budget, the problem is much smaller for this algorithm;indeed, whereas the learning curves of the Rao-Blackwellised version exhibit a high degree ofvolatility and several large jumps (the peaks are cut for visibility) with subsequent very longexcursions out in the parameter space, the PaRISian version exhibits only one significant jumpof moderate size. time × φ -1-0.500.51 time × σ time × β (a) PaRIS-based RML time × φ -1-0.500.51 time × σ time × β (b) Rao-Blackwellised RML Fig 1 . Particle learning trajectories produced by the PaRISian RML estimator (left panel) and the Rao-Blackwellised RML estimator (right panel) proposed in [8] for, from top to bottom, φ , σ , and β . The formerand latter algorithms used N = 1 , and N = 400 particles, respectively (leading to comparable CPU times). Foreach algorithm, replicates were generated on the same data set with different, randomised initial parameters(being the same for both algorithms). For the Rao-Blackwellised version, the plot of β does not contain the fulltrajectories due to very high peaks. The simultaneous localisation and mapping (SLAM) problem is fundamental in robotics. In theversion considered here we let the state space consist of the coordinates and the bearing of arobot moving in the plane, i.e. X = R × ( − π, π ]. The prior motion of the robot is modeled as a . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent filters Markov chain X t := ( X t , X t , X t ), t ∈ N , on this space, defined through the state equations X t +1 = X t + d t +1 cos( X t ) + ǫ t +1 ,X t +1 = X t + d t +1 sin( X t ) + ǫ t +1 ,X t +1 = X t + α t +1 + ǫ t +1 , where { ǫ it } t ∈ N ∗ , i ∈ { , , } , are independent sequences of mutually independent noise variableswith known variances σ i , i ∈ { , , } . The sequence { ( d t , α t ) } t ∈ N ∗ provides the commands—interms of speed and bearing changes—of the robot at each time point.The robot observes a landmark, defined by its positions in the plane, by measuring the distanceand the relative bearing to the same. Assuming L landmarks, the observations at time t are givenby Y t = { Y it } i ∈ O t , where O t ⊆ { , . . . , L } is the set of observed landmarks. For each observedlandmark, Y it = k ( θ i , θ i ) − ( X t , X t ) k + ε i, t arctan (cid:18) θ i − X t θ i − X t (cid:19) − X t + ε i, t , where k · k denotes the Euclidean norm, ( θ i , θ i ) is the location of landmark i , ( ε i, t , ε i, t ) areindependent noise variables with known variances ( ς , ς ), the noises of different time steps anddifferent landmarks being independent.In this setting we wish to estimate the locations of all the landmarks, which implies 2 L unknown parameters. Note that the noise parameters in the model are assumed to be calibratedbeforehand. Several existing works apply particle methods to the SLAM problem; see, e.g., the FastSLAM [30] and
FastSLAM 2.0 [31] algorithms. More recently an online EM version wasproposed by [26] and the
Marginal-SLAM algorithm [29] is based on an updating step similarto (4.1).Using the model above, data was simulated for L = 9 landmarks located according to thegreen dots in Figure 2. The observations were generated by simulating 100 ,
000 observations ofthe robot moving around a closed loop with a time resolution of . σ = σ = . σ = π , ς = .
25, and ς = π . The robot observes landmarks within a30 m radius and a 180 ◦ field of vision.Using this simulated data we performed 20 independent runs of our PaRISian RML algorithmin order to estimate the landmark positions. For each landmark, the initial estimate of its positionwas drawn randomly according to a bivariate Gaussian distribution with the true landmarkposition as mean and the identity matrix as covariance matrix. The algorithm used ( N, ˜ N ) =(500 ,
2) and the particles were initialised in (0 ,
0) with zero bearing, i.e., the same as the robot’sstarting position. Since the problem is invariant under translations and rotations we fix twolandmarks and assume that these are known a priori. The results are displayed in Figure 2,where the estimates cluster closely around the true parameter values. The figure also showsthe estimated filtered trajectory of the robot during its first 3 laps. In order to illustrate theconvergence of the landmark position estimates, we consider the evolution of their MSE overtime. The results are reported in Figure 3, where the average MSE of 4 landmarks is computedbased on the 20 independent runs of the algorithm. As clear from the picture, the processing ofthe full observation record yields MSE values that fluctuate stably around zero.Needless to say, even though this prototypical solution to the SLAM seems promising, a lotof additional work is needed for obtaining an algorithm running in real applications. Moreover,the comprehensive task of benchmarking our approach against full palette of existing algorithmstreating this fundamental problem falls outside the scope of the present paper. . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent filters Fig 2 . Resulting estimates of the landmarks for the SLAM problem using the PaRISian RML algorithm. Thecircles are the true positions of the landmarks and green stars are the resulting estimates. The black dots denotethe particle estimates of the robot’s positions during the first three laps. Rerunning the algorithm yields similartrajectories.
5. Conclusion
We have presented a novel algorithm for efficient particle-based estimation of tangent filters ingeneral SSMs. The algorithm involves the PaRIS algorithm proposed by [34] as a key ingredient.The estimator is furnished with several convergence results, the main result being a CLT at therate √ N . The convergence analysis is driven by results obtained by [34, 8], and our time uniformbound on the asymptotic variance extends the existing results in [8]. Importantly, under weakassumptions, the computational complexity of our algorithm grows only linearly with the numberof particles, whereas the complexity of existing methods such as the estimator of [8], which maybe viewed as a Rao-Blackwellisation of our estimator, is typically quadratic in the number ofparticles. Thus, we may expect the computational benefit of the “de-Rao-Blackwellisation” thatcharacterises PaRIS to grossly exceed the price of the additional variance introduced by thePaRISian backward sampling procedure.In the second part of the paper we cast our PaRISian tangent filter estimator into the frame-work of RML, yielding a computationally efficient and easily implemented online parameterestimation procedure. As clear from the simulations, the fact that our online estimator allows,compared to existing methods, more particles to be used for a given computational budget is ofimportance for the stability of the online estimates.Appealingly, the asymptotic variance of our estimator can be bounded uniformly in time; inother words, the estimator stays stochastically stable in the long term. Needless to say, the strong . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent filters Observations × M SE Observations × M SE Observations × M SE Observations × M SE Fig 3 . Resulting mean square error estimate for 4 of the landmarks based on the 20 independent runs of thePaRISian RML algorithm for the SLAM problem. mixing assumptions driving the stability analysis are restrictive, and to relax these assumptionsis a natural direction of research. [21] provides an O ( t ) bound on the asymptotic variance ofthe forward-filtering backward-smoothing (FFBSm) algorithm [15], which is equivalent with theRao-Blackwellised PaRIS in the case of additive state functionals (the online formulation washowever found by Del Moral et al. [7]), under assumptions that point to applications on non-compact state spaces, and the same approach may be applicable to our PaRISian tangent filterestimator.Finally, it still remains to prove, under general, verifiable assumptions, the convergence of theparameter estimates produced by our PaRISian RML algorithm. Such an analysis is expectedto be technically very involved, and is hence beyond the scope of the present paper. In [28], inwhich the convergence of exact RML in HMMs is established for the case of finite state space, theproof consists in showing that the extended Markov chain comprising the state and observationprocesses as well as the prediction and tangent filter flows is ergodic and applying standardstochastic approximation results [9]. A route to the convergence of our proposed PaRISian RMLcould be to include also the particle cloud and the particle-based statistics { τ it } Ni =1 into theMarkov chain and study the ergodicity of the same. An alternative and more direct approachwould be to cast the problem into the framework of biased stochastic gradient search , which wasanalysed recently by [39]. References [1] B. D. O. Anderson and J. B. Moore.
Optimal Filtering . Prentice-Hall, New Jersey, 1979. . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent filters [2] O. Capp´e. Ten years of HMMs (online bibliography 1989–2000). http://perso.telecom-paristech.fr/~cappe/docs/hmmbib.html , 2001.[3] O. Capp´e. Online EM algorithm for hidden Markov models. Journal of Computational andGraphical Statistics , 20(3):728–749, 2011.[4] O. Capp´e, E. Moulines, and T. Ryd´en.
Inference in Hidden Markov Models . Springer, NewYork, 2005.[5] D. Crisan and K. Heine. Stability of the discrete time filter in terms of the tails of noisedistributions.
Journal of the London Mathematical Society. Second Series , 78(2):441–458,2008.[6] P. Del Moral and A. Guionnet. On the stability of interacting processes with applicationsto filtering and genetic algorithms.
Annales de l’Institut Henri Poincar´e , 37:155–194, 2001.[7] P. Del Moral, A. Doucet, and S. Singh. A Backward Particle Interpretation of Feynman-KacFormulae.
ESAIM: Mathematical Modelling and Numerical Analysis , 44(5):947–975, 2010.[8] P. Del Moral, A. Doucet, and S. S. Singh. Uniform stability of a particle approximation ofthe optimal filter derivative.
SIAM Journal on Control and Optimization , 53(3):1278–1304,2015.[9] B. Delyon. General results on the convergence of stochastic algorithms.
IEEE Transactionson Automatic Control , 41(9):1245–1255, 1996.[10] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete datavia the EM algorithm.
Journal of the Royal Statistical Society , 39(1):1–38 (with discussion),1977.[11] R. Douc and C. Matias. Asymptotics of the maximum likelihood estimator for generalhidden Markov models.
Bernoulli , 7(3):381–420, 2001.[12] R. Douc, A. Garivier, E. Moulines, and J. Olsson. Sequential Monte Carlo smoothing forgeneral state space hidden Markov models.
Annals of Applied Probability , 21(6):2109–2145,2011.[13] R. Douc, E. Moulines, and J. Olsson. Long-term stability of sequential Monte Carlo methodsunder verifiable conditions.
Annals of Applied Probability , 24(5):1767–1802, 2014.[14] A. Doucet and V. B. Tadi´c. Parameter estimation in general state-space models usingparticle methods.
Annals of the Institute of Statistical Mathematics , 55(2):409–422, 2003.[15] A. Doucet, S. Godsill, and C. Andrieu. On sequential Monte-Carlo sampling methods forBayesian filtering.
Statistics and Computing , 10:197–208, 2000.[16] A. Doucet, N. De Freitas, and N. Gordon, editors.
Sequential Monte Carlo Methods inPractice . Springer, New York, 2001.[17] P. Fearnhead, D. Wyncoll, and J. Tawn. A sequential smoothing algorithm with linearcomputational cost.
Biometrika , 97(2):447–464, 2010.[18] N. Gordon, D. Salmond, and A. F. Smith. Novel approach to nonlinear/non-GaussianBayesian state estimation.
IEE Proceedings F, Radar and Signal Processing , 140:107–113,1993.[19] J. Hull and A. White. The pricing of options on assets with stochastic volatilities.
TheJournal of Finance , 42:281–300, 1987.[20] P. E. Jacob, L. M. Murray, and S. Rubenthaler. Path storage in the particle filter.
Statisticsand Computing , pages 1–10, 2013.[21] A. Jasra. On the behaviour of the backward interpretation of Feynman-Kac formulae underverifiable conditions.
Journal of Applied Probability , 52(2):339–359, 2015.[22] S. J. Julier and J. K. Uhlmann. A new extension of the Kalman filter to nonlinear systems. In
AeroSense: The 11th International Symposium on Aerospace/Defense Sensing, Simulationand Controls , 1997.[23] N. Kantas, A. Doucet, S. S. Singh, J. Maciejowski, and N. Chopin. On particle methods for . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent filters parameter estimation in state-space models. Statistical Science , 30(3):328–351, 2015.[24] G. Kitagawa. Monte-Carlo filter and smoother for non-Gaussian nonlinear state space mod-els.
Journal of Computational and Graphical Statistics , 1:1–25, 1996.[25] G. Kitagawa and S. Sato. Monte Carlo smoothing and self-organising state-space model. In
Sequential Monte Carlo methods in practice , pages 177–195. Springer, New York, 2001.[26] S. Le Corff, G. Fort, and E. Moulines. Online expectation maximization algorithm to solvethe SLAM problem. In , pages225–228, 2011.[27] F. Le Gland and L. Mevel. Geometric Ergodicity in Hidden Markov Models. ResearchReport RR-2991, INRIA, 1996.[28] F. Le Gland and L. Mevel. Recursive estimation in hidden Markov models. In
Priceesingsof the 36th IEEE Conference on Decision and Control , pages 3468–3473, 1997.[29] R. Martinez-Cantin, N. de Freitas, and J. A. Castellanos. Analysis of particle methodsfor simultaneous robot localization and mapping and a new algorithm: Marginal-slam. In
Proceedings 2007 IEEE International Conference on Robotics and Automation , pages 2415–2420, 2007.[30] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit. FastSLAM: A factored solution tothe simultaneous localization and mapping problem. In
Proceedings of the AAAI NationalConference on Artificial Intelligence , Edmonton, Canada, 2002. AAAI.[31] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit. FastSLAM 2.0: An improved particlefiltering algorithm for simultaneous localization and mapping that provably converges. In
Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJ-CAI) , Acapulco, Mexico, 2003. IJCAI.[32] T. N. M. Nguyen, S. Le Corff, and E. Moulines. On the two-filter approximations of marginalsmoothing distributions in general state-space models.
Advances in Applied Probability , 50(1):154–177, 2017.[33] J. Olsson and J. Westerborn. Efficient parameter inference in general hidden markov modelsusing the filter derivatives. In , pages 3984–3988, 2016.[34] J. Olsson and J. Westerborn. Efficient particle-based online smoothing in general hiddenMarkov models: The PaRIS algorithm.
Bernoulli , 23(3):1951–1996, 2017.[35] J. Olsson, O. Capp´e, R. Douc, and E. Moulines. Sequential Monte Carlo smoothing withapplication to parameter estimation in non-linear state space models.
Bernoulli , 14(1):155–179, 2008.[36] G. Poyiadjis, A. Doucet, and S. S. Singh. Particle methods for optimal filter derivative:application to parameter estimation. In
Proceedings IEEE International Conference onAcoustics, Speech, and Signal Processing , pages 925–928, 2005.[37] G. Poyiadjis, A. Doucet, and S. Singh. Particle approximations of the score and ob-served information matrix in state space models with application to parameter estimation.
Biometrika , 2011.[38] V. B. Tadic. Analyticity, convergence, and convergence rate of recursive maximum-likelihoodestimation in hidden markov models.
IEEE Transactions on Information Theory , 56(12):6406–6432, 2010.[39] V. B. Tadi´c and A. Doucet. Asymptotic bias of stochastic gradient search.
Annals of AppliedProbability , 27(6):3255–3304, 2017. . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent filters Appendix A: Proofs
Define for all t ∈ N and θ ∈ Θ, L t ; θ : X × X ∋ ( x, A ) g t ; θ ( x ) Q θ ( x, A ) . (Note that our definition of L t differs from that used by [34], in which the order of g t ; θ and Q θ is swapped.) With this notation, by the filtering recursion (2.3)–(2.4), π t +1; θ = π t ; θ L t ; θ π t ; θ L t ; θ X , (A.1)with, as previously, φ |− := χ . This condensed form of the filtering recursion will be used inSection A.3.In the coming analysis the following decomposition will be instrumental. For all t ∈ N , η Nt ; θ f t − η t ; θ f t = 1 N N X i =1 { ( τ it − π t ; θ T t ; θ h t ; θ )( f t ( ξ it ) − π t ; θ f t ) }− π t ; θ { ( T t ; θ h t ; θ − π t ; θ T t ; θ h t ; θ )( f t − π t ; θ f t ) }− N N X i =1 τ it − π t ; θ T t ; θ h t ; θ ! N N X i =1 f t ( ξ it ) − π t ; θ f t ! . (A.2) A.1. Proof of Theorem 2
We apply the decomposition (A.2). Note that1 N N X i =1 { ( τ it − π t ; θ T t ; θ h t ; θ )( f t ( ξ it ) − π t ; θ f t ) } − π t ; θ { ( T t ; θ h t ; θ − π t ; θ T t ; θ h t ; θ )( f t − π t ; θ f t ) } = 1 N N X i =1 ω it { τ it F t ; θ ( ξ it ) + ˜ F t ; θ ( ξ it ) } − π t ; θ { g t ; θ ( T t ; θ h t ; θ F t ; θ + ˜ F t ; θ ) } , (A.3)where F t ; θ ( x ) := f t ( x ) − π t ; θ f t g t ; θ ( x ) , ˜ F t ; θ ( x ) := − π t ; θ T t ; θ h t ; θ { f t ( x ) − π t ; θ f t } g t ; θ ( x ) , x ∈ X . (A.4)Now, since F t g t ; θ and ˜ F t g t ; θ both belong to F ( X ), Proposition 7 provides constants c t > c t > ε > P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 ω it { τ it F t ; θ ( ξ it ) + ˜ F t ; θ ( ξ it ) } − π t ; θ { g t ; θ ( T t ; θ h t ; θ F t ; θ + ˜ F t ; θ ) } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ ε ! ≤ c t exp ( − ˜ c t N ε ) . (A.5)To deal with the second part of the decomposition (A.2), we use the same technique. First, byapplying Proposition 7 with f ≡ /g t ; θ and ˜ f ≡
0, we obtain constants a t > a t > ε > P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 τ it − π t ; θ T t ; θ h t ; θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ ε ! ≤ a t exp ( − ˜ a t N ε ) . (A.6) . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent filters Similarly, using Proposition 7 with f ≡ f ≡ f t /g t ; θ provides constants b t > b t > ε > P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 f t ( ξ it ) − π t ; θ f t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ ε ! ≤ b t exp ( − ˜ b t N ε ) . (A.7)Combining (A.5), (A.6), and (A.7) yields, for all ε > P (cid:0)(cid:12)(cid:12) η Nt ; θ f t − η t ; θ f t (cid:12)(cid:12) ≥ ε (cid:1) ≤ a t exp ( − ˜ a t N ε/
2) + b t exp ( − ˜ b t N ε/
2) + c t exp {− ˜ c t N ( ε/ } , from which the statement of the theorem follows.The following result is obtained by inspection of the proof of Olsson and Westerborn [34,Theorem 1(i)]. Proposition 7.
Let Assumption 1 hold. Then for all t ∈ N , all θ ∈ Θ , all additive state function-als h t ∈ F ( X (cid:15) ( t +1) ) , all measurable functions f t and ˜ f t such that f t g t ; θ ∈ F ( X ) and ˜ f t g t ; θ ∈ F ( X ) ,and all ˜ N ∈ N , there exist constants c t > and ˜ c t > (possibly depending on θ , h t f t , ˜ f t , and ˜ N ) such that for all ε > , P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 ω it { τ it f t ( ξ it ) + ˜ f t ( ξ it ) } − π t ; θ { g t ; θ ( T t ; θ h t f t + ˜ f t ) } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ ε ! ≤ c t exp ( − ˜ c t N ε ) , where { ( ξ it , ω it , τ it ) } Ni =1 are produced using the PaRIS algorithm. A.2. Proof of Corollary 3
The P -a.s. convergence of η Nt ; θ f t to η t ; θ f t is implied straightforwardly by the exponential conver-gence rate in Theorem 2. Indeed, note that P (cid:16) lim N →∞ η Nt ; θ f t = η t ; θ f t (cid:17) = lim k →∞ lim n →∞ P [ n ≤ N (cid:26)(cid:12)(cid:12) η Nt ; θ f t − η t ; θ f t (cid:12)(cid:12) ≥ k (cid:27) ;now, by Theorem 2, P [ n ≤ N (cid:26)(cid:12)(cid:12) η Nt ; θ f t − η t ; θ f t (cid:12)(cid:12) ≥ k (cid:27) ≤ c t exp (cid:0) − ˜ c t nk − (cid:1) ∞ X N =0 exp (cid:0) − ˜ c t N k − (cid:1) , where the right hand side tends to zero when n tends to infinity. This completes the proof. A.3. Proof of Theorem 4
By combining (A.2) and (A.3), √ N (cid:0) η Nt ; θ f t − η t ; θ f t (cid:1) = √ N N N X i =1 { τ it F t ; θ ( ξ it ) + ˜ F t ; θ ( ξ it ) } − π t ; θ ( T t ; θ h t ; θ F t ; θ + ˜ F t ; θ ) ! − √ N N N X i =1 f t ( ξ it ) − π t ; θ f t ! N N X i =1 τ it − π t ; θ T t ; θ h t ; θ ! , . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent filters where in this case F t ; θ ( x ) := f t ( x ) − π t ; θ f t , ˜ F t ; θ ( x ) := − π t ; θ T t ; θ h t ; θ { f t ( x ) − π t ; θ f t } , x ∈ X . are defined in (A.4). By Proposition 8, since F t ; θ ∈ F ( X ) and ˜ F t ; θ ∈ F ( X ), √ N N N X i =1 { τ it F t ; θ ( ξ it ) + ˜ F t ; θ ( ξ it ) } − π t ; θ ( T t ; θ h t ; θ F t ; θ + ˜ F t ; θ ) ! D −→ σ t ; θ h F t ; θ , ˜ F t ; θ i ( h t ) Z, where Z is standard normally distributed and σ t ; θ h F t ; θ , ˜ F t ; θ i ( h t ; θ ) = σ t ; θ ( h t ; θ ) , with σ t ; θ ( h t ; θ ) being defined in (3.1). Now, Proposition 7 and Proposition 8 yield1 N N X i =1 τ it P −→ π t ; θ T t ; θ h t ; θ , and √ N N N X i =1 f t ( ξ it ) − π t ; θ f t ! D −→ σ t ; θ h , f t i ( h t ; θ ) Z (with 0 denoting the zero function), respectively, implying, by Slutsky’s theorem, √ N N N X i =1 f t ( ξ it ) − π t ; θ f t ! N N X i =1 τ it − π t ; θ T t ; θ h t ; θ ! P −→ . Finally, we complete the proof by noting that the term ˜ σ t ; θ ( f t ) in (3.1) coincides with theasymptotic variance provided by Del Moral et al. [8, Theorem 3.2]. Proposition 8.
Assumption 1 hold. Then for all t ∈ N , all θ ∈ Θ , all additive state functionals h t ∈ F ( X (cid:15) ( t +1) ) , all measurable functions f t ∈ F ( X ) and ˜ f t ∈ F ( X ) , and all ˜ N ∈ N , as N → ∞ , √ N N N X i =1 { τ it f t ( ξ it ) + ˜ f t ( ξ it ) } − π t ; θ ( T t ; θ h t f t + ˜ f t ) ! D −→ σ t ; θ h f t , ˜ f t i ( h t ) Z, where Z is a standard Gaussian random variable and σ t ; θ h f t , ˜ f t i ( h t ) := ˜ σ t ; θ h f t , ˜ f t i ( h t ) + t − X s =0 s X ℓ =0 ˜ N ℓ − ( s +1) ς s,ℓ,t ; θ h f t i ( h t ) , (A.8) with σ t ; θ h f t , ˜ f t i ( h t ) := t − X s =0 π s +1; θ ˜ D s +1 ,t ; θ ( h t f t + ˜ f t )( π s +1; θ L s +1; θ · · · L t − θ X ) and ς s,ℓ,t ; θ h f t i ( h t ) := π ℓ +1; θ {←− Q φ ℓ ; θ ( T ℓ ; θ h ℓ + ˜ h ℓ − T ℓ +1; θ h ℓ +1 ) L ℓ +1; θ · · · L s ; θ ( L s +1; θ · · · L t − θ f t ) } ( π ℓ +1; θ L ℓ +1; θ · · · L s ; θ X )( π s +1; θ L s +1; θ · · · L t − θ X ) . . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent filters of Proposition 8. Assume first that π t ; θ ( T t ; θ h t f t + ˜ f t ) = 0. Then, by Lemma 9 and Slutsky’stheorem, as Ω t /N P −→ π t ; θ g t ; θ by Proposition 7, √ N N N X i =1 { τ it f t ( ξ it ) + ˜ f t ( ξ it ) } ! = √ N N X i =1 ω it Ω t { τ it F t ; θ ( ξ it ) + ˜ F t ; θ ( ξ it ) } ! N Ω t D −→ π t ; θ g t ; θ × Γ t ; θ h F t ; θ , ˜ F t ; θ i ( h t ) Z, where again Z has standard Gaussian distribution, Γ t ; θ h F t ; θ , ˜ F t ; θ i ( h t ) is given in Lemma 9, andwe have set F t ; θ := f t /g t ; θ and ˜ F t ; θ := ˜ f t /g t ; θ and used, first, that g t : θ F t ; θ = f t ∈ F ( X ) and g t : θ ˜ F t ; θ = ˜ f t ∈ F ( X ) and, second, that φ t ; θ ( T t ; θ h t ; θ F t ; θ + ˜ F t ; θ ) = π t ; θ { g t ; θ ( T t ; θ h t ; θ F t ; θ + ˜ F t ; θ ) } π t ; θ g t ; θ = π t ; θ ( T t ; θ h t f t + ˜ f t ) π t ; θ g t ; θ = 0 . Now, by iterating (A.1) we conclude that for all ( s, t ) ∈ N , π t ; θ = π t − θ L t − θ π t − θ L t − θ X = π t − θ L t − θ L t − θ ( π t − θ L t − θ X )( π t − θ L t − θ X ) = π t − θ L t − θ L t − θ π t − θ L t − θ L t − θ X = π s +1; θ L s +1; θ . . . L t − θ π s +1; θ L s +1; θ . . . L t − θ X and, consequently, π t ; θ g t ; θ = π s +1; θ L s +1; θ . . . L t − θ g t ; θ π s +1; θ L s +1; θ . . . L t − θ X . (A.9)Finally, by (A.9) it holds that π t ; θ g t ; θ × Γ t ; θ h F t ; θ , ˜ F t ; θ i ( h t ) = σ t ; θ h f t , ˜ f t i ( h t ) , where Γ t ; θ h f t , ˜ f t i ( h t ) is defined in (A.10) and σ t ; θ h f t , ˜ f t i ( h t ) is defined in (A.8). Finally, in thegeneral case, the previous holds true when ˜ f t is replaced by ˜ f t − π t ; θ ( T t ; θ h t f t + ˜ f t ), whichcompletes the proof.The following lemma is obtained by inspection of the proof of Olsson and Westerborn [34,Theorem 3]. Lemma 9.
Assumption 1 hold. Then for all t ∈ N , all θ ∈ Θ , all additive state functionals h t ∈ F ( X (cid:15) ( t +1) ) , all measurable functions f t and ˜ f t such that f t g t ; θ ∈ F ( X ) and ˜ f t g t ; θ ∈ F ( X ) ,and all ˜ N ∈ N , as N → ∞ , √ N N X i =1 ω it Ω t { τ it f t ( ξ it ) + ˜ f t ( ξ it ) } − φ t ; θ ( T t ; θ h t f t + ˜ f t ) ! D −→ Γ t ; θ h f t , ˜ f t i ( h t ) Z, where Z is a standard normal distribution and Γ t ; θ h f t , ˜ f t i ( h t ) := ˜Γ t ; θ h f t , ˜ f t i ( h t ) + t − X s =0 s X ℓ =0 ˜ N ℓ − ( s +1) γ s,ℓ,t ; θ h f t i ( h t ) , (A.10) . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent filters with ˜Γ t ; θ h f t , ˜ f t i ( h t ) := t − X s =0 π s +1; θ ˜ D s +1 ,t ; θ { g t ; θ ( h t f t + ˜ f t ) } ( π s +1; θ L s +1; θ · · · L t − θ g t ; θ ) and γ s,ℓ,t ; θ h f t i ( h t ) := π ℓ +1; θ {←− Q φ ℓ ; θ ( T ℓ ; θ h ℓ + ˜ h ℓ − T ℓ +1; θ h ℓ +1 ) L ℓ +1; θ · · · L s ; θ ( L s +1; θ · · · L t − θ g t ; θ f t ) } ( π ℓ +1; θ L ℓ +1; θ · · · L s ; θ X )( π s +1; θ L s +1; θ · · · L t − θ g t ; θ ) . A.4. Proof of Theorem 6
As noted above, the first term of the asymptotic variance σ t ( f t ) coincides with the asymptoticvariance ˜ σ t ; θ ( f t ) obtained by Del Moral et al. [8, Theorem 3.2]. The same work provides a constant c ∈ R + such that ˜ σ t ; θ ( f t ) ≤ c k f t k ∞ , and we may hence focus on bounding second term of theasymptotic variance.For this purpose, note that for all s ≤ t − x s +1 ∈ X , L s +1; θ · · · L t − θ ( f t − π t ; θ f t )( x s +1 ) = g s +1; θ ( x s +1 ) Q θ L s +2; θ · · · L t − θ { g t − θ ( Q θ f t − φ t − θ Q θ f t ) } ( x s +1 ) . (A.11)By applying the forgetting of the filter, or, more particularly, Douc et al. [12, Lemma 10], to(A.11) we obtain k L s +1; θ · · · L t − θ ( f t − π t ; θ f t ) k ∞ ≤ δ k Q θ L s +2; θ · · · L t − θ X k ∞ k f t k ∞ ̺ t − s − . Note that in the previous bound, the exponential contraction follows from the fact that theobjective function f t is centered around its predicted mean. The latter is a consequence of thefact that the tangent filter is, as a covariance, centered itself (recall the identities (2.9) and(2.10) and the decomposition (A.2)). In addition, from the proof of Olsson and Westerborn [34,Theorem 8] we extract, using Assumption 3, k T ℓ ; θ h ℓ ; θ + ˜ h ℓ ; θ − T ℓ +1; θ h ℓ +1; θ k ∞ ≤ | ˜ h | ∞ (cid:18) δ ̺ − ̺ + 1 (cid:19) , and under Assumption 2, for all x ∈ X , ρ − δµ L s +2; θ · · · L t − θ X ≤ L s +1; θ · · · L t − θ X ( x ) ≤ ρ ¯ δµ L s +2; θ · · · L t − θ X . Combining the previous bounds gives ς s,ℓ,t ; θ ( f t ) = π ℓ +1; θ {←− Q φ ℓ ; θ ( T ℓ ; θ h ℓ ; θ + ˜ h ℓ ; θ − T ℓ +1; θ h ℓ +1; θ ) L ℓ +1; θ · · · L s ; θ ( L s +1; θ · · · L t − θ { f t − π t ; θ f t } ) } ( π ℓ +1; θ L ℓ +1; θ · · · L s ; θ X )( π s +1; θ L s +1; θ · · · L t − θ X ) ≤ ˜ c k f t k ∞ ̺ t − s − , where ˜ c := 4 | ˜ h | ∞ (cid:18) δ ̺ − ̺ + 1 (cid:19) (cid:18) ¯ δδ (1 − ̺ ) (cid:19) . . Olsson and J. Westerborn Alenl¨ov/Online estimation of tangent filters Finally, summing up yields t − X s =0 s X ℓ =0 ς s,ℓ,t ; θ ( f t ) ≤ ˜ c k f t k ∞ t − X s =0 s X ℓ =0 ˜ N ℓ − s − ̺ t − s − = ˜ c k f t k ∞ N − t − X s =0 ̺ t − s − (1 − ˜ N − s − ) ≤ ˜ c k f t k ∞ N − t − X s =0 ̺ t − s − ≤ ˜ c k f t k ∞
1( ˜ N − − ̺ ) , which completes the proof. Appendix B: Kernels
Given two measurable spaces ( X , X ) and ( Y , Y ), an unnormalised transition kernel K betweenthese spaces induces two integral operators, one acting on functions and the other on measures.Specifically, we define the function K h : X ∋ x Z h ( x, y ) K ( x, d y ) ( h ∈ F ( X (cid:15) Y ))and the measure ν K : Y ∋ A Z K ( x, A ) ν (d x ) ( ν ∈ M ( X ))whenever these quantities are well-defined. Moreover, let L be another unnormalised transitionkernel from ( Y , Y ) to the measurable space ( Z , Z ); then two different products of K and L canbe defined, namely KL : X × Z ∋ ( x, A ) Z K ( x, d y ) L ( y, A )and K (cid:15) L : X × ( Y (cid:15) Z ) ∋ ( x, A ) Z A ( y, z ) K ( x, d y ) L ( y, d z )whenever these are well-defined. These products form new transition kernels from ( X , X ) to ( Z , Z )and from ( X , X ) to ( Y × Z , Y (cid:15) Z ), respectively. Also the (cid:15) -product of a kernel K and a measure ν ∈ M ( X ) is defined as the new measure ν (cid:15) K : X (cid:15) Y ∋ A Z A ( x, y ) K ( x, d y ) ν (d x ) . Finally, for any kernel K and any bounded measurable function h we write K h := ( K h ) and K h := K ( h2