Dimension-free Wasserstein contraction of nonlinear filters
aa r X i v : . [ m a t h . S T ] S e p Dimension-free Wasserstein contraction of nonlinear filters
Nick WhiteleySchool of Mathematics, University of BristolSeptember 5, 2018
Abstract
For a class of partially observed diffusions, sufficient conditions are given for the map from theinitial condition of the signal to filtering distribution to be contractive with respect to Wassersteindistances, with rate which has no dependence on the dimension of the state-space and is stableunder tensor products of the model. The main assumptions are that the signal has affine driftand constant diffusion coefficient, and that the likelihood functions are log-concave. Contractionestimates are obtained from an h -process representation of the transition probabilities of the signalreweighted so as to condition on the observations. Let ( θ t ) t ∈ R + , called the signal process, be the solution of the stochastic differential equation: d θ t = ( α + βθ t )d t + σ d B t , (1.1)where α ∈ R p and β is a p × p matrix of reals, σ ≥ is a scalar, and ( B t ) t ∈ R + is p -dimensionalBrownian motion. Let observations ( Y k ) k ∈ N be each valued in a measurable space ( Y , Y ) , conditionallyindependent given ( θ t ) t ∈ R + and such that the conditional probability that Y k lies in A ∈ Y given ( θ t ) t ∈ R + is of the form R A g k ( θ k ∆ , y ) χ (d y ) , for a measure χ on Y , a function g k : R p × Y → (0 , ∞ ) anda constant ∆ > .The filtering distributions ( π µk ) k ∈ N associated with a fixed sequence ( y k ) k ∈ N and a probabilitymeasure µ on the Borel σ -algebra B ( R p ) are defined by π µk ( A ) := E µ h A ( θ k ∆ ) Q kj =0 g j ( θ j ∆ , y j ) i E µ hQ kj =0 g j ( θ j ∆ , y j ) i , A ∈ B ( R p ) , (1.2)where E µ denotes expectation with respect to the law of the solution of (1.1) with θ ∼ µ . When ( y , . . . , y k ) are replaced in (1.2) by the random variables ( Y , . . . , Y k ) distributed according to theabove prescription and with true initialization also θ ∼ µ , then π µk is a version of the conditionaldistribution of θ k ∆ given ( Y , . . . , Y k ) . It shall be assumed throughout that whichever ( y k ) k ∈ N and µ we consider, the denominator in (1.2) is finite for each k , so that ( π µk ) k ∈ N are well defined as probabilitymeasures. When µ is δ θ , the Dirac mass located at θ , we shall write and E θ and π θk instead of E δ θ and π δ θ k .Our overall aim is to obtain bounds on Wasserstein distance between differently initialized filteringdistributions, say π µk , π νk in terms of distance between µ and ν , and find conditions under which theformer distance decays as k → ∞ at a rate which does not depend on the dimension of the state-space R p . The question of under what conditions the filtering distributions forget their initial conditionhas been approached using a variety of techniques, see [2, Chap. 4] for an overview. The topicof dependence on dimension has received attention only quite recently, motivated by the increasingimportance of inference problems involving high-dimensional stochastic processes.1ecent contributions such as [6, 3, 4] study the rate of forgetting in total variation distance and V -norm, and the rate estimates obtained there depend on the constants associated with minorization-type conditions for the signal process. However such constants, and therefore the rate estimates basedupon them, typically degrade with the dimension of the state-space. Infinite-dimensional filtering istreated in [12], where stability results are obtained involving weak convergence and the notion of localergodicity, which pertains to the mixing properties of finite-dimensional components of the infinitedimensional signal process, conditional on the observations. The results hold under mild conditionsand do not quantify the rate of convergence. For signals with certain spatio-temporal mixing properties,[10] provides local, quantitative filter stability results which do not degrade with dimension as part oftheir particle filter analysis.The approach taken here does not rely on spatial structure of the model, but is instead connectedwith contraction properties of gradient flows and convexity, and influenced by analyses of Markovprocesses using abstract ideas of curvature and underlying links to functional inequalities [1]. Theproofs ultimately rely on a quite simple coupling technique and the pathwise stability properties ofdiffusions whose drifts involve the gradients of certain convex potentials. This convexity arises from acombination of two features of the model we consider: firstly log-concavity of the likelihood functions θ g k ( θ, y k ) , which will be one of our main assumptions (stated precisely below), and secondly alog-concavity-preservation characteristic of the signal model (1.1).Log-concave likelihoods appear, for example, in statistical regression models built around the expo-nential family of distributions, in particular in Generalized Linear Models [8], which are used to solvehigh-dimensional data analysis problems in disciplines such as neuroscience, genomics and internettraffic prediction.In this setting y k = ( y k , . . . , y nk ) ∈ R n =: Y , and with known covariates x k = ( x ijk ) , i = 1 , . . . , n , j = 1 , . . . , p , g k ( θ, y k ) is of the form: g k ( θ, y k ) = exp n X i =1 p X j =1 y ik x ijk θ j − ψ p X j =1 x ijk θ j + log φ ( y ik ) , where θ = ( θ , . . . , θ p ) , φ is a given function, and ψ is convex, the latter implying θ g k ( θ, y k ) islog-concave. The situation in which the regression parameter θ is treated as time-varying is known asa Dynamic Generalized Linear Model [5]. Linear-Gaussian vector auto-regressions for ( θ k ∆ ) k ∈ N are apopular choice in practice and indeed the solution of (1.1) satisfies θ ( k +1)∆ = a + Bθ k ∆ + σξ k +1 , (1.3)where ξ k +1 = e ∆ β R ( k +1)∆ k ∆ e − ( t − k ∆) β d B t is a Gaussian random variable and a = e ∆ β R ∆0 e − tβ α d t , B = e ∆ β .The signal model (1.1) also has an important analytical property: it is known that the semigroupof transition operators ( P t ) t ∈ R + associated with (1.1) preserves log-concavity, meaning that for anylog-concave function f and t > , P t f is log-concave, see for example [7]. Combined with log-concavityof θ g k ( θ, y k ) , the Markov property of ( θ t ) t ∈ R + and the fact that a pointwise product of log-concavefunctions is log-concave, this implies that θ ϕ j,k ( θ ) := E θ hQ ki = j g i ( θ ( i − j )∆ , y i ) i is log-concave.Functions of the latter form play an important role in filter stability because they provide the re-weighting of transition probabilities which corresponds to conditioning on observations, and this iswhere the convex potentials alluded to earlier arise.It is important to note that log-concavity of ϕ j,k cannot be expected in much greater generality.It was established in [7] that among all diffusions of the form: d θ t = b ( θ t )d t + σ ( θ t )d B t , with b ( · ) , σ ( · ) satisfying some mild regularity conditions, it is only in the case that b ( · ) is affine and σ ( · ) is a constant that P t f is log-concave for all log-concave f . This motivates our focus on signalprocesses of the form (1.1). 2 otation and conventions A function f : R p → (0 , ∞ ) is called log-concave if log f ( cu + (1 − c ) v ) ≥ c log f ( u ) + (1 − c ) log f ( v ) , ∀ u, v ∈ R p , c ∈ [0,1] , and strongly log-concave if there exists a log-concave function ˜ f and a constant λ f ∈ (0 , ∞ ) suchthat f ( u ) = exp( − λ f u T u ) ˜ f ( u ) . For a measure µ , function f and integral kernel K , we shall write µf = R f ( u ) µ (d u ) , µK ( · ) = R µ (d u ) K ( u, · ) , Kf ( u ) = R f ( v ) K ( u, d v ) . For a nonnegative function f , µ · f denotes the measure µ (d u ) f ( u ) . The gradient and Laplace operators with respect to θ are denoted ∇ θ and ∇ θ . The indicator function on a set A is denoted A . The class of real-valued and twicecontinuously differentiable functions with on R p is denoted C .The order- q Wasserstein distance between probability measures on B ( R p ) is: W q ( µ, ν ) := (cid:18) inf γ ∈ Γ( µ,ν ) Z R p × R p k u − v k q γ (d u, d v ) (cid:19) /q , where Γ( µ, ν ) is the set of all couplings of µ and ν , and k · k is the Euclidean norm. Assumption 1.
For each k ∈ N , θ g k ( θ, y k ) is strictly positive, a member of C , and thereexists a constant λ g ( k ) ∈ [0 , ∞ ) and a log-concave function ˜ g k : R p → (0 , ∞ ) such that g k ( θ, y k ) =exp h − λ g ( k )2 θ T θ i ˜ g k ( θ ) . Theorem 1.
If Assumptions 1 holds, then for any q ≥ , W q ( π θk , π ϑk ) ≤ exp − k X j =1 Z ∆0 λ ( j, t )d t k θ − ϑ k , ∀ k ≥ , θ, ϑ ∈ R p , (2.1) where λ ( j, t ) := λ sig + σ λ g ( j ) λ min β (∆ − t )1 + σ λ g ( j ) R ∆ t λ max β (∆ − s )d s ,λ sig ∈ R is the smallest eigenvalue of − ( β + β T ) / and λ min β ( t ) , λ max β ( t ) ∈ (0 , ∞ ) are respectively thesmallest and largest eigenvalues of e βt ( e βt ) T . If Assumption 1 is satisfied with λ g ( k ) = 0 for all k , so that θ g k ( θ, y k ) is log-concave, but notnecessarily strongly log-concave, then (2.1) becomes: W q ( π θk , π ϑk ) ≤ exp ( − k ∆ λ sig ) k θ − ϑ k . (2.2)Note that the right hand side of this bound has no dependence on the observations ( y k ) k ∈ N . Since λ g ( k ) = 0 allows θ g k ( θ, y k ) to be a constant, in which case π θk ( · ) = P k ∆ ( θ, · ) , Theorem 1 impliesthat λ sig , if it is positive, is the exponential rate of Wasserstein contraction of ( P t ) t ∈ R + . In summary,assuming θ g k ( θ, y k ) is log-concave and λ sig > , the exponential rate of Wasserstein contraction ofthe filters ( π θk ) k ∈ N is positive and at least that of the ( P k ∆ ) k ∈ N .3s soon as σ > , the observations can help achieve contraction of the filters without contractionof ( P t ) t ∈ R + . For example, with β = − λ sig I for any λ sig ∈ R , we have λ min β ( t ) = λ max β ( t ) = e − λ sig , andit is straightfoward to check that (2.1) becomes: W q ( π θk , π ϑk ) ≤ k Y j =1 exp( − λ sig ∆)1 + σ λ g ( j ) R ∆0 e − λ sig t d t k θ − ϑ k , so that if λ sig ≤ and ∆ are fixed, contraction can be achieved if the products σ λ g ( j ) , j ∈ N , aresufficiently large. A notable case is when λ sig = 0 ,W q ( π θk , π ϑk ) ≤ k Y j =1
11 + σ λ g ( j )∆ k θ − ϑ k . The quantities ( λ g ( k )) k ∈ N , λ sig , λ min β ( t ) , λ max β ( t ) and σ appearing in (2.1) do not necessarily haveany dependence on the dimension of the state space, R p , and are stable under tensor products of themodel described in section 1, in the sense that g k ( θ, y k,i ) = exp (cid:20) − λ g ( k )2 θ T θ (cid:21) ˜ g k,i ( θ ) , i = 1 , , = ⇒ g k ( θ, y k, ) g k ( ϑ, y k, ) = exp (cid:20) − λ g ( k )2 ( θ T θ + ϑ T ϑ ) (cid:21) ˜ g k, ( θ )˜ g k, ( ϑ ) ,
2) spectrum { ( β + β T ) / } = spectrum { ( β ⊗ + ( β ⊗ ) T ) / } ,
3) spectrum { e βt ( e βt ) T } = spectrum { e β ⊗ t ( e β ⊗ t ) T } , where β ⊗ denotes the Kronecker product (cid:20) (cid:21) ⊗ β . This amounts to saying that if one expands themodel to state-space R p by defining the signal to be two independent copies of (1.1), with independentobservations y k = ( y k, , y k, ) ∈ Y whose likelihood functions have common log-concavity parameter λ g ( k ) , then there is no degradation of λ ( j, t ) in (2.1). Under the assumptions of Theorem 1 the bound in (2.1) cannot be improved in general: for exampleif σ = 0 and β = − λ sig I , then W q ( π θk , π ϑk ) = exp( − k ∆ λ sig ) k θ − ϑ k . The case of initial distributionswhich are not necessarily Dirac measures is addressed in section 3. Our starting point is the well-known fact that the filtering distributions can be written in terms of thetransition probabilities of the signal process re-weighted so as to condition on observations. Fix k ≥ and define ϕ k,k ( θ ) := g k ( θ, y k ) , (2.3) ϕ j,k ( θ ) := g j ( θ, y j ) P ∆ ϕ j +1 ,k ( θ ) 0 ≤ j < k, (2.4) R j,k ( θ, A ) := R A P ∆ ( θ, d ϑ ) ϕ j,k ( ϑ ) P ∆ ϕ j,k ( θ ) , ≤ j ≤ k. We will need the following preliminary lemma. 4 emma 1.
For every log-concave f and t > , P t f is log-concave.Proof. [7, proof of Prop. 1.3] Lemma 2.
We have π θk ( A ) = R ,k R ,k · · · R k,k ( θ, A ) . (2.5) If Assumption 1 holds, then for each j, k such that ≤ j ≤ k , there exists a log-concave function ˜ ϕ j,k such that: ϕ j,k ( θ ) = exp (cid:20) − λ g ( j )2 θ T θ (cid:21) ˜ ϕ j,k ( θ ) . (2.6) Proof.
The expression for π θk ( A ) follows from (1.2) and the Markov property of the signal. The secondclaim is established by repeated application to (2.3)–(2.4) of Lemma 1 and the fact that the pointwiseproduct of log-concave functions is log-concave.The Wasserstein bound in Theorem 1 is a consequence of contraction estimates for the kernels R j,k derived in sections 2.3 and 2.4. In particular, (2.1) is obtained by combining Proposition 1, which isbased on a synchronous coupling of an h -process interpretation of R j,k where h is a certain space-timeharmonic function, with Proposition 2, which quantifies the log-concavity of h inherited from that of ϕ j,k in (2.6). h -transform of the signal process Let C ([0 , ∆] , R p × [0 , ∆]) be the space of R p × [0 , ∆] -valued, continuous functions on [0 , ∆] endowedwith the supremum norm. Let ( θ t , t ) t ∈ [0 , ∆] be the associated space-time coordinate process and let F = ( F t ) t ∈ [0 , ∆] be the filtration it generates. The extended generator (in the sense of [11, p. 285])of the space-time process on C ([0 , ∆] , R p × [0 , ∆]) under the law associated with (1.1) and acting onfunctions f on R p × R + is: Lf ( θ, t ) := ∂∂t f ( θ, t ) + ( α + βθ ) T ∇ θ f ( θ, t ) + σ ∇ θ f ( θ, t ) . Lemma 3.
Let Assumption 1 hold, fix any j,k such that ≤ j ≤ k and define h ( θ, t ) := P ∆ − t ϕ j,k ( θ ) . (2.7) There exists a probability kernel P h : R p × F ∆ → [0 , such that for any θ ∈ R p and A ∈ B ( R p ) , R j,k ( θ , A ) = P h ( θ , { θ ∆ ∈ A } ) , and under P h ( θ , · ) the extended generator of the space-time process ( θ t , t ) t ∈ [0 , ∆] on C ([0 , ∆] , R p × [0 , ∆]) is: L h f ( θ, t ) := Lf ( θ, t ) + σ ∇ θ log h ( θ, t ) T ∇ θ f ( θ, t ) . (2.8) Proof.
Let P : R p × F ∆ → [0 , be a probability kernel such that P ( θ , · ) is the law of the space-timeprocess associated with (1.1) on the time horizon [0 , ∆] initialized from the point ( θ , .Note the following properties of the functions ϕ j,k . Under Assumption 1, for all k ≥ , θ g k ( θ, y k ) is strictly positive and therefore so is ϕ j,k for all j ≤ k . Also, it follows from the assumption that forall k ≥ , θ g k ( θ, y k ) is a member of C , combined with (2.3)-(2.4) and (1.3) that ϕ j,k ∈ C . Bythe log-concavity established in Lemma 2, there exists a constant c such that ϕ j,k ( θ ) grows no fasterthan e c k θ k as k θ k → ∞ .Now fix j, k as in the statement. Then θ h ( θ, t ) is strictly positive, log-concave by Lemma 2 andLemma 1, and a member of C because of (2.7) and ϕ j,k ∈ C . With: D t := h ( θ t , t ) h ( θ , , D t ) t ∈ [0 , ∆] is a ( F t , P ( θ , · )) -continuous martingale, and the expected value of D t under P ( θ , · ) is .Now define the probability kernel P h ( θ, · ) := D ∆ · P ( θ, · ) . Under P h ( θ , · ) , ( θ t ) t ∈ [0 , ∆] is an inhomoge-neous Markov process with transition probabilities: P hs,t ( θ, d ϑ ) := P t − s ( θ, d ϑ ) h ( ϑ, t ) h ( θ, s ) , and R j,k ( θ, A ) = P h , ∆ ( θ, A ) = P h ( θ, { θ ∆ ∈ A } ) . By [11, Prop. 3.9, p.357], the extended generator ofthe space-time process under P h ( θ , · ) is L h f = h − L ( hf ) , which is equal to the r.h.s. of (2.8) because R P s ( θ, d ϑ ) h ( ϑ, s + t ) = h ( θ, t ) and hence L ( h ) = 0 . Proposition 1.
Fix any j,k such that ≤ j ≤ k . If there exists a continuous function λ h : [0 , ∆] → [0 , ∞ ) and a function ˜ h : R p × [0 , ∆] → (0 , ∞ ) such that for each t , θ ˜ h ( θ, t ) is log-concave and h asin Lemma 3 satisfies h ( θ, t ) = exp h − λ h ( t )2 θ T θ i ˜ h ( θ, t ) ,then for any q ≥ , W q ( R j,k ( θ, · ) , R j,k ( ϑ, · )) ≤ exp " − λ sig ∆ − σ Z ∆0 λ h ( t )d t k θ − ϑ k . Proof.
Consider the synchronous coupling θ t = θ + Z t α + βθ s + σ ∇ θ log h ( θ s , s )d s + σB t ,ϑ t = ϑ + Z t α + βϑ s + σ ∇ θ log h ( ϑ s , s )d s + σB t . By Ito’s formula, for any continuous function ζ : [0 , ∆] → R . k θ t − ϑ t k e R t ζ ( s )ds = k θ − ϑ k + 2 Z t (cid:0) ζ ( s ) k θ s − ϑ s k + ( θ s − ϑ s ) T β ( θ s − ϑ s ) (cid:1) e R s ζ ( u )d u d s + 2 Z t σ ( ∇ θ log h ( θ s , s ) − ∇ θ log h ( ϑ s , s )) T ( θ s − ϑ s ) e R s ζ ( u )d u d s. (2.9)Now set ζ ( s ) = λ sig + σ λ h ( s ) . For any skew-symmetric matrix, say A , and any u ∈ R p , u T Au =( Au ) T u = u T A T u = − u T Au , hence u T Au = 0 , so u T βu = 12 u T ( β + β T ) u ≤ − λ sig k u k , ∀ u ∈ R p . (2.10)The assumption on h implies ( ∇ θ log h ( θ, s ) − ∇ θ log h ( ϑ, s )) T ( θ − ϑ ) ≤ − λ h ( s ) k θ − ϑ k , θ, ϑ ∈ R p . (2.11)Applying (2.10) and (2.11) to (2.9) gives: k θ ∆ − ϑ ∆ k ≤ exp − Z ∆0 λ sig + σ λ h ( t )d t ! k θ − ϑ k . The proof is completed by taking expectations and applying Lemma 3.6 .4 Quantifying log-concavity of θ h ( θ, t ) The main result of this section is Proposition 2, which complements Lemma 1 by quantifying theinfluence on the log-concavity of θ h ( θ, t ) of the parameters of the signal process and the log-concavity of the likelihood functions, and provides verification of the hypotheses of Proposition 1. Proposition 2.
Let Assumption 1 hold, fix j, k such that ≤ j ≤ k and let h be as in Lemma 3. Thenthere exists a function ˜ h : R p × [0 , ∆] → (0 , ∞ ) such that θ ˜ h ( θ, t ) is log-concave and h ( θ, t ) = exp (cid:20) − λ h ( t )2 θ T θ (cid:21) ˜ h ( θ, t ) , where λ h ( t ) := λ g ( j ) λ min β (∆ − t )1 + σ λ g ( j ) R ∆ t λ max β (∆ − s )d s , and λ min β ( t ) , λ max β ( t ) are respectively the smallest and largest eigenvalues of e βt ( e βt ) T . We shall make use of the following well-known lemma [9, Thm. 6].
Lemma 4.
For every function ( u, v ) f ( u, v ) on R p × R q which is log-concave in ( u, v ) , the integral R f ( u, v )d v is a log-concave function of u . Lemma 5 and Lemma 6 are technical results used in the proof of Proposition 2.
Lemma 5.
Let
F, S be real, square, symmetric matrices such that F + S is invertible. Then v T F v + ( u − v ) T S ( u − v ) = u T Cu + z T ( F + S ) z where C := F ( F + S ) − S and z := v − ( F + S ) − Su .Proof. We have using the assumed symmetry of F and S , z T ( F + S ) z = v T ( F + S ) v − u T Sv + u T S ( F + S ) − Su.
Therefore u T Cu + z T ( F + S ) z = u T Su + v T ( F + S ) v − u T Sv = v T F v + ( u − v ) T S ( u − v ) . Lemma 6.
Let f be any function of the form f ( u ) : u ∈ R p exp( − u T F u ) ˜ f ( u ) where F is areal symmetric matrix and ˜ f is log-concave, and let S be a real symmetric matrix such that F + S isinvertible. Then for any a ∈ R p and p × p real matrix B , f ( v ) exp (cid:20) −
12 ( v − a − Bu ) T S ( v − a − Bu ) (cid:21) = exp (cid:18) − u T B T CBu (cid:19) ˜ f ( v ) exp (cid:20) − z T ( F + S ) z (cid:21) , where C = F ( F + S ) − S and z = v − ( F + S ) − S ( a + Bu ) Proof.
Using Lemma 5 with u there replaced by a + Bu , f ( v ) exp (cid:20) −
12 ( v − a − Bu ) T S ( v − a − Bu ) (cid:21) = ˜ f ( v ) exp (cid:20) − (cid:8) v T F v + ( v − a − Bu ) T S ( v − a − Bu ) (cid:9)(cid:21) = ˜ f ( v ) exp (cid:20) − (cid:8) ( a + Bu ) T C ( a + Bu ) + z T ( F + S ) z (cid:9)(cid:21) = exp (cid:18) − u T B T CBu (cid:19) exp (cid:20) − (cid:0) a T Ca + 2 a T CBu (cid:1)(cid:21) ˜ f ( v ) exp (cid:20) − z T ( F + S ) z (cid:21) . roof of Proposition 2. First note that for the signal process ( θ ) t ∈ R + as per (1.1), m t := E θ [ θ t ] = a t + e βt θ , Σ t := E θ [( θ t − m t )( θ t − m t ) T ] = σ Z t e β ( t − s ) ( e β ( t − s ) ) T d s, where a t := e βt Z t ( e βs ) − α d s. It follows that u T Σ − t u ≥ Λ − t u T u for all u ∈ R p with the shorthand Λ t := σ R t λ max β ( s )d s .Applying Lemma 6 with a = a t , B = e βt , S = I Λ − t , f = ϕ j,k , F = Iλ g ( j ) , and Lemma 2, ϕ j,k ( θ ) exp (cid:20) −
12 ( θ − a t − e βt θ ) T Σ − t ( θ − a t − e βt θ ) (cid:21) = exp − λ g ( j ) λ min β ( t )1 + λ g ( j )Λ t θ T θ ! · ˜ ϕ j,k ( θ ) exp (cid:20) − z Tt z t ( λ g ( j ) + Λ − t ) (cid:21) (2.12) · exp (cid:20) − λ g ( j )1 + λ g ( j )Λ t θ T (cid:0) ( e βt ) T e βt − Iλ min β ( t ) (cid:1) θ (cid:21) (2.13) · exp (cid:20) −
12 ( θ − a t − e βt θ ) T (Σ − t − Λ − t I )( θ − a t − e βt θ ) (cid:21) . (2.14)where C t = I λ g ( j )1+ λ g ( j )Λ t and z t = θ − λ g ( j )Λ t ( a t + e βt θ ) .The product of the terms in (2.12)-(2.14) is jointly log-concave in ( θ , θ ) . Therefore by Lemma 4,there exists a function ˜ h such that θ ˜ h ( θ, t ) is log-concave and h ( θ , t ) = P ∆ − t ϕ j,k ( θ )= Z ϕ j,k ( θ ) exp (cid:20) −
12 ( θ − a ∆ − t − e β (∆ − t ) θ ) T Σ − − t ( θ − a ∆ − t − e β (∆ − t ) θ ) (cid:21) d θ = exp − λ g ( j ) λ min β (∆ − t )1 + λ g ( j )Λ ∆ − t θ T θ ! ˜ h ( θ , t ) , which completes the proof. Obtaining a satisfactory generalization of Theorem 1 to allow for initial distributions µ other than Diracmeasures appears to be a non-trivial matter. The difficulty is that the corresponding generalization of(2.5) from which to start is: π µk ( A ) = µ ,k R ,k R ,k · · · R k,k ( A ) , µ ,k := µ · ϕ ,k µϕ ,k , so a direct corollary of Theorem 1 is: W q ( π µk , π νk ) ≤ exp − k X j =1 Z ∆0 λ ( j, t )d t W q ( µ ,k , ν ,k ) . (3.1)8ut it cannot immediately be deduced from (3.1) that lim k W q ( π µk , π νk ) = 0 , due to the dependence of W q ( µ ,k , ν ,k ) on k .An alternative is to work with a certain family of weighted Wasserstein distances between filteringdistributions. As we shall see, this is equivalent to establishing forgetting of the initial condition for smoothing distributions, which unlike filtering distributions condition on future as well as past andpresent observations. The starting point from which to describe this equivalence in more detail is thefollowing lemma. Lemma 7.
Let d ( · , · ) be a metric on the set of probability measures on B ( R p ) and let φ : R p → (0 , ∞ ) . Then d φ ( · , · ) defined by: d φ : ( µ, ν ) d (cid:18) µ · φµφ , ν · φνφ (cid:19) is a metric on the subset of probability measures { µ on B ( R p ) : µφ < ∞} .Proof. It follows immediately from the assumption that d is a metric and φ is strictly positive that onthe given domain { µ : µφ < ∞} , d φ is nonnegative, symmetric, satisfies the triangle inequality and µ = ν ⇒ d φ ( µ, ν ) = 0 . For the reverse implication, we have d φ ( µ, ν ) = 0 ⇒ µ φ := µ · φµφ = ν · φνφ =: ν φ , sosince φ is strictly positive, µ ≪ ν and µ φ / d ν φ = (d µ/ d ν )( νφ/µφ ) , ν φ -a.e. and then also ν -a.e.since φ is strictly positive. Thus d µ/ d ν = const . , ν -a.e. and since µ and ν are probability measures,it follows that µ = ν .Introduce the nonnegative integral kernels Q k ( θ, d ϑ ) := g k − ( θ, y k − ) P ∆ ( θ, d ϑ ) . k ≥ . Q j,k := Q j +1 · · · Q k , ≤ j < k. (3.2)and the probability measures η µk ( A ) := µQ ,k ( A ) µQ ,k ( R p ) , k ≥ , η µ := µ, A ∈ B ( R p ) , for any µ such that the denominator is finite. Note from (1.1) that η µk = π µk − P ∆ .We shall use the functions appearing in the following assumption to define a family of weightedWasserstein distances. Assumption 2.
There exists a probability measure µ such that for each k ∈ N , the followingpointwise limit exists: φ k, ∞ ( θ ) := lim ℓ →∞ ϕ k,ℓ ( θ ) η µ k ϕ k,ℓ , (3.3) φ k, ∞ ( θ ) ∈ (0 , ∞ ) for all θ ∈ R p , and the functions ( φ k, ∞ ) k ∈ N so-defined belong to C and satisfy Q k φ k, ∞ = ς k − φ k − , ∞ , k ≥ , (3.4) where ς k := R η µ k (d u ) g k ( u, y k ) ∈ (0 , ∞ ) .. Before discussing the interpretation of Assumption 2, consider the following lemma, which mirrorsLemma 2.
Lemma 8.
If Assumption 2 holds, then for any µ such that for all k ∈ N , π µk P ∆ φ k +1 , ∞ < ∞ , theprobability measures ( π µk, ∞ ) k ∈ N defined by: π µk, ∞ ( A ) := π µk ( A P ∆ φ k +1 , ∞ ) π µk P ∆ φ k +1 , ∞ , A ∈ B ( R p ) , (3.5) satisfy π µk, ∞ ( A ) = π µ , ∞ R , ∞ · · · R k, ∞ ( A ) , (3.6)9 ith the Markov kernels R k, ∞ ( θ, d ϑ ) := P ∆ ( θ, d ϑ ) φ k, ∞ ( ϑ ) P ∆ φ k, ∞ ( θ ) . If additionally Assumption 1 holds, then for each k ∈ N , there exists a log-concave function ˜ φ k, ∞ suchthat φ k, ∞ ( θ ) = exp (cid:20) − λ g ( k )2 θ T θ (cid:21) ˜ φ k, ∞ ( θ ) . Proof.
To establish (3.6) it suffices to show π µk − , ∞ R k, ∞ = π µk, ∞ . We have π µk − , ∞ R k, ∞ ( A ) = π µk − ( P ∆ ( φ k, ∞ ) R k, ∞ ( A )) π µk − P ∆ k φ k, ∞ = π µk − P ∆ ( A φ k, ∞ ) π µk − P ∆ φ k, ∞ = π µk − P ∆ ( A Q k +1 φ k +1 , ∞ ) π µk − P ∆ Q k +1 φ k +1 , ∞ = π µk ( A P ∆ φ k +1 , ∞ ) π µk P ∆ φ k +1 , ∞ = π µk, ∞ ( A ) , where (3.4), (3.2) and the identity π µk ( A ) = π µk − [ P ∆ ( A Q k R p )] /π µk − [ P ∆ ( Q k R p )] have been used.For the second claim, the fact that φ j, ∞ is log-concave for every j ∈ N follows from its definitionas the pointwise limit in (3.3) and the log-concavity of ϕ j,k established in Lemma 2. By Lemma 1, P ∆ φ k +1 is log-concave and since by Assumption 2, φ k, ∞ = ς − k Q k +1 φ k +1 , ∞ , we may take ˜ φ k, ∞ ( θ ) = ς − k ˜ g k ( θ ) P ∆ φ k +1 , ∞ ( θ ) .Recalling from section 1 the interpretation of π µk as the conditional distribution of θ k ∆ given ( y , . . . , y k ) , the measure π µk · ( P ∆ ϕ k +1 ,ℓ ) /π µk P ∆ ϕ k +1 ,ℓ is the smoothing distribution which conditionsadditionally on ( y k +1 , · · · , y k + ℓ ) . The interpretation of (3.3) is then that φ k, ∞ is the function withwhich to re-weight π µk P ∆ in order to condition on the infinite data record ( y k + ℓ ) ℓ ∈ N .The question of whether there exists a well-behaved (in the sense of satisfying the other requirementsof Assumption 2) function which achieves this conditioning is closely connected to the question of filterstability, see [14] for a discussion on doubly infinite time horizons. Indeed it is clear from (3.5) thatAssumption 2 implies that the filtering and smoothing measures, π µk and π µk, ∞ , are equivalent, despitethe fact that π µk, ∞ conditions on an infinite number of observations. Various existing tools are availableto verify Assumption 2, we shall illustrate some of them in an example below, it is an open questionwhether Assumption 2 can be deduced directly from Theorem 1.When Assumption 2 holds, we shall consider the family of weighted Wasserstein distances W q,k ( µ, ν ) := W q (cid:18) µ · P ∆ φ k +1 , ∞ µP ∆ φ k +1 , ∞ , ν · P ∆ φ k +1 , ∞ νP ∆ φ k +1 , ∞ (cid:19) , k ∈ N , whenever µ, ν satisfy appropriate integrability conditions for this object to be well-defined. The interestin the distances W q,k is the identity: W q,k ( π µk , π νk ) = W q ( π µk, ∞ , π νk, ∞ ) , (3.7)which follows from (3.5). Thus W q,k quantifies distance between π µk and π νk as the W q -distance betweenthe corresponding smoothing distributions π µk, ∞ and π νk, ∞ .We denote the set of probability measures P q := (cid:26) µ on B ( R p ) : Z (1 + k u k q ) φ ( u ) µ (d u ) < ∞ and π µk P ∆ φ k +1 , ∞ < ∞ , ∀ k ∈ N (cid:27) . heorem 2. If Assumptions 1 and 2 hold, then for any q ≥ , W q,k ( π µk , π νk ) ≤ exp − k X j =1 Z ∆0 λ ( j, t )d t W q, ( π µ , π ν ) , ∀ k ≥ , µ, ν ∈ P q , where λ ( j, t ) is as in Theorem 1. Given the identities (3.6) and (3.7), the proof of Theorem 2 follows almost exactly the same pro-gramme as the proof of Theorem 1, except working with the kernels R k, ∞ , the functions φ k, ∞ and theirlog-concavity in Lemma 8, instead of R j,k , ϕ j,k and their log-concavity in Lemma 2. The requirement µ, ν ∈ P q ensures that W q, ( µ, ν ) and π µk, ∞ , π νk, ∞ are well-defined. Example - dynamic logistic regression
Consider the case: σ > , β = − Iλ sig for some λ sig > , and with Y = { , } n , the observations Y k = ( Y k · · · Y nk ) are conditionally independent given θ k ∆ , with the conditional probability of { Y ik = 1 } being / (1 + e − P j θ jk ∆ x ijk ) , where x ijk are known covariates. The likelihood function is then: g k ( θ, y k ) = exp n X i =1 p X j =1 y ik x ijk θ j − log (cid:16) e P pj =1 x ijk θ j (cid:17) . For any ( y k ) k ∈ N , Assumption 1 is satisfied with λ g ( k ) = 0 , and therefore (2.2) holds by Theorem1. Checking Assumption 2 is more involved, we shall use some results from [13].Let us assume that the covariates satisfy sup k ≥ X i,j ( x ijk ) < ∞ , (3.8)and fix an arbitrarily sequence of observations ( y k ) k ∈ N .The following properties of this model are easily checked (see [13, Sec. 3.1] for a similar example):there exists a constant c > such that with V ( θ ) := 1 + c k θ k , C d := { θ ∈ R p : V ( θ ) ≤ d } , (3.9)we have for some d ∈ [1 , ∞ ) and all d ≥ d , • sup k g k ( θ, y k ) ≤ , ∀ θ ∈ R p , and there exist constants δ ∈ (0 , , b d ∈ [0 , ∞ ) such that P ∆ ( e V ) ≤ exp( V (1 − δ ) + b d C d ) , (3.10) • inf k g k ( θ, y k ) P ∆ ( θ, C d ) > , ∀ θ ∈ R p , • there exist constants ǫ − d , ǫ + d such that ∀ θ ∈ C d and k ∈ N , ǫ − d ν d (d ϑ ) C d ( ϑ ) ≤ g k ( θ, y k ) P ∆ ( θ, d ϑ ) C d ( ϑ ) ≤ ǫ − d ν d (d ϑ ) C d ( ϑ ) , where the probability measure ν d is the normalized restriction of Lebesgue measure to C d .Define the norm on functions f : R p → R , k f k e V := sup θ | f ( θ ) | /e V ( θ ) . Proposition 3.
For any µ such that µ ( e V ) < ∞ , define φ j,k ( θ ) := ϕ j,k ( θ ) /π µ j − P ∆ ϕ j,k . Then:1) sup k ≥ η µ k ( e V ) < ∞ sup ≤ j ≤ k k φ j,k k e V < ∞ ,3) for all d ≥ d , inf ≤ j ≤ k inf θ ∈ C d φ j,k ( θ ) > , ) for all < j ≤ k , Q j φ j,k = ς j − φ j − ,k , where ς j = R η µ j (d θ ) g j ( θ, y j ) ,5) there exist constants ρ < and c µ < ∞ such that for any f : R p → R with k f k e V < ∞ , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Q j,k f ( θ ) Q k − i = j ς i − φ j,k − ( θ ) η µ k f (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ρ k − j k f k e V c µ e V ( θ ) µ ( e V ) , ∀ θ ∈ R p , ≤ j < k Proof.
The properties identified immediately before the statement of proposition and the requirement µ ( e V ) < ∞ imply that conditions (H1)-(H4) of [13] are satisfied. Then 1) and 2) are established by[13, Prop. 1 and 2], 3) by [13, Lem. 10], 4) by [13, Lem.1], and 5) by [13, Thm. 1].The following proposition establishes that the conditions of Theorem 2 are satisfied. Proposition 4.
For any sequence of observations ( y k ) k ∈ N , the dynamic logistic regression modeldescribed above satisfies Assumption 2 with sup k ≥ k φ k, ∞ k e V < ∞ , and for any q ≥ , W q,k ( π µk , π νk ) ≤ exp ( − k ∆ λ sig ) W q, ( π µ , π ν ) , (3.11) for all µ, ν in the set of probability measures (cid:8) µ on B ( R p ) : R (1 + k u k q ) e c k u k µ (d u ) < ∞ (cid:9) where c isas in (3.9).Remark . The constant ρ < appearing in part 5) of Proposition 3 and obtained using the techniquesof [13] may degrade with dimension of the state-space. Note however, that ρ does not appear in (3.11),it only serves as an intermediate tool used to in the following proof to help establish that Assumption2 holds. Proof of Proposition 4.
Choose any µ such that µ ( e V ) < ∞ . Noting the identities π µ k − P ∆ ϕ k,ℓ = Q ℓj = k ς j and φ j,k = Q j,k +1 R p / Q ki = j ς i , we have for any ℓ ≥ , φ j,k − φ j,k + ℓ = Q j,k +1 Q ki = j ς i − Q k +1 ,k + ℓ +1 R p Q k + ℓi = k +1 ς i ! . Since Q k + ℓi = k +1 ς i = η µ k +1 Q k +1 ,k + ℓ +1 R p , we have η µ k +1 (1 − Q k +1 ,k + ℓ +1 R p Q k + ℓi = k +1 ς i ) = 0 and by part 2) of Propo-sition 3, sup j,k,ℓ k Q k +1 ,k + ℓ +1 R p k eV Q k + ℓi = k +1 ς i =: c Q < ∞ , so an application of part 5) of Proposition 3 gives: k φ j,k − φ j,k + ℓ k e V ≤ ρ k +1 − j c Q c µ µ ( e V ) , ∀ ℓ ≥ . It follows for each j , ( φ j,k ) k ≥ j is a Cauchy sequence in the Banach space of functions f : R p → R endowed with the norm k f k e V < + ∞ . With the strong limit of ( φ j,k ) k ≥ j then denoted φ j, ∞ , we have k φ j, ∞ k e V < ∞ and φ j, ∞ ( θ ) = lim k →∞ φ j,k ( θ ) pointwise.From part 4) of Proposition 3, Q j φ j,k = Q j φ j, ∞ + Q j ( φ j,k − φ j, ∞ ) = ς j − φ j − , ∞ + ς j − ( φ j − ,k − φ j − , ∞ ) = ς j − φ j − ,k , and since using (3.10), k Q j ( e V ) k e V < ∞ , k φ j − ,k − φ j − , ∞ k → and k Q j ( φ j,k − φ j, ∞ ) k e V ≤k Q j ( e V ) k e V k φ j,k − φ j, ∞ k e V → , both as k → ∞ , we have Q j φ j, ∞ = ς j − φ j − , ∞ . Since g j ( θ, y j ) ∈ (0 , , we have ς j ∈ (0 , and using part 3) of Proposition 3, Q j φ j, ∞ ( θ ) > for all θ hence φ j − , ∞ ( θ ) > for all θ. Also k φ j, ∞ k e V < ∞ implies φ j, ∞ ( θ ) < ∞ for all θ . The membership φ j − , ∞ ∈ C followsfrom Q j φ j, ∞ = ς j − φ j − , ∞ together with θ g j − ( θ, y j − ) ∈ C by Assumption 1 and the fact that P ∆ is given by (1.3). That completes the verification of Assumption 2.To complete the proof, observe that in order for µ ∈ P q it is sufficient that R (1 + k θ k q ) e V ( θ ) µ (d θ ) < ∞ , because using part 2) of Proposition 3 , sup k ≥ k φ k, ∞ k e V < ∞ , we have π µk − P ∆ = η µk and by part1) of Proposition 3, sup k η µk ( e V ) < ∞ . Acknowledgement.
The author thanks Anthony Lee for helpful comments.12 eferences [1] P. Cattiaux and A. Guillin. Semi log-concave Markov diffusions. In
Séminaire de probabilitésXLVI , pages 231–292. Springer, 2014.[2] D. Crisan and B. Rozovskii.
The Oxford handbook of nonlinear filtering . Oxford University Press,2011.[3] R. Douc, E. Gassiat, B. Landelle, and E. Moulines. Forgetting of the initial distribution fornonergodic hidden Markov chains.
The Annals of Applied Probability , 20(5):1638–1662, 2010.[4] M. Gerber and N. Whiteley. Stability with respect to initial conditions in V-norm for nonlinearfilters with ergodic observations.
Journal of Applied Probability , 54(1):118–133, 2017.[5] J. Harrison and M. West.
Bayesian forecasting & dynamic models , volume 1030 of
Springer Seriesin Statistics . Springer New York City, 1999.[6] M.L. Kleptsyna and A.Y. Veretennikov. On discrete time ergodic filters with wrong initial data.
Probability Theory and Related Fields , 141(3-4):411–444, 2008.[7] A.V. Kolesnikov. On diffusion semigroups preserving the log-concavity.
Journal of FunctionalAnalysis , 186(1):196–205, 2001.[8] P. McCullagh and J.A Nelder.
Generalized Linear Models , volume 37 of
Monograph on Statisticsand Applied Probability . Chapman & Hall„ 1989.[9] A. Prékopa. On logarithmic concave measures and functions.
Acta Scientiarum Mathematicarum ,34:334–343, 1973.[10] P. Rebeschini and R. Van Handel. Can local particle filters beat the curse of dimensionality?
TheAnnals of Applied Probability , 25(5):2809–2866, 2015.[11] D. Revuz and M. Yor.
Continuous martingales and Brownian motion . Springer, 3rd edition, 1999.[12] X.T. Tong and R. van Handel. Conditional ergodicity in infinite dimension.
The Annals ofProbability , 42(6):2243–2313, 2014.[13] N. Whiteley. Stability properties of some particle filters.
The Annals of Applied Probability ,23(6):2500–2537, 2013.[14] N. Whiteley and A. Lee. Twisted particle filters.