[PDF] Invariant density adaptive estimation for ergodic jump diffusion processes over anisotropic classes

Abstract

We consider the solution X = (Xt) t ≥ 0 of a multivariate stochastic differential equation with Levy-type jumps and with unique invariant probability measure with density μ . We assume that a continuous record of observations X T = (Xt) 0 ≤ t ≤ T is available. In the case without jumps, Reiss and Dalalyan (2007) and Strauch (2018) have found convergence rates of invariant density estimators, under respectively isotropic and anisotropic H{ö}lder smoothness constraints, which are considerably faster than those known from standard multivariate density estimation. We extend the previous works by obtaining, in presence of jumps, some estimators which have the same convergence rates they had in the case without jumps for d ≥ 2 and a rate which depends on the degree of the jumps in the one-dimensional setting. We propose moreover a data driven bandwidth selection procedure based on the Goldensh-luger and Lepski (2011) method which leads us to an adaptive non-parametric kernel estimator of the stationary density μ of the jump diffusion X. Adaptive bandwidth selection, anisotropic density estimation, ergodic diffusion with jumps, L{é}vy driven SDE

Full PDF

aa r X i v : . [ m a t h . S T ] J a n Invariant density adaptive estimation for ergodic jumpdiﬀusion processes over anisotropic classes.

Chiara Amorino ∗ , Arnaud Gloter ∗ January 22, 2020

Abstract

We consider the solution X = ( X t ) t ≥ of a multivariate stochastic diﬀerential equationwith Levy-type jumps and with unique invariant probability measure with density µ . Weassume that a continuous record of observations X T = ( X t ) ≤ t ≤ T is available.In the case without jumps, Reiss and Dalalyan [7] and Strauch [24] have found conver-gence rates of invariant density estimators, under respectively isotropic and anisotropic H¨oldersmoothness constraints, which are considerably faster than those known from standard mul-tivariate density estimation.We extend the previous works by obtaining, in presence of jumps, some estimators which havethe same convergence rates they had in the case without jumps for d ≥ µ of the jump diﬀusion X .Adaptive bandwidth selection, anisotropic density estimation, ergodic diﬀusion with jumps,L´evy driven SDE Diﬀusion phenomena arise from a Markovian stochastic modeling and as a solution of SDEs withor without jumps in many areas of applied mathematics. Their investigation concerns diﬀerentmathematical branches and therefore research interest in questions such as existence and regularityof solutions of stochastic diﬀerential equations has constantly grown over the past years.The study of the statistical properties of diﬀusion models has emerged since such models are widelyused for applications in ﬁnance and biology. Diﬀusion processes with jumps, in particular, havebeen used in neuroscience for instance in [8] while in ﬁnance they have been introduced to modelthe dynamic of asset prices [13], [20], exchange rates [3], or volatility processes [2].In this work, we aim at estimating adaptively the invariant density µ associated to the pro-cess ( X t ) t ≥ , solution of the following multivariate stochastic diﬀerential equation with Levy-typejumps: X t = X + Z t b ( X s ) ds + Z t a ( X s ) dW s + Z t Z R d \{ } γ ( X s − ) z ˜ µ ( ds, dz ) , (1)where W is a d -dimensional Brownian motion and ¯ µ a compensated Poisson random measurewith a possible inﬁnite jump activity. We assume that a continuous record of observations X T =( X t ) ≤ t ≤ T is available.Practical concerns raise new questions such as the dependence of statistical features on theobservation scheme: it is, for the applications, a subject of interest to consider basic questions indiﬀerent observation scenarios. From a theoretical point of view, it is however also of substantialinterest to work under the assumption that a continuous record of the diﬀusion considered isavailable.In this framework, it belongs to the folklore of the statistics for stochastic processes without jumpsthat the invariant density can be estimated under standard nonparametric assumptions with a ∗ Laboratoire de Math´ematiques et Mod´elisation d’Evry, CNRS, Univ Evry, Universit´e Paris-Saclay, 91037, Evry,France. R d .The Russian school considered anisotropic spaces from the beginning of the theory of functionspaces in 1950-1960s (in [22] the author takes account of the developments). However, results onminimax rates of convergence in classical statistical models were rare for a lot of time.The question of optimal bandwidth selection based on i. i. d. observations for density estimationwith respect to sup - norm risk was not completely solved until the pretty recent developmentsgathered in [17]. The methodology detailed in Goldenshluger and Lepski [11] inspired the data-driven selection procedure of the bandwidth of the kernel estimator proposed by many authorssuch as Strauch in [24] and Comte, Prieur and Samson in [6] and provides the starting point forthe study of our adaptive procedure as well.In this paper, we provide a non-parametric estimator of the invariant density µ with a fullydata-driven procedure of the bandwidth. We propose to estimate the invariant density µ by meansof a kernel estimator, we therefore introduce some kernel function K : R → R . A natural estimatorof µ at x ∈ R d in the anisotropic context is given byˆ µ h,T ( x ) = 1 T Q dl =1 h l Z T d Y m =1 K ( x m − X mu h m ) du, where h = ( h , ..., h d ) is a multi - index bandwidth, which will be chosen through the data-drivenselection procedure. We ﬁrst prove some bounds on the transition semigroup and on the transitiondensity that will be useful to ﬁnd sharp upper bounds on the variance of integral functionals of thediﬀusion X . Through them, we ﬁnd the following convergence rates for the pointwise estimationof the invariant density of our diﬀusion with jumps: E [ | ˆ µ h,T ( x ) − µ ( x ) | ] < ∼  (log T ) (2 − (1+ α )2 ) ∨ T for d = 1 , log TT for d = 2 ,T − β β + d − for d ≥ , where α ∈ (0 ,

2) is the degree of jumps activity of the L´evy process and ¯ β is the harmonic meansmoothness of the invariant density over the d diﬀerent dimensions.We remark that the rate we ﬁnd for d ≥ β , the commonsmoothness over the d dimensions.The case d = 1 evidences the main diﬀerence between what happens with and without jumps.Indeed, if in the continuous case the optimal convergence rate was T , now the rate we found isbetween log TT and (log T ) T . It is worth noting here that such a convergence rate is not necessarilythe optimal one in the jumps framework. As a matter of fact in the continuous case diﬀerentapproaches, as the diﬀusion local time, have been used to get the rate T ; we do not exclude thepossibility that also in presence of jumps the implementation of other methods could lead to aconvergence rate faster than the one presented here above for the mono-dimensional setting.To complete the comparison to the continuous framework, we recall that in both [7] and [24] theconvergence rate found in the case d = 2 was (log T ) T and so the convergence of the estimator seemsbeing faster in presence of jumps than without them. The reason why it happens is that, to ﬁnd2he convergence rate, the transition density ( p t ) t ∈ R + is needed to be upper bounded. If in [7] theauthors assume to have p t ( x, y ) ≤ c ( t − d + t d ) and in [24] Nash and Poincar´e inequalities leadStrauch to a bound analogous to the one presented in [7]; Lemma 1 below provides us a diﬀerentbound which guides us to a diﬀerent rate. However, in absence of the term t d in the assumptionbefore, which is the case for example considering a bounded drift, also in the continuous settingthe convergence rate turns out being, as in the jump -diﬀusion case, equal to log TT .It is moreover worth noting here that, if in [7] and [24] they needed to assume the existence of thetransition density and a bound on it, we derive them through Lemma 1: all the assumptions weneed are directly on the model (1).We no longer need to assume that the drift is of the form b = −∇ V (where V ∈ C is referred toas potential) as it was in both [7] and [24].After having provided the rates of convergence of the estimators we ﬁnally propose, in the case d ≥

3, a fully data-driven selection procedure of the bandwidth of the kernel estimator, inspired bythe methodology detailed in Goldenshluger and Lespki [11]. The method has the decisive advantageof being anisotropic: the bandwidths selected in each direction are in general diﬀerent, which iscoherent with the possibly diﬀerent regularities with respect to each variable. Finally, we provethat for the selected optimal bandwidth the following estimation holds: E [ (cid:13)(cid:13) ˆ µ ˜ h − µ (cid:13)(cid:13) A ] ≤ c inf h ∈H T ( B ( h ) + V ( h )) + c e − c (log T ) , (2)where we have denoted as k·k A the L norm on A , a compact subset of R d and as H T the setof candidate bandwidths; B ( h ) is a bias term and V ( h ) an estimate of the variance bound. Weremark that the estimator leads to an automatic trade - oﬀ between the bias and the variance: thesecond term on the right hand side of (2) is indeed negligible compared to the ﬁrst one.Moreover, as the rate optimal choice h ( T ) belongs to the set of candidate bandwidths H T , (2)turns out being E [ (cid:13)(cid:13) ˆ µ ˜ h − µ (cid:13)(cid:13) A ] ≤ c T − β β + d − + c e − c (log T ) , where ¯ β is the mean smoothness of the invariant density.The paper is organised as follows. We give in Section 2 the assumptions on the process X .In section 3 we deﬁne the anisotropic H¨older balls and we construct our estimator. Section 4 isdevoted to the statements of our main results; which will be proven in the two following sections. Inparticular, we show how we get the convergence rates for the invariant density estimation in Section5 while in Section 6 we prove the estimator we ﬁnd through our bandwidth selection procedure isadaptive. Some technical results are moreover presented in the Appendix. We consider the question of nonparametric estimation of the invariant density of a d-dimensionaldiﬀusion process X, assuming that a continuous record X T = { X t , ≤ t ≤ T } up to time T is ob-served. This diﬀusion is given as a strong solution of the following stochastic diﬀerential equationswith jumps: X t = X + Z t b ( X s ) ds + Z t a ( X s ) dW s + Z t Z R d \{ } γ ( X s − ) z ˜ µ ( ds, dz ) , t ∈ [0 , T ] , (3)where b : R d → R d , a : R d → R d × R d and γ : R d → R d × R d ; W = ( W t , t ≥

0) is a d-dimensionalBrownian motion and µ is a Poisson random measure on [0 , T ] × R d associated to the L´evy process L = ( L t ) t ∈ [0 ,T ] , with L t := R t R R d z ˜ µ ( ds, dz ). The compensated measure is ˜ µ = µ − ¯ µ ; we supposethat the compensator has the following form: ¯ µ ( dt, dz ) := F ( dz ) dt , where conditions on the Levymeasure F will be given later.The initial condition X , W and L are independent.In what follows, we suppose the following assumptions hold: A1 : The functions b ( x ) , γ ( x ) and a ( x ) are globally Lipschitz and, for some c ≥ , c − I d × d ≤ a ( x ) ≤ c I d × d , here I d × d denotes the d × d identity matrix.Denoting with | . | and < ., . > respectively the Euclidian norm and the scalar product in R d , wesuppose moreover that there exists a constant c > such that, ∀ x ∈ R d , | b ( x ) | ≤ c . Under Assumption 1 the equation (3) admits a unique non-explosive c`adl`ag adapted solutionpossessing the strong Markov property, cf [1] (Theorems 6.2.9. and 6.4.6.).

A2 (Drift condition) : There exist ˜ C > and ˜ ρ > such that < x, b ( x ) > ≤ − ˜ C | x | , ∀ x : | x | ≥ ˜ ρ . We furthermore need the following assumptions on the jumps:

A3 (Jumps) : F is absolutely continuous with respect to the Lebesgue mea-sure and we denote F ( z ) = F ( dz ) dz .2. We suppose that there exist c > such that for all z ∈ R d , F ( z ) ≤ c | z | d + α , with α ∈ (0 , andthat supp ( F ) = R d .3. The jump coeﬃcient γ is upper bounded, i.e. sup x ∈ R d | γ ( x ) | := γ max < ∞ . We suppose more-over that, ∀ x ∈ R d , Det ( γ ( x )) = 0 .4. If α = 1 , we require for any < r < R < ∞ R r< | z | and a constant c > such that R R d | z | e ǫ | z | F ( z ) dz ≤ c . As we will see in Lemma 2 below, Assumption 2 ensures, together with the last points of As-sumption 3, the existence of a Lyapunov function. The process X admits therefore a uniqueinvariant distribution π and the ergodic theorem holds.We assume the invariant probability measure π of X being absolutely continuous with respect tothe Lebesgue measure and from now on we will denote its density as µ : dπ = µdx .For any set S ⊂ R d we deﬁne µ ( S ) := R S µ ( x ) dx and, by abuse of notation, we will write µ ( f ) := E [ f ( X )] = R R d f ( x ) µ ( x ) dx for functions f : R d → R .We deﬁne moreover L ( µ ) := (cid:8) f : R d → R : R R d | f ( x ) | µ ( x ) dx < ∞ (cid:9) and L ( µ ) := (cid:8) f : R d → R : R R d | f ( x ) | µ ( x ) dx < ∞ (cid:9) .For each g ∈ L ( µ ) we denote as k g k L ( µ ) := µ ( | g | ) the L norm with respect to µ on R d .The transition semigroup of the process X on L ( µ ) is P t f ( x ) := E [ f ( X t ) | X = x ].The transition density is denoted by p t and it is such that P t f ( x ) = R R d f ( y ) p t ( x, y ) dy ; we will seein Lemma 1 that it exists.The process X is called β - mixing if β X ( t ) = o (1) for t → ∞ and exponentially β - mixingif there exists a constant γ > β X ( t ) = O ( e − γt ) for t → ∞ , where β X is the β - mixingcoeﬃcient of the process X as deﬁned in Section 1.3.2 of [9]. We recall that, for a Markov process X with transition semigroup ( P t ) t ∈ R + and L ( X ) = η , the β - mixing coeﬃcient of X is given by β X ( t ) := sup s ∈ R + Z R d k P t ( x, . ) − ηP s + t ( x, . ) k ηP s ( dx, . ) , (4)where ηP t = L ( X t ) and k λ k stands for the total variation norm of a signed measure λ .For the exponential mixing property of general multidimensional diﬀusions, the reader may consultTheorem 3 of Kusuoka and Yoshida [14] for the α - mixing; Meyn and Tweedie [21], Stramer andTweedie [23] and Veretennikov [25] for the β - mixing. The mixing property for general diﬀusionswith jumps has been investigated by Masuda in [19].Now we mention the notion of exponential ergodicity in the sense of [21]. Deﬁnition 1.

We say that X is exponentially ergodic if it admits a unique invariant distribution π and additionally if there exist positive constants c and ρ for which, for each f centered under µ , k P t f k L ( µ ) ≤ ce − ρt k f k ∞ . We will see in Lemma 2 that both the exponential ergodicity and the exponential β - mixingcan be derived from our assumptions. 4n Lemmas 2 and 1 below we will prove some bounds on the transition semigroup and on thetransition density that will be useful to establish tight upper bounds on the variance V ar ( Z T f ( X s ) ds ) , f ∈ L ( µ )of integral functionals of the diﬀusion X .Bounds of this type were proven before, in [7] (cf. their Proposition 1), by combining estimatesbased on the spectral gap inequality and on upper bounds on the transition densities of X . Throughthem they prove, under isotropic H¨older smoothness constraints, convergence rates of invariantdensity estimators for pointwise estimation which are considerably faster than those known fromstandard multivariate density estimation.We replace the spectral gap inequality with a control from L to L ∞ given by the exponentialergodicity. Moreover, contrary to [7], we don’t need to assume that such controls hold true sincewe get them as consequence of Lemma 1 and 2 below, having required some assumptions onlydirectly on the model (3).In the next section we will construct adaptive estimators for the density in the multidimensionaldiﬀusion case with jumps, which achieve fast rates of convergence over anisotropic H¨older balls. In several cases, the regularity of some function g : R d → R depends on the direction in R d chosen.We thus work under the following anisotropic smoothness constraints. Deﬁnition 2.

Let β = ( β , ..., β d ) , β i > , L = ( L , ..., L d ) , L i > . A function g : R d → R issaid to belong to the anisotropic H¨older class H d ( β, L ) of functions if, for all i ∈ { , ..., d } , (cid:13)(cid:13) D ki g (cid:13)(cid:13) ∞ ≤ L i ∀ k = 0 , , ..., ⌊ β i ⌋ , (cid:13)(cid:13)(cid:13) D ⌊ β i ⌋ i g ( . + te i ) − D ⌊ β i ⌋ i g ( . ) (cid:13)(cid:13)(cid:13) ∞ ≤ L i | t | β i −⌊ β i ⌋ ∀ t ∈ R , for D ki g denoting the k -th order partial derivative of g with respect to the i -th component, ⌊ β i ⌋ denoting the largest integer strictly smaller than β i and e , ..., e d denoting the canonical basis in R d . From now on we deal with the estimation of the density µ belonging to the anisotropic H¨olderclass H d ( β, L ).Given the observation X T of a diﬀusion X , solution of (3), we propose to estimate the invariantdensity µ by means of a kernel estimator. To estimate some µ ∈ H d ( β, L ) we therefore introducesome kernel function K : R → R satisfying Z R K ( x ) dx = 1 , k K k ∞ < ∞ , supp( K ) ⊂ [ − , , Z R K ( x ) x l dx = 0 , for all l ∈ { , ..., M } with M ≥ max i β i .Denoting by X jt , j ∈ { , ..., d } the j -th component of X t , t ≥

0, a natural estimator of µ at x = ( x , ..., x d ) T ∈ R d in the anisotropic context is given byˆ µ h,T ( x ) = 1 T Q dl =1 h l Z T d Y m =1 K ( x m − X mu h m ) du. (5)As we will see in Section 4.2, a main question concerns the choice of the multi-index bandwidth h = ( h , ..., h d ) T . We want to investigate on the convergence rates for invariant density estimation. In order todetermine the asymptotic behaviour of our estimator for T → ∞ , we study the variance of general5dditive functionals of X in d dimension. To do so, we need some properties as the exponentialergodicity of the process and a bound on the transition density. Such properties will be derivedfrom our assumptions through the following lemmas, that we will prove in the appendix.The following bounds on the transition density and on the transition semigroup hold true. Lemma 1.

Suppose that A1 - A3 hold. Then, for T ≥ , there exists a transition density p t ( x, y ) for which for any t ∈ [0 , T ] there are a c > and a λ > such that, for any pair of points x, y ∈ R d , we have p t ( x, y ) ≤ c ( t − d e − λ | y − x | t + t ( t + | y − x | ) d + α ) . Lemma 2.

Suppose that A1 - A3 hold. Then the process X is exponentially ergodic and exponen-tially β - mixing. On the basis of the two previous lemmas we can prove the following bound on the variance,which is the heart of the study on the convergence rate.

Proposition 1.

Suppose that A1 - A3 hold and let f : R d → R be a bounded, measurable functionwith support S satisfying |S| < . Then, there exists a constant C independent of f such that • V ar ( R T f ( X t ) dt ) ≤ CT k f k ∞ |S| (1 + (log( |S| )) − (1+ α )2 + log( |S| )) for d = 1 , • V ar ( R T f ( X t ) dt ) ≤ CT k f k ∞ |S| (1 + log( |S| )) for d = 2 , • V ar ( R T f ( X t ) dt ) ≤ CT k f k ∞ |S| d for d ≥ . From the bias - variance decomposition in the anisotropic case (see Proposition 1 in [5]) we getthe following bound E [ | ˆ µ h,T ( x ) − µ ( x ) | ] < ∼ d X l =1 h β l l + T − V ar ( 1 Q dl =1 h l Z T d Y m =1 K ( x m − X mt h m ) dt ) . We want to bound the variance here above using Proposition 1 on the function f ( y ) := Q dl =1 h l Q dm =1 K ( x m − y m h m ). As it will be explained in the proof of Proposition 2 in Section5, for d ≥ V ar ( 1 Q dl =1 h l Z T d Y m =1 K ( x m − X mt h m ) dt ) ≤ cT ( d Y l =1 h l ) d − , which leads us to the following convergence rate. Proposition 2.

Suppose that A1 - A3 hold. If µ ∈ H d ( β, L ) , then the estimator given in (5) satisﬁes, for d ≥ , the following risk estimates: E [ | ˆ µ h,T ( x ) − µ ( x ) | ] < ∼ d X l =1 h β l l + T − ( d Y l =1 h l ) d − . (6) Deﬁned β := d P dl =1 1 β l , the rate optimal choice h l = h l ( T ) = ( T ) ¯ ββl (2¯ β + d − yields the convergencerate E [ | ˆ µ h,T ( x ) − µ ( x ) | ] < ∼ T − β β + d − . We underline that, in the continuous case, the convergence rate found by Strauch in [24] forthe estimation of the invariant density µ belonging to the anisotropic H¨older class H d ( β + 1 , L ) is T − β +1)2( β +1)+ d − , for d ≥ µ over anisotropic H¨older class H d ( β, L ) and we therefore extend[24] to the jumps - diﬀusion case: the convergence rate we obtain is the same it was in the casewithout jumps, which is also analogous to the rate ﬁrst obtained by Reiss and Dalalyan in [7] forthe estimation of the invariant density µ over isotropic H¨older class H d ( β + 1 , L ), up to replac-ing the mean smoothness β + 1 with β +1, the common smoothness over the d diﬀerent dimensions.For d = 1 and d = 2, the bound on the variance changes. Therefore, the rate optimal choice h will be diﬀerent as well, as explained in following two propositions.6 roposition 3. Suppose that A1 - A3 hold. If µ ∈ H d ( β, L ) , then the estimator given in (5) satisﬁes, for d = 1 , the following risk estimates: E [ | ˆ µ h,T ( x ) − µ ( x ) | ] < ∼ h β + 1 T (1 + (log( 1 h )) − (1+ α )2 + log( 1 h )) . (7) The rate optimal choice for h yields to the convergence rate E [ | ˆ µ h,T ( x ) − µ ( x ) | ] < ∼ (log T ) (2 − (1+ α )2 ) ∨ T .

It is worth remarking that, in Proposition 3, it is stated the main diﬀerence between the casewith and without jumps. Indeed, if in the continuous case the convergence rate was T , now itdepends on the degree of the jumps α and it is between log TT and (log T ) T .We need to say that the convergence rate we have found here above for the estimation of theinvariant density of a stochastic diﬀerential equations with jumps in the one dimensional settingis not necessarily the optimal one. In the continuous case other methods have been explored forsuch an estimation when d = 1, as the use of diﬀusion local time to get the optimal rate T . We donot rule out the possibility to get a sharper bound through the exploitation of other approachesalso for the jumps case, ﬁnding therefore a convergence rate faster than the one presented in theprevious proposition. Proposition 4.

Suppose that A1 - A3 hold. If µ ∈ H d ( β, L ) , then the estimator given in (5) satisﬁes, for d = 2 , the following risk estimates: E [ | ˆ µ h,T ( x ) − µ ( x ) | ] < ∼ h β + h β + 1 T (1 + log( 1 h h )) . (8) The rate optimal choice for h yields to the convergence rate E [ | ˆ µ h,T ( x ) − µ ( x ) | ] < ∼ log TT .

Comparing our result with the convergence rate obtained in the continuous case over isotropicH¨older class H d ( β + 1 , L ) in [7] and anisotropic H¨older class H d ( β + 1 , L ) in [24], which is (log T ) T in both works, one can observe that the convergence rate seems being faster in presence of jumps.The reason why it happens is that in [7] they assume the transition density to be upper boundedby C ( t − d + t d ), which is a bound diﬀerent from the one we get from Lemma 1.If the term t d would have been absent in their assumption, e. g. for bounded drift, then theconvergence rate in the continuous case could have been improved to log TT , which is also what weget in the jump- diﬀusion case.In [24], Nash and Poincar´e inequalities lead the author to an upper bound on the transition densitywhich is analogous to the one found in [7] (see Remark 2.4 of [24]).From the pointwise estimation of the invariant density gathered in the three previous proposi-tions we move to the estimation on L ( A ), where A is a compact set of R d .In the sequel, for A ⊂ R d compact and for g ∈ L ( A ), k g k A := R A | g ( x ) | dx denotes the L normwith respect to Lebesgue on A .As a consequence of Propositions 2, 3 and 4 and the fact that the constants which turn out in theproofs do not depend on x , the following corollary holds true: Corollary 1. If µ ∈ H d ( β, L ) , then for the rate optimal choice for h = h(T) provided in Proposi-tions 2, 3 and 4 we have the following risk estimates: E [ k ˆ µ h,T − µ k A ] < ∼ V d ( T ) :=  (log T ) (2 − (1+ α )2 ) ∨ T for d = 1 log TT for d = 2 T − β β + d − for d ≥ . (9)The proof of Corollary 1 will be given in Section 5.7 .2 Adaptive procedure The question of density estimation belongs to the canonical framework of nonparametric statistics.As detailed in Propositions 3 and 4, both the bandwidth and the upper bound on the rate of con-vergence appearing on the right hand side of (7) and (8) do not depend on the unknown smoothnessof the invariant density µ and so there is no gain in implementing a data-driven bandwidth selec-tion procedure for density estimation in the framework of continuous observations of a one or twodimensional diﬀusion process with jumps. Hence, throughout the sequel we restrict to the case d ≥ d ≥

3, instead, the proposed bandwidth choice de-pends on the regularity of the density µ , which is unknown. This is why we study a data-drivenbandwidth selection device.We emphasize that the d selected bandwidths are diﬀerent, and this anisotropy property is im-portant in our setting: the regularity in each direction can be various. The bandwidth selectionprocedure has to be able to provide such diﬀerent choices for h , h , ... , h d .To select h adequately, we propose the following method, inspired from Goldenshluger and Lespki[11].We deﬁne the set of candidate bandwidths H T as H T ⊂ ( h = ( h , ..., h d ) T ∈ (0 , d : (log T ) d T d ≤ d Y l =1 h l ≤ ( 1log T ) dd − ) , (10)The conditions on Q dl =1 h l we have just given are needed to use Talagrand inequality, on the basisof which we show our adaptive result.We suppose moreover that the growth of |H T | is at most polynomial in T , which is there exists c > |H T | ≤ cT c .An example of H T is the following set of candidate bandwidths: H T := ( h = ( h , ..., h d ) T ∈ (0 , d : h i = 1 k i with k i ∈ N , (log T ) d T d ≤ d Y l =1 k l ≤ ( 1log T ) dd − ) . (11)In correspondence of the variation of h ∈ H T , we have the following family of estimators, deﬁnedas in (5) F ( H T ) := ( ˆ µ h ( x ) := 1 T Z T K h ( X u − x ) du : x ∈ R d , h ∈ H T ) where, for y ∈ R d , it is K h ( y ) := d Y l =1 h l d Y m =1 K ( y m h m ) . (12)We aim at selecting an estimator from the family F ( H T ) in a completely data-driven way, basedonly on the observation of the continuous trajectory of the process X solution of (3).We now turn to describing the selection procedure from F ( H T ), which is based on auxiliaryestimators relying on the convolution operator. According to our records, it was introduced in [18]as a device to circumvent the lack of ordering among a set of estimators in anisotropic case, wherethe increase of the variance of an estimator does not imply a decrease of its bias.For any bandwidths h = ( h , ..., h d ) T , η = ( η , ..., η d ) T ∈ H T and x ∈ R d , we deﬁne K h ∗ K η ( x ) := d Y j =1 ( K h j ∗ K η j )( x j ) = d Y j =1 Z R K h j ( u − x j ) K η j ( u ) du. We moreover deﬁne the kernel estimatorsˆ µ h,η ( x ) := 1 T Z T ( K h ∗ K η )( X u − x ) du, x ∈ R d . We remark that for how we have deﬁned the kernel estimators, since the convolution is commuta-tive, it is ˆ µ h,η = ˆ µ η,h . 8he proposed selection procedure relies on comparing the diﬀerences ˆ µ h,η − ˆ µ η .We deﬁne A ( h ) := sup η ∈H T ( k ˆ µ h,η − ˆ µ η k A − V ( η )) + , (13)with V ( h ) := kT ( d Y l =1 h l ) d − , where k is a numerical constant which is large. In particular, it is suﬃcient to choose it biggerthan the constants 2 k ∗ and 2 k which appear in Lemma 4. Even if k is not explicit, it can becalibrated by simulations as done for example in Section 5 of [6] through the implementation of amethod inspired by Goldenshluger and Lepski [11] and rewritten most recently by Lacour, Massartand Rivoirard in [16].Heuristically, A ( h ) is an estimate of the squared bias and V ( h ) of the variance bound. It is worthnoticing that the penalty term V ( h ) which is used here comes from Proposition 1 for the function f being the Kernel function.Thus, the selection is done by setting˜ h := arg min h ∈H T ( A ( h ) + V ( h )) . (14)We introduce the following notation: µ h := K h ∗ µ , which is the function that is estimated withoutbias by ˆ µ h , i.e. E [ˆ µ h ( x )] = µ h ( x ). Moreover we deﬁne µ h,η := K h ∗ K η ∗ µ and a bias term B ( h ) := k µ h − µ k A , where we have denoted as k . k ˜ A the L - norm on ˜ A , a compact set in R d which is such that ˜ A := n ζ ∈ R d : d ( ζ, A ) ≤ √ d o .The following result holds. Theorem 1.

Suppose that assumptions A1 - A3 hold. Then, we have E [ (cid:13)(cid:13) ˆ µ ˜ h − µ (cid:13)(cid:13) A ] ≤ c inf h ∈H T ( B ( h ) + V ( h )) + c e − c (log T ) , for c and c positive constants. The bound stated in Theorem 1 shows that the estimator leads to an automatic trade-oﬀ be-tween the bias k µ h − µ k A and the variance V ( h ), up to a multiplicative constant c . The last termis indeed negligible. The proof of Theorem 1 is postponed to Section 6.We recall that Proposition 2 provides us the rate optimal choice h ( T ) for d ≥

3, which is h l ( T ) = ( T ) ¯ ββl (2¯ β + d − .Using such a bandwidth we will prove in Section 6 the following theorem. Theorem 2.

Suppose that assumptions A1 - A3 hold and let H T be deﬁned by (11) . Then, wehave E [ (cid:13)(cid:13) ˆ µ ˜ h − µ (cid:13)(cid:13) A ] ≤ c ( 1 T ) β β + d − + c e − c (log T ) , for c and c positive constants. Underlining once again that the second term in the right hand side of the equation here aboveis negligible compared to the ﬁrst, we have that the risk estimates we get using the bandwidthprovided by our selection procedure converges to zero fast. In particular, its convergence ratecoincides to the optimal one provided by both [7] and [24] in the case without jumps.

In this section we prove Propositions 2, 3 and 4, which gives us the convergence rate for theestimation of the invariant density µ ∈ H d ( β, L ) in the three diﬀerent situation: d = 1, d = 2 and d ≥ x considered.We start showing the bound on the variance gathered in Proposition 1.9 .1 Proof of Proposition 1 Proof.

We consider ﬁrst of all the case d ≥

3. We deﬁne the function f c := f − µ ( f ). From thesymmetry and the stationarity we have V ar ( Z T f ( X s ) ds ) = 2 Z T Z s E [ f c ( X s ) f c ( X t )] dtds = 2 Z T Z s E [ f c ( X ) f c ( X s − t )] dtds Applying the change if variable u := s − t , using Fubini and computing the integral we have thatthe quantity here above is equal to 2 R T ( T − u ) E [ f c ( X ) f c ( X u )] du . Let now 0 < δ < D ≤ T ,where the speciﬁc choice of δ and D will be given later. The idea is to deal with the integralhere above in diﬀerent way for u which is in diﬀerent intervals. For this reason we see R T ( T − u ) E [ f c ( X ) f c ( X u )] du as Z δ ( T − u ) E [ f c ( X ) f c ( X u )] du + Z Dδ ( T − u ) E [ f c ( X ) f c ( X u )] du + Z TD ( T − u ) E [ f c ( X ) f c ( X u )] du. (15)We now observe that Z δ ( T − u ) E [ f c ( X ) f c ( X u )] du = Z δ ( T − u )( E [ f ( X ) f ( X u )] − ( µ ( f )) ) du ≤ cT Z δ | µ | du, (16)where we have denoted as < ., . > µ the scalar product deriving by the norm with respect to themeasure µ , for which < g, h > µ := R R d g ( x ) h ( x ) µ ( x ) dx , each g, h ∈ L ( µ ). In the last inequality wehave moreover used that ( µ ( f )) is always more than 0.Now we use Cauchy-Schwartz inequality and the fact that P u f is a contraction map in L ( µ ) toget Z δ | µ | du ≤ Z δ q k P u f k µ k f k µ du ≤ Z δ q k f k µ du ≤ k f k ∞ µ ( S ) δ, (17)where in the last inequality we have used the estimation k f k µ = Z S | f ( x ) | µ ( x ) dx ≤ k f k ∞ µ ( S ) . Concerning the second integral in (15), we remark that (16) still holds on [ δ, D ]. We then estimateit through the deﬁnition of transition semigroup. It is Z Dδ | µ | du ≤ Z Dδ Z R d | f ( x ) | Z R d | f ( y ) | p u ( x, y ) dyµ ( x ) dxdu. (18)We want to use the bound on the transition density given in Lemma 1 which holds for t ∈ [0 , T ]but it is not uniform in t big. Nevertheless, for t ≥

1, we have p t ( x, y ) = Z R d p t − ( x, ζ ) p ( ζ, y ) dζ ≤ c Z R d p t − ( x, ζ )( e − λ ( y − ζ ) + 1( q + | y − ζ | ) d + α ) dζ ≤≤ c Z R d p t − ( x, ζ ) dζ ≤ c, where the constant c changes from line to line. The right hand side of (18) is therefore upperbounded by Z Dδ Z R d | f ( x ) | c Z R d | f ( y ) | ( u − d e − λ | y − x | u + u ( u + | y − x | ) d + α + 1) dy µ ( x ) dx du ≤≤ Z Dδ Z S | f ( x ) | c Z S | f ( y ) | ( u − d + u − ( d + α )2 +1) dy µ ( x ) dx du ≤ c k f k ∞ µ ( S ) |S| Z Dδ ( u − d + u − ( d + α )2 +1) du, where we have bounded in both integrals the absolute value of f with its inﬁnity norm.Now we want to calculate the integral with respect to the variable u . We observe that, since d ≥ − d <

0. The exponent of the second term in the integral here above, after having integrated, is10 − d + α . It is more than zero if d < − α , which is possible only if α ∈ (0 ,

1) and d = 3, less thenzero otherwise.Therefore, we have to consider the two diﬀerent possibilities, according to the fact that the exponentwould be positive or negative. It follows Z Dδ | µ | du ≤ c k f k ∞ µ b ( S ) |S| ( δ − d + δ − d + α { d ≥ − α } + D − d + α { d< − α } + D ) . (19)We are now left to estimate the third integral of (15). From Lemma 2 it follows it is upper boundedby cT Z TD | µ | du ≤ cT Z TD q k P u f c k L ( µ ) k f c k ∞ du ≤ c T Z TD q ( e − ρu k f c k ∞ ) k f c k ∞ du. (20)We recall it is f c ( x ) = f ( x ) − µ ( f ), where µ ( f ) = R S f ( x ) µ ( x ) dx .Therefore we have | f c ( x ) | ≤ | f ( x ) | + | µ ( f ) | ≤ | f ( x ) | + k f k ∞ µ ( S ) (21)and so k f c k ∞ ≤ k f k ∞ + k f k ∞ µ ( S ) ≤ c k f k ∞ , where in the last inequality we have used the following estimation | µ ( S ) | ≤ k µ k ∞ |S| ≤ c |S| (22)and the fact that |S| < cT Z TD | µ | du ≤ c T k f k ∞ Z TD e − ρu du ≤ c T k f k ∞ e − ρD . (23)Replacing (17), (19) and (23) in (15) we have that | Z T ( T − u ) E [ f c ( X ) f c ( X u )] du | ≤ c T k f k ∞ |S| ( δ + |S| δ − d + |S| δ − d + α { d ≥ − α } + |S| D − d + α { d< − α } +(24)+ |S| D ) + c T k f k ∞ e − ρD . We now want to choose δ and D for which the estimation here above is as sharp as possible.Recalling that the exponent on δ are less than zero in the second and the third terms of the righthand side of (24), we have that for a small choice of δ would correspond the smallness of the ﬁrstterm while the second and the third would be big, the opposite would hold for a big δ . In thesame way, the behaviour of the last two terms of the right hand side of (24) relies on the choice D . Aiming at balancing them, we deﬁne δ := |S| d and D := [max( − ρ log( |S| ) , ∧ T . Replacingthem in (24) if T > ( − ρ log( |S| )) we get | Z T ( T − u ) E [ f c ( X ) f c ( X u )] du | ≤ c T k f k ∞ ( |S| d + |S| − αd + |S| ( log |S| ) − d + α + |S| log |S| + |S| ) , which give us the result we wanted remarking that both 2 and 1 + − αd are always more than 1 + d for d ≥ α ∈ (0 , T ≤ ( − ρ log( |S| )), by the deﬁnition of D we obtain D = T . We still have |S| D − d + α { d< − α } ≤ c |S| ( log |S| ) − d + α and, moreover, the last integral which we dealt with in(23) is in this case between T and T and so its contribution is null. Hence, the result still holds true.We now consider the case d = 1. We can act exactly like we did in the case d ≥

3, splittingthe integral in three parts. Estimations (17) and (23) still holds while, using again Lemma 1 onthe interval [ δ, D ], we have | Z Dδ ( T − u ) E [ f c ( X ) f c ( X u )] du | ≤ cT k f k ∞ µ ( S ) |S| Z Dδ ( u − + u − (1+ α )2 + 1) du ≤ cT k f k ∞ µ ( S ) |S| ( D + D − (1+ α )2 + D ) , where we have used that now, integrating, both the exponent we get are positive. In total in thecase d = 1, using also (22), we therefore have V ar ( Z T f ( X s ) ds ) ≤ cT k f k ∞ ( |S| δ + |S| ( D + D − (1+ α )2 ) + e − ρD ) . As we have already done, we want to make the estimation here above as sharp as possible. Thistime there isn’t any constraint on the smallness of δ and so we can choose directly δ := 0. Regard-ing D we observe that, if α >

1, then 2 − (1+ α )2 <

1; otherwise 2 − (1+ α )2 >

1. In each case we havethe same trade oﬀ we had in the case d ≥ D := [max( − ρ log( |S| ) , ∧ T ;it follows V ar ( R T f ( X s ) ds ) ≤ cT k f k ∞ |S| (1 + (log |S| ) − (1+ α )2 + log( |S| )) . In the case d = 2 estimations (17) and (23) keep holding true. The bound on the transitiondensity gathered in Lemma 1 for d = 2 yields | Z Dδ ( T − u ) E [ f c ( X ) f c ( X u )] du | ≤ cT k f k ∞ µ ( S ) |S| Z Dδ ( u − + u − (2+ α )2 + 1) du ≤≤ cT k f k ∞ µ ( S ) |S| (log( Dδ ) + D − (2+ α )2 + D ) , having remarked that 2 − (2+ α )2 = 1 − α > α ∈ (0 , V ar ( Z T f ( X s ) ds ) ≤ cT k f k ∞ ( |S| δ + |S| (log( Dδ ) + D − (2+ α )2 + D ) + e − ρD ) . Aiming at balancing the terms, we choose again δ := |S| and D := [max( − ρ log( |S| ) , ∧ T .It follows that log( Dδ ) ≤ c | log | log |S||| + c | log( |S| ) | and so, observing that log | log( |S| ) | is negligiblecompared to log( |S| ), the bound on the variance becomes V ar ( Z T f ( X s ) ds ) ≤ cT k f k ∞ |S| (1 + log( 1 |S| )) , where we have also used that 2 − (2+ α )2 is always less than 1 and so (log( |S| )) − (2+ α )2 < log( |S| ).The proposition is therefore proved. Proof.

Estimation (6) is a straightforward consequence of the bias - variance decomposition andProposition 1 applied to f ( y ) := Q dl =1 h l Q dm =1 K ( x m − y m h m ), whose support S is such that |S| ≤ c Q dl =1 h l and which is by construction such that k f k ∞ ≤ c ( Q dl =1 h l ) − .To ﬁnd the optimal choice of h we deﬁne h l ( T ) := ( T ) a l for l ∈ { , ..., d } and we look for a , ... a d such that the upper bound of the mean-squared error in the right hand side of (6) would result assmall as possible.Replacing the deﬁnition of h l ( T ) in the bias - variance decomposition, it means searching for a , ..., a d for which we get the balance and so we have to resolve the following system: ( β i a i = β i +1 a i +1 ∀ i ∈ { , ..., d − } β d a d = 1 + ( d − P dl =1 a l . We observe that, as a consequence of the ﬁrst d − a l as β d β l a d for each l ∈ { , ..., d − } . Therefore, the last equation becomes2 β d a d = 1 + ( 2 d − β d a d d X l =1 β l . β := 1 d d X l =1 β l , it follows 2 β d a d = 1 + ( 2 d − β d a d d ¯ β , which yields a d = ¯ ββ d (2 ¯ β + ( d − a l = ¯ ββ l (2 ¯ β + ( d − ∀ l ∈ { , ..., d } . Taking in the right hand side of (6) the rate optimal choice h l ( T ) = ( T ) ¯ ββl (2¯ β +( d − we get theconvergence rate wanted. Proof.

The upper bound of the mean-squared error follows from the decomposition bias - varianceand from Proposition 1, recalling that for f ( X t ) := h K ( x − X t h ) we have k f k ∞ ≤ ch − and itssupport S is such that |S| ≤ ch .Now, aiming at balancing the terms, we take h := ( T ) a ; getting the mean-squared error is upperbounded by ( 1 T ) aβ + 1 T + ( a log T ) − (1+ α )2 T + a log TT . If a gets bigger clearly h gets smaller; it is enough to take a such that 2 aβ > (log T ) − (1+ α )2 T for α ≤ log TT for α > Proof.

Again, (8) follows naturally from the bias - variance decomposition and Proposition 1.Regarding the convergence rate, we take again h l := ( T ) a l for l = 1 , h h ) = a log T + a log T and so the mean-squared error is upper bounded by( 1 T ) a β + ( 1 T ) a β + c log TT .

Taking a and a big enough to make the ﬁrst two terms here above negligible compared to thethird, we get the convergence rate log TT . Proof.

It is a straightforward consequence of Propositions 2, 3 and 4 after having remarked that theconstants which turn out in all the previous propositions do not depend on the point x considered.Indeed, such propositions yield E [ k ˆ µ h,T − µ k A ] = E [ Z A | ˆ µ h,T ( x ) − µ ( x ) | dx ] ≤ c Z A cV d ( T ) dx ≤ c | A | V d ( T ) . The heart of the proof of Theorem 1 consist of ﬁnding an upper bound for the expected value of A ( h ), which is gathered in the following proposition. Proposition 5.

Suppose that assumptions A1 - A3 hold. Then, ∀ h ∈ H T , E [ A ( h )] ≤ c B ( h ) + c e − c (log T ) . Proposition 5 will be proven after the proofs of Theorems 1 and 2.13 .1 Proof of Theorem 1.

Proof.

From triangular inequality it follows ∀ h ∈ H T (cid:13)(cid:13) ˆ µ ˜ h − µ (cid:13)(cid:13) ≤ c ( (cid:13)(cid:13)(cid:13) ˆ µ ˜ h − ˆ µ h, ˜ h (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ˆ µ h, ˜ h − ˆ µ h (cid:13)(cid:13)(cid:13) + k ˆ µ h − µ k ) (25)By the deﬁnition (13) of A ( h ) it follows that the ﬁrst and the second term of (25) are respectivelyupper bounded by A ( h ) + V (˜ h ) and A (˜ h ) + V ( h ), having also used on the second term thatˆ µ h, ˜ h = ˆ µ ˜ h,h . Then, since ˜ h has been deﬁned in (14) as the h ∈ H T for which A ( h ) + V ( h ) isminimal, we clearly have that A (˜ h ) + V (˜ h ) ≤ A ( h ) + V ( h ).Hence, for any h ∈ H T , we get (cid:13)(cid:13) ˆ µ ˜ h − µ (cid:13)(cid:13) ≤ c ( A ( h ) + V ( h ) + k ˆ µ h − µ k ) . (26)We want an upper bound for the expected value of the left hand side of the equation (26) and sowe need to evaluate E [ k ˆ µ h − µ k ] = E [ R A | ˆ µ h ( x ) − µ ( x ) | dx ].From the standard bias variance decomposition, recalling that µ h = K h ∗ µ is such that E [ˆ µ h ( x )] = µ h ( x ), we get E [ Z A | ˆ µ h ( x ) − µ ( x ) | dx ] = Z A | µ h ( x ) − µ ( x ) | dx + Z A E [ | ˆ µ h ( x ) − µ h ( x ) | ] dx. Now, we can upper bound the ﬁrst term of the right hand side here above by enlarging theintegration domain getting Z A | µ h ( x ) − µ ( x ) | dx ≤ Z ˜ A | µ h ( x ) − µ ( x ) | dx = B ( h ) . (27)Moreover, as consequence of Proposition 1 in the case d ≥ Z A E [ | ˆ µ h ( x ) − µ h ( x ) | ] dx = Z A V ar ( 1 T Q dl =1 h l Z T d Y m =1 K ( x m − X mu h m ) du ) dx ≤ | A | cT ( d Y l =1 h l ) d − . (28)Comparing the upper bound given in (28) with the deﬁnition of V ( h ) and using also (26) and (27)we get, for each h ∈ H T , E [ k ˆ µ h − µ k ] ≤ c ( B ( h ) + V ( h )) + E [ A ( h )] . Now from Proposition 5 and the arbitrariness of the bandwidth h we are considering it follows E [ (cid:13)(cid:13) ˆ µ ˜ h − µ (cid:13)(cid:13) ] ≤ c inf h ′ ∈H T ( B ( h ′ ) + V ( h ′ )) + c e − c (log T ) , as we wanted.As a consequence of Theorem 1 we get, considering the rate optimal choice h l ( T ) = ( T ) ¯ ββl (2¯ β + d − provided by Proposition 2, the estimation gathered in Theorem 2. Its proof relies on the fact that,for how we have found it in Proposition 2, if the rate optimal bandwidth belongs to H T then theinf h ∈H T ( B ( h ) + V ( h )) is clearly realized by it. Proof.

We observe that, for the rate optimal choice h ( T ) of the bandwidth, the conditions gatheredin the right hand side of (10), which are (log T ) d T d ≤ Q dl =1 h l ≤ ( T ) dd − , hold true. Indeed, d Y l =1 h l ( T ) = ( 1 T ) ¯ β β + d − P dl =1 1 βl = ( 1 T ) d β + d − . the upper bound condition is therefore( 1 T ) d β + d − ≤ ( 1log T ) dd +2 , T ≤ T d − β + d − . Now we observe that d − β + d − > β > − d ,which is always true since d ≥

3. In particular we can write d − β + d − =: γ ∈ (0 ,

1) and, giventhat eventually for T going to ∞ it is log T ≤ T γ , we have Q dl =1 h l ( T ) ≤ ( T ) dd +2 .In the same way it is ( 1 T ) d β + d − ≥ (log T ) d T d . For the same reasoning as here above it is true if ( −

12 ¯ β + d − ) =: γ is positive.Being ¯ β > d ≥

3, it turns out γ >

0, as we wanted.Up to consider ˜ h l ( T ) := ⌊ T ¯ ββl (2 ¯ β + d − ⌋ instead of h l ( T ), which is asymptotically equivalent andwhich leads to the same convergence rate, we have that the rate optimal choice belongs to the setof candidate bandwidths H T proposed in (11).Having now h ( T ) ∈ H T , for how we have found the rate optimal choice in Proposition 2, theinf h ∈H T ( B ( h ) + V ( h )) is clearly realized by it and so the bound stated in Theorem 1 is actually(see also Corollary 1) E [ (cid:13)(cid:13) ˆ µ ˜ h − µ (cid:13)(cid:13) A ] ≤ c ( 1 T ) β β + d − + c e − c (log T ) , as we wanted.We have showed Theorem 1 using, as main tool, the bound on E [ A ( h )] stated in Proposition5. Its proof, as we will see in the next section, relies on the use of Berbee’s coupling method as inViennet [26] and on a version of Talagrand inequality given in Klein and Rio [12]. Proof.

To ﬁnd a bound for E [ A ( h )], for each h ∈ H T we want to use Talagrand inequality, statedon random variables which are independent. Therefore, we start introducing some blocks which aremutually independent. We do it through the use of Berbee’s coupling method as done in Viennet[26], Proposition 5.1 and its proof p. 484.We assume that T = 2 p T q T , with p T integer and q T a real to be chosen. We split the initial process X = ( X t ) t ∈ [0 ,T ] in 2 p T processes of a length q T : for each j ∈ { , ..., p T } we consider X j, := ( X t ) t ∈ [2( j − q T , (2 j − q T ] and X j, := ( X t ) t ∈ [(2 j − q T , jq T ] .Then, there exist a process ( X ∗ t ) t ∈ [0 ,T ] satisfying the following properties:1. For j ∈ { , ..., p T } the processes X j, and X ∗ j, := ( X ∗ t ) t ∈ [2( j − q T , (2 j − q T ] have the samedistribution and so have the processes X j, and X ∗ j, := ( X ∗ t ) t ∈ [(2 j − q T , jq T ] .2. For j ∈ { , ..., p T } , P ( X j, = X ∗ j, ) ≤ β X ( q T ) and P ( X j, = X ∗ j, ) ≤ β X ( q T ), where β X isthe β - mixing coeﬃcient of the process X as in (4).3. For each k ∈ { , } , X ∗ ,k , ..., X ∗ p T ,k are independent.We denote by ˆ µ ∗ h the estimator computed using X ∗ t instead of X t and we write ˆ µ ∗ h = (ˆ µ ∗ (1) h + ˆ µ ∗ (2) h )to separate the part coming from X ∗ ., (super -index (1)) and those coming from X ∗ ., (super-index (2)), having ˆ µ ∗ (1) h := p T q T P p T j =1 R (2 j − q T j − q T K h ( X ∗ u − x ) du .In a natural way we deﬁne moreover ˆ µ ∗ h,η := K η ∗ ˆ µ ∗ h , which can be written again as (ˆ µ ∗ (1) h,η + ˆ µ ∗ (2) h,η ),to separate the contribution of X ∗ ., and X ∗ ., .With this background we can evaluate E [ A ( h )]. We recall that, as deﬁned in (13), A ( h ) := sup η ∈H T ( k ˆ µ h,η − ˆ µ η k − V ( η )) + . Now we can see ˆ µ h,η − ˆ µ η as sum of diﬀerent terms which we deal with singularly:ˆ µ h,η − ˆ µ η := (ˆ µ h,η − ˆ µ ∗ h,η ) + (ˆ µ ∗ h,η − µ h,η ) + ( µ h,η − µ η ) + ( µ η − ˆ µ ∗ η ) + (ˆ µ ∗ η − ˆ µ η ) . As a consequence of the triangular inequality and of the deﬁnition of A ( h ) the following estimationholds true: A ( h ) ≤ sup η ∈H T [ (cid:13)(cid:13) ˆ µ h,η − ˆ µ ∗ h,η (cid:13)(cid:13) +( (cid:13)(cid:13) ˆ µ ∗ h,η − µ h,η (cid:13)(cid:13) − V ( η )2 ) + + k µ h,η − µ η k +( (cid:13)(cid:13) µ η − ˆ µ ∗ η (cid:13)(cid:13) − V ( η )2 ) + + (cid:13)(cid:13) ˆ µ ∗ η − ˆ µ η (cid:13)(cid:13) ] =15 sup η ∈H T [ X j =1 I h,ηj ]We start considering I h,η . We deﬁne the setΩ ∗ := { X t = X ∗ t ∀ t ∈ [0 , T ] } . As a consequence of the second property of the process X ∗ and of the β - mixing with exponentialdecay showed in Lemma 2 we get, recalling that 2 p T q t = T (with q T and p T to be chosen), P (Ω ∗ c ) ≤ p T β X ( q T ) ≤ c Tq T e − γq T . (29)From the deﬁnition of ˆ µ ∗ h and Jensen inequality it is (cid:13)(cid:13) ˆ µ ∗ η − ˆ µ η (cid:13)(cid:13) = Z A ( 1 T Z T K η ( X t − x ) − K η ( X ∗ t − x ) dt ) dx Ω ∗ c ≤≤ c Z A T T ( Z T K η ( X t − x ) + K η ( X ∗ t − x ) dt ) dx Ω ∗ c ≤ c k K η k ∞ Ω ∗ c (30)By the deﬁnition (12) we get that, ∀ h ∈ H T , k K h k ∞ ≤ ( Q dl =1 h l ) − .We recall that, from how we have deﬁned in (10) the set H T , we have that ∀ h ∈ H T Q dl =1 h l > (log T ) d T d and so k K η k ∞ < Q dl =1 η l ) ≤ T d (log T ) d . Replacing this bound in (30) it followssup η ∈H T (cid:13)(cid:13) ˆ µ ∗ η − ˆ µ η (cid:13)(cid:13) ≤ c T d (log T ) d Ω ∗ c . We take its expectation and we use (29), getting a term which depends on q T , a real to be chosen.From the arbitrariness of q T we get a convergence to zero as fast as we want, for T going to ∞ .Indeed, taking q T := (log T ) yields ∀ h, η ∈ H T , E [ | sup η ∈H T I h,η | ] = E [ | sup η ∈H T (cid:13)(cid:13) ˆ µ ∗ η − ˆ µ η (cid:13)(cid:13) | ] ≤ c T d +1 (log T ) d +2 e − γ (log T ) . (31)Regarding sup η ∈H T I h,η , we estimate it through (31) and the following lemma, which will be provenin the appendix. Lemma 3.

Let f : R d → R be a bounded, measurable function with support S satisfying diam ( S ) ≤ √ d ; ˜ A a compact set in R d such that A ⊂ ˜ A and ˜ A = n ζ : d ( ζ, A ) ≤ √ d o and g afunction in L ( ˜ A ) . Then, k f ∗ g k A ≤ k f k , R d k g k , ˜ A , where we have denoted as k . k A the usual L norm on A , k . k , R d the L norm on R d and k . k , ˜ A the L norm on ˜ A . We recall that ˆ µ h,η = K η ∗ ˆ µ h and ˆ µ ∗ h,η = K η ∗ ˆ µ ∗ h . Therefore, remarking that diam ( K ) ≤ K η it is diam ( K η ) ≤ √ d , we can use Lemma 3, which yieldssup η ∈H T I h,η = sup η ∈H T k K η ∗ (ˆ µ h − ˆ µ ∗ h ) k ≤ sup η ∈H T k K η k , R d k ˆ µ h − ˆ µ ∗ h k A . Taking the expected value, using that k K η k , R d ≤ c ∀ η ∈ H T and the equation (31), remarkingthat the dependence on the integration set considered is hidden in the constant c in which thistime will appear | ˜ A | instead of | A | we get E [ | sup η ∈H T I h,η | ] ≤ c T d +1 (log T ) d +2 e − γ (log T ) . (32)16e still use Lemma 3 to study sup η ∈H T I h,η , recalling that µ h,η = K η ∗ µ h and µ η = K η ∗ µ b . Ityieldssup η ∈H T I h,η = sup η ∈H T k K η ∗ ( µ h − µ ) k ≤ sup η ∈H T k K η k , R d k µ h − µ k A ≤ c k µ h − µ k A = cB ( h ) . (33)We are left to study I h,η and I h,η , for which we need the following lemma that will be showedright after the proof of this proposition. Lemma 4.

For i = 1 , , there exist some positive constants c ∗ , c ∗ , c ∗ and a constant k ∗ such that,for any ¯ k ≥ k ∗ , E [ sup η ∈H T ( (cid:13)(cid:13)(cid:13) µ η − ˆ µ ∗ ( i ) η (cid:13)(cid:13)(cid:13) − ¯ kT ( d Y l =1 η l ) d − ) + ] ≤ c ∗ T c ∗ e − c ∗ (log T ) . (34) Moreover there exist k such that, for any ¯ k ≥ k , E [ sup η ∈H T ( (cid:13)(cid:13)(cid:13) µ h,η − ˆ µ ∗ ( i ) h,η (cid:13)(cid:13)(cid:13) − ¯ kT ( d Y l =1 η l ) d − ) + ] ≤ c ∗ T c ∗ e − c ∗ (log T ) . (35)Concerning I h,η observe it is µ η − ˆ µ ∗ η = (2 µ η − ˆ µ ∗ (1) η − ˆ µ ∗ (2) η ). Hence, from triangular inequalityand the deﬁnition of positive part function, we get I h,η ≤ c ( (cid:13)(cid:13)(cid:13) µ η − ˆ µ ∗ (1) η (cid:13)(cid:13)(cid:13) − V ( η )2 ) + + c ( (cid:13)(cid:13)(cid:13) µ η − ˆ µ ∗ (2) η (cid:13)(cid:13)(cid:13) − V ( η )2 ) + . From (34), for a k in the deﬁnition of V ( η ) big enough, for which we have k > ( k ∗ ∨ k ), we get E [ sup η ∈H T I h,η ] ≤ c T c e − c (log T ) . (36)In the same way, remarking that I h,η ≤ c ( (cid:13)(cid:13)(cid:13) ˆ µ ∗ (1) h,η − µ h,η (cid:13)(cid:13)(cid:13) − V ( η )2 ) + + c ( (cid:13)(cid:13)(cid:13) ˆ µ ∗ (2) h,η − µ h,η (cid:13)(cid:13)(cid:13) − V ( η )2 ) + and using (35) it follows E [ sup η ∈H T I h,η ] ≤ c T c e − c (log T ) . (37)From (31) , (32), (33), (36) and (37) we obtain, for any h ∈ H T , E [ A ( h )] ≤ c T d +1 (log T ) d +2 e − γ (log T ) + cB ( h ) + c T c e − c (log T ) ≤ cB ( h ) + c e − c (log T ) , as we wanted.To conclude the proof of the adaptive procedure we need to show Lemma 4, which core isthe use of the Talagrand inequality. First of all, we recall the following version of the Talagrandinequality, which has been stated as Lemma 2 in [6] and which is a straightforward consequence ofthe Talagarand inequality given in Klein and Rio [12]. Lemma 5.

Let T , ..., T n be independent random variables with values in some Polish space X , R acountable class of measurable functions from X into [ − , p and v p ( r ) := p P pj =1 [ r ( T j ) − E [ r ( T j )]] . Then, E [(sup r ∈R | v p ( r ) | − H ) + ] ≤ c ( vp e − c pH v + M p e − c pHM ) , (38) with c a universal constant and where sup r ∈R k r k ∞ ≤ M, E b [sup r ∈R | v p ( r ) | ] ≤ H, sup r ∈R p p X j =1 V ar ( r ( T j )) ≤ v. .4 Proof of Lemma 4 Proof.

Since the two cases i = 1 and i = 2 are similar, we study only one of them. We startproving (34), the proof of inequality (35) follows the same line. We ﬁrst observe it is E [ sup η ∈H T ( (cid:13)(cid:13)(cid:13) µ η − ˆ µ ∗ (1) η (cid:13)(cid:13)(cid:13) A − ¯ kT ( d Y l =1 η l ) d − ) + ] ≤ X η ∈H T E [( (cid:13)(cid:13)(cid:13) µ η − ˆ µ ∗ (1) η (cid:13)(cid:13)(cid:13) A − ¯ kT ( d Y l =1 η l ) d − ) + ] . (39)Our goal is now to ﬁnd a bound for the right hand side of the inequality here above using theversion of the Talagrand inequality gathered in Lemma 5. To do it, we need to introduce somenotation.We observe that (cid:13)(cid:13)(cid:13) µ η − ˆ µ ∗ (1) η (cid:13)(cid:13)(cid:13) A = sup r, k r k =1 < µ η − ˆ µ ∗ (1) η , r > , and the supremum can be consid-ered over a countable dense set of function r such that k r k = 1; let us denote this set by B (1).We deﬁne T j ( z ) := 1 q T Z jq T (2 j − q T K η ( X ∗ j, t − z ) dt ; r ( T j ) := Z A T j ( z ) r ( z ) dz. Thus v p T ( r ) = < ˆ µ ∗ (1) η − µ η , r > = 1 p T p T X j =1 [ r ( T j ) − E [ r ( T j )]]is a centered empirical process with independent variables ψ r ( X ∗ j, ) := r ( T j ) − E [ r ( T j )] = Z A ( 1 q T Z jq T (2 j − q T [ K η ( X ∗ j, t − z ) − E [ K η ( X ∗ j, t − z )]] dt ) r ( z ) dz, to which we want to apply Talagrand inequality (38). Therefore, we have to compute M , H and v as deﬁned in Lemma 5. We start by the calculation of M. For any r ∈ B (1) it is, using thedeﬁnition of r and Cauchy - Schwartz inequality, | Z A T j ( z ) r ( z ) dz | ≤ ( Z A T j ( z ) dz ) . Now from the deﬁnition of T and Jensen inequality it follows Z A T j ( z ) dz ≤ Z A q T q T Z jq T (2 j − q T K η ( X ∗ j, t − z ) dtdz ≤ c k K η k ∞ |S| ≤ c ( d Y l =1 η l ) − ( d Y l =1 η l ) = c ( d Y l =1 η l ) − , where we have also used that the support of K η is on S which size is Q dl =1 η l . Hence, | Z A T j ( z ) r ( z ) dz | ≤ c ( d Y l =1 η l ) − =: M. (40)Regarding the computation of H , from the deﬁnition of v p T ( r ) and the fact that the randomvariables ψ ∗ j, r are centered and independents it follows E [ sup r ∈B (1) | v p T ( r ) | ] = E b [ (cid:13)(cid:13)(cid:13) µ η − ˆ µ ∗ (1) η (cid:13)(cid:13)(cid:13) A ] = Z A V ar ( 1 p T p T X j =1 q T Z jq T (2 j − q T ( K η ( X ∗ j, t − z ) − E [ K η ( X ∗ j, t − z )]) dt ) dz == Z A p T V ar ( 1 q T Z jq T (2 j − q T K η ( X ∗ j, t − z ) dt ) dz ≤ c | A | p T q T ( d Y l =1 η l ) d − , where in the last inequality we have used the estimation for the variance in the case d ≥ f we have that its support S issuch that |S| ≤ c ( Q dl =1 η l ). It yields E [ sup r ∈B (1) | v p T ( r ) | ] ≤ cT ( d Y l =1 η l ) d − =: H (41)18n order to use Lemma 5, we are left to compute v .We observe it is1 p T p T X j =1 V ar ( Z A q T Z jq T (2 j − q T K η ( X ∗ j, t − z ) dt r ( z ) dz ) = 1 p T p T X j =1 V ar ( 1 q T Z jq T (2 j − q T ( K η ∗ r )( X ∗ j, t ) dt ) . We want to prove a tight upper bound for the variance of the integral functional q T R jq T (2 j − q T f η ( X ∗ j, t ) dt of the diﬀusion X ∗ , where we have denoted f η := K η ∗ r .Following the proof we have given of Proposition 1 we have, V ar ( 1 q T Z jq T (2 j − q T f η ( X ∗ j, t ) dt ) ≤ cq T Z q T ( q T − u ) E [ f η,c ( X ∗ j, ) , f η,c ( X ∗ j, u )] du == cq T ( Z D ( q T − u ) E [ f η,c ( X ∗ j, ) , f η,c ( X ∗ j, u )] du + Z q T D ( q T − u ) E [ f η,c ( X ∗ j, ) , f η,c ( X ∗ j, u )] du ) , where we have introduced f η,c ( x ) := f η ( x ) − µ ( f η ) and D a quantity to be chosen in order tobalance the contribution of the two integrals here above. We denote moreover as P ∗ t the transitionsemigroup of the process X ∗ Concerning the integral between 0 and D , we act like we did in (16)and we use Cauchy - Schwartz inequality and the fact that P ∗ t is a contraction. It follows cq T Z D ( q T − u ) E [ f η,c ( X ∗ j, ) , f η,c ( X ∗ j, u )] du ≤ cq T | Z D µ dt | ≤≤ cq T Z D ( k P ∗ t f η k µ k f η k µ ) dt ≤ cq T Z D k f η k µ dt ≤ cDq T , (42)where in the last inequality we have used the fact that k µ k ∞ ≤ c , Young inequality and thedeﬁnition of the Kernel function and of r in order to say k f η k µ = k K η ∗ r k µ ≤ c k K η ∗ r k , R d ≤ c k K η k , R d k r k , R d ≤ c. (43)Regarding the integral between D and q T , we act like we did in (20) using the exponential ergodicitygathered in Lemma 2 to get cq T Z q T D ( q T − u ) E [ f η,c ( X ∗ j, ) , f η,c ( X ∗ j, u )] du ≤ cq T | Z q T D µ dt | ≤ cq T Z q T D e − ρt k f η,c k ∞ dt. (44)We now recall that f η,c ( x ) = f η ( x ) − µ ( f η ) with µ ( f η ) = R R d f η ( x ) µ ( x ) dx . Hence, from Cauchy-Schwartz inequality and (43), we get | µ ( f η ) | ≤ c and therefore | f η,c ( x ) | ≤ | f η ( x ) | + c. (45)To estimate the inﬁnity norm of f η,c we remark it is, ∀ y ∈ R d , | f η ( y ) | = | K η ∗ r ( y ) | = | Z A K η ( y − z ) r ( z ) dz | ≤ ( Z A K η ( y − z ) dz ) ( Z A r ( z ) dz ) = ( Z S K η ( y − z ) dz ) , where we have used Cauchy - Schwartz inequality and the fact that r ∈ B (1) and so its 2 - normis equal to 1 by deﬁnition. Moreover, the Kernel function is diﬀerent from 0 only on its support S , which size is Q dl =1 η l . Hence, recalling also that the inﬁnite norm of K η is upper bounded by c ( Q dl =1 η l ) − , we get | K η ∗ r ( y ) | ≤ c ( k K η k ∞ |S| ) ≤ c qQ dl =1 η l . It follows k f η,c k ∞ ≤ c qQ dl =1 η l + c ≤ c qQ dl =1 η l , (46)given that the second term is negligible compared to the ﬁrst. Replacing (46) in (44), using also(42), we obtain V ar ( 1 q T Z jq T (2 j − q T f η ( X ∗ j, t ) dt ) ≤ cq T ( D + e − ρD Q dl =1 η l ) .

19e look for a D for which the ﬁrst and the second term of the right hand side of the inequalityhere above have the same magnitude. Therefore, we choose D := [max( − ρ log( Q dl =1 η l ) , ∧ q T .Replacing such a value we get, if q T > − ρ log( Q dl =1 η l ),1 p T p T X j =1 V ar ( Z A q T Z jq T (2 j − q T K η ( X ∗ j, t − z ) dt r ( z ) dz ) ≤ cq T (1 + log( 1 | Q dl =1 η l | ))Otherwise, if q T ≤ − ρ log( Q dl =1 η l ), by the deﬁnition of D we have D = q T . We still havethe contribution of cq T D which is in this case less than cq T (log( | Q dl =1 η l | )) and, moreover, thecontribution of the integral between D and q T is now null since we have D = q T . Hence we have v := cq T (1 + log( 1 | Q dl =1 η l | )) . (47)We use Lemma 5 on the right hand side of (39), recalling that (cid:13)(cid:13)(cid:13) µ η − ˆ µ ∗ (1) η (cid:13)(cid:13)(cid:13) A = sup r ∈B (1) <µ η − ˆ µ ∗ (1) η , r > = sup r ∈B (1) | v p T ( r ) | ; with M , H and v as found in (40), (41) and (47). It follows E [ sup η ∈H T ( (cid:13)(cid:13)(cid:13) µ η − ˆ µ ∗ (1) η (cid:13)(cid:13)(cid:13) − ¯ kT ( d Y l =1 η l ) d − ) + ] ≤≤ c X η ∈H T (1 + log( | Q dl =1 η l | )) p T q T e − c pTT ( Q dl =1 ηl ) 2 d − qT (1+log( 1 | Q dl =1 ηl | )) + ( Q dl =1 η l ) − p T e − c pT √ T ( Q dl =1 ηl ) 1 d − Q dl =1 ηl ) − . We recall that 2 p T q T = T , where q T is chosen above equation (31) as (log T ) ; we can thereforeupper bound the right hand side of the equation here above with c X η ∈H T (1 + log( | Q dl =1 η l | )) T e − c ( Q dl =1 ηl )1 − d (1+log( 1 | Q dl =1 ηl | )) + (log T ) ( Q dl =1 η l ) T e − c √ T (log T )2 ( Q dl =1 η l ) d ≤≤ ( log TT e − c (log T ) + T d − (log T ) d − e − cT ) |H T | , where in the last inequality we have used that, by the deﬁnition (10) we have given of H T , ∀ h ∈ H T we have (log T ) d T d ≤ Q dl =1 h l ≤ ( T ) dd − and so ( Q dl =1 η l ) − d (1 + log( | Q dl =1 η l | )) ≤ c T ) and √ T (log T ) ( Q dl =1 η l ) d ≥ cT ; we have therefore upper bounded each element of the sum with a quantitywhich does not depend on η .We have moreover assumed that |H T | has polynomial growth in T ; and so there is a constant c > E [ sup η ∈H T ( (cid:13)(cid:13)(cid:13) µ η − ˆ µ ∗ (1) η (cid:13)(cid:13)(cid:13) − ¯ kT ( d Y l =1 η l ) d − ) + ] ≤ ( log TT e − c (log T ) + T d − (log T ) d − e − cT ) T c ;inequality (34) follows.As a consequence of Lemma 3 and the deﬁnition of Kernel function we moreover have (cid:13)(cid:13)(cid:13) µ h,η − ˆ µ ∗ (1) h,η (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) K h ∗ ( µ η − ˆ µ ∗ (1) η ) (cid:13)(cid:13)(cid:13) ≤ k K h k , R d (cid:13)(cid:13)(cid:13) µ η − ˆ µ ∗ (1) η (cid:13)(cid:13)(cid:13) , ˜ A ≤ c (cid:13)(cid:13)(cid:13) µ η − ˆ µ ∗ (1) η (cid:13)(cid:13)(cid:13) , ˜ A . Using that and (34) we have just shown we obtain (35).

A Appendix

A.1 Proof of Lemma 1

Proof.

Lemma 1 relies on the ﬁrst point of Theorem 1.1 in [4], which needs the following assump-tions on the coeﬃcients a , b , γ and on the jumps.20 H a ) There are c > β ∈ (0 ,

1) such that for all x, y ∈ R d , | a ( x ) − a ( y ) | ≤ c | x − y | β and,for some c ≥ c − I d × d ≤ a ( x ) ≤ c I d × d . ( H k ) The function k ( x, z ) := | z | d + α F ( zγ ( x ) ) is bounded, measurable and, if α = 1, for any0 < r < R < ∞ it is Z r ≤ z ≤ R zk ( x, z ) | z | − d − dz = 0 . (48) ( H b ) The function b belongs to the Kato class K which is, as deﬁned in [4], K γ := ( f : R d → R satisﬁes lim δ → sup x Z δ Z R d | f ( x + y ) | η γ,γ − ( s, y ) dy ds = 0 ) , where we have denoted f ( x + y ) as an abbreviation for f ( x + y ) + f ( x − y ) and η α,γ ( t, x ) := t γ ( | x | + t ) − d − α . Through this paper we have assumed that Assumptions A1 - A3 hold.From A1 it follows H a since we have asked the boundedness of a and, in the case in which x and y are such that | x − y | > ∃ c such that | a ( x ) − a ( y ) | ≤ | a ( x ) | + | a ( y ) | ≤ c ≤ c | x − y | β , foreach β ∈ (0 , | x − y | ≤

1, instead, we have | a ( x ) − a ( y ) | ≤ L | x − y | = L | x − y | − β | x − y | β ≤ L | x − y | β , as consequence of the Lipschitz continuity.Regarding H k , we have that k is a bounded function as a consequence of second and third pointsof A3. Indeed, | k ( x, z ) | = | z | d + α | F ( zγ ( x ) ) | ≤ | z | d + α | γ ( x ) || z | d + α ≤ γ max < ∞ . Observing moreover that (48) holds true on the basis of the fourth point of A3, H k clearly follows.We are left to show that b ∈ K . Noticing that (cf also Remark 2.6 in [4]) Z δ η α,α − ( s, y ) ds = Z δ s ( α − ( | y | + s ) d + α ds ≤ c ( | y | ∧ δ ) (1+ α )2 | y | d + α = 1 | y | d − (1 ∧ δ | y | ) (1+ α )2 , it is enough to show that M b ( δ ) := sup x ∈ R d Z R d | b ( x + y ) | | y | d − (1 ∧ δ | y | ) dy −→ δ → b ∈ K .From Assumption A1 we now b is upper bounded by a constant. Therefore we havesup x ∈ R d | Z R d | b ( x + y ) | | y | d − (1 ∧ δ | y | ) dy | ≤ c Z { y : | y |≤√ δ } | y | d − dy + c Z { y : | y | > √ δ } | y | d − ( δ | y | ) dy. (49)We move to polar coordinates system, getting the right hand side of the equation (49) here aboveis upper bounded by Z √ δ cdρ + cδ Z ∞√ δ ρ − dρ = c √ δ + cδ δ , which clearly goes to 0 for δ →

0. It yields that H b holds true.It entails we can use Theorem 1.1 of [4]; Lemma 1 follows. A.2 Proof of Lemma 2

Proof.

The exponential ergodicity and the exponential β - mixing of the process X are showed inProposition 3.8 and the second point of Theorem 2.2 of [19].To use them we have to show that Assumptions 1, 2 and 3* stated in [19] hold.Assumption 1 of [19] is a regularity condition which corresponds to our assumption A1.We want to show that point b of Assumption 2 of [19] holds, which is the following:(b) There exists a constant ∆ > X ∆ admits a density p ∆ ( x, y ) with respect to the21ebesgue measure on R d for every x ∈ R d , and ( x, y ) p ∆ ( x, y ) is bounded in y ∈ R d and in x ∈ K for every compact K ⊂ R d . Moreover, for every x ∈ R d and every open ball U ⊂ R d thereexists a point z = z ( x, U ) ∈ supp ( F ) such that γ ( x ) · z ∈ U .We observe that the existence of a bounded density has already been proven in Lemma 1. Moreover,from second and third points of A3, we know that supp ( F ) = R d and that γ is an invertible matrix.Hence, for every x ∈ R d and every open ball U ⊂ R d there exists a point z = z ( x, U ) ∈ R d suchthat γ ( x ) · z ∈ U .To conclude, we have to prove that Assumption 3* holds and so we have to show the existence of aLyapunov function. We therefore want to provide a function f ∗ which satisﬁes the drift condition Af ∗ ≤ − c f ∗ + c for c > c > A denotes the generator of the diﬀusion, which is thesum of the continuous and discrete part A c f ( x ) := 12 d X i,j =1 a i,j ( x ) ∂ i,j f ( x ) + d X i =1 b i ( x ) ∂ i f ( x ) and A d f ( x ) := Z R d ( f ( x + γ ( x ) · z ) − f ( x ) − γ ( x ) · z · ∇ f ( x )) F ( z ) dz, for every function f : R d → R , f ∈ C ( R d ).From the ﬁfth point of condition A3 we know there exists ǫ > R R d | z | e ǫ | z | F ( z ) dz ≤ c .For such an ǫ we deﬁne f ∗ ( x ) := e ǫ | x | . We observe it is ∂ i f ∗ ( x ) = ǫe ǫ | x | x i | x | and ∂ i,j f ∗ ( x ) = ǫ x i x j | x | e ǫ | x | ( ǫ − | x | ) + ǫe ǫ | x | | x | j = i . (50)We therefore have, using also the drift condition gathered in assumption A2, ∀ x : | x | > ˜ ρ | A c f ∗ ( x ) | ≤ ǫe ǫ | x | ( ǫ + 2 | x | ) d X i,j =1 | a i,j ( x ) | + ǫe ǫ | x | | x | < x, b ( x ) > ≤≤ cǫe ǫ | x | ( ǫ + 1 | x | ) d X i,j =1 | a i,j ( x ) | − ˜ Cǫe ǫ | x | . (51)Concerning the discrete part of the generator, from intermediate value theorem we have | A d f ∗ ( x ) | ≤ Z R d Z ( γ ( x ) · z ) T · H f ( x + sγ ( x ) · z ) · ( γ ( x ) · z ) dsdz, where H f ( x + sγ ( x ) · z ) denotes the hessian matrix of the function f computed in the point x + sγ ( x ) · z .We now split the integral in the right hand side here above, to act diﬀerently depending on whether | z | is more or less than | x | k γ k ∞ . We therefore get | A d f ∗ ( x ) | ≤ Z z : | z |≤ | x | k γ k∞ Z ( γ ( x ) · z ) T · H f ( x + sγ ( x ) · z ) · ( γ ( x ) · z ) dsdz ++ Z z : | z | > | x | k γ k∞ Z ( γ ( x ) · z ) T · H f ( x + sγ ( x ) · z ) · ( γ ( x ) · z ) dsdz =: I + I . Concerning I , from (50) it follows I ≤ c Z z : | z |≤ | x | k γ k∞ Z | z | k γ k ∞ ǫe ǫ | x + szγ ( x ) | ( ǫ + 1 | x + szγ ( x ) | ) F ( z ) dzds ≤≤ cǫe ǫ | x | Z z : | z |≤ | x | k γ k∞ | z | e ǫ | z |k γ k ∞ ( ǫ + 1 | x | − | z | k γ k ∞ ) F ( z ) dz ≤ cǫe ǫ | x | ( ǫ + 1 | x | ) , (52)where in the last inequality we have used the ﬁfth point of A3 and the fact that on the integral weare considering it is | z | ≤ | x | k γ k ∞ and so | x |−| z |k γ k ∞ ≤ | x | .We now study the term I , which is I , + I , := Z n z : | z | > | x | k γ k∞ , | x + szγ ( x ) |≤ o Z ( γ ( x ) · z ) T · H f ( x + sγ ( x ) · z ) · ( γ ( x ) · z ) dsdz +22 Z n z : | z | > | x | k γ k∞ , | x + szγ ( x ) | > o Z ( γ ( x ) · z ) T · H f ( x + sγ ( x ) · z ) · ( γ ( x ) · z ) dsdz. On I , we can upper bound the hessian matrix with cǫ and so we get I , ≤ cǫ Z n z : | z | > | x | k γ k∞ , | x + szγ ( x ) |≤ o | z | k γ k ∞ F ( z ) dz ≤ cǫ Z n z : | z | > | x | k γ k∞ o | z | e ǫ | z | e − ǫ | z | F ( z ) dz ≤ cǫe − ǫ | x | k γ k∞ , (53)where in the last inequality we have used that | z | > | x | k γ k ∞ ; after that we have enlarged the domainof integration and used the ﬁfth point of Assumption A3 to upper bound the integral with aconstant.On I , we still use (50), getting I , ≤ c Z n z : | z | > | x | k γ k∞ , | x + szγ ( x ) | > o Z | z | k γ k ∞ ǫe ǫ | x + szγ ( x ) | ( ǫ + 1 | x + szγ ( x ) | ) F ( z ) dzds ≤≤ c ( ǫ + 1) ǫe ǫ | x | Z n z : | z | > | x | k γ k∞ , | x + szγ ( x ) | > o | z | e ǫ | z |k γ k ∞ F ( z ) dz. (54)We deﬁne J ( x ) := R n z : | z | > | x | k γ k∞ o | z | e ǫ | z |k γ k ∞ F ( z ) dz . From (52) (53) and (54), observing thatthe domain of integration of the integral deﬁned as J contains the one in (54) and using theboundedness of γ , we get | A d f ∗ ( x ) | ≤ c ǫe ǫ | x | ( ǫ + 1 | x | + e − cǫ | x | + ( ǫ + 1) J ( x )) . (55)From (51) and (55) we get, using also the boundedness of a and J which follows from A1 the ﬁfthpoint of A3, | Af ∗ ( x ) | ≤ ǫe ǫ | x | ( cǫ − ˜ C ) + cǫe ǫ | x | ( 1 | x | + e − cǫ | x | + J ( x )) . (56)Since ǫ > cǫ is therefore less then ˜ C and so the ﬁrst term here above turnsout being negative.Moreover, the second term on the right hand side of (56) is o ( f ∗ ). Indeed, cǫ ( | x | + e − cǫ | x | ) clearlygoes to 0 for | x | → ∞ and the domain of integration of the integral deﬁned as J ( x ) is the set of z such that | z | > | x | k γ k ∞ . Therefore, for | x | → ∞ the contribution of the integral becomes null.It follows Af ∗ ≤ − c f ∗ + o ( f ∗ ), as we wanted.We get the drift condition holding true on the proposed function f ∗ and so the Assumption 3*does. A.3 Proof of Lemma 3

Proof.