[PDF] On the nonparametric inference of coefficients of self-exciting jump-diffusion

Abstract

In this paper, we consider a one-dimensional diffusion process with jumps driven by a Hawkes process. We are interested in the estimations of the volatility function and of the jump function from discrete high-frequency observations in a long time horizon which remained an open question until now. First, we propose to estimate the volatility coefficient. For that, we introduce a truncation function in our estimation procedure that allows us to take into account the jumps of the process and estimate the volatility function on a linear subspace of L2(A) where A is a compact interval of R. We obtain a bound for the empirical risk of the volatility estimator, ensuring its consistency, and then we study an adaptive estimator w.r.t. the regularity. Then, we define an estimator of a sum between the volatility and the jump coefficient modified with the conditional expectation of the intensity of the jumps. We also establish a bound for the empirical risk for the non-adaptive estimators of this sum, the convergence rate up to the regularity of the true function, and an oracle inequality for the final adaptive estimator.Finally, we give a methodology to recover the jump function in some applications. We conduct a simulation study to measure our estimators' accuracy in practice and discuss the possibility of recovering the jump function from our estimation procedure.

Full PDF

OOn the nonparametric inference of coeﬃcientsof self-exciting jump-diﬀusion

Chiara Amorino (1) , Charlotte Dion (2) , Arnaud Gloter (3) , Sarah Lemler (4)

November 26, 2020

Abstract

In this paper, we consider a one-dimensional diﬀusion process with jumps driven by aHawkes process. We are interested in the estimations of the volatility function and of thejump function from discrete high-frequency observations in long time horizon. We ﬁrst pro-pose to estimate the volatility coeﬃcient. For that, we introduce in our estimation procedurea truncation function that allows to take into account the jumps of the process and we es-timate the volatility function on a linear subspace of L ( A ) where A is a compact intervalof R . We obtain a bound for the empirical risk of the volatility estimator and establish anoracle inequality for the adaptive estimator to measure the performance of the procedure.Then, we propose an estimator of a sum between the volatility and the jump coeﬃcientmodiﬁed with the conditional expectation of the intensity of the jumps. The idea behindthis is to recover the jump function. We also establish a bound for the empirical risk for thenon-adaptive estimator of this sum and an oracle inequality for the ﬁnal adaptive estimator.We conduct a simulation study to measure the accuracy of our estimators in practice andwe discuss the possibility of recovering the jump function from our estimation procedure.Jump diﬀusion, Hawkes process, Volatility estimation, Nonparametric, AdaptationAMS: 62G05, 60G55 The present work focuses on the jump-diﬀusion process introduced in [18]. It is deﬁned as thesolution of the following equation dX t = b ( X t ) dt + σ ( X t ) dW t + a ( X t − ) M (cid:88) j =1 dN ( j ) t , (1)where X t − denotes the process of left limits, N = ( N (1) , . . . , N ( M ) ) is a M -dimensional Hawkesprocess with intensity function λ and W is the standard Brownian motion independent of N . (1) Unit´e de Recherche en Math´ematiques, Universit´e du Luxembourg, [email protected] .Chiara Amorino gratefully acknowledges ﬁnancial support of ERC Consolidator Grant 815703 “STAM-FORD: Statistical Methods for High Dimensional Diﬀusions”. (2)

LPSM, Sorbonne Universit´e 75005 Paris [email protected]. (3)

Laboratoire de Math´ematiques et Mod´elisation d’Evry, CNRS, Univ Evry, Universit´e Paris-Saclay, 91037,Evry, France [email protected] (4)

Universit´e Paris-Saclay, ´Ecole CentraleSup´elec, MICS Laboratory, France, [email protected] a r X i v : . [ m a t h . S T ] N ov ome probabilistic results have been established for this model in [18], such as the ergodicityand the β − mixing. A second work has then been conducted to estimate the drift function ofthe model using a model selection procedure and upper bounds on the risk of this adaptiveestimator have been established in [17] in the high frequency observations context.In this work, we are interested in estimating the volatility function σ and the jump function a . The jumps in this process make estimating these two functions diﬃcult. We assume thatdiscrete observations of a path of X are available, at high frequency and on a large time interval. Let us notice ﬁrst that this model has practical relevance thinking of continuous phenomenonimpacted by exterior event, with auto-excitation structure. For example one can think of interestrate model (see [22]) in insurance; then, in neurosciences of the evolution of the membrane po-tential impacted by the signals of the other neurons around it (see [17]). Indeed, it is common todescribe the spike train of a neuron through a Hawkes process which models the auto-excitationof the phenomenon: for a certain type of neurons, when it spikes once, the probability that itwill spike again increases. Finally, referring to [7] for a complete review on Hawkes process inﬁnance, the reader can see the considered model as a generalisation of the so called mutually-exciting-jump diﬀusion proposed in [5] to study an asset price evolution. This process generalisesPoisson jumps (or L´evy jumps which have independent increments) with auto-exciting jumpsand is more tractable than jumps driven by L´evy process.Nonparametric estimation of coeﬃcients of a stochastic diﬀerential equations from the obser-vation of a discrete path is a challenge that has been studied a lot in literature. From frequentistpoint of view in the high frequency context one can cite [23, 12] and in bayesian one recently in[1]. Nevertheless, the purpose of this article fall more under the scope of statistic for stochasticprocesses with jumps. The literature for the diﬀusion with jumps from a pure centred L´evyprocess is large. For example one can refer to [29] and [31].The ﬁrst goal of this work is to estimate the volatility coeﬃcient σ . As is it well known, inpresence of jumps the approximate quadratic variation based on the squared increments of X no longer converges to the integrated volatility. As in [28], we base the approach on truncatedquadratic variation to estimate the coeﬃcient σ . Particularly, instead of truncation, we use asmooth function to ﬁlter the jumps as it is done in [4] in the classical jump-diﬀusion context.The structure of the jumps here is very diﬀerent from the one induced by the pure-jump L´evy-process. Indeed, the increments are not independent and this implies the necessity to develop aproper methodology as the one presented hereafter.Secondly, we want to recover coeﬃcient a . It is important to note that, as presented in[31], in classical jump-diﬀusion framework (where a L´evy process is used instead of the Hawkesprocess for M = 1) it is possible to obtain an estimator for the function σ + a by consideringthe quadratic increments (without truncation) of the process. This is no longer the case here,due to the form of the intensity function of the Hawkes process. Indeed, we recover a morecomplicated function to be estimated as explained in the following. The estimations of the volatility function and of the jump function in Model (1) are challenging inthe sense that we have to take into account the jumps of the Hawkes process. Statistical inferencefor the volatility and for the jump function in a jump diﬀusion model with jumps driven by aHawkes process has never been studied before. As for the estimation of the drift in [17], weassume that the coupled process (

X, λ ) is ergodic, stationary and exponentially β − mixing. Inorder to estimate the volatility in a non-parametric way, we consider as in [4] a truncation ofthe increments of the quadratic variation of X that allows to judge if a jump occurred on not2n a time interval. We estimate σ on a collection of subspaces of L by minimizing a leastsquares contrast over each model and we establish for the obtained estimators a bound on therisk. Then, we propose an adaptive selection of the model and we obtain non-asymptotic oracleinequalities for the adaptive estimator that guarantees its theoretical performance.In the second part of this work, we are interested in the estimation of the jump function.As it has been said before, it is not possible to recover directly the jump function a from thequadratic increments of X , and what appears naturally is the sum of the volatility and of theproduct of the square of the jump function and the jump intensity. The jump intensity is hard tocontrol properly and it is unobserved. To overcome such a problem we introduce the conditionalexpectation of the intensity given the observation of X , which leads us to estimate the sum of thevolatility and of the product between a and the conditional expectation of the jump intensitygiven X . We lead again a penalized minimum contrast estimation procedure and we establish anon-asymptotic oracle inequality for the ﬁnal adaptive estimator. Both adaptive estimator arestudied using Talagrand’s concentration inequalities.We then discuss the estimation of a , obtained as a quotient in which we plug the estimatorsof σ and g := σ + a × f , where f is the conditional expectation of the jump intensity thatwe do not know in practice. We propose to estimate f using a Nadaraya-Watson estimator. Weshow that the risk of the estimator of a cumulates the errors coming from the estimation of thethree functions σ , g and the conditional expectation of the jump intensity, which shows howhard it is to estimate correctly a .Finally we have conducted a simulation study to observe the behavior of our estimators inpractice. We compare the empirical risks of our estimators to the risks of the oracle estimator towhich we have access in a simulation study (they correspond to the estimator in the collectionof models which minimises the empirical error). We show that we can recover rather well thevolatility σ and g from our procedure but it is harder to recover the jump function a . The model is described in Section 2, some assumptions on the model are discussed and wegive some ergodic and β − mixing properties on the process ( X t , λ t ). In Section 3 we detail theestimation procedure to estimate the volatility σ , we establish a bound for the risk of the non-adaptive estimator, then we propose an adaptive procedure to choose the best estimator amongthe collection of estimators and give the oracle inequality for this ﬁnal estimator. Section 4 isdevoted to the estimation of σ + a × f , where f is the expectation of the jump intensity λ given X . In this section, we explain why we have to estimate this function, we detail the estimationprocedure and establish bounds for the risks of the non-adaptive estimator and of the adaptiveestimator. The estimation of the jump coeﬃcient a is discussed in Section 5. In Section 6 wehave conducted a simulation study and give a little conclusion and some perspective to this workin Section 7. Finally, the proofs of the main results are detailed in Section 8 and the technicalresults are proved in Appendix A. Let (Ω , F , P ) be a probability space. We deﬁne the Hawkes process for t ≥ M -dimensional point process N t := ( N (1) t , . . . , N ( M ) t )and its intensity λ is a vector of non-negative stochastic intensity functions given by a collectionof baseline intensities. It consists in positive constants ζ j , for j ∈ { , . . . , M } , and in M × M interaction functions h i,j : R + → R + , which are measurable functions ( i, j ∈ { , . . . , M } ). For3 ∈ { , . . . , M } we also introduce n ( i ) , a discrete point measure on R − satisfying (cid:90) R − h i,j ( t − s ) n ( i ) ( ds ) < ∞ for all t ≥ . They can be interpreted as initial condition of the process. The linear Hawkes process withinitial condition n ( i ) and with parameters ( ζ i , h i,j ) ≤ i,j ≤ M is a multivariate counting process( N t ) t ≥ . It is such that for all i (cid:54) = j , P - almost surely, N ( i ) and N ( j ) never jump simultaneously.Moreover, for any i ∈ { , . . . , M } , the compensator of N ( i ) is given by Λ ( i ) t := (cid:82) t λ ( i ) s ds , where λ is the intensity process of the counting process N and satisﬁes the following equation: λ ( i ) t = ζ i + M (cid:88) j =1 (cid:90) t − h i,j ( t − u ) dN ( j ) u + M (cid:88) j =1 (cid:90) −∞ h i,j ( t − u ) dn ( j ) u . We remark that N ( j ) t is the cumulative number of events in the j-th component at time t while dN ( j ) t represents the number of points in the time increment [ t, t + dt ]. We deﬁne ˜ N t := N t − Λ t and ¯ F t := σ ( N s , ≤ s ≤ t ) the history of the counting process N (see Daley and Vere - Jones[15]). The intensity process λ = ( λ (1) , . . . , λ ( M ) ) of the counting process N is the ¯ F t -predictableprocess that makes ˜ N t a ¯ F t -local martingale.Requiring that the functions h i,j are locally integrable, it is possible to show with standardarguments the existence of a process ( N ( j ) t ) t ≥ (see for example [16]). We denote as ζ j theexogenous intensity of the process and as ( T ( j ) k ) k ≥ the non-decreasing jump times of the process N ( j ) .We interpret the interaction functions h i,j (also called kernel function or transfer function)as the inﬂuence of the past activity of subject i on the subject j , while the parameter ζ j > h i,j : R + → R + , h i,j ( t ) = c ij e − αt , α > , c ij > , ≤ i, j ≤ M. With this choice of h i,j the conditional intensity process ( λ t ) is then Markovian. In this case wecan introduce the auxiliary Markov process Y = Y ( ij ) : Y ( ij ) t = c i,j (cid:90) t e − α ( t − s ) dN ( j ) s + c i,j (cid:90) −∞ e − α ( t − s ) dn ( j ) s , ≤ i, j ≤ M. The intensity can be expressed in terms of sums of these Markovian processes that is, for all1 ≤ i ≤ M λ ( i ) t = f i  M (cid:88) j =1 Y ( ij ) t −  , with f i ( x ) = ζ i + x. We remark that all the point processes N ( j ) behave as homogeneous Poisson processes withconstant intensity ζ j , before the ﬁrst occurrence. Then, as soon as the ﬁrst occurrence appearsfor a particular N ( i ) , it aﬀects all the process increasing the conditional intensity through theinteraction functions h i,j .Let us emphasized that from the work [18], it is possible to not assume the positiveness of thecoeﬃcients c i,j , taking then f i ( x ) = ( ζ i + x ) + . This is particularly important for the neuronalapplications where the neurons can be have excitatory or inhibitory behavior.4 .2 Model Assumptions In this work we consider the following jump-diﬀusion model. We write the process as M + 1stochastic diﬀerential equations: (cid:40) dλ ( i ) t = − α ( λ ( i ) t − ζ i ) dt + (cid:80) Mj =1 c i,j dN ( j ) t , i = 1 , . . . , MdX t = b ( X t ) dt + σ ( X t ) dW t + a ( X t − ) (cid:80) Mj =1 dN ( j ) t , (2)with λ ( j )0 and X random variables independent from the others. In particular, ( λ (1) t , . . . , λ ( M ) t , X t )is a Markovian process for the general ﬁltration F t := σ ( W s , N ( j ) s , j = 1 , . . . , M, ≤ s ≤ t ) . We aim at estimating, in a non-parametric way, the volatility σ and the jump coeﬃcient a starting from a discrete observation of the process X . The process X is indeed observed at highfrequency on the time interval [0 , T ]. For 0 = t ≤ t ≤ . . . ≤ t n = T , the observations aredenoted as X t i . We deﬁne ∆ n,i := t i +1 − t i and ∆ n := sup i =0 ,...,n ∆ n,i . We are here assumingthat ∆ n → n ∆ n → ∞ , for n → ∞ . We suppose that there exists c , c such that, ∀ i ∈ { , . . . , n − } , c ∆ min ≤ ∆ n,i ≤ c ∆ n . Furthermore we require that there exists ε > n ε log n = o ( (cid:112) n ∆ n ) . (3)The size parameter M is ﬁxed and ﬁnite all along and asymptotic properties are obtained when T → ∞ .Requiring that the size of the discretization step is always the same, as we do asking that themaximal and minimal discretization steps diﬀer only on a constant, is a pretty classical assump-tion in our framework. On the other side, the step conditions gathered in (3) is more technicaland yet essential to show our main results. Assumption 1 (Assumptions on the coeﬃcients of X ) .

1. The coeﬃcients a , b and σ are globally Lipschitz.2. There exist positive constants a and σ such that | a ( x ) | < a and < σ ( x ) < σ for all x ∈ R .3. The coeﬃcients b and σ are of class C and there exist positive constants c, c (cid:48) , q such that,for all x ∈ R , | b (cid:48) ( x ) | + | σ (cid:48) ( x ) | + | a (cid:48) ( x ) | ≤ c and | b (cid:48)(cid:48) ( x ) | + | σ (cid:48)(cid:48) ( x ) | ≤ c (cid:48) (1 + | x | q ) .4. There exist d ≥ and r > such that, for all x satisfying | x | > r , we have xb ( x ) ≤ − dx . The ﬁrst three assumptions ensure the existence of a strong solution X of the consideredstochastic diﬀerential equation (the proof can be adapted from [27], under the Lipschitz conditionon the jump coeﬃcient a ). The last assumption is introduced in order to study the longtimebehavior of X and to ensure its ergodicity (see [18]). Note that the assumption on a can berelaxed (see [18]). Assumption 2 (Assumptions on the kernels) .

1. Let H be a matrix such that H i,j := (cid:82) ∞ h i,j ( t ) dt = c ij α , for ≤ i, j ≤ M . The matrix H has a spectral radius smaller than 1.2. We suppose that (cid:80) Mj =1 ζ j > and that the matrix H is invertible. . For all i, j such that ≤ i, j ≤ M , c ij ≤ α . The ﬁrst point of the Assumption 2 here above implies that the process ( N t ) admits a versionwith stationary increments (see [10]). In the sequel we always will consider such an assumptionsatisﬁed. The process ( N t ) corresponds to the asymptotic limit and ( λ t ) is a stationary process.The second point of A2 is needed in order to ensure the positive Harris recurrence of the couple( X t , λ t ). A discussion about it can be found in Section 2.3 of [17]. In the sequel, we repeatedly use the ergodic properties of the process Z t := ( X t , λ t ). FromTheorem 3.6 in [18] we know that, under Assumptions 1 and 2, the process ( X t , λ t ) t ≥ is positiveHarris recurrent with unique invariant measure π ( dx ). Moreover, in [18], the Foster-Lyapunovcondition in the exponential frame implies that, for all t ≥ E [ X t ] < ∞ (see Proposition 3.4).In the sequel we need X to have arbitrarily big moments and, therefore, we propose a modiﬁedLyapunov function. In particular, following the ideas in [18], we take V : R × R M × M such that V ( x, y ) := | x | m + e (cid:80) i,j m ij | y ( ij ) | , (4)where m ≥ m ij := k i α , being k ∈ R M + a left eigenvectorof H , which exists and has non-negative components under our Assumption 2 (see [18] belowAssumption 3.3).We now introduce the generator of the process ˜ Z t := ( X t , Y t ), deﬁned for suﬃciently smoothtest function g by A ˜ Z g ( x, y ) = − α M (cid:88) i,j =1 y ( ij ) ∂ y ( ij ) g ( x, y ) + ∂ x g ( x, y ) b ( x ) + 12 σ ( x ) ∂ x g ( x, y ) (5)+ M (cid:88) j =1 f j (cid:32) M (cid:88) k =1 y ( jk ) (cid:33) [ g ( x + a ( x ) , y + ∆ j ) − g ( x, y )] , with (∆ j ) ( il ) = c i,j j = l , for all 1 ≤ i, l ≤ M . Then, the following proposition holds true. Proposition 1.

Suppose that A1 and A2 hold true. Let V be as in (4) . Then, there existpositive constants d and d such that the following Foster-Lyapunov type drift condition holds: A ˜ Z V ≤ d − d V. Proposition 1 is proven in the Appendix. As the process λ is included in Y , and so wecan recover it starting from Y , the ergodicity of ˜ Z implies the ergodicity of Z as well. As aconsequence, both X and λ have bounded moments of any order. Let us now add the thirdassumption Assumption 3. ( X , λ ) has probability π . Then, the process ( X t , λ t ) t ≥ is in its stationary regime.We recall that the process Z is called β - mixing if β Z ( t ) = o (1) for t → ∞ and exponentially β - mixing if there exists a constant γ > β Z ( t ) = O ( e − γ t ) for t → ∞ , where β Z is the β - mixing coeﬃcient of the process Z as deﬁned for a Markov process Z with transitionsemigroup ( P t ) t ∈ R + , by β Z ( t ) := (cid:90) R × R M (cid:107) P t ( z, . ) − π (cid:107) π ( dz ) , (6)6here (cid:107) λ (cid:107) stands for the total variation norm of a signed measure λ .Moreover, it is β X ( t ) := (cid:90) R × R M (cid:13)(cid:13) P t ( z, . ) − π X (cid:13)(cid:13) π ( dz ) , where P t ( z, . ) is the projection on X of P t ( z, . ) such that P t ( z, dx ) := P t ( z, dx × R M ) and π X ( dx ) := π ( dx × R M ) is the projection of π on the coordinate X . Then, according to Theorem4.9 in [18], under A1-A3 the process Z t := ( X t , λ t ) is exponentially β -mixing and there existsome constant K, γ > β X ( t ) ≤ β Z ( t ) ≤ Ke − γt . With the background introduced in the previous sections, we are now ready to deal with theestimation of the volatility function, to whom this section is dedicated. We remind the readerthat the procedure is based on the observations ( X t i ) i =1 ,...,n . First of all, in Subsection 3.1, we propose a non-adaptive estimator based on the squaredincrements of the process X . To do that, we decompose such increments in several terms, aimedto isolate the volatility function. Regarding the other terms, we can recognize a bias term (whichwe will show being small), the contribution of the brownian part (which is centered) and thecontribution of the jumps. To make the latter small as well we introduce a truncation function(see Lemma 2 below). Thus, we can deﬁne a contrast function, based on the truncated squaredincrements of X , and the associated estimator of the volatility. In Proposition 3, which is themain result of this subsection, we prove a bound for the empirical risk of the volatility estimatorwe propose.As the presented estimator depends on the model, in Subsection 3.2 we introduce a fullydata driven procedure to select automatically the best model in the sense of the empirical risk.We choose the model such that it minimizes the sum between the contrast and a penalizationfunction, as explained in (13). In Theorem 1 we show that the estimator associated to theselected model realizes automatically the best compromise between the bias term and the penaltyterm. Let us consider the increments of the process X as follows: X t i +1 − X t i = (cid:90) t i +1 t i b ( X s ) ds + (cid:90) t i +1 t i σ ( X s ) dW s + (cid:90) t i +1 t i a ( X s − ) M (cid:88) j =1 dN ( j ) s = (cid:90) t i +1 t i b ( X s ) ds + Z t i + J t i (7)where Z, J are given in Equation (8): Z t i := (cid:90) t i +1 t i σ ( X s ) dW s , J t i := (cid:90) t i +1 t i a ( X s − ) M (cid:88) j =1 dN ( j ) s . (8)To estimate σ for a diﬀusion process (without jumps), the idea is to consider the randomvariables T t i := n ( X t i +1 − X t i ) . Following this idea, we decompose T t i , in order to isolate thecontribution of the volatility computed in X t i . In particular, Equation (7) yields T t i = 1∆ n ( X t i +1 − X t i ) = σ ( X t i ) + A t i + B t i + E t i , (9)7here A, B, E are functions of

Z, J : A t i := 1∆ n (cid:18)(cid:90) t i +1 t i b ( X s ) ds (cid:19) + 2∆ n ( Z t i + J t i ) (cid:90) t i +1 t i ( b ( X s ) − b ( X t i )) ds + 1∆ n (cid:90) t i +1 t i ( σ ( X s ) − σ ( X t i )) ds + 2 b ( X t i ) Z t i ,B t i := 1∆ n [ Z t i − (cid:90) t i +1 t i σ ( X s ) ds ]; E t i := 2 b ( X t i ) J t i + 2∆ n Z t i J t i + 1∆ n J t i . The term A t i is small, whereas B t i is centered. In order to make E t i small as well, we introducethe truncation function ϕ ∆ βn,i ( X t i +1 − X t i ), for β ∈ (0 , ). It is a smooth version of the indicatorfunction, such that ϕ ( ζ ) = 0 for each ζ , with | ζ | ≥ ϕ ( ζ ) = 1 for each ζ , with | ζ | ≤

1. Theidea is to use the size of the increment of the process X t i +1 − X t i in order to judge if a jumpoccurred or not in the interval [ t i , t i +1 ). As it is hard for the increment of X with continuoustransition to overcome the threshold ∆ βn,i for β ≤ , we can assert the presence of a jump in[ t i , t i +1 ) if | X t i +1 − X t i | > ∆ βn,i . Hence, we consider the random variables T t i ϕ ∆ βn,i (∆ i X ) = σ ( X t i ) + ˜ A t i + B t i + E t i ϕ ∆ βn,i (∆ i X ) , with ˜ A t i := σ ( X t i )( ϕ ∆ βn,i (∆ i X ) −

1) + A t i ϕ ∆ βn,i (∆ i X ) + B t i ( ϕ ∆ βn,i (∆ i X ) − . Now, the just introduced ˜ A t i is once again a small term, because so A t i was and because ofthe fact that the truncation function does not diﬀer a lot from the indicator function, as betterjustiﬁed in Lemma 1 below.In the sequel, the constant c may change value from line to line. Lemma 1.

Suppose that A1-A3 hold. Then, for any k ≥ , E [ | ϕ ∆ βn,i (∆ i X ) − | k ] ≤ c ∆ n,i . The proof of Lemma 1 can be found in the Appendix. The same is for the proof of Lemma2 below, which illustrates the reason why we have introduced a truncation function. Indeed,without the presence of ϕ , the same Lemma would have held true with just a c ∆ n,i in the righthand side. Filtering the contribution of the jumps we can gain an extra ∆ βqn,i which, as we willsee in Proposition 2, will make the contribution of E t i small. Lemma 2.

Suppose that A1-A3 hold. Then, for q ≥ and for any k ≥ E (cid:20) | J t i | q ϕ k ∆ βn,i (∆ i X ) (cid:21) ≤ c ∆ βqn,i . From Lemmas 1 and 2 here above, it is possible to show the following proposition. Also itsproof can be found in the Appendix.

Proposition 2.

Suppose that A1-A3 hold. Then, for β ∈ ( , ) , . ∀ ˜ ε > , E [ ˜ A t i ] ≤ c ∆ − ˜ εn,i , E [ ˜ A t i ] ≤ c ∆ − ˜ εn,i ; E [ B t i |F t i ] = 0 , E [ B t i |F t i ] ≤ cσ , E [ B t i ] ≤ c ; E [ | E t i | ϕ ∆ βn,i (∆ i X )] = c ∆ βn,i , E [ E t i ϕ ∆ βn,i (∆ i X )] ≤ c ∆ β − n,i , E [ E t i ϕ ∆ βn,i (∆ i X )] ≤ c ∆ β − n,i . In the Proposition here above it’s possible to see in detail in what terms the contribution of˜ A t i and of the truncation of E t i are small. Moreover, an analysis of the centered Brownian term B t i and its powers is proposed.Based on these variables, we propose a nonparametric estimation procedure for the function σ ( · ) on a closed interval A of R . We consider S m a linear subspace of L ( A ) such that S m =span( ϕ , . . . , ϕ D m ) of dimension D m , where ( ϕ i ) i is an orthonormal basis of L ( A ). We denote˜ S n := ∪ m ∈M n S m , where M n ⊂ N is a set of indexes for the model collection. The contrastfunction is deﬁned by γ n,M ( t ) := 1 n n − (cid:88) i =0 ( t ( X t i ) − T t i ϕ ∆ βn,i (∆ i X )) (10)with the T t i given in Equation (9). The associated mean squares contrast estimator is (cid:98) σ m := arg min t ∈S m γ n,M ( t ) . (11)We observe that, as (cid:98) σ m achieves the minimum, it represents the projection of our estimator onthe space S m . The approximation spaces S m have to satisfy the following properties Assumption 4 (Assumptions on the subspaces) .

1. There exists φ such that, for any t ∈ S m , (cid:107) t (cid:107) ∞ ≤ φ D m (cid:107) t (cid:107) .2. The spaces S m have ﬁnite dimension D m and are nested: for all m < m (cid:48) ∈ M n , S m ⊂ S m (cid:48) .3. For any positive d there exists ˜ ε > such that, for any ε < ˜ ε , (cid:80) m ∈M n e − dD − εm ≤ Σ( d ) ,where Σ( d ) denotes a ﬁnite constant depending only on d . We now introduce the empirical norm (cid:107) t (cid:107) n := 1 n n − (cid:88) i =0 t ( X t i ) . The main result of this section consists in a bound for E [ (cid:13)(cid:13)(cid:98) σ m − σ (cid:13)(cid:13) n ], which is gathered in thefollowing proposition. Its proof can be found in Section 8.1. Proposition 3.

Suppose that A1-A4 hold and that β ∈ ( , ) . If ∆ n → and for some ε > n ε log n = o ( √ n ∆ n ) for n → ∞ and D n ≤ C √ n ∆ n log( n ) n ε for a constant C > , then the estimator (cid:98) σ m of σ on A given by equation (11) satisﬁes E (cid:104)(cid:13)(cid:13)(cid:98) σ m − σ (cid:13)(cid:13) n (cid:105) ≤

13 inf t ∈S m (cid:13)(cid:13) t − σ (cid:13)(cid:13) π X + C σ D m n + C ∆ β − n + C ∆ ∧ β − n n , (12) with C , C and C positive constants. σ m for the empirical norm. Theright hand side of the Equation (12) is decomposed into diﬀerent types of error. The ﬁrstterm corresponds to the bias term which decreases with the dimension D m of the space ofapproximation S m . The second term corresponds to the variance term, i.e. the estimation error,and contrary to the bias, it increases with D m . The third term comes from the discretizationerror and the controls obtained in Proposition 2, taking into account the jumps. Then, thefourth term arise evaluating the norm (cid:107) (cid:98) σ m − σ (cid:107) n when (cid:107) . (cid:107) n are (cid:107) . (cid:107) π X are not equivalent. Thisinequality ensures that our estimator ˆ σ m does almost as well as the best approximation of thetrue function by a function of S m .Finally, it should be noted that the variance term is the same as for a diﬀusion withoutjumps. Nevertheless, the remainder terms are larger because of the jumps. We want deﬁne a criterion in order to select automatically the best dimension D m (and so thebest model) in the sense of the empirical risk. This procedure should be adaptive, meaningindependent of σ and dependent only on the observations. The ﬁnal chosen model minimizesthe following criterion: (cid:98) m := arg min m ∈M n { γ n,M ( (cid:98) σ m ) + pen σ ( m ) } , (13)with pen σ ( · ) the increasing function on D m given bypen σ ( m ) := κ D m n (14)where κ is a constant which has to be calibrated.Next theorem is proven in Section 8.1. Theorem 1.

Suppose that A1-A4 hold and that β ∈ ( , ) . If ∆ n → and for some ε > n ε log n = o ( √ n ∆ n ) for n → ∞ and D n ≤ C √ n ∆ n log( n ) n ε for C > , then the estimator (cid:98) σ (cid:98) m of σ onA given by equations (11) and (13) satisﬁes E (cid:104)(cid:13)(cid:13)(cid:98) σ (cid:98) m − σ (cid:13)(cid:13) n (cid:105) ≤ C inf m ∈M n (cid:26) inf t ∈S m (cid:107) t − σ (cid:107) π X + pen σ ( m ) (cid:27) + C ∆ β − n + C ∆ β − n n + C n where C > is a numerical constant and C , C , C are positive constants depending on ∆ n , σ in particular. This inequality ensures that the ﬁnal estimator ˆ σ m realizes automatically the best compro-mise between the bias term and the penalty term which is of the same order than the varianceterm. In addition to the estimation of the volatility, our goal is to estimate, once again in a nonpara-metric way, the jump coeﬃcient a . The idea is to study the sum between the volatility and thejump coeﬃcient and to recover consequently a way to estimate a (see Section 5 below). However,what turns out naturally is the volatility plus the product between the jump coeﬃcient and thejump intensity which, as we will see in the sequel, leads to some diﬃculties. To overcome suchdiﬃculties, we must bring ourselves to consider the conditional expectation of the intensity ofthe jumps with respect to X t i . In this way we analyze diﬀerently the squared increments of the10rocess X , to highlight the role of the conditional expectation. In particular in the following weuse, for the decomposition of the squared increments, ideally the same notation as before: wedenote the small bias term as A t i , the Brownian contribution as B t i and the jump contributionas E t i , even if the forms of such terms are no longer the same as in Section 3. In particular, A t i and E t i are no longer the same as before and their new deﬁnition can be found below, whilethe Brownian contribution B t i remains exactly the same. To these, as previously anticipated, aterm C t i deriving from the conditional expectation of the intensity is added.Besides, as in the previous section, we show that A t i is small and B t i is centered. Moreover,in this case we also need the jump part to be centered. Therefore, we consider the compensatedmeasure d ˜ N t instead of dN t , relocating the diﬀerence in the drift.Let us rewrite the process of interest as: (cid:40) dλ ( j ) t = − α ( λ ( j ) t − ζ t ) dt + (cid:80) Mi =1 c i,j dN ( i ) t dX t = ( b ( X t ) + a ( X t − ) (cid:80) Mi =1 λ ( i ) t ) dt + σ ( X t ) dW t + a ( X t − ) (cid:80) Mi =1 d ˜ N ( i ) t . (15)We set now J t i := (cid:90) t i +1 t i a ( X s − ) M (cid:88) i =1 d ˜ N ( i ) s . (16)The increments of the process X are such that X t i +1 − X t i = (cid:90) t i +1 t i  b ( X s ) + a ( X s − ) M (cid:88) j =1 λ ( j ) s  ds + Z t i + J t i (17)where J is given in Equation (16) and Z has not changed and is given in Equation (8). Let usdeﬁne this time: A t i := 1∆ n (cid:90) t i +1 t i ( b ( X s ) + a ( X s − ) M (cid:88) j =1 λ ( j ) s ) ds  + 1∆ n (cid:90) t i +1 t i ( σ ( X s ) − σ ( X t i )) ds + 2∆ n ( Z t i + J t i ) (cid:90) t i +1 t i ( b ( X s ) − b ( X t i )) + ( a ( X s − ) M (cid:88) j =1 λ ( j ) s − a ( X t i ) M (cid:88) j =1 λ ( j ) t i ) ds  + 1∆ n (cid:90) t i +1 t i ( a ( X s ) − a ( X t i )) M (cid:88) j =1 λ ( j ) s ds + a ( X t i )∆ n (cid:90) t i +1 t i M (cid:88) j =1 ( λ ( j ) s − λ ( j ) t i ) ds +2  b ( X t i ) + a ( X t i ) M (cid:88) j =1 λ ( j ) t i  Z t i + 2  b ( X t i ) + a ( X t i ) M (cid:88) j =1 λ ( j ) t i  J t i , (18) E t i := 2∆ n Z t i J t i + 1∆ n  J t i − (cid:90) t i +1 t i a ( X s ) M (cid:88) j =1 λ ( j ) s ds  . (19)The term A t i is small, whereas B t i (which is the same as in the previous section) and E t i arecentered. Moreover, let us deﬁne the quantity M (cid:88) j =1 E [ λ ( j ) t i | X t i ] = M (cid:88) j =1 (cid:82) R M z j π ( X t i , z , . . . , z M ) dz , . . . , dz M (cid:82) R M π ( X t i , z , . . . , z M ) dz , . . . , dz M , where π is the invariant density of the process ( X, λ ), whose existence has been discussed inSection 2.3; and C t i := a ( X t i ) M (cid:88) j =1 ( λ ( j ) t i − E [ λ ( j ) t i | X t i ]) .

11t comes the following decomposition: T t i = 1∆ n ( X t i +1 − X t i ) = σ ( X t i ) + a ( X t i ) M (cid:88) j =1 E [ λ ( j ) t i | X t i ] + A t i + B t i + C t i + E t i . (20)In the last decomposition of the squared increments we have isolated the sum of the volatilityplus the jump coeﬃcient times the conditional expectation of the intensity with respect to X t i ,which is an object on which we can ﬁnally use the same approach as before. Thus, as previously,the other terms need to be evaluate. The term A t i is small and B t i and E t i are centered. More-over the just added term C t i is clearly centered, by construction, if conditioned with respect tothe random variable X t i and, as we will see in the sequel, it is enough to get our main results.As explained above Assumption 3, the Foster-Lyapunov condition in the exponential framesimplies the existence of bounded moments for λ and so we also get E [ λ ( j ) t i | X t i ] < ∞ , for any j ∈ { , . . . , M } .The properties here above listed are stated in Proposition 4 below, whose proof can be foundin the appendix. Proposition 4.

Suppose that A1 -A3 hold. Then,1. ∀ ˜ ε > , E [ A t i ] ≤ c ∆ − ˜ εn,i , E [ A t i ] ≤ c ∆ − ˜ εn,i ; E [ B t i |F t i ] = 0 , E [ B t i |F t i ] ≤ cσ , E [ B t i ] ≤ c ; E [ E t i |F t i ] = 0 , E [ E t i |F t i ] ≤ ca ∆ n,i (cid:80) Mj =1 λ ( j ) t i , E [ E t i ] ≤ c ∆ n,i ; E [ C t i | X t i ] = 0 , E [ C t i ] ≤ c, E [ C t i ] ≤ c. From Proposition 4 one can see in detail how small the bias term A t i is. Moreover, it shedslight to the fact that the Brownian term and the jump term are centered with respect to theﬁltration ( F t ) while C is centered with respect to the σ -algebra generated by the process X . Based on variables we have just introduced, we propose a nonparametric estimation procedurefor the function g ( x ) := σ ( x ) + a ( x ) f ( x ) (21)with f ( x ) = (cid:80) Mj =1 (cid:82) R M z j π ( x, z , . . . , z M ) dz , . . . , dz M (cid:82) R M π ( x, z , . . . , z M ) dz , . . . , dz M (22)on a closed interval A. We consider S m the linear subspace of L ( A ) deﬁned in the previoussection for m ∈ M n and satisfying Assumption 4. The contrast function is deﬁned almost asbefore, since this time we no longer need to truncate the contribution of the jumps. It is, for t ∈ ˜ S n , γ n,M ( t ) := 1 n n − (cid:88) i =0 ( t ( X t i ) − T t i ) and the T t i are given in Equation (20) this time. The associated mean squares contrast estimatoris (cid:98) g m := arg min t ∈ S m γ n,M ( t ) . (23)We want to bound the empirical risk E [ (cid:107) (cid:98) g m − g (cid:107) n ]. We state it in next proposition, whose proofcan be found in Section 8.2. 12 roposition 5. Suppose that A1-A4 hold. If for some ε > n → , n ε log n = o ( √ n ∆ n ) and D n ≤ C √ n ∆ n log n n ε , for C > , then the estimator (cid:98) g m of g on A satisﬁes, for any ˜ ε > , E (cid:104) (cid:107) (cid:98) g m − g (cid:107) n (cid:105) ≤

13 inf t ∈S m (cid:107) t − g (cid:107) π X + C ( σ + a + 1) D m n ∆ n + C ∆ − ˜ εn + C n ∆ n , (24) with C , C and C positive constants. As in the previous section, this inequality measures the performance of our estimator (cid:98) g m for the empirical norm. The right hand side of Equation (24) is decomposed into four diﬀerenttypes of error. The ﬁrst term corresponds to the bias term which decreases with the dimension D m of the space of approximation S m . The second term corresponds to the variance term, i.e.the estimation error, and contrary to the bias, it increases with D m . The third term comesfrom the discretization error and the controls obtained in Proposition 4. Then, the fourth termappears when evaluating the norm (cid:107) (cid:98) g m − g (cid:107) n when (cid:107) . (cid:107) n are (cid:107) . (cid:107) π X are not equivalent.Finally, let us compare this result with the bound (12) obtained for the estimator (cid:98) σ m . Themain diﬀerence is that the second term is of order D m / ( n ∆) here, instead of D m /n as it waspreviously. As a consequence, in practice the risks will depend mainly on n ∆ for the estimationof g and on n for the estimation of σ . Also for the estimation of g we deﬁne a criterion in order to select the best dimension D m in thesense of the empirical risk. This procedure should be adaptive, meaning independent of g anddependent only on the observations. The ﬁnal chosen model minimizes the following criterion: (cid:98) m := arg min m ∈M n { γ n,M ( (cid:98) g m ) + pen g ( m ) } , (25)with pen g ( · ) the increasing function on D m given bypen g ( m ) := κ D m n ∆ n , (26)where κ is a constant which has to be calibrated. We remark that (cid:98) m here above introducedis not the same as in Equation (13). The model which minimizes the right hand side of (13)is actually (cid:98) m while the one introduced in Equation (25) is (cid:98) m , but when it does not causeconfusion we denote both as (cid:98) m in order to lighten the notation.We analyse the quantity E [ (cid:107) (cid:98) g (cid:98) m − g (cid:107) n ] in the following theorem, whose proof will be in Section8.2. Theorem 2.

Suppose that A1-A4 hold. If for some ε > n → , n ε log n = o ( √ n ∆ n ) and D n ≤ C √ n ∆ n log n n ε for C > , then the estimator (cid:98) g (cid:98) m of g on A satisﬁes, for any ˜ ε > , E (cid:104) (cid:107) (cid:98) g (cid:98) m − g (cid:107) n (cid:105) ≤ C inf m ∈M n (cid:26) inf t ∈S m (cid:107) t − g (cid:107) π X + pen g ( m ) (cid:27) + C ∆ − ˜ εn + C n ∆ n + C n ∆ n where C > is a numerical constants and C , C , C are positive constants depending on ∆ n , a , σ in particular. This oracle inequality guarantees that our ﬁnal estimator ˆ g ˆ m realizes automatically the bestcompromise between the bias term and the penalty term which is of the same order than thevariance term. Since it is more diﬃcult to estimate g because we have to deal with the conditionalexpectation of the intensity f , the last two error terms are larger than the ones obtained inTheorem 1 for the estimation of σ . 13 Estimation of the jump coeﬃcient

The challenge is to get an estimator of the coeﬃcient a ( · ). A natural idea is to replace theconditional expectation in the deﬁnition of g given in Equation (21) by an estimator. Let usremind the reader the notation f ( x ) := (cid:80) Mj =1 E [ λ ( j ) t i | X t i = x ] (see Equation (22)) and g ( x ) = σ ( x ) + a ( x ) f ( x ) . The function f can be estimated through a classical estimator for example the Nadaraya-Watsonestimator. This is only possible if the intensity of the Hawkes process is known and the jumptimes observed. We make this assumption in this section and we denote this estimator (cid:98) f h where h > a ( · ) of the form (cid:98) g m ( x ) − (cid:98) σ m ( x ) (cid:98) f h ( x ) we also assume that f > f on A . We then set: (cid:98) a z := (cid:98) g m ( x ) − (cid:98) σ m ( x ) (cid:98) f h ( x ) (cid:98) f h ( x ) >f / with z = ( m , m , h ). Let us study this estimator, for the empirical norm.Due to the disjoint support of the two terms and together with Cauchy-Schwarz inequality,we obtain (cid:107) (cid:98) a z − a (cid:107) n = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:32) ( (cid:98) g m − g ) (cid:98) f h + ( σ − (cid:98) σ m ) (cid:98) f h + ( g − σ ) f f − (cid:98) f h (cid:98) f h (cid:33) (cid:98) f h >f / (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n + (cid:13)(cid:13)(cid:13)(cid:13) g − σ f (cid:98) f h f / (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n + 1 n n − (cid:88) i =0 a ( X t i ) (cid:98) f h ( X ti ) f / a ( · ) < a ﬁnally: E [ (cid:107) (cid:98) a z − a (cid:107) n ] ≤ f E [ (cid:107) (cid:98) g m − g (cid:107) n ] + 12 f E [ (cid:107) σ − (cid:98) σ m (cid:107) n ] + 12 a f E (cid:20)(cid:13)(cid:13)(cid:13) f − (cid:98) f h (cid:13)(cid:13)(cid:13) n (cid:21) + a n n − (cid:88) i =0 P ( | (cid:98) f h ( X t i ) − f ( X t i ) | > f / . And by Markov’s inequality, we obtain: E [ (cid:107) (cid:98) a z − a (cid:107) n ] ≤ f (cid:18) E [ (cid:107) (cid:98) g m − g (cid:107) n ] + E [ (cid:107) σ − (cid:98) σ m (cid:107) n ] + 2 a E (cid:20)(cid:13)(cid:13)(cid:13) f − (cid:98) f h (cid:13)(cid:13)(cid:13) n (cid:21)(cid:19) . (27)This equation teaches us that the empirical risk of the estimator (cid:98) a z is upper bound bythe sum of the three empirical risks of the estimators of the functions g, σ , f . The ﬁrst twoare controlled in Theorem 1 and 2. The last one is more classic. The Nadaraya-Watson canbe studied with one or two bandwidth parameters. The upper bound of the L -risk for twobandwidth is for example done in [13]. Remark 1.

Let us note here that f can be lower bounded by construction. Indeed, its deﬁnitionjointly with the fact that λ ( j ) t i > ζ j because of the positiveness of h i,j , provides us the wantedlower bound. For (cid:98) a z to be an estimator, f must be known or estimated. Numerical results

In this section we present our numerical study on synthetic data.

We simulate the Hawkes process N with M = 1 for simplicity and here we denote ( T k ) k thesequence of jump times. In fact, the multidimensional structure of the Hawkes process allowsto consider a lot of kind of data, but what is impacting the dynamic of X is the cumulativeHawkes process, thus in that sense we do not loose generality taking M = 1. In this case, theintensity process is written as λ t = ξ + ( λ − ξ ) e − αt + (cid:88) T k λ = ξ and X = 2 in the examples. Also, the exogenous intensities ξ is chosen equal to 0 .

5, thecoeﬃcient c is equal to 0 . α = 5.Then we simulate ( X ∆ , . . . X ( n +1)∆ ) from an Euler scheme with a constant time step ∆ i = ∆.Because of the additional jump term (when a (cid:54) = 0), to the best of our knowledge it is not possibleto use classical more sophisticated scheme. A simulation algorithm is also detailed in [18] Section2.3.In order to challenge the proposed methodology, we investigate diﬀerent kind of models. Inthis section we present the results for four models which are the following(a) b ( x ) = − x , σ ( x ) = 1, a ( x ) = (cid:112) . x ),(b) b ( x ) = − x + sin( x ), σ ( x ) = (cid:112) (3 + x ) / (1 + x ), a ( x ) = 1,(c) b ( x ) = − x , σ ( x ) = √ x , a ( x ) = 1,(d) b ( x ) = − x , σ ( x ) = √ x , a ( x ) = x [ − , + 5 ( −∞ , − − (5 , + ∞ ) .The drift is chosen linear in order to satisfy the assumptions and as it is not of interest tostudy the estimation of b here, keeping the same drift coeﬃcient let us focuses on the diﬀerencesobserved due to the coeﬃcients σ, a . For example in models c) and d), σ does not satisfyassumption 1. Let us now detail the numerical estimation strategy. It is important to remind the reader that the estimation procedures are only based on theobservations ( X k ∆ ) k =0 ,...,n . Indeed, the estimators (cid:98) σ (cid:98) m and (cid:98) g (cid:98) m of σ and g respectively deﬁnedby (11) and (23), are based on the statistics: T k ∆ = ( X ( k +1)∆ − X k ∆ ) ∆ , k = 0 , . . . , n − . Estimation of σ . To compute (cid:98) σ m we use a version of the truncated quadratic variation,through a function ϕ that vanishes when the increments of the data are too large compared tothe typical increments of a continuous diﬀusion process. Precisely, we choose T ϕk ∆ := T k ∆ × ϕ (cid:18) X ( k +1)∆ − X k ∆ ∆ β (cid:19) ; ϕ ( x ) =  | x | < e / / ( | x | − | x | ≥ . (28)This choice for the smooth function ϕ is discussed in [4].15 stimation of g . As far as the estimation of g := σ + a × f is concerned, we do not knowthe true conditional expectations f ( x t k ) = E [ λ t k | X t k = x t k ] for all k . Thus we compare theestimations of g to the approximate function ˜ g ( x ) = σ ( x ) + a ( x ) × N W (cid:98) h ( x ) where the function f ( x ) = (cid:82) zπ ( x, z ) dzπ X ( x ) , which corresponds to E [ λ | X = x ], is estimated with the classical Nadaraya-Watson estimator N W h , where h is the bandwidth parameter. To do so we use the R-package ksmooth . Then, (cid:98) h is chosen through a cross-validation leave-one-out procedure. We are awareof the fact that the NW estimator can be deﬁned with two bandwidths (one for the numeratorand one for the denominator) as it is presented in [13], but we choose the simplest way here. Choice of the subspaces of L ( A ) The spaces S m are generated by the Fourier basis.The maximal dimension D n is chosen equal to 20 for this study. The theoretical dimension (cid:98)√ n ∆ /n ε log( n ) (cid:99) is often too small in practice since we have to consider higher dimension toestimate non-regular functions.In the theoretical part, the estimation is done on a ﬁxed compact interval A . Here it isslightly diﬀerent. We consider for each model the random data range as the estimation interval.This is more adapted to a real life data set situation. Let us remind the reader that the two penalty functions, pen σ given in Equation (14) and pen g given in Equation (26), depend on constants named κ , κ . These constants need to be chosenone for all for each estimator in order to compute the ﬁnal adaptive estimators (cid:98) σ (cid:98) m and (cid:98) g (cid:98) m . Weexplain now how these choices are made. Choice for the universal constants.

In order to chose the universal constants κ and κ we investigate models varying b, a, σ (diﬀerent from those used to validate the procedure lateron) for n ∈ { , , } and ∆ ∈ { . , . } . We compute Monte-Carlo estimators of therisks E [ (cid:107) (cid:98) σ (cid:98) m − σ (cid:107) n ] and E [ (cid:107) (cid:98) g (cid:98) m − ˜ g (cid:107) n ]. We choose to do N rep = 1000 repetitions to estimate thisexpectation by the average:1 N rep N rep (cid:88) k =1 (cid:107) (cid:98) σ , ( k ) (cid:98) m − σ (cid:107) n and 1 N rep N rep (cid:88) k =1 (cid:107) (cid:98) g ( k ) (cid:98) m − ˜ g (cid:107) n . Finally comparing the risks as functions of κ , κ leads to select values making a good compromiseover all experiences. Applying this procedure we ﬁnally choose κ = 100 and κ = 100. Choice for the threshold β . The parameter β appears in Equation (28). This parameterhelps the algorithm to decide if the process has jumped or not. The theoretical range of valuesis (1 / , / β = 1 / . Choice for the bandwidth h . The bandwidth h in the Nadaraya-Watson estimator of theconditional expectation is chosen through a leave-one-out cross-validation procedure. Sincethe true conditional expectation is unknown we focus of the estimation of (cid:101) g which dependson this estimator anyway. Indeed it is the estimation procedure of g that is evaluated. Otherchoices for the best bandwidth exist as the Goldenshluger and Lepski method [21] or a PenalizedComparison to Overﬁtting [26]. As for the calibration phase, we compute Monte-Carlo estimators of the empirical risks. Wechoose to do N rep = 1000 repetitions to estimate this expectation by the average on the simula-tions. In the risk tables 1 and 2, we present for the three models and diﬀerent values of (∆ , n ):16

246 −0.5 0.0 0.5 1.0 0246 −1 0 1 2 0246 −0.5 0.0 0.5 1.0 1.5

Figure 1: Models (a),(b),(c) with n = 10000, ∆ = 0 .

01. Three ﬁnal estimators are plain green(plain line), true σ plain black (dotted line) ∆ , n ∆ = 0 . n = 1000 ∆ = 0 . n = 10000 ∆ = 0 . n = 10000Model (cid:98) σ (cid:98) m (cid:98) σ m ∗ (cid:98) σ (cid:98) m (cid:98) σ m ∗ (cid:98) σ (cid:98) m (cid:98) σ m ∗ (a) 0.410 (0.280) 0.361 (0.285) 0.385 (0.122) 0.278 ( 0.088) 0.015 (0.028) 0.010 (0.023)(b) 0.187 (1.678) 0.107 (0.989) 0.046 (1.162) 0.027 (1.014) 0.005 (0.015) 0.005 (0.008)(c) 1.201 (0.216) 0.798 (0.208) 0.452 (0.062) 0.366 (0.042) 0.015 (0.012) 0.008 (0.007) Table 1: Estimation on a compact interval. Average and standard deviation of the estimatedrisks (cid:107) (cid:98) σ (cid:98) m − σ (cid:107) n and (cid:107) (cid:98) σ m ∗ − σ (cid:107) n computed over 1000 repetitions.the average of the estimated risk over 1000 simulations (MISE) and the standard deviation inthe brackets.Also, we print the result for the oracle function in both cases. Indeed, as on simulationswe know functions σ , ˜ g , we can compute the estimator in the collection M n = { , . . . , D n } which minimises in m the errors (cid:107) (cid:98) σ m − σ (cid:107) n and (cid:107) (cid:98) g m − ˜ g (cid:107) n . Let us denote the oracle estimators (cid:98) σ m ∗ and (cid:98) g m ∗ respectively. These are not true estimators as they are not available in practice.Nevertheless it is the benchmark. The goal of this numerical study is thus to see how close tothe risk results of (cid:98) σ (cid:98) m , (cid:98) g (cid:98) m are to the risks of these two oracle functions.Let us detail the result for each estimator. Estimation of σ . Figure 1 shows for models (a),(b),(c), three estimators (cid:98) σ (cid:98) m in green (lightgrey) and the true function σ in black (dotted line). We can appreciate here the good recon-struction of the function σ by our estimator.Table 1 sums up the results of the estimator (cid:98) σ (cid:98) m for the diﬀerent models and diﬀerentparameter choices. We present also the results for the oracle estimator (cid:98) σ m ∗ as it has beensaid previously.The estimations of the MISE and the standard deviation are really close to the oracle ones.As it has been shown in the theoretical part, we can notice that the MISE decreases when n increases. Besides, as the variance term is proportional to 1 /n , when n is ﬁxed and large enough,we can see the clear inﬂuence of ∆ from 0 . .

01, the MISE are divided at least by 10. Themodel (c) seems to be the more challenging for the procedure.

Estimation of ˜ g . Figure 2 shows for each of the three models (a),(b),(c), three estimators (cid:98) g (cid:98) m of ˜ g in green (light grey) and function ˜ g in black (dotted line). The beams of the three17

246 −0.5 0.0 0.5 1.0 0246 −1 0 1 2 0246 −0.5 0.0 0.5 1.0 1.5

Figure 2: Models (a),(b),(c) with n = 10000, ∆ = 0 .

01. Three ﬁnal estimators of ˜ g are plaingreen (plain line) and ˜ g plain black (dotted line).∆ , n ∆ = 0 . n = 1000 ∆ = 0 . n = 10000 ∆ = 0 . n = 10000Model (cid:98) g (cid:98) m (cid:98) g m ∗ (cid:98) g (cid:98) m (cid:98) g m ∗ (cid:98) g (cid:98) m (cid:98) g m ∗ (a) 1.363 (0.715) 0.895 (0.606) 0.948 (0.193) 0.735 (0.195) 0.129 (0.141) 0.109 (0.120)(b) 0.915 (0.520) 0.474 (0.393) 0.313 (0.174) 0.198 (0.079) 0.240 (0.100) 0.098 (0.072)(c) 0.707 (0.964) 0.311 (0.320) 0.236 (0.202) 0.099 (0.056) 0.073 (0.130) 0.035 (0.035)Table 2: Estimation on a compact interval. Average and standard deviation of the estimatedrisks (cid:107) (cid:98) g (cid:98) m − ˜ g (cid:107) n and (cid:107) (cid:98) g m ∗ − ˜ g (cid:107) n computed over 1000 repetitions.realisations of the estimator are satisfying.We observe that the procedure has diﬃculties in Model (a) and we conﬁrm that impressionin Table 2 below with the estimation of the risk. But for the two other models, the estimatorsseem closer from the true function. The estimation seems to work better in Model (c) than inModel (b) and this is also corroborate by the estimation of the risk given in Table 2.Table 2 gives the Mean Integrated Squared Errors (MISEs) of the estimator (cid:98) g (cid:98) m obtainedfrom our procedure and of the oracle estimator (cid:98) g m ∗ , which is the best one in the collection forthe three diﬀerent models with diﬀerent values of ∆ and n .As expected, we observe that the MISEs are smaller when n increases and ∆ decreases.The diﬀerent Models (a), (b), (c) gives relatively good results even if as already said, it seemsa little bit more diﬃcult to estimate correctly g in Model (a), probably because the volatility σ is constant in this case. For the two other models, the estimators seems to be better. Ascomparison with the results on the estimation of σ , here the variance in proportional to 1 / ( n ∆)and thus the risks are greater in general. a As explained is Section 5 the challenge is to get an approximation of the coeﬃcient a fromthe two previous estimators. A main numerical issue is that, according to the theoretical andnumerical results, the best setting for the estimation of σ and g are not the same. Indeed, thesmallest ∆ is, the best the estimation of σ is, as only large n is important, and on the contrary, n ∆ needs to be large to estimate g properly.To overcome this diﬃculty, we choose a thin discretization of the trajectories of X . Wesimulate here discrete path of the process X at ﬁrst with ∆ = 10 − , n = 10 . Then, we18 Figure 3: Model (d). Final estimators (cid:98) g (cid:98) m , (cid:98) σ (cid:98) m and (cid:98) a are plain green (plain line), and trueparameters ˜ g , σ and a in plain black (dotted line) from left to right respectively.ﬁrst compute (cid:98) g (cid:98) m the estimator of ˜ g on all the observations. Secondly, we compute (cid:98) σ (cid:98) m theestimator of σ from a subsample of the discretized observations (one over ten observations thus∆ = 0 . , n = 10000).We ﬁnally compute the estimator (cid:98) a ( x ) = (cid:98) g (cid:98) m ( x ) − (cid:98) σ (cid:98) m ( x ) N W (cid:98) h ( x ) . This procedure is slightly diﬀerent than the one presented in Section 5. Indeed, here we haveplugged-in (cid:98) a the ﬁnal estimators of σ , g and not computed the all collection (cid:98) a z with z =( m , m , h ) the parameter to be chosen. Nevertheless, as two procedures to select m and m have been intensively studied before, and the cross-validation method to select h is alsowell known, this way of doing seems more natural. Besides, the risk bound obtained on (cid:98) a z inEquation (27) suggests that the better the three functions σ , g, f are estimated, the better theestimation of a will be. Nevertheless, one could set up a selection procedure of z in order tominimize the estimation risk on a .We present on Figure 3 the results obtained on model (d) in which neither σ nor a areconstant. Indeed, for the three other models, our procedure has diﬃculties to estimate properly g , σ and a , when one of the parameter of the diﬀusion jump process is constant. We see thatthe ﬁnal estimator (cid:98) a (cid:98) z is not so far from the true function a even if there are some ﬂuctuationsaround the true function. This is understandable because we add the errors coming from theestimations of σ and g as we can see on Inequality (27). Moreover, it should not be forgottenthat we do not know exactly g and that we already make an error by estimating (cid:101) g instead of g,this error is then reﬂected in the estimate of a . This paper investigates the jump diﬀusion model with jumps driven by a Hawkes process. Thismodel is interesting to complete the collection of jump diﬀusion models and take into accountdependency in the jump process. The dynamic of the trajectories obtained from this model isimpacted by the Hawkes process which acts independently of the diﬀusion process.This work focuses on the estimation of the unknown coeﬃcients σ and a . We propose aclassical adaptive estimator of σ based on the truncated increments of the observed discrete19rajectory. this allows to estimate the diﬀusion coeﬃcient when no jump is detected.Then, we estimate the sum g := σ + a × f . Indeed, it is this function and not σ + a that can be estimated. The multiplicative term f is the sum of the conditional expectations ofthe jump process. This function is estimated separately through a Nadaraya-Watson estimator.The proposed estimator of g is built using all increments of the quadratic variation this time.Furthermore, a main issue is to reach the jump coeﬃcient a from the two ﬁrst estimators (cid:98) σ (cid:98) m and (cid:98) g (cid:98) m for which the theoretical and numerical results are convincing. The last section ofthis article answered this question partially. In fact it is simple to build an estimator of a fromthe two previous ones, and the estimator of the unknown conditional intensity function f .Nevertheless, this is possible only if the jumps of the Hawkes process are observed, whichis the case of the simulation study. Then, when a real life data arises, the jump times of thecounting process must be known to be able to reach a with our methodology. Otherwise, theissue remain an open question.Then, the proposed estimator (cid:98) a z , with z = ( m , m , h ), is a quotient of estimators and thedenominator must be lower bounded to insure the proper deﬁnition of the estimator. This couldbe theoretically and numerically carefully studied and be the object for further works.Moreover, the choice of the 3 − dimensional parameter z could be investigated. Indeed, insteadof choosing the triplet ( (cid:98) m , (cid:98) m , (cid:98) h ) proposed in simulation (where the two ﬁrst are given in (13)and (25) respectively and the third is the cross-validation bandwidth) one could propose anadaptive estimator of a choosing the triplet, minimizing an estimator of the risk. This is aninteresting mathematical question which requires more attention and is beyond the primarypurpose of this work.Finally, our analysis sheds light on the importance to further investigate the conditionalintensity function f , dependent on the invariant density π . A future perspective would be topropose a kernel estimator for the invariant density π and to deeply study its behaviour and itsasymptotic properties, following the same approach as in [32] and [3]. A projection method isinstead considered in [25] in order to estimate the invariant density associated to a piecewisedeterministic Markov process (PDMP) process. As a consequence, it will be possible to discussthe properties of the related estimator of f .To conclude, the innovative procedure that we have presented, could be used to investigatereal life data set. For example, some neuronal data as explained in [17] should be interpretedthrough this model as both X and the jumps times of N are observed. This is a work in progress. This section is devoted to the proof of the results stated in Section 3. We start proving Propo-sition 3.

Proof.

We want to show an upper bound for the empirical risk E [ (cid:13)(cid:13)(cid:98) σ m − σ (cid:13)(cid:13) n ]. First of allwe remark that, if t is a deterministic function, then it is E [ (cid:107) t (cid:107) n ] = (cid:107) t (cid:107) π X , where (cid:107) t (cid:107) π X := (cid:82) A t ( x ) π X ( dx ) and π X ( dx ) = π ( dx × R M ) is the projection on the coordinate X of π , whichexists for Theorem 2.3 in [17] (proof in [18]). 20y the deﬁnition of T t i we have that γ n,M ( t ) := 1 n n − (cid:88) i =0 (cid:16) t ( X t i ) − T t i ϕ ∆ βn,i (∆ i X ) (cid:17) = 1 n n − (cid:88) i =0 (cid:16) t ( X t i ) − σ ( X t i ) − ( ˜ A t i + B t i + E t i ϕ ∆ βn,i (∆ i X )) (cid:17) = (cid:13)(cid:13) t − σ (cid:13)(cid:13) n + 1 n n − (cid:88) i =0 ( ˜ A t i + B t i + E t i ϕ ∆ βn,i (∆ i X )) − n n − (cid:88) i =0 (cid:16) ˜ A t i + B t i + E t i ϕ ∆ βn,i (∆ i X ) (cid:17) (cid:0) t ( X t i ) − σ ( X t i ) (cid:1) . As (cid:98) σ m minimizes γ n,M ( t ), for any σ m ∈ S m it is γ n,M ( (cid:98) σ m ) ≤ γ n,M ( σ m ) and therefore (cid:13)(cid:13)(cid:98) σ m − σ (cid:13)(cid:13) n ≤ (cid:13)(cid:13) σ m − σ (cid:13)(cid:13) n + 2 n n − (cid:88) i =0 ( ˜ A t i + B t i + E t i ϕ ∆ βn,i (∆ i X ))( (cid:98) σ m ( X t i ) − σ m ( X t i )) . Let us denote the contrast function ν n ( t ) := 1 n n − (cid:88) i =0 B t i t ( X t i ) . (29)It follows (cid:13)(cid:13)(cid:98) σ m − σ (cid:13)(cid:13) n ≤ (cid:13)(cid:13) σ m − σ (cid:13)(cid:13) n + dn n − (cid:88) i =0 (cid:16) ˜ A t i + E t i ϕ ∆ βn,i (∆ i X ) (cid:17) + 1 d (cid:13)(cid:13)(cid:98) σ m − σ m (cid:13)(cid:13) n +2 ν n ( σ m − (cid:98) σ m ) . The linearity of the function ν n in t implies that2 ν n ( (cid:98) σ m − σ m ) = 2 (cid:107) (cid:98) σ m − σ m (cid:107) π X ν n (( (cid:98) σ m − σ m ) / (cid:107) (cid:98) σ m − σ m (cid:107) π X ) ≤ (cid:107) (cid:98) σ m − σ m (cid:107) π X sup t ∈B m ν n ( t ) , then, using that when d >

0, we have 2 xy ≤ x d + dy , we obtain the upper bound2 ν n ( (cid:98) σ m − σ m ) ≤ d (cid:107) (cid:98) σ m − σ m (cid:107) π X + d sup t ∈B m ν n ( t )where B m = (cid:110) t ∈ S m : (cid:107) t (cid:107) π X ≤ (cid:111) . Finally, using Cauchy-Schwarz’s inequality leads to (cid:13)(cid:13)(cid:98) σ m − σ (cid:13)(cid:13) n ≤ (cid:13)(cid:13) σ m − σ (cid:13)(cid:13) n + 2 dn n − (cid:88) i =0 ˜ A t i + 2 dn n − (cid:88) i =0 E t i ϕ βn,i (∆ i X ) + 1 d (cid:13)(cid:13)(cid:98) σ m − σ m (cid:13)(cid:13) n + d sup B m ν n ( t ) + 1 d (cid:13)(cid:13)(cid:98) σ m − σ m (cid:13)(cid:13) π X . (30)Let us set Ω n := (cid:40) ω, ∀ t ∈ ˜ S n \{ } , | (cid:107) t (cid:107) n (cid:107) t (cid:107) π X − | ≤ (cid:41) , (31)on which the norms (cid:107)·(cid:107) π X and (cid:107)·(cid:107) n are equivalent. We now act diﬀerently to bound the risk onΩ n and Ω cn . 21 ound of the risk on Ω n On Ω n , it is (cid:13)(cid:13)(cid:98) σ m − σ m (cid:13)(cid:13) π X ≤ (cid:13)(cid:13)(cid:98) σ m − σ m (cid:13)(cid:13) n ≤ (cid:13)(cid:13)(cid:98) σ m − σ (cid:13)(cid:13) n + 4 (cid:13)(cid:13) σ − σ m (cid:13)(cid:13) n , where in the last estimation we have used triangular inequality. In the same way we get (cid:13)(cid:13)(cid:98) σ m − σ m (cid:13)(cid:13) n ≤ (cid:13)(cid:13)(cid:98) σ m − σ (cid:13)(cid:13) n + 2 (cid:13)(cid:13) σ − σ m (cid:13)(cid:13) n . Replacing them in (30) we obtain (cid:13)(cid:13)(cid:98) σ m − σ (cid:13)(cid:13) n ≤ (cid:13)(cid:13) σ m − σ (cid:13)(cid:13) n + 2 dn n − (cid:88) i =0 ˜ A t i + 2 dn n − (cid:88) i =0 ( E t i ϕ ∆ βn,i (∆ i X )) + d sup t ∈B m ν n ( t )+ 6 d (cid:13)(cid:13)(cid:98) σ m − σ (cid:13)(cid:13) n + 6 d (cid:13)(cid:13) σ − σ m (cid:13)(cid:13) n . We need d to be more than 6, we take d = 7 obtaining (cid:13)(cid:13)(cid:98) σ m − σ (cid:13)(cid:13) n ≤ (cid:13)(cid:13) σ m − σ (cid:13)(cid:13) n + 98 n n − (cid:88) i =0 ˜ A t i + 98 n n − (cid:88) i =0 ( E t i ϕ ∆ βn,i (∆ i X )) + 49 sup t ∈B m ν n ( t ) . (32)We denote as ( ψ l ) l an orthonormal basis of S m for the L π X norm (thus (cid:82) R ψ l ( x ) π X ( x ) dx = 1).Each t ∈ B m can be written t = D m (cid:88) l =1 α l ψ l , with D m (cid:88) l =1 α l ≤ . Thensup t ∈B m ν n ( t ) = sup (cid:80) Dml =1 α l ≤ ν n (cid:32) D m (cid:88) l =1 α l ψ l (cid:33) ≤ sup (cid:80) Dml =1 α l ≤ (cid:32) D m (cid:88) l =1 α l (cid:33) (cid:32) D m (cid:88) l =1 ν n ( ψ l ) (cid:33) = D m (cid:88) l =1 ν n ( ψ l ) . (33)To study the risk we need to evaluate the expected value. From (32), (33) and using the ﬁrstand the third points of Proposition 2, we get E (cid:104)(cid:13)(cid:13)(cid:98) σ m − σ (cid:13)(cid:13) n Ω n (cid:105) ≤ E (cid:104)(cid:13)(cid:13) σ m − σ (cid:13)(cid:13) n (cid:105) + c ∆ − ˜ εn + c ∆ β − n + 49 D m (cid:88) l =1 E [ ν n ( ψ l )] . (34)By the deﬁnition (29) of ν n it is ν n ( ψ l ) = 1 n n − (cid:88) i =0 B t i ψ l ( X t i ) . As B t i is conditionally centered, using the second point of Proposition 2, it is D m (cid:88) l =1 E [ ν n ( ψ l )] ≤ cn n − (cid:88) i =0 D m (cid:88) l =1 E [ ψ l ( X t i ) E [ B t i |F t i ]] ≤ cn n − (cid:88) i =0 D m (cid:88) l =1 σ E [ ψ l ( X t i )] ≤ cσ D m n . Replacing the inequality here above in (34) it yields E (cid:104)(cid:13)(cid:13)(cid:98) σ m − σ (cid:13)(cid:13) n Ω n (cid:105) ≤ E (cid:104)(cid:13)(cid:13) σ m − σ (cid:13)(cid:13) n (cid:105) + c ∆ β − n + cσ D m n .

22s on Ω n the empiric norm and the norm on π X are equivalent and the reasoning here aboveapplies for no matter what σ m ∈ S m , it clearly follows E [ (cid:13)(cid:13)(cid:98) σ m − σ (cid:13)(cid:13) n Ω n ] ≤

13 inf t ∈S m (cid:13)(cid:13) t − σ (cid:13)(cid:13) π X + c ∆ β − n + cσ D m n . (35) Bound of the risk on Ω cn The complementary space Ω cn of Ω n given in Equation (31) is deﬁned as:Ω cn = (cid:40) ω ∈ Ω , ∃ t ∗ ∈ ˜ S n \{ } , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) t ∗ (cid:107) n (cid:107) t ∗ (cid:107) π X − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > / (cid:41) . Let us set e = ( e t , . . . , e t n − ), where e t i := T t i ϕ ∆ βn,i (∆ i X ) − σ ( X t i ) = ˜ A t i + B t i + E t i ϕ ∆ βn,i (∆ i X ).MoreoverΠ m T ϕ = Π m ( T t ϕ ∆ βn, (∆ X ) , . . . , T t n − ϕ ∆ βn,n − (∆ n − X )) = ( (cid:98) σ m ( X t ) , . . . , (cid:98) σ m ( X t n − )) , where Π m is the Euclidean orthogonal projection over S m . Then, according to the projectiondeﬁnition, (cid:13)(cid:13)(cid:98) σ m − σ (cid:13)(cid:13) n = (cid:13)(cid:13) Π m T ϕ − σ (cid:13)(cid:13) n = (cid:13)(cid:13) Π m T ϕ − Π m σ (cid:13)(cid:13) n + (cid:13)(cid:13) Π m σ − σ (cid:13)(cid:13) n ≤ (cid:13)(cid:13) T ϕ − σ (cid:13)(cid:13) n + (cid:13)(cid:13) σ (cid:13)(cid:13) n = (cid:107) e (cid:107) n + (cid:13)(cid:13) σ (cid:13)(cid:13) n . Therefore, from Cauchy -Schwarz inequality and the boundedness of σ ( x ), E (cid:104)(cid:13)(cid:13)(cid:98) σ m − σ (cid:13)(cid:13) n Ω cn (cid:105) ≤ E (cid:104) (cid:107) e (cid:107) n Ω cn (cid:105) + E (cid:104)(cid:13)(cid:13) σ (cid:13)(cid:13) n Ω cn (cid:105) = 1 n n − (cid:88) i =0 E [ e t i Ω cn ] + 1 n n − (cid:88) i =0 E [ σ ( X t i ) Ω cn ] ≤ n n − (cid:88) i =0 E [ e t i ] P (Ω cn ) + σ P (Ω cn ) . From Lemma 6.4 in [17], if n ∆ n (log n ) → ∞ and D n ≤ n ∆ n (log n ) , then P (Ω cn ) ≤ c n . (36)In the hypothesis of our proposition we have requested that n ε log n = o ( √ n ∆ n ). As for n going to ∞ we have (log n ) n ∆ n < n ε log n √ n ∆ n →

0, the ﬁrst condition in Lemma 6.4 in [17] hold true.Regarding the bound on D n , we have assumed D n ≤ √ n ∆ n log n n ε ≤ √ n ∆ n log n and so we can apply thehere above mentioned lemma, which yields (36).We are left to evaluate E [ e t i ]. From Proposition 2 it follows E (cid:2) e t i (cid:3) ≤ E (cid:20) ˜ A t i + B t i + E t i ϕ βn,i (∆ i X ) (cid:21) ≤ c ∆ − ˜ εn + c + c ∆ β − n ≤ c ∆ ∧ β − n . Putting the pieces together it yields E (cid:104)(cid:13)(cid:13)(cid:98) σ m − σ (cid:13)(cid:13) n Ω cn (cid:105) ≤ c ∆ ∧ β − n n + cn ≤ c ∆ ∧ β − n n . (37)From (35) and (37) it follows E (cid:104)(cid:13)(cid:13)(cid:98) σ m − σ (cid:13)(cid:13) n (cid:105) ≤

13 inf t ∈S m (cid:13)(cid:13) t − σ (cid:13)(cid:13) π X + C σ D m n + C ∆ β − n + C ∆ ∧ β − n n . .1.2 Proof of Theorem 1 Proof.

We analyse the quantity E [ (cid:13)(cid:13)(cid:98) σ (cid:98) m − σ (cid:13)(cid:13) n ], acting again in diﬀerent way depending onwhether or not we are on Ω n . On Ω cn the proof can be led as before, getting E (cid:104)(cid:13)(cid:13)(cid:98) σ (cid:98) m − σ (cid:13)(cid:13) n Ω cn (cid:105) ≤ c ∆ ∧ β − n n . (38)Now we investigate what happens on Ω n . By the deﬁnition of (cid:98) m it is γ n,M ( (cid:98) σ (cid:98) m ) + pen( (cid:98) m ) ≤ γ n,M ( (cid:98) σ m ) + pen( m ) ≤ γ n,M ( σ m ) + pen( m )and so, acting as before (32), we get E (cid:104)(cid:13)(cid:13) σ (cid:98) m − σ (cid:13)(cid:13) n Ω n (cid:105) ≤ E [ (cid:13)(cid:13) σ m − σ (cid:13)(cid:13) n ] + 98 n n − (cid:88) i =0 E [ ˜ A t i ] + 98 n n − (cid:88) i =0 E [( E t i ϕ ∆ βn,i (∆ i X )) ]+49 E (cid:34) sup t ∈B m, (cid:98) m ν n ( t ) (cid:35) + 7pen( m ) − E [pen( (cid:98) m )] , (39)where ν n has been deﬁned in (29) and B m,m (cid:48) := { h ∈ S m + S m (cid:48) : (cid:107) h (cid:107) π X ≤ } . We want to control the term E [sup t ∈B m, (cid:98) m ( ν n ( t )) ] and, to do that, we introduce the function p ( m, m (cid:48) ) which is such that p ( m, m (cid:48) ) = 149 (pen( m ) + pen( m (cid:48) )) . (40)It is E (cid:34) sup t ∈B m, (cid:98) m ν n ( t ) (cid:35) ≤ E [ p ( m, (cid:98) m )] + (cid:88) m (cid:48) ∈M n E (cid:34)(cid:32) sup t ∈B m,m (cid:48) ( ν n ( t )) − p ( m, m (cid:48) ) (cid:33) + (cid:35) . In order to bound the second term in the right hand side here above we want to use Lemma 7 in[30]. We can remark that, for any p ≥ E [ | B t i | p ] ≤ c ∆ pn E [ Z pt i ] + cσ p . According to Proposition4.2 in Barlow and Yor [8] (B.D.G. inequality with optimal constants) there exists a constant c such that, for any p > E (cid:104) Z pt i (cid:105) ≤ c p (2 p ) p ∆ pn σ p . It follows E [ | B t i | p ] ≤ ( c p (2 p ) p σ p + cσ p ) ≤ c p (2 p ) p σ p . By Lemma 7 in [30] there exists a constant k such that, for any m, m (cid:48) ∈ M n , E (cid:34)(cid:32) sup t ∈B m,m (cid:48) ν n ( t ) − kcσ p ( m, m (cid:48) ) (cid:33) + (cid:35) ≤ c e − ( D m + D m (cid:48) ) n . (41)We have said, in the deﬁnition of the penalization function pen σ given in Subsection 3.2,that the constant k has to be calibrated. In particular, we need it to be such that k ≥ kcσ ,where σ is the upper bound for the volatility provided in the second point of Assumption 1and k and c are as in Lemma 7 of [30]. We underline that Lemma 7 in [30] has been provedfor a noisy diﬀusion. However, the same reasoning applies for a jump diﬀusion (see the proof ofTheorem 13 in [31]) and for our framework as well, as it is based on a projection argument and24n algebraic computations which still hold true.From (41) and the fourth point of Assumption 4 we get (cid:88) m (cid:48) ∈M n E (cid:34)(cid:32) sup t ∈B m,m (cid:48) ν n ( t ) − p ( m, m (cid:48) ) (cid:33) + (cid:35) ≤ cn (cid:88) m (cid:48) ∈M n e − ( D m + D m (cid:48) ) ≤ cn . It provides us, using also (37) and Proposition 2, E (cid:104)(cid:13)(cid:13)(cid:98) σ (cid:98) m − σ (cid:13)(cid:13) n (cid:105) ≤ E (cid:104)(cid:13)(cid:13) σ m − σ (cid:13)(cid:13) n (cid:105) + c ∆ β − n + cn + 8pen( m ) + c ∆ ∧ (4 β − ) n n + cn ≤ C inf m ∈M n (cid:26) inf t ∈S m (cid:107) t − σ (cid:107) π X + pen( m ) (cid:27) + C ∆ β − n + C ∆ β − n n + C n . g In this section we prove the results stated in Section 4.

Proof.

The proof follows the same scheme than the proof of Proposition 3. We want to upperbound the empirical risk E [ (cid:107) (cid:98) g m − g (cid:107) n ]. By the deﬁnition of T t i we have that γ n,M ( t ) := 1 n n − (cid:88) i =0 ( t ( X t i ) − T t i ) = 1 n n − (cid:88) i =0 ( t ( X t i ) − g ( X t i ) − ( A t i + B t i + C t i + E t i )) γ n,M ( t ) = (cid:107) t − g (cid:107) n + 1 n n − (cid:88) i =0 ( A t i + B t i + C t i + E t i ) − n n − (cid:88) i =0 ( A t i + B t i + C t i + E t i )( t ( X t i ) − g ( X t i )) . As (cid:98) g m minimizes γ n,M ( t ), for any g m ∈ S m it is γ n,M ( (cid:98) g m ) ≤ γ n,M ( g m ) and therefore (cid:107) (cid:98) g m − g (cid:107) n ≤ (cid:107) g m − g (cid:107) n + 2 n n − (cid:88) i =0 ( A t i + B t i + C t i + E t i )( (cid:98) g m ( X t i ) − g m ( X t i )) . Using Cauchy-Schwarz inequality and the fact that, for d >

0, 2 xy ≤ x d + dy , we get (cid:107) (cid:98) g m − g (cid:107) n ≤ (cid:107) g m − g (cid:107) n + 2 dn n − (cid:88) i =0 A t i + 1 d (cid:107) (cid:98) g m − g m (cid:107) n + 2 d sup B m ν n, ( t )+ 1 d (cid:107) (cid:98) g m − g m (cid:107) π X + 2 d sup B m ν n, ( t ) , (42)where B m = (cid:110) t ∈ S m : (cid:107) t (cid:107) π X ≤ (cid:111) and ν n, ( t ) := 1 n n − (cid:88) i =0 ( B t i + E t i ) t ( X t i ) , ν n, ( t ) := 1 n n − (cid:88) i =0 C t i t ( X t i ) . (43)25e still denote Ω n the space on which the norms (cid:107)·(cid:107) π X and (cid:107)·(cid:107) n are equivalent given by Equation(31). We now act diﬀerently to bound the risk on Ω n and Ω cn . Bound of the risk on Ω n On Ω n , it is (cid:107) (cid:98) g m − g m (cid:107) π X ≤ (cid:107) (cid:98) g m − g m (cid:107) n ≤ (cid:107) (cid:98) g m − g (cid:107) n + 4 (cid:107) g − g m (cid:107) n , where in the last estimation we have used triangular inequality. Replacing it in (42) we get (cid:107) (cid:98) g m − g (cid:107) n ≤ (cid:107) g m − g (cid:107) n + 2 dn n − (cid:88) i =0 A t i + 2 d sup B m ν n, ( t ) + 2 d sup B m ν n, ( t )+ 6 d (cid:107) (cid:98) g m − g (cid:107) n + 6 d (cid:107) g − g m (cid:107) n . We need d to be more than 6, we take d = 7 obtaining (cid:107) (cid:98) g m − g (cid:107) n ≤ (cid:107) g m − g (cid:107) n + 98 n n − (cid:88) i =0 A t i + 98 sup t ∈B m ν n, ( t ) + 98 sup t ∈B m ν n, ( t ) . (44)We now need to introduce a diﬀerent orthonormal basis of S m , compared to the one we proposedin Section 8.1, for the estimation of volatility. The reason why it is necessary to change it isthat in E [ E t i |F t i ] we now get a term that depends on λ t i , which is an extra diﬃculty comparedwith the reasoning we applied below (34). Hence, we consider ( ˜ ψ k ) k an orthonormal basis of S m for which E (cid:104) ˜ ψ k ( X t i , l ) | λ t i = l (cid:105) = 1 , (45)where λ t i = ( λ (1) t i , . . . , λ ( M ) t i ). It is possible to build such a basis starting from the one we haveintroduced in the proof of Proposition 3, through Gram-Schmidt process, for the scalar productin L ( π ( dx | λ t i = l )), for l ∈ R m . Each t ∈ B m can be written t = D m (cid:88) l =1 α l ˜ ψ l , with D m (cid:88) l =1 α l ( λ t i ) ≤ . We underline that this time, unlike it was in the estimation of the volatility, the coeﬃcients α l depend on λ t i . We omit it in the sequel to lighten the notation. Then, for j = 1 and j = 2,sup t ∈B m ν n,j ( t ) = sup (cid:80) Dml =1 α l ≤ ν n,j (cid:32) D m (cid:88) l =1 α l ˜ ψ l (cid:33) ≤ sup (cid:80) Dml =1 α l ≤ (cid:32) D m (cid:88) l =1 α l (cid:33) (cid:32) D m (cid:88) l =1 ν n,j ( ˜ ψ l ) (cid:33) = D m (cid:88) l =1 ν n,j ( ˜ ψ l ) . (46)To study the risk we need to evaluate the expected value. From (44), (46) and using the ﬁrstpoint of Proposition 4, we get E (cid:104) (cid:107) (cid:98) g m − g (cid:107) n Ω n (cid:105) ≤ E (cid:104) (cid:107) g m − g (cid:107) n (cid:105) + c ∆ − ˜ εn + 98 D m (cid:88) l =1 E [ ν n, ( ˜ ψ l )] + 98 D m (cid:88) l =1 E (cid:104) ν n, ( ˜ ψ l ) (cid:105) . (47)By the deﬁnition (43) of ν n, and the points 2 and 3 of Proposition 4, it is D m (cid:88) l =1 E [ ν n, ( ˜ ψ l )] ≤ cn n − (cid:88) i =0 D m (cid:88) l =1 E (cid:104) ˜ ψ l ( X t i , λ t i ) E [ B t i + E t i |F t i ] (cid:105) ≤ cn n − (cid:88) i =0 D m (cid:88) l =1 E  ˜ ψ l ( X t i , λ t i )( cσ + ca ∆ n,i M (cid:88) j =1 | λ ( j ) t i | ) 

26e observe that the ﬁrst term in the right hand side here above is cσ n n − (cid:88) i =0 D m (cid:88) l =1 E [ ˜ ψ l ( X t i , λ t i )] ≤ cD m n , where we moved to the conditional expectation with respect to λ t i = ( λ (1) t i , . . . , λ ( M ) t i ). Regardingthe second term, we remark it is E  ˜ ψ l ( X t i , λ t i ) ca ∆ n,i M (cid:88) j =1 | λ ( j ) t i |  = E [ E [ ˜ ψ l ( X t i , λ t i ) | λ t i ] ca ∆ n,i M (cid:88) j =1 | λ ( j ) t i | ]= ca ∆ n,i M (cid:88) j =1 E [ | λ ( j ) t i | ] ≤ ca ∆ n,i , where in the last inequality we have used the boundedness of the moments of λ and (45). Itfollows cn n − (cid:88) i =0 D m (cid:88) l =1 E [ ˜ ψ l ( X t i , λ t i ) ca ∆ n,i M (cid:88) j =1 | λ ( j ) t i | ] ≤ cD m a n ∆ n,i . Hence, D m (cid:88) l =1 E (cid:104) ν n, ( ˜ ψ l ) (cid:105) ≤ c ( σ + a ) D m n ∆ n,i . (48)In order to evaluate E [ ν n, ( ˜ ψ l )], the following lemma will be useful: Lemma 3.

Suppose that A1-A3 hold true. Then,

Var (cid:32) n n − (cid:88) i =0 C t i ˜ ψ l ( X t i , λ t i ) (cid:33) ≤ cn ∆ n . The proof of Lemma 3 is in the appendix. Lemma 3 yields D m (cid:88) l =1 E (cid:104) ν n, ( ˜ ψ l ) (cid:105) ≤ cD m n ∆ n . Replacing the inequality here above and (48) in (47) we get, using also that ∆ n,i ≥ c ∆ min andthe fact that there exist c and c for which c ≤ ∆ n ∆ min ≤ c , E (cid:104) (cid:107) (cid:98) g m − g (cid:107) n Ω n (cid:105) ≤ E (cid:104) (cid:107) g m − g (cid:107) n (cid:105) + c ∆ − ˜ εn + c ( σ + a + 1) D m n ∆ n . As the choice g m ∈ S m is arbitrary, we obtain E (cid:104) (cid:107) (cid:98) g m − g (cid:107) n Ω n (cid:105) ≤

13 inf t ∈S m (cid:107) t − g (cid:107) π X + c ∆ − ˜ εn + c ( σ + a + 1) D m n ∆ n . (49) Bound of the risk on Ω cn Let us set e = ( e t , . . . , e t n − ), where e t i := T t i − g ( X t i ) = A t i + B t i + C t i + E t i . MoreoverΠ m T = Π m ( T t , . . . , T t n − ) = ( (cid:98) g m ( X t ) , . . . , (cid:98) g m ( X t n − )) , m is the Euclidean orthogonal projection over S m . Then, according to the projectiondeﬁnition, (cid:107) (cid:98) g m − g (cid:107) n = (cid:107) Π m T − g (cid:107) n = (cid:107) Π m T − Π m g (cid:107) n + (cid:107) Π m g − g (cid:107) n ≤ (cid:107) T − g (cid:107) n + (cid:107) g (cid:107) n = (cid:107) e (cid:107) n + (cid:107) g (cid:107) n . Therefore, from Cauchy -Schwarz inequality, E [ (cid:107) (cid:98) g m − g (cid:107) n Ω cn ] ≤ E [ (cid:107) e (cid:107) n Ω cn ] + E [ (cid:107) g (cid:107) n Ω cn ] = 1 n n − (cid:88) i =0 E [ e t i Ω cn ] + 1 n n − (cid:88) i =0 E [ g ( X t i ) Ω cn ] ≤ n n − (cid:88) i =0 E [ e t i ] P (Ω cn ) + 1 n n − (cid:88) i =0 E [ g ( X t i ) ] P (Ω cn ) Moreover, using the boundedness of both a and σ and the fact that E [ | λ t i | ] < ∞ , we obtain E [ g ( X t i ) ] < ∞ . We are left to evaluate E [ e t i ]. From Proposition 4 it follows E [ e t i ] ≤ E [ A t i + B t i + C t i + E t i ] ≤ c ∆ − ˜ εn + c + c + c ∆ n i ≤ c ∆ n . Putting the pieces together it yields E [ (cid:107) (cid:98) g m − g (cid:107) n Ω cn ] ≤ c ∆ n n + cn ≤ cn ∆ n . (50)From (49) and (50) it follows E [ (cid:107) (cid:98) g m − g (cid:107) n ] ≤ E [ (cid:107) g m − g (cid:107) n ] + C ( σ + a + 1) D m n ∆ n + C ∆ − ˜ εn + C n ∆ n . Proof.

We act again in diﬀerent way depending on whether or not we are on Ω n . On Ω cn theproof can be led as before, getting E (cid:104) (cid:107) (cid:98) g (cid:98) m − g (cid:107) n Ω cn (cid:105) ≤ cn ∆ n . (51)Now we investigate what happens on Ω n . In particular, we analyse what happens on O ⊂ Ω n ,a set which will be deﬁned later (see (56)). By the deﬁnition of (cid:98) m we have γ n,M ( (cid:98) g (cid:98) m ) + pen( (cid:98) m ) ≤ γ n,M ( (cid:98) g m ) + pen( m ) ≤ γ n,M ( g m ) + pen( m )and so, following the from of Equation (44), we get E (cid:104) (cid:107) (cid:98) g (cid:98) m − g (cid:107) n O (cid:105) ≤ E [ (cid:107) g m − g (cid:107) n ] + 98 n n − (cid:88) i =0 E [ A t i ] + 98 E (cid:34) sup t ∈B m, (cid:98) m ν n ( t ) O (cid:35) +7pen( m ) − E [pen( (cid:98) m )] , where ν n ( t ) := 1 n n − (cid:88) i =0 ( B t i + C t i + E t i ) t ( X t i ) , B m,m (cid:48) := { h ∈ S m + S m (cid:48) : (cid:107) h (cid:107) π X ≤ } . In order to control the term E [sup t ∈B m, (cid:98) m ν n ( t ) O ], we introduce the function p ( m, m (cid:48) ): p ( m, m (cid:48) ) ≤

198 (pen( m ) + pen( m (cid:48) )) . It is E [ sup t ∈B m, (cid:98) m ν n ( t ) O ] ≤ E [ p ( m, (cid:98) m )] + (cid:88) m (cid:48) ∈M n E (cid:34)(cid:32) sup t ∈B m,m (cid:48) ν n ( t ) − p ( m, m (cid:48) ) (cid:33) + O (cid:35) . Replacing it in (39) and using the ﬁrst point of Proposition 4 we get E (cid:104) (cid:107) (cid:98) g (cid:98) m − g (cid:107) n O (cid:105) ≤ E (cid:104) (cid:107) g m − g (cid:107) n (cid:105) + c ∆ − ˜ εn + 98 E [ p ( m, (cid:98) m )] + 7pen( m ) − E [pen( (cid:98) m )] + 98 (cid:88) m (cid:48) ∈M n E (cid:34)(cid:32) sup t ∈B m, (cid:98) m ν n ( t ) − p ( m, m (cid:48) ) (cid:33) + O (cid:35) . We have introduced the function p ( m, m (cid:48) ) with the purpose to use Talagrand inequality on thelast term in the right hand side of the equation here above. We recall the following version ofthe Talagrand inequality, which has been stated in [31] and proved by Birg´e and Massart (1998)[9] (corollary 2p.354) and Comte and Merlev`ede (2002) [14] (p222-223). Lemma 4.

Let T , . . . , T n be independent random variables with values in some Polish space X and v p : B m,m (cid:48) → R such that v p ( r ) := 1 p p (cid:88) j =1 [ r ( T j ) − E [ r ( T j )]] . Then, E (cid:34)(cid:32) sup r ∈B m,m (cid:48) | v p ( r ) | − H (cid:33) + (cid:35) ≤ c (cid:18) vp e − c pH v + M p e − c pHM (cid:19) , (52) with c a universal constant and where sup r ∈B m,m (cid:48) (cid:107) r (cid:107) ∞ ≤ M, E [ sup r ∈B m,m (cid:48) | v p ( r ) | ] ≤ H, sup r ∈B m,m (cid:48) p p (cid:88) j =1 Var( r ( T j )) ≤ v. We observe that in Talagrand lemma here above the random variables T ,. . . , T n are sup-posed to be independent. Starting from our variables we can get independent variables throughBerbee’s coupling method. We recall it below, it is proved by Viennet in Proposition 5.1 of [33]while an analogous statement in continuous time can be found in [3]. Lemma 5.

Let ( M t ) t ≥ be a stationary and exponentially β mixing process observed at discretetimes t ≤ t ≤ . . . ≤ t n = T . Let p n and q n be two integers such that n = 2 p n q n . For any j ∈ { , } and ≤ k ≤ p n we consider the random variables U k,j := ( M t (2( k − j ) qn +1 , . . . , M t (2 k − j ) qn ) . There exist random variables M ∗ t , . . . , M ∗ t n such that U ∗ k,j := ( M ∗ t (2( k − j ) qn +1 , . . . , M ∗ t (2 k − j ) qn ) satisfy the following properties. For any j ∈ { , } , the random vectors U ∗ ,j , . . . , U ∗ p n ,j are independent. • For any ( j, k ) ∈ { , } × { , . . . , p n } , U k,j and U ∗ k,j have the same distribution. • For any ( j, k ) ∈ { , } × { , . . . , p n } , P ( U k,j (cid:54) = U ∗ k,j ) ≤ β M ( q n ∆ min ) , where β M is the β -mixing coeﬃcient of the process ( M t ) . We want to apply Berbee’s coupling lemma to the random vectors ( B t i + C t i + E t i , X t i ),that we write as function of Z t = ( X t , λ t ), which is stationary and exponentially β - mixing, asdiscussed in Section 2.3. We deﬁne the σ algebra˜ F t i := σ ( X s , λ s , s ∈ ( t i , t i +1 ]) , (53)completed with the null sets. Because of the exponentially β - mixing of ( X t , λ t ) we know it is β ( ˜ F t i , ˜ F t j ) ≤ ce − γ | t j − t i | . From (2) and the fact we have assumed c i,j to be inversible, it is possible to write both Z t i and J t i in function of X and λ and so they are measurable with respect to ˜ F t i . By the deﬁnition of B t i , C t i and E t i it follows that also B t i + C t i + E t i is measurable with respect to ˜ F t i . We cantherefore use Berbee’s coupling lemma on ( B t i + C t i + E t i , X t i ).For t ∈ B m,m (cid:48) , according to Berbee’s coupling lemma, we can construct U ∗ k,j := 1 q n q n (cid:88) l =1 ( B + C + E ) ∗ t (2( k − j ) qn + l t ( X ∗ t (2( k − j ) qn + l )such that, for j ∈ { , } , the random variables ( U ∗ k,j ) ≤ k ≤ p n are independent and have the samedistribution as U k,j := 1 q n q n (cid:88) l =1 ( B + C + E ) t (2( k − j ) qn + l t ( X t (2( k − j ) qn + l ) . Let us set Ω ∗ := (cid:8) ω, ∀ j, ∀ k, U k,j = U ∗ k,j (cid:9) , by Berbee’s coupling lemma it is P (Ω ∗ ,c ) ≤ p n β Z ( q n ∆ min ) ≤ c nq n e − γq n ∆ min . We recall that p n and q n are two integers to be chosen such that 2 p n q n = n . It is enough to take q n := (cid:98) γ ∆ min log n (cid:99) in (54) to get P (Ω ∗ ,c ) ≤ cn log n . (54)We want to apply Talagrand inequality on v ∗ n ( t ) := v , ∗ n ( t ) + v , ∗ n ( t ), where v , ∗ n ( t ) = 1 p n p n (cid:88) k =0 U ∗ k, , v , ∗ n ( t ) = 1 p n p n (cid:88) k =0 U ∗ k, . To do that, we ﬁrst of all observe that, as a consequence of Proposition 4, it is E [ U ∗ k,j ] = 0 forany j ∈ { , } and any k ∈ { , . . . , p n } . Now we want to compute the constants M , v and H asdeﬁned in Lemma 4. To compute M we introduce the following setΩ B := (cid:110) ω, ∀ j, ∀ k, ∀ ε > | U ∗ k,j | ≤ ˜ cn ε D (cid:111) , (55)with D := D m + D m (cid:48) . The following lemma is proven in the appendix.30 emma 6. Suppose that A1-A3 hold. Then there exists c > such that P (Ω cB ) ≤ cn . We set O := Ω n ∩ Ω B ∩ Ω ∗ . (56)On O the random variables | U ∗ k,j | are replaced by | U ∗ k,j | ∧ ˜ cn ε D . As the original variables U and the independent ones U ∗ are the same on O even after truncation, the truncated randomvariables are still independent and we can use Talagrand inequality on them. From (36), (54)and Lemma 6 it follows P ( O c ) ≤ cn . We act on O c as we did on Ω cn , getting E [ (cid:107) (cid:98) g (cid:98) m − g (cid:107) n O c ] ≤ cn ∆ n . (57)On the other side, on O we are really going to use Talagrand’s inequality to control (cid:88) m (cid:48) ∈M n E (cid:34)(cid:32) sup t ∈B m, (cid:98) m ν n ( t ) − p ( m, m (cid:48) ) (cid:33) + O (cid:35) . From the deﬁnition of Ω B , we clearly obtain that M := cn ε D . With the purpose of computing v we observe that for any t ∈ B m,m (cid:48) , by stationarity, it isVar( U ∗ k,j ) = 1 q n q n (cid:88) l =1 E (cid:104) t ( X ∗ t l ) C ∗ , t l E l [ B ∗ , t l + E ∗ , t l ] (cid:105) . By the second and the third points of Proposition 4 this variance is upper boundedVar( U ∗ k,j ) ≤ cq n q n (cid:88) l =1 E  t ( X ∗ t l ) C ∗ , t l ( σ + a ∆ n M (cid:88) j =1 λ ( j ) t l )  ≤ cq n q n (cid:88) l =1 E (cid:2) t p ( X ∗ t l ) (cid:3) p E  C ∗ , qt l  σ + a ∆ n M (cid:88) j =1 λ ( j ) t i  q  q , (58)where we have used Holder inequality with q big and p next to 1. We can see t p ( X ∗ t l ) as t p − ( X ∗ t l ) = t ( X ∗ t l ) t (2 p − ( X ∗ t l ) ≤ (cid:107) t (cid:107) p − ∞ t ( X ∗ t l ) . As p has been chosen next to 1, (cid:107) t (cid:107) p − p ∞ ≤ (cid:107) t (cid:107) δ ∞ ≤ cD δ for any δ arbitrarily small. Using alsothe boundedness of the moments of C and λ it follows that the right hand side of (58) is upperbounded by cD δ q n ∆ n =: v. In order to compute H we observe it is E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p n (cid:88) j,k U ∗ k,j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:113) E [ v , ∗ p ( t )]31o ﬁnd an upper bound for the right hand side here above we act in a similar way to how wedid before (46): we introduce the orthonormal basis ( ¯ ψ k ) k for which E [ ¯ ψ k ( X t i , l ) | λ t i = l ] = 1,such that each t ∈ B m,m (cid:48) can be written as the following t = D (cid:88) l =1 ¯ α l ¯ ψ l , with D (cid:88) l =1 ¯ α l ( λ t i ) ≤ . The coeﬃcients ¯ α l depend on λ t i . We omit it in the sequel to lighten the notation. Similarly to(46), we havesup t ∈B m,m (cid:48) v , ∗ p ( t ) = sup (cid:80) Dl =1 ¯ α l ≤ v , ∗ p (cid:32) D (cid:88) l =1 ¯ α l ¯ ψ l (cid:33) ≤ sup (cid:80) Dl =1 ¯ α l ≤ (cid:32) D (cid:88) l =1 ¯ α l (cid:33) (cid:32) D (cid:88) l =1 v , ∗ p ( ¯ ψ l ) (cid:33) = D (cid:88) l =1 v , ∗ p ( ¯ ψ l ) . Acting exactly as we did in order to get (48) and Lemma 3 on v n, and v n, we obtain (cid:113) E [ v , ∗ p ( t )] ≤ (cid:114) Dn ∆ n =: H, We now use Talagrand inequality as in Lemma 4. It follows E (cid:34)(cid:32) sup t ∈B m, (cid:98) m ν ∗ n ( t ) − H (cid:33) + O (cid:35) ≤ D δ p n q n ∆ n exp (cid:32) − c Dp n q n ∆ n n ∆ n D δ (cid:33) + cn ε Dp n exp (cid:32) − cp n D √ n ∆ n n ε D (cid:33) = cD δ n ∆ n exp( − cD − δ ) + cn ε Dp n exp (cid:18) − c √ nq n √ ∆ n n ε (cid:19) . We recall that q n = c log n ∆ min . We observe that, as ∆ min and ∆ n diﬀers only for a constant, c √ n √ ∆ n q n n ε = c √ n ∆ n log nn ε . Moreover, it goes to ∞ for n going to inﬁnity as we have assumed that(log n ) n ε = o ( √ n ∆ n ). Therefore, the second term here above is negligible compared to the ﬁrstone. It follows, using also the deﬁnition of p ( m, (cid:98) m ), the fact that for D > D δ e − c (cid:48) D − δ

For the following proofs, the lemma stated and proved below is a very helpful tool. It providesthe size of the increments of both X and λ . Lemma 7.

Suppose that A1-A3 hold. Then, there exist c and c positive constants such that,for all t > s , | t − s | < the following hold true1. For all p ≥ , E [ | X t − X s | p ] ≤ c | t − s | .2. For all p ≥ and for any j ∈ { , . . . , M } , E [ | λ ( j ) t − λ ( j ) s | p ] ≤ c | t − s | .3. E [ | λ t − λ s ||F s ] ≤ c | t − s | (1 + | λ s | ) , where λ = ( λ (1) , . . . , λ ( M ) ) and | · | stands for theeuclidean norm.4. For any j ∈ { , . . . , M } , sup h ∈ [0 , E [ | λ ( j ) s + h ||F s ] ≤ | λ ( j ) s | + c | h | (1 + | λ ( j ) s | ) .Proof. We start proving the ﬁrst point. From the dynamic (2) of the process X we have | X t − X s | p ≤ c | (cid:90) ts b ( X u ) du | p + c (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) ts σ ( X u ) dW u (cid:12)(cid:12)(cid:12)(cid:12) p + c (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) ts a ( X u − ) M (cid:88) j =1 dN ( j ) u (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p = I + I + I . From Jensen inequality, the polynomial growth of b and the fact that X has bounded momentsit follows E [ I ] ≤ c | t − s | p − (cid:90) ts E [ | b ( X u ) | p ] du ≤ c | t − s | p . (59)Using Burkholder-Davis-Gundy inequality, Jensen inequality and the boundedness of σ it is E [ I ] ≤ c E (cid:20) ( (cid:90) ts σ ( X u ) du ) p (cid:21) ≤ c | t − s | p − (cid:90) ts E [ | σ ( X u ) | p ] du ≤ cσ p | t − s | p . (60)To evaluate I , Kunita inequality will be useful. We refer to the Appendix of [24] for its proofin a general form, while below (A7) on page 52 of [2] can be found an example of its applicationin a form closer to the one we are going to use. For a compensated Poisson random measure˜ µ = µ − ¯ µ and a jump coeﬃcient l ( x, z ), indeed, Kunita inequality provides the following: E (cid:20)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) t (cid:90) R l ( X s − , z )˜ µ ( ds, dz ) (cid:12)(cid:12)(cid:12)(cid:12) p (cid:21) ≤ c E (cid:20)(cid:90) t (cid:90) R | l ( X s − , z ) | p ¯ µ ( ds, dz ) (cid:21) + c E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) t (cid:90) R l ( X s − , z )¯ µ ( ds, dz ) (cid:12)(cid:12)(cid:12)(cid:12) p (cid:35) . We remark that, up to change the constant c in the right hand side, the equation here aboveholds with the measure µ instead of the compensated one ˜ µ . In the sequel we will apply Kunitainequality on the measure dN ( j ) u and the compensated one d ˜ N ( j ) u , for j ∈ { , ..., M } . Thecompensator is in this case λ ( j ) ( u ) du .Using on I Kunita inequality together with Jensen inequality and the boundedness of a we get E [ I ] ≤ c M (cid:88) j =1 E (cid:34)(cid:90) ts | a ( X u − ) | p λ ( j ) u du + (cid:18)(cid:90) ts a ( X u − ) λ ( j ) u du (cid:19) p (cid:35) ≤ M (cid:88) j =1 c | a | p (cid:90) ts E [ λ ( j ) u ] du + c | a | p | t − s | p − (cid:90) ts E [ | λ ( j ) u | p ] du ≤ c | a | p ( | t − s | + | t − s | p ) = c | a | p | t − s | . (61)33rom (59), (60) and (61), as | t − s | <

1, it follows E [ | X t − X s | p ] ≤ c | t − s | . Point 2

Concerning the second point, for any j ∈ { , . . . , M } it is | λ ( j ) t − λ ( j ) s | p ≤ c (cid:12)(cid:12)(cid:12)(cid:12) α (cid:90) ts ( λ ( j ) ( u ) − ζ j ) du (cid:12)(cid:12)(cid:12)(cid:12) p + c (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) ts M (cid:88) i =1 c i,j dN ( i ) u (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p . Acting as in the proof of the ﬁrst point, using as main arguments Jensen inequality, Kunitainequality and the boundedness of the moments of λ , we easily get the wanted estimation. Point 3

We consider the dynamic of λ gathered in (2) in matrix form and so we have λ t − λ s = α (cid:90) ts ( λ u − ζ ) du + (cid:90) ts cdN u =: D s + G s , where λ t = ( λ (1) t , . . . , λ ( M ) t ), c ∈ R M × R M . We start evaluating D s . By adding and subtracting λ s we easily get, denoting as E s [ · ] the quantity E [ ·|F s ], E s [ | D s | ] ≤ c | t − s | (1 + | λ s | ) + c (cid:90) ts E s [ | λ u − λ s | ] ds. On G s we use compensation formula and we apply the same reasoning as before, getting E s [ | G s | ] ≤ E s (cid:20)(cid:90) ts c | λ u | du (cid:21) ≤ c | t − s || λ s | + c (cid:90) ts E s [ | λ u − λ s | ] ds. Putting the pieces together it follows E s [ | λ t − λ s | ] ≤ c | t − s | (1 + | λ s | ) + c (cid:90) ts E s [ | λ u − λ s | ] ds. We use Gronwall lemma, which yields E s [ | λ t − λ s | ] ≤ c | t − s | (1 + | λ s | ) e c . Point 4

We observe that, for any h ∈ [0 , E s [ | λ ( j ) s + h | ] ≤ | λ ( j ) s | + E s [ | λ ( j ) s + h − λ ( j ) s | ] ≤ | λ ( j ) s | + c | h | (1 + | λ s | ) , where we have used the just showed third point of this lemma. A.1 Proof of Proposition 1

Proof.

We write V ( x, y ) = V ( x ) + V ( y ), where V ( x ) = | x | m for m arbitrarily big and V ( y ) = e (cid:80) i,j m ij | y ( ij ) | . From the deﬁnition (5) of A ˜ z we have A ˜ z V = A ˜ z V + A ˜ z V, where A ˜ z V ( x, y ) := ∂ x V ( x, y ) b ( x ) + 12 σ ( x ) ∂ x V ( x, y ) + M (cid:88) j =1 f j ( M (cid:88) k =1 y ( jk ) )[ V ( x + a ( x )) − V ( x )]= m | x | m − b ( x ) + 12 σ ( x ) m ( m − | x | m − + M (cid:88) j =1 f j (cid:32) M (cid:88) k =1 y ( jk ) (cid:33) [ | x + a ( x ) | m − | x | m ]34s the diﬀusion part and A ˜ z V ( x, y ) := A ˜ z V ( x, y ) − A ˜ z V ( x, y )= − α M (cid:88) i,j =1 y ( ij ) ∂ y ( ij ) V ( x, y ) + M (cid:88) j =1 f j ( M (cid:88) k =1 y ( jk ) )[ V ( y + ∆ j ) − V ( y )] , is the jump part of the generator. The arguments of the proof of Proposition 4.5 in [11] implythat A ˜ z V ( x, y ) = A ˜ z V ( y ) ≤ − c V ( y ) + c K ( y ) , (62)with c and c some positive constants and K some compact of R M × M . Moreover, denoting¯ f ( y ) := (cid:80) Mj =1 f j ( (cid:80) Mk =1 y ( jk ) ) the total jump rate, it is A ˜ z V ( x, y ) = m | x | m − b ( x ) + 12 σ ( x ) m ( m − | x | m − + ¯ f ( y )[ | x + a ( x ) | m − | x | m ] . From the drift condition on b gathered in the fourth point of Assumption 1 and the boundednessof both σ and a it follows A ˜ z V ( x, y ) ≤ − dm | x | m + c | x | m − + ¯ f ( y )( c | x | m − + . . . + c m ) . (63)We observe that, for any x such that | x | > r , | x | m − is negligible compared to | x | m = V ( x ). Tostudy the last term in the right hand side of (63), we choose 1 < p < q > p ( m − < m (i e p < m − ) and p + q = 1. Then,¯ f ( y )( c | x | m − + . . . + c m ) ≤ cp ( c | x | m + . . . + c m ) p + cq ¯ f ( y ) q . The ﬁrst term is again negligible compared to | x | m = V ( x ), being p ( m − < m . To estimatethe second one we observe that, for each y ∈ R M the total jump rate ¯ f ( y ) can be seen as (cid:80) Mi =1 ( ζ i + (cid:80) j y ( ij ) ) (see page 12 in [18]). Therefore, it is¯ f ( y ) ≤ ¯ c + ˜ c M (cid:88) i,j =1 | y ( ij ) | ≤ ¯ c + ˜ c log( V ( y )) , which is negligible with respect to the negative term of (62) − c V ( y ). The same reasoningapplies for cq ¯ f ( y ) q . It follows that A ˜ z V ( x, y ) ≤ − dm | x | m + o ( V ( x )) + o ( V ( y ))which, together with (62), conclude the proof. A.2 Proof of Lemma 1

Proof.

By the deﬁnition of ϕ , for any k ≥ | ϕ ∆ βn,i (∆ i X ) − | k is diﬀerent from zero only if | ∆ i X | > ∆ βn,i . Therefore, E [ | ϕ ∆ βn,i (∆ i X ) − | k ] ≤ c E [ (cid:110) | ∆ i X | > ∆ βn,i (cid:111) ]= c E  (cid:40) | ∆ i X | > ∆ βn,i , | J ti |≤ ∆ βn,i (cid:41)  + c E  (cid:40) | ∆ i X | > ∆ βn,i , | J ti | > ∆ βn,i (cid:41)  . (64)35e denote as ∆ i X c the increment of the continuous part of X , which is∆ i X c := X ct i +1 − X ct i = (cid:90) t i +1 t i b ( X s ) ds + Z t i . The ﬁrst term in the right hand side of (64) is c E  (cid:40) | ∆ i X c | > ∆ βn,i (cid:41)  = c P ( | ∆ i X c | > ∆ βn,i ≤ c E [ | ∆ i X c | r ]∆ βrn,i ≤ c ∆ r ( − β ) n,i , (65)where we have used Markov inequality and a classical estimation for the continuous incrementsof X (see for example point 6 of Lemma 1 in [4]). In order to evaluate the second term in theright hand side of (64), instead, we have to introduce the set N i,n :=  M (cid:88) j =1 | ∆ i N ( j ) | := M (cid:88) j =1 | N ( j ) t i +1 − N ( j ) t i | ≤ βn,i a  . We observe that, on N ci,n , there exists j ∈ { , . . . , M } such that | ∆ i N ( j ) | (cid:54) = 0. Therefore, P ( N ci,n ) ≤ P ( | ∆ i N ( j ) | ≥ ≤ E [ | ∆ i N ( j ) | ] ≤ c ∆ n,i . (66)On N i,n , instead, ∀ j | ∆ i N ( j ) | = 0 and so ( N i,n ) ∩ (cid:26) | J t i | > ∆ βn,i (cid:27) = ∅ . It follows that the secondterm in the right hand side of (64) is c E  (cid:40) | ∆ i X | > ∆ βn,i , | J ti | > ∆ βn,i ,N i,n (cid:41)  + c E [ (cid:40) | ∆ i X | > ∆ βn,i , | J ti | > ∆ βn,i ,N ci,n (cid:41) ] ≤ c P ( N ci,n ) ≤ c ∆ n,i . Putting the pieces together, as r is arbitrary, it follows E (cid:104) | ϕ ∆ βn,i (∆ i X ) − | k (cid:105) ≤ c ∆ n,i . A.3 Proof of Lemma 2

Proof.

Again, we act diﬀerently depending on whether the jumps are big or not: E [ | J t i | q ϕ k ∆ βn,i (∆ i X )] = E  | J t i | q ϕ k ∆ βn,i (∆ i X ) (cid:40) | J ti | > ∆ βn,i (cid:41)  + E  | J t i | q ϕ k ∆ βn,i (∆ i X ) (cid:40) | J ti |≤ ∆ βn,i (cid:41)  . (67)By the deﬁnition of ϕ it is diﬀerent from 0 only if | ∆ i X | ≤ βn,i . As ∆ i X = ∆ i X c + J t i , it is E  | J t i | q ϕ k ∆ βn,i (∆ i X ) (cid:40) | J ti | > ∆ βn,i (cid:41)  ≤ E  | J t i | q (cid:40) | ∆ i X c | > βn,i (cid:41)  ≤ E [ | J t i | qp ] p E  (cid:40) | ∆ i X c | > βn,i (cid:41)  p ≤ c ∆ p n,i ∆ rp ( − β ) n,i c ∆ r ( − β ) − εn,i , N i,n . On N i,n the increments∆ i N ( j ) are null and so | J t i | = 0. On N ci,n instead, using also (66), we have E  | J t i | q ϕ k ∆ βn,i (∆ i X ) (cid:40) | J ti |≤ ∆ βn,i ,N ci,n (cid:41)  ≤ c ∆ βqn,i P ( N ci,n ) ≤ c ∆ βqn,i . By the arbitrariness of r it follows E [ | J t i | q ϕ k ∆ βn,i (∆ i X )] ≤ c ∆ βqn,i , as we wanted. A.4 Proof of Proposition 2

Proof.

As the second point is useful in order to show the ﬁrst one, we start proving point 2.

Point 2

By deﬁnition we know that B t i is centered. In the sequel we denote as E i [ · ] the conditionalexpected value E [ ·|F t i ]. Regarding the second moment, it is E i [ B t i ] ≤ n,i E i (cid:20) Z t i + ( (cid:90) t i +1 t i σ ( X s ) ds ) (cid:21) ≤ c ∆ n,i E i (cid:34)(cid:18)(cid:90) t i +1 t i σ ( X s ) ds (cid:19) (cid:35) ≤ cσ where we have used, sequentially, BDG inequality, Jensen inequality and the boundedness of σ .Using the same arguments we show the following: E i [ B t i ] ≤ n,i E i (cid:34) Z t i + (cid:18)(cid:90) t i +1 t i σ ( X s ) ds (cid:19) (cid:35) ≤ c ∆ n,i E i (cid:34)(cid:18)(cid:90) t i +1 t i σ ( X s ) ds (cid:19) (cid:35) ≤ cσ . Point 1

We analyse the behaviour of˜ A t i = σ ( X t i )( ϕ ∆ βn,i (∆ i X ) −

1) + A t i ϕ ∆ βn,i (∆ i X ) + B t i ( ϕ ∆ βn,i (∆ i X ) − . From Holder inequality, the boundedness of σ and a repeated use of Lemma 2 we get E [ ˜ A t i ] ≤ cσ ∆ n,i + E [ A t i ϕ βn,i (∆ i X )] + E [ B pt i ] p c ∆ q n,i . We evaluate the moments of B t i acting as in the proof of the ﬁrst point and we choose p bigand q next to 1, getting E [ ˜ A t i ] ≤ cσ ∆ n,i + E [ A t i ϕ βn,i (∆ i X )] + cσ ∆ − ˜ εn,i , (68)for ˜ ε > A t i ϕ βn,i . From its deﬁnition, recalling that ϕ is a bounded function, we obtain E [ A t i ϕ βn,i (∆ i X )] ≤ c ∆ n,i E (cid:34)(cid:18)(cid:90) t i +1 t i b ( X s ) ds (cid:19) (cid:35) + c ∆ n,i E (cid:34) ( Z t i + J t i ) (cid:18)(cid:90) t i +1 t i b ( X s ) − b ( X t i ) ds (cid:19) (cid:35) + c ∆ n,i E (cid:34)(cid:18)(cid:90) t i +1 t i σ ( X s ) − σ ( X t i ) ds (cid:19) (cid:35) + 4 E [ b ( X t i ) Z t i ] =: (cid:88) j =1 I j . b and the existence of bounded moments of X we get I ≤ c ∆ n,i ∆ n,i (cid:90) t i +1 t i E [ b ( X s )] ds ≤ c ∆ n,i . (69)On I we use ﬁrst of all Holder inequality. Then, on the ﬁrst we use B.D.G. and Kunita inequal-ities, as in (60) and (61), while on the second the ﬁnite increments theorem, the boundednessof b (cid:48) and the ﬁrst point of Lemma 7: I ≤ c ∆ n,i E [( Z t i + J t i ) ] E (cid:34)(cid:18)(cid:90) t i +1 t i b ( X s ) − b ( X t i ) ds (cid:19) (cid:35) ≤ c ∆ n,i ∆ n,i ∆ n,i E (cid:20)(cid:90) t i +1 t i c | X s − X t i | ds (cid:21) ≤ c ∆ n,i . (70)In order to study the behaviour of I , Jensen inequality, the ﬁnite increment theorem, theboundedness of the derivative of σ and the ﬁrst point of Lemma 7 will be once again useful. I ≤ c ∆ n,i ∆ n,i E (cid:20)(cid:90) t i +1 t i c | X s − X t i | ds (cid:21) ≤ c ∆ n,i . (71)From Holder inequality, the polynomial growth of b , the boundedness of the moments of X andBDG inequality we obtain I ≤ c E [ b ( X t i ) ] E [ Z t i ] ≤ c ∆ n,i . (72)Putting the pieces together it follows that, for any ˜ ε > E [ ˜ A t i ] ≤ c ∆ − ˜ εn,i . We now evaluate E [ ˜ A t i ]. Acting as above (68) it easily follows E [ ˜ A t i ] ≤ c ∆ − ˜ εn,i + E [ A t i ϕ ∆ βn,i (∆ i X )] . Replacing the deﬁnition of A t i we get that E [ A t i ϕ βn,i (∆ i X )] is again the sum of 4 terms, that wenow denote as ˜ I , . . . , ˜ I . Using exactly the same arguments as in the study of E [ A t i ϕ βn,i (∆ i X )]we easily get ˜ I ≤ c ∆ n,i ∆ n,i (cid:90) t i +1 t i E [ b ( X s )] ds ≤ c ∆ n,i , ˜ I ≤ c ∆ n,i E [( Z t i + J t i ) ] E (cid:34)(cid:18)(cid:90) t i +1 t i b ( X s ) − b ( X t i ) ds (cid:19) (cid:35) ≤ c ∆ n,i (∆ n,i + ∆ n,i )∆ n,i E (cid:20)(cid:90) t i +1 t i c | X s − X t i | ds (cid:21) ≤ c ∆ n,i , ˜ I ≤ c ∆ n,i ∆ n,i E (cid:20)(cid:90) t i +1 t i c | X s − X t i | ds (cid:21) ≤ c ∆ n,i , ˜ I ≤ E [ b ( X t i ) ] E [ Z t i ] ≤ c ∆ n,i . The four equations here above provide the wanted result.38 oint 3

In order to show the estimations on the jumps gathered in the third point of Proposition 3we repeatedly use Lemma 2. Using also Holder inequality with p big and q next to 1, BDGinequality, the polynomial growth of b and the boundedness of the moments of X it is E [ | E t i | ϕ ∆ βn,i (∆ i X )] ≤ c E [ | b ( X t i ) || J t i | ϕ ∆ βn,i (∆ i X )] + c ∆ n,i E [ | Z t i || J t i | ϕ ∆ βn,i (∆ i X )]+ c ∆ n,i E (cid:104) | J t i | ϕ ∆ βn,i (∆ i X ) (cid:105) ≤ c E [ | b ( X t i ) | p ] p E [ | J t i | q ϕ q ∆ βn,i (∆ i X )] q + c ∆ n,i E [ | Z t i | p ] p E [ | J t i | q ϕ q ∆ βn,i (∆ i X )] q + c ∆ n,i ∆ βn,i thus, because, as β ∈ (0 , ), we can always ﬁnd an ε > + β − ε > β , it comes E [ | E t i | ϕ ∆ βn,i (∆ i X )] ≤≤ c ∆ q + βn,i + c ∆ n,i ∆ n,i ∆ q + βn,i + c ∆ βn,i = c ∆ β − εn,i + c ∆ + β − εn,i + c ∆ βn,i = c ∆ βn,i . In analogous way we obtain E [ | E t i | ϕ ∆ βn,i (∆ i X )] ≤ c E [ | b ( X t i ) | p ] p E [ | J t i | q ϕ q ∆ βn,i (∆ i X )] q + c ∆ n,i E [ | Z t i | p ] p E [ | J t i | q ϕ q ∆ βn,i (∆ i X )] q + c ∆ n,i E [ | J t i | ϕ ∆ βn,i (∆ i X )] ≤ c ∆ q +2 βn,i + c ∆ n,i ∆ n,i ∆ q +2 βn,i + c ∆ n,i ∆ βn,i = c ∆ β − εn,i + c ∆ β − εn,i + c ∆ β − n,i = c ∆ β − n,i , where the last inequality is, again, consequence of the fact that we can always ﬁnd ε > β − ε > β −

1. Finally, acting as before, E [ | E t i | ϕ ∆ βn,i (∆ i X )] ≤ c ∆ β − εn,i + c ∆ n,i ∆ n,i ∆ β − εn,i + c ∆ n,i ∆ βn,i = c ∆ β − n,i . A.5 Proof of Proposition 4

Proof. Point 1

Regarding the ﬁrst point, we ﬁrst of all introduce ˜ b ( X s ) := b ( X s ) + a ( X s − ) (cid:80) Mj =1 λ ( j ) s ds . Weobserve that, as b has polynomial growth, a is bounded and both λ and X have bounded momentsof any order, then ˜ b has bounded moments of any order as well. Recalling that A t i is given asin (18) we can denote A t i =: (cid:88) j =1 ¯ I j . Replacing ˜ b with b , we already know from (69), (70), (71) and (72) that E [ ¯ I + ¯ I + ¯ I + ¯ I ] ≤ c ∆ n,i . (73)We now consider ¯ I . From Assumption 1 we know the function a is Lipschitz and with boundedderivative. Therefore, we use the ﬁnite increments theorem followed by the ﬁrst point of Lemma39. It provides us, using also Jensen inequality and Holder inequality with q big and p next to 1, E [ ¯ I ] ≤ c ∆ n,i ∆ n,i (cid:90) t i +1 t i E  ( a ( X s ) − a ( X t i ))  M (cid:88) j =1 λ ( j ) s   ds ≤ c ∆ n,i (cid:90) t i +1 t i E [( a ( X s ) − a ( X t i )) p ] p E  M (cid:88) j =1 λ ( j ) s  q  q ds ≤ c ∆ n,i (cid:90) t i +1 t i ∆ p n,i ds ≤ c ∆ − ˜ εn,i , (74)where we have also used the boundedness of the moments of λ . On ¯ I we use that a ( x ) ≤ a and the second point of Lemma 7, getting E [ ¯ I ] ≤ c ∆ n,i ∆ n,i (cid:90) t i +1 t i M (cid:88) j =1 E [( λ ( j ) s − λ ( j ) t i ) ] ds ≤ c ∆ n,i . (75)To conclude the proof of the bound on E [ A t i ] we are left to evaluate ¯ I . We do that throughHolder and Kunita inequalities. It yields E [ ¯ I ] ≤ c E [¯ b ( X t i ) J t i ] ≤ E [¯ b ( X t i ) p ] p E [ J qt i ] q ≤ c ∆ − ˜ εn,i , (76)where in the last inequality we have chosen p big and q next to 1. From (73), (74), (75) and(76) it follows E [ A t i ] ≤ c ∆ − ˜ εn,i . Concerning the fourth moment of A t i , as before we know from Proposition 2 that E [ ¯ I + ¯ I + ¯ I + ¯ I ] ≤ c ∆ n,i . (77)Acting as in (74) we get E [ ¯ I ] ≤ c ∆ n,i ∆ n,i (cid:90) t i +1 t i E  ( a ( X s ) − a ( X t i ))  M (cid:88) j =1 λ ( j ) s   ds (78) ≤ c ∆ n,i (cid:90) t i +1 t i E [( a ( X s ) − a ( X t i )) p ] p E  M (cid:88) j =1 λ ( j ) s  q  q ds ≤ c ∆ − ˜ εn,i . In the same way, acting as in (75) we obtain E [ ¯ I ] ≤ c ∆ n,i ∆ n,i (cid:90) t i +1 t i M (cid:88) j =1 E [( λ ( j ) s − λ ( j ) t i ) ] ds ≤ c ∆ n,i . (79)We conclude the proof of the point 2 by observing that E [ ¯ I ] ≤ c E [¯ b ( X t i ) p ] p E [ J qt i ] q ≤ c ∆ − ˜ εn,i , (80)by the boundedness of the moments of ˜ b and Kunita inequality. Point 2

We observe that B t i is deﬁned in the same way in Section 3 and Section 4. Therefore, the second40oint has already been showed in point 2 of Proposition 2. Point 3

By the deﬁnition of E t i it clearly follows E i [ E t i ] = 0. We now analyse E i [ E t i ] ≤ c ∆ n,i E i [ Z t i J t i ] + c ∆ n,i E i [ J t i + (cid:90) t i +1 t i a ( X s − ) M (cid:88) j =1 λ ( j ) s ds  ] . (81)We are going to show that the ﬁrst term in the right hand side of the equation (81) is negligibleif compared to the second one. By a conditional version of Holder, BDG and Kunita inequalitieswe get c ∆ n,i E i [ Z t i J t i ] ≤ c ∆ n,i E i [ Z pt i ] p E i [ J qt i ] q ≤ c ∆ n,i ∆ n,i ∆ q n,i ≤ c ∆ − εn,i , (82)for any ε >

0. To study the last term in the right hand side of (81) we recall it is J t i = (cid:82) t i +1 t i a ( X s − ) (cid:80) Mj =1 d ˜ N ( j ) s . Therefore, from conditional Kunita inequality, we have c ∆ n,i E i  J t i + (cid:90) t i +1 t i a ( X s − ) M (cid:88) j =1 λ ( j ) s ds   ≤ c ∆ n,i E i (cid:90) t i +1 t i a ( X s − ) M (cid:88) j =1 λ ( j ) s ds +2 (cid:90) t i +1 t i a ( X s − ) M (cid:88) j =1 λ ( j ) s ds   ≤ ca ∆ n,i (1 + ∆ n,i ) (cid:90) t i +1 t i E i  M (cid:88) j =1 λ ( j ) s  ds, where we have also used Jensen inequality on the last term here above, which is the reasonwhy we get an extra ∆ n,i . From the fourth point of Lemma 7 it follows that the equation hereabove is upper bounded by ca ∆ n,i (cid:80) Mj =1 λ ( j ) t i , plus a negligible term. Replacing it and (82) in (81)it follows E i [ E t i ] ≤ ca ∆ n,i M (cid:88) j =1 λ ( j ) t i + c ∆ − εn,i ≤ ca ∆ n,i M (cid:88) j =1 λ ( j ) t i , where the last inequality is a consequence of the fact that λ is always strictly more than zero.Regarding the fourth moment of E t i , from Kunita, Holder and Jensen inequality we have E [ E t i ] ≤ c ∆ n,i E [ Z pt i ] p E [ J qt i ] q + c ∆ n,i E i  J t i + (cid:90) t i +1 t i a ( X s − ) M (cid:88) j =1 λ ( j ) s ds   ≤ c ∆ n,i (∆ n,i ∆ − εn,i + ∆ n,i + ∆ n,i ∆ n,i ) ≤ c ∆ n,i . Point 4

The result follows directly from the deﬁnition of C t i and the boundedness of a and of themoments of λ . 41 .6 Proof of Lemma 3 Proof.

It is C t i ˜ ψ l ( X t i , λ t i ) = a ( X t i ) M (cid:88) j =1 ( λ ( j ) t i − E [ λ ( j ) t i | X t i ]) ˜ ψ l ( X t i , λ t i ) =: f ( X t i , λ t i ) . Since Var (cid:32) n n − (cid:88) i =0 f ( X t i , λ t i ) (cid:33) ≤ n n − (cid:88) i =0 n − (cid:88) j =0 Cov( f ( X t i , λ t i ) , f ( X t j , λ t j )) , we need to estimate the covariance.As explained in Section 2.3 we know that, under our assumptions, the process Z := ( X, λ ) is β -mixing with exponential decay. It means that there exists γ > β X ( t ) ≤ β Z ( t ) ≤ Ce − γt . If the process Y is β - mixing, then it is also α -mixing and so the following estimation holds (seeTheorem 3 in Section 1.2.2 of [19]) | Cov( Y t i , Y t j ) | ≤ c (cid:107) Y t i (cid:107) p (cid:13)(cid:13) Y t j (cid:13)(cid:13) q α r ( Y t i , Y t j )with p , q and r such that p + q + r = 1. Using that α ( Z t i , Z t j ) ≤ β Z ( | t i − t j | ) ≤ Ce − γ | t i − t j | , in our case the inequality here above becomes | Cov( f ( X t i , λ t i ) , f ( X t j , λ t j )) | ≤ ce − r γ | t i − t j | , where we have also used the deﬁnition of f and the boundedness of a and the existence ofmoments of λ to include the two norms in the constant c .We introduce a partition of (0 , T n ] based on the sets A k := ( k T n n , ( k + 1) T n n ], for which (0 , T n ] = ∪ n − k =0 A k . Now each point t i in (0 , T n ] can be seen as t k,h , where k identiﬁes the particular set A k to which the point belongs while, deﬁning M k as | A k | , h is a number in { , . . . , M k } whichenumerates the points in each set. It follows cn n − (cid:88) i =0 n − (cid:88) j =0 e − r γ | t i − t j | ≤ cn n − (cid:88) k =0 n − (cid:88) k =0 M k (cid:88) h =1 M k (cid:88) h =1 e − r γ | t k ,h − t k ,h | ≤ ce r Tnn n n − (cid:88) k =0 n − (cid:88) k =0 M k (cid:88) h =1 M k (cid:88) h =1 e − r γ | k − k | Tnn , where the last inequality is a consequence of the following estimation: for each k , k ∈ { , . . . , n − } it is | t k ,h − t k ,h | ≥ | k − k | T n n − T n n .Now we observe that the exponent does not depend on h anymore, hence the last term hereabove can be upper bounded by ce r Tnn n (cid:80) n − k =0 (cid:80) n − k =0 M k M k e − r γ | k − k | Tnn .Moreover, remarking that the length of each interval A k is T n n , it is easy to show that we canalways upper bound M k with T n n min , with T n = (cid:80) n − i =0 ∆ n,i ≤ n ∆ n and so M k ≤ ∆ n ∆ min , thatwe have assumed bounded by a constant c .Furthermore, still using that T n ≤ n ∆ n , we have e r Tnn ≤ e r ∆ n ≤ c .42o conclude, we have to evaluate cn (cid:80) n − k =0 (cid:80) n − k =0 e − r γ | k − k | Tnn . We deﬁne j := k − k andwe apply a change of variable, getting cn n − (cid:88) k =0 n − (cid:88) k =0 e − r γ | k − k | Tnn ≤ cn n − (cid:88) j = − ( n − e − r γ | j | Tnn | n − j | ≤ cn n − (cid:88) j = − ( n − e − r γ | j | ∆ min ≤ cn (1 − e − r γ ∆ min ) ≤ cT n , as we wanted. A.7 Proof of Lemma 6

Proof.

In order to estimate the probability of the complementary of the set Ω B , as deﬁned in(55), we ﬁrst of all observe that, as t ∈ B m,m (cid:48) whose dimension is D , it is | t ( X t k ) | ≤ (cid:107) t (cid:107) ∞ ≤ cD . Now we ﬁnd an upper bound for the probability of Ω cB focusing on what happens for j = 1 and k = 0. It is P ( | U ∗ , | ≥ ˜ cn ε ) ≤ P (cid:32) q n q n (cid:88) k =1 | B ∗ t k + C ∗ t k + E ∗ t k | ≥ ˜ cn ε (cid:33) ≤≤ P (cid:32) q n q n (cid:88) k =1 | B ∗ t k | ≥ ˜ c n ε (cid:33) + P (cid:32) q n q n (cid:88) k =1 | C ∗ t k | ≥ ˜ c n ε (cid:33) + P (cid:32) q n q n (cid:88) k =1 | E ∗ t k | ≥ ˜ c n ε (cid:33) . (83)From the deﬁnition of B it is 1 q n q n (cid:88) k =1 | B ∗ t k | ≤ cq n ∆ n q n (cid:88) k =1 Z t k + c. (84)Moreover, using Markov inequality and the boundedness of σ , P (cid:18) | Z t k | ≥ cσ ∆ n log n (cid:19) = P (cid:32) e | Ztk | σ √ ∆ n ≥ n c (cid:33) ≤ n c E (cid:34) e | Ztk | σ √ ∆ n (cid:35) ≤ n c E (cid:34) e c (cid:48) ∆ nσ (cid:82) tk +1 tk σ ( X s ) ds (cid:35) ≤ c (cid:48) n c . (85)Therefore, as the constant c in (84) can be moved in the other side of the inequality in the ﬁrstprobability of (83) and so it turns out not being inﬂuential, the ﬁrst probability of (83) is upperbounded by q n n c , which is arbitrarily small. Concerning the second term of (83), we use Markovinequality and the fact that C has bounded moments. We get, ∀ r ≥ P (cid:32) q n q n (cid:88) k =1 | C ∗ t k | ≥ ˜ c n ε (cid:33) ≤ q n P (cid:18) | C ∗ t k | ≥ ˜ c n ε (cid:19) ≤ cq n E [ | C ∗ t k | r ] n rε ≤ cq n n rε . Regarding the third term of (83) we observe that, replacing the value of q n we get P (cid:32) q n q n (cid:88) k =1 | E ∗ t k | ≥ ˜ c n ε (cid:33) = P (cid:32) q n (cid:88) k =1 | E ∗ t k | ≥ ˜ c n ε log n ∆ n (cid:33) . (86)43e now recall that, from the deﬁnition of E t k it is q n (cid:88) k =1 | E ∗ t k | ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n q n (cid:88) k =1 Z t k J t k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n q n (cid:88) k =1 J t k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:90) t qn a ( X s − ) M (cid:88) j =1 λ ( j ) ( s ) ds (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) =: I + I + I . The right hand side of (86) is upper bounded by P (cid:18) I ≥ ˜ c n ε log n ∆ n (cid:19) + P (cid:18) I ≥ ˜ c n ε log n ∆ n (cid:19) + P (cid:18) I ≥ ˜ c n ε log n ∆ n (cid:19) . Concerning the ﬁrst one, we observe it is I ≤ n q n (cid:88) k =1 ( Z t k + J t k ) = I , + I , . The probability that I , is bigger than ˜ c n ε log n ∆ n is arbitrarily small as a consequence of (85). I , is instead equal to I and so it is enough to study such a term. From Markov, Holder, BDGand Kunita inequalities we have P (cid:18) I ≥ ˜ c n ε log n ∆ n (cid:19) ≤ E [( I ) r ]( n ε log n ∆ − n ) r ≤ c ∆ − rn t rq n ( n ε log n ∆ − n ) r ≤ cn εr , where we underline that the order of t q n is cq n ∆ n = c log n ∆ min ∆ n = c log n . It is arbitrarily small.Concerning I , we want to estimate P ( (cid:80) q n − k =0 J t k ≥ c n ε log n ). We now consider two diﬀerentpossibilities, starting from the deﬁnition of the following set A := (cid:110) ∃ ˜ k ∈ { , . . . , q n − } such that J t ˜ k ≥ n ε (cid:111) . Then P (cid:32) q n − (cid:88) k =0 J t k ≥ c n ε log n (cid:33) = P (cid:32) q n − (cid:88) k =0 J t k ≥ c n ε log n, A (cid:33) + P (cid:32) q n − (cid:88) k =0 J t k ≥ c n ε log n, A c (cid:33) . We observe that Markov inequality and Kunita inequality yield P (cid:32) q n − (cid:88) k =0 J t k ≥ c n ε log n, A (cid:33) ≤ P ( A ) ≤ q n E [( J t ˜ k ) r ] n εr ≤ ∆ n q n n εr = c log nn εr , which is arbitrarily small by the arbitrariness of r . We remark that on A c , for every k ∈{ , . . . , q n − } , it is J t k < n ε . Therefore, to have the sum of them bigger than c n ε log n weshould have at least c log nn ε jumps. Hence, denoting as ∆ N q the number of jumps in [0 , t q n ],we have P (cid:32) q n − (cid:88) k =0 J t k ≥ c n ε log n, A c (cid:33) ≤ P (cid:16) ∆ N q > c n ε log n (cid:17) ≤ c E [(∆ N q ) r ]( n ε log n ) r ≤ c t q n ( n ε log n ) r ≤ c (log n ) r − n εr , where again we have used Markov inequality and we got a quantity arbitrarily small. We putall the pieces together and we observe we can choose in particular r for which P (cid:32) q n q n (cid:88) k =1 | E ∗ t k | ≥ ˜ c n ε (cid:33) ≤ cn . In the same way it is possible to choose r and ˜ c such that P (Ω cB ) ≤ cn . eferenceseferences