A fully data-driven approach to minimizing CVaR for portfolio of assets via SGLD with discontinuous updating
AA fully data-driven approach to minimizing CVaR for portfolio ofassets via SGLD with discontinuous updating ∗ Sotirios Sabanis and Ying Zhang School of Mathematics, The University of Edinburgh, UK. The Alan Turing Institute, UK.
July 6, 2020
Abstract
A new approach in stochastic optimization via the use of stochastic gradient Langevin dynam-ics (SGLD) algorithms, which is a variant of stochastic gradient decent (SGD) methods, allows usto efficiently approximate global minimizers of possibly complicated, high-dimensional landscapes.With this in mind, we extend here the non-asymptotic analysis of SGLD to the case of discontin-uous stochastic gradients. We are thus able to provide theoretical guarantees for the algorithm’sconvergence in (standard) Wasserstein distances for both convex and non-convex objective func-tions. We also provide explicit upper estimates of the expected excess risk associated with theapproximation of global minimizers of these objective functions.All these findings allow us to devise and present a fully data-driven approach for the optimalallocation of weights for the minimization of CVaR of portfolio of assets with complete theoreticalguarantees for its performance. Numerical results illustrate our main findings.
We are concerned in this article with the study of stochastic optimization problems of the formminimize U ( θ ) := E [ f ( θ, X )] , (1)where the gradient of f is discontinuous in θ ∈ R d and X is a random element with a smooth density.Within this framework, we highlight and solve the problem of minimizing CVaR (expected shortfall) ofa portfolio of assets in terms of optimal selection of weights for individual assets as explained in Section5.2.2. We offer theoretical guarantees for the approximate solution of the optimization problem (1)by generating a ˆ θ such that the expected excess risk E [ U (ˆ θ )] − inf θ ∈ R d U ( θ )is minimized. To achieve this, we analyse the convergence properties of the stochastic gradientLangevin dynamics (SGLD) algorithm with discontinuous updating H , which is given by θ λ = θ , θ λn +1 = θ λn − λH ( θ λn , X n +1 ) + (cid:112) β − λξ n +1 , n ∈ N , (2)where θ is an R d -valued random variable, λ > β > H : R d × R m → R d is a measurable function satisfying ∇ U ( θ ) = E [ H ( θ, X )] with( X n ) n ∈ N being an i.i.d. sequence, and ( ξ n ) n ∈ N is an independent sequence of standard d -dimensional ∗ This work was supported by The Alan Turing Institute for Data Science and AI under EPSRC grant EP/N510129/1.Y. Z. was supported by The Maxwell Institute Graduate School in Analysis and its Applications, a Centre for DoctoralTraining funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016508/01), the ScottishFunding Council, Heriot-Watt University and the University of Edinburgh. a r X i v : . [ q -f i n . P M ] J u l aussian random variables. One recalls heere that the SGLD algorithm (2) can be viewed as adiscretization of the Langevin SDE: Z = θ , dZ t = − h ( Z t ) dt + (cid:112) β − dB t , (3)where h := ∇ U and ( B t ) t ≥ represents the standard Brownian motion. Moreover, it is well-knownthat, under appropriate conditions, the Langevin SDE (3) admits a unique invariant measure π β (cid:29) exp( − βU ( θ )) which concentrates around the minimizers of U when β is sufficiently large, , see [16] formore details.Theoretical guarantees of the SGLD algorithm (2) to the target distribution π β have been es-tablished in Wasserstein-2 distance under the assumptions that H is convex and (locally) Lipschitzcontinuous, see [1], [2], [10] and references therein. Recently, these results are considered under moregeneralised conditions aiming to include a wider range of practical applications. To relax the convex-ity condition, a dissipativity condition is proposed in [19], and the convergence result is obtained inWasserstein-2 distance with the rate λ / n . This is the first such result in non-convex optimization,which is then improved in the work [23] and [8]. Compared to [19], a higher rate of convergence withdependence on n is achieved in [23] following a direct analysis of the ergodicity of the overdampedLangevin Monte Carlo (LMC) algorithms, while a rate 1/2 in Wasserstein-1 distance is obtained in[8] by using the contraction results developed in [14].As for the generalisation of the smoothness of H , to the best of the author’s knowledge, there areno theoretical guarantees established in the literature for the SGLD algorithm (2) with discontinuousgradient. We present here the first such results. We are inspired by similar studies for stochasticgradient descent (SGD) algorithms, see [15] and [7] and references therein. In particular, [15] providesan almost sure convergence result, while [7] provides a strong L convergence result with rate 1/2.In this paper, we establish non-asymptotic error bounds for the SGLD algorithm (2) with dis-continuous gradient H . More precisely, non-asymptotic results in Wasserstein-1 and Wasserstein-2distances between the law of the n -th iterate of the SGLD algorithm (2) and the target distribution π β are obtained under convexity and dissipativity conditions for H . This allows us to then provide fullanalytic results concerning the expected excess risk of the associated optimization problem (1). Allthis is achieved by assuming that H is decomposed in to two parts F and G , where F : R d × R m → R d is locally Lipschitz continuous and G : R d × R m → R d is bounded. Furthermore, H is assumed tosatisfy a conditional Lipschitz-continuity (CLC) property proposed in [7], which is given explicitly inAssumption 3 below.We illustrate the applicability of our findings by presenting examples from quantile and VaR, CVaRestimations in Section 5. In particular, we solve the problem of optimal allocation of weights for theminimization of CVaR of a portfolio of assets. This is also the first such result in the literature to thebest of the author’s knowledge. Numerical experiments are implemented and their results support ourtheoretical findings.The paper is organised as follows. Section 2 presents the assumptions and main results. In Section3, the proofs for the main theorems in the non-convex case are provided, which are followed by theproofs for the results in the convex case in Section 4. Practical examples along with the minimizationalgorithm of CVaR for a portfolio of assets are presented in Section 5 while auxiliary results areprovided in Section A.We conclude this section by introducing some notation. Let (Ω , F , P ) be a probability space. Wedenote by E [ X ] the expectation of a random variable X . For any x ∈ R d , denote by x ( i ) the i -th entryof the vector. Fix an integer d ≥
1. For an R d -valued random variable X , its law on B ( R d ) (the Borelsigma-algebra of R d ) is denoted by L ( X ). Scalar product is denoted by (cid:104)· , ·(cid:105) , with | · | standing forthe corresponding norm (where the dimension of the space may vary depending on the context). For µ ∈ P ( R d ) and for a non-negative measurable f : R d → R , the notation µ ( f ) := (cid:82) R d f ( θ ) µ ( dθ ) is used.Given a Markov kernel R on R d and a function f integrable under R ( x, · ), for any x ∈ R d , denote by Rf ( x ) = (cid:82) R d f ( y ) R ( x, dy ). For any integer q ≥
1, let P ( R q ) denote the set of probability measureson B ( R q ). For µ, ν ∈ P ( R d ), let C ( µ, ν ) denote the set of probability measures ζ on B ( R d ) such thatits respective marginals are µ, ν . For two probability measures µ and ν , the Wasserstein distance of2rder p ≥ W p ( µ, ν ) := inf ζ ∈C ( µ,ν ) (cid:18)(cid:90) R d (cid:90) R d | θ − θ (cid:48) | p ζ ( dθdθ (cid:48) ) (cid:19) /p , µ, ν ∈ P ( R d ) . (4) Denote by G n := σ ( X k , k ≤ n, k ∈ N ), for any n ∈ N . ( X n ) n ∈ N is an R m -valued, ( G n ) n ∈ N -adaptedprocess. It is assumed throughout the paper that θ , G ∞ and ( ξ n ) n ∈ N are independent. Moreover, thefollowing assumptions are considered: Assumption 1.
Let H : R d × R m → R d take the form H ( θ, x ) = F ( θ, x ) + G ( θ, x ) , θ ∈ R d , x ∈ R m , where F : R d × R m → R d and G : R d × R m → R d satisfy the following:(i) F : R d × R m → R d is jointly Lipschitz continuous in both variables, i.e. there exist L , L > , ρ ≥ such that for any θ, θ (cid:48) ∈ R d , x, x (cid:48) ∈ R m , | F ( θ, x ) − F ( θ (cid:48) , x (cid:48) ) | ≤ (1 + | x | + | x (cid:48) | ) ρ ( L | θ − θ (cid:48) | + L | x − x (cid:48) | ) . (ii) G ( θ, x ) : R d × R m → R d is bounded in θ , i.e. there exist K : R m → R + such that for any θ ∈ R d , x ∈ R m , | G ( θ, x ) | ≤ K ( x ) . Assumption 2.
We assume the inital value θ satisfies E [ | θ | ] < ∞ . The process ( X n ) n ∈ N is i.i.d.with E [ | X | ρ +4 ] < ∞ and E [ K ( X )] < ∞ . Moreover, it satisfies E [ H ( θ, X )] = h ( θ ) . Remark 1.
By Assumption 1, for all θ ∈ R d and x ∈ R m , | H ( θ, x ) | ≤ (1 + | x | ) ρ +1 ( L | θ | + L ) + F ∗ ( x ) , where F ∗ ( x ) = | F (0 , | + K ( x ) . For any x ∈ R m , ρ ≥ , denote by K ρ ( x ) = (1 + 2 | x | ) ρ +4 . (5) One notices that by Assumption 2, E [ K ρ ( X )] is well defined. Assumption 3.
There exists a positive constant
L > such that, for all θ, θ (cid:48) ∈ R d , E [ | H ( θ, X ) − H ( θ (cid:48) , X ) | ] ≤ L | θ − θ (cid:48) | . Remark 2.
Assumptions 2 and 3 imply, for all θ, θ (cid:48) ∈ R d , | h ( θ ) − h ( θ (cid:48) ) | ≤ L | θ − θ (cid:48) | . (6) Remark 3.
Assumption 3 is satisfied for a wide class of ( X n ) n ∈ N , see Section 5 for the examples.Here, for the illustrative purpose, one considers the following simple example. Suppose G ( θ, x ) = (cid:80) Nj =1 ˙ g j ( θ, x ) (cid:84) mi =1 { x ( i ) ∈ I i,j ( θ ) } is a lower semi-continuous function, where N ∈ N ∗ , ˙ g j : R d × R m → R d are bounded and jointly Lipschitz continuous functions, i.e. there exist L , L , K > such that forany θ, θ (cid:48) ∈ R d , x, x (cid:48) ∈ R m , j = 1 , . . . , N | ˙ g j ( θ, x ) − ˙ g j ( θ (cid:48) , x (cid:48) ) | ≤ (1 + | x | + | x (cid:48) | ) ρ ( L | θ − θ (cid:48) | + L | x − x (cid:48) | ) , | ˙ g j ( θ, x ) | ≤ K , the intervals I i,j ( θ ) take the form ( −∞ , ¯ g ( i ) j ( θ )) , (¯ g ( i ) j ( θ ) , ∞ ) or (˜ g ( i ) j ( θ ) , ˆ g ( i ) j ( θ )) , and ¯ g ( i ) j , ˜ g ( i ) j , ˆ g ( i ) j : R d → R are Lipschitz continuous functions. In this case, it is enough to require the marginal den-sity function of X ( i )0 is continuous and bounded for any i = 1 , . . . , m . Then, the property stated inAssumption 3 holds.Proof. See Appendix A.1. 3 .1 Nonconvex case
Further to the assumptions above, we consider the following conditions on U , which can be viewed asa generalization of the convexity assumption. Assumption 4.
There exist A : R m → R d × d , b : R m → R such that for any x, y ∈ R d , (cid:104) y, A ( x ) y (cid:105) ≥ and for all θ ∈ R d and x ∈ R m , (cid:104) F ( θ, x ) , θ (cid:105) ≥ (cid:104) θ, A ( x ) θ (cid:105) − b ( x ) . The smallest eigenvalue of E [ A ( X )] is a positive real number a > and E [ b ( X )] = b > . Define first λ max = min (cid:40) min { a, a / } L ) E [ K ρ ( X )] , a (cid:41) , (7)where L , a are given in Assumption 1 and 4 respectively, and K ρ ( x ) for any x ∈ R m is defined in (5). Theorem 1.
Let Assumptions 1, 2, 3 and 4 hold. Then, for any n ∈ N , < λ ≤ λ max , there existconstants C , C , C > such that, W ( L ( θ λn ) , π β ) ≤ C e − C λn ( E [ | θ | ] + 1) + C √ λ, n ∈ N , (8) where C , C and C are given explicitly in (29) . Theorem 1 provides the rate of convergence between the law of the SGLD algorithm (2) and thetarget distribution π β in W distance. An analogous result in Wasserstein-2 distance can be obtained. Corollary 1.
Let Assumptions 1, 2, 3 and 4 hold. Then, for any n ∈ N , < λ ≤ λ max given in (7) ,there exist constants C , C , C > such that, W ( L ( θ λn ) , π β ) ≤ C e − C λn ( E [ | θ | ] + 1) + C λ / , n ∈ N , where C , C and C are given explicitly in (30) . By using the convergence result in Wasserstein-2 distance as presented in Corollary 1, one canobtain an upper bound for the expected excess risk E [ U (ˆ θ )] − inf θ ∈ R d U ( θ ). Corollary 2.
Let Assumptions 1, 2, 3 and 4 hold. Then, for every < λ ≤ λ max given in (7) , thereexist constants ˆ C , ˆ C , ˆ C , ˆ C > such that the expected excess risk can be estimated as E [ U (ˆ θ )] − inf θ ∈ R d U ( θ ) ≤ ˆ C e − ˆ C λn + ˆ C λ / + ˆ C /β, where ˆ θ = θ λn , and ˆ C , ˆ C , ˆ C , ˆ C > are given explicitly in (32) and (33) . Recall Assumption 1, where it is assumed H = F + G . In this section, we present (improved)convergence results of the SGLD algorithm (2) under the convexity condition of F and G .In the case that F satisfies a convexity condition but not G , the result in Theorem 1 can berecovered. Assumption 5.
There exist ˆ A : R m → R d × d such that for any x, y ∈ R d , (cid:104) y, ˆ A ( x ) y (cid:105) ≥ and for each θ, θ (cid:48) ∈ R d , x ∈ R m , (cid:104) F ( θ, x ) − F ( θ (cid:48) , x ) , θ − θ (cid:48) (cid:105) ≥ (cid:104) θ − θ (cid:48) , ˆ A ( x )( θ − θ (cid:48) ) (cid:105) . The smallest eigenvalue of E [ ˆ A ( X )] is a positive real number ˆ a > (cid:15) with (cid:15) > . emark 4. By Assumptions 1 and 5, one obtains, for θ ∈ R d and x ∈ R m , (cid:104) F ( θ, x ) , θ (cid:105) ≥ (cid:104) θ, ˆ A ∗ ( x ) θ (cid:105) − ˆ b ( x ) , where ˆ A ∗ ( x ) = ˆ A ( x ) − (cid:15) I d and ˆ b ( x ) = ( L (1 + | x | ) ρ +1 + | F (0 , | ) / (4 (cid:15) ) .Proof. See Appendix A.2.
Corollary 3.
Let Assumptions 1, 2, 3 and 5 hold. Then, for any n ∈ N , < λ ≤ λ ∗ max , where λ ∗ max = min (cid:40) min { a ∗ , ( a ∗ ) / } L ) E [ K ρ ( X )] , a ∗ (cid:41) with a ∗ = ˆ a − (cid:15) , there exist constants C ∗ , C ∗ , C ∗ > such that, W ( L ( θ λn ) , π β ) ≤ C ∗ e − C ∗ λn ( E [ | θ | ] + 1) + C ∗ √ λ, n ∈ N . (9)If G is assumed to be convex in addition to Assumption 5, then it can be shown that the rateof convergence is 1/2 in Wasserstein-2 distance between the law of the SGLD algorithm (2) and thetarget distribution π β , which appeared to be optimal, see [1, Example 3.4]. Assumption 6.
There exist ˆ A : R m → R d × d such that for any x, y ∈ R d , (cid:104) y, ˆ A ( x ) y (cid:105) ≥ and for each θ, θ (cid:48) ∈ R d , x ∈ R m , (cid:104) G ( θ, x ) − G ( θ (cid:48) , x ) , θ − θ (cid:48) (cid:105) ≥ (cid:104) θ − θ (cid:48) , ˆ A ( x )( θ − θ (cid:48) ) (cid:105) . The smallest eigenvalue of E [ ˆ A ( X )] is a positive real number ˆ a > . Remark 5.
Assumptions 5 and 6 imply, for each θ, θ (cid:48) ∈ R d , x ∈ R m , (cid:104) H ( θ, x ) − H ( θ (cid:48) , x ) , θ − θ (cid:48) (cid:105) ≥ (cid:104) θ − θ (cid:48) , ˆ A ( x )( θ − θ (cid:48) ) (cid:105) , where ˆ A ( x ) = ˆ A ( x ) + ˆ A ( x ) . Moreover, one obtains (cid:104) h ( θ ) − h ( θ (cid:48) ) , θ − θ (cid:48) (cid:105) ≥ ˆ a | θ − θ (cid:48) | , where ˆ a = ˆ a + ˆ a . Remark 6.
By Remark 2 and Remark 5, [18, Theorem 2.1.12] shows that (cid:104) h ( θ ) − h ( θ (cid:48) ) , θ − θ (cid:48) (cid:105) ≥ ˆ a ∗ | θ − θ (cid:48) | + 1ˆ a + L | h ( θ ) − h ( θ (cid:48) ) | , where ˆ a ∗ = ˆ aL/ (ˆ a + L ) . Define ¯ λ max = min { / a + L ) , ˆ a/ (4 L E [ K ρ ( X )]) } (10)with ˆ a = ˆ a + ˆ a given in Remark 5. Under the convexity condition of H , the non-asymptotic boundfor W ( L ( θ γn ) , π β ) is obtained with the optimal convergence rate 1/2. The explicit statement is givenbelow. Theorem 2.
Let Assumptions 1, 2, 3, 5 and 6 hold. Then, for any n ∈ N , < λ < ¯ λ max given in (10) , there exist constants C , C , C > such that, W ( L ( θ λn ) , π β ) ≤ C e − C λn + C √ λ, where C , C and C are given explicitly in (41) . If ρ = 0 in Assumption 1, then the result holds for λ ∈ min { / a + L ) , / (6 L ) } . By using Theorem 2, one can obtain an upper bound for the expected excess risk E [ U (ˆ θ )] − inf θ ∈ R d U ( θ ) in the convex case. Corollary 4.
Let Assumptions 1, 2, 3, 5 and 6 hold. Then, for every < λ ≤ ¯ λ max given in (10) ,there exist constants ˆ C , ˆ C , ˆ C , ˆ C > such that the expected excess risk can be estimated as E [ U (ˆ θ )] − inf θ ∈ R d U ( θ ) ≤ ˆ C e − ˆ C λn + ˆ C √ λ + ˆ C /β, where ˆ θ = θ λn , and ˆ C , ˆ C , ˆ C , ˆ C > are given explicitly in (43) and (44) . Proofs of the main results: nonconvex case
Denote by F t the natural filtration of B t , t ∈ R + . It is a classic result that SDE (3) has a uniquesolution adapted to ( F t ) t ∈ R + , since h is Lipschitz-continuous by (6). In order to obtain the convergenceresults in Theorem 1 and Corollary 1, we first introduce some auxiliary processes. Define the Lyapunov function for each p ≥ V p ( θ ) := (1 + | θ | ) p/ , θ ∈ R d , and similarly v p ( x ) := (1+ x ) p/ , for any real x ≥
0. Notice that these functions are twice continuouslydifferentiable and lim | θ |→∞ ∇ V p ( θ ) V p ( θ ) = 0 . Let P V p denote the set of µ ∈ P ( R d ) satisfying (cid:82) R d V p ( θ ) µ ( dθ ) < ∞ .Consider the following auxiliary processes. For each λ > Z λt := Z λt , t ∈ R + . Notice that ˜ B λt := B λt / √ λ , t ∈ R + is also a Brownian motion and dZ λt = − λh ( Z λt ) dt + (cid:112) β − λd ˜ B λt , Z λ = θ . Then, F λt := F λt , t ∈ R + is the natural filtration of ˜ B λt , t ∈ R + . One notice that F λt is independentof G ∞ ∨ σ ( θ ). Then, define the continuous-time interpolation of the SGLD algorithm (2) as d ¯ θ λt = − λH (¯ θ λ (cid:98) t (cid:99) , X (cid:100) t (cid:101) ) dt + (cid:112) β − λd ˜ B λt , (11)with initial condition ¯ θ λ = θ . In addition, due to the homogeneous nature of the coefficients ofequation (11), the law of the interpolated process coincides with the law of the SGLD algorithm (2)at grid-points, i.e. L (¯ θ λn ) = L ( θ λn ), for each n ∈ N . Hence, crucial estimates for the SGLD can bederived by studying equation (11).Furthermore, consider a continuous-time process ζ s,v,λt , t ≥ s , which denotes the solution of theSDE dζ s,v,λt = − λh ( ζ s,v,λt ) dt + (cid:112) β − λd ˜ B λt . with initial condition ζ s,v,λs := v , v ∈ R d . Definition 1.
Fix n ∈ N and define ¯ ζ λ,nt = ζ nT, ¯ θ λnT ,λt where T := (cid:98) /λ (cid:99) . Intuitively, ¯ ζ λ,nt is a process started from the value of the SGLD process (11) at time nT and maderun until time t ≥ nT with the continuous-time Langevin dynamics. We proceed by establishing the moment bounds of the processes (¯ θ λt ) t ≥ and (¯ ζ λ,nt ) t ≥ . Lemma 1.
Let Assumptions 1, 2 and 4 hold. For any < λ < λ max given in (7) , n ∈ N , t ∈ ( n, n +1] , E (cid:104) | ¯ θ λt | (cid:105) ≤ (1 − aλ ( t − n ))(1 − aλ ) n E (cid:2) | θ | (cid:3) + c ( λ max + a − ) , here c = ( c + 2 d/β ) , c = 8 E (cid:2) K ( X ) (cid:3) a − + 2 b + 4 λ max L E [ K ρ ( X )] + 4 λ max E (cid:2) F ∗ ( X ) (cid:3) . (12) In addition, sup t E | ¯ θ λt | ≤ E (cid:2) | θ | (cid:3) + c ( λ max + a − ) < ∞ . Similarly, one obtains E (cid:104) | ¯ θ λt | (cid:105) ≤ (1 − aλ ( t − n ))(1 − aλ ) n E | θ | + c ( λ max + a − ) , where c = (1 + aλ max ) c + 12 d β − ( λ max + 9 a − ) (13) with c given in (18) . Moreover, this implies sup t E | ¯ θ λt | < ∞ .Proof. For any n ∈ N and t ∈ ( n, n + 1], define ∆ n,t = ¯ θ λn − λH (¯ θ λn , X n +1 )( t − n ). By using (11), it iseasily seen that for t ∈ ( n, n + 1] E (cid:104) | ¯ θ λt | (cid:12)(cid:12)(cid:12) ¯ θ λn (cid:105) = E (cid:104) | ∆ n,t | (cid:12)(cid:12)(cid:12) ¯ θ λn (cid:105) + (2 λ/β ) d ( t − n ) . Then, by using Assumptions 1, 2, 4 and Remark 1, one obtains E (cid:104) | ∆ n,t | (cid:12)(cid:12)(cid:12) ¯ θ λn (cid:105) = | ¯ θ λn | − λ ( t − n ) E (cid:104)(cid:68) ¯ θ λn , H (¯ θ λn , X n +1 ) (cid:69) (cid:12)(cid:12)(cid:12) ¯ θ λn (cid:105) + λ ( t − n ) E (cid:104) | H (¯ θ λn , X n +1 ) | (cid:12)(cid:12)(cid:12) ¯ θ λn (cid:105) ≤ | ¯ θ λn | − λ ( t − n ) (cid:68) ¯ θ λn , E [ A ( X )] ¯ θ λn (cid:69) + 2 λ ( t − n ) b − λ ( t − n ) E (cid:104)(cid:68) ¯ θ λn , G (¯ θ λn , X n +1 ) (cid:69) (cid:12)(cid:12)(cid:12) ¯ θ λn (cid:105) + λ ( t − n ) E (cid:104) ((1 + | X n +1 | ) ρ +1 ( L | ¯ θ λn | + L ) + F ∗ ( X n +1 )) (cid:12)(cid:12)(cid:12) ¯ θ λn (cid:105) ≤ (1 − aλ ( t − n )) | ¯ θ λn | + 2 λ ( t − n ) b + 2 λ ( t − n ) E [ K ( X )] | ¯ θ λn | + 2 λ ( t − n ) L E [ K ρ ( X )] | ¯ θ λn | + 4 λ ( t − n ) L E [ K ρ ( X )] + 4 λ ( t − n ) E (cid:2) F ∗ ( X ) (cid:3) , where the last inequality is obtained by using ( a + b ) ≤ a + 2 b , for a, b ≥ λ < λ max with λ max given in (7), E (cid:104) | ∆ n,t | (cid:12)(cid:12)(cid:12) ¯ θ λn (cid:105) ≤ (cid:18) − aλ ( t − n ) (cid:19) | ¯ θ λn | + 2 λ ( t − n ) E [ K ( X )] | ¯ θ λn | + 2 λ ( t − n ) b + 4 λ ( t − n ) L E [ K ρ ( X )] + 4 λ ( t − n ) E [ F ∗ ( X )] . For | ¯ θ λn | > E [ K ( X )] a − , one obtains − aλ ( t − n ) | ¯ θ λn | + 2 λ ( t − n ) E [ K ( X )] | ¯ θ λn | < , which implies E (cid:104) | ∆ n,t | (cid:12)(cid:12)(cid:12) ¯ θ λn (cid:105) ≤ (1 − aλ ( t − n )) | ¯ θ λn | + 2 λ ( t − n ) b + 4 λ ( t − n ) L E [ K ρ ( X )] + 4 λ ( t − n ) E (cid:2) F ∗ ( X ) (cid:3) . For | ¯ θ λn | ≤ E [ K ( X )] a − , we have E (cid:104) | ∆ n,t | (cid:12)(cid:12)(cid:12) ¯ θ λn (cid:105) ≤ (cid:18) − aλ ( t − n ) (cid:19) | ¯ θ λn | + 8 λ ( t − n ) E (cid:2) K ( X ) (cid:3) a − + 2 λ ( t − n ) b + 4 λ ( t − n ) L E [ K ρ ( X )] + 4 λ ( t − n ) E (cid:2) F ∗ ( X ) (cid:3) . Combining the two cases yields E (cid:104) | ∆ n,t | (cid:12)(cid:12)(cid:12) ¯ θ λn (cid:105) ≤ (1 − aλ ( t − n )) | ¯ θ λn | + λ ( t − n ) c , c = 8 E (cid:2) K ( X ) (cid:3) a − + 2 b + 4 λ max L E [ K ρ ( X )] + 4 λ max E (cid:2) F ∗ ( X ) (cid:3) . Therefore, one obtains E (cid:104) | ¯ θ λt | (cid:12)(cid:12)(cid:12) ¯ θ λn (cid:105) ≤ (1 − aλ ( t − n )) | ¯ θ λn | + λ ( t − n ) c , where c = ( c + 2 d/β ) and the result follows by induction. To calculate a higher moment, denote byΞ λn,t = { λβ − } / ( ˜ B λt − ˜ B λn ), for t ∈ ( n, n + 1], one calculates E (cid:104) | ¯ θ λt | (cid:12)(cid:12)(cid:12) ¯ θ λn (cid:105) = E (cid:20)(cid:16) | ∆ n,t | + | Ξ λn,t | + 2 (cid:68) ∆ n,t , Ξ λn,t (cid:69)(cid:17) (cid:12)(cid:12)(cid:12) ¯ θ λn (cid:21) = E (cid:104) | ∆ n,t | + | Ξ λn,t | + 2 | ∆ n,t | | Ξ λn,t | + 4 | ∆ n,t | (cid:68) ∆ n,t , Ξ λn,t (cid:69) +4 | Ξ λn,t | (cid:68) ∆ n,t , Ξ λn,t (cid:69) + 4 (cid:16)(cid:68) ∆ n,t , Ξ λn,t (cid:69)(cid:17) (cid:12)(cid:12)(cid:12) ¯ θ λn (cid:21) ≤ E (cid:104) | ∆ n,t | + | Ξ λn,t | + 6 | ∆ n,t | | Ξ λn,t | (cid:12)(cid:12)(cid:12) ¯ θ λn (cid:105) ≤ (1 + aλ ( t − n )) E (cid:104) | ∆ n,t | (cid:12)(cid:12)(cid:12) ¯ θ λn (cid:105) + (1 + 9 / ( aλ ( t − n ))) E (cid:104) | Ξ λn,t | (cid:105) . (14)where the last inequality holds due to 2 ab ≤ εa + ε − b , for a, b ≥ ε > ε = aλ ( t − n ).Then, one continues with calculating E (cid:104) | ∆ n,t | (cid:12)(cid:12)(cid:12) ¯ θ λn (cid:105) = E (cid:20)(cid:16) | ¯ θ λn | − λ ( t − n ) (cid:68) ¯ θ λn , H (¯ θ λn , X n +1 ) (cid:69) + λ ( t − n ) | H (¯ θ λn , X n +1 ) | (cid:17) (cid:12)(cid:12)(cid:12) ¯ θ λn (cid:21) ≤ | ¯ θ λn | + E (cid:104) λ ( t − n ) | ¯ θ λn | | H (¯ θ λn , X n +1 ) | − λ ( t − n ) (cid:68) ¯ θ λn , H (¯ θ λn , X n +1 ) (cid:69) | ¯ θ λn | − λ ( t − n ) | H (¯ θ λn , X n +1 ) | (cid:68) ¯ θ λn , H (¯ θ λn , X n +1 ) (cid:69) + λ ( t − n ) | H (¯ θ λn , X n +1 ) | (cid:12)(cid:12)(cid:12) ¯ θ λn (cid:105) . By Remark 1, for q ≥
1, one observes E (cid:104) | H (¯ θ λn , X n +1 ) | q (cid:12)(cid:12)(cid:12) ¯ θ λn (cid:105) ≤ E (cid:2) (1 + | X | ) qρ + q (cid:3) (2 q − L q | ¯ θ λn | q + 2 q − L q ) + 2 q − E [ F q ∗ ( X )] . (15)Then, by using Assumption 4 and by taking q = 2 , , E (cid:104) | ∆ n,t | (cid:12)(cid:12)(cid:12) ¯ θ λn (cid:105) ≤ (1 − aλ ( t − n )) | ¯ θ λn | + 4 bλ ( t − n ) | ¯ θ λn | + 4 λ ( t − n ) E [ K ( X )] | ¯ θ λn | + 12 λ ( t − n ) L E [ K ρ ( X )] | ¯ θ λn | + 24 λ ( t − n ) (cid:0) L E [ K ρ ( X )] + E (cid:2) F ∗ ( X ) (cid:3)(cid:1) | ¯ θ λn | + 16 λ ( t − n ) L E [ K ρ ( X )] | ¯ θ λn | + 64 λ ( t − n ) (cid:0) L E [ K ρ ( X )] + E (cid:2) F ∗ ( X ) (cid:3)(cid:1) | ¯ θ λn | + 8 λ ( t − n ) L E [ K ρ ( X )] | ¯ θ λn | + 64 λ ( t − n ) (cid:0) L E [ K ρ ( X )] + E (cid:2) F ∗ ( X ) (cid:3)(cid:1) , which implies, by using λ < λ max E (cid:104) | ∆ n,t | (cid:12)(cid:12)(cid:12) ¯ θ λn (cid:105) ≤ (1 − aλ ( t − n )) | ¯ θ λn | + 4 λ ( t − n ) E [ K ( X )] | ¯ θ λn | + 4 bλ ( t − n ) | ¯ θ λn | + 24 λ ( t − n ) (cid:0) L E [ K ρ ( X )] + E (cid:2) F ∗ ( X ) (cid:3)(cid:1) | ¯ θ λn | + 64 λ ( t − n ) (cid:0) L E [ K ρ ( X )] + E (cid:2) F ∗ ( X ) (cid:3)(cid:1) | ¯ θ λn | + 64 λ ( t − n ) (cid:0) L E [ K ρ ( X )] + E (cid:2) F ∗ ( X ) (cid:3)(cid:1) . For | ¯ θ λn | > E [ K ( X )] a − , one obtains − aλ ( t − n )3 | ¯ θ λn | + 4 λ ( t − n ) E [ K ( X )] | ¯ θ λn | < , similarly, for | ¯ θ λn | > (12 ba − + 72 a − λ max (cid:0) L E [ K ρ ( X )] + E (cid:2) F ∗ ( X ) (cid:3)(cid:1) ) / , we have − aλ ( t − n )3 | ¯ θ λn | + 4 bλ ( t − n ) | ¯ θ λn | + 24 λ ( t − n ) (cid:0) L E [ K ρ ( X )] + E (cid:2) F ∗ ( X ) (cid:3)(cid:1) | ¯ θ λn | < , | ¯ θ λn | > (192 a − λ (cid:0) L E [ K ρ ( X )] + E (cid:2) F ∗ ( X ) (cid:3)(cid:1) ) / − aλ ( t − n )3 | ¯ θ λn | + 64 λ ( t − n ) (cid:0) L E [ K ρ ( X )] + E (cid:2) F ∗ ( X ) (cid:3)(cid:1) | ¯ θ λn | < . Denote by M = max (cid:110) E [ K ( X )] a − , (12 ba − + 72 a − λ max (cid:0) L E [ K ρ ( X )] + E (cid:2) F ∗ ( X ) (cid:3)(cid:1) ) / , (192 a − λ (cid:0) L E [ K ρ ( X )] + E (cid:2) F ∗ ( X ) (cid:3)(cid:1) ) / (cid:111) . (16)For | ¯ θ λn | > M , one obtains E (cid:104) | ∆ n,t | (cid:12)(cid:12)(cid:12) ¯ θ λn (cid:105) ≤ (1 − aλ ( t − n )) | ¯ θ λn | + 64 λ ( t − n ) (cid:0) L E [ K ρ ( X )] + E (cid:2) F ∗ ( X ) (cid:3)(cid:1) . As for | ¯ θ λn | ≤ M , we have E (cid:104) | ∆ n,t | (cid:12)(cid:12)(cid:12) ¯ θ λn (cid:105) ≤ (1 − aλ ( t − n )) | ¯ θ λn | + 4 λ ( t − n ) E [ K ( X )] M + 4 bλ ( t − n ) M + 24 λ ( t − n ) (cid:0) L E [ K ρ ( X )] + E (cid:2) F ∗ ( X ) (cid:3)(cid:1) M + 64 λ ( t − n ) (cid:0) L E [ K ρ ( X )] + E (cid:2) F ∗ ( X ) (cid:3)(cid:1) M + 64 λ ( t − n ) (cid:0) L E [ K ρ ( X )] + E (cid:2) F ∗ ( X ) (cid:3)(cid:1) . Combining the two cases yields E (cid:104) | ∆ n,t | (cid:12)(cid:12)(cid:12) ¯ θ λn (cid:105) ≤ (1 − aλ ( t − n )) | ¯ θ λn | + λ ( t − n ) c , (17)where c = 4 E [ K ( X )] M +4 bM +152(1+ λ max ) (cid:0) (1 + L ) E [ K ρ ( X )] + (1 + E (cid:2) F ∗ ( X ) (cid:3)(cid:1) (1+ M ) (18)with M given in (16). Substituting (17) into (14), one obtains E (cid:104) | ¯ θ λt | (cid:12)(cid:12)(cid:12) ¯ θ λn (cid:105) ≤ (1 + aλ ( t − n ))(1 − aλ ( t − n )) | ¯ θ λn | + (1 + aλ ( t − n )) λ ( t − n ) c + 12 d λ β − ( t − n ) (1 + 9 / ( aλ ( t − n ))) ≤ (1 − aλ ( t − n )) | ¯ θ λn | + λ ( t − n ) c , where c = (1 + aλ max ) c + 12 d β − ( λ max + 9 a − ). The proof completes by induction. Remark 7.
One notices that in Lemma 1, the step-size restriction is the following: ˆ λ max = min (cid:40) a L E [ K ρ ( X )] , a / L E [ K ρ ( X )]) / , a / (32 L E [ K ρ ( X )]) / , a (cid:41) . Theorem 1 and Corollary 1 still hold by using ˆ λ max . However, in order to make notation compact, therestriction is chosen to be λ max given in (7) , which can be deduced from the above expression. Corollary 5.
Let Assumptions 1, 2 and 4 hold. For any < λ < λ max given in (7) , n ∈ N , t ∈ ( n, n + 1] , E [ V (¯ θ λt )] ≤ − aλ ) (cid:98) t (cid:99) E [ V ( θ )] + 2 c ( λ max + a − ) + 2 , where c is given in (13) . Next, we present a drift condition associated with the SDE (3), which will be used to obtain themoment bounds of the process (¯ ζ λ,nt ) t ≥ . 9 emma 2. Let Assumptions 1, 2 and 4 hold. Then, for each p ≥ , θ ∈ R d , ∆ V p β − (cid:104) h ( θ ) , ∇ V p ( θ ) (cid:105) ≤ − ¯ c ( p ) V p ( θ ) + ˜ c ( p ) , where ¯ c ( p ) = ap/ and ˜ c ( p ) = (3 / ap v p +1 ( M p ) with M p given in (19) .Proof. One notices that, by Assumptions 1 and 2, for any θ ∈ R d , h ( θ ) = E [ H ( θ, X )] = E [ F ( θ, X ) + G ( θ, X )]. Then, one calculates,∆ V p β − (cid:104) h ( θ ) , ∇ V p ( θ ) (cid:105) = β − p ( p − | θ | V p − ( θ ) + β − pdV p − ( θ ) − pV p − ( θ ) (cid:104) E [ F ( θ, X ) + G ( θ, X )] , θ (cid:105)≤ − apV p ( θ ) + ( ap + bp + β − p ( p −
2) + β − pd ) V p − ( θ ) + p E [ K ( X )] | θ | V p − ( θ ) , where the last inequality is obtained due to Assumption 4. By observing | θ | ≤ (cid:112) | θ | , denote by M p = (cid:112) (4 / b/ (3 a ) + 4 d/ (3 aβ ) + 4( p − / (3 aβ ) + 4 E [ K ( X )] / (3 a )) − . (19)For | θ | > M p , one obtains ∆ V p β − (cid:104) h ( θ ) , ∇ V p ( θ ) (cid:105) ≤ − ( ap/ V p ( θ ), while for | θ | ≤ M p , we have ∆ V p β − (cid:104) h ( θ ) , ∇ V p ( θ ) (cid:105) ≤ (3 / ap v p +1 ( M p ). Combining the two cases yields the desired result.The following Lemma provides the second and the fourth moment of the process (¯ ζ λ,nt ) t ≥ . Lemma 3.
Let Assumptions 1, 2 and 4 hold. For any < λ < λ max given in (7) , t ≥ nT , n ∈ N ,one obtains the following inequality E [ V (¯ ζ λ,nt )] ≤ e − aλt/ E [ V ( θ )] + 3v ( M ) + c ( λ max + a − ) + 1 , where the process ¯ ζ λ,nt is defined in Definition 1 and c is given in (12) . Furthermore, E [ V (¯ ζ λ,nt )] ≤ e − aλt E [ V ( θ )] + 3v ( M ) + 2 c ( λ max + a − ) + 2 , where c is given in (13) .Proof. For any p ≥
1, application of Ito’s lemma and taking expectation yields E [ V p (¯ ζ λ,nt )] = E [ V p (¯ θ λnT )] + (cid:90) tnT E (cid:34) λ ∆ V p (¯ ζ λ,ns ) β − λ (cid:104) h (¯ ζ λ,ns ) , ∇ V p (¯ ζ λ,ns ) (cid:105) (cid:35) ds. Differentiating both sides and using Lemma 2, we arrive at ddt E [ V p (¯ ζ λ,nt )] = E (cid:34) λ ∆ V p (¯ ζ λ,nt ) β − λ (cid:104) h (¯ ζ λ,nt ) , ∇ V p (¯ ζ λ,nt ) (cid:105) (cid:35) ≤ − λ ¯ c ( p ) E [ V p (¯ ζ λ,nt )] + λ ˜ c ( p ) , which yields E [ V p (¯ ζ λ,nt )] ≤ e − λ ( t − nT )¯ c ( p ) E [ V p (¯ θ λnT )] + ˜ c ( p )¯ c ( p ) (cid:16) − e − λ ¯ c ( p )( t − nT ) (cid:17) ≤ e − λ ( t − nT )¯ c ( p ) E [ V p (¯ θ λnT )] + ˜ c ( p )¯ c ( p ) . Now for p = 2, by using Corollary 5,one obtains E [ V (¯ ζ λ,nt )] ≤ e − λ ( t − nT )¯ c (2) E [ V (¯ θ λnT )] + ˜ c (2)¯ c (2)10 (1 − aλ ) nT e − λ ( t − nT )¯ c (2) E [ V ( θ )] + ˜ c (2)¯ c (2) + c ( λ max + a − ) + 1 ≤ e − aλt/ E [ V ( θ )] + 3v ( M ) + c ( λ max + a − ) + 1 , where the last inequality holds due to 1 − z ≤ e − z for z ≥ c (2) = a/
2. Similarly, for p = 4, oneobtains E [ V (¯ ζ λ,nt )] ≤ e − λ ( t − nT )¯ c (4) E [ V (¯ θ λnT )] + ˜ c (4)¯ c (4) ≤ − aλ ) nT e − λ ( t − nT )¯ c (4) E [ V ( θ )] + ˜ c (4)¯ c (4) + 2 c ( λ max + a − ) + 2 ≤ e − aλt E [ V ( θ )] + 3v ( M ) + 2 c ( λ max + a − ) + 2 , where the last inequality holds due to 1 − z ≤ e − z for z ≥ c (4) = a . We introduce a functional which is crucial to obtain the convergence rate in W . For any p ≥ µ, ν ∈ P V p , w ,p ( µ, ν ) := inf ζ ∈C ( µ,ν ) (cid:90) R d (cid:90) R d [1 ∧ | θ − θ (cid:48) | ](1 + V p ( θ ) + V p ( θ (cid:48) )) ζ ( dθdθ (cid:48) ) , (20)and it satisfies trivially W ( µ, ν ) ≤ w ,p ( µ, ν ) . (21)The case p = 2, i.e. w , , is used throughout the section. The result below states a contractionproperty of w , . Proposition 1.
Let Z (cid:48) t , t ∈ R + be the solution of (3) with initial condition Z (cid:48) = θ which is inde-pendent of F ∞ and satisfies | θ | is finite. Then, w , ( L ( Z t ) , L ( Z (cid:48) t )) ≤ ˆ ce − ˙ ct w , ( L ( θ ) , L ( θ (cid:48) )) , where the constants ˙ c and ˆ c are given in Lemma 7.Proof. See Proposition 3.14 of [8].By using the contraction property provided in Proposition 1, one can construct the non-asymptoticbound between L (¯ θ λt ) and L ( Z λt ), t ∈ [ nT, ( n + 1) T ], in W distance by decomposing the error usingthe auxiliary process ¯ ζ λ,nt : W ( L (¯ θ λt ) , L ( Z λt )) ≤ W ( L (¯ θ λt ) , L (¯ ζ λ,nt )) + W ( L (¯ ζ λ,nt ) , L ( Z λt )) . (22)One notices that when 1 < λ ≤ λ max , the result holds trivially. Thus, we consider the case 0 < λ ≤ / < λT ≤ Lemma 4.
Let Assumption 1, 2, 3, and 4 hold. For any < λ < λ max given in (7) , t ∈ [ nT, ( n +1) T ] , W ( L (¯ θ λt ) , L (¯ ζ λ,nt )) ≤ √ λ ( e − an/ ¯ C , E [ V ( θ )] + ¯ C , ) / , where ¯ C , and ¯ C , are given in (25) .Proof. To handle the first term in (22), we start by establishing an upper bound in Wasserstein-2distance and the statement follows by noticing W ≤ W . By employing synchronous coupling, using(11) and the definition of ¯ ζ λ,nt in Definition 1, one obtains (cid:12)(cid:12)(cid:12) ¯ ζ λ,nt − ¯ θ λt (cid:12)(cid:12)(cid:12) ≤ λ (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) tnT (cid:104) H (¯ θ λ (cid:98) s (cid:99) , X (cid:100) s (cid:101) ) − h (¯ ζ λ,ns ) (cid:105) ds (cid:12)(cid:12)(cid:12)(cid:12) . (cid:12)(cid:12)(cid:12) ¯ ζ λ,nt − ¯ θ λt (cid:12)(cid:12)(cid:12) ≤ λ (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) tnT (cid:104) H (¯ θ λ (cid:98) s (cid:99) , X (cid:100) s (cid:101) ) − h (¯ θ λ (cid:98) s (cid:99) ) (cid:105) ds (cid:12)(cid:12)(cid:12)(cid:12) + λ (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) tnT (cid:104) h (¯ θ λ (cid:98) s (cid:99) ) − h (¯ ζ λ,ns ) (cid:105) ds (cid:12)(cid:12)(cid:12)(cid:12) . Taking squares on both sides and the application of Remark 2 yield (cid:12)(cid:12)(cid:12) ¯ ζ λ,nt − ¯ θ λt (cid:12)(cid:12)(cid:12) ≤ λ (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) tnT (cid:104) H (¯ θ λ (cid:98) s (cid:99) , X (cid:100) s (cid:101) ) − h (¯ θ λ (cid:98) s (cid:99) ) (cid:105) ds (cid:12)(cid:12)(cid:12)(cid:12) + 2 λL (cid:90) tnT (cid:12)(cid:12)(cid:12) ¯ θ λ (cid:98) s (cid:99) − ¯ ζ λ,ns (cid:12)(cid:12)(cid:12) ds. By taking expectations on both sides and by using ( a + b ) ≤ a + 2 b , for a, b >
0, one obtains E (cid:20)(cid:12)(cid:12)(cid:12) ¯ ζ λ,nt − ¯ θ λt (cid:12)(cid:12)(cid:12) (cid:21) ≤ λ E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) tnT (cid:104) H (¯ θ λ (cid:98) s (cid:99) , X (cid:100) s (cid:101) ) − h (¯ θ λ (cid:98) s (cid:99) ) (cid:105) ds (cid:12)(cid:12)(cid:12)(cid:12) (cid:35) + 4 λL (cid:90) tnT E (cid:20)(cid:12)(cid:12)(cid:12) ¯ θ λ (cid:98) s (cid:99) − ¯ θ λs (cid:12)(cid:12)(cid:12) (cid:21) ds + 4 λL (cid:90) tnT E (cid:20)(cid:12)(cid:12)(cid:12) ¯ θ λs − ¯ ζ λ,ns (cid:12)(cid:12)(cid:12) (cid:21) ds, which implies due to λT ≤ E (cid:20)(cid:12)(cid:12)(cid:12) ¯ ζ λ,nt − ¯ θ λt (cid:12)(cid:12)(cid:12) (cid:21) ≤ λL ( e − aλnT ¯ σ Y E [ V ( θ )] + ˜ σ Y ) + 4 λL (cid:90) tnT E (cid:20)(cid:12)(cid:12)(cid:12) ¯ θ λs − ¯ ζ λ,ns (cid:12)(cid:12)(cid:12) (cid:21) ds + 2 λ E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) tnT (cid:104) H (¯ θ λ (cid:98) s (cid:99) , X (cid:100) s (cid:101) ) − h (¯ θ λ (cid:98) s (cid:99) ) (cid:105) ds (cid:12)(cid:12)(cid:12)(cid:12) (cid:35) , (23)where ¯ σ Y and ˜ σ Y are provided in (52). Next, we bound the last term in (23) by partitioning theintegral. Assume that nT + K ≤ t ≤ nT + K + 1 where K + 1 ≤ T . Thus we can write (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) tnT (cid:104) H (¯ θ λ (cid:98) s (cid:99) , X (cid:100) s (cid:101) ) − h (¯ θ λ (cid:98) s (cid:99) ) (cid:105) ds (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K (cid:88) k =1 I k + R K (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) where I k = H (¯ θ λnT + k − , X nT + k ) − h (¯ θ λnT + k − ) , R K = ( t − ( nT + K ))( H (¯ θ λnT + K , X nT + K +1 ) − h (¯ θ λnT + K )) . Taking squares of both sides (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K (cid:88) k =1 I k + R K (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = K (cid:88) k =1 | I k | + 2 K (cid:88) k =2 k − (cid:88) j =1 (cid:104) I k , I j (cid:105) + 2 K (cid:88) k =1 (cid:104) I k , R K (cid:105) + | R K | , Finally, we take expectations of both sides. Define the filtration H t = F λ ∞ ∨ G (cid:98) t (cid:99) . We first note thatfor any k = 2 , . . . , K , j = 1 , . . . , k − E (cid:104) I k , I j (cid:105) = E [ E [ (cid:104) I k , I j (cid:105)|H nT + k − ]] , = E (cid:104) E (cid:104)(cid:68) H (¯ θ λnT + k − , X nT + k ) − h (¯ θ λnT + k − ) , H (¯ θ λnT + j − , X nT + j ) − h (¯ θ λnT + j − ) (cid:69)(cid:12)(cid:12)(cid:12) H nT + k − (cid:105)(cid:105) , = E (cid:104)(cid:68) E (cid:104) H (¯ θ λnT + k − , X nT + k ) − h (¯ θ λnT + k − ) (cid:12)(cid:12)(cid:12) H nT + k − (cid:105) , H (¯ θ λnT + j − , X nT + j ) − h (¯ θ λnT + j − ) (cid:69)(cid:105) , = 0 . By the same argument E (cid:104) I k , R K (cid:105) = 0 for all 1 ≤ k ≤ K . Therefore, the last term of (23) is boundedas 2 λ E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) tnT (cid:104) H (¯ θ λ (cid:98) s (cid:99) , X (cid:100) s (cid:101) ) − h (¯ θ λ (cid:98) s (cid:99) ) (cid:105) ds (cid:12)(cid:12)(cid:12)(cid:12) (cid:35) = 2 λ K (cid:88) k =1 E (cid:2) | I k | (cid:3) + 2 λ E (cid:2) | R K | (cid:3) λ ( e − aλnT ¯ σ Z E [ V ( θ )] + ˜ σ Z ) , where the last inequality holds due to Lemma 12 and ¯ σ Z and ˜ σ Z are provided in (51). Therefore, thebound (23) becomes E (cid:20)(cid:12)(cid:12)(cid:12) ¯ ζ λ,nt − ¯ θ λt (cid:12)(cid:12)(cid:12) (cid:21) ≤ λL (cid:90) tnT E (cid:20)(cid:12)(cid:12)(cid:12) ¯ θ λs − ¯ ζ λ,ns (cid:12)(cid:12)(cid:12) (cid:21) ds + 4 λe − aλnT ( L ¯ σ Y + ¯ σ Z ) E [ V ( θ )] + 4 λ ( L ˜ σ Y + ˜ σ Z ) , Using Gr¨onwall’s inequality yields E (cid:20)(cid:12)(cid:12)(cid:12) ¯ ζ λ,nt − ¯ θ λt (cid:12)(cid:12)(cid:12) (cid:21) ≤ λe L (cid:104) e − aλnT ( L ¯ σ Y + ¯ σ Z ) E [ V ( θ )] + ( L ˜ σ Y + ˜ σ Z ) (cid:105) , which implies by λT ≥ / W ( L (¯ θ λt ) , L (¯ ζ λ,nt )) ≤ E (cid:20)(cid:12)(cid:12)(cid:12) ¯ ζ λ,nt − ¯ θ λt (cid:12)(cid:12)(cid:12) (cid:21) ≤ λ ( e − an/ ¯ C , E [ V ( θ )] + ¯ C , ) , (24)where ¯ C , = 4 e L ( L ¯ σ Y + ¯ σ Z ) , ¯ C , = 4 e L ( L ˜ σ Y + ˜ σ Z ) (25)with ¯ σ Y , ˜ σ Y provided in (52) and ¯ σ Z , ˜ σ Z given in (51).Then, the following Lemma provides the bound for the second term in (22). Lemma 5.
Let Assumption 1, 2, 3 and 4 hold. For any < λ < λ max given in (7) , t ∈ [ nT, ( n + 1) T ] , W ( L (¯ ζ λ,nt ) , L ( Z λt )) ≤ √ λ ( e − min { ˙ c,a/ } n/ ¯ C , E [ V ( θ )] + ¯ C , ) , where ¯ C , , ¯ C , is given in (26) .Proof. To upper bound the second term W ( L (¯ ζ λ,nt ) , L ( Z λt )) in (22), we adapt the proof from Lemma 3.28in [8]. By Proposition 1, Corollary 5, Lemma 3 and 4, one obtains W ( L (¯ ζ λ,nt ) , L ( Z λt )) ≤ n (cid:88) k =1 W ( L (¯ ζ λ,kt ) , L (¯ ζ λ,k − t )) , ≤ n (cid:88) k =1 w , ( L ( ζ kT, ¯ θ λkT ,λt ) , L ( ζ kT, ¯ ζ λ,k − kT ,λt )) ≤ ˆ c n (cid:88) k =1 exp( − ˙ c ( n − k )) w , ( L (¯ θ λkT ) , L (¯ ζ λ,k − kT )) ≤ ˆ c n (cid:88) k =1 exp( − ˙ c ( n − k )) W ( L (¯ θ λkT ) , L (¯ ζ λ,k − kT )) (cid:20) (cid:110) E [ V (¯ θ λkT )] (cid:111) / + (cid:110) E [ V (¯ ζ λ,k − kT )] (cid:111) / (cid:21) ≤ ( √ λ ) − ˆ c n (cid:88) k =1 exp( − ˙ c ( n − k )) W ( L (¯ θ λkT ) , L (¯ ζ λ,k − kT ))+ 3 √ λ ˆ c n (cid:88) k =1 exp( − ˙ c ( n − k )) (cid:104) E [ V (¯ θ λkT )] + E [ V (¯ ζ λ,k − kT )] (cid:105) ≤ √ λe − min { ˙ c,a/ } n n ˆ c ( e min { ˙ c,a/ } ¯ C , E [ V ( θ )] + 12 E [ V ( θ )])+ √ λ ˆ c − exp( − ˙ c ) ( ¯ C , + 12 c ( λ max + a − ) + 9v ( M ) + 15) ≤ √ λ ( e − min { ˙ c,a/ } n/ ¯ C , E [ V ( θ )] + ¯ C , ) 13here the last inequality holds due to e − αn ( n +1) ≤ α − , for α >
0, and we take α = min { ˙ c, a/ } / C , = ˆ c (cid:18) { ˙ c, a/ } (cid:19) ( e min { ˙ c,a/ } ¯ C , + 12)¯ C , = ˆ c − exp( − ˙ c ) ( ¯ C , + 12 c ( λ max + a − ) + 9v ( M ) + 15) (26)with ¯ C , , ¯ C , given in 25, ˆ c , ˙ c given in Lemma 7, c is given in (13) and M given in (19).By using similar arguments as in Lemma 5, an analogous result can be obtained in W distance,which is given in the following corollary. Corollary 6.
Let Assumption 1, 2, 3 and 4 hold. For any < λ < λ max given in (7) , t ∈ [ nT, ( n +1) T ] , W ( L (¯ ζ λ,nt ) , L ( Z λt )) ≤ λ / ( e − min { ˙ c,a/ } n/ ¯ C ∗ , E / [ V ( θ )] + ¯ C ∗ , ) , where ¯ C ∗ , , ¯ C ∗ , is given in (27) .Proof. One notices that W ≤ (cid:112) w , , then one writes W ( L (¯ ζ λ,nt ) , L ( Z λt )) ≤ n (cid:88) k =1 W ( L (¯ ζ λ,kt ) , L (¯ ζ λ,k − t )) ≤ n (cid:88) k =1 √ w / , ( L ( ζ kT, ¯ θ λkT ,λt ) , L ( ζ kT, ¯ ζ λ,k − kT ,λt )) ≤ √ c n (cid:88) k =1 exp( − ˙ c ( n − k ) / W / ( L (¯ θ λkT ) , L (¯ ζ λ,k − kT )) (cid:20) (cid:110) E [ V (¯ θ λkT )] (cid:111) / + (cid:110) E [ V (¯ ζ λ,k − kT )] (cid:111) / (cid:21) / ≤ λ − / √ c n (cid:88) k =1 exp( − ˙ c ( n − k ) / W ( L (¯ θ λkT ) , L (¯ ζ λ,k − kT ))+ λ / √ c n (cid:88) k =1 exp( − ˙ c ( n − k ) / (cid:20) (cid:110) E [ V (¯ θ λkT )] (cid:111) / + (cid:110) E [ V (¯ ζ λ,k − kT )] (cid:111) / (cid:21) ≤ √ cλ / e − min { ˙ c,a/ } n/ n ( e min { ˙ c,a/ } / ¯ C / , E / [ V ( θ )] + 2 √ E / [ V ( θ )])+ √ cλ / − exp( − ˙ c/
2) ( ¯ C / , + 2 √ c ( λ max + a − ) / + √ / ( M ) + √ ≤ λ / ( e − min { ˙ c,a/ } n/ ¯ C ∗ , E / [ V ( θ )] + ¯ C ∗ , ) , where ¯ C ∗ , = √ c (cid:18) { ˙ c, a/ } (cid:19) ( e min { ˙ c,a/ } / ¯ C / , + 2 √ C ∗ , = √ c − exp( − ˙ c/
2) ( ¯ C / , + 2 √ c ( λ max + a − ) / + √ / ( M ) + √ , (27)with ¯ C , , ¯ C , given in 25, ˆ c , ˙ c given in Lemma 7, c is given in (13) and M given in Lemma 2. Thiscompletes the proof.Finally, by using the inequality (22) and the results from previous lemmas, one can obtain thenon-asymptotic bound between ¯ θ λt and Z λt , t ∈ [ nT, ( n + 1) T ], in W distance. Lemma 6.
Let Assumption 1, 2, 3 and 4 hold. For any < λ < λ max given in (7) , t ∈ [ nT, ( n + 1) T ] , W ( L (¯ θ λt ) , L ( Z λt )) ≤ ¯ C √ λ ( e − min { ˙ c,a/ } n/ E [ V ( θ )] + 1) , where ¯ C is given in (28) . roof. By using Lemma 4 and 5, one obtains W ( L (¯ θ λt ) , L ( Z λt )) ≤ W ( L (¯ θ λt ) , L (¯ ζ λ,nt )) + W ( L (¯ ζ λ,nt ) , L ( Z λt )) ≤ √ λ ( e − an/ ¯ C / , E / [ V ( θ )] + ¯ C / , ) + √ λ ( e − min { ˙ c,a/ } n/ ¯ C , E [ V ( θ )] + ¯ C , ) ≤ ¯ C √ λ ( e − min { ˙ c,a/ } n/ E [ V ( θ )] + 1) , where ¯ C = ¯ C / , + ¯ C / , + ¯ C , + ¯ C , . (28)Before proceeding to the proofs of the main results, we provide explicitly the constants ˙ c and ˆ c inProposition 1. Lemma 7.
The contraction constant in Proposition 1 is given by ˙ c = min { ¯ φ, ¯ c ( p ) , c ( p ) ˙ (cid:15) ¯ c ( p ) } / , where the explicit expressions for ¯ c ( p ) and ˜ c ( p ) can be found in Lemma 2 and ¯ φ is given by ¯ φ = (cid:18)(cid:112) π/L ¯ b exp (cid:18)(cid:16) ¯ b √ L/ / √ L (cid:17) (cid:19)(cid:19) − . Furthermore, any ˙ (cid:15) can be chosen which satisfies the following inequality ˙ (cid:15) ≤ ∧ (cid:32) c ( p ) (cid:112) π/L (cid:90) ˜ b exp (cid:18)(cid:16) s √ L/ / √ L (cid:17) (cid:19) ds (cid:33) − , where ˜ b = (cid:112) c ( p ) / ¯ c ( p ) − and ¯ b = (cid:112) c ( p )(1 + ¯ c ( p )) / ¯ c ( p ) − . The constant ˆ c is given as the ratio C /C , where C , C are given explicitly in [8, Lemma 3.26].Proof. See [8, Lemma 3.26].
Proof of Theorem 1
One notes that, by Lemma 6 and Proposition 1, for t ∈ [ nT, ( n + 1) T ] W ( L (¯ θ λt ) , π β ) ≤ W ( L (¯ θ λt ) , L ( Z λt )) + W ( L ( Z λt ) , π β ) ≤ ¯ C √ λ ( e − min { ˙ c,a/ } n/ E [ V ( θ )] + 1) + ˆ ce − ˙ cλt w , ( θ , π β ) ≤ ¯ C √ λ ( e − min { ˙ c,a/ } n/ E [ V ( θ )] + 1) + ˆ ce − ˙ cλt (cid:20) E [ V ( θ )] + (cid:90) R d V ( θ ) π β ( dθ ) (cid:21) ≤ e − min { ˙ c,a/ } n/ ( λ / ¯ C + ˆ c )(1 + E [ | θ | ])+ ˆ ce − min { ˙ c,a/ } n/ (cid:20) (cid:90) R d V ( θ ) π β ( dθ ) (cid:21) + √ λ ¯ C , which implies, for any n ∈ N W ( L ( θ λn ) , π β ) ≤ C e − C λn (1 + E [ | θ | ]) + C √ λ, where C = min { ˙ c, a/ } / , C = 2 (cid:20) ( λ / ¯ C + ˆ c ) + ˆ c (cid:18) (cid:90) R d V ( θ ) π β ( dθ ) (cid:19)(cid:21) , C = ¯ C , (29)with ¯ C given in 28. Proof of Corollary 1
By using (24) in Lemma 4, Corollary 6 and Proposition 1, one obtains W ( L (¯ θ λt ) , π β ) ≤ W ( L (¯ θ λt ) , L ( Z λt )) + W ( L ( Z λt ) , π β )15 W ( L (¯ θ λt ) , L (¯ ζ λ,nt )) + W ( L (¯ ζ λ,nt ) , L ( Z λt )) + W ( L ( Z λt ) , π β ) ≤ √ λ ( e − an/ ¯ C , E [ V ( θ )] + ¯ C , ) / + λ / ( e − min { ˙ c,a/ } n/ ¯ C ∗ , E / [ V ( θ )] + ¯ C ∗ , ) , + (cid:113) w , ( L ( Z λt ) , π β ) ≤ λ / ˜ C ( e − min { ˙ c,a/ } n/ E [ V ( θ )] + 1) + ˆ c / e − ˙ cλt/ (cid:113) w , ( θ , π β ) , where ˜ C = λ / ¯ C / , + λ / ¯ C / , + ¯ C ∗ , + ¯ C ∗ , and it can be further calculated as W ( L (¯ θ λt ) , π β ) ≤ λ / ˜ C ( e − min { ˙ c,a/ } n/ E [ V ( θ )] + 1)+ √ c / e − ˙ cλt/ (cid:18) E [ V ( θ )] + (cid:90) R d V ( θ ) π β ( dθ ) (cid:19) / ≤ e − min { ˙ c,a/ } n/ ( λ / ˜ C + √ c / )(1 + E [ | θ | ])+ √ c / e − min { ˙ c,a/ } n/ (cid:20) (cid:90) R d V ( θ ) π β ( dθ ) (cid:21) + λ / ˜ C , Finally, one obtains W ( L ( θ λn ) , π β ) ≤ C e − C λn E [ | θ | + 1] + C λ / where C = min { ˙ c, a/ } / , C = 2 (cid:20) ( λ / ˜ C + √ c / ) + ˆ c / (cid:18) (cid:90) R d V ( θ ) π β ( dθ ) (cid:19)(cid:21) , C = ˜ C . (30) Proof of Corollary 2
To obtain an upper bound for the expected excess risk E [ U (ˆ θ )] − inf θ ∈ R d U ( θ ),one considers the following splitting E [ U (ˆ θ )] − inf θ ∈ R d U ( θ ) = (cid:16) E [ U (ˆ θ )] − E [ U ( Z ∞ )] (cid:17) + (cid:18) E [ U ( Z ∞ )] − inf θ ∈ R d U ( θ ) (cid:19) , (31)where ˆ θ = θ λn and Z ∞ ∼ π β with π β ( θ ) = exp( − βU ( θ )) for all θ ∈ R d . By using [19, Lemma 3.5],Lemma 1, 14 and Corollary 1, the first term on the RHS of (31) can be bounded by E [ U (ˆ θ )] − E [ U ( Z ∞ )] ≤ (cid:16) L ( E (cid:2) | θ | (cid:3) + ( c + E [ K ( X )] /a )( λ max + a − )) / + | h (0) | (cid:17) W ( L ( θ λn ) , π β ) ≤ (cid:16) L ( E (cid:2) | θ | (cid:3) + ( c + E [ K ( X )] /a )( λ max + a − )) / + | h (0) | (cid:17) (cid:16) C e − C λn E [ | θ | + 1] + C λ / (cid:17) ≤ ˆ C e − ˆ C λn + ˆ C λ / , where ˆ C = C , ˆ C = C (cid:16) L ( E (cid:2) | θ | (cid:3) + ( c + E [ K ( X )] /a )( λ max + a − )) / + | h (0) | (cid:17) E [ | θ | + 1] , ˆ C = C (cid:16) L ( E (cid:2) | θ | (cid:3) + ( c + E [ K ( X )] /a )( λ max + a − )) / + | h (0) | (cid:17) , (32)with C , C , C given in (30) and c given in (12). Moreover, the second term on the RHS of (31) canbe estimated by using [19, Proposition 3.4], which gives, E [ U ( Z ∞ )] − inf θ ∈ R d U ( θ ) ≤ ˆ C β , where ˆ C = d (cid:18) eβLad (cid:18) dβ + 2 b + E [ K ( X )] a (cid:19)(cid:19) . (33)Finally, one obtains E [ U (ˆ θ )] − inf θ ∈ R d U ( θ ) ≤ ˆ C e − ˆ C λn + ˆ C λ / + ˆ C /β. Proof of the main results: convex case
The analysis of the convergence results in the convex case, i.e. Theorem 2, relies on the properties ofthe LMC algorithm, known also as the unadjusted Langevin algorithm (ULA). The LMC algorithmassociated with SDE (3) is given explicitly by, for any n ∈ N ,˙ θ λn +1 := ˙ θ λn − λh ( ˙ θ λn ) + (cid:112) β − λξ n +1 , ˙ θ λ := θ . (34)For 0 < λ < ¯ λ max , the Markov kernel ˙ R λ associated with (34) is given by, for all A ∈ B ( R d ) and θ ∈ R d , ˙ R λ ( θ, A ) = (cid:90) A (4 β − πλ ) − d/ exp (cid:16) − β (4 λ ) − | y − θ + λh ( θ ) | (cid:17) dy. In this section, the moment estimates of the SDE (3), the LMC algorithm (34) and the SGLD algorithm(2) are presented which contribute to the analysis of the convergence resuts.
Under Assumptions 5 and 6, U has a unique minimizer θ ∗ ∈ R d . Denote by ( P t ) t ≥ the semigroupassociated with SDE (3). The statements below provide a moment bound and a convergence resultfor SDE (3). Lemma 8 (Proposition 1 in [12]) . Let Assumptions 1, 2, 3, 5 and 6 hold.(i) For all t > and y ∈ R d , (cid:90) R d | y − θ ∗ | P t ( θ, dy ) ≤ | θ − θ ∗ | e − at + ( d/ (ˆ aβ ))(1 − e − at ) . (ii) The stationary distribution π β satisfies (cid:90) R d | y − θ ∗ | π β ( dy ) ≤ d/ (ˆ aβ ) . The following lemma provides moment estimates for ( ˙ θ n ) n ∈ N and it states that ˙ R λ admits aninvariant measure π λ which may differ from π β . Lemma 9.
Let Assumptions 1, 2, 5 and 6 hold. Then, for all < λ < ¯ λ max given in (10) , oneobtains:(i) For all t > and θ ∈ R d , (cid:90) R d | y − θ ∗ | ˙ R nλ ( θ, dy ) ≤ (1 − a ∗ λ ) n | θ − θ ∗ | + ( d/ (ˆ a ∗ β ))(1 − (1 − a ∗ λ ) n ) . (ii) The Markov kernel ˙ R λ has a unique stationary distribution π λ and it satisfies (cid:90) R d | θ − θ ∗ | π λ ( dθ ) ≤ d/ (ˆ a ∗ β ) . (iii) For all n ∈ R d and θ ∈ R d , W ( δ θ ˙ R nλ , π λ ) ≤ e − ˆ a ∗ λn ( | θ − θ ∗ | + d/ (ˆ a ∗ β )) / . The lemma below presents a second moment bound for θ λn in the convex case.17 emma 10. Let Assumptions 1, 2, 5 hold. For any < λ < ¯ λ max given in (10) , E (cid:20)(cid:12)(cid:12)(cid:12) θ λn − θ ∗ (cid:12)(cid:12)(cid:12) (cid:21) ≤ (1 − ˆ aλ ) n E (cid:2) | θ − θ ∗ | (cid:3) + ¯ c ˆ a − , where ¯ c = 32 E (cid:2) K ( X ) (cid:3) ˆ a − + 9¯ λ max ( L E [ K ρ ( X )] | θ ∗ | + L E [ K ρ ( X )] + E (cid:2) F ∗ ( X ) (cid:3) ) + 2 dβ − . (35) This implies sup n E (cid:104)(cid:12)(cid:12) θ λn +1 − θ ∗ (cid:12)(cid:12) (cid:105) ≤ E (cid:2) | θ − θ ∗ | (cid:3) + ¯ c ˆ a − < ∞ . Furthermore, if ρ = 0 in Assumption1, the result holds for λ ∈ min { / a + L ) , / (6 L ) } with ˆ a = ˆ a + ˆ a .Proof. By using (2), one writes, for any n ∈ N , | θ λn +1 − θ ∗ | = | θ λn − θ ∗ | + 2 (cid:68) θ λn − θ ∗ , − λH ( θ λn , X n +1 ) + (cid:112) β − λξ n +1 (cid:69) + | − λH ( θ λn , X n +1 ) + (cid:112) β − λξ n +1 | = | θ λn − θ ∗ | − λ (cid:68) θ λn − θ ∗ , H ( θ λn , X n +1 ) − H ( θ ∗ , X n +1 ) (cid:69) − λ (cid:68) θ λn − θ ∗ , H ( θ ∗ , X n +1 ) (cid:69) + 2 (cid:68) θ λn − θ ∗ , (cid:112) β − λξ n +1 (cid:69) + λ | H ( θ λn , X n +1 ) | − λ (cid:68) H ( θ λn , X n +1 ) , (cid:112) β − λξ n +1 (cid:69) + 2 β − λ | ξ n +1 | . Taking conditional expectation on both sides and by using Remark 1 and Assumption 1, 5 yield E (cid:104) | θ λn +1 − θ ∗ | (cid:12)(cid:12)(cid:12) θ λn (cid:105) = | θ λn − θ ∗ | − λ E (cid:104) (cid:68) θ λn − θ ∗ , F ( θ λn , X n +1 ) − F ( θ ∗ , X n +1 ) (cid:69)(cid:12)(cid:12)(cid:12) θ λn (cid:105) − λ E (cid:104) (cid:68) θ λn − θ ∗ , G ( θ λn , X n +1 ) − G ( θ ∗ , X n +1 ) (cid:69)(cid:12)(cid:12)(cid:12) θ λn (cid:105) − λ (cid:68) θ λn − θ ∗ , h ( θ ∗ ) (cid:69) + λ E (cid:104) | H ( θ λn , X n +1 ) | (cid:12)(cid:12)(cid:12) θ λn (cid:105) + 2 dβ − λ (36) ≤ | θ λn − θ ∗ | − λ ˆ a | θ λn − θ ∗ | + 4 λ E [ K ( X )] | θ λn − θ ∗ | + λ E (cid:20) (cid:16) (1 + | X n +1 | ) ρ +1 ( L | θ λn − θ ∗ | + L | θ ∗ | + L ) + F ∗ ( X n +1 ) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) θ λn (cid:21) + 2 dβ − λ ≤ (1 − a λ ) | θ λn − θ ∗ | + 4 λ E [ K ( X )] | θ λn − θ ∗ | + 2 λ L E [ K ρ ( X )] | θ λn − θ ∗ | + 6 λ L E [ K ρ ( X )] | θ ∗ | + 6 λ L E [ K ρ ( X )] + 6 λ E (cid:2) F ∗ ( X ) (cid:3) + 2 dβ − λ which implies, for 0 < λ < ¯ λ max , E (cid:104) | θ λn +1 − θ ∗ | (cid:12)(cid:12)(cid:12) θ λn (cid:105) ≤ (cid:18) −
32 ˆ a λ (cid:19) | θ λn − θ ∗ | + 4 λ E [ K ( X )] | θ λn − θ ∗ | + 6 λ L E [ K ρ ( X )] | θ ∗ | + 6 λ L E [ K ρ ( X )] + 6 λ E (cid:2) F ∗ ( X ) (cid:3) + 2 dβ − λ. Then, for | θ λn − θ ∗ | > E [ K ( X )] ˆ a − , one notices that −
12 ˆ a λ | θ λn − θ ∗ | + 4 λ E [ K ( X )] | θ λn − θ ∗ | < , and this indicates E (cid:104) | θ λn +1 − θ ∗ | (cid:12)(cid:12)(cid:12) θ λn (cid:105) ≤ (1 − ˆ a λ ) | θ λn − θ ∗ | + 6 λ L E [ K ρ ( X )] | θ ∗ | + 6 λ L E [ K ρ ( X )] + 6 λ E (cid:2) F ∗ ( X ) (cid:3) + 2 dβ − λ. Similarly, for | θ λn − θ ∗ | ≤ E [ K ( X )] ˆ a − , one obtains E (cid:104) | θ λn +1 − θ ∗ | (cid:12)(cid:12)(cid:12) θ λn (cid:105) ≤ (cid:18) −
32 ˆ a λ (cid:19) | θ λn − θ ∗ | + 32 λ E (cid:2) K ( X ) (cid:3) ˆ a −
18 6 λ L E [ K ρ ( X )] | θ ∗ | + 6 λ L E [ K ρ ( X )] + 6 λ E (cid:2) F ∗ ( X ) (cid:3) + 2 dβ − λ. Combining the two cases yields E (cid:104) | θ λn +1 − θ ∗ | (cid:12)(cid:12)(cid:12) θ λn (cid:105) ≤ (1 − ˆ aλ ) | θ λn − θ ∗ | + λc , where c = 32 E (cid:2) K ( X ) (cid:3) ˆ a − + 6¯ λ max ( L E [ K ρ ( X )] | θ ∗ | + L E [ K ρ ( X )] + E (cid:2) F ∗ ( X ) (cid:3) ) + 2 dβ − . Theresult follows by induction.Moreover, one observes that when ρ = 0 in Assumption 1, F is co-coercive, i.e. for any θ, θ (cid:48) ∈ R d and for every x ∈ R m (cid:10) θ − θ (cid:48) , F ( θ, x ) − F ( θ (cid:48) , x ) (cid:11) ≥ L | F ( θ, x ) − F ( θ (cid:48) , x ) | . (37)Then, by substituting (37) into (36), one obtains E (cid:104) | θ λn +1 − θ ∗ | (cid:12)(cid:12)(cid:12) θ λn (cid:105) ≤ | θ λn − θ ∗ | − λ ˆ a | θ λn − θ ∗ | − λ L E (cid:104) | F ( θ λn , X n +1 ) − F ( θ ∗ , X n +1 ) | (cid:12)(cid:12)(cid:12) θ λn (cid:105) + 4 λ E [ K ( X )] | θ λn − θ ∗ | + λ E (cid:104) | H ( θ λn , X n +1 ) | (cid:12)(cid:12)(cid:12) θ λn (cid:105) + 2 dβ − λ ≤ (cid:18) − λ ˆ a (cid:19) | θ λn − θ ∗ | + 4 λ E [ K ( X )] | θ λn − θ ∗ | + (cid:18) λ − λ L (cid:19) E (cid:104) | F ( θ λn , X n +1 ) − F ( θ ∗ , X n +1 ) | (cid:12)(cid:12)(cid:12) θ λn (cid:105) + 3 λ E (cid:104) | F ( θ ∗ , X n +1 ) | (cid:12)(cid:12) θ λn (cid:105) + 3 λ E (cid:2) K ( X ) (cid:3) + 2 dβ − λ, which implies for λ ∈ min { / a , / (6 L ) } E (cid:104) | θ λn +1 − θ ∗ | (cid:12)(cid:12)(cid:12) θ λn (cid:105) ≤ (cid:18) − λ ˆ a (cid:19) | θ λn − θ ∗ | + 4 λ E [ K ( X )] | θ λn − θ ∗ | + 9 λ L E [ K ρ ( X )] | θ ∗ | + 9 λ L E [ K ρ ( X )] + 9 λ E (cid:2) F ∗ ( X ) (cid:3) + 2 dβ − λ. By using the same arguments as above, consider the case | θ λn − θ ∗ | > E [ K ( X )] ˆ a − , one notices that −
12 ˆ a λ | θ λn − θ ∗ | + 4 λ E [ K ( X )] | θ λn − θ ∗ | < , and this indicates E (cid:104) | θ λn +1 − θ ∗ | (cid:12)(cid:12)(cid:12) θ λn (cid:105) ≤ (1 − ˆ a λ ) | θ λn − θ ∗ | + 9 λ L E [ K ρ ( X )] | θ ∗ | + 9 λ L E [ K ρ ( X )] + 9 λ E (cid:2) F ∗ ( X ) (cid:3) + 2 dβ − λ. Similarly, for | θ λn − θ ∗ | ≤ E [ K ( X )] ˆ a − , one obtains E (cid:104) | θ λn +1 − θ ∗ | (cid:12)(cid:12)(cid:12) θ λn (cid:105) ≤ (cid:18) −
32 ˆ a λ (cid:19) | θ λn − θ ∗ | + 32 λ E (cid:2) K ( X ) (cid:3) ˆ a − + 9 λ L E [ K ρ ( X )] | θ ∗ | + 9 λ L E [ K ρ ( X )] + 9 λ E (cid:2) F ∗ ( X ) (cid:3) + 2 dβ − λ. Combining the two cases yields E (cid:104) | θ λn +1 − θ ∗ | (cid:12)(cid:12)(cid:12) θ λn (cid:105) ≤ (1 − ˆ aλ ) | θ λn − θ ∗ | + λ ¯ c , where ¯ c = 32 E (cid:2) K ( X ) (cid:3) ˆ a − + 9¯ λ max ( L E [ K ρ ( X )] | θ ∗ | + L E [ K ρ ( X )] + E (cid:2) F ∗ ( X ) (cid:3) ) + 2 dβ − .19 .2 Convergence results We aim to establish the non-asymptotic bound in Wasserstein-2 distance between L ( θ λn ) and π β . Toachieve this, we consider the following decomposition: W ( L ( θ λn ) , π β ) ≤ W ( L ( θ λn ) , L ( ˙ θ λn )) + W ( L ( ˙ θ λn ) , π λ ) + W ( π λ , π β ) . (38)The lemma presented below provides the non-asymptotic estimates for the last two terms in (38). Theorem 3. [12, Corollary 7] Let Assumptions 1, 2, 3, 5 and 6 hold. Then, for any < λ < ¯ λ max given in (10) , the Markov chain ( ˙ θ λn ) n ∈ N admits an invariant measure π λ such that, for all n ∈ N , W ( L ( ˙ θ λn ) , π λ ) ≤ ¯ C e − ˆ a ∗ λn , where ¯ C = ( | θ − θ | + d/ ˆ a ∗ β ) / is given in Lemma 9 (iii) with ˆ a ∗ = ˆ aL/ (ˆ a + L ) . Furthermore, W ( π β , π λ ) ≤ ¯ C , √ λ, where ¯ C , = (cid:0) dL (ˆ a ∗ β ) − (2 λ + (ˆ a ∗ ) − )(1 + λ L + L λ/ ˆ a ) (cid:1) / . (39)The non-asymptotic estimate for the first term in (38) is provided in the following lemma. Lemma 11.
Let Assumptions 1, 2, 3, 5 and 6 hold. For any < λ < ¯ λ max given in (10) , one obtains W ( L ( ˙ θ λn ) , L ( θ λn )) ≤ ¯ C , √ λ, where ¯ C , = (cid:112) c / a ∗ c = (8 L + 16 L E [ K ρ ( X )])( E (cid:2) | θ | (cid:3) + ˆ a − ¯ c )+ (8 L + 40 L E [ K ρ ( X )]) | θ ∗ | + 24 L E [ K ρ ( X )] + 24 E (cid:2) F ∗ ( X ) (cid:3) . (40) Proof.
By using synchronous coupling for the algorithms (34) and (2), one obtains | ˙ θ λn +1 − θ λn +1 | = | ˙ θ λn − θ λn − λ ( h ( ˙ θ λn ) − H ( θ λn , X n +1 )) | = | ˙ θ λn − θ λn | − λ (cid:104) ˙ θ λn − θ λn , h ( ˙ θ λn ) − H ( θ λn , X n +1 ) (cid:105) + λ | h ( ˙ θ λn ) − H ( θ λn , X n +1 ) | ≤ | ˙ θ λn − θ λn | − λ (cid:104) ˙ θ λn − θ λn , h ( ˙ θ λn ) − h ( θ λn ) (cid:105) − λ (cid:104) ˙ θ λn − θ λn , h ( θ λn ) − H ( θ λn , X n +1 ) (cid:105) + 2 λ | h ( ˙ θ λn ) − h ( θ λn ) | + 2 λ | h ( θ λn ) − H ( θ λn , X n +1 ) | , which implies, by taking conditional expectation on both sides and by using Remark 6 E (cid:104) | ˙ θ λn +1 − θ λn +1 | (cid:12)(cid:12)(cid:12) ˙ θ λn , θ λn (cid:105) ≤ | ˙ θ λn − θ λn | − a ∗ λ | ˙ θ λn − θ λn | − λ ˆ a + L | h ( ˙ θ λn ) − h ( θ λn ) | + 2 λ | h ( ˙ θ λn ) − h ( θ λn ) | + 2 λ E (cid:104) | h ( θ λn ) − H ( θ λn , X n +1 ) | (cid:12)(cid:12)(cid:12) ˙ θ λn , θ λn (cid:105) , where ˆ a ∗ = ˆ aL/ (ˆ a + L ). For λ < ¯ λ max , one obtains by using Remark 1 and 2 E (cid:104) | ˙ θ λn +1 − θ λn +1 | (cid:12)(cid:12)(cid:12) ˙ θ λn , θ λn (cid:105) ≤ (1 − a ∗ λ ) | ˙ θ λn − θ λn | + 4 λ E (cid:104) | h ( θ λn ) | (cid:12)(cid:12)(cid:12) ˙ θ λn , θ λn (cid:105) + 4 λ E (cid:104) | H ( θ λn , X n +1 ) | (cid:12)(cid:12)(cid:12) ˙ θ λn , θ λn (cid:105) ≤ (1 − a ∗ λ ) | ˙ θ λn − θ λn | + 4 λ L E (cid:104) | θ λn − θ ∗ | (cid:12)(cid:12)(cid:12) ˙ θ λn , θ λn (cid:105) + 4 λ E (cid:20) (cid:16) (1 + | X n +1 | ) ρ +1 ( L | θ λn − θ ∗ | + L | θ ∗ | + L ) + F ∗ ( X n +1 ) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) ˙ θ λn , θ λn (cid:21) (1 − a ∗ λ ) | ˙ θ λn − θ λn | + (4 λ L + 8 λ L E [ K ρ ( X )]) | θ λn − θ ∗ | + 24 λ L E [ K ρ ( X )] | θ ∗ | + 24 λ L E [ K ρ ( X )] + 24 λ E (cid:2) F ∗ ( X ) (cid:3) . Finally, one calculates by using Lemma 10, E (cid:104) | ˙ θ λn +1 − θ λn +1 | (cid:105) ≤ (1 − a ∗ λ ) E (cid:104) | ˙ θ λn − θ λn | (cid:105) + (4 λ L + 8 λ L E [ K ρ ( X )]) E (cid:104) | θ λn − θ ∗ | (cid:105) + 24 λ L E [ K ρ ( X )] | θ ∗ | + 24 λ L E [ K ρ ( X )] + 24 λ E (cid:2) F ∗ ( X ) (cid:3) ≤ (1 − a ∗ λ ) E (cid:104) | ˙ θ λn − θ λn | (cid:105) + λ c , where c = (8 L +16 L E [ K ρ ( X )])( E (cid:2) | θ | (cid:3) +ˆ a − ¯ c )+(8 L +40 L E [ K ρ ( X )]) | θ ∗ | +24 L E [ K ρ ( X )]+24 E (cid:2) F ∗ ( X ) (cid:3) . The result follows by induction. Proof of Theorem 2
One observes that by using Theorem 3 and Lemma 11 W ( L ( θ λn ) , π β ) ≤ W ( L ( θ λn ) , L ( ˙ θ λn )) + W ( L ( ˙ θ λn ) , π λ ) + W ( π λ , π β ) ≤ ¯ C , √ λ + ¯ C e − ˆ a ∗ λn + ¯ C , √ λ ≤ C e − C λn + C √ λ, where C = ˆ a ∗ , C = ¯ C , C = ¯ C , + ¯ C , (41)with ˆ a ∗ = ˆ aL/ (ˆ a + L ), ¯ C given in Lemma 3, ¯ C , and ¯ C , given in (39) and (40) respectively. Proof of Corollary 4
The proof follows the same lines as the proof of Corollary 2. To obtain anupper bound for the expected excess risk E [ U (ˆ θ )] − inf θ ∈ R d U ( θ ), one considers E [ U (ˆ θ )] − inf θ ∈ R d U ( θ ) = (cid:16) E [ U (ˆ θ )] − E [ U ( Z ∞ )] (cid:17) + (cid:18) E [ U ( Z ∞ )] − inf θ ∈ R d U ( θ ) (cid:19) , (42)where ˆ θ = θ λn and Z ∞ ∼ π β with π β ( θ ) = exp( − βU ( θ )) for all θ ∈ R d . By using [19, Lemma 3.5],Lemma 8, 10 and Theorem 2, the first term on the RHS of (42) can be bounded by E [ U (ˆ θ )] − E [ U ( Z ∞ )] ≤ (cid:16) L ( E (cid:2) | θ − θ ∗ | (cid:3) + ¯ c ˆ a − + | θ ∗ | ) / + | h (0) | (cid:17) W ( L ( θ λn ) , π β ) ≤ (cid:16) L ( E (cid:2) | θ − θ ∗ | (cid:3) + ¯ c ˆ a − + | θ ∗ | ) / + | h (0) | (cid:17) (cid:16) C e − C λn + C √ λ (cid:17) ≤ ˆ C e − ˆ C λn + ˆ C √ λ, where θ ∗ ∈ R d is the minimizer of U , andˆ C = C , ˆ C = C (cid:16) L ( E (cid:2) | θ − θ ∗ | (cid:3) + ¯ c ˆ a − + | θ ∗ | ) / + | h (0) | (cid:17) , ˆ C = C (cid:16) L ( E (cid:2) | θ − θ ∗ | (cid:3) + ¯ c ˆ a − + | θ ∗ | ) / + | h (0) | (cid:17) , (43)with C , C , C given in (41) and ¯ c given in (35). Moreover, the second term on the RHS of (42) canbe estimated by using [19, Proposition 3.4], which gives, E [ U ( Z ∞ )] − inf θ ∈ R d U ( θ ) ≤ ˆ C β , where ˆ C = d (cid:18) eβLd (cid:18) d ˆ aβ + | θ ∗ | (cid:19)(cid:19) . (44)Finally, one obtains E [ U (ˆ θ )] − inf θ ∈ R d U ( θ ) ≤ ˆ C e − ˆ C λn + ˆ C √ λ + ˆ C /β. Applications L regularization We consider the problem of quantile estimation for AR(1) processes, which has been discussed in [7],[17] and [22] amongst others, with L regularization. It assumed therefore that the data X t ∈ R , t ∈ Z , follows an AR(1) process given by X t +1 = αX t + ¯ ξ t +1 , where α is a constant with | α | < ξ t ) t ∈ Z are i.i.d. standard Normal random variables. Theabove expression can be further rewritten as X t = ∞ (cid:88) j =0 α j ¯ ξ t − j . One notes that X t has a stationary distribution π X which is normally distributed with mean 0 andvariance 1 / (1 − α ). Our task is to identify the q -th quantile of the stationary distribution π X usingthe SGLD algorithm (2), in other words, we aim to solve the following problem:min θ E [ l q ( X ∞ − θ )] + γ | θ | , where X ∞ ∼ π X and l q ( z ) = (cid:40) qz, z ≥ , ( q − z, z < . The stochastic gradient H : R × R → R is given by H ( θ, x ) = − q + { x<θ } + 2 γθ, (45)where γ is a positive constant. To check Assumption 1, denote by F ( θ, x ) = − q +2 γθ , G ( θ, x ) = { x<θ } .It can be easily seen that Assumption 1 holds with ρ = 0, L = 2 γ , L = 0 and K ( x ) = 1. Then, byRemark 3 and its proof in A.1, Assumption 3 holds with L = 2 γ + 1. Moreover, Assumption 4 holdswith A ( x ) = γ I d and b ( x ) = q / (4 γ ), which implies a = γ and b = q / (4 γ ).One notes that the value of the q -th quantile of π X is given by θ ∗ = N ( q ) / √ − α where N ( · )is the cumulative distribution function of the standard normal distribution. For the simulation, set α = 0 . q = 0 .
95, and thus, θ ∗ = 1 .
89. Moreover, let m = 1, θ = 3, β = 10 and γ = 10 − . Notethat we use the step restriction given in Remark 7 for all the examples in this section. In Figure5.1, the left graph is obtained by using the SGLD algorithm (2) with λ = 10 − and the number ofiterations n = 10 . It shows the path of θ n with the first 10000 iterations being discarded, and thepath stabilises at around the true value θ ∗ = 1 .
89. The right graph of Figure 5.1 illustrates the rateof convergence of the SGLD algorithm in Wasserstein-1 distance based on 5000 samples. The slopeof the results in W obtained using numerical experiments is 0.5022, which supports our theoreticalfinding in Theorem 1 with rate 1 / In this section, we consider the problem of computing Value-at-Risk (VaR) and Conditional-Value-at-Risk (CVaR), which are two commonly used risk measures in financial risk management. In order toobtain the two quantities, one considers the following optimization problem:min θ V ( θ ) = min θ (cid:18) E (cid:20) θ + 11 − ¯ q ( f ( X ) − θ ) + (cid:21) + γ | θ | (cid:19) , (46)where 0 < ¯ q < f is continuous and f ( X ) is integrable with respect to the probability measure.As noted in [4], f can represent more complicated payoff structures than simple vanilla instruments22 igure 5.1: [Left] Path of θ n when q = 0 . . [Right] Rate of convergence of the SGLD algorithm. ¯ q = 0 .
95 ¯ q = 0 . SGLD
CVaR
SGLD
VaR* CVaR* VaR
SGLD
CVaR
SGLD µ = 0 , σ = 1 1.645 2.062 1.642 2.062 2.326 2.677 2.329 2.662(0.02) (0.0006) (0.04) (0.0038) µ = 1 , σ = 2 4.290 5.124 4.294 5.126 5.653 6.335 5.640 6.336(0.03) (0.0006) (0.06) (0.0032) µ = 3 , σ = 5 11.224 13.311 11.230 13.305 14.632 16.337 14.643 16.313(0.05) (0.0006) (0.11) (0.006)Table 1: VaR and CVaR for normal distribution N ( µ, σ ).¯ q = 0 .
95 ¯ q = 0 . SGLD
CVaR
SGLD
VaR* CVaR* VaR
SGLD
CVaR
SGLD d.f. = 10 1.812 2.416 1.808 2.407 2.764 3.357 2.767 3.350(0.02) (0.0005) (0.05) (0.003)d.f. = 7 1.895 2.595 1.895 2.594 2.998 3.757 3.001 3.782(0.03) (0.0008) (0.05) (0.0024)d.f. = 3 2.353 3.876 2.358 3.873 4.541 6.968 4.542 6.967(0.03) (0.0008) (0.08) (0.0028)Table 2: VaR and CVaR for Student’s t distribution. while X can accommodate a large family of asset distributions including those generated by stochas-tic/local volatility models, see e.g. [11], [20] and [21] references therein. Then, by [4, Proposition2.1], VaR ¯ q ( f ( X )) = argmin V ( θ ) and CVaR ¯ q ( f ( X )) = min θ V ( θ ). To compute VaR, the stochasticgradient H : R × R → R of the SGLD algorithm (2) is given by H ( θ, x ) = 1 − − ¯ q { f ( x ) ≥ θ } + 2 γθ = − ¯ q − ¯ q + 11 − ¯ q { f ( x ) <θ } + 2 γθ. Let f ( x ) = x , one notices that the above expression has a similar form as (45). Then, one cancheck that Assumption 1 - 4 are satisfied. More precisely, denote by F ( θ, x ) = − ¯ q/ (1 − ¯ q ) + 2 γθ , G ( θ, x ) = { x<θ } / (1 − ¯ q ), Assumption 1 holds with ρ = 0, L = 2 γ , L = 0 and K ( x ) = 1 / (1 − ¯ q ). Let X be a one-dimensional random variable with finite fourth moment, then Assumption 2 is satisfied.Denote by ¯ c d the upper bound of the density of X , Assumption 3 holds with L = 2 γ + ¯ c d / (1 − ¯ q ).Furthermore, Assumption 4 holds with A ( x ) = γ I d and b ( x ) = ¯ q / (4 γ (1 − ¯ q ) ), which implies a = γ and b = ¯ q / (4 γ (1 − ¯ q ) ).For the numerical experiments, we set θ = 0, β = 10 , γ = 10 − , λ = 10 − and the number ofiterations n = 10 . Table 1 and 2 present VaR and CVaR for the normal distribution and Student’st-distribution. VaR* and CVaR* in the tables denote the theoretical values, while VaR SGLD andCVaR
SGLD denote the numerical approximations from the SGLD algorithm (2). Each approximationin the table is obtained based on 10000 samples, which is followed by its sample standard deviation23 igure 5.2: [Left] Path of θ n (VaR) for Student’s t-distribution. [Right] Rate of convergence of the SGLD algorithm basedon 5000 samples. shown in brackets. In addition, in Figure 5.2, the left graph illustrates the path of θ n for the t -distribution, whereas the right graph shows that the rate of convergence of the SGLD algorithm (2)is 0.4811. One notes that the samples from π β is generated by running the SGLD algorithm with λ = 10 − . To minimize CVaR for a given portfolio, we consider the following optimization problem:min ˆ θ V (ˆ θ ) = min ˆ θ E − ¯ q (cid:32) n (cid:88) i =1 g i ( w ) X i − θ (cid:33) + + θ + γ | ˆ θ | , (47)where the parameter ˆ θ := ( θ, w ) (cid:124) = ( θ, w , . . . , w n ) (cid:124) and g i ( w ) := e wi (cid:80) nj =1 e wj ∈ (0 ,
1) for i = 1 , . . . , n .By solving (47), we obtain not only VaR for a given portfolio, but also the optimal weight for eachasset in the portfolio such that CVaR is minimized.For reasons of brevity, we assume here that the X i ’s, for i = 1 , . . . , n , are i.i.d. one-dimensionalrandom variables (with finite fourth moments). Our results can be naturally extended to the case ofdependent data streams via the concept of L -mixing as explained in [8].Let c X , c ¯ X denote the first and second absolute moment respectively of X . Moreover, let | x | f X i ( x )be bounded for any i and x ∈ R . Note that this latter requirement is satisfied for a wide rangeof distributions, for example, the distributions shown in Table 3. Then, the stochastic gradient H ˆ θ (ˆ θ, x ) : R n +1 × R n → R n +1 is defined as H ˆ θ (ˆ θ, x ) := ( H θ (ˆ θ, x ) , H w (ˆ θ, x ) , . . . , H w n (ˆ θ, x )) (cid:124) , where H θ (ˆ θ, x ) : R n +1 × R n → R and H w j (ˆ θ, x ) : R n +1 × R n → R for all j are given by H θ (ˆ θ, x ) = 1 − − ¯ q { (cid:80) ni =1 g i ( w ) x i ≥ θ } + 2 γθ, and H w j (ˆ θ, x ) = 11 − ¯ q ˆ g w j ( w, x ) { (cid:80) ni =1 g i ( w ) x i ≥ θ } + 2 γw j , where ˆ g w j ( w, x ) = n (cid:88) i =1 ∂g i ( w ) ∂w j x i for any j = 1 , . . . , n with ∂g j ( w ) ∂w j = e wj ( (cid:80) l (cid:54) = j e wl )( (cid:80) nl =1 e wl ) , and ∂g i ( w ) ∂w j = − e wi e wj ( (cid:80) nl =1 e wl ) for i (cid:54) = j . One notes that | ˆ g w j ( w, x ) | ≤ (cid:80) ni =1 | x i | for any j . Moreover, if Assumption 1 - 4 hold for H θ and H w j for any j , thenthe assumptions hold for H ˆ θ . 24e first check assumptions for H θ . Denote by F θ (ˆ θ, x ) = 2 γθ, G θ (ˆ θ, x ) = 1 − { (cid:80) ni =1 g i ( w ) x i ≥ θ } / (1 − ¯ q ) , then H θ = F θ + G θ . Assumption 1 holds with ρ = 0, L = 2 γ , L = 0 and K ( x ) = (2 − ¯ q ) / (1 − ¯ q ). Bytaking into consideration the expression of K ( x ) and the construction of the problem, Assumption2 is satisfied. Assumption 4 holds with A ( x ) = 2 γ I d and b ( x ) = 0, which implies a = 2 γ and b = 0.To check Assumption 3, one considers ˆ θ (cid:48) := (¯ θ, w ) (cid:124) , and then calculates by assuming without loss ofgenerality g n ( w ) = max { g ( w ) , . . . , g n ( w ) } E (cid:104)(cid:12)(cid:12)(cid:12) H θ (ˆ θ, X ) − H θ (ˆ θ (cid:48) , X ) (cid:12)(cid:12)(cid:12)(cid:105) ≤ γ (cid:12)(cid:12) θ − ¯ θ (cid:12)(cid:12) + 11 − ¯ q E (cid:104)(cid:12)(cid:12)(cid:12) { (cid:80) ni =1 g i ( w ) X i ≥ θ } − { (cid:80) ni =1 g i ( w ) X i ≥ ¯ θ } (cid:12)(cid:12)(cid:12)(cid:105) ≤ γ (cid:12)(cid:12)(cid:12) ˆ θ − ˆ θ (cid:48) (cid:12)(cid:12)(cid:12) + 11 − ¯ q ( E + E ) , where E = E (cid:104) { θ ≤ (cid:80) ni =1 g i ( w ) X i ≤ ¯ θ } (cid:105) , E = E (cid:104) { ¯ θ ≤ (cid:80) ni =1 g i ( w ) X i ≤ θ } (cid:105) . To estimate E , one writes E (cid:104) { θ ≤ (cid:80) ni =1 g i ( w ) X i ≤ ¯ θ } (cid:105) = E (cid:104) E (cid:104) { ( θ − (cid:80) i (cid:54) = n g i ( w ) X i ) /g n ( w )) ≤ X n ≤ (¯ θ − (cid:80) i (cid:54) = n g i ( w ) X i ) /g n ( w ) } (cid:12)(cid:12)(cid:12) X , . . . , X n − (cid:105)(cid:105) = (cid:90) ∞−∞ · · · (cid:90) ∞−∞ (cid:90) (¯ θ − (cid:80) i (cid:54) = n g i ( w ) x i ) /g n ( w )( θ − (cid:80) i (cid:54) = n g i ( w ) x i ) /g n ( w ) f X n ( z ) dzf X n − ( x n − ) dx n − · · · f X ( x ) dx ≤ nc X n (cid:12)(cid:12)(cid:12) ˆ θ − ˆ θ (cid:48) (cid:12)(cid:12)(cid:12) , where we use the fact g n ( w ) ≥ /n in the last inequality and c X n denotes the upper bound of thedensity of X n . E can be estimated by using similar arguments. Then, one obtains E (cid:104)(cid:12)(cid:12)(cid:12) H θ (ˆ θ, X ) − H θ (ˆ θ (cid:48) , X ) (cid:12)(cid:12)(cid:12)(cid:105) ≤ (2 γ + 2 nc X n / (1 − ¯ q )) (cid:12)(cid:12)(cid:12) ˆ θ − ˆ θ (cid:48) (cid:12)(cid:12)(cid:12) , which implies Assumption 3 holds with L = 2 γ + 2 nc X n / (1 − ¯ q ).Next, we check assumptions for H w j . Denote by F w j (ˆ θ, x ) = 2 γw j , G w j (ˆ θ, x ) = ˆ g w j ( w, x ) { (cid:80) ni =1 g i ( w ) x i ≥ θ } / (1 − ¯ q ) , then H w j = F w j + G w j . Assumption 1 holds with ρ = 0, L = 2 γ , L = 0 and K ( x ) = (cid:80) i | x i | / (1 − ¯ q ).By taking into consideration the expression of K ( x ) and the construction of the problem, Assumption2 is satisfied. Assumption 4 holds with A ( x ) = 2 γ I d and b ( x ) = 0, which implies a = 2 γ and b = 0.Then, we check Assumption 3 for H w , and the arguments stay the same lines for any other H w j , j = 2 , . . . , n . Consider ˆ θ (cid:93) := ( θ, ¯ w ) (cid:124) = ( θ, ¯ w , w , . . . , w n ) (cid:124) . Then, one calculates E (cid:104)(cid:12)(cid:12)(cid:12) H w (ˆ θ, X ) − H w (ˆ θ (cid:93) , X ) (cid:12)(cid:12)(cid:12)(cid:105) ≤ γ | w − ¯ w | + 11 − ¯ q E (cid:104)(cid:12)(cid:12)(cid:12) ˆ g w ( w, X ) { (cid:80) ni =1 g i ( w ) X i ≥ θ } − ˆ g w ( ¯ w, X ) { (cid:80) i =1 g i ( ¯ w ) X i ≥ θ } (cid:12)(cid:12)(cid:12)(cid:105) ≤ γ (cid:12)(cid:12)(cid:12) ˆ θ − ˆ θ (cid:93) (cid:12)(cid:12)(cid:12) + 11 − ¯ q E (cid:104)(cid:12)(cid:12)(cid:12) ˆ g w ( w, X ) { (cid:80) ni =1 g i ( w ) X i ≥ θ } − ˆ g w ( ¯ w, X ) { (cid:80) ni =1 g i ( w ) X i ≥ θ } (cid:12)(cid:12)(cid:12)(cid:105) + 11 − ¯ q E (cid:104)(cid:12)(cid:12)(cid:12) ˆ g w ( ¯ w, X ) { (cid:80) ni =1 g i ( w ) X i ≥ θ } − ˆ g w ( ¯ w, X ) { (cid:80) i =1 g i ( ¯ w ) X i ≥ θ } (cid:12)(cid:12)(cid:12)(cid:105) ≤ γ (cid:12)(cid:12)(cid:12) ˆ θ − ˆ θ (cid:93) (cid:12)(cid:12)(cid:12) + 2 nc X − ¯ q | w − ¯ w | GLD algorithm Reference X X w w g ( w ) X + g ( w ) X w ∗ w ∗ g ( w ∗ ) X + g ( w ∗ ) X VaR
SGLD
CVaR
SGLD
VaR* CVaR* N (500 , N (0 , − ) 0.00002 0.99998 0.025 0.03 0 1 0.016 0.021 N (0 , ) N (0 , − ) 0.000006 0.999994 0.016 0.25 0 1 0.016 0.021 N (1 , N (0 ,
1) 0.111 0.889 1.615 2.004 0.11 0.89 1.617 1.999 N (0 , t with d.f. = 2 .
01 0.917 0.083 1.567 1.975 0.9 0.1 1.531 1.971 N (0 , t with d.f. = 10 0.577 0.423 1.236 1.554 0.58 0.42 1.224 1.553 N (0 , t with d.f. = 1000 0.503 0.497 1.15 1.46 0.5 0.5 1.165 1.461 N (1 , t with d.f. = 2 .
01 0.596 0.404 2.941 4.130 0.61 0.39 2.985 4.115 N (1 , t with d.f. = 10 0.172 0.828 1.743 2.290 0.17 0.83 1.779 2.286 N (1 , t with d.f. = 1000 0.113 0.887 1.594 2.008 0.11 0.89 1.619 2.002 N (0 ,
1) Logistic(0,1) 0.775 0.225 1.422 1.816 0.78 0.22 1.442 1.813 N (0 ,
1) Logistic(0,29) 0.999 0.001 1.633 2.110 1 0 1.645 2.063 N (0 ,
1) Logistic(2,10) 0.997 0.003 1.650 2.101 1 0 1.648 2.065 N (1 ,
4) Logistic(0,1) 0.402 0.598 2.635 3.262 0.4 0.6 2.607 3.261 N (1 ,
4) Logistic(0,29) 0.998 0.002 4.284 5.145 1 0 4.284 5.116 N (1 ,
4) Logistic(2,10) 0.991 0.009 4.255 5.132 0.99 0.01 4.283 5.114 N (0 ,
1) Lognormal(0,1) 0.966 0.034 1.662 2.068 0.97 0.03 1.647 2.054 N (0 ,
1) Lognormal(0,0.01) 0.074 0.926 1.145 1.205 0.07 0.93 1.132 1.186 N (0 ,
1) Lognormal(1,4) 0.9997 0.0003 1.674 2.136 1 0 1.645 2.062 N (1 ,
4) Lognormal(0,1) 0.732 0.268 3.750 4.6050.74 0.74 0.26 3.771 4.599 N (1 ,
4) Lognormal(0,0.01) 0.010 0.0.989 1.173 1.301 0 1 1.179 1.230 N (1 ,
4) Lognormal(1,4) 0.997 0.003 4.266 5.194 1 0 4.292 5.129Logistic(0,1) Lognormal(0,1) 0.817 0.183 2.797 3.727 0.81 0.19 2.814 3.724Logistic(0,1) Lognormal(0,0.01) 0.022 0.978 1.169 1.256 0.02 0.98 1.164 1.217Logistic(0,1) Lognormal(1,4) 0.997 0.003 2.961 4.030 1 0 2.947 3.971Logistic(2,10) Lognormal(0,1) 0.043 0.956 5.245 8.412 0.04 0.96 5.198 8.400Logistic(2,10) Lognormal(0,0.01) 0.009 0.991 1.184 1.315 0 1 1.179 1.229Logistic(2,10) Lognormal(1,4) 0.996 0.004 31.651 41.748 0.99 0.01 31.420 41.738Table 3: 95% VaR and CVaR for portfolios of two assets X , X with the form w X + w X . + 11 − ¯ q E (cid:104)(cid:12)(cid:12)(cid:12) ˆ g w ( ¯ w, X ) { (cid:80) ni =1 g i ( w ) X i ≥ θ } − ˆ g w ( ¯ w, X ) { (cid:80) i =1 g i ( ¯ w ) X i ≥ θ } (cid:12)(cid:12)(cid:12)(cid:105) ≤ γ (cid:12)(cid:12)(cid:12) ˆ θ − ˆ θ (cid:93) (cid:12)(cid:12)(cid:12) + 2 nc X − ¯ q (cid:12)(cid:12)(cid:12) ˆ θ − ˆ θ (cid:93) (cid:12)(cid:12)(cid:12) + 11 − ¯ q E (cid:104)(cid:12)(cid:12)(cid:12) ˆ g w ( ¯ w, X ) { (cid:80) ni =1 g i ( w ) X i ≥ θ } − ˆ g w ( ¯ w, X ) { (cid:80) i =1 g i ( ¯ w ) X i ≥ θ } (cid:12)(cid:12)(cid:12)(cid:105) , where the third inequality holds due to the fact that | ˆ g w ( w, X ) − ˆ g w ( ¯ w, X ) | ≤ | w − ¯ w | (cid:80) i | X i | .Then, by using | ˆ g w ( ¯ w, x ) | ≤ (cid:80) i | x i | , E (cid:104)(cid:12)(cid:12)(cid:12) H w (ˆ θ, X ) − H w (ˆ θ (cid:93) , X ) (cid:12)(cid:12)(cid:12)(cid:105) ≤ γ (cid:12)(cid:12)(cid:12) ˆ θ − ˆ θ (cid:93) (cid:12)(cid:12)(cid:12) + 2 nc X − ¯ q (cid:12)(cid:12)(cid:12) ˆ θ − ˆ θ (cid:93) (cid:12)(cid:12)(cid:12) + 11 − ¯ q E (cid:34)(cid:88) i | X i | (cid:12)(cid:12)(cid:12) { (cid:80) ni =1 g i ( w ) X i ≥ θ } − { (cid:80) i =1 g i ( ¯ w ) X i ≥ θ } (cid:12)(cid:12)(cid:12)(cid:35) ≤ (2 γ + 2 nc X / (1 − ¯ q )) (cid:12)(cid:12)(cid:12) ˆ θ − ˆ θ (cid:93) (cid:12)(cid:12)(cid:12) + 2( n − c X (¯ c X n + ¯ c X ) + ( c ¯ X + ( n − c X )( c X n + c X )) / (1 − ¯ q ) (cid:12)(cid:12)(cid:12) ˆ θ − ˆ θ (cid:93) (cid:12)(cid:12)(cid:12) , (48)where c X , c ¯ X denote the first and the second absolute moment of X i ’s respectively, for any i , ¯ c X i isthe upper bound of the function | x | f X i , and c X i is the upper bound of the density of X i . Detailedcalculations to obtain the last inequality in (48) is given in Appendix A.3. Thus Assumption 3 holdswith L = 2 γ + 2 nc X / (1 − ¯ q ) + 2( n − c X (¯ c X n + ¯ c X ) + ( c ¯ X + ( n − c X )( c X n + c X )) / (1 − ¯ q ).For the numerical experiments, we set θ = 0, β = 10 , γ = 10 − , λ = 10 − and the number ofiterations n = 10 . Tabel 3 illustrates 95% VaR and CVaR obtained using the SGLD algorithm for26 igure 5.3: Rate of convergence of the SGLD algorithm for w based on 5000 samples. a portfolio of two assets X and X with weights g ( w ) and g ( w ) respectively. The reference values w ∗ , w ∗ , VaR* and CVaR* are obtained numerically in the following way:1. First, we create 100 evenly spaced numbers over the interval [0 , X and X , assign each of the 100 numbers to g ( w ), whichis the weight of X , and calculate the 95% CVaR for the combination g ( w ) X + g ( w ) X .3. Finally, we obtain the minimum CVaR and the corresponding g ( w ) among the 100 values.We denote them as CVaR* and g ( w ∗ ). Here , one notes that the corresponding VaR* can becalculated using the optimal weights g ( w ∗ ) and g ( w ∗ ).Figure 5.3 shows that the rate of convergence of the SGLD algorithm (2) for the parameter w is0.5319, which supports the theoretical finding in Theorem 1. One notes that the samples from π β isgenerated by running the SGLD algorithm with λ = 10 − . References [1] M. Barkhagen, N. H. Chau, ´E. Moulines, M. R´asonyi, S. Sabanis and Y. Zhang. On stochasticgradient Langevin dynamics with stationary data streams in the logconcave case.
Preprint , 2018.arXiv:1812.02709[2] N. Brosse, A. Durmus and E. Moulines. The promises and pitfalls of stochastic gradient Langevindynamics.
Advances in Neural Information Processing Systems , 8268-8278, 2018.[3] M. Benam, J.C. Fort and G. Pag`es, Convergence of the one-dimensional Kohonen algorithm.
Advances in Applied Probability , 30(3), 850-869, 1998.[4] O. Bardou, N. Frikha and G. Pag`es, Computing VaR and CVaR using stochastic approximationand adaptive unconstrained importance sampling.
Monte Carlo Methods and Applications , 15(3),173-210, 2009.[5] H. Cardot, P. C´enac and P. A. Zitt. Recursive estimation of the conditional geometric median inHilbert spaces.
Electronic Journal of Statistics , 6: 2535-2562, 2012.[6] H. Cardot, P. C´enac and P. A. Zitt. Efficient and fast estimation of the geometric median inHilbert spaces with an averaged stochastic gradient algorithm.
Bernoulli , 19(1): 18-43, 2013.[7] N. H. Chau, Ch. Kumar, M. R´asonyi and S. Sabanis. On fixed gain recursive estimators withdiscontinuity in the parameters.
ESAIM Probability and Statistics , 23:217–244, 2019.278] N. H. Chau, ´E. Moulines, M. R´asonyi, S. Sabanis and Y. Zhang. On stochastic gradientLangevin dynamics with dependent data streams: the fully non-convex case.
Preprint , 2019.arXiv:1905.13142[9] H. Djellout, A. Guillin, and L. Wu. Transportation cost-information inequalities and applicationsto random dynamical systems and diffusions.
The Annals of Probability , 32(3B), 2702–2732, 2004.[10] A. S. Dalalyan and A. Karagulyan. User-friendly guarantees for the Langevin Monte Carlo withinaccurate gradient.
To appear in Stochastic Processes and their Applications , 2019.[11] B. Dupire, Pricing with a Smile.
Risk , 7(1):18–20, 1994.[12] A. Durmus and ´E. Moulines. High-dimensional Bayesian inference via the unadjusted Langevinalgorithm.
Preprint , 2018. arXiv:1605.01559v3[13] A. Eberle. Reflection couplings and contraction rates for diffusions.
Probab. Theory RelatedFields , 166:851–886, 2016.[14] A. Eberle, A. Guillin and R. Zimmer. Quantitative Harris-type theorems for diffusions andMcKean-Vlasov processes.
In Press, Transactions of the American Mathematical Society , 2018.https://doi.org/10.1090/tran/7576[15] G. Fort, ´E. Moulines, A. Schreck and M. Vihola. Convergence of Markovian Stochastic Ap-proximation with discontinuous dynamics.
SIAM Journal on Control and Optimization , 54(2):866–893, 2016.[16] C.R. Hwang Laplace’s method revisited: weak convergence of probability measures.
The Annalsof Probability , 8(6): 1177-1182, 1980.[17] R. Koenker and G. Bassett Jr. Regression quantiles.
Econometrica: journal of the EconometricSociety , 33–50, 1978.[18] Y. Nesterov.
Introductory Lectures on Convex Optimization: A Basic Course. Applied Optimiza-tion.
Springer, 2004.[19] M. Raginsky, A. Rakhlin, and M. Telgarsky. Non-convex learning via Stochastic GradientLangevin Dynamics: a nonasymptotic analysis.
Proceedings of Machine Learning Research ,(65)1674–1703, 2017.[20] S. Sabanis. Stochastic volatility and the mean reverting process.
Journal of futures markets ,23(1):33–47, 2003.[21] S. Sabanis. Stochastic volatility.
International Journal of Theoretical and Applied Finance ,5(5):515–530, 2002.[22] I. Takeuchi, Q. V. Le, T. D. Sears and A.J. Smola. Nonparametric quantile estimation.
Journalof machine learning research , 7(Jul): 1231–1264, 2006.[23] P. Xu, J. Chen, D. Zhou and Q. Gu. Global convergence of Langevin dynamics based algorithmsfor nonconvex optimization.
Advances in Neural Information Processing Systems , 3122-3133,2018. 28
Appendix
A.1 Proof of the claim in Remark 3
We adapt the proof from [7, Lemma 4.7] and extend it to an R m -valued random variable X . Itsuffices to consider H ( θ, X ) = ˙ g ( θ, X ) (cid:84) mi =1 { X ( i )0 ∈ I i ( θ ) } , where θ ∈ R d , ˙ g is bounded and jointlyLipschitz continuous, i.e. there exist L , L , K > θ, θ (cid:48) ∈ R d , x, x (cid:48) ∈ R m , | ˙ g ( θ, x ) − ˙ g ( θ (cid:48) , x (cid:48) ) | ≤ (1 + | x | + | x (cid:48) | ) ρ ( L | θ − θ (cid:48) | + L | x − x (cid:48) | ) , | ˙ g ( θ, x ) | ≤ K , and the intervals I i ( θ ) take the form ( −∞ , ¯ g ( i ) ( θ )) with ¯ g ( i ) Lipschitz. One notices that the proof fol-lows the same lines when I i ( θ ) takes the form (¯ g ( i ) ( θ ) , ∞ ), (˜ g ( i ) ( θ ) , ˆ g ( i ) ( θ )) with ¯ g ( i ) , ˜ g ( i ) , ˆ g ( i ) Lipschitz.One writes, (cid:12)(cid:12) H ( θ, X ) − H ( θ (cid:48) , X ) (cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) ˙ g ( θ, X ) (cid:84) mi =1 (cid:110) X ( i )0 < ¯ g ( i ) ( θ ) (cid:111) − ˙ g ( θ (cid:48) , X ) (cid:84) mi =1 (cid:110) X ( i )0 < ¯ g ( i ) ( θ (cid:48) ) (cid:111) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) ˙ g ( θ, X ) (cid:84) mi =1 (cid:110) X ( i )0 < ¯ g ( i ) ( θ ) (cid:111) − ˙ g ( θ (cid:48) , X ) (cid:84) mi =1 (cid:110) X ( i )0 < ¯ g ( i ) ( θ ) (cid:111) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) ˙ g ( θ (cid:48) , X ) (cid:84) mi =1 (cid:110) X ( i )0 < ¯ g ( i ) ( θ ) (cid:111) − ˙ g ( θ (cid:48) , X ) (cid:84) mi =1 (cid:110) X ( i )0 < ¯ g ( i ) ( θ (cid:48) ) (cid:111) (cid:12)(cid:12)(cid:12)(cid:12) ≤ L (1 + 2 | X | ) ρ | θ − θ (cid:48) | + K (cid:84) mi =1 (cid:110) X ( i )0 ∈ [¯ g ( i ) ( θ ) , ¯ g ( i ) ( θ (cid:48) )) (cid:111) , where K ρ ( x ) for any x ∈ R m is defined in (5) and we assume without loss of generality ¯ g ( i ) ( θ ) ≤ ¯ g ( i ) ( θ (cid:48) )for all i = 1 , . . . , m . By taking expectation on both sides and by using Cauchy-Schwarz inequality,one obtains E (cid:2)(cid:12)(cid:12) H ( θ, X ) − H ( θ (cid:48) , X ) (cid:12)(cid:12)(cid:3) ≤ L E [(1 + 2 | X | ) ρ ] | θ − θ (cid:48) | + K P (cid:32) m (cid:92) i =1 (cid:110) X ( i )0 ∈ [¯ g ( i ) ( θ ) , ¯ g ( i ) ( θ (cid:48) )) (cid:111)(cid:33) ≤ L E [(1 + 2 | X | ) ρ ] | θ − θ (cid:48) | + K (cid:90) ¯ g ( m ) ( θ (cid:48) )¯ g ( m ) ( θ ) · · · (cid:90) ¯ g (1) ( θ (cid:48) )¯ g (1) ( θ ) f X ( x (1) , . . . , x ( m ) ) dx (1) · · · dx ( m ) ≤ L E [(1 + 2 | X | ) ρ ] | θ − θ (cid:48) | + K (cid:90) ¯ g (1) ( θ (cid:48) )¯ g (1) ( θ ) f X (1)0 ( x (1) ) dx (1) ≤ L E [(1 + 2 | X | ) ρ ] | θ − θ (cid:48) | + K K L | θ − θ (cid:48) |≤ ( L + K K L ) E [(1 + 2 | X | ) ρ ] | θ − θ (cid:48) | , where f X ( i )0 denotes the marginal density function of X ( i )0 , K is an upper bound of f X (1)0 and L is aLipschitz constant for ¯ g (1) . Taking L = L + K K L completes the proof. A.2 Proof of the claim in Remark 4
By Assumption 5, one obtains, for θ ∈ R d and x ∈ R m , (cid:104) F ( θ, x ) − F (0 , x ) , θ (cid:105) ≥ (cid:104) θ, ˆ A ( x ) θ (cid:105) , which implies (cid:104) F ( θ, x ) , θ (cid:105) ≥ (cid:104) θ, ˆ A ( x ) θ (cid:105) + (cid:104) F (0 , x ) , θ (cid:105)≥ (cid:104) θ, ˆ A ( x ) θ (cid:105) − | F (0 , x ) || θ |≥ (cid:104) θ, ˆ A ( x ) θ (cid:105) − (cid:15) | θ | − ( L (1 + | x | ) ρ +1 + | F (0 , | ) / (4 (cid:15) ) ≥ (cid:104) θ, ˆ A ∗ ( x ) θ (cid:105) − ˆ b ( x ) , where the third inequality holds due to Assumption 1 and ab < (cid:15)a + b / (4 (cid:15) ), for any a, b > (cid:15) > A ∗ ( x ) = ˆ A ( x ) − (cid:15) I d and ˆ b ( x ) = ( L (1 + | x | ) ρ +1 + | F (0 , | ) / (4 (cid:15) ).29 .3 Validity of Assumption 3 for VaR-CVaR algorithm in Section 5.2 We aim to show Assumption 3 is valid for H w . To achieve this, it is enough to prove(1) The inequality | ˆ g w ( w, X ) − ˆ g w ( ¯ w, X ) | ≤ | w − ¯ w | (cid:80) i | X i | holds, and(2) the last inequality in (48) is satisfied.To prove | ˆ g w ( w, X ) − ˆ g w ( ¯ w, X ) | ≤ | w − ¯ w | (cid:80) i | X i | , recall that for every j = 1 , . . . , n , i (cid:54) = j , ∂g j ( w ) ∂w j = e w j ( (cid:80) l (cid:54) = j e w l )( (cid:80) nl =1 e w l ) , ∂g i ( w ) ∂w j = − e w i e w j ( (cid:80) nl =1 e w l ) . Then, one calculates | ˆ g w ( w, X ) − ˆ g w ( ¯ w, X ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ∂g i ( w ) ∂w X i − n (cid:88) i =1 ∂g i ( ¯ w ) ∂w X i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) ∂g ( w ) ∂w − ∂g ( ¯ w ) ∂w (cid:12)(cid:12)(cid:12)(cid:12) | X | + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i (cid:54) =1 ∂g i ( w ) ∂w X i − (cid:88) i (cid:54) =1 ∂g i ( ¯ w ) ∂w X i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) e w ( (cid:80) l (cid:54) =1 e w l )( (cid:80) nl =1 e w l ) − e ¯ w ( (cid:80) l (cid:54) =1 e w l )( (cid:80) l (cid:54) =1 e w l + e ¯ w ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | X | + (cid:88) i (cid:54) =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) e w i e ¯ w ( (cid:80) l (cid:54) =1 e w l + e ¯ w ) − e w i e w ( (cid:80) nl =1 e w l ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | X i | = (cid:80) l (cid:54) =1 e w l ( (cid:80) nl =1 e w l ) ( (cid:80) l (cid:54) =1 e w l + e ¯ w ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) l (cid:54) =1 e w l (cid:0) e w − e ¯ w (cid:1) + e ¯ w e w (cid:0) e ¯ w − e w (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | X | + (cid:88) i (cid:54) =1 e w i ( (cid:80) nl =1 e w l ) ( (cid:80) l (cid:54) =1 e w l + e ¯ w ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) l (cid:54) =1 e w l (cid:0) e ¯ w − e w (cid:1) + e ¯ w e w (cid:0) e w − e ¯ w (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | X i |≤ | w − ¯ w | n (cid:88) i =1 | X i | , where the last inequality holds due to 1 − e − x ≤ x for all x ≥ g n ( w ) =max { g ( w ) , . . . , g n ( w ) } . Then,(i) For ¯ w ≥ w , one calculates E (cid:34)(cid:88) i | X i | (cid:12)(cid:12)(cid:12) { (cid:80) ni =1 g i ( w ) X i ≥ θ } − { (cid:80) ni =1 g i ( ¯ w ) X i ≥ θ } (cid:12)(cid:12)(cid:12)(cid:35) ≤ I + I , (49)where I = E (cid:34)(cid:88) i | X i | (cid:12)(cid:12)(cid:12) { (cid:80) ni =1 g i ( w ) X i ≥ θ } − { (cid:80) l (cid:54) =1 g l ( w ) X l + g ( ¯ w ) X ≥ θ } (cid:12)(cid:12)(cid:12)(cid:35) ,I = E (cid:34)(cid:88) i | X i | (cid:12)(cid:12)(cid:12) { (cid:80) l (cid:54) =1 g l ( w ) X l + g ( ¯ w ) X ≥ θ } − { (cid:80) l (cid:54) =1 , g l ( w ) X l + g ( ¯ w ) X + g ( ¯ w ) X ≥ θ } (cid:12)(cid:12)(cid:12)(cid:35) + · · · + E (cid:34)(cid:88) i | X i | (cid:12)(cid:12)(cid:12) { g n ( w ) X n + (cid:80) l (cid:54) = n g l ( ¯ w ) X l ≥ θ } − { (cid:80) ni =1 g i ( ¯ w ) X i ≥ θ } (cid:12)(cid:12)(cid:12)(cid:35) .
30o estimate I , one writes I ≤ E (cid:34)(cid:88) i | X i | { ( θ − (cid:80) l (cid:54) = n g l ( w ) X l ) /g n ( w ) ≤ X n ≤ ( θ − g ( ¯ w ) X − (cid:80) l (cid:54) =1 ,n g l ( w ) X l ) /g n ( w ) } (cid:35) + E (cid:34)(cid:88) i | X i | { ( θ − g ( ¯ w ) X − (cid:80) l (cid:54) =1 ,n g l ( w ) X l ) /g n ( w ) }≤ X n ≤ ( θ − (cid:80) l (cid:54) = n g l ( w ) X l ) /g n ( w ) } (cid:35) . The first term on the RHS of the inequality above can be further estimated as E (cid:34)(cid:88) i | X i | { ( θ − (cid:80) l (cid:54) = n g l ( w ) X l ) /g n ( w ) ≤ X n ≤ ( θ − g ( ¯ w ) X − (cid:80) l (cid:54) =1 ,n g l ( w ) X l ) /g n ( w ) } (cid:35) = E (cid:88) i (cid:54) = n | X i | E (cid:104) { ( θ − (cid:80) l (cid:54) = n g l ( w ) X l ) /g n ( w ) ≤ X n ≤ ( θ − g ( ¯ w ) X − (cid:80) l (cid:54) =1 ,n g l ( w ) X l ) /g n ( w ) } (cid:12)(cid:12)(cid:12) X , . . . , X n − (cid:105) + E (cid:104) E (cid:104) | X n | { ( θ − (cid:80) l (cid:54) = n g l ( w ) X l ) /g n ( w ) ≤ X n ≤ ( θ − g ( ¯ w ) X − (cid:80) l (cid:54) =1 ,n g l ( w ) X l ) /g n ( w ) } (cid:12)(cid:12)(cid:12) X , . . . , X n − (cid:105)(cid:105) = (cid:90) ∞−∞ (cid:88) i (cid:54) = n | x i | · · · (cid:90) ∞−∞ (cid:90) ( θ − g ( ¯ w ) x − (cid:80) l (cid:54) =1 ,n g l ( w ) x l ) /g n ( w )( θ − (cid:80) l (cid:54) = n g l ( w ) x l ) /g n ( w ) f X n ( z ) dz × f X n − ( x n − ) dx n − · · · f X ( x ) dx + (cid:90) ∞−∞ · · · (cid:90) ∞−∞ (cid:90) ( θ − g ( ¯ w ) x − (cid:80) l (cid:54) =1 ,n g l ( w ) x l ) /g n ( w )( θ − (cid:80) l (cid:54) = n g l ( w ) x l ) /g n ( w ) | x n | f X n ( z ) dz × f X n − ( x n − ) dx n − · · · f X ( x ) dx ≤ c X n ( c ¯ X + ( n − c X ) g n ( w ) | g ( w ) − g ( ¯ w ) | + ¯ c X n c X g n ( w ) | g ( w ) − g ( ¯ w ) | = ( c X n ( c ¯ X + ( n − c X ) + ¯ c X n c X ) (cid:80) i e w i e w n (cid:16)(cid:80) i (cid:54) =1 e w i (cid:17) | e ¯ w − e w | ( (cid:80) i e w i ) (cid:16) e ¯ w + (cid:80) i (cid:54) =1 e w i (cid:17) ≤ ( c X n ( c ¯ X + ( n − c X ) + ¯ c X n c X ) (cid:80) i (cid:54) =1 g i ( w ) g n ( w ) e ¯ w (cid:16) e ¯ w + (cid:80) i (cid:54) =1 e w i (cid:17) | ¯ w − w |≤ ( c X n ( c ¯ X + ( n − c X ) + ¯ c X n c X )( n − e ¯ w (cid:16) e ¯ w + (cid:80) i (cid:54) =1 e w i (cid:17) | ¯ w − w |≤ ( c X n ( c ¯ X + ( n − c X ) + ¯ c X n c X )( n − | ¯ w − w | , where c ¯ X denotes the second absolute moment of X i ’s , c X n is the upper bound of the densityof X n , and we use 1 − e − x ≤ x for x ≥ I can be upperbounded by I ≤ E (cid:34)(cid:88) i | X i | { ( θ − (cid:80) l (cid:54) =1 g l ( w ) X l ) /g ( ¯ w ) ≤ X ≤ ( θ − (cid:80) l (cid:54) =1 , g l ( w ) X l − g ( ¯ w ) X ) /g ( ¯ w ) } (cid:35) + E (cid:34)(cid:88) i | X i | { ( θ − (cid:80) l (cid:54) =1 , g l ( w ) X l − g ( ¯ w ) X ) /g ( ¯ w ) ≤ X ≤ ( θ − (cid:80) l (cid:54) =1 g l ( w ) X l ) /g ( ¯ w ) } (cid:35) + · · · + E (cid:34)(cid:88) i | X i | { ( θ − g n ( w ) X n − (cid:80) l (cid:54) =1 ,n g l ( ¯ w ) X l ) /g ( ¯ w ) ≤ X ≤ ( θ − (cid:80) l (cid:54) =1 g l ( ¯ w ) X l ) /g ( ¯ w ) } (cid:35) + E (cid:34)(cid:88) i | X i | { ( θ − (cid:80) l (cid:54) =1 g l ( ¯ w ) X l ) /g ( ¯ w ) ≤ X ≤ ( θ − g n ( w ) X n − (cid:80) l (cid:54) =1 ,n g l ( ¯ w ) X l ) /g ( ¯ w ) } (cid:35) . E (cid:34)(cid:88) i | X i | { ( θ − (cid:80) l (cid:54) =1 g l ( w ) X l ) /g ( ¯ w ) ≤ X ≤ ( θ − (cid:80) l (cid:54) =1 , g l ( w ) X l − g ( ¯ w ) X ) /g ( ¯ w ) } (cid:35) = E (cid:104) E (cid:104) | X | { ( θ − (cid:80) l (cid:54) =1 g l ( w ) X l ) /g ( ¯ w ) ≤ X ≤ ( θ − (cid:80) l (cid:54) =1 , g l ( w ) X l − g ( ¯ w ) X ) /g ( ¯ w ) } (cid:12)(cid:12)(cid:12) X , . . . , X n (cid:105)(cid:105) + E (cid:88) i (cid:54) =1 | X i | E (cid:104) { ( θ − (cid:80) l (cid:54) =1 g l ( w ) X l ) /g ( ¯ w ) ≤ X ≤ ( θ − (cid:80) l (cid:54) =1 , g l ( w ) X l − g ( ¯ w ) X ) /g ( ¯ w ) } (cid:12)(cid:12)(cid:12) X , . . . , X n (cid:105) = (cid:90) ∞−∞ · · · (cid:90) ∞−∞ (cid:90) ( θ − (cid:80) l (cid:54) =1 , g l ( w ) x l − g ( ¯ w ) x ) /g ( ¯ w )( θ − (cid:80) l (cid:54) =1 g l ( w ) x l ) /g ( ¯ w ) | z | f X ( z ) dzf X n ( x n ) dx n · · · f X ( x ) dx + (cid:90) ∞−∞ (cid:88) i (cid:54) =1 | x i | · · · (cid:90) ∞−∞ (cid:90) ( θ − (cid:80) l (cid:54) =1 , g l ( w ) x l − g ( ¯ w ) x ) /g ( ¯ w )( θ − (cid:80) l (cid:54) =1 g l ( w ) x l ) /g ( ¯ w ) f X ( z ) dz × f X n ( x n ) dx n · · · f X ( x ) dx ≤ ¯ c X c X g ( ¯ w ) | g ( w ) − g ( ¯ w ) | + c X ( c ¯ X + ( n − c X ) g ( ¯ w ) | g ( w ) − g ( ¯ w ) | = (¯ c X c X + c X ( c ¯ X + ( n − c X )) (cid:16) e ¯ w + (cid:80) i (cid:54) =1 e w i (cid:17) e ¯ w e w | e ¯ w − e w | ( (cid:80) i e w i ) (cid:16) e ¯ w + (cid:80) i (cid:54) =1 e w i (cid:17) ≤ (¯ c X c X + c X ( c ¯ X + ( n − c X )) e w e ¯ w e ¯ w ( (cid:80) i e w i ) | ¯ w − w |≤ (¯ c X c X + c X ( c ¯ X + ( n − c X )) | ¯ w − w | , where c X denotes the first absolute moment of X i ’s and ¯ c X is the upper bound of the function | x | f X . Thus, in the case ¯ w ≥ w , (49) becomes E (cid:34)(cid:88) i | X i | (cid:12)(cid:12)(cid:12) { (cid:80) ni =1 g i ( w ) X i ≥ θ } − { (cid:80) ni =1 g i ( ¯ w ) X i ≥ θ } (cid:12)(cid:12)(cid:12)(cid:35) ≤ n − c X n + c X )( c ¯ X + ( n − c X ) + c X (¯ c X n + ¯ c X )) | ¯ w − w | . (ii) As for the case w > ¯ w , the calculations are close to the above, however, one considers adifferent splitting as follows E (cid:34)(cid:88) i | X i | (cid:12)(cid:12)(cid:12) { (cid:80) ni =1 g i ( w ) X i ≥ θ } − { (cid:80) ni =1 g i ( ¯ w ) X i ≥ θ } (cid:12)(cid:12)(cid:12)(cid:35) ≤ T + T , (50)where T = E (cid:34)(cid:88) i | X i | (cid:12)(cid:12)(cid:12) { (cid:80) ni =1 g i ( w ) X i ≥ θ } − { (cid:80) l (cid:54) = n g l ( w ) X l + g n ( ¯ w ) X n ≥ θ } (cid:12)(cid:12)(cid:12)(cid:35) + · · · + E (cid:34)(cid:88) i | X i | (cid:12)(cid:12)(cid:12) { g ( w ) X + g ( w ) X + (cid:80) l (cid:54) =1 , g l ( ¯ w ) X l ≥ θ } − { g ( w ) X + (cid:80) l (cid:54) =1 g l ( ¯ w ) X l ≥ θ } (cid:12)(cid:12)(cid:12)(cid:35) ,T = E (cid:34)(cid:88) i | X i | (cid:12)(cid:12)(cid:12) { g ( w ) X + (cid:80) l (cid:54) =1 g l ( ¯ w ) X l ≥ θ } − { (cid:80) ni =1 g i ( ¯ w ) X i ≥ θ } (cid:12)(cid:12)(cid:12)(cid:35) . To estimate T , one calculates T ≤ E (cid:34)(cid:88) i | X i | { ( θ − (cid:80) l (cid:54) =1 g l ( w ) X l ) /g ( w ) ≤ X ≤ ( θ − g n ( ¯ w ) X n − (cid:80) l (cid:54) =1 ,n g l ( w ) X l ) /g ( w ) } (cid:35) E (cid:34)(cid:88) i | X i | { ( θ − g n ( ¯ w ) X n − (cid:80) l (cid:54) =1 ,n g l ( w ) X l ) /g ( w ) ≤ X ≤ ( θ − (cid:80) l (cid:54) =1 g l ( w ) X l ) /g ( w ) } (cid:35) + · · · + E (cid:34)(cid:88) i | X i | { ( θ − (cid:80) l (cid:54) =1 , g l ( ¯ w ) X l − g ( w ) X ) /g ( w ) ≤ X ≤ ( θ − (cid:80) l (cid:54) =1 g l ( ¯ w ) X l ) /g ( w ) } (cid:35) + E (cid:34)(cid:88) i | X i | { ( θ − (cid:80) l (cid:54) =1 g l ( ¯ w ) X l ) /g ( w ) ≤ X ≤ ( θ − (cid:80) l (cid:54) =1 , g l ( ¯ w ) X l − g ( w ) X ) /g ( w ) } (cid:35) . The first term on the RHS of the inequality above can be further calculated as E (cid:34)(cid:88) i | X i | { ( θ − (cid:80) l (cid:54) =1 g l ( w ) X l ) /g ( w ) ≤ X ≤ ( θ − g n ( ¯ w ) X n − (cid:80) l (cid:54) =1 ,n g l ( w ) X l ) /g ( w ) } (cid:35) = E (cid:104) E (cid:104) | X | { ( θ − (cid:80) l (cid:54) =1 g l ( w ) X l ) /g ( w ) ≤ X ≤ ( θ − g n ( ¯ w ) X n − (cid:80) l (cid:54) =1 ,n g l ( w ) X l ) /g ( w ) } (cid:12)(cid:12)(cid:12) X , . . . , X n (cid:105)(cid:105) + E (cid:88) i (cid:54) =1 | X i | E (cid:104) { ( θ − (cid:80) l (cid:54) =1 g l ( w ) X l ) /g ( w ) ≤ X ≤ ( θ − g n ( ¯ w ) X n − (cid:80) l (cid:54) =1 ,n g l ( w ) X l ) /g ( w ) } (cid:12)(cid:12)(cid:12) X , . . . , X n (cid:105) = (cid:90) ∞−∞ · · · (cid:90) ∞−∞ (cid:90) ( θ − g n ( ¯ w ) x n − (cid:80) l (cid:54) =1 ,n g l ( w ) x l ) /g ( w )( θ − (cid:80) l (cid:54) =1 g l ( w ) x l ) /g ( w ) | z | f X ( z ) dzf X n ( x n ) dx n · · · f X ( x ) dx + (cid:90) ∞−∞ (cid:88) i (cid:54) =1 | x i | · · · (cid:90) ∞−∞ (cid:90) ( θ − g n ( ¯ w ) x n − (cid:80) l (cid:54) =1 ,n g l ( w ) x l ) /g ( w )( θ − (cid:80) l (cid:54) =1 g l ( w ) x l ) /g ( w ) f X ( z ) dz × f X n ( x n ) dx n · · · f X ( x ) dx ≤ ¯ c X c X g ( w ) | g n ( w ) − g n ( ¯ w ) | + c X ( c ¯ X + ( n − c X ) g ( w ) | g n ( w ) − g n ( ¯ w ) | = (¯ c X c X + c X ( c ¯ X + ( n − c X )) (cid:80) i e w i e w e w n | e w − e ¯ w | ( (cid:80) i e w i ) (cid:16) e ¯ w + (cid:80) i (cid:54) =1 e w i (cid:17) ≤ (¯ c X c X + c X ( c ¯ X + ( n − c X ) e w n e w e w (cid:16) e ¯ w + (cid:80) i (cid:54) =1 e w i (cid:17) | w − ¯ w |≤ (¯ c X c X + c X ( c ¯ X + ( n − c X ) | w − ¯ w | . In addition, T can be estimated as T ≤ E (cid:34)(cid:88) i | X i | { ( θ − g ( w ) X − (cid:80) l (cid:54) =1 ,n g l ( ¯ w ) X l ) /g n ( ¯ w ) ≤ X n ≤ ( θ − (cid:80) l (cid:54) = n g l ( ¯ w ) X l ) /g n ( ¯ w ) } (cid:35) + E (cid:34)(cid:88) i | X i | { ( θ − (cid:80) l (cid:54) = n g l ( ¯ w ) X l ) /g n ( ¯ w ) ≤ X n ≤ ( θ − g ( w ) X − (cid:80) l (cid:54) =1 ,n g l ( ¯ w ) X l ) /g n ( ¯ w ) } (cid:35) . The first term on the RHS of the above inequality can be upper bounded by E (cid:34)(cid:88) i | X i | { ( θ − g ( w ) X − (cid:80) l (cid:54) =1 ,n g l ( ¯ w ) X l ) /g n ( ¯ w ) ≤ X n ≤ ( θ − (cid:80) l (cid:54) = n g l ( ¯ w ) X l ) /g n ( ¯ w ) } (cid:35) = E (cid:104) E (cid:104) | X n | { ( θ − g ( w ) X − (cid:80) l (cid:54) =1 ,n g l ( ¯ w ) X l ) /g n ( ¯ w ) ≤ X n ≤ ( θ − (cid:80) l (cid:54) = n g l ( ¯ w ) X l ) /g n ( ¯ w ) } (cid:12)(cid:12)(cid:12) X , . . . , X n − (cid:105)(cid:105) + E (cid:88) i (cid:54) = n | X i | E (cid:104) { ( θ − g ( w ) X − (cid:80) l (cid:54) =1 ,n g l ( ¯ w ) X l ) /g n ( ¯ w ) ≤ X n ≤ ( θ − (cid:80) l (cid:54) = n g l ( ¯ w ) X l ) /g n ( ¯ w ) } (cid:12)(cid:12)(cid:12) X , . . . , X n − (cid:105) = (cid:90) ∞−∞ · · · (cid:90) ∞−∞ (cid:90) ( θ − (cid:80) l (cid:54) = n g l ( ¯ w ) x l ) /g n ( ¯ w )( θ − g ( w ) x − (cid:80) l (cid:54) =1 ,n g l ( ¯ w ) x l ) /g n ( ¯ w ) | x n | f X n ( z ) dz f X n − ( x n − ) dx n − · · · f X ( x ) dx + (cid:90) ∞−∞ (cid:88) i (cid:54) = n | x i | · · · (cid:90) ∞−∞ (cid:90) ( θ − (cid:80) l (cid:54) = n g l ( ¯ w ) x l ) /g n ( ¯ w )( θ − g ( w ) x − (cid:80) l (cid:54) =1 ,n g l ( ¯ w ) x l ) /g n ( ¯ w ) f X n ( z ) dz × f X n − ( x n − ) dx n − · · · f X ( x ) dx ≤ ¯ c X n c X g n ( ¯ w ) | g ( w ) − g ( ¯ w ) | + c X n ( c ¯ X + ( n − c X ) g n ( ¯ w ) | g ( w ) − g ( ¯ w ) | = (¯ c X n c X + c X n ( c ¯ X + ( n − c X )) (cid:16) e ¯ w + (cid:80) i (cid:54) =1 e w i (cid:17) e w n (cid:16)(cid:80) i (cid:54) =1 e w i (cid:17) | e w − e ¯ w | ( (cid:80) i e w i ) (cid:16) e ¯ w + (cid:80) i (cid:54) =1 e w i (cid:17) = (¯ c X n c X + c X n ( c ¯ X + ( n − c X )) (cid:80) i (cid:54) =1 g i ( w ) g n ( w ) e w (cid:80) i e w i | w − ¯ w |≤ ( n − c X n c X + c X n ( c ¯ X + ( n − c X )) e w (cid:80) i e w i | w − ¯ w |≤ ( n − c X n c X + c X n ( c ¯ X + ( n − c X )) | w − ¯ w | . Thus for the case w > ¯ w , we have E (cid:34)(cid:88) i | X i | (cid:12)(cid:12)(cid:12) { (cid:80) ni =1 g i ( w ) X i ≥ θ } − { (cid:80) ni =1 g i ( ¯ w ) X i ≥ θ } (cid:12)(cid:12)(cid:12)(cid:35) ≤ n − c X (¯ c X n + ¯ c X ) + ( c ¯ X + ( n − c X )( c X n + c X )) | w − ¯ w | . Combining the two cases, one obtains E (cid:34)(cid:88) i | X i | (cid:12)(cid:12)(cid:12) { (cid:80) ni =1 g i ( w ) X i ≥ θ } − { (cid:80) ni =1 g i ( ¯ w ) X i ≥ θ } (cid:12)(cid:12)(cid:12)(cid:35) ≤ n − c X (¯ c X n + ¯ c X ) + ( c ¯ X + ( n − c X )( c X n + c X )) | w − ¯ w | . A.4 Auxiliary results
Lemma 12.
Let Assumption 1, 2, 3 and 4 hold. For any t ∈ [ nT, ( n +1) T ] , n ∈ N and k = 1 , . . . , K +1 , K + 1 ≤ T , one obtains E (cid:20)(cid:12)(cid:12)(cid:12) H (¯ θ λnT + k − , X nT + k ) − h (¯ θ λnT + k − ) (cid:12)(cid:12)(cid:12) (cid:21) ≤ e − aλnT ¯ σ Z E [ V ( θ )] + ˜ σ Z , where ¯ σ Z = 4 E [ K ρ ( X )] (cid:0) L + L (cid:1) ˜ σ Z = 4 E [ K ρ ( X )] (cid:0) L + L (cid:1) c ( λ max + a − ) + 4 | h (0) | + 8 L E [ K ρ ( X )] + 8 E (cid:2) F ∗ ( X ) (cid:3) . (51) Proof.
One notices that by Remark 1 and 2, E (cid:20)(cid:12)(cid:12)(cid:12) H (¯ θ λnT + k − , X nT + k ) − h (¯ θ λnT + k − ) (cid:12)(cid:12)(cid:12) (cid:21) ≤ E (cid:20)(cid:12)(cid:12)(cid:12) h (¯ θ λnT + k − ) (cid:12)(cid:12)(cid:12) (cid:21) + 2 E (cid:20)(cid:12)(cid:12)(cid:12) H (¯ θ λnT + k − , X nT + k ) (cid:12)(cid:12)(cid:12) (cid:21) ≤ E (cid:20)(cid:16) L (cid:12)(cid:12)(cid:12) ¯ θ λnT + k − (cid:12)(cid:12)(cid:12) + | h (0) | (cid:17) (cid:21) + 2 E (cid:20)(cid:16) (1 + | X nT + k | ) ρ +1 (cid:16) L (cid:12)(cid:12)(cid:12) ¯ θ λnT + k − (cid:12)(cid:12)(cid:12) + L (cid:17) + F ∗ ( X nT + k ) (cid:17) (cid:21) ≤ L E (cid:20)(cid:12)(cid:12)(cid:12) ¯ θ λnT + k − (cid:12)(cid:12)(cid:12) (cid:21) + 4 | h (0) | + 4 L E [ K ρ ( X )] E (cid:20)(cid:12)(cid:12)(cid:12) ¯ θ λnT + k − (cid:12)(cid:12)(cid:12) (cid:21) + 8 L E [ K ρ ( X )] + 8 E (cid:2) F ∗ ( X ) (cid:3) ≤ E [ K ρ ( X )] (cid:0) L + L (cid:1) (cid:16) e − aλnT E [ V ( θ )] + c ( λ max + a − ) (cid:17)
34 4 | h (0) | + 8 L E [ K ρ ( X )] + 8 E (cid:2) F ∗ ( X ) (cid:3) , where the last inequality holds due to Lemma 1. Finally, one obtains E (cid:20)(cid:12)(cid:12)(cid:12) h (¯ ζ λ,nt ) − H (¯ ζ λ,nt , X nT + k ) (cid:12)(cid:12)(cid:12) (cid:21) ≤ e − aλnT ¯ σ Z E [ V ( θ )] + ˜ σ Z , where ¯ σ Z = 4 E [ K ρ ( X )] (cid:0) L + L (cid:1) and ˜ σ Z = 4 E [ K ρ ( X )] (cid:0) L + L (cid:1) c ( λ max + a − ) + 4 | h (0) | +8 L E [ K ρ ( X )] + 8 E (cid:2) F ∗ ( X ) (cid:3) . Lemma 13.
Let Assumption 1, 2 and 4 hold. For any t > , one obtains E (cid:20)(cid:12)(cid:12)(cid:12) ¯ θ λt − ¯ θ λ (cid:98) t (cid:99) (cid:12)(cid:12)(cid:12) (cid:21) ≤ λ ( e − aλ (cid:98) t (cid:99) ¯ σ Y E [ V ( θ )] + ˜ σ Y ) , where ¯ σ Y = 2 λ max L E [ K ρ ( X )]˜ σ Y = 2 λ max L E [ K ρ ( X )] c ( λ max + a − ) + 4 λ max L E [ K ρ ( X )] + 4 λ max E (cid:2) F ∗ ( X ) (cid:3) + 2 dβ − . (52) Proof.
For any t >
0, one calculates E (cid:20)(cid:12)(cid:12)(cid:12) ¯ θ λt − ¯ θ λ (cid:98) t (cid:99) (cid:12)(cid:12)(cid:12) (cid:21) = E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − λ (cid:90) t (cid:98) t (cid:99) H (¯ θ λ (cid:98) t (cid:99) , X (cid:100) t (cid:101) ) ds + (cid:112) β − λ ( ˜ B λt − ˜ B λ (cid:98) t (cid:99) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ E (cid:20)(cid:16) (1 + | X (cid:100) t (cid:101) | ) ρ +1 ( L | ¯ θ λ (cid:98) t (cid:99) | + L ) + F ∗ ( X (cid:100) t (cid:101) ) (cid:17) (cid:21) + 2 dλβ − , where the inequality holds due to Remark 1 and by applying Lemma 1, one obtains E (cid:20)(cid:12)(cid:12)(cid:12) ¯ θ λt − ¯ θ λ (cid:98) t (cid:99) (cid:12)(cid:12)(cid:12) (cid:21) ≤ λ L E [ K ρ ( X )] E [ | ¯ θ λ (cid:98) t (cid:99) | ] + 4 λ L E [ K ρ ( X )] + 4 λ E (cid:2) F ∗ ( X ) (cid:3) + 2 dλβ − ≤ λ ((1 − aλ ) (cid:98) t (cid:99) ¯ σ Y E [ V ( θ )] + ˜ σ Y ) , where ¯ σ Y = 2 λ max L E [ K ρ ( X )] and ˜ σ Y = 2 λ max L E [ K ρ ( X )] c ( λ max + a − ) + 4 λ max L E [ K ρ ( X )] +4 λ max E (cid:2) F ∗ ( X ) (cid:3) + 2 dβ − . Lemma 14.
Let Assumption 1, 2, and 4 hold. Then, for any t > , one obtains E [ | Z t | ] ≤ e − at E [ | θ | ] + (cid:18) daβ + 2 ba + E [ K ( X )] a (cid:19) (1 − e − at ) . Proof.
For any t >
0, by applying Itˆo’s formula to e at | Z t | , one obtains, almost surely de at | Z t | = ae at | Z t | dt − e at (cid:104) Z t , h ( Z t ) (cid:105) dt + 2 e at (cid:104) Z t , (cid:112) β − dB t (cid:105) + 2 dβ − e at dt. Then, integrating both sides and taking expectation yield e at E [ | Z t | ] = E [ | θ | ] + a (cid:90) t e as E [ | Z s | ] ds − (cid:90) t e as E [ (cid:104) Z s , h ( Z s ) (cid:105) ] ds + 2 dβ − (cid:90) t e as ds, which implies by using Assumption 4 e at E [ | Z t | ] = E [ | θ | ] + a (cid:90) t e as E [ | Z s | ] ds − a (cid:90) t e as E [ | Z s | ] ds + 2 b (cid:90) t e as ds + 2 (cid:90) t e as E [ | Z s | ] E [ K ( X )] ds + 2 dβ − (cid:90) t e as ds ≤ E [ | θ | ] + (2 b + E [ K ( X )] /a + 2 dβ − )( e at − /a. Finally, one obtains E [ | Z t | ] ≤ e − at E [ | θ | ] + (2 b + E [ K ( X )] /a + 2 dβ − )(1 − e − at ) /a./a.