Maximum likelihood estimation of regularisation parameters in high-dimensional inverse problems: an empirical Bayesian approach. Part II: Theoretical Analysis
Valentin De Bortoli, Alain Durmus, Ana F. Vidal, Marcelo Pereyra
aa r X i v : . [ m a t h . S T ] A ug Maximum likelihood estimation of regularisationparameters in high-dimensional inverse problems: anempirical Bayesian approachPart II: Theoretical Analysis
Valentin De Bortoli ∗ , Alain Durmus † , Marcelo Pereyra ‡ , and Ana F. Vidal § Part of this work has been presented at the 25th IEEE International Conferenceon Image Processing (ICIP) [50] CMLA - École normale supérieure Paris-Saclay, CNRS, Université Paris-Saclay, 94235 Cachan, France. Maxwell Institute for Mathematical Sciences & School of Mathematical and Computer Sciences,Heriot-Watt University, Edinburgh, EH14 4AS, United Kingdom.
August 14, 2020
Abstract
This paper presents a detailed theoretical analysis of the three stochastic approximationproximal gradient algorithms proposed in our companion paper [49] to set regularization pa-rameters by marginal maximum likelihood estimation. We prove the convergence of a moregeneral stochastic approximation scheme that includes the three algorithms of [49] as specialcases. This includes asymptotic and non-asymptotic convergence results with natural andeasily verifiable conditions, as well as explicit bounds on the convergence rates. Importantly,the theory is also general in that it can be applied to other intractable optimisation prob-lems. A main novelty of the work is that the stochastic gradient estimates of our scheme areconstructed from inexact proximal Markov chain Monte Carlo samplers. This allows the useof samplers that scale efficiently to large problems and for which we have precise theoreticalguarantees.
Numerous imaging problems require performing inferences on an unknown image of interest x ∈ R d from some observed data y . Canonical examples include image denoising [12, 28], compressivesensing [18, 40], super-resolution [35, 51], tomographic reconstruction [13], image inpainting [24, 44],source separation [9, 8], fusion [46, 31], and phase retrieval [10, 26]. Such imaging problems canbe formulated in a Bayesian statistical framework, where inferences are derived from the so-calledposterior distribution of x given y , which for the purpose of this paper we specify as follows p ( x | y, θ ) = p ( y | x ) p ( x | θ ) /p ( y | θ ) where p ( y | x ) = exp {− f y ( x ) } with f y ∈ C ( R d , R ) is the likelihood function, and the prior distri-bution is p ( x | θ ) = exp {− θ ⊤ g ( x ) } with g : R d → R d Θ and θ ∈ Θ ⊂ R d Θ . The function f y acts as adata-fidelity term, g as a regulariser that promotes desired structural or regularity properties (e.g.,smoothness, piecewise-regularity, or sparsity [11]), and θ is a regularisation parameter that con-trols the amount of regularity enforced. Most Bayesian methods in the imaging literature considermodels for which f y and g are convex functions and report as solution the maximum-a-posteriori(MAP) Bayesian estimatorargmin f y,θ , where f y,θ ( x ) = f y ( x ) + θ ⊤ g ( x ) for any x ∈ R d . (1) ∗ Email: [email protected] † Email: [email protected] ‡ Email: [email protected] § Email: [email protected] y = Ax + w ,where A ∈ R d × R d is some problem-specific linear operator and the noise w has distribution N(0 , σ I d ) with variance σ > . Then, for any x ∈ R d f y ( x ) = (2 σ ) − k Ax − y k . With regardsto the prior, a common choice in imaging is to set Θ = R + and g ( x ) = k Bx k for some suitablebasis or dictionary B ∈ R d ′ × R d , or g ( x ) = TV( x ) , where TV( x ) is the isotropic total variationpseudo-norm given by TV( x ) = P i p (∆ hi x ) + (∆ vi x ) where ∆ vi and ∆ hi denote horizontal andvertical first-order local (pixel-wise) difference operators.Importantly, when f y and g are convex, problem (1) is also convex and can usually be efficientlysolved by using modern proximal convex optimisation techniques [11], with remarkable guaranteeson the solutions delivered.Setting the value of θ can be notoriously difficult, especially in problems that are ill-posed orill-conditioned where the regularisation has a dramatic impact on the recovered estimates. Werefer to [27] and [49, Section 1] for illustrations and a detailed review of the existing methods forsetting set θ .In our companion paper [49], we present a new method to set regularisation parameters. Moreprecisely, in [49], we adopt an empirical Bayesian approach and set θ by maximum marginallikelihood estimation, i.e. θ ⋆ ∈ arg max θ ∈ Θ log p ( y | θ ) , where p ( y | θ ) = Z R d p ( y, x | θ )d x , p ( y, x | θ ) ∝ exp[ − f y,θ ( x )] . (2)To solve (2), we aim at using gradient based optimization methods. The gradient of θ log p ( y | θ ) ,can be computed using Fisher’s identity, see [49, Proposition A.1], which implies under mild inte-grability conditions on f y and g , for any θ ∈ Θ , ∇ θ log p ( y | θ ) = − Z R d g (˜ x ) p (˜ x | y, θ )d˜ x + Z R d g (˜ x ) p (˜ x | θ )d˜ x . It follows that θ
7→ ∇ θ log p ( y | θ ) can be written as a sum of two parametric integrals which areuntractable in most cases. Therefore, we propose to use a stochastic approximation (SA) schemeand, in particular, we define three different algorithms to solve (2) [49, Algorithm 3.1, Algorithm3.2, Algorithm 3.3]. These algorithms are extensively demonstrated in [49] through a range ofapplications and comparisons with alternative approaches from the state-of-the-art.In the present paper we theoretically analyse these three SA schemes and establish naturaland easily verifiable conditions for convergence. For generality, rather than presenting algorithm-specific analyses, we establish detailed convergence results for a more general SA scheme that coversthe three algorithms of [49] as specific cases. Indeed, all these methods boil down to defining asequence ( θ n ) n ∈ N satisfying a recursion of the form: for any n ∈ N , θ n +1 = Π Θ " θ n − δ n +1 m n m n X k =1 (cid:8) g ( X nk ) − g ( ¯ X nk ) (cid:9) , (3)where Π Θ is the projection onto a convex closed set Θ , ( X nk ) k ∈{ ,...,m n } and ( ¯ X nk ) k ∈{ ,...,m n } aretwo independent stochastic processes targeting x p ( x | y, θ ) and x p ( x | θ ) respectively, ( m n ) n ∈ N is a sequence of batch-sizes and ( δ n ) n ∈ N ∗ is a sequence of stepsizes. In this paper, we are interestedin establishing the convergence of the averaging of ( θ n ) n ∈ N to a solution of (2) in this setting. SAhas been extensively studied during the past decades [41, 29, 38, 47, 33, 34, 7, 6, 48]. Recently,quantitative results have been obtained in [45, 2, 39, 1, 43]. In contrast to [1], here we considerthe case where ( X nk ) k ∈{ ,...,m n } and ( ¯ X nk ) k ∈{ ,...,m n } are inexact Markov chains which target x p ( x | y, θ ) and x p ( x | θ ) respectively and are based on some generalizations of the UnadjustedLangevin Algorithm (ULA) [42]. In the recent years, ULA has attracted a lot of attention sincethis algorithm exhibits favorable high-dimensional convergence properties in the case where thetarget distribution admits a differentiable density, see [20, 22, 14, 15]. However, in most imagingmodels, the penalty function g is not differentiable and therefore x p ( x | y, θ ) and x p ( x | θ ) arenot differentiable as well. Therefore, we consider proximal Langevin samplers which are specificallydesign to overcome this issue: the Moreau-Yoshida Unadjusted Langevin Algorithm (MYULA),see [23], and the Proximal Unadjusted Langevin Operator (PULA), see [21].A similar approximation scheme to (3) is studied in [1]. More precisely [1, Theorem 3, Theorem4] are similar to Theorem 6 and Theorem 7. Contrarily to that work, here we do not require theMarkov kernels we use to exactly target x p ( x | θ ) and x p ( x | y, θ ) but allow some biasin the estimation which is accounted for in our convergence rates. This relaxation to biased2stimates plays a central role in the capacity of the method to scale efficiently to large problems.Moreover, the present paper is also a complement of [17] which establishes general conditions forthe convergence of inexact Markovian SA but only apply these results to ULA. In this study, wedo not consider a general Markov kernel but rather specialize the results of [17] to MYULA andPULA Markov kernels. However, to apply results of [17], new quantitative geometric convergenceproperties on MYULA and PULA have to be established.The remainder of the paper is organized as follows. In Section 2, we recall our notations andconventions. In Section 3, we define the class of optimisation problems considered and the SAscheme (3). This setting includes the optimization problem presented in (2) and the three specificalgorithms introduced in [49]. Then, in Section 4, we present a detailed analysis of the theoreticalproperties of the proposed methodology. First, we show new ergodicity results for the MYULAand PULA samplers. In a second part, we provide easily verifiable conditions for convergence andquantitative convergence rates for the averaging sequences designed from (3). The proofs of theseresults are gathered in Section 5. We denote by
B(0 , R ) and B(0 , R ) the open ball, respectively the closed ball, with radius R in R d .Denote by B ( R d ) the Borel σ -field of R d , F ( R d ) the set of all Borel measurable functions on R d andfor f ∈ F ( R d ) , k f k ∞ = sup x ∈ R d | f ( x ) | . For µ a probability measure on ( R d , B ( R d )) and f ∈ F ( R d ) a µ -integrable function, denote by µ ( f ) the integral of f w.r.t. µ . For f ∈ F ( R d ) , the V -norm of f is given by k f k V = sup x ∈ R d | f ( x ) | /V ( x ) . Let ξ be a finite signed measure on ( R d , B ( R d )) . The V -total variation norm of ξ is defined as k ξ k V = sup f ∈ F ( R d ) , k f k V (cid:12)(cid:12)(cid:12)(cid:12)Z R d f ( x )d ξ ( x ) (cid:12)(cid:12)(cid:12)(cid:12) . If V ≡ , then k · k V is the total variation norm on measures denoted by k · k TV .Let U be an open set of R d . We denote by C k ( U , R d Θ ) the set of R d Θ -valued k -differentiablefunctions, respectively the set of compactly supported R d Θ -valued k -differentiable functions. C k ( U ) stands C k ( U , R ) . Let f : U → R , we denote by ∇ f , the gradient of f if it exists. f is said to be m -convex with m > if for all x, y ∈ R d and t ∈ [0 , , f ( tx + (1 − t ) y ) tf ( x ) + (1 − t ) f ( y ) − ( m / t (1 − t ) k x − y k . Let (Ω , F , P ) be a probability space. Denote by µ ≪ ν if µ is absolutely continuous w.r.t. ν and d µ/ d ν an associated density. Let µ, ν be two probability measures on ( R d , B ( R d )) . Define theKullback-Leibler divergence of µ from ν by KL ( µ | ν ) = (R R d d µ d ν ( x ) log (cid:16) d µ d ν ( x ) (cid:17) d ν ( x ) , if µ ≪ ν , + ∞ otherwise . Let Θ ⊂ R d Θ and f : Θ → R . We consider the optimisation problem θ ⋆ ∈ arg min θ ∈ Θ f ( θ ) , (4)in scenarios where it is not possible to evaluate f nor ∇ f because they are computationally in-tractable. Problem (4) includes the marginal likelihood estimation problem (2) of our companionpaper [49] as the special case f = − log p ( y |· ) . We make the following general assumptions on f and Θ , which are in particular verified by the imaging models considered in [49]. A1. Θ is a convex compact set and Θ ⊂ B(0 , R Θ ) with R Θ > . A 2.
There exist an open set U ⊂ R p and L f > such that Θ ⊂ U , f ∈ C ( U , R ) and for any θ , θ ∈ Θ k∇ θ f ( θ ) − ∇ θ f ( θ ) k L f k θ − θ k . For any θ ∈ Θ , there exist H θ , ¯ H θ : R d → R d Θ and two probability distributions π θ , ¯ π θ on ( R d , B ( R d )) satisfying for any θ ∈ Θ ∇ θ f ( θ ) = Z R d H θ ( x )d π θ ( x ) + Z R d ¯ H θ ( x )d¯ π θ ( x ) . In addition, ( θ, x ) H θ ( x ) and ( θ, x ) ¯ H θ ( x ) are measurable. Remark 1.
Note that if f ∈ C (Θ) then A is automatically satisfied under A , since Θ iscompact. In every model considered in our companion paper [49], θ
7→ − log p ( y | θ ) is continuouslytwice differentiable on each compact using the dominated convergence theorem and therefore A holds under A . Remark 2.
Assumption A is verified in the three cases considered in our companion paper [49,Algorithm 3.1, Algorithm 3.2, Algorithm 3.3]:(a) if the regulariser g is α positively homogeneous with α > and d Θ = 1 , corresponding to [49,Algorithm 3.1], then for any θ ∈ Θ , H θ = g , ¯ H θ = − d/ ( αθ ) , π θ is the probability measure withdensity w.r.t. the Lebesgue measure x p ( x | y, θ ) and ¯ π θ is any probability measure;(b) if the regulariser g is separably positively homogeneous as in [49, Algorithm 3.2], then for any θ ∈ Θ , H θ = g , ¯ H θ = ( − | A i | / ( α i θ i )) i ∈{ ,...,d Θ } , π θ is the probability measure with density w.r.t. theLebesgue measure x p ( x | y, θ ) and ¯ π θ is any probability measure;(c) if the regulariser g is inhomogeneous, corresponding to [49, Algorithm 3.3], then for any θ ∈ Θ , ¯ H θ = − g , H θ = g , π θ and ¯ π θ are the probability measures associated with the posterior and theprior, with density w.r.t. the Lebesgue measure x p ( x | y, θ ) and x p ( x | θ ) respectively. We now present in Algorithm 1, the stochastic algorithm we consider in order to solve (4).This method encompasses the schemes introduced in the companion paper [49, Algorithm 3.1,Algorithm 3.2, Algorithm 3.3]. Starting from ( X , ¯ X ) ∈ R d × R d and θ ∈ Θ , we define on aprobability space (Ω , F , P ) , the sequence ( { ( X nk , ¯ X nk ) : k ∈ { , . . . , m n }} , θ n ) n ∈ N by the followingrecursion for n ∈ N and k ∈ { , . . . , m n − } ( X nk ) k ∈{ ,...,m n } is a MC with kernel K γ n ,θ n and X n = X n − m n − given F n − , ( ¯ X nk ) k ∈{ ,...,m n } is a MC with kernel ¯K γ ′ n ,θ n and ¯ X n = ¯ X n − m n − given F n − ,θ n +1 = Π Θ " θ n − δ n +1 m n m n X k =1 (cid:8) H θ n ( X nk ) + ¯ H θ n ( ¯ X nk ) (cid:9) , (5)where ( X − m − , ¯ X − m − ) = ( X , ¯ X ) , { (K γ,θ , ¯K γ,θ ) : γ > , θ ∈ Θ } is a family of Markov kernels on R d × B ( R d ) , ( m n ) n ∈ N ∈ ( N ∗ ) N , δ n , γ n , γ ′ n > for any n ∈ N , Π Θ is the projection onto Θ and F n is defined as follows for all n ∈ N ∪ {− }F n = σ (cid:0) θ , { ( X ℓk , ¯ X ℓk ) k ∈{ ,...,m ℓ } : ℓ ∈ { , . . . , n }} (cid:1) , F − = σ ( θ , X , ¯ X ) . Define for any N ∈ N , ¯ θ N = N − X n =0 δ n θ n , N − X n =0 δ n . In the sequel, we are interested in the convergence of ( f (¯ θ N )) N ∈ N to a minimum of f in the casewhere the Markov kernels { (K γ,θ , ¯K γ,θ ) : γ > , θ ∈ Θ } , used in Algorithm 1 are either the onesassociated with MYULA or PULA. We now present these two MCMC methods for which someanalysis is required in our study of ( f (¯ θ N )) N ∈ N . Given the high dimensionality involved, it is fundamental to carefully choose the families of Markovkernels { K γ,θ , ¯K γ,θ : γ > , θ ∈ Θ } driving Algorithm 1. In the experimental part of this work,see [49, Section 4], we use the MYULA Markov kernel recently proposed in [23], which is a state-of-the-art proximal Markov chain Monte Carlo (MCMC) method specifically designed for high-dimensional models that are are log-concave but not smooth. The method is derived from the4 lgorithm 1 General algorithm Input: initial { θ , X , ¯ X } , ( δ n , γ n , γ ′ n , m n ) n ∈ N , number of iterations N . for n = 0 to N − do if n > then Set X n = X n − m n − , Set ¯ X n = ¯ X n − m n − , end if for k = 0 to m n − do Sample X nk +1 ∼ K γ n ,θ n ( X nk , · ) , Sample ¯ X nk +1 ∼ ¯K γ ′ n ,θ n ( ¯ X nk , · ) , end for Set θ n +1 = Π Θ h θ n − δ n +1 m n P m n k =1 (cid:8) H θ n ( X nk ) + ¯ H θ n ( ¯ X nk ) (cid:9)i . end for Output: ¯ θ N = { P N − n =0 δ n } − P N − n =0 δ n θ n .discretisation of an over-damped Langevin diffusion, ( ¯ X t ) t > , satisfying the following stochasticdifferential equation d X t = −∇ x F ( X t )d t + √ B t , (6)where F : R d R is a continuously differentiable potential and ( B t ) t > is a standard d -dimensionalBrownian motion. Under mild assumptions, this equation has a unique strong solution [25, Chapter4, Theorem 2.3]. Accordingly, the law of ( X t ) t > converges as t → ∞ to the diffusion’s uniqueinvariant distribution, with probability density given by π ( x ) ∝ e − F ( x ) for all x ∈ R d [42, Theorem2.2]. Hence, to use (6) as a Monte Carlo method to sample from the posterior p ( x | y, θ ), we set F ( x ) = log p ( x | y, θ ) and thus specify the desired target density. Similarly, to sample from the priorwe set F ( x ) = −∇ x log p ( x | θ ) .However, sampling directly from (6) is usually not computationally feasible. Instead, we usuallyresort to a discrete-time Euler-Maruyama approximation of (6) that leads to the following Markovchain ( X k ) k ∈ N with X ∈ R d , given for any k ∈ N byULA : X k +1 = X k − γ ∇ x F ( X k ) + p γZ k +1 , where γ > is a discretisation step-size and ( Z k ) k ∈ N ∗ is a sequence of i.i.d d -dimensional zero-meanGaussian random variables with an identity covariance matrix. This Markov chain is commonlyknown as the Unadjusted Langevin Algorithm (ULA) [42]. Under some additional assumptionson F , namely Lipschitz continuity of ∇ x F , the ULA chain inherits the convergence properties of(6) and converges to a stationary distribution that is close to the target π , with γ controlling atrade-off between accuracy and convergence speed [23]. Remark 3.
In this form, the ULA algorithm is limited to distributions where F is a Lipschitzcontinuously differentiable function. However, in the imaging problems of interest this is usuallynot the case [49]. For example, to implement any of the algorithms presented in [49] it is necessaryto sample from the posterior distribution p ( x | y, θ ) (corresponding to π θ in Section 3.1), whichwould require setting for any x ∈ R d , F ( x ) = f y ( x ) + θ ⊤ g ( x ) . Similarly, one of the algorithmsalso requires sampling from the prior distribution x p ( x | θ ) (corresponding to ¯ π θ in Section 3.1),which requires setting for any x ∈ R d , F ( x ) = θ ⊤ g ( x ) . In both cases, if g is not smooth then ULAcannot be directly applied. The MYULA kernel was designed precisely to overcome this limitation. Suppose that the target potential admits a decomposition F = V + U where V is Lipschitzdifferentiable and U is not smooth but convex over R d . In MYULA, the differentiable part ishandled via the gradient ∇ x V in a manner akin to ULA, whereas the non-differentiable convexpart is replaced by a smooth approximation U λ ( x ) given by the Moreau-Yosida envelope of U , see[5, Definition 12.20], defined for any x ∈ R d and λ > by U λ ( x ) = min ˜ x ∈ R d n U (˜ x ) + (1 / λ ) k x − ˜ x k o . (7)Similarly, we define the proximal operator for any x ∈ R d and λ > by prox λU ( x ) = arg min ˜ x ∈ R d n U (˜ x ) + (1 / λ ) k x − ˜ x k o . (8)5or any λ > , the Moreau-Yosida envelope U λ is continuously differentiable with gradient givenfor any x ∈ R d by ∇ U λ ( x ) = ( x − prox λU ( x )) /λ , (9)(see, e.g., [5, Proposition 16.44]). Using this approximation we obtain the MYULA kernel associ-ated with ( X k ) k ∈ N given by X ∈ R d and the following recursion for any k ∈ N MYULA : X k +1 = X k − γ ∇ x V ( X k ) − γ ∇ x U λ ( X k ) + p γZ k +1 . (10)Returning to the imaging problems of interest, we define the MYULA families of Markov kernels { R γ,θ , ¯R γ,θ : γ > , θ ∈ Θ } that we use in Algorithm 1 to target π θ and ¯ π θ for θ ∈ Θ as follows.By Remark 3, we set V = f y and U = θ ⊤ g , ¯ V = 0 and ¯ U = θ ⊤ g . Then, for any θ ∈ Θ and γ > , R γ,θ associated with ( X k ) k ∈ N is given by X ∈ R d and the following recursion for any k ∈ N X k +1 = X k − γ ∇ x f y ( X k ) − γ n X k − prox λθ ⊤ g ( X k ) o /λ + p γZ k +1 . (11)Similarly, for any θ ∈ Θ and γ ′ > , ¯R γ,θ associated with ( X k ) k ∈ N is given by X ∈ R d and thefollowing recursion for any k ∈ N ¯ X k +1 = ¯ X k − γ ′ n ¯ X k − prox λ ′ θ ⊤ g ( ¯ X k ) o /λ ′ + p γZ k +1 , (12)where we recall that λ, λ ′ > are the smoothing parameters associated with θ ⊤ g λ , γ, γ ′ > are thediscretisation steps and ( Z k ) k ∈ N ∗ is a sequence of i.i.d d -dimensional zero-mean Gaussian randomvariables with an identity covariance matrix.Notice that other ways of splitting the target potential F can be straightforwardly implemented.For example, instead of a single non-smooth convex term U , one might choose a splitting involvingseveral non-smooth terms to simplify the computation of the proximal operators (each term wouldbe replaced by its Moreau-Yosida envelope in (6)). Similarly, although we usually to associate V, ¯ V and U, ¯ U to the log-likelihood and the log-prior, some cases might benefit from a differentsplitting. Moreover, as illustrated in Section 3.2.2 below, other discrete approximations of theLangevin diffusion could be considered too. As an alternative to MYULA, one could also consider using the Proximal Unadjusted LangevinAlgorithm (PULA) introduced in [21], which replaces the (forward) gradient step of MYULA bya composition of a backward and forward step. More precisely, PULA defines the Markov chain ( X k ) k ∈ N starting from X ∈ R d by the following recursion: for any k ∈ N PULA : X k +1 = prox λU ( X k ) − γ ∇ x U (prox λU ( X k )) + p γZ k +1 . (13)To highlight the connection with MYULA we note that for any x ∈ R d and λ > , ∇ U λ ( x ) =( x − prox λU ( x )) /λ by [5, Proposition 12.30]. Therefore, if we set λ = γ we obtain that (13) can berewritten for any k ∈ N a X k +1 = X k − γ ∇ x V ( X k ) − γ ∇ x U (prox λU ( X k )) + p γZ k +1 , which corresponds to (10) with λ = γ , except that the term ∇ x U ( X k ) in (10) is replaced by ∇ x U (prox λU ( X k )) in (10).Going back to the imaging problems of interest, to define the PULA families of Markov kernels { S γ,θ , ¯S γ,θ : γ > , θ ∈ Θ } that we use in Algorithm 1 to target π θ and ¯ π θ for θ ∈ Θ we proceedas follows. We set V = f y and U = θ ⊤ g , ¯ V = 0 and ¯ U = θ ⊤ g . Then, by Remark 3, for any θ ∈ Θ and γ > , S γ,θ associated with ( X k ) k ∈ N is given by X ∈ R d and the following recursion for any k ∈ N X k +1 = prox λθ ⊤ g ( X k ) − γ ∇ x f y (prox λθ ⊤ g ( X k )) + p γZ k +1 , (14)Similarly, for any θ ∈ Θ and γ ′ > , ¯S γ,θ associated with ( X k ) k ∈ N is given by X ∈ R d and thefollowing recursion for any k ∈ N ¯ X k +1 = prox λ ′ θ ⊤ g ( ¯ X k ) + p γZ k +1 . (15)Recall that λ, λ ′ > are the smoothing parameters associated with θ ⊤ g λ , γ, γ ′ > are thediscretisation steps and ( Z k ) k ∈ N ∗ is a sequence of i.i.d d -dimensional zero-mean Gaussian random6ariables with an identity covariance matrix. Again, one could use PULA with a different splittingof F .Finally, we note at this point that the MYULA and PULA kernels (11), (12), (14) and (15),do not target the posterior or prior distributions exactly but rather an approximation of thesedistributions. This is mainly due to two facts: 1) we are not able to use the exact Langevin diffusion(6), so we resort to a discrete approximation instead; and 2) we replace the non-differentiable termswith their Moreau-Yosida envelopes. As a result of these approximation errors, Algorithm 1 willexhibit some asymptotic estimation bias. This error is controlled by λ, λ ′ , γ, γ ′ , and δ , and can bemade arbitrarily small at the expense of additional computing time, see Theorem 7 in Section 4. Before establishing our main convergence results about Algorithm 1, see Section 4.1, we deriveergodicity properties on the Markov chains given by (10) and (13). We consider the followingassumptions on π θ and ¯ π θ . These assumptions are satisfied for a large class of models in Bayesianimaging sciences, and in particular by the models considered in our companion paper [49]. H 1.
For any θ ∈ Θ , there exist V θ , ¯ V θ , U θ , ¯ U θ : R d → [0 , + ∞ ) convex functions satisfying thefollowing conditions.(a) For any θ ∈ Θ and x ∈ R d , π θ ( x ) ∝ exp [ − V θ ( x ) − U θ ( x )] , ¯ π θ ( x ) ∝ exp (cid:2) − ¯ V θ ( x ) − ¯ U θ ( x ) (cid:3) , and min (cid:18) inf θ ∈ Θ Z R d exp[ − V θ (˜ x ) − U θ (˜ x )]d˜ x, inf θ ∈ Θ Z R d exp[ − ¯ V θ (˜ x ) − ¯ U θ (˜ x )]d˜ x (cid:19) > . (16) (b) For any θ ∈ Θ , V θ and ¯ V θ are continuously differentiable and there exists L > such thatfor any θ ∈ Θ and x, y ∈ R d max (cid:0) k∇ x V θ ( x ) − ∇ x V θ ( y ) k , k∇ x ¯ V θ ( x ) − ∇ x ¯ V θ ( y ) k (cid:1) L k x − y k . In addition, there exist R V, , R V, > such that for any θ ∈ Θ , there exist x ⋆θ , ¯ x ⋆θ ∈ R d with x ⋆θ ∈ arg min R d V θ , ¯ x ⋆θ ∈ arg min R d ¯ V θ , x ⋆θ , ¯ x ⋆θ ∈ B(0 , R V, ) and V θ ( x ⋆θ ) , ¯ V θ (¯ x ⋆θ ) ∈ B(0 , R V, ) .(c) There exists M > such that for any θ ∈ Θ and x, y ∈ R d max (cid:0) k U θ ( x ) − U θ ( y ) k , k ¯ U θ ( x ) − ¯ U θ ( y ) k (cid:1) M k x − y k . In addition, there exist R U, , R U, > such that for any θ ∈ Θ , there exist x ♯θ , ¯ x ♯θ ∈ R d with x ♯θ , ¯ x ♯θ ∈ B(0 , R U, ) and U θ ( x ♯θ ) , ¯ U θ (¯ x ♯θ ) ∈ B(0 , R U, ) . Note that (16) in H Θ is compact and the functions θ R R d exp[ − V θ (˜ x ) − U θ (˜ x )]d˜ x and θ R R d exp[ − ¯ V θ (˜ x ) − ¯ U θ (˜ x )]d˜ x are continuous. This latter condition can bethen easily verified using the Lebesgue dominated convergence theorem and some assumptionson { V θ , ¯ V θ , U θ , ¯ U θ : θ ∈ Θ } . Note that if there exists V : R d → [0 , + ∞ ) such that for any θ ∈ Θ , V θ = V and there exists x ⋆ ∈ R d with x ⋆ ∈ arg min R d V then one can choose x ⋆θ = x ⋆ for any θ ∈ Θ in H R V, = 0 . Similarly if for any θ ∈ Θ , U θ (0) = 0 then one can choose x ♯θ = 0 in H R U, = R U, = 0 . These conditions are satisfied by all the modelsstudied in [49].As emphasized in Section 3.1, we use a stochastic approximation proximal gradient approachto minimize f and therefore we need to consider Monte Carlo estimators for ∇ θ f ( θ ) and θ ∈ Θ .These estimators are derived from Markov chains targeting π θ and ¯ π θ respectively. We consider twoMCMC methodologies to construct the Markov chains. A first option, as proposed in Section 3.2.1,is to use MYULA to sample from π θ and ¯ π θ . Let κ > and { R γ,θ : γ > , θ ∈ Θ } be the familyof kernels defined for any x ∈ R d , γ > , θ ∈ Θ and A ∈ B ( R d ) by R γ,θ ( x, A ) = (4 π γ ) − d/ Z A exp (cid:16) (cid:13)(cid:13) y − x + γ ∇ x V θ ( x ) + κ − (cid:8) x − prox γκU θ ( x ) (cid:9)(cid:13)(cid:13) . (4 γ ) (cid:17) d y . (17)7ote that (17) is the Markov kernel associated with the recursion (10) with U ← U θ , V ← V θ and λ ← κγ . For any γ, κ > and θ ∈ Θ corresponds to R γ,κγ,θ in [49]. Consider also the family ofMarkov kernels { ¯R γ,θ : γ > , θ ∈ Θ } such that for any γ > and θ ∈ Θ , ¯R γ,θ is the Markovkernel defined by (17) but with ¯ U θ and ¯ V θ in place of U θ and V θ respectively. The coefficient κ isrelated to λ in (11) by κ = λ/γ .Moreover, although our companion paper [49] only considers the MYULA kernel, the theoreticalresults we present in this paper also hold if the algorithms are implemented using PULA [21]. Definethe family { S γ,θ : γ > , θ ∈ Θ } , for any x ∈ R d , γ > , θ ∈ Θ and A ∈ B ( R d ) by S γ,θ ( x, A ) = (4 π γ ) − d/ Z A exp (cid:16) (cid:13)(cid:13) y − prox γκU θ ( x ) + γ ∇ x V θ (prox γκU θ ( x )) (cid:13)(cid:13) . (4 γ ) (cid:17) d y . (18)Note that (17) is the Markov kernel associated with the recursion (13) with U ← U θ , V ← V θ and λ ← κγ . Consider also the family of Markov kernels { ¯S γ,θ : γ > , θ ∈ Θ } such that forany γ > and θ ∈ Θ , ¯S γ,θ is the Markov kernel defined by the recursion (18) but with ¯ U θ and ¯ V θ in place of U θ and V θ respectively. We use the results derived in [17] to analyse the sequencegiven by (5) with { (K γ,θ , ¯K γ,θ ) : γ ∈ (0 , ¯ γ ] , θ ∈ Θ } = { (R γ,θ , ¯R γ,θ ) : γ ∈ (0 , ¯ γ ] , θ ∈ Θ } or { (S γ,θ , ¯S γ,θ ) : γ ∈ (0 , ¯ γ ] , θ ∈ Θ } . To this end, we impose that for any γ ∈ (0 , ¯ γ ] and θ ∈ Θ ,the kernels K γ,θ and ¯K γ,θ admit an invariant probability distribution, denoted by π γ,θ and ¯ π γ,θ respectively which are approximations of π θ and ¯ π θ defined in A
3, and geometrically convergetowards them. More precisely, we show in Theorem 4 and Theorem 5 below, that MYULA andPULA satisfy these conditions if at least one of the following assumptions is verified:
H2.
There exists m > such that for any θ ∈ Θ , V θ and ¯ V θ are m -convex. H 3.
There exist η > and c > such that for any θ ∈ Θ and x ∈ R d , min( U θ ( x ) , ¯ U θ ( x )) > η k x k − c . Note that if for any θ ∈ Θ , U θ is convex on R d and sup θ ∈ Θ ( R R d exp[ − U θ (˜ x )]d˜ x ) < + ∞ , then H H x p ( x | θ ) is log-concave and proper for any θ ∈ Θ . In [49], if theprior x p ( x | θ ) is improper for some θ ∈ Θ then we require H i.e. for any y ∈ C d y ,there exists m > such that for any θ ∈ Θ , x p ( x | y, θ ) is m -log-concave. Finally, we believe that H η > and c > such that for any θ ∈ Θ and x ∈ R d , min( U θ ( x )+ V θ ( x ) , ¯ U θ ( x )+ ¯ V θ ( x )) > η k x k− c . In particular, this latter condition holdsin the case where x p ( x | θ ) = exp[ − θ ⊤ TV( x )] and sup θ ∈ Θ ( R R d exp[ − U θ (˜ x ) + V θ (˜ x )]d˜ x ) < + ∞ .Consider for any m ∈ N ∗ and α > , the two functions W m and W α given for any x ∈ R d by W m ( x ) = 1 + k x k m , W α = exp (cid:20) α q k x k (cid:21) . (19) Theorem 4.
Assume H and H or H . Let ¯ κ > > κ > / , ¯ γ < min { (2 − /κ ) / L , / ( m + L ) } if H holds and ¯ γ < min { (2 − /κ ) / L , η/ (2 ML ) } if H holds. Then for any a ∈ (0 , , there exist ¯ A ,a > and ρ a ∈ (0 , such that for any θ ∈ Θ , κ ∈ [ κ, ¯ κ ] , γ ∈ (0 , ¯ γ ] , R γ,θ and ¯R γ,θ admitinvariant probability measures π γ,θ , respectively ¯ π γ,θ . In addition, for any x, y ∈ R d and n ∈ N wehave max (cid:0) k δ x R nγ,θ − π γ,θ k W a , k δ x ¯R nγ,θ − ¯ π γ,θ k W a (cid:1) ¯ A ,a ¯ ρ γna W a ( x ) , max (cid:0) k δ x R nγ,θ − δ y R nγ,θ k W a , k δ x ¯R nγ,θ − δ y ¯R nγ,θ k W a (cid:1) ¯ A ,a ¯ ρ γna { W a ( x ) + W a ( y ) } , with W = W m and m ∈ N ∗ if H holds and W = W α with α < min( κη/ , η/ if H holds.Proof. The proof is postponed to Section 5.2.
Theorem 5.
Assume H and H or H . Let Let ¯ κ > > κ > / , ¯ γ < / ( m + L ) if H holdsand ¯ γ < / L if H holds. Then for any a ∈ (0 , , there exist A ,a > and ρ a ∈ (0 , such thatfor any θ ∈ Θ , κ ∈ [ κ, ¯ κ ] , γ ∈ (0 , ¯ γ ] , S γ,θ and ¯S γ,θ admit an invariant probability measure π γ,θ and ¯ π γ,θ respectively. In addition, for any x, y ∈ R d and n ∈ N we have max (cid:0) k δ x S nγ,θ − π γ,θ k W a , k δ x ¯S nγ,θ − ¯ π γ,θ k W a (cid:1) A ,a ρ γna W a ( x ) , max (cid:0) k δ x S nγ,θ − δ y S nγ,θ k W a , k δ x ¯S nγ,θ − δ y ¯S nγ,θ k W a (cid:1) A ,a ρ γna { W a ( x ) + W a ( y ) } , with W = W m and m ∈ N ∗ if H holds and W = W α with α < κη/ if H holds.Proof. The proof is postponed to Section 5.3. 8 .2 Main results
We now state our main results regarding the convergence of the sequence defined by (5) under thefollowing additional regularity assumption.
H4.
There exist M Θ > and f Θ ∈ C( R + , R + ) such that for any θ , θ ∈ Θ , x ∈ R d , max (cid:0) k∇ x V θ ( x ) − ∇ x V θ ( x ) k , k∇ x ¯ V θ ( x ) − ∇ x ¯ V θ ( x ) k (cid:1) M Θ k θ − θ k (1 + k x k ) , max (cid:0) k∇ x U κ θ ( x ) − ∇ x U κ θ ( x ) k , k∇ x ¯ U κ θ ( x ) − ∇ x ¯ U κ θ ( x ) k (cid:1) f Θ ( κ ) k θ − θ k (1 + k x k ) . In Theorem 6, we give sufficient conditions on the parameters of the algorithm under which thesequence ( θ n ) n ∈ N converges a.s., and we give explicit convergence rates in Theorem 7. Theorem 6.
Assume A , A , A and that f is convex. Let κ ∈ [ κ, ¯ κ ] with ¯ κ > > κ > / .Assume H and one of the following conditions:(a) H holds, ¯ γ < min(2 / ( m + L ) , (2 − /κ ) / L , L − ) and there exists m ∈ N ∗ and C m > suchthat for any θ ∈ Θ and x ∈ R d , k H θ ( x ) k C m W / m ( x ) and k ¯ H θ ( x ) k C m W / m ( x ) .(b) H holds, ¯ γ < min((2 − /κ ) / L , η/ (2 ML ) , L − ) and there exists < α < η/ , C α > suchthat for any θ ∈ Θ and x ∈ R d , k H θ ( x ) k C α W / α ( x ) and k ¯ H θ ( x ) k C α W / α ( x ) .Let ( γ n ) n ∈ N , ( δ n ) n ∈ N be sequences of non-increasing positive real numbers and ( m n ) n ∈ N be a se-quence of non-decreasing positive integers satisfying δ < /L f and γ < ¯ γ . Let ( { ( X nk , ¯ X nk ) : k ∈{ , . . . , m n }} , θ n ) n ∈ N be given by (5) . In addition, assume that P + ∞ n =0 δ n +1 = + ∞ , P + ∞ n =0 δ n +1 γ / n < + ∞ and that one of the following conditions holds:(1) P + ∞ n =0 δ n +1 / ( m n γ n ) < + ∞ ; (2) m n = m ∈ N ∗ for all n ∈ N , sup n ∈ N | δ n +1 − δ n | δ − n < + ∞ , H holds and we have P + ∞ n =0 δ n +1 γ − n < + ∞ , P + ∞ n =0 δ n +1 γ − n +1 ( γ n − γ n +1 ) < + ∞ . Then ( θ n ) n ∈ N converges a.s. to some θ ⋆ ∈ arg min Θ f . Furthermore, a.s. there exists C > suchthat for any n ∈ N ∗ ( n X k =1 δ k f ( θ k ) , n X k =1 δ k ) − min Θ f C , n X k =1 δ k ! . Proof.
The proof is postponed to Section 5.6.These results are similar to the ones identified in [17, Theorem 1, Theorem 5, Theorem 6] forthe Stochastic Optimization with Unadjusted Langevin (SOUL) algorithm. Note that in SOUL thepotential is assumed to be differentiable and the sampler is given by ULA, whereas in Theorem 6,the results are stated for PULA and MYULA samplers.Although rigorously establishing convexity of f is usually not possible for imaging models, weexpect that in many cases, for any of its minimizer θ ⋆ , f is convex in some neighborhood of θ ⋆ .For example, this is the case if its Hessian is definite positive around this point.Assume that δ n ∼ n − a , γ n ∼ n − b and m n ∼ n − c with a, b, c > . We now distinguish two casesdepending on if for all n ∈ N , m n = m ∈ N ∗ (fixed batch size) or not (increasing size).1) In the increasing batch size case, Theorem 6 ensures that ( θ n ) n ∈ N converges if the followinginequalities are satisfied a + b/ > , a − b + c > , a . (20)Note in particular that c > , i.e. the number of Markov chain iterates required to compute theestimator of the gradient increases at each step. However, for any a ∈ [0 , there exist b, c > such that (20) is satisfied. In the special setting where a = 0 then for any ε > ε > such that b = 2 + ε and c = 3 + ε satisfy the results of (20) hold.2) In the fixed batch size case, which implies that c = 0 , Theorem 6 ensures that ( θ n ) n ∈ N convergesif the following inequalities are satisfied a + b/ > , a − b ) > , a + b + 1 − b > a , which can be rewritten as b ∈ (2(1 − a ) , min( a − / , a/ , a ∈ [0 , . The interval (2( a − , min( a − / , a/ is then not empty if and only if a ∈ (5 / , .9 heorem 7. Assume A , A , A and that f is convex. Let κ ∈ [ κ, ¯ κ ] with ¯ κ > > κ > / .Assume H and that the condition (a) or (b) in Theorem 6 is satisfied. Let ( γ n ) n ∈ N , ( δ n ) n ∈ N besequences of non-increasing positive real numbers and ( m n ) n ∈ N be a sequence of non-decreasingpositive integers satisfying δ < /L f and γ < ¯ γ . Let ( { ( X nk , ¯ X nk ) : k ∈ { , . . . , m n }} , θ n ) n ∈ N begiven by (5) E "( n X k =1 δ k f ( θ k ) , n X k =1 δ k ) − min Θ f E n , n X k =1 δ k ! , where(a) E n = C ( n − X k =0 δ k +1 γ / k + n − X k =0 δ k +1 / ( m k γ k ) + n − X k =0 δ k +1 / ( m k γ k ) ) . (21) (b) or if m n = m for all n ∈ N , sup n ∈ N | δ n +1 − δ n | δ − n < + ∞ and H holds E n = C ( n − X k =0 δ k +1 γ / k + n − X k =0 δ k +1 /γ k + n − X k =0 δ k +1 γ − k +1 ( γ k − γ k +1 ) ) . (22) Proof.
The proof is postponed to Section 5.7.First, note that if the stepsize is fixed and recalling that κ = λ/γ then the condition γ < (2 − /κ ) / L can be rewritten as γ < / ( L + λ − ) . Assume that ( δ n ) n ∈ N is non-increasing, lim n → + ∞ δ n =0 , lim n → + ∞ m n = + ∞ and γ n = γ > for all n ∈ N . In addition, assume that P n ∈ N ∗ δ n = + ∞ then, by [37, Problem 80, Part I], it holds that ( lim n → + ∞ [ ( P nk =1 δ k /m k )/( P nk =1 δ k ) ] = lim n → + ∞ /m n = 0 ;lim n → + ∞ (cid:2) (cid:0)P nk =1 δ k (cid:1)(cid:14) ( P nk =1 δ k ) (cid:3) = lim n → + ∞ δ n = 0 . (23)Therefore, using (21) we obtain that lim sup n → + ∞ E "( n X k =1 δ k f ( θ k ) , n X k =1 δ k ) − min f C √ γ . Similarly, if the stepsize is fixed and the number of Markov chain iterates is fixed, i.e. for all n ∈ N , γ n = γ and m n = m with γ > and m ∈ N ∗ , combining (22) and (23) we obtain that lim sup n → + ∞ E "( n X k =1 δ k f ( θ k ) , n X k =1 δ k ) − min f C √ γ . In this section, we gather the proofs of Section 4. First, in Section 5.1 we derive some usefultechnical lemmas. In Section 5.2, we prove Theorem 4, using minorisation and Foster-Lyapunovdrift conditions. Similarly, we prove Theorem 5 in Section 5.3. Next, we show Theorem 6 byapplying [17, Theorem 1, Theorem 3] and Theorem 7 by applying [17, Theorem 2, Theorem4], which boils down to verifying that [17, H1, H2] are satisfied. In Section 5.4, we show that[17, H1, H2] hold if the sequence is given by (5) where { (K γ,θ , ¯K γ,θ ) : γ ∈ (0 , ¯ γ ] , θ ∈ Θ } = { (R γ,θ , ¯R γ,θ ) : γ ∈ (0 , ¯ γ ] , θ ∈ Θ } defined in (18), i.e. we consider PULA as a sampling schemein the optimization algorithm. In Section 5.5 we check that [17, H1, H2] are satisfied when { (K γ,θ , ¯K γ,θ ) : γ ∈ (0 , ¯ γ ] , θ ∈ Θ } = { (S γ,θ , ¯S γ,θ ) : γ ∈ (0 , ¯ γ ] , θ ∈ Θ } defined in (17), i.e. whenconsidering MYULA as a sampling scheme. Finally, we prove Theorem 6 in Section 5.6 andTheorem 7 in Section 5.7. 10 .1 Technical lemmas We say that a Markov kernel R on R d × B ( R d ) satisfies a discrete Foster-Lyapunov drift condition D d ( W, λ, b ) if there exist λ ∈ (0 , , b > and a measurable function W : R d → [1 , + ∞ ) such thatfor all x ∈ R d R W ( x ) λW ( x ) + b . We will use the following result.
Lemma 8.
Let R be a Markov kernel on R d × B ( R d ) which satisfies D d ( W, λ γ , bγ ) with λ ∈ (0 , , b > , γ > and a measurable function W : R d → [1 , + ∞ ) . Then, we have for any x ∈ R d R ⌈ /γ ⌉ W ( x ) (1 + b log − (1 /λ ) λ − ¯ γ ) W ( x ) . Proof.
Using [17, Lemma 9] we have for any x ∈ R d R ⌈ /γ ⌉ W ( x ) λ γ ⌈ /γ ⌉ + bγ ⌈ /γ ⌉− X k =0 λ γk W ( x ) (1 + b log − (1 /λ ) λ − ¯ γ ) W ( x ) . We continue this section by giving some results on proximal operators. Some of them arewell-known but their proof is given for completeness.
Lemma 9.
Let κ > and U : R d → R convex. Assume that U is M -Lipschitz with M > , then U κ is M -Lipschitz and for any x ∈ R d , k x − prox κ U ( x ) k κ M .Proof. Let κ > . We have for any x, y ∈ R d by (7) and (8) U κ ( x ) − U κ ( y )= k x − prox κ U ( x ) k / (2 κ ) + U (prox κ U ( x )) − k y − prox κ U ( y ) k / (2 κ ) − U (prox κ U ( y )) k y − prox κ U ( y ) k / (2 κ ) + U ( x − y + prox κ U ( y )) − k y − prox κ U ( y ) k / (2 κ ) − U (prox κ U ( y )) M k x − y k . Hence, U κ is M -Lipschitz. Since by [5, Proposition 12.30], U κ is continuously differentiable wehave for any x ∈ R d , k∇ U κ ( x ) k M . Combining this result with the fact that for any x ∈ R d , ∇ U κ ( x ) = ( x − prox κ U ( x )) / κ by [5, Proposition 12.30] concludes the proof. Lemma 10.
Let U : R d → [0 , + ∞ ) be a convex and M -Lipschitz function with M > . Then forany κ > and z, z ′ ∈ R d , h prox κ U ( z ) − z, z i − κ U ( z ) + κ M + κ { U ( z ′ ) + M k z ′ k} . Proof. κ > and z, z ′ ∈ R d . Since ( z − prox κ U ( z )) / κ ∈ ∂U (prox κ U ( z )) [5, Proposition 16.44], wehave κ { U ( z ′ ) − U (prox κ U ( z )) } > h z − prox κ U ( z ) , z ′ − prox κ U ( z ) i > h z − prox κ U ( z ) , z ′ − z i + k z − prox κ U ( z ) k > h z − prox κ U ( z ) , z ′ − z i . Combining this result, the fact that U is M -Lipschitz and Lemma 9 we get that h prox κ U ( z ) − z, z i κ U ( z ′ ) − κ U ( z ) + κ M k z − prox κ U ( z ) k + k z ′ k k z − prox κ U ( z ) k − κ U ( z ) + κ M + κ { U ( z ′ ) + M k z ′ k} , which concludes the proof Lemma 11.
Let κ , κ > and U : R d → R convex and lower semi-continuous. For any x ∈ R d we have k prox κ U ( x ) − prox κ U ( x ) k κ − κ )( U (prox κ U ( x )) − U (prox κ U ( x ))) . If in addition, U is M -Lipschitz with M > then k prox κ U ( x ) − prox κ U ( x ) k M | κ − κ | . roof. Let x ∈ R d . By definition of prox κ U ( x ) we have κ U (prox κ U ( x )) + k x − prox κ U ( x ) k κ U (prox κ U ( x )) + k x − prox κ U ( x ) k . Combining this result and the fact that ( x − prox κ U ( x )) / κ ∈ ∂U (prox κ U ( x )) we have k prox κ U ( x ) − prox κ U ( x ) k κ { U (prox κ U ( x )) − U (prox κ U ( x )) } + 2 h x − prox κ U ( x ) , prox κ U ( x ) − prox κ U ( x ) i κ { U (prox κ U ( x )) − U (prox κ U ( x )) } + 2 κ { U (prox κ U ( x )) − U (prox κ U ( x )) } κ − κ )( U (prox κ U ( x )) − U (prox κ U ( x ))) , which concludes the proof. Lemma 12.
Let V : R d → R m -convex and continuously differentiable with m > . Assume thatthere exists M > such that for any x, y ∈ R d k∇ V ( x ) − ∇ V ( y ) k M k x − y k . Assume that there exists x ⋆ ∈ arg min R d V , then for any γ ∈ (0 , ¯ γ ] with ¯ γ < / ( M + m ) and x ∈ R d k x − γ ∇ V ( x ) k (1 − γ̟ ) k x k + γ { (2 / ( m + M ) − ¯ γ ) − + 4 ̟ } k x ⋆ k , with ̟ = m M/ ( m + M ) .Proof. Let x ∈ R d , γ ∈ (0 , ¯ γ ] and ¯ γ < / ( m + M ) . Using [36, Theorem 2.1.11] and the fact that forany a, b, ε > , εa + b /ε > ab we have k x − γ ∇ V ( x ) k k x k − γ h∇ V ( x ) − ∇ V ( x ⋆ ) , x − x ⋆ i + γ ¯ γ k∇ V ( x ) − ∇ V ( x ⋆ ) k + 2 γ k x ⋆ k k∇ V ( x ) − ∇ V ( x ⋆ ) k k x k − γ̟ k x − x ⋆ k − γ (2 / ( m + M ) − ¯ γ ) k∇ V ( x ) − ∇ V ( x ⋆ ) k + 2 γ k x ⋆ k k∇ V ( x ) − ∇ V ( x ⋆ ) k k x k − γ̟ k x − x ⋆ k − γ (2 / ( m + M ) − ¯ γ ) k∇ V ( x ) − ∇ V ( x ⋆ ) k + γ (2 / ( m + M ) − ¯ γ ) k∇ V ( x ) − ∇ V ( x ⋆ ) k + γ/ (2 / ( m + M ) − ¯ γ ) k x ⋆ k (1 − γ̟ ) k x k + 4 γ̟ k x ⋆ k k x k + γ/ (2 / ( m + M ) − ¯ γ ) k x ⋆ k (1 − γ̟ ) k x k + γ (cid:8) (2 / ( m + M ) − ¯ γ ) − + 4 ̟ (cid:9) k x ⋆ k . Lemma 13.
Assume H and H . Then for any κ > , θ ∈ Θ , γ ∈ (0 , ¯ γ ] with ¯ γ < / ( m + L ) and x ∈ R d , we have (cid:13)(cid:13) prox γκU θ ( x ) − γ ∇ x V θ (prox γκU θ ( x )) (cid:13)(cid:13) (1 − γ̟/ k x k + γ (cid:2) ¯ γκ M + (cid:8) (2 / ( m + L ) − ¯ γ ) − + 4 ̟ (cid:9) R V, +2 κ M ̟ − (cid:3) , with ̟ = mL / ( m + L ) . Proof.
Let κ > , θ ∈ Θ , γ ∈ (0 , ¯ γ ] and x ∈ R d . Using H H
2, Lemma 9, Lemma 12, theCauchy-Schwarz inequality and that for any α, β > , max t ∈ R ( − αt + 2 βt ) = β /α , we have (cid:13)(cid:13) prox γκU θ ( x ) − γ ∇ x V θ (prox γκU θ ( x )) (cid:13)(cid:13) (1 − γ̟ ) (cid:13)(cid:13) prox γκU θ ( x ) (cid:13)(cid:13) + γ (cid:8) (2 / ( m + L ) − ¯ γ ) − + 4 ̟ (cid:9) k x ⋆θ k (1 − γ̟ ) (cid:13)(cid:13) x − prox γκU θ ( x ) − x (cid:13)(cid:13) + γ (cid:8) (2 / ( m + L ) − ¯ γ ) − + 4 ̟ (cid:9) R V, (1 − γ̟ ) k x k + γ κ M + 2 γκ M k x k + γ (cid:8) (2 / ( m + L ) − ¯ γ ) − + 4 ̟ (cid:9) R V, (1 − γ̟/ k x k + γ κ M + γ (cid:8) (2 / ( m + L ) − ¯ γ ) − + 4 ̟ (cid:9) R V, + 2 γκ M k x k − γ̟ k x k / (1 − γ̟/ k x k + γ ¯ γκ M + γ (cid:8) (2 / ( m + L ) − ¯ γ ) − + 4 ̟ (cid:9) R V, + 2 γκ M ̟ − . emma 14. Assume H and H . Then for any κ > , θ ∈ Θ , γ ∈ (0 , ¯ γ ] with ¯ γ < / L and x ∈ R d , we have (cid:13)(cid:13) prox γκU θ ( x ) − γ ∇ x V θ (prox γκU θ ( x )) (cid:13)(cid:13) k x k + γ (cid:2) γκ M + 2 κ c + 2 κ ( R U, + M R U, )+(2 /L − ¯ γ ) − R V, − κη k x k (cid:3) . Proof.
Let κ > , θ ∈ Θ , γ ∈ (0 , ¯ γ ] and x ∈ R d . Using H H
3, Lemma 9 and Lemma 10 andLemma 12 we have (cid:13)(cid:13) prox γκU θ ( x ) − γ ∇ x V θ (prox γκU θ ( x )) (cid:13)(cid:13) k prox γκU θ ( x ) k + γ/ (2 / L − ¯ γ ) R V, k x k + γ κ M + 2 h prox γκU θ ( x ) − x, x i + γ/ (2 / L − ¯ γ ) R V, k x k + 3 γ κ M − γκU ( x ) + 2 γκ ( U ( x ♯θ ) + M k x ♯θ k ) + γ/ (2 / L − ¯ γ ) R V, k x k + 3 γ κ M − γκη k x k + 2 γκ c + 2 γκ ( U ( x ♯θ ) + M k x ♯θ k ) + γ/ (2 / L − ¯ γ ) R V, k x k + γ (cid:2) γκ M + 2 κ c + 2 κ ( R U, + M R U, ) + (2 /L − ¯ γ ) − R V, − κη k x k (cid:3) . Lemma 15.
Assume H and H . Then for any κ > , θ ∈ Θ , γ ∈ (0 , ¯ γ ] with ¯ γ < / ( m + L ) and x ∈ R d , we have k x − γ ∇ x V θ ( x ) − γ ∇ x U γκθ ( x ) k (1 − γ̟/ k x k + γ (cid:8) (2 / ( m + L ) − ¯ γ ) − + 4 ̟ (cid:9) R V, + 2 γ ML R V, + γ M + 2 γ M (1 + ¯ γ L ) ̟ − , with ̟ = mL / (2 m + 2 L ) .Proof. Let κ > , θ ∈ Θ , γ ∈ (0 , ¯ γ ] and x ∈ R d . Using H H
2, Lemma 9, Lemma 12 and that forany α, β > , max( − αt + 2 βt ) = β /α we have k x − γ ∇ x V θ ( x ) − γ ∇ x U γκθ ( x ) k k x − γ ∇ x V θ ( x ) k + 2 γ M k x − γ {∇ x V θ ( x ) − ∇ x V θ ( x ⋆θ ) }k + γ M (1 − γ̟ ) k x k + γ (cid:8) (2 / ( m + L ) − ¯ γ ) − + 4 ̟ (cid:9) k x ⋆θ k + 2 γ M k x k + 2 γ M k∇ x V θ ( x ) − ∇ x V θ ( x ⋆θ ) k + γ M (1 − γ̟ ) k x k + γ (cid:8) (2 / ( m + L ) − ¯ γ ) − + 4 ̟ (cid:9) k x ⋆θ k + 2 γ M k x k + 2 γ ML k x k + 2 γ ML k x ⋆θ k + γ M (1 − γ̟/ k x k + γ (cid:8) (2 / ( m + L ) − ¯ γ ) − + 4 ̟ (cid:9) R V, + 2 γ ML R V, + γ M + 2 γ M (1 + ¯ γ L ) k x k − γ̟ k x k / (1 − γ̟/ k x k + γ (cid:8) (2 / ( m + L ) − ¯ γ ) − + 4 ̟ (cid:9) R V, + 2 γ ML R V, + γ M + 2 γ M (1 + ¯ γ L ) ̟ − . Lemma 16.
Assume H and H . Then for any κ > , θ ∈ Θ , x ∈ R d and γ ∈ (0 , ¯ γ ] with ¯ γ < min(2 / L , η/ (2 ML )) , we have k x − γ ∇ x V θ ( x ) − γ ∇ x U γκθ ( x ) k k x k + γ (cid:2) (2 /L − ¯ γ ) − R V, + 3¯ γ M + 2 c + 2( M R U, + R U, ) + 2¯ γ ML R V, − η k x k (cid:3) . Proof.
Let κ > , θ ∈ Θ , γ ∈ (0 , ¯ γ ] and x ∈ R d . Using H H
3, (7), Lemma 9 and Lemma 10 we13ave k x − γ ∇ x V θ ( x ) − γ ∇ x U γκθ ( x ) k k x − γ ∇ x V θ ( x ) k − γ h x − γ ∇ x V θ ( x ) , ∇ x U γκθ ( x ) i + γ M k x − γ ∇ x V θ ( x ) k − κ − h x − γ ∇ x V θ ( x ) , x − prox γκU θ ( x ) i + γ M k x − γ ∇ x V θ ( x ) k − κ − h x, x − prox γκU θ ( x ) i + 2 κ − γ k∇ x V θ ( x ) k k x − prox γκU θ ( x ) k + γ M k x − γ ∇ x V θ ( x ) k + 3 γ M − γη k x k + 2 γ c + 2 γ ( M k x ♯θ k + U ( x ♯θ )) + 2 γ ¯ γ M k∇ x V θ ( x ) k k x − γ ∇ x V θ ( x ) k + 3 γ ¯ γ M − γη k x k + 2 γ c + 2 γ ( M R U, + R U, ) + 2 γ ¯ γ ML k x k + 2 γ ¯ γ ML k x ⋆θ k k x − γ ∇ x V θ ( x ) k + 3 γ ¯ γ M − γη k x k + 2 γ c + 2 γ ( M R U, + R U, ) + 2 γ ¯ γ ML k x ⋆θ k , where we have used for the last inequality that ¯ γ < η/ (2 ML ) . Then, we can conclude using H k x − γ ∇ x V θ ( x ) − γ ∇ x U γκθ ( x ) k k x k + γ/ (2 /L − ¯ γ ) R V, + 3 γ ¯ γ M − γη k x k + 2 γ c + 2 γ ( M R U, + R U, ) + 2 γ ¯ γ ML R V, k x k + γ (cid:2) (2 /L − ¯ γ ) − R V, + 3¯ γ M + 2 c + 2( M R U, + R U, ) + 2¯ γ ML R V, − η k x k (cid:3) . For υ ∈ R d and σ > , denote Υ υ, σ the d -dimensional Gaussian distribution with mean υ andcovariance matrix σ Id . Lemma 17.
For any σ , σ > and υ , υ ∈ R d , we have KL (Υ υ , σ Id | Υ υ , σ Id ) = k υ − υ k / (2 σ ) + ( d/ (cid:8) − log( σ / σ ) − σ / σ (cid:9) . In addition, if σ > σ KL (Υ υ , σ Id | Υ υ , σ Id ) k υ − υ k / (2 σ ) + ( d/ − σ / σ ) . Proof.
Let X be a d -dimensional Gaussian random variable with mean υ and covariance matrix σ Id . We have that KL (Υ υ , σ Id | Υ υ , σ Id ) = E h log n ( σ / σ ) d/ exp h − k X − υ k / (2 σ ) + k X − υ k / (2 σ ) ioi = − ( d/
2) log( σ / σ ) + E h − k X − υ k / (2 σ ) + k X − υ k / (2 σ ) i = − ( d/
2) log( σ / σ ) + (1 / σ − − σ − ) E h − k X − υ k i + (cid:13)(cid:13) υ − υ (cid:13)(cid:13) / (2 σ )= − ( d/
2) log( σ / σ ) + ( d/ σ / σ −
1) + (cid:13)(cid:13) υ − υ (cid:13)(cid:13) / (2 σ )= k υ − υ k / (2 σ ) + ( d/ (cid:8) − log( σ / σ ) − σ / σ (cid:9) . In the case where σ > σ , let s = σ / σ − . Since s > we have log(1 + s ) > s − s . Therefore,we get that − log( σ / σ ) − σ / σ = − log(1 + s ) + s s , which concludes the proof. We show that under H H
3, Foster-Lyapunov drifts hold for MYULA in Lemma 18 andLemma 19. Combining these Foster-Lyapunov drifts with an appropriate minorisation conditionLemma 20, we obtain the geometric ergodicity of the underlying Markov chain in Theorem 21.
Lemma 18.
Assume H and H . Then for any θ ∈ Θ , κ ∈ [ κ, ¯ κ ] and γ ∈ (0 , ¯ γ ] with ¯ κ > > κ > / , ¯ γ < / ( m + L ) , R γ,θ and ¯R γ,θ satisfy D d ( W , λ γ , b γ ) with λ = exp [ − ̟/ ,b = (cid:8) (2 / ( m + L ) − ¯ γ ) − + 4 ̟ (cid:9) R V, + 2¯ γ ML R V, + ¯ γ M + 2 d + 2 M (1 + ¯ γ L ) ̟ − + ̟/ ,̟ = mL / ( m + L ) , here for any x ∈ R d , W ( x ) = 1 + k x k . In addition, for any m ∈ N ∗ , there exist λ m ∈ (0 , , b m > such that for any θ ∈ Θ , κ ∈ [ κ, ¯ κ ] , γ ∈ (0 , ¯ γ ] with ¯ κ > > κ > / , ¯ γ < / ( m + L ) , R γ,θ and ¯R γ,θ satisfy D d ( W m , λ γm , b m γ ) , where W m is given in (19) .Proof. We show the property for R γ,θ only as the proof for ¯R γ,θ is identical. Let θ ∈ Θ , κ ∈ [ κ, ¯ κ ] , γ ∈ (0 , ¯ γ ] and x ∈ R d . Let Z be a d -dimensional Gaussian random variable with zero mean andidentity covariance matrix. Using Lemma 15 we have Z R d k y k R γ,θ ( x, d y ) = E (cid:20)(cid:13)(cid:13)(cid:13) x − γ ∇ x V θ ( x ) − γ ∇ x U γκθ ( x ) + p γZ (cid:13)(cid:13)(cid:13) (cid:21) = k x − γ ∇ x V θ ( x ) − γ ∇ x U γκθ ( x ) k + 2 γd (1 − γ̟/ k x k + γ (cid:2)(cid:8) (2 / ( m + L ) − ¯ γ ) − + 4 ̟ (cid:9) R V, +2¯ γ ML R V, + ¯ γ M + 2 d + 2 M (1 + ¯ γ L ) ̟ − (cid:3) . Therefore, we get Z R d (1 + k y k )R γ,θ ( x, d y ) (1 − γ̟/ k x k ) + γ (cid:2)(cid:8) (2 / ( m + L ) − ¯ γ ) − + 4 ̟ (cid:9) R V, +2¯ γ ML R V, + ¯ γ M + 2 d + 2 M (1 + ¯ γ L ) ̟ − + ̟/ (cid:3) , which concludes the first part of the proof. Let T γ,θ ( x ) = x − γ ∇ x V θ ( x ) − γ ∇ x U γκθ ( x ) . In thesequel, for any k ∈ { , . . . , m } , b, ˜ b k > and λ, ˜ λ k ∈ [0 , are constants independent of γ whichmay take different values at each appearance. Note that using Lemma 15, for any k ∈ { , . . . , m } there exist ˜ λ k ∈ (0 , and ˜ b k > such that kT γ,θ ( x ) k k { ˜ λ γk k x k + γ ˜ b k } k (24) ˜ λ γkk k x k k + γ k max(˜ b k , k max(¯ γ, k − n k x k k − o ˜ λ γk k x k k + ˜ b k γ n k x k k − o (1 + k x k k )(1 + ˜ b k γ ) . Therefore, combining (24) and the Cauchy-Schwarz inequality we obtain Z R d (1 + k y k )R γ,θ ( x, d y ) = 1 + E h ( kT γ,θ ( x ) k + 2 p γ hT γ,θ ( x ) , Z i + 2 γ k Z k ) m i = 1 + m X k =0 k X ℓ =0 (cid:18) mk (cid:19)(cid:18) kℓ (cid:19) kT γ,θ ( x ) k m − k ) (3 k − ℓ ) / γ ( k + ℓ ) / E h hT γ,θ ( x ) , Z i k − ℓ k Z k ℓ i kT γ,θ ( x ) k m + 2 m/ m X k =1 k X ℓ =0 (cid:18) mk (cid:19)(cid:18) kℓ (cid:19) kT γ,θ ( x ) k m − k ) γ ( k + ℓ ) / E h hT γ,θ ( x ) , Z i k − ℓ k Z k ℓ i { (1 , } c ( k, ℓ ) kT γ,θ ( x ) k m + γ m/ m X k =1 k X ℓ =0 (cid:18) mk (cid:19)(cid:18) kℓ (cid:19) kT γ,θ ( x ) k m − k − ℓ ¯ γ ( k + ℓ ) / − E h k Z k k + ℓ i { (1 , } c ( k, ℓ ) λ γ m k x k m + b m γ n k x k m − o + γ m/ m max(¯ γ, m sup k ∈{ ,...,m } n (1 + ˜ b k ¯ γ ) E h k Z k k io (1 + k x k m − ) λ γ k x k m + γb (1 + k x k m − ) λ γ/ (1 + k x k m ) + γb (1 + k x k m − ) + λ γ (1 + k x k m ) − λ γ/ (1 + k x k m ) . Using that λ γ − λ γ/ − log(1 /λ ) γλ γ/ / , concludes the proof. Lemma 19.
Assume H and H . Then for any θ ∈ Θ , κ ∈ [ κ, ¯ κ ] and γ ∈ (0 , ¯ γ ] with ¯ κ > > > / , ¯ γ < min(2 / L , η/ (2 ML )) , R γ,θ and ¯R γ,θ satisfy D d ( W, λ γ , bγ ) with λ = e − α ,b e = (4 /L − γ ) − R V, + (3 / γ M + c + M R U, + R U, + ¯ γ ML R V, + d + 2 α ,b = αb e e α ¯ γb e W ( R ) ,W = W α , α < η/ ,R η = max (2 b e / ( η − α ) , , (25) where W α is given in (19) .Proof. We show the property for R γ,θ only as the proof for ¯R γ,θ is identical. Let θ ∈ Θ , κ ∈ [ κ, ¯ κ ] γ ∈ (0 , ¯ γ ] , x ∈ R d and Z be a d -dimensional Gaussian random variable with zero mean and identitycovariance matrix. Using Lemma 16 we have Z R d k y k R γ,θ ( x, d y ) = k x − γ ∇ x V θ ( x ) − γ ∇ x U γκθ k + 2 γd k x k + γ (cid:2) (2 /L − ¯ γ ) − R V, + 3¯ γ M + 2 c + 2( M R U, + R U, ) + 2¯ γ ML R V, + 2 d − η k x k (cid:3) . Using the log-Sobolev inequality [3, Proposition 5.4.1] and Jensen’s inequality we get that R γ,θ W ( x ) exp (cid:2) α R γ,θ φ ( x ) + α γ (cid:3) (26) exp " α (cid:18) Z R d k y k R γ,θ ( x, d y ) (cid:19) / + α γ . We now distinguish two cases:(a) If k x k > R η , recalling that R η is given in (25), then (2 /L − ¯ γ ) − R V, + 3¯ γ M + 2 c + 2( M R U, + R U, ) + 2¯ γ ML R V, + 2 d − η k x k − α k x k . In this case using that φ − ( x ) k x k > / and that for any t > , √ t t/ we have (cid:18) Z R d k y k R γ,θ ( x, d y ) (cid:19) / − φ ( x ) γφ − ( x ) (cid:0) (2 /L − ¯ γ ) − R V, + 3¯ γ M + 2 c + 2( M R U, + R U, ) + 2¯ γ ML R V, + 2 d − η k x k (cid:1)(cid:14) − αγφ − ( x ) k x k − αγ . Hence, R γ,θ W ( x ) " α (cid:18) Z R d k y k R γ,θ ( x, d y ) (cid:19) / + α γ e − α γ W ( x ) . (b) If k x k R η then using that for any t > , √ t t/ we have (cid:18) Z R d k y k R γ,θ ( x, d y ) (cid:19) / − φ ( x ) γ ((4 /L − γ ) − R V, + (3 / γ M + c + M R U, + R U, + ¯ γ ML R V, + d ) . Therefore, using (26), we get R γ,θ W ( x ) exp (cid:2) αγ (cid:8) (4 /L − γ ) − R V, + (3 / γ M + c + M R U, + R U, + ¯ γ ML R V, + d + α (cid:9)(cid:3) W ( x ) . Since for all a > b , e a − e b ( a − b )e a we obtain that R γ,θ W ( x ) λ γ W ( x ) + γαb e e α ¯ γb e W ( R η ) , which concludes the proof. 16 emma 20. Assume H . For any κ ∈ [ κ, ¯ κ ] , θ ∈ Θ , γ ∈ (0 , ¯ γ ] with ¯ κ > > κ > / , ¯ γ < (2 − /κ ) / L and x, y ∈ R d max (cid:16) k δ x R ⌈ /γ ⌉ γ,θ − δ y R ⌈ /γ ⌉ γ,θ k TV , k δ x ¯R ⌈ /γ ⌉ γ,θ − δ y ¯R ⌈ /γ ⌉ γ,θ k TV (cid:17) − Φ n − k x − y k / (2 √ o , where Φ is the cumulative distribution function of the standard normal distribution on R .Proof. We only show that for any θ ∈ Θ , κ ∈ [ κ, ¯ κ ] , γ ∈ (0 , ¯ γ ] with ¯ κ > > κ > / , ¯ γ < (2 − /κ ) / L and x, y ∈ R d , we have k δ x R ⌈ /γ ⌉ γ,θ − δ y R ⌈ /γ ⌉ γ,θ k TV − Φ (cid:8) − k x − y k / (2 √ (cid:9) as the proof of for ¯R γ,θ is similar. Let κ ∈ [ κ, ¯ κ ] , θ ∈ Θ , γ ∈ (0 , ¯ γ ] . We have that x V θ ( x ) + U γκθ ( x ) is convex,continuously differentiable and satisfies for any x, y ∈ R d k∇ x V θ ( x ) + ∇ x U γκθ ( x ) − ∇ x V θ ( y ) − ∇ x U γκθ ( y ) k { L + 1 / ( γκ ) } k x − y k , Combining this result with [36, Theorem 2.1.5, Equation (2.1.8)] and the fact that γ / { L +1 / ( γκ ) } since ¯ γ (2 − /κ ) / L , we have for any x, y ∈ R d k x − γ ∇ x V θ ( x ) − γ ∇ x U γκθ ( x ) − y + γ ∇ x V θ ( y ) + γ ∇ x U γκθ ( y ) k k x − y k . The proof is then an application of [16, Proposition 3b] with ℓ ← , for any x ∈ R d , T γ,θ ( x ) ← x − γ ∇ x V θ ( x ) − γ ∇ x ∇ U γκθ ( x ) and Π ← Id . Theorem 21.
Assume H and H or H . Let ¯ κ > > κ > / , ¯ γ < min { (2 − /κ ) / L , / ( m + L ) } if H holds and ¯ γ < min { (2 − /κ ) / L , η/ (2 ML ) } if H holds. Then for any a ∈ (0 , , there exist A ,a > and ρ a ∈ (0 , such that for any θ ∈ Θ , κ ∈ [ κ, ¯ κ ] , γ ∈ (0 , ¯ γ ] , R γ,θ and ¯R γ,θ admitinvariant probability measures π γ,θ , respectively ¯ π γ,θ , and for any x, y ∈ R d and n ∈ N we have max (cid:0) k δ x R nγ,θ − π γ,θ k W a , k δ x ¯R nγ,θ − ¯ π γ,θ k W a (cid:1) A ,a ρ γna W a ( x ) , max (cid:0) k δ x R nγ,θ − δ y R nγ,θ k W a , k δ x ¯R nγ,θ − δ y ¯R nγ,θ k W a (cid:1) A ,a ρ γna { W a ( x ) + W a ( y ) } , with W = W m and m ∈ N ∗ if H holds and W = W α with α < min( κη/ , η/ if H holds, see (19) .Proof. We only show that for any a ∈ (0 , , there exist A ,a > and ρ a ∈ (0 , such that forany θ ∈ Θ , κ ∈ [ κ, ¯ κ ] and γ ∈ (0 , ¯ γ ] we have k δ x R nγ,θ − π γ,θ k W a A ,a ρ γna W a ( x ) and k δ x R nγ,θ − δ y R nγ,θ k W a A ,a ρ γna { W a ( x ) + W a ( y ) } , since the proof for ¯R γ,θ is similar . Let a ∈ [0 , . First,using Jensen’s inequality and Lemma 18 if H H λ a and b a such that for any θ ∈ Θ , κ ∈ [ κ, ¯ κ ] , γ ∈ (0 , ¯ γ ] , R γ,θ and ¯R γ,θ satisfy D d ( W a , λ γa , b a γ ) .Combining [16, Theorem 6], Lemma 20 and D d ( W a , λ γa , b a γ ) , we get that there exist ¯ A ,a > and ρ a ∈ (0 , such that for any θ ∈ Θ , κ ∈ [ κ, ¯ κ ] , γ ∈ (0 , ¯ γ ] , x, y ∈ R d and n ∈ N , R γ,θ and ¯R γ,θ admitinvariant probability measures π γ,θ and ¯ π γ,θ respectively and max (cid:8) k δ x R nγ,θ − δ y R nγ,θ k W a , k δ x ¯R nγ,θ − δ y ¯R nγ,θ k W a (cid:9) ¯ A ,a ρ γna { W a ( x ) + W a ( y ) } . (27)Using that for any θ ∈ Θ , κ ∈ [ κ, ¯ κ ] and γ ∈ (0 , ¯ γ ] , R γ,θ and ¯R γ,θ satisfy D d ( W a , λ γa , b a γ ) and [17,Lemma S2] we have π γ,θ ( W a ) b a γ/ (1 − λ γa ) b a λ − ¯ γa / log(1 /λ a ) . (28)Hence, combining (27) and (28), we have for any θ ∈ Θ , κ ∈ [ κ, ¯ κ ] , γ ∈ (0 , ¯ γ ] and n ∈ N max (cid:8) k δ x R nγ,θ − π γ,θ k W , k δ x ¯R nγ,θ − ¯ π γ,θ k W (cid:9) ¯ A ,a ρ γna (1 + b a λ − ¯ γa / log(1 /λ a )) W a ( x ) . We conclude upon letting A ,a = ¯ A ,a (1 + b a λ − ¯ γa / log(1 /λ a )) . We show that under H H
3, Foster-Lyapunov drifts hold for PULA in Lemma 22 and Lemma 23.Combining these Foster-Lyapunov drifts with an appropriate minorisation condition Lemma 24,we obtain the geometric ergodicity of the underlying Markov chain in Theorem 25.17 emma 22.
Assume H and H . Then for any θ ∈ Θ , κ ∈ [ κ, ¯ κ ] and γ ∈ (0 , ¯ γ ] with ¯ κ > > κ > / and ¯ γ < / ( m + L ) , S γ,θ and ¯S γ,θ satisfy D d ( W , λ γ , b γ ) with λ = exp [ − ̟/ ,b = ¯ γ ¯ κ M + (cid:8) (2 / ( m + L ) − ¯ γ ) − + 4 ̟ (cid:9) R V, + 2 d + 2¯ κ M ̟ − + ̟/ ,̟ = mL / ( m + L ) , where for any x ∈ R d , W ( x ) = 1 + k x k . In addition, for any m ∈ N ∗ , there exist λ m ∈ (0 , , b m > such that for any θ ∈ Θ , κ ∈ [ κ, ¯ κ ] and γ ∈ (0 , ¯ γ ] with ¯ κ > > κ > / and ¯ γ < / ( m + L ) , S γ,θ and ¯S γ,θ satisfy D d ( W m , λ γm , b m γ ) , where W m is given in (19) .Proof. We show the property for S γ,θ only as the proof for ¯S γ,θ is identical. Let θ ∈ Θ , κ ∈ [ κ, ¯ κ ] , γ ∈ (0 , ¯ γ ] and x ∈ R d . Let Z be a d -dimensional Gaussian random variable with zero mean andidentity covariance matrix. Using Lemma 13 we have Z R d k y k S γ,θ ( x, d y ) = E (cid:20)(cid:13)(cid:13)(cid:13) prox γκU θ ( x ) − γ ∇ x V θ (prox γκU θ ( x )) + p γZ (cid:13)(cid:13)(cid:13) (cid:21) (1 − γ̟/ k x k + γ (cid:2) ¯ γκ M + (cid:8) (2 / ( m + L ) − ¯ γ ) − + 4 ̟ (cid:9) R V, +2 κ M ̟ − (cid:3) + 2 γd . Therefore, we get Z R d (1 + k y k )S γ,θ ( x, d y ) (1 − γ̟/ k x k ) + γ (cid:2) ¯ γκ M + (cid:8) (2 / ( m + L ) − ¯ γ ) − + 4 ̟ (cid:9) R V, + 2 d + 2 κ M ̟ − + ̟/ (cid:3) , which concludes the first part of the proof using that for any t > , − t e − t . The proof of theresult for W = W m with m ∈ N ∗ is a straightforward adaptation of the one of Lemma 18 and isleft to the reader. Lemma 23.
Assume H and H . Then for any θ ∈ Θ , κ ∈ [ κ, ¯ κ ] and γ ∈ (0 , ¯ γ ] with ¯ κ > > κ > / and ¯ γ < / L , S γ,θ and ¯S γ,θ satisfy D d ( W, λ γ , bγ ) with λ = e − α ,b e = (3 / γ ¯ κ M + ¯ κ c + ¯ κ ( R U, + M R U, ) + (4 /L − γ ) − R V, + d + 2 αb = αb e e α ¯ γb e W ( R ) ,W = W α , < α < κη/ ,R η = max ( b e / ( κη − α ) , , and where W α is given in (19) .Proof. We show the property for S γ,θ only as the proof for ¯S γ,θ is identical. Let θ ∈ Θ , κ ∈ [ κ, ¯ κ ] , γ ∈ (0 , ¯ γ ] , x ∈ R d , and Z be a d -dimensional Gaussian random variable with zero mean andidentity covariance matrix. Using Lemma 14 we have Z R d k y k S γ,θ ( x, d y ) (cid:13)(cid:13) prox γκU θ ( x ) − γ ∇ x V θ (prox γκU θ ( x )) (cid:13)(cid:13) + 2 γd k x k + γ (cid:2) γκ M + 2 κ c + 2 κ ( R U, + M R U, ) + (2 /L − ¯ γ ) − R V, + 2 d − κη k x k (cid:3) . Using the log-Sobolev inequality [3, Proposition 5.4.1] and Jensen’s inequality we get that S γ,θ W ( x ) exp (cid:2) α S γ,θ φ ( x ) + α γ (cid:3) (29) exp " α (cid:18) Z R d k y k S γ,θ ( x, d y ) (cid:19) / + α γ . We now distinguish two cases.(a) If k x k > R η then φ − ( x ) k x k > / and γκ M + 2 κ c + 2 κ ( R U, + M R U, ) + (2 /L − ¯ γ ) − R V, +2 d − κη k x k − α k x k . In this case using that for any t > , √ t − t/ we get (cid:18) Z R d k y k S γ,θ ( x, d y ) (cid:19) / − φ ( x ) γφ − ( x ) (cid:2) γκ M + 2 κ c + 2 κ ( R U, + M R U, ) + (2 /L − ¯ γ ) − R V, + 2 d − κη k x k (cid:3) / − αγφ − ( x ) k x k − αγ . S γ,θ W ( x ) exp " α (cid:18) Z R d k y k S γ,θ ( x, d y ) (cid:19) / + α γ e − α γ W ( x ) . (b) If k x k R η then using that for any t > , √ t − t/ (cid:18) Z R d k y k S γ,θ ( x, d y ) (cid:19) / − φ ( x ) γ (cid:2) (3 / γκ M + κ c + κ ( R U, + M R U, ) + (4 /L − γ ) − R V, + d (cid:3) . Therefore we get using (29) S γ,θ W ( x ) /W ( x ) exp (cid:2) αγ (cid:8) (3 / γκ M + κ c + κ ( R U, + M R U, ) + (4 /L − γ ) − R V, + d + α (cid:9)(cid:3) e αb e γ . Since for all a > b , e a − e b ( a − b )e a we obtain that S γ,θ W ( x ) λ γ W ( x ) + γαb e e α ¯ γb e W ( R η ) , which concludes the proof. Lemma 24.
Assume H . For any θ ∈ Θ , κ ∈ [ κ, ¯ κ ] and γ ∈ (0 , ¯ γ ] with ¯ κ > > κ > / , ¯ γ < / L and x, y ∈ R d max (cid:16) k δ x S ⌈ /γ ⌉ γ,θ − δ y S ⌈ /γ ⌉ γ,θ k TV , k δ x ¯S ⌈ /γ ⌉ γ,θ − δ y ¯S ⌈ /γ ⌉ γ,θ k TV (cid:17) − Φ n − k x − y k / (2 √ o , where Φ is the cumulative distribution function of the standard normal distribution on R .Proof. We only show that for any θ ∈ Θ , κ ∈ [ κ, ¯ κ ] , γ ∈ (0 , ¯ γ ] with ¯ γ < / L , and x, y ∈ R d , k δ x S ⌈ /γ ⌉ γ,θ − δ y S ⌈ /γ ⌉ γ,θ k TV − Φ (cid:8) − k x − y k / (2 √ (cid:9) since the proof for ¯S γ,θ is similar. Let θ ∈ Θ , κ ∈ [ κ, ¯ κ ] , γ ∈ (0 , ¯ γ ] . Using [36, Theorem 2.1.5, Equation (2.1.8)] and that the proximaloperator is non-expansive [5, Proposition 12.28], we have for any x, y ∈ R d (cid:13)(cid:13) prox γκU θ ( x ) − prox γκU θ ( y ) − γ ( ∇ x V θ (prox γκU θ ( x )) − ∇ x V θ (prox γκU θ ( y ))) (cid:13)(cid:13) (cid:13)(cid:13) prox γκU θ ( x ) − prox γκU θ ( y ) (cid:13)(cid:13) k x − y k . The proof is then an application of [16, Proposition 3b] with ℓ ← , for any x ∈ R d , T γ,θ ( x ) ← prox γκU θ ( x ) − γ ∇ x V θ (prox γκU θ ( x )) and Π ← Id . Theorem 25.
Assume H and H or H . Let ¯ κ > > κ > / . Let ¯ γ < / ( m + L ) if H holdsand ¯ γ < / L if H holds. Then for any a ∈ (0 , , there exist A ,a > and ρ a ∈ (0 , such thatfor any θ ∈ Θ , κ ∈ [ κ, ¯ κ ] , γ ∈ (0 , ¯ γ ] , S γ,θ and ¯S γ,θ admit an invariant probability measure π γ,θ and ¯ π γ,θ respectively, and for any x, y ∈ R d and n ∈ N we have max (cid:0) k δ x S nγ,θ − π γ,θ k W a , k δ x ¯S nγ,θ − ¯ π γ,θ k W a (cid:1) A ,a ρ γna W a ( x ) , max (cid:0) k δ x S nγ,θ − δ y S nγ,θ k W a , k δ x ¯S nγ,θ − δ y ¯S nγ,θ k W a (cid:1) A ,a ρ γna { W a ( x ) + W a ( y ) } , with W = W m and m ∈ N ∗ if H holds and W = W α with α < κη/ if H holds, see (19) .Proof. The proof is similar to the one of Theorem 21.
Lemma 26 implies that [17, H1a] holds. The geometric ergodicity proved in Theorem 25 implies[17, H1b]. Then, we show that the distance between the invariant probability distribution of theMarkov chain and the target distribution is controlled in Corollary 31 and therefore [17, H1c] issatisfied. Finally, we show that [17, H2] is satisfied in Proposition 32.19 emma 26.
Assume H , H or H , and let ( X nk , ¯ X nk ) n ∈ N ,k ∈{ ,...,m n } be given by (5) with { (K γ,θ , ¯K γ,θ ) : γ ∈ (0 , ¯ γ ] , θ ∈ Θ } = { (S γ,θ , ¯S γ,θ ) : γ ∈ (0 , ¯ γ ] , θ ∈ Θ } and κ ∈ [ κ, ¯ κ ] with ¯ κ > > κ > / . Then there exists A > such that for any n, p ∈ N and k ∈ { , . . . , m n } E h S pγ n ,θ n W ( X nk ) (cid:12)(cid:12) X i A W ( X ) , E h ¯S pγ n ,θ n W ( ¯ X nk ) (cid:12)(cid:12) ¯ X i A W ( ¯ X ) , E (cid:2) W ( X ) (cid:3) < + ∞ , E (cid:2) W ( ¯ X ) (cid:3) < + ∞ , with W = W m with m ∈ N ∗ and ¯ γ < / ( m + L ) if H holds and W = W α with α < κη/ and ¯ γ < / L if H holds, see (19) .Proof. Combining [17, Lemma S15] and Lemma 22 if H H Lemma 27.
Assume H and H or H . We have sup θ ∈ Θ { π θ ( W )+ ¯ π θ ( W ) } < + ∞ , with W = W m with m ∈ N ∗ if H holds and W = W α with α < η if H holds, see (19) .Proof. We only show that sup θ π θ ( W ) < + ∞ since the proof for ¯ π θ is similar. Let m ∈ N ∗ , α < η and θ ∈ Θ The proof is divided into two parts.(a) If H H Z R d (1 + k x k m ) exp [ − U θ ( x ) − V θ ( x )] d x Z R d (1 + k x k m ) exp [ − V θ ( x )] d x Z R d (1 + k x k m ) exp h − V θ ( x ⋆θ ) − m k x − x ⋆θ k / i d x exp (cid:2) R V, + m R V, / (cid:3) Z R d (1 + k x k m ) exp h m R V, k x k − m k x k / i d x . Hence using H sup θ ∈ Θ π θ ( W ) exp (cid:2) R V, + m R V, / (cid:3) Z R d (1 + k x k m ) exp h m R V, k x k − m k x k / i d x (cid:30) inf θ ∈ Θ (cid:26)Z R d exp [ − U θ ( x ) − V θ ( x )] d x (cid:27) < + ∞ . (b) if H Z R d exp [ αφ ( x )] exp [ − U θ ( x ) − V θ ( x )] d x Z R d exp [ αφ ( x )] exp [ − U θ ( x )] d x e c Z R d exp [ α (1 + k x k )] exp [ − η k x k ] d x . Since α < η we have using H sup θ ∈ Θ π θ ( W ) e c Z R d exp [ α (1 + k x k )] exp [ − η k x k ] d x (cid:30) inf θ ∈ Θ (cid:26)Z R d exp [ − U θ ( x ) − V θ ( x )] d x (cid:27) < + ∞ , which concludes the proof. Theorem 28.
Assume H and H or H . Let ¯ κ > > κ > / . Let ¯ γ < / ( m + L ) if H holdsand ¯ γ < / L if H holds. Then for any θ ∈ Θ , κ ∈ [ κ, ¯ κ ] and γ ∈ (0 , ¯ γ ] we have max (cid:16) k π ♯γ,θ − π θ k W / , k ¯ π ♯γ,θ − ¯ π θ k W / (cid:17) ˜ Ψ ( γ ) , where for any θ ∈ Θ and γ ∈ (0 , ¯ γ ] , π ♯γ,θ , respectively ¯ π ♯γ,θ , is the invariant probability measure of S γ,θ , respectively ¯S γ,θ , given by (18) and associated with κ = 1 . In addition, for any γ ∈ (0 , ¯ γ ]˜ Ψ ( γ ) = √ { bλ − ¯ γ / log(1 /λ ) + sup θ ∈ Θ π θ ( W ) + sup θ ∈ Θ ¯ π θ ( W ) } / ( L d + M ) / √ γ , and where W = W m with m ∈ N ∗ and ¯ γ, λ, b are given in Lemma 22 if H holds and W = W α with α < min( κη/ , η ) and ¯ γ, λ, b are given in Lemma 23 if H holds, see (19) . roof. We only show that for any θ ∈ Θ , κ ∈ [ κ, ¯ κ ] and γ ∈ (0 , ¯ γ ] , k π ♯γ,θ − π θ k W / ˜ Ψ ( γ ) , sincethe proof of k ˜ π ♯γ,θ − ˜ π θ k W / ˜ Ψ ( γ ) is similar. Let θ ∈ Θ , κ ∈ [ κ, ¯ κ ] , γ ∈ (0 , ¯ γ ] and x ∈ R d UsingTheorem 25 we obtain that ( δ x S nγ,θ ) n ∈ N , with κ = 1 , is weakly convergent towards π ♯γ,θ . Using that µ KL ( µ | π θ ) is lower semi-continuous for any θ ∈ Θ , see [19, Lemma 1.4.3b], and [21, Corollary18] we get that KL (cid:16) π ♯γ,θ | π θ (cid:17) lim inf n → + ∞ KL n − n X k =1 δ x S kγ,θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) π θ ! γ ( L d + M ) . Using a generalized Pinsker inequality, see [22, Lemma 24], Lemma 27 and Lemma 22 if H H k π ♯γ,θ − π θ k W / √ π ♯γ,θ ( W ) + π θ ( W )) / KL (cid:16) π ♯γ,θ | π θ (cid:17) / √ { bλ − ¯ γ / log(1 /λ ) + sup θ ∈ Θ π θ ( W ) } / ( L d + M ) / γ / , which concludes the proof. Lemma 29.
Assume H and H or H . Let ¯ κ > > κ > / . Let ¯ γ < / ( m + L ) if H holdsand ¯ γ < / L if H holds. Then there exists ¯ B > such that for any θ ∈ Θ , γ ∈ (0 , ¯ γ ] , x ∈ R d and κ i ∈ [ κ, ¯ κ ] with i ∈ { , } we have max (cid:16) k δ x S ⌈ /γ ⌉ ,γ,θ − δ x S ⌈ /γ ⌉ ,γ,θ k W / , k δ x ¯S ⌈ /γ ⌉ ,γ,θ − δ x ¯S ⌈ /γ ⌉ ,γ,θ k W / (cid:17) ¯ B γ | κ − κ | W / ( x ) . where for any i ∈ { , } , θ ∈ Θ and γ ∈ (0 , ¯ γ ] , S i,γ,θ is given by (18) and associated with κ ← κ i ,and W = W m with m ∈ N ∗ if H holds. In addition, W = W α with α < min( κη/ , η ) if H holds, see (19) .Proof. We only show that for any θ ∈ Θ , γ ∈ (0 , ¯ γ ] , x ∈ R d and κ i ∈ [ κ, ¯ κ ] with i ∈ { , } we have k δ x S ⌈ /γ ⌉ ,γ,θ − δ x S ⌈ /γ ⌉ ,γ,θ k W / ¯ B γ | κ − κ | W / ( x ) since the proof for ¯S ,γ,θ and ¯S ,γ,θ is similar.Let θ ∈ Θ , γ ∈ (0 , ¯ γ ] , x ∈ R d and κ i ∈ [ κ, ¯ κ ] with i ∈ { , } . Using a generalized Pinsker inequality,see [22, Lemma 24], we have k δ x S ⌈ /γ ⌉ ,γ,θ − δ x S ⌈ /γ ⌉ ,γ,θ k W / √ ⌈ /γ ⌉ ,γ,θ W ( x ) + S ⌈ /γ ⌉ ,γ,θ W ( x )) / KL (cid:16) δ x S ⌈ /γ ⌉ ,γ,θ | δ x S ⌈ /γ ⌉ ,γ,θ (cid:17) / . (30)Using [30, Lemma 4.1] we get that KL (cid:16) δ x S ⌈ /γ ⌉ ,γ,θ | δ x S ⌈ /γ ⌉ ,γ,θ (cid:17) KL (˜ µ | ˜ µ ) where setting T = γ ⌈ /γ ⌉ , ˜ µ i , i ∈ { , } , is the probability measure over B (C([0 , T ] , R d )) which is defined for any A ∈ B (C([0 , T ] , R d )) by ˜ µ i ( A ) = P (( X it ) t ∈ [0 ,T ] ∈ A ) , i ∈ { , } and for any t ∈ [0 , T ]d X it = b i ( t, ( X is ) s ∈ [0 ,T ] )d t + √ B t , X i = x , with for any ( ω s ) s ∈ [0 ,T ] ∈ C([0 , T ] , R d ) and t ∈ [0 , T ] b i ( t, ( ω s ) s ∈ [0 ,T ] ) = X p ∈ N [ pγ, ( p +1) γ ) ( t ) T (prox γκ i U θ ( ω pγ )) , where for any y ∈ R d , T γ,θ ( y ) = y − γ ∇ x V θ ( y ) . Since ( X it ) t ∈ [0 ,T ] ∈ C([0 , T ] , R d ) , b i and b arecontinuous for any i ∈ { , } , [32, Theorem 7.19] applies and we obtain that ˜ µ ≪ ˜ µ and d˜ µ d˜ µ (( X t ) t ∈ [0 ,T ] ) = exp ( (1 / Z T (cid:13)(cid:13) b ( t, ( X s ) s ∈ [0 ,T ] ) − b ( t, ( X s ) s ∈ [0 ,T ] ) (cid:13)(cid:13) d t +(1 / Z T h b ( t, ( X s ) s ∈ [0 ,T ] ) − b ( t, ( X s ) s ∈ [0 ,T ] ) , d X t i ) , where the equality holds almost surely. As a consequence we obtain that KL (˜ µ | ˜ µ ) = (1 / E "Z T (cid:13)(cid:13) b ( t, ( X s ) s ∈ [0 ,T ] ) − b ( t, ( X s ) s ∈ [0 ,T ] ) (cid:13)(cid:13) d s . (31)21n addition, using Lemma 11, we have for any ( ω s ) s ∈ [0 ,T ] ∈ C([0 , T ] , R d ) and t ∈ [0 , T ] (cid:13)(cid:13) b ( t, ( ω s ) s ∈ [0 ,T ] ) − b ( t, ( ω s ) s ∈ [0 ,T ] ) (cid:13)(cid:13) = (cid:13)(cid:13) T γ,θ (prox γκ U θ ( ω γ ⌊ t/γ ⌋ )) − T γ,θ (prox γκ U θ ( ω γ ⌊ t/γ ⌋ )) (cid:13)(cid:13) (cid:13)(cid:13) prox γκ U θ ( ω γ ⌊ t/γ ⌋ ) − prox γκ U θ ( ω γ ⌊ t/γ ⌋ ) (cid:13)(cid:13) γ ( κ − κ ) M . (32)Combining this result and (31) we get that KL (cid:16) δ x S ⌈ /γ ⌉ ,γ,θ | δ x S ⌈ /γ ⌉ ,γ,θ (cid:17) (1 + ¯ γ ) M γ | κ − κ | . (33)Combining (33) and (30) we get that k δ x S ⌈ /γ ⌉ ,γ,θ − δ x S ⌈ /γ ⌉ ,γ,θ k W / / (1 + ¯ γ ) / M (S ⌈ /γ ⌉ ,γ,θ W ( x ) + S ⌈ /γ ⌉ ,γ,θ W ( x )) / γ | κ − κ | . We conclude the proof upon using Lemma 8, and Lemma 22 if H H Proposition 30.
Assume H and H or H . Let ¯ κ > > κ > / . Let ¯ γ < / ( m + L ) if H holds and ¯ γ < / L if H holds. Then there exists B > such that for any θ ∈ Θ , γ ∈ (0 , ¯ γ ] and κ i ∈ [ κ, ¯ κ ] with i ∈ { , } we have max (cid:0) k π γ,θ − π γ,θ k W / , k ¯ π γ,θ − ¯ π γ,θ k W / (cid:1) B γ | κ − κ | , where for any i ∈ { , } , θ ∈ Θ and γ ∈ (0 , ¯ γ ] , π iγ,θ , respectively ¯ π iγ,θ , is the invariant probabilitymeasure of S i,γ,θ , respectively ¯S i,γ,θ , given by (18) and associated with κ ← κ i . In addition, W = W m with m ∈ N ∗ if H holds and W = W α with α < min( κη/ , η ) if H holds, see (19) .Proof. We only show that for any θ ∈ Θ , γ ∈ (0 , ¯ γ ] and κ i ∈ [ κ, ¯ κ ] with i ∈ { , } , k π γ,θ − π γ,θ k W / B γ | κ − κ | since the proof for ¯ π γ,θ and ¯ π γ,θ are similar. Let θ ∈ Θ , γ ∈ (0 , ¯ γ ] , x ∈ R d and κ i > / . Using Theorem 25 we have lim n → + ∞ k δ x S n ,γ,θ − δ x S n ,γ,θ k W / = k π ,γ,θ − π ,γ,θ k W / . Let n = q ⌈ /γ ⌉ . Using Theorem 25 with a = 1 / , that W / ( x ) W ( x ) for any x ∈ R d ,Lemma 29, Lemma 8 and Lemma 22 if H H k δ x S n ,γ,θ − δ x S n ,γ,θ k W / q − X k =0 k δ x S ( k +1) ⌈ /γ ⌉ ,γ,θ S ( q − k − ⌈ /γ ⌉ ,γ,θ − δ x S k ⌈ /γ ⌉ ,γ,θ S ( q − k ) ⌈ /γ ⌉ ,γ,θ k W / q − X k =0 A , / ρ q − k − / (cid:13)(cid:13)(cid:13) δ x S k ⌈ /γ ⌉ ,γ,θ n S ⌈ /γ ⌉ ,γ,θ − S ⌈ /γ ⌉ ,γ,θ o(cid:13)(cid:13)(cid:13) W / A , / q − X k =0 ρ q − k − / ¯ B γ | κ − κ | δ x S k ⌈ /γ ⌉ ,γ,θ W ( x ) A , / q − X k =0 ρ q − k − / ¯ B γ | κ − κ | (1 + bλ − ¯ γ / log(1 /λ )) W ( x ) A , / ¯ B (1 + bλ − ¯ γ / log(1 /λ )) / (1 − ρ / ) | κ − κ | γW ( x ) , which concludes the proof with B = 2 A , / ¯ B (1 + bλ − ¯ γ / log(1 /λ )) / (1 − ρ / ) κ upon setting x = 0 . Corollary 31.
Assume H and H or H . Let ¯ κ > > κ > / . Let ¯ γ < / ( m + L ) if H holdsand ¯ γ < / L if H holds. Then for any κ ∈ [ κ, ¯ κ ] , θ ∈ Θ and γ ∈ (0 , ¯ γ ] , we have max ( k π γ,θ − π θ k W / , k ¯ π γ,θ − ¯ π θ k W / ) Ψ ( γ ) , where for any γ ∈ (0 , ¯ γ ] , π γ,θ is the invariant probability measure of S γ,θ given by (18) . In addition, Ψ ( γ ) = ˜ Ψ ( γ )+ B γ | κ − | , where ˜ Ψ is given in Theorem 28 and B in Proposition 30, and W = W m with m ∈ N ∗ if H holds and W = W α with α < min( κη/ , η ) if H holds, see (19) . roof. We only show that for any θ ∈ Θ and γ ∈ (0 , ¯ γ ] we have k π γ,θ − π θ k W / Ψ ( γ ) since theproof for ¯ π γ,θ and ¯ π θ are similar. Let κ ∈ [ κ, ¯ κ ] , θ ∈ Θ , γ ∈ (0 , ¯ γ ] . The proof is a direct applicationof Theorem 28 and Proposition 30 upon noticing that k π γ,θ − π θ k W / k π γ,θ − π ♯γ,θ k W / + k π ♯γ,θ − π θ k W / , where π ♯γ,θ is the invariant probability measure of S γ,θ given by (18) and associated with κ = 1 . Proposition 32.
Assume H and H or H . Let ¯ κ > > κ > / . Let ¯ γ < / ( m + L ) if H holds and ¯ γ < / L if H holds. Then there exists A > such that for any κ ∈ [ κ, ¯ κ ] , θ , θ ∈ Θ , γ , γ ∈ (0 , ¯ γ ] with γ < γ , a ∈ [1 / , / and x ∈ R d max (cid:0) k δ x S γ ,θ − δ x S γ ,θ k W a , k δ x ¯S γ ,θ − δ x ¯S γ ,θ k W a (cid:1) ( Λ ( γ , γ ) + Λ ( γ , γ ) k θ − θ k ) W a ( x ) , with Λ ( γ , γ ) = A ( γ /γ − , Λ ( γ , γ ) = A γ / , and where W = W m with m ∈ N and m > if H is satisfied and W = W α with α < min( κη/ , η ) if H is satisfied, see (19) .Proof. We only show that for any κ ∈ [ κ, ¯ κ ] , θ , θ ∈ Θ , γ , γ ∈ (0 , ¯ γ ] with γ < γ , a ∈ [1 / , / and x ∈ R d we have k δ x S γ ,θ − δ x S γ ,θ k W a ( Λ ( γ , γ ) + Λ ( γ , γ ) k θ − θ k ) W a ( x ) since theproof for ¯S γ ,θ and ¯S γ ,θ is similar. Let a ∈ [1 / , / , κ ∈ [ κ, ¯ κ ] , θ , θ ∈ Θ , γ , γ ∈ (0 , ¯ γ ] with γ < γ . Using a generalized Pinsker inequality, see [22, Lemma 24], we have k δ x S γ ,θ − δ x S γ ,θ k W a √ δ x S γ ,θ W a ( x ) + δ x S γ ,θ W a ( x )) / KL ( δ x S γ ,θ | δ x S γ ,θ ) / . Combining this result, Jensen’s inequality and Lemma 22 if H H k S γ ,θ − S γ ,θ k W a b ¯ γ ) / { KL ( δ x S γ ,θ | δ x S γ ,θ ) } / W a ( x ) . Denote for υ ∈ R d and σ > , Υ υ, σ the d -dimensional Gaussian distribution with mean υ andcovariance matrix σ Id . Using Lemma 17 and the fact that γ > γ we have KL ( δ x S γ ,θ | δ x S γ ,θ ) (34) d ( γ /γ − / (cid:13)(cid:13)(cid:13) T γ ,θ (prox γ κU θ ( x )) − T γ ,θ (prox γ κU θ ( x )) (cid:13)(cid:13)(cid:13) (cid:30) (4 γ ) , with T γ,θ ( z ) = z − γ ∇ x V θ ( z ) for any θ ∈ Θ , γ ∈ (0 , ¯ γ ] and x ∈ R d . We have (1 / (cid:13)(cid:13)(cid:13) T γ ,θ (prox γ κU θ ( x )) − T γ ,θ (prox γ κU θ ( x )) (cid:13)(cid:13)(cid:13) (35) (cid:13)(cid:13)(cid:13) T γ ,θ (prox γ κU θ ( x )) − T γ ,θ (prox γ κU θ ( x )) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) T γ ,θ (prox γ κU θ ( x )) − T γ ,θ (prox γ κU θ ( x )) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) T γ ,θ (prox γ κU θ ( x )) − T γ ,θ (prox γ κU θ ( x )) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) T γ ,θ (prox γ κU θ ( x )) − T γ ,θ (prox γ κU θ ( x )) (cid:13)(cid:13)(cid:13) . First using H
1, [36, Theorem 2.1.5, Equation (2.1.8)] and Lemma 11 we have (cid:13)(cid:13)(cid:13) T γ ,θ (prox γ κU θ ( x )) − T γ ,θ (prox γ κU θ ( x )) (cid:13)(cid:13)(cid:13) (36) (cid:13)(cid:13)(cid:13) prox γ κU θ ( x ) − prox γ κU θ ( x ) (cid:13)(cid:13)(cid:13) M | γ κ − γ κ | . Second, we have using (9), H
1, [36, Theorem 2.1.5, Equation (2.1.8)] and H (cid:13)(cid:13)(cid:13) T γ ,θ (prox γ κU θ ( x )) − T γ ,θ (prox γ κU θ ( x )) (cid:13)(cid:13)(cid:13) (37) γ κ (cid:13)(cid:13) ∇ x U γ κθ ( x ) − ∇ x U γ κθ ( x ) (cid:13)(cid:13) sup t ∈ [0 , ¯ γκ ] { f θ ( t ) } γ κ k θ − θ k (1 + k x k ) . H (cid:13)(cid:13)(cid:13) T γ ,θ (prox γ κU θ ( x )) − T γ ,θ (prox γ κU θ ( x )) (cid:13)(cid:13)(cid:13) ( γ − γ ) (cid:13)(cid:13)(cid:13) ∇ x V θ (prox γ κU θ ( x )) (cid:13)(cid:13)(cid:13) (38) ( γ − γ ) L (cid:13)(cid:13)(cid:13) prox γ κU θ ( x ) − x ⋆θ (cid:13)(cid:13)(cid:13) ( γ − γ ) L ( R V, + ¯ γκ M + k x k ) . Finally using H H (cid:13)(cid:13)(cid:13) T γ ,θ (prox γ κU θ ( x )) − T γ ,θ (prox γ κU θ ( x )) (cid:13)(cid:13)(cid:13) (39) γ (cid:13)(cid:13)(cid:13) ∇ x V θ (prox γ κU θ ( x )) − ∇ x V θ (prox γ κU θ ( x )) (cid:13)(cid:13)(cid:13) γ M Θ k θ − θ k (1 + k prox γ κU θ ( x ) k ) γ M Θ k θ − θ k (1 + ¯ γκ M + k x k ) . Therefore, combining (36), (37), (38) and (39) in (35), there exists A , > such that for any γ , γ > with γ < γ and θ , θ ∈ Θ (cid:13)(cid:13)(cid:13) T γ ,θ (prox γ κU θ ( x )) − T γ ,θ (prox γ κU θ ( x )) (cid:13)(cid:13)(cid:13) A , h ( γ − γ ) + γ k θ − θ k i W a ( x ) . Using this result in (34), there exists A , > such that KL ( δ x S γ ,θ | δ x S γ ,θ ) A , h ( γ /γ − + γ k θ − θ k i W a ( x ) , which implies the announced result upon setting A = 2 p A , (1 + b ¯ γ ) / and using that for any u, v > , √ u + v √ u + √ v . In this section, similarly to Section 5.5 for PULA, we show that [17, H1, H2] hold for MYULA.
Lemma 33.
Assume H , H or H , and let ( X nk , ¯ X nk ) n ∈ N ,k ∈{ ,...,m n } be given by (5) with { (K γ,θ , ¯K γ,θ ) : γ ∈ (0 , ¯ γ ] , θ ∈ Θ } = { (R γ,θ , ¯R γ,θ ) : γ ∈ (0 , ¯ γ ] , θ ∈ Θ } and κ ∈ [ κ, ¯ κ ] with ¯ κ > > κ > / . Then there exists ¯ A > such that for any n, p ∈ N and k ∈ { , . . . , m n } E h R pγ n ,θ n W ( X nk ) (cid:12)(cid:12) X i ¯ A W ( X ) , E h ¯R pγ n ,θ n W ( ¯ X nk ) (cid:12)(cid:12) ¯ X i ¯ A W ( ¯ X ) , E (cid:2) W ( X ) (cid:3) < + ∞ , E (cid:2) W ( ¯ X ) (cid:3) < + ∞ . with W = W m with m ∈ N ∗ and ¯ γ < / ( m + L ) if H holds and W = W α with α < min( κη/ , η/ and ¯ γ < min { / L , η/ (2 ML ) } if H holds, see (19) . Proposition 34.
Assume H and H or H . Let ¯ κ > > κ > / . Let ¯ γ < min { (2 − /κ ) / L , / ( m + L ) } if H holds and ¯ γ < min { (2 − /κ ) / L , η/ (2 ML ) } if H holds. Then there exists ¯ B , > such that for any θ ∈ Θ , κ i ∈ [ κ, ¯ κ ] , γ ∈ (0 , ¯ γ ]max (cid:0) k π γ,θ − π γ,θ k W / , k ¯ π γ,θ − ¯ π γ,θ k W / (cid:1) ¯ B , γ , where for any i ∈ { , } , θ ∈ Θ and γ ∈ (0 , ¯ γ ] , π iγ,θ , respectively ¯ π iγ,θ , is the invariant probabilitymeasure of R i,γ,θ , respectively ¯R i,γ,θ , given by (17) and associated with κ ← κ i . In addition, W = W m with m ∈ N ∗ if H holds and W = W α with α < min( κη/ , η/ if H holds, see (19) .Proof. The proof is similar to the one of Proposition 30 upon setting for any i ∈ { , } and ( ω s ) s ∈ [0 ,T ] ∈ C([0 , T ] , R d ) with T = γ ⌈ /γ ⌉ b i ( t, ( ω s ) s ∈ [0 ,T ] ) = ω ⌊ t/γ ⌋ γ − γ ∇ x V θ ( ω ⌊ t/γ ⌋ γ ) − γ ∇ x U γκ i ( γ ) θ ( ω ⌊ t/γ ⌋ γ ) , and replacing (32) in Lemma 29 by (cid:13)(cid:13) b ( t, ( ω s ) s ∈ [0 ,T ] ) − b ( t, ( ω s ) s ∈ [0 ,T ] ) (cid:13)(cid:13) = (cid:13)(cid:13) − γ ∇ x U γκ θ ( ω ⌊ t/γ ⌋ γ ) + γ ∇ x U γκ θ ( ω ⌊ t/γ ⌋ γ ) (cid:13)(cid:13) γ M . roposition 35. Assume H and H or H . Let ¯ κ > > κ > / . Let ¯ γ < min { (2 − /κ ) / L , / ( m + L ) , L − } if H holds and ¯ γ < min { (2 − /κ ) / L , η/ (2 ML ) , L − } if H holds. Thenthere exists ¯ B , > such that for any θ ∈ Θ , γ ∈ (0 , ¯ γ ] and κ i ∈ [ κ, ¯ κ ] with i ∈ { , } we have max (cid:16) k π ♭γ,θ − π ♯γ,θ k W / , k ¯ π ♭γ,θ − ¯ π ♯γ,θ k W / (cid:17) ¯ B , γ , where for any θ ∈ Θ and γ ∈ (0 , ¯ γ ] , π ♭γ,θ , respectively ¯ π ♭γ,θ , is the invariant probability measure of R γ,θ , respectively ¯R γ,θ , given by (17) and associated with κ = 1 and π ♯γ,θ , respectively ¯ π ♯γ,θ , is theinvariant probability measure of S γ,θ , respectively ¯S γ,θ , given by (18) and associated with κ = 1 . Inaddition, W = W m with m ∈ N ∗ if H holds and W = W α with α < min( κη/ , η/ if H holds,see (19) .Proof. The proof is similar to the one of Proposition 30 upon setting for any ( ω s ) s ∈ [0 ,T ] ∈ C([0 , T ] , R d ) with T = γ ⌈ /γ ⌉ b ( t, ( ω s ) s ∈ [0 ,T ] ) = prox γU θ ( ω ⌊ t/γ ⌋ γ ) − γ ∇ x V θ (prox γU θ ( ω ⌊ t/γ ⌋ γ )) ,b ( t, ( ω s ) s ∈ [0 ,T ] ) = ω ⌊ t/γ ⌋ γ − γ ∇ x V θ ( ω ⌊ t/γ ⌋ γ ) − γ ∇ x U γθ ( ω ⌊ t/γ ⌋ γ ) , and replacing (32) in Lemma 29 and using (9) and Lemma 9 we get (cid:13)(cid:13) b ( t, ( ω s ) s ∈ [0 ,T ] ) − b ( t, ( ω s ) s ∈ [0 ,T ] ) (cid:13)(cid:13) = k prox γU θ ( ω ⌊ t/γ ⌋ γ )) − γ ∇ x V θ (prox γU θ ( ω ⌊ t/γ ⌋ γ )) − ω ⌊ t/γ ⌋ γ + γ ∇ x V θ ( ω ⌊ t/γ ⌋ γ )) + γ ( ω ⌊ t/γ ⌋ γ − prox γU θ ( ω ⌊ t/γ ⌋ γ )) /γ k = γ (cid:13)(cid:13) ∇ x V θ (prox γU θ ( ω ⌊ t/γ ⌋ γ ))) − ∇ x V θ ( ω ⌊ t/γ ⌋ γ )) (cid:13)(cid:13) L M γ . Proposition 36.
Assume H and H or H . Let ¯ κ > > κ > / . Let ¯ γ < min { (2 − /κ ) / L , / ( m + L ) , L − } if H holds and ¯ γ < min { (2 − /κ ) / L , η/ (2 ML ) , L − } if H holds. Thenfor any θ ∈ Θ , κ ∈ [ κ, ¯ κ ] and γ ∈ (0 , ¯ γ ] , we have max ( k π γ,θ − π θ k W / , k ¯ π γ,θ − ¯ π θ k W / ) ¯ Ψ ( γ ) , where for any i ∈ { , } , θ ∈ Θ and γ ∈ (0 , ¯ γ ] , π iγ,θ , respectively ¯ π iγ,θ , is the invariant probabilitymeasure of R i,γ,θ , respectively ¯R i,γ,θ , given by (17) and associated with κ ← κ i . In addition, ¯ Ψ ( γ ) = ˜ Ψ ( γ ) + ¯ B , γ + ¯ B , γ , where ˜ Ψ is given in Theorem 28 and B in Proposition 30, and W = W m with m ∈ N ∗ if H holds and W = W α with α < min( κη/ , η/ if H holds, see (19) .Proof. We only show that for any θ ∈ Θ and γ ∈ (0 , ¯ γ ] , k π γ,θ − π θ k W / ¯ Ψ ( γ ) as the proof for ¯ π γ,θ and ¯ π θ is similar. First note that for any θ ∈ Θ , κ ∈ [ κ, ¯ κ ] and γ ∈ (0 , ¯ γ ] we have k π γ,θ − π θ k W / k π γ,θ − π ♭γ,θ k W / + k π ♭γ,θ − π ♯γ,θ k W / + k π ♯γ,θ − π θ k W / , where for any θ ∈ Θ and γ ∈ (0 , ¯ γ ] , π ♭γ,θ is the invariant probability measure of R γ,θ given by (17)and associated with κ = 1 and π ♯γ,θ is the invariant probability measure of S γ,θ and associated with κ = 1 . We conclude the proof upon combining Proposition 34, Proposition 35 and Theorem 28. Proposition 37.
Assume H and H or H . Let ¯ κ > > κ > / . Let ¯ γ < min { (2 − /κ ) / L , / ( m + L ) } if H holds and ¯ γ < min { (2 − /κ ) / L , η/ (2 ML ) } if H holds. Then there exists ¯ A > such that for any θ , θ ∈ Θ , κ ∈ [ κ, ¯ κ ] , γ , γ ∈ (0 , ¯ γ ] with γ < γ , a ∈ [1 / , / and x ∈ R d max (cid:0) k δ x R γ ,θ − δ x R γ ,θ k W a , k δ x ¯R γ ,θ − δ x ¯R γ ,θ k W a (cid:1) ( ¯ Λ ( γ , γ ) + ¯ Λ ( γ , γ ) k θ − θ k ) W a ( x ) , with ¯ Λ ( γ , γ ) = ¯ A ( γ /γ − , ¯ Λ ( γ , γ ) = ¯ A γ / , and where W = W m with m ∈ N and m > if H is satisfied and W = W α with α < min( κη/ , η/ if H is satisfied, see (19) . roof. First, note that we only show that for any θ , θ ∈ Θ , κ ∈ [¯ κ, κ ] , γ , γ ∈ (0 , ¯ γ ] with γ < γ , a ∈ [1 / , / and x ∈ R d , we have k δ x R γ ,θ − δ x R γ ,θ k W a ( ¯ Λ ( γ , γ )+ ¯ Λ ( γ , γ ) k θ − θ k ) W a ( x ) since the proof for ¯R γ ,θ and ¯R γ ,θ is similar. Let a ∈ [1 / , / , θ , θ ∈ Θ , κ ∈ [ κ, ¯ κ ] , γ , γ ∈ (0 , ¯ γ ] with γ < γ . Using a generalized Pinsker inequality [22, Lemma 24] we have k δ x R γ ,θ − δ x R γ ,θ k W a √ δ x R γ ,θ W a ( x ) + δ x R γ ,θ W a ( x )) / KL ( δ x R γ ,θ | δ x R γ ,θ ) / . Combining this result, Jensen’s inequality and Lemma 22 if H H k δ x R γ ,θ − δ x R γ ,θ k W a b ¯ γ ) / KL ( δ x R γ ,θ | δ x R γ ,θ ) / W a ( x ) . Using Lemma 17 and the fact that γ > γ we have KL ( δ x R γ ,θ | δ x R γ ,θ ) d ( γ /γ − / k γ ∇ x V θ ( x ) − γ ∇ x V θ ( x ) + γ ∇ x U γ κθ ( x ) − γ ∇ x U γ κθ ( x ) k / (4 γ ) , (40)We have k γ ∇ x V θ ( x ) − γ ∇ x V θ ( x ) + γ ∇ x U γ κθ ( x ) − γ ∇ x U γ κθ ( x ) k (41) k γ ∇ x V θ ( x ) − γ ∇ x V θ ( x ) k + 4 k γ ∇ x V θ ( x ) − γ ∇ x V θ ( x ) k + 4 (cid:13)(cid:13) γ ∇ x U γ κθ ( x ) − γ ∇ x U γ κθ ( x ) (cid:13)(cid:13) + 4 (cid:13)(cid:13) γ ∇ x U γ κθ ( x ) − γ ∇ x U γ κθ ( x ) (cid:13)(cid:13) . First using H k γ ∇ x V θ ( x ) − γ ∇ x V θ ( x ) k γ M Θ k θ − θ k (1 + k x k ) . (42)Second using H k γ ∇ x V θ ( x ) − γ ∇ x V θ ( x ) k ( γ − γ ) k∇ x V θ ( x ) k (43) ( γ − γ ) L (cid:13)(cid:13) x − x ⋆θ (cid:13)(cid:13) ( γ − γ ) L ( R V, + k x k ) . Third using H H
4, Lemma 9 and Lemma 11 we have (cid:13)(cid:13) γ ∇ x U γ κθ ( x ) − γ ∇ x U γ κθ ( x ) (cid:13)(cid:13) (cid:13)(cid:13)(cid:13) ( x − prox γ κU θ ( x )) /κ − ( x − prox γ κU θ ( x )) /κ (cid:13)(cid:13)(cid:13) (44) (cid:13)(cid:13)(cid:13) prox γ κU θ ( x ) − prox γ κU θ ( x ) (cid:13)(cid:13)(cid:13). κ M ( γ − γ ) Finally using H (cid:13)(cid:13) γ ∇ x U γ κθ ( x ) − γ ∇ x U γ κθ ( x ) (cid:13)(cid:13) γ ( sup [0 , ¯ γκ ] f θ ( t ) ) k θ − θ k . (45)Combining (42), (43), (44) and (45) in (41) we get that there exists ¯ A , > such that k γ ∇ x V θ ( x ) − γ ∇ x V θ ( x ) + γ ∇ x U κθ ( x ) − γ ∇ x U κθ ( x ) k ¯ A , (cid:2) ( γ − γ ) + γ k θ − θ k (cid:3) W a ( x ) . Using this result in (40) we obtain that there exists ¯ A , > such that KL ( δ x R γ ,θ | δ x R γ ,θ ) ¯ A , h ( γ /γ − + γ k θ − θ k i W a ( x ) , which implies the announced result upon setting ¯ A = 2 p ¯ A , (1 + b ¯ γ ) / and using that for any u, v > , √ u + v √ u + √ v . 26 .6 Proof of Theorem 6 We divide the proof in two parts.(a) First assume that ( X nk ) n ∈ N ,k ∈{ ,...,m n } and ( ¯ X nk ) n ∈ N ,k ∈{ ,...,m n } are given by (5) and we have { (K γ,θ , ¯K γ,θ ) : γ ∈ (0 , ¯ γ ] , θ ∈ Θ } = { (S γ,θ , ¯S γ,θ ) : γ ∈ (0 , ¯ γ ] , θ ∈ Θ } . Then Lemma 26 impliesthat [17, H1a] is satisfied with A ← A , Theorem 25 implies that [17, H1b] holds with A ← A and ρ ← ρ . Finally, using Corollary 31 we get that [17, H1c] holds with Ψ ← Ψ . Therefore, wecan apply [17, Theorem 1] and we obtain that the sequence ( θ n ) n ∈ N converges a.s. if + ∞ X n =0 δ n = + ∞ , + ∞ X n =0 δ n +1 Ψ ( γ n ) < + ∞ , + ∞ X n =0 δ n +1 / ( m n γ n ) < + ∞ . Since Ψ ( γ n ) = O ( γ / n ) by Corollary 31, these summability conditions are satisfied under thesummability assumptions of Theorem 6-(1). Proposition 32 implies that [17, H2] holds with Λ ← Λ and Λ ← Λ . Therefore if m n = m for all n ∈ N , we can apply [17, Theorem 3] and weobtain that the sequence ( θ n ) n ∈ N converges a.s. if + ∞ X n =0 δ n = + ∞ , + ∞ X n =0 δ n +1 Ψ ( γ n ) < + ∞ , + ∞ X n =0 δ n +1 γ − n < + ∞ + ∞ X n =0 δ n +1 /γ n ( Λ ( γ n , γ n +1 ) + δ n +1 Λ ( γ n , γ n +1 )) < + ∞ . These summability conditions are satisfied under the summability assumptions of Theorem 6 -(2).(b) Second assume that ( X nk ) n ∈ N ,k ∈{ ,...,m n } and ( ¯ X nk ) n ∈ N ,k ∈{ ,...,m n } are given by (5) with { (K γ,θ , ¯K γ,θ ) : γ ∈ (0 , ¯ γ ] , θ ∈ Θ } = { (R γ,θ , ¯R γ,θ ) : γ ∈ (0 , ¯ γ ] , θ ∈ Θ } . Then Lemma 33 implies that [17, H1a]is satisfied with A ← ¯ A , Theorem 21 implies that [17, H1b] holds with A ← ¯ A and ρ ← ¯ ρ .Finally, using Proposition 36 we get that [17, H1c] holds with Ψ ← ¯ Ψ . Therefore, we can apply[17, Theorem 1] and we obtain that the sequence ( θ n ) n ∈ N converges a.s. if + ∞ X n =0 δ n = + ∞ , + ∞ X n =0 δ n +1 ¯ Ψ ( γ n ) < + ∞ , + ∞ X n =0 δ n +1 / ( m n γ n ) < + ∞ . Since Ψ ( γ n ) = O ( γ / n ) by Proposition 36, these summability conditions are satisfied under thesummability assumptions of Theorem 6-(1). Proposition 37 implies that [17, H2] holds with Λ ← ¯ Λ and Λ ← ¯ Λ . Therefore if m n = m for all n ∈ N , we can apply [17, Theorem 3] and weobtain that the sequence ( θ n ) n ∈ N converges a.s. if + ∞ X n =0 δ n = + ∞ , + ∞ X n =0 δ n +1 ¯ Ψ ( γ n ) < + ∞ , + ∞ X n =0 δ n +1 γ − n , + ∞ X n =0 δ n +1 /γ n ( ¯ Λ ( γ n , γ n +1 ) + δ n +1 ¯ Λ ( γ n , γ n +1 )) < + ∞ . These summability conditions are satisfied under the summability assumptions of Theorem 6-(2).
The proof is similar to the one of Theorem 6 using [16, Theorem 2, Theorem 4] instead of [16,Theorem 1, Theorem 3].
AD acknowledges financial support from Polish National Science Center grant: NCN UMO-2018/31/B/ST1/00253. MP acknowledges financial support from EPSRC under grant EP/T007346/1.27 eferences [1] Yves F Atchadé, Gersende Fort, and Eric Moulines. On perturbed proximal gradient algo-rithms.
J. Mach. Learn. Res , 18(1):310–342, 2017.[2] Francis R. Bach and Eric Moulines. Non-asymptotic analysis of stochastic approximationalgorithms for machine learning. In
Advances in Neural Information Processing Systems 24:25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of ameeting held 12-14 December 2011, Granada, Spain , pages 451–459, 2011.[3] D. Bakry, I. Gentil, and M. Ledoux.
Analysis and geometry of Markov diffusion operators ,volume 348 of
Grundlehren der Mathematischen Wissenschaften [Fundamental Principles ofMathematical Sciences] . Springer, Cham, 2014.[4] Dominique Bakry, Franck Barthe, Patrick Cattiaux, and Arnaud Guillin. A simple proof ofthe Poincaré inequality for a large class of probability measures including the log-concave case.
Electron. Commun. Probab. , 13:60–66, 2008.[5] Heinz H. Bauschke and Patrick L. Combettes.
Convex analysis and monotone operator the-ory in Hilbert spaces . CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC.Springer, Cham, second edition, 2017. With a foreword by Hédy Attouch.[6] M. Benaim. A dynamical system approach to stochastic approximations.
SIAM J. ControlOptim. , 34(2):437–472, 1996.[7] A. Benveniste, M. Métivier, and P. Priouret.
Adaptive algorithms and stochastic approxima-tions , volume 22 of
Applications of Mathematics (New York) . Springer-Verlag, Berlin, 1990.Translated from the French by Stephen S. Wilson.[8] Sebastian Berisha, James G Nagy, and Robert J Plemmons. Deblurring and sparse unmixingof hyperspectral images using multiple point spread functions.
SIAM Journal on ScientificComputing , 37(5):S389–S406, 2015.[9] José M Bioucas-Dias, Antonio Plaza, Nicolas Dobigeon, Mario Parente, Qian Du, Paul Gader,and Jocelyn Chanussot. Hyperspectral unmixing overview: Geometrical, statistical, and sparseregression-based approaches.
IEEE journal of selected topics in applied earth observations andremote sensing , 5(2):354–379, 2012.[10] Emmanuel J Candes, Yonina C Eldar, Thomas Strohmer, and Vladislav Voroninski. Phaseretrieval via matrix completion.
SIAM review , 57(2):225–251, 2015.[11] Antonin Chambolle and Thomas Pock. An introduction to continuous optimization for imag-ing.
Acta Numerica , 25:161–319, 2016.[12] Emilie Chouzenoux, Anna Jezierska, Jean-Christophe Pesquet, and Hugues Talbot. A ConvexApproach for Image Restoration with Exact Poisson–Gaussian Likelihood.
SIAM Journal onImaging Sciences , 8(4):2662–2682, 2015.[13] Julianne Chung and Linh Nguyen. Motion estimation and correction in photoacoustic tomo-graphic reconstruction.
SIAM Journal on Imaging Sciences , 10(1):216–242, 2017.[14] Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) ,79(3):651–676, 2017.[15] Arnak S Dalalyan and Avetik Karagulyan. User-friendly guarantees for the langevin montecarlo with inaccurate gradient.
Stochastic Processes and their Applications , 129(12):5278–5311, 2019.[16] V. De Bortoli and A. Durmus. Convergence of diffusions and their discretizations: fromcontinuous to discrete processes and back, 2019.[17] V. De Bortoli, A. Durmus, M. Pereyra, and A. F. Vidal. Efficient stochastic optimisation byunadjusted langevin monte carlo. application to maximum marginal likelihood and empiricalbayesian estimation. 2019. 2818] David L Donoho. Compressed sensing.
IEEE Transactions on information theory , 52(4):1289–1306, 2006.[19] Paul Dupuis and Richard S. Ellis.
A weak convergence approach to the theory of large devi-ations . Wiley Series in Probability and Statistics: Probability and Statistics. John Wiley &Sons, Inc., New York, 1997. A Wiley-Interscience Publication.[20] A. Durmus and E. Moulines. High-dimensional Bayesian inference via the UnadjustedLangevin Algorithm.
ArXiv e-prints , May 2016.[21] Alain Durmus, Szymon Majewski, and Blazej Miasojedow. Analysis of langevin monte carlovia convex optimization.
Journal of Machine Learning Research , 20(73):1–46, 2019.[22] Alain Durmus, Eric Moulines, et al. Nonasymptotic convergence analysis for the unadjustedLangevin algorithm.
The Annals of Applied Probability , 27(3):1551–1587, 2017.[23] Alain Durmus, Eric Moulines, and Marcelo Pereyra. Efficient Bayesian computation by prox-imal Markov chain Monte Carlo: when Langevin meets Moreau.
SIAM Journal on ImagingSciences , 11(1):473–506, 2018.[24] Bruno Galerne and Arthur Leclaire. Texture inpainting using efficient Gaussian conditionalsimulation.
SIAM Journal on Imaging Sciences , 10(3):1446–1474, 2017.[25] Nobuyuki Ikeda and Shinzo Watanabe.
Stochastic differential equations and diffusion pro-cesses , volume 24 of
North-Holland Mathematical Library . North-Holland Publishing Co.,Amsterdam; Kodansha, Ltd., Tokyo, second edition, 1989.[26] Mark A Iwen, Aditya Viswanathan, and Yang Wang. Fast phase retrieval from local correlationmeasurements.
SIAM Journal on Imaging Sciences , 9(4):1655–1688, 2016.[27] Jari Kaipio and Erkki Somersalo.
Statistical and computational inverse problems , volume 160.Springer Science & Business Media, 2006.[28] Michael Kech and Felix Krahmer. Optimal injectivity conditions for bilinear inverse problemswith applications to identifiability of deconvolution problems.
SIAM Journal on AppliedAlgebra and Geometry , 1(1):20–37, 2017.[29] Jack Kiefer, Jacob Wolfowitz, et al. Stochastic estimation of the maximum of a regressionfunction.
The Annals of Mathematical Statistics , 23(3):462–466, 1952.[30] Solomon Kullback.
Information theory and statistics . John Wiley and Sons, Inc., New York;Chapman and Hall, Ltd., London, 1959.[31] Shutao Li, Xudong Kang, Leyuan Fang, Jianwen Hu, and Haitao Yin. Pixel-level image fusion:A survey of the state of the art.
Information Fusion , 33:100–112, 2017.[32] Robert S. Liptser and Albert N. Shiryaev.
Statistics of random processes. II , volume 6 of
Applications of Mathematics (New York) . Springer-Verlag, Berlin, expanded edition, 2001.Applications, Translated from the 1974 Russian original by A. B. Aries, Stochastic Modellingand Applied Probability.[33] M. Métivier and P. Priouret. Applications of a Kushner and Clark lemma to general classesof stochastic algorithms.
IEEE Trans. Inform. Theory , 30(2, part 1):140–151, 1984.[34] M. Métivier and P. Priouret. Théorèmes de convergence presque sure pour une classed’algorithmes stochastiques à pas décroissant.
Probab. Theory Related Fields , 74(3):403–428,1987.[35] Veniamin I Morgenshtern and Emmanuel J Candes. Super-resolution of positive sources: Thediscrete setup.
SIAM Journal on Imaging Sciences , 9(1):412–444, 2016.[36] Yurii Nesterov.
Introductory lectures on convex optimization: A basic course , volume 87.Springer Science & Business Media, 2013.[37] George Pólya and Gabor Szegő.
Problems and theorems in analysis. I . Classics in Mathematics.Springer-Verlag, Berlin, 1998. Series, integral calculus, theory of functions, Translated fromthe German by Dorothee Aeppli, Reprint of the 1978 English translation.2938] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging.
SIAM Journal on Control and Optimization , 30(4):838–855, 1992.[39] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimalfor strongly convex stochastic optimization. arXiv preprint arXiv:1109.5647 , 2011.[40] Saiprasad Ravishankar and Yoram Bresler. Efficient blind compressed sensing using sparsifyingtransforms with convergence guarantees and application to magnetic resonance imaging.
SIAMJournal on Imaging Sciences , 8(4):2519–2557, 2015.[41] Herbert Robbins and Sutton Monro. A stochastic approximation method.
The annals ofmathematical statistics , pages 400–407, 1951.[42] G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin distributions andtheir discrete approximations.
Bernoulli , 2(4):341–363, 1996.[43] Lorenzo Rosasco, Silvia Villa, and Bang Công V˜u. Convergence of stochastic proximal gradientalgorithm.
Applied Mathematics & Optimization , pages 1–27, 2019.[44] Carola-Bibiane Schönlieb.
Partial Differential Equation Methods for Image Inpainting , vol-ume 29. Cambridge University Press, 2015.[45] Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization:Convergence results and optimal averaging schemes. In
International Conference on MachineLearning , pages 71–79, 2013.[46] Miguel Simões, José Bioucas-Dias, Luis B Almeida, and Jocelyn Chanussot. A convex for-mulation for hyperspectral image superresolution via subspace-based regularization.
IEEETransactions on Geoscience and Remote Sensing , 53(6):3373–3388, 2015.[47] Weijie Su, Stephen P. Boyd, and Emmanuel J. Candès. A differential equation for modelingnesterov’s accelerated gradient method: Theory and insights.
J. Mach. Learn. Res. , 17:153:1–153:43, 2016.[48] V. B. Tadić and A. Doucet. Asymptotic bias of stochastic gradient search.
Ann. Appl. Probab. ,27(6):3255–3304, 2017.[49] Ana F. Vidal, Valentin De Bortoli, Marcelo Pereyra, and Alain Durmus. Maximum likelihoodestimation of regularisation parameters in high-dimensional inverse problems: an empiricalbayesian approach. Part I: Methodology and experiments, 2019.[50] Ana Fernandez Vidal and Marcelo Pereyra. Maximum likelihood estimation of regularisationparameters. In , pages1742–1746. IEEE, 2018.[51] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense networkfor image super-resolution. In