On polynomial-time computation of high-dimensional posterior measures by Langevin-type algorithms
aa r X i v : . [ m a t h . S T ] S e p ON POLYNOMIAL-TIME COMPUTATION OF HIGH-DIMENSIONALPOSTERIOR MEASURES BY LANGEVIN-TYPE ALGORITHMS
RICHARD NICKL AND SVEN WANGUNIVERSITY OF CAMBRIDGE † SEPTEMBER 14, 2020
Abstract.
The problem of generating random samples of high-dimensional posterior distribu-tions is considered. The main results consist of non-asymptotic computational guarantees forLangevin-type MCMC algorithms which scale polynomially in key quantities such as the dimen-sion of the model, the desired precision level, and the number of available statistical measurements.As a direct consequence, it is shown that posterior mean vectors as well as optimisation basedmaximum a posteriori (MAP) estimates are computable in polynomial time, with high probabilityunder the distribution of the data. These results are complemented by statistical guarantees forrecovery of the ground truth parameter generating the data.Our results are derived in a general high-dimensional non-linear regression setting (with Gauss-ian process priors) where posterior measures are not necessarily log-concave, employing a set oflocal ‘geometric’ assumptions on the parameter space, and assuming that a good initialiser of thealgorithm is available. The theory is applied to a representative non-linear example from PDEsinvolving a steady-state Schr¨odinger equation.
Contents
1. Introduction 21.1. Basic setting and contributions 31.2. Discussion of related literature 51.3. Notations and conventions 72. Main results for the Schr¨odinger model 82.1. Bayesian inference with Gaussian process priors 82.2. Polynomial time guarantees for Bayesian posterior computation 103. General theory for random design regression 153.1. Local curvature bounds for the likelihood function 163.2. Construction of the likelihood surrogate function 183.3. Non-asymptotic bounds for Bayesian posterior computation 193.4. Proof of Lemma 3.4 213.5. A chaining lemma for empirical processes 303.6. Proofs for Section 3.3 324. Proofs for the Schr¨odinger model 34 † Department of Pure Mathematics & Mathematical Statistics, Wilberforce Road, CB3 0WB Cambridge, UK.
Email: [email protected], [email protected]. We gratefully acknowledge support by the
Euro-pean Research Council , ERC grant agreement 647812 (UQMSI) .
Introduction
Markov chain Monte Carlo (MCMC) type algorithms are a key methodology in computationalmathematics and statistics. The main idea is to generate a Markov chain ( ϑ k : k ∈ N ) whoselaws L ( ϑ k ) on R D approximate its invariant measure. In Bayesian inference the relevant invariantmeasure has a probability density of the form(1) π ( θ | Z ( N ) ) ∝ e ℓ N ( θ ) π ( θ ) , θ ∈ R D . Here π is a prior density function for a parameter θ ∈ R D and the map ℓ N : R D → R is the‘data-log-likelihood’ based on N observations Z ( N ) from some statistical model, so that π ( ·| Z ( N ) )is the density of the Bayesian posterior probability distribution on R D arising from the observations.It can be challenging to give performance guarantees for MCMC algorithms in the increasinglycomplex and high-dimensional statistical models relevant in contemporary data science. By ‘high-dimensional’ we mean that the model dimension D may be large (e.g., proportional to a powerof N ). Without any further assumptions accurate sampling from π ( ·| Z ( N ) ) in high dimensionscan then be expected to be intractable (see below for more discussion). For MCMC methods thecomputational hardness typically manifests itself in an exponential dependence in D or N of the‘mixing time’ of the Markov chain ( ϑ k : k ∈ N ) towards its equilibrium measure (1).In this work we develop mathematical techniques which allow to overcome such computationalhardness barriers. We consider diffusion-based MCMC algorithms targeting the Gibbs-type mea-sure with density π ( ·| Z ( N ) ) from (1) in a non-linear and high-dimensional setting. The prior π willbe assumed to be Gaussian – the main challenge thus arises from the non-convexity of − ℓ N . Wewill show how local geometric properties of the statistical model can be combined with recent devel-opments in Bayesian nonparametric statistics [51, 53] and the non-asymptotic theory of Langevinalgorithms [16, 20, 21] to justify the ‘ polynomial time ’ feasibility of such sampling methods.While the approach is general, it crucially takes advantage of the particular geometric structureof the statistical model at hand. In a large class of high-dimensional non-linear inference problemsarising throughout applied mathematics, such structure is described by partial differential equations (PDEs). Examples that come to mind are inverse and data assimilation problems, and in particularsince influential work by A. Stuart [63], MCMC-based Bayesian methodology is frequently used insuch settings, especially for the task of uncertainty quantification. We refer the reader to [37], [38],[30], [9], [42], [15], [63], [49], [14], [61], [18], [2] and the references therein. A main contributionof this paper is to demonstrate the feasibility of our proof strategy in a (for such PDE problems)prototypical non-linear example where the parameter θ models the potential in a steady-stateSchr¨odinger equation. This PDE arises in various applications such as photo-acoustics, e.g., [3],and provides a suitable framework to lay out the main mathematical ideas underpinning our proofs. ANGEVIN ALGORITHMS 3
Basic setting and contributions.
To summarise our key results we now introduce a moreconcrete setting. For O a bounded subset of R d , d ∈ N , and Θ some parameter space, consider afamily of ‘regression’ functions {G ( θ ) : θ ∈ Θ } ⊂ L ( O ), where L ( O ) denotes the usual Lebesguespace L ( O ) of square-integrable functions. This induces a ‘forward map’(2) G : Θ → L ( O ) , and we suppose that N observations Z ( N ) = ( Y i , X i : i = 1 , . . . N ) arising via(3) Y i = G ( θ )( X i ) + ε i , i = 1 , ..., N, are given, where ε i ∼ N (0 ,
1) are independent noise variables, and design variables X i are drawnuniformly at random from the domain O (independently of ε i ). While natural parameter spaces Θcan be infinite-dimensional, in numerical practice a D -dimensional discretisation of Θ is employed,where D can possibly be large. The log-likelihood function of the data ( Y i , X i ) then equals, up toadditive constants, the usual least squares criterion(4) ℓ N ( θ ) = − N X i =1 (cid:2) Y i − G ( θ )( X i ) (cid:3) , θ ∈ R D . The aim is to recover θ from Z ( N ) . A wide-spread practice in statistical science is to employ Gaussian(process) priors Π with multivariate normal probability densities π on R D ; from a numerical pointof view the Bayesian approach to inference in such problems is then precisely concerned with(approximate) evaluation of the posterior measure (1).As discussed above, in important physical applications the forward map G is described implicitlyby a partial differential equation . For example suppose that G ( θ ) = u f θ arises as the solution u = u f θ to the following elliptic boundary value problem for a Schr¨odinger equation(5) ( ∆ u − f θ u = 0 on O ,u = g on ∂O, with a suitable parameterisation θ f θ > θ ∈ R D (see (17) below for details). In suchcases, the map G is non-linear and − ℓ N ( θ ) is not convex. The probability measure with density π ( ·| Z ( N ) ) given in (1) may then be highly complex to evaluate in a high-dimensional setting, withcomputational cost scaling exponentially as D → ∞ . For instance, complexity theory for high-dimensional numerical integration (see [57, 58] for general references) implies that computing theintegral of a D -dimensional real-valued Lipschitz function – such as the normalising factor implicitin (1) – by a deterministic algorithm has worst case cost scaling as D D/ [34, 64]. Relaxing aworst case analysis, Monte Carlo methods can in principle obtain dimension-free guarantees (withhigh probability under the randomisation scheme). However, a curse of dimensionality may persistas one typically is only able to sample approximately from the target measure, and since theapproximation error incurred, e.g., by the mixing time of a Markov chain, could scale exponentiallyin dimension. The references [5, 6], [4], [60], [48, 76] discuss this issue in a variety of contexts. Inaddition, since the distribution becomes increasingly ‘spiked’ as the statistical information increases(i.e., N → ∞ ), commonly used iterative algorithms can take an exponential in N time to exitneighbourhoods of local optima of the posterior surface π ( ·| Z ( N ) ) (e.g., [22], Example 4).In light of the preceding discussion one may ask whether the approximate calculation ofbasic aspects of π ( ·| Z ( N ) ) – such as its mean vector (expected value), real-valued functionals R R D H ( θ ) π ( θ | Z ( N ) ) dθ , or mode – is feasible at a computational cost which grows at most poly-nomially in D, N and the desired (inverse) precision level. Very few rigorous results providing
R. NICKL AND S. WANG even just partial such guarantees appear to be available. The notable exception Hairer, Stuart andVollmer [32] along with some other important references will be discussed below.Let us describe the scope of the methods to be developed in this article in the problem ofapproximate computation of the high-dimensional posterior mean vector in the PDE model (5)with the Schr¨odinger equation. We will require mild regularity assumptions on D, Π and on theground truth θ generating the data (3) – full details can be found in Section 2. If Π is a D -dimensional Gaussian process prior with covariance equal to a rescaled inverse Laplacian raised tosome large enough power α ∈ N , if the model dimension grows at most as D . N d/ (2 α + d ) , and if θ is sufficiently well-approximated by its ‘discretisation’ in R D (see (28)), we obtain the followingmain result. Theorem 1.1.
Suppose that data Z ( N ) = ( Y i , X i : i = 1 , ..., N ) arise through (3) in the Schr¨odingermodel (5) and let P > . Then, for any precision level ε ≥ N − P there exists a (randomised)algorithm whose output ˆ θ ε ∈ R D can be computed with computational cost (6) O ( N b D b ε − b ) ( b , b , b > , and such that with high probability (under the joint law of Z ( N ) and the randomisation mechanism), (cid:13)(cid:13) ˆ θ ε − E Π [ θ | Z ( N ) ] (cid:13)(cid:13) R D ≤ ε, where E Π [ θ | Z ( N ) ] = R R D θπ ( θ | Z ( N ) ) dθ denotes the mean vector of the posterior distribution Π( ·| Z ( N ) ) with density (1). We further show in Theorem 2.6 that ˆ θ ε also recovers the ground truth θ , within precision ε . The method underlying Theorem 1.1 consists of an initialisation step which requires solving astandard convex optimisation problem, followed by iterations ( ϑ k ) of a discretised gradient basedLangevin-type MCMC algorithm, at each step requiring a single evaluation of ∇ ℓ N (which itselfamounts to solving a standard linear elliptic boundary value problem). In particular our resultswill imply that the posterior mean can be computed by ergodic averages (1 /J ) P k ≤ J ϑ k along theMCMC chain (after some burn-in time), see Theorem 2.5 (which implies Theorem 1.1). The laws L ( ϑ k ) of the iterates ( ϑ k ) in fact provide a global approximation W ( L ( ϑ k ) , Π( ·| Z ( N ) )) ≤ ε, k ≥ k mix , of the high-dimensional posterior measure on R D , in Wasserstein-distance W . Our explicit conver-gence guarantees will ensure that both the ‘mixing time’ k mix and the number of required iterations J to reach precision level ε scales polynomially in D, N, ε − . Similar statements hold true for thecomputation of real-valued functionals R R D H ( θ ) π ( θ | Z N ) dθ for Lipschitz maps H : R D → R andof maximum a posteriori (MAP) estimates. See Theorems 2.7, 2.8 as well as Proposition 2.4 forprecise statements.The key idea underlying our proofs is to demonstrate first that, with high probability under thelaw generating the data Z ( N ) , the target measure Π( ·| Z ( N ) ) from (1) is locally log-concave on aregion in R D where most of its mass concentrates. Then we show that a ‘localised’ Langevin-typealgorithm, when initialised into the region of log-concavity, possesses polynomial time convergenceguarantees in ‘moderately’ high-dimensional models. That sufficiently precise initialisation is pos-sible has to be shown in each problem individually (for the Schr¨odinger model, see Section B.3).Our proofs provide a template (outlined in Section 3) that can be used in principle also in generalsettings as long as the linearisation ∇ θ G ( θ ) of G at the ground truth parameter θ satisfies a suit-able stability estimate (i.e., a quantitative injectivity property related to the ‘Fisher information’ ANGEVIN ALGORITHMS 5 operator of the statistical model). We verify this stability property for the Schr¨odinger equationusing elliptic PDE techniques (see Lemma 4.7) but our approach may succeed in a variety of othernon-linear forward models arising in inverse problems [40, 52, 63, 69], integral X -ray geometry[36, 51, 59], and also in the context of data assimilation and filtering [15, 49, 61]. Further advanc-ing our understanding of the computational complexity of such PDE-constrained high-dimensionalinference problems poses a formidable challenge for future research.1.2. Discussion of related literature.
Both the statistical and computational aspects of high-dimensional Bayes procedures have been subject of great interest in recent years. Frequentistconvergence properties of high- and infinite-dimensional Bayes procedures were intensely studiedin the last two decades. For ‘direct’ statistical models we refer to the recent monograph [26] (andreferences therein), and in the non-linear (PDE) setting relevant here to [1, 29, 51, 53–56].We now discuss some representative papers studying mixing properties of MCMC algorithmsin high-dimensional settings, and refer to the references cited in these articles for various furtherimportant results.1.2.1.
Mixing times for pCN-type algorithms.
The important contribution [32] by Hairer, Stuartand Vollmer derives dimension-independent convergence guarantees for the preconditioned Crank-Nicolson (pCN) algorithm, using ergodicity results for infinite-dimensional Markov chains fromHairer, Mattingly and Scheutzow [31]. The task of sampling from a general measure arising froma Gaussian process prior and a general likelihood function exp( − Φ( θ )) is considered there. Theirresults are hence naturally compatible with the setting considered in this paper, where Φ is givenby (4), i.e. Φ = Φ N = ℓ N and it is natural to ask (a) whether the bounds from [32] apply to thisclass of problems and (b) if they apply, how they quantitatively depend on N and model dimension.The key Assumptions 2.10, 2.11, and 2.13 made in [32] can be summarised as (A) a globallower bound on the acceptance probability of the pCN as well as (B) a (local) Lipschitz continuityrequirement on Φ. In non-linear PDE problems, part (B) can usually be verified (e.g., [56]), whilepart (A) is more challenging: due to the global nature of the assumption, it seems that verificationof (A) will typically require bounds for likelihood ratios exp(Φ( θ ) − Φ(¯ θ )) with θ, ¯ θ arbitrarilyfar apart. Of course, in some specific problems an initial bound may be obtained by invokinginequalities like (18). However the resulting lower bounds on the acceptance probabilities in thepCN scheme will decrease exponentially in N . We also note that though dimension-independent, themain Theorems 2.12 and 2.14 from [32] remain implicit (non-quantitative) in the relevant quantitiesfrom Assumptions (A) and (B); this seems to stem both from the utilised proof techniques, such asconsiderations regarding level sets of Lyapunov functions (cf. [32], p.2474), as well as the qualitativenature of the key underlying probabilistic weak Harris theorem proved by [31]. Summarising, whileit would be very exciting to see the results [32] be extended to yield quantitative bounds which arepolynomial in both N, D , serious technical and conceptual innovations seem to be required. In thepresent context, when exploiting local average curvature of the likelihood surface arising from PDEstructure, it appears more promising to investigate gradient based MCMC schemes.1.2.2.
Computational guarantees for Langevin-type algorithms.
For the important gradient-basedclass of Langevin Monte Carlo (LMC) algorithms, the first nonasymptotic convergence guaran-tees which are suited for high-dimensional settings were obtained by Dalalyan [16] for log-concavedensities, shortly after to be extended by Durmus and Moulines [20, 21] to closely related cases.Our proofs rely substantially on these convergence results for the strongly log-concave case (seeAppendix A for a review).
R. NICKL AND S. WANG
Very recently further extensions have emerged, notably [48] and [74], which estabish convergenceguarantees assuming that either the density to be sampled from is convex outside of some region, orthat the target measure satisfies functional inequalities of log-Sobolev and Poincar´e type. However,it appears that both of these results, when applied to (4) without any further substantial work, yieldbounds that scale exponentially in N . Indeed, the bound in Theorem 1 of [48] evidently dependsexponentially on the Lipschitz constant of the gradient ∇ ℓ N ; and ad hoc verification of assumptionsfrom [74] would utilise the Holley-Stroock perturbation principle [35] (and (18)), exhibiting the sameexponential dependence. Alternative, more elaborate ways of verifying functional inequalities in thiscontext would be highly interesting, but this is not the approach we take in the present paper.1.2.3. Relationship to Bernstein-von Mises theorems.
A key idea in our proofs is to use approximatecurvature of ℓ N ( θ ) ‘near’ the ground truth θ . On a deeper level this idea is related to the pos-sibility of a Bernstein-von Mises theorem which would establish precise Gaussian (‘Laplace-type’)approximations to posterior distributions, see [41, 43, 72] for the classical versions of such resultsin ‘low-dimensional’ statistical models, and [10–12, 25] for high- or infinite-dimensional versionsthereof.Such an approach is taken by [4] who attempt to exploit the asymptotic ‘normality’ of the pos-terior measure to establish bounds on the computation time of MCMC-based posterior sampling,building on seminal work by Lovasz, Simonovits and Vempala [46, 47] on the complexity of generalMetropolis-Hastings schemes. While [4] allow potentially for moderately high-dimensional situa-tions (by appealing to high-dimensional Bernstein-von Mises theorems from [25]), their samplingguarantees hold for rescaled posterior measures arising as laws of √ N ( θ − ˜ θ ) | Z ( N ) where ˜ θ = ˜ θ ( Z ( N ) )is an initial ‘semi-parametrically efficient centring’ of the posterior draws θ | Z ( N ) (cf. also Remark2.10 below). In our setting such a centring is not generally available (in fact that one can computesuch centrings, such as the posterior mode or mean, in polynomial time, is a main aim of ouranalysis). The setting in [4] thus appears somewhat unnatural for the problems studied here, alsobecause the conditions there do not appear to permit Gaussian priors.For the Schr¨odinger equation example considered in the present paper, Bernstein-von Mises the-orems were obtained in the recent paper [53] (in a slightly different but closely related measurementsetting). While we follow [53] in using elliptic PDE theory to quantify the amount of curvatureexpressed in the ‘limiting information operator’ arising from the Schr¨odinger model, our proofs arein fact not based on an asymptotic Gaussian approximation of the posterior distribution. Ratherwe use tools from high-dimensional probability to deduce local curvature bounds directly for thelikelihood surface, and then show that the posterior measure is approximated, in Wasserstein dis-tance, by a globally log-concave measure that concentrates around the posterior mode (see Theorem4.14). While one can think of this as a ‘non-asymptotic’ version of a Bernstein-von Mises theorem,the underlying techniques do not require the full inversion of the information operator (as in [53]or also in [50, 55]), but solely rely on a ‘stability estimate’ for the local linearisation of the forwardmap, and hence are likely to apply to a larger class of PDEs. A further key advantage of ourapproach is that we do not require the initialiser for the algorithm to be a ‘semi-parametricallyefficient’ estimator (as [4] do), instead only a sufficiently fast ‘nonparametric’ convergence rate isrequired, which substantially increases the class of admissible initialisation strategies.1.2.4. Regularisation/optimisation literature.
Regularisation-driven optimisation methods havebeen studied for a long time in applied mathematics, see for instance the monographs [23, 39].In the setting of non-linear operator equations in Hilbert spaces and with deterministic noise, ‘lo-cal’ convergence guarantees for iterative (gradient or ‘Landweber’) methods have been obtained
ANGEVIN ALGORITHMS 7 in [33, 39], assuming that optimisation is performed over a (sufficiently small) neighbourhood ofa maximum. The proof techniques underlying our main results allow as well to derive guaranteesfor gradient descent algorithms targeting, for instance, maximum a posteriori (MAP) estimates,see Section 2.2.5. Specifically, in Theorem 2.8, global convergence guarantees for the computationof MAP estimates over a high-dimensional discretisation space are given, in our genuine statisticalframework, paralleling our main results for Langevin sampling methods, which can be regarded asrandomised versions of classical gradient methods. A main attraction of studying such randomisedalgorithms, and more generally of solving the problem of Bayesian computation, is of course that onecan access entire posterior distributions, which is required for quantifying the statistical uncertaintyin the reconstruction provided by point estimates such as posterior mean or mode.1.3.
Notations and conventions.
Throughout, N will denote the number of observations in (3)and D will denote the dimension of the model from (4). For a real-valued function f : R D → R ,its gradient and Hessian are denoted by ∇ f and ∇ f , respectively, while ∆ = ∇ T ∇ denotes theLaplace operator. For any matrix A ∈ R D × D , we denote the operator norm by k A k op := sup ψ : k ψ k R D ≤ k Aψ k R D . If A is positive definite and symmetric, then we denote the minimal and maximal eigenvalues of A by λ min ( A ) and λ max ( A ) respectively, with condition number κ ( A ) := λ max ( A ) /λ min ( A ). TheEuclidean norm on R D will be denoted by k · k R D . The space ℓ ( N ) denotes the usual sequencespace of square-summable sequence ( a n : n ∈ N ), normed by k · k ℓ . For any a ∈ R , we write a + = min { a, } . Throughout, . , & , ≃ will denote (in-)equalities up to multiplicative constants.For a Borel subset O ⊆ R d , d ∈ N , let L p = L p ( O ) be the usual spaces of functions endowed withthe norm k · k pL p = R O | h ( x ) | p dx , where dx is Lebesgue measure. The usual L ( O ) inner productis denoted by h· , ·i L ( O ) . If O is a smooth domain in R d , then C ( O ) denotes the space of boundedcontinuous functions h : O → R equipped with the supremum norm k·k ∞ and C α ( O ) , α ∈ N , denotethe usual spaces of α -times continuously differentiable functions on O with bounded derivatives.Likewise we denote by H α ( O ) the usual order- α Sobolev spaces of weakly differentiable functionswith square integrable partial derivatives up to order α ∈ N , and this definition extends to positive α / ∈ N by interpolation [66]. We also define ( H ( O )) ∗ as the topological dual space of (cid:0) H ( O ) = (cid:8) h ∈ H ( O ) : tr ( h ) = 0 (cid:9) , k · k H ( O ) (cid:1) , where tr ( · ) denotes the usual trace operator on O . We will repeatedly use the inequalities(7) k gh k H α ≤ c ( α, O ) k g k H α k h k H α , α > d/ , (8) k h k H β ≤ c ( β, α, O ) k h k ( α − β ) /αL k h k β/αH α , ≤ β ≤ α for g, h ∈ H α , see, e.g., [45]. For Borel probability measures µ , µ on R D with finite secondmoments we define the Wasserstein distance(9) W ( µ , µ ) = inf ν ∈ Γ( µ ,µ ) Z R D × R D k θ − ϑ k R D dν ( θ, ϑ ) , where Γ( µ , µ ) is the set of all ‘couplings’ of µ and µ (see, e.g., [75]). Finally we say that a map H : R D → R is Lipschitz if it has finite Lipschitz norm(10) k H k Lip := sup x = y,x,y ∈ R D | H ( x ) − H ( y ) |k x − y k R D . R. NICKL AND S. WANG Main results for the Schr¨odinger model
Our object of study in this section is a nonlinear forward model arising with a (steady state)Schr¨odinger equation. Throughout, let
O ⊂ R d be a bounded domain with smooth boundary ∂ O .For convenience we will restrict to d ≤
3, dimensions d ≥ vol ( O ) = 1.Suppose that g ∈ C ∞ ( ∂ O ) is a given function prescribing boundary values g ≥ g min > ∂O .For an ‘attenuation potential’ f ∈ H α ( O ), consider solutions u = u f of the PDE(11) ( ∆ u − f u = 0 on O ,u = g on ∂ O . If α > d/ f ≥ u f ∈ C ( O ) ∩ C ( ¯ O ) to the Schr¨odinger equation (11)exists. The non-linearity of the map f u f becomes apparent from the classical Feynman-Kacformula (e.g., Theorem 4.7 in [13])(12) u f ( x ) = u f,g ( x ) = E x h g ( X τ O ) e − R τ O f ( X s ) ds i , x ∈ O , where ( X s : s ≥
0) is a d -dimensional Brownian motion started at x with exit time τ O from O . This PDE appears in various settings in applied mathematics; for example an application tophoto-acoustics is discussed in Section 3 in [3].2.1. Bayesian inference with Gaussian process priors.
The Dirichlet-Laplacian and Gaussian random fields.
In Bayesian statistics popular choicesof prior probability measures arise from Gaussian random fields whose covariance kernels are relatedto the Laplace operator ∆, see, e.g., Section 2.4 in [63] and also Example 11.8 in [26] (where theclosely related ‘Whittle-Mat´ern’ processes are considered).Let g O denote the symmetric Green kernel of the Dirichlet Laplacian on O , which for ψ ∈ L ( O )describes the unique solution v = V [ ψ ] = R O g O ( · , y ) ψ ( y ) dy ∈ H ( O ) of the Poisson equation∆ v/ ψ on O . By standard results (Section 5.A in [66]) the compact h· , ·i L ( O ) -self-adjointoperator V has eigenfunctions ( e k : k ∈ N ) forming an orthonormal basis of L ( O ) such that V [ ψ ] = P ∞ k =1 µ k h e k , ψ i L ( O ) e k , with (negative) eigenvalues µ k satisfying the Weyl asymptotics(e.g., Corollary 8.3.5 in [67])(13) λ k = 1 | µ k | ≃ k /d as k → ∞ , < λ k < λ k +1 , k ∈ N . The ‘spectrally defined’ Sobolev-type spaces H α = { F ∈ L ( O ) : P ∞ k =1 λ αk h F, e k i L ( O ) < ∞} areisomorphic to corresponding Hilbert sequence spaces h α := (cid:8) θ ∈ ℓ ( N ) : k θ k h α = ∞ X k =1 λ αk θ k < ∞ (cid:9) , h =: ℓ ( N ) . One shows that H α is a closed subspace of H α ( O ) and that the sequence norm k · k h α is equivalentto k · k H α ( O ) on H α . For α even, this follows from the usual isomorphism theorems for the α/ α by interpolation, see Section5.A in [66]. One also shows that any F ∈ H α ( O ) supported strictly inside of O belongs to H α . ANGEVIN ALGORITHMS 9
A centred Gaussian random field M α on O can be defined by the infinite random series(14) M α ( x ) = ∞ X k =1 λ − α/ k g k e k ( x ) , x ∈ O , g k ∼ i.i.d. N (0 , . For α > d/ M α defines a Gaussian Borel random variable in C ( O ) ∩{ h uniformly continuous : h = 0 on ∂ O} with reproducing kernel Hilbert space equal to H α (seeExample 2.6.15 in [28]), thus providing natural priors for α -regular functions vanishing at ∂ O .Such Dirichlet boundary conditions could be replaced by Neumann conditions at the expense ofminor changes (see p.473 in [66]). We finally note that our techniques in principle may extend toother classes of priors such as exponential Besov-type priors considered in [42], but we focus ourdevelopment here on the most commonly used class of α -regular Gaussian process priors.2.1.2. Re-parameterisation, regular link functions, and forward map.
To use Gaussian random fieldssuch as M α to model a potential f ≥ ∞ . Definition 2.1 (Regular link function) . Let K min ∈ [0 , ∞ ) . We say that Φ : R → ( K min , ∞ ) is aregular link function if it is bijective, smooth, strictly increasing (i.e. Φ ′ > on R ) and if for any k ≥ , the k -th derivative of Φ satisfies sup x ∈ R (cid:12)(cid:12) Φ ( k ) ( x ) (cid:12)(cid:12) < ∞ . For a simple example of a regular link function Φ, see e.g. Example 3.2 of [56]. To ease notation,we denote the composition operator associated to Φ by(15) Φ ∗ : L ( O ) → L ( O ) , F Φ ◦ F = Φ ∗ ( F ) . Now to describe a natural parameter space for f , we will first expand functions F ∈ L ( O ) in theorthonormal basis from Section 2.1.1,(16) F = F θ = ∞ X k =1 θ k e k , ( θ k : k = 1 , , . . . ) ∈ ℓ ( N ) , and denote by Ψ( θ ) = F θ the map Ψ : ℓ ( N ) → L ( O ) that associates to the vector θ the ‘Fourier’series of F θ . We then apply a regular link function Φ to F θ and set f θ := Φ ◦ F θ . For α > d/ F θ ∈ H α ( O ) implies f θ ∈ H α ( O ) and hence solutions of theSchr¨odinger equation (11) exist for such f . If we denote the solution map f u f from (11) by G ,then the overall forward map describing our parametrisation is given by(17) G : h α → L ( O ) , G ( θ ) = u f θ = [ G ◦ Φ ∗ ◦ Ψ]( θ ) . We shall frequently regard G as a map on the closed linear subspace R D of h α consisting of thefirst D coefficients ( θ , . . . , θ D ) of θ ∈ h α . Moreover it will be tacitly assumed that a regular linkfunction Φ : R → ( K min , ∞ ), K min ≥
0, has been chosen. We also note that the solutions of (11)are uniformly bounded by a constant independent of θ ∈ h α , specifically(18) kG ( θ ) k ∞ = k u f θ k ∞ ≤ k g k ∞ , as follows from (12) and f θ ≥
0. This ‘bounded range’ property of G is relative to the normemployed; for instance the k u f θ k H α -norms are not uniformly bounded in θ ∈ h α for general α . Measurement model, prior, likelihood and posterior.
For the forward map G from (17), wenow consider the measurement model(19) Y i = G ( θ )( X i ) + ε i , i = 1 , . . . , N, ε i ∼ i.i.d. N (0 , , X i ∼ i.i.d. Uniform( O ) . The i.i.d. random vectors(20) Z ( N ) = ( Z i ) Ni =1 = ( Y i , X i ) Ni =1 are drawn from a product measure on ( R × O ) N that we denote by P Nθ = ⊗ Ni =1 P θ . The coordinate(Lebesgue) densities p θ of the joint probability density p Nθ = Q Ni =1 p θ of P Nθ are of the form(21) p θ ( y, x ) := 1 √ π exp n −
12 [ y − G ( θ )( x )] o , y ∈ R , x ∈ O , (recalling vol ( O ) = 1) and we can define the log-likelihood function as(22) ℓ N ( θ ) ≡ log p Nθ + N log √ π = − N X i =1 (cid:0) Y i − G ( θ )( X i ) (cid:1) . When using Gaussian process prior models in Bayesian statistics, a common discretisation ap-proach is to truncate the (‘Karhunen-Lo´eve’ type) expansion of the prior in a suitable basis, cf.[17, 32, 42, 63]. In our context this will mean that we truncate the series defining the random field M α in (14) at some finite dimension D to be specified. For integer α to be chosen, and recallingthe eigenvalues ( λ k : k ∈ N ) of the Dirichlet Laplacian from (13), we thus consider priors(23) θ ∼ Π = Π N ∼ N (cid:0) , N − d/ (2 α + d ) Λ − α (cid:1) , Λ α = diag ( λ α , . . . , λ αD ) , supported in the subspace R D of h α consisting of its first D coordinates. The Lebesgue density d Π of Π on R D will be denoted by π . The posterior measure Π( ·| Z ( N ) ) on R D then arises fromdata Z ( N ) in (19) via Bayes’ formula. Writing k θ k h α = k F θ k h α , its probability density function ofΠ( ·| Z ( N ) ) is given by π ( θ | Z ( N ) ) ∝ e ℓ N ( θ ) π ( θ )(24) ∝ exp ( − N X i =1 (cid:0) Y i − G ( θ )( X i ) (cid:1) − N d/ (2 α + d ) k θ k h α ) , θ ∈ R D . Polynomial time guarantees for Bayesian posterior computation.
Description of the algorithm.
We now describe the Langevin-type algorithm targeting theposterior measure Π( ·| Z ( N ) ). It requires the choice of an initialiser θ init and of constants ǫ, K, γ .Throughout, we use the initialiser θ init = θ init ( Z ( N ) ) ∈ R D constructed in Theorem B.6 in SectionB.3 (computable in O ( N b ) polynomially many steps, for some b > ǫ > B = { θ ∈ R D : k θ − θ init k R D ≤ ǫD − /d / } . We then construct a proxy function ˜ ℓ N : R D → R which agrees on ˆ B with the log-likelihood function ℓ N from (22). Specifically, take the cut-off function α = α η from (53) and the convex function g = g η from (52) with choice η = ǫD − /d and | · | = k · k R D . Note that α is compactly supported andidentically one on ˆ B and that g vanishes on ˆ B . Then for K to be chosen, ˜ ℓ N takes the form(26) ˜ ℓ N ( θ ) := α ( θ ) ℓ N ( θ ) − Kg ( θ ) , θ ∈ R D . ANGEVIN ALGORITHMS 11
This induces a proxy probability measure, correspondingly denoted by ˜Π( ·| Z ( N ) ), with log-density(27) log ˜ π ( θ | Z ( N ) ) = ˜ ℓ N ( θ ) − N d α + d k θ k h α / const., θ ∈ R D . Note that ˜ π ( ·| Z ( N ) ) coincides with the posterior density π ( ·| Z ( N ) ) on the set ˆ B up to a (random)normalising constant. The MCMC scheme we consider is then given in Algorithm 1 and the law ofthe resulting Markov chain ( ϑ k ) ∈ R D will be denoted by P θ init . Algorithm 1Input:
Initialiser θ init ∈ R D , convexification parameters ǫ, K >
0, step size γ > i.i.d. sequence ξ k ∼ N (0 , I D × D ). Output:
Markov chain ϑ , . . . , ϑ k , · · · ∈ R D . initialise ϑ = θ init for k = 0 , ... do ϑ k +1 = ϑ k + γ ∇ log ˜ π ( ϑ k | Z ( N ) ) + √ γξ k +1 return ( ϑ k : k = 1 , . . . )While the algorithm is related to stochastic optimisation methods based on gradient descent, thediffusivity term is of constant order in k , allowing ( ϑ k ) to explore the entire support of the targetmeasure. It coincides with the unadjusted Langevin algorithm (see Appendix A) targeting π ( ·| Z ( N ) )as long as the iterates ( ϑ k ) stay within the region ˆ B ⊂ R D we have initialised to. When ( ϑ k ) exitsˆ B , the Markov chain is forced by the ‘proxy’ function ˜ ℓ N to eventually return to ˆ B . This procedureis justified since most of the posterior mass will be shown to concentrate on ˆ B with high probabilityunder the law of Z ( N ) . [In fact a key step of our proofs is to control the Wasserstein-distancebetween the measures induced by the densities π ( ·| Z ( N ) ) , ˜ π ( ·| Z ( N ) ), cf. Theorem 4.14.] Note thatwhile the ball in (25) shrinks as dimension D → ∞ , relative to the step-sizes γ permitted below,ˆ B has asymptotically growing diameter. The results that follow show that the Markov chain ( ϑ k )nevertheless mixes sufficiently fast to reconstruct the posterior surface on ˆ B with arbitrary precisionafter a polynomial runtime.To demonstrate the performance of Algorithm 1 in a large N, D scenario, we now make thefollowing specific choices of the key algorithm parameters ǫ, K, γ . Condition 2.2.
Let θ init be the initialiser from Theorem B.6 and suppose that ǫ := 1log N , K := N D /d (log N ) , γ ≤ N D /d (log N ) . Conditions involving θ . The convergence guarantees obtained below hold for moderatelyhigh-dimensional models where D is permitted to grow polynomially in N , and under the frequentistassumption that the data Z ( N ) from (19) is generated from a fixed ground truth θ inducing thelaw P Nθ . Note that we do not assume that θ ∈ R D , but rather that θ ∈ h α is sufficiently wellapproximated by its ℓ ( N )-projection θ ,D onto R D . The precise condition, which is discussed inmore detail in Remark 2.9 below, reads as follows. Condition 2.3.
For integers d ≤ and α > , suppose data Z ( N ) from (20) arise in the Schr¨odingermodel (19) for some fixed θ ∈ h α . Moreover, suppose that D ∈ N is such that for some constants c > , < c ′ < / , and θ ,D = (( θ ) , ..., ( θ ) D ) D ≤ c N d/ (2 α + d ) , kG ( θ ,D ) − G ( θ ) k L ( O ) ≤ c ′ N − α/ (2 α + d ) . (28)Though it will be left implicit, the results we obtain in this section depend on θ only through c ′ and an upper bound S ≥ k θ k h α .2.2.3. Computational guarantees for ergodic MCMC averages.
We first present a concentration in-equality for ergodic averages along the Markov chain ( ϑ k ). Proposition 2.4 is non-asymptotic innature; hence its statement necessarily involves various constants whose dependence on D and N is tracked. Theorems 2.5 and 2.6 then demonstrate how the desired polynomial time computationguarantees, including Theorem 1.1, can be deduced from it.For ‘burn-in’ time J in ∈ N and MCMC samples ( ϑ k : k = J in + 1 , ..., J in + J ) from Algorithm 1,define ˆ π JJ in ( H ) = 1 J J in + J X k = J in +1 H ( ϑ k ) , H : R D → R . We also set, for c > B ( γ ) := c h γD ( d +24) /d (log N ) + γ N D ( d +44) /d (log N ) i + exp( − N − d α + d ) . The quantity B ( γ ) is an upper bound for the error incurred by the Euler discretisation of theLangevin dynamics (see (163) below) and by the ‘proxy’ construction (27). Proposition 2.4.
Assume Condition 2.3 is satisfied and consider iterates ϑ k of the Markovchain from Algorithm 1 with θ init , ǫ, K, γ satisfying Condition 2.2. Then there exist constants c , c , ..., c > such that for all N ∈ N , any Lipschitz function H : R D → R , any burn-in period (30) J in ≥ log NγN D − /d × log (cid:0) D + B ( γ ) − ) , any J ∈ N , any t ≥ k H k Lip p B ( γ ) and on events E N (measurable subsets of ( R × O ) N ) ofprobability P Nθ ( E N ) ≥ − c exp( − c N d/ (2 α + d ) ) , P θ init (cid:0)(cid:12)(cid:12) ˆ π JJ in − E Π ( H | Z ( N ) ) (cid:12)(cid:12) ≥ t (cid:1) ≤ c exp (cid:16) − c t N JγD /d k H k Lip (1 + D /d / ( N Jγ )) (cid:17) . The next result concerns computation of the posterior mean vector E Π [ θ | Z ( N ) ] = Z R D θπ ( θ | Z ( N ) ) dθ by ergodic averages ¯ θ JJ in := 1 J J in + J X k = J in +1 ϑ k , J in , J ∈ N , within prescribed precision level ε . For convenience we assume ε ≥ N − P , which is natural in viewof the statistical error to be considered in Theorem 2.6 below. To this end, we make an explicitchoice for the step size parameter(31) γ = γ ε = min (cid:16) ε D ( d +24) /d , ε √ N D (22+ d/ /d , N D /d (cid:17) × (log N ) − . ANGEVIN ALGORITHMS 13
Theorem 2.5.
Assume Condition 2.3 is satisfied. Fix
P > and let ε ≥ N − P . Consider iterates ϑ k of the Markov chain from Algorithm 1 with θ init , ǫ, K satisfying Condition 2.2 and with γ = γ ε as in (31). Then there exist c , c , c > and at most polynomially growing constants (32) g D,N,ε = O ( D ¯ b N ¯ b ε − ¯ b ) , ¯ b , ¯ b , ¯ b > , such that for all N ∈ N , J in ≥ g D,N,ε , J ∈ N and on events E N of probability P Nθ ( E N ) ≥ − c exp( − c N d/ (2 α + d ) ) , (33) P θ init (cid:16)(cid:13)(cid:13) ¯ θ JJ in − E Π [ θ | Z ( N ) ] (cid:13)(cid:13) R D ≥ ε (cid:17) ≤ c D exp (cid:16) − Jg D,N,ε (cid:17) . Theorem 2.5 implies that for J in ∧ J ≫ g D,N,ε × log D , one can compute the posterior meanvector within precision ε > E Π ( H | Z ( N ) ) can be obtained as long as k H k Lip grows at most polynomially in D .We conclude this subsection with a result concerning recovery of the actual target of statisticalinference, that is, the ground truth θ . It combines Theorem 2.5 with a statistical rate of convergenceof E Π [ θ | Z N ] to θ , obtained by adapting recent results from [51] to the present situation. Theorem 2.6.
Consider the setting of Theorem 2.5 with P = α / ((2 α + d )( α + 2)) . There existfurther constants c , c , c , c > such that for all N ∈ N , all ε ≥ c N − α α + d αα +2 , with g D,N,ε from (32) and on events E N of probability P Nθ ( E N ) ≥ − c exp( − c N d/ (2 α + d ) ) , (34) P θ init (cid:16)(cid:13)(cid:13) ¯ θ JJ in − θ (cid:13)(cid:13) ℓ ≥ ε (cid:17) ≤ c exp (cid:16) − J g D,N,ε (cid:17) . While the statistical minimax-optimal rate towards θ ∈ h α in this problem can be expected tobe faster than N − P (see [53]), it appears unclear how to obtain this rate when F θ is discretisedby means of the (for the purposes of the present paper essential) spectral decomposition of theDirichlet-Laplacian from Section 2.1.1. The difficulty arises with the approximation theory of thespace H αc ( O ) (equal to the completion of C ∞ c ( O ) in H α ( O )) and is not discussed further here.2.2.4. Global bounds for posterior approximation in Wasserstein distance.
The previous theoremsconcern the computation of specific posterior characteristics; one may also be interested in global mixing properties of the laws L ( ϑ k ) induced by the Markov chain ( ϑ k : k ∈ N ) towards the targetΠ( ·| Z ( N ) ), for instance in the Wasserstein distance from (9). Theorem 2.7.
Assume Condition 2.3 is satisfied, let L ( ϑ k ) denote the law of the k -th iterate ϑ k of the Markov chain from Algorithm 1 with θ init , ǫ, K, γ satisfying Condition 2.2, and let B ( γ ) , c be as in (29). For any P > there exist constants c , c , c , c , c > such that on events E N of probability P Nθ ( E N ) ≥ − c exp( − c N d/ (2 α + d ) ) and for all N ∈ N , the following holds. i) For any k ≥ , W (cid:0) L ( ϑ k ) , Π[ ·| Z ( N ) ] (cid:1) ≤ c D α/d (1 − c N D − /d γ ) k + + B ( γ ) . (35) ii) For any ‘precision level’ ε ≥ N − P and for γ = γ ε from (31), there exists (36) k mix = O ( N ˜ b D ˜ b ε − ˜ b ) , ˜ b , ˜ b , ˜ b > , such that for any k ≥ k mix , W (cid:0) L ( ϑ k ) , Π[ ·| Z ( N ) ] (cid:1) ≤ ε. The first term on the right hand side of (35) characterises the rate of geometric convergencetowards equilibrium of ( ϑ k ); the factor N D − /d γ can be thought of as a spectral gap of the Markovchain (related to the ‘average local curvature’ of ℓ N ( · ) near θ in the Schr¨odinger model). Choosing γ = γ ε as in (31), part ii) further establishes ‘polynomial-time’ mixing of the MCMC schemetowards the posterior measure.2.2.5. Computation of the MAP estimate.
Our techniques also imply the following guarantees forthe computation of maximum a posteriori (MAP) estimatesˆ θ MAP ∈ arg max θ ∈ R D π ( θ | Z ( N ) )by a classical gradient descent method applied to the ‘proxy’ posterior surface (27). Theorem 2.8.
Assume Condition 2.3 is satisfied and let θ init denote the initialiser from TheoremB.6. For k = 0 , , , . . . , consider the gradient descent algorithm ϑ = θ init , ϑ k +1 = ϑ k + γ ∇ log ˜ π ( ϑ k | Z ( N ) ) , γ = 1 N D /d (log N ) . There exist constants c , c , c , c , c > such that for all N ∈ N and on events E N of probabilityat least P Nθ ( E N ) ≥ − c exp( − c N d/ (2 α + d ) ) we have the following: i) A unique maximiser ˆ θ MAP of π ( θ | Z ( N ) ) over R D exists. ii) For all k ≥ , we have the geometric convergence k ϑ k − ˆ θ MAP k R D ≤ c D /d (cid:16) − c D /d (log N ) (cid:17) k + . iii) Finally, we can choose k = O ( D /d (log N ) ) such that k ϑ k − θ k ℓ ≤ c N − α α + d αα +2 . Remarks.
Remark 2.9 (About Condition 2.3) . In principle the upper bound for D required in Condition 2.3could be replaced by general conditions on D (alike those from Lemma 3.4) which do not becomemore stringent as α increases. From a statistical point of view, however, a choice D ≤ c N d/ (2 α + d ) is natural as it corresponds to the optimal ‘bias-variance’ tradeoff underpinning the convergencerate towards θ ∈ h α from Theorem 2.6. [In fact, the second requirement in (28) can be checkedfor θ ∈ h α and D ≃ N d/ (2 α + d ) , since G is ℓ ( N ) − L ( O ) Lipschitz.] Moreover, combined with α >
6, such a choice of D provides a convenient sufficient condition throughout our proofs: Itis used critically when showing (in Theorem 4.14) that the proxy posterior measure ˜Π( ·| Z ( N ) )contracts about a k · k R D -neighbourhood of θ of radius D − /d on which the Fisher information inthe Schr¨odinger model has a stable behaviour (see (116)). It is also required for our initialiser θ init to lie in this neighbourhood (Theorem B.6). While it is conceivable that the condition on α couldbe weakened (as discussed, e.g., in the next remark), it would come at the expense of considerablefurther technicalities that we wish to avoid here. Remark 2.10 (Preconditioning and rescaling) . Given the ‘local’ nature of Algorithm 1, one may beinterested in sampling from the distribution of θ | Z ( N ) by first running an appropriate modification ofAlgorithm 1 generating samples ( ψ k : k ≥
0) of the rescaled and recentred law ν N of A − N ( θ − θ init )with probability density propotional dν N ( ψ ) ∝ π ( θ init + A N ψ | Z ( N ) ), where A N ∈ R D × D is asequence of ‘preconditioning’ matrices, and then setting ϑ k = θ init + A N ψ k , see, e.g., Section 4.2 in ANGEVIN ALGORITHMS 15 [16]. The techniques underlying our proofs also apply to such preconditioned algorithms by obviousmodifications of the surrogate construction (using also that W ( L ( ϑ k ) , Π( ·| Z ( N ) )) . W ( L ( ψ k ) , ν N ),where the constant in . depends only polynomially on the eigenvalues of A N ). This may speed upthe algorithm (e.g., in terms of explicit constants b i , ¯ b i , ˜ b i in Theorems 1.1, 2.5 and 2.7 respectively),for instance, in the Schr¨odinger equation it would be natural to choose for A N the action of theLaplace operator ∆ on R D to ‘stablise’ the curvature bounds in Lemma 4.7. However, wheninvestigating the question of existence of polynomial time sampling algorithms, such preconditioningarguments appear less relevant. For instance, for the pCN algorithm discussed in Section 1.2, theglobal likelihood ratios determining the mixing time of the Markov chain obtained in [32] still growexponentially in N after rescaling. Likewise, for rescaled Langevin algorithms, the ‘qualitative’picture of computational hardness (in the context of the present paper) remains unchanged.3. General theory for random design regression
In proving the results from Section 2, we will first develop some theory which applies to generalnonlinear regression models. We thus consider in this section the measurement model (3) for ageneral forward model G that satisfies a set of analytic conditions to be detailed below. Let Θ be a(measurable) linear subspace of ℓ ( N ) which itself admits a subspace R D ⊆ Θ for some D ∈ N . Let O be a Borel subset of R d , d ≥
1, and consider a model of regression functions {G ( θ ) : θ ∈ Θ } viaa Borel-measurable forward map G : Θ → C ( O ). While we regard each G ( θ ) as a continuous real-valued function, the results of this section readily extend to vector or matrix fields over manifolds O , see Remark 3.11. Our data is given by Z i = ( Y i , X i ) arising from(37) Y i = G ( θ )( X i ) + ε i , i = 1 , ..., N, where X i ∼ i.i.d. P X , P X a Borel probability measure on O , and where ε i ∼ i.i.d. N (0 , X i ’s. We write Z ( N ) = ( Z , ..., Z N ) for the full data vector with joint distribution P Nθ = ⊗ Ni =1 P θ on ( R × O ) N , with expectation operator E Nθ = ⊗ Ni =1 E θ . Then the log-likelihoodfunctions of the data Z ( N ) and of a single observation Z = ( Y, X ) ∼ P θ are given by(38) ℓ N ( θ ) ≡ ℓ N ( θ, Z ( N ) ) = − N X i =1 [ Y i − G ( θ )( X i )] , ℓ ( θ ) ≡ ℓ ( θ, Z ) = −
12 [ Y − G ( θ )( X )] , respectively. If we regard these maps as being defined on R D ⊆ Θ, and if Π is a Gaussian prior Πsupported in R D , then we obtain the posterior measure Π( ·| Z ( N ) ) with probability density π ( ·| Z ( N ) )on R D as in (24).The main results of this section are Theorems 3.7 and 3.8, providing convergence guaranteesfor a Langevin sampling method for the posterior distribution that depend polynomially on modeldimension D and number N of measurements, and which hold on an event (i.e., a measurablesubset E of the sample space ( R × O ) N supporting the data Z ( N ) ) of the form E := E conv ∩ E init ∩ E wass . On E conv the negative log-likelihood − ℓ N ( θ ) will be strongly convex in some region B ⊆ R D , while E init is the event that allows one to initialise the method at some (data-driven) θ init = θ init ( Z ( N ) ) inthat set B . Finally, intersection with E wass further guarantees that the posterior measure Π( ·| Z ( N ) )is close in Wasserstein distance to a globally log-concave surrogate probability measure ˜Π( ·| Z ( N ) )which locally coincides with Π( ·| Z ( N ) ) up to proportionality factors. In applying the results ofthis section to a concrete sampling problem, one needs to show that all the events E conv , E init , E wass have sufficiently high frequentist P Nθ -probability, where θ is the ground truth parameter generating data (37). For the event E conv we provide a generic method in Lemma 3.4, based on a stabilityestimate for the linearisation of the map G combined with high-dimensional concentration of measuretechniques. The events E init and E wass are somewhat more specific to a given problem, see Remark3.10 for more discussion.We will assume the set B ⊆ R D of local convexity to be of ellipsoidal form. Definition 3.1.
A norm | · | on R D is called ellipsoidal if there exists a positive definite, symmetricmatrix M ∈ R D × D such that | θ | = θ T M θ for any θ ∈ R D . Throughout this section, for some centring θ ∗ ∈ R D , scalar η > | · | withassociated matrix M , let B denote the open subset of R D given by(39) B := (cid:8) θ ∈ R D : | θ − θ ∗ | < η (cid:9) . One may think of θ ∗ as the projection of θ onto R D , but at this stage this is not necessary. Whilefor the Schr¨odinger model with d ≤ | · | = k · k R D , in general (e.g., when d ≥ Local curvature bounds for the likelihood function.
In what follows, θ ∈ Θ is anarbitrary ‘ground truth’ and the gradient operator ∇ = ∇ θ will always act on G viewed as a mapon the subspace R D ⊆ Θ. Specifically we shall write ∇G ( θ ) and ∇ G ( θ ) for the following vectorand matrix fields ∇G ( θ ) : O → R D , ∇ G ( θ ) : O → R D × D , respectively. The following condition summarises some quantitative regularity conditions on themap G . These have to hold locally on the set B (and are satisfied, for instance, for any smooth G ). To formulate them we equip R D and R D × D with the Euclidean norm k · k R D and the operatornorm k · k op = k · k R D → R D (for linear maps from R D → R D ) respectively, and the functional normsof R D - or R D × D -valued fields are understood relative to these norms. [So for instance, in (40), onerequires a bound k for sup x ∈O k∇ G ( θ )( x ) k R D → R D that is uniform in θ ∈ B .] Assumption 3.2 (Local regularity) . Let B be given in (39). i) For any x ∈ O , the map θ
7→ G ( θ )( x ) is twice continuously differentiable on B . ii) For some k , k , k > , sup θ ∈B kG ( θ ) − G ( θ ) k ∞ ≤ k , sup θ ∈B k∇G ( θ ) k L ∞ ( O , R D ) ≤ k , sup θ ∈B k∇ G ( θ ) k L ∞ ( O , R D × D ) ≤ k . (40) iii) For some m , m , m > and any θ, ¯ θ ∈ B , we have kG ( θ ) − G (¯ θ ) k ∞ ≤ m | θ − ¯ θ | , k∇G ( θ ) − ∇G (¯ θ ) k L ∞ ( O , R D ) ≤ m | θ − ¯ θ | , k∇ G ( θ ) − ∇ G (¯ θ ) k L ∞ ( O , R D × D ) ≤ m | θ − ¯ θ | . We now turn to the central condition underlying the results in this section in terms of a localcurvature bound on E θ [ −∇ ℓ ( θ, Z )], with ℓ ( θ ) : R D → R from (38). To motivate it, notice that(41) − ∇ ℓ ( θ, Z ) = [ ∇G ( θ )( X )][ ∇G ( θ )( X )] T + [ G ( θ )( X ) − Y ] ∇ [ G ( θ )( X )] . ANGEVIN ALGORITHMS 17
If the design distribution P X is uniform on a bounded domain O then at θ = θ , the E Nθ -expectationof the last expression can be represented as(42) v T E θ [ −∇ ℓ ( θ , Z )] v = k∇G ( θ ) T v k L ( O ) , v ∈ R D . Therefore, if a suitable ‘ L ( O )-stability estimate’ for the linearisation ∇G of G at θ is available,the key condition (43) below holds at θ ; by regularity of G this should extend to θ sufficiently closeto θ . In the example with the Schr¨odinger equation studied in Section 2, such a stability estimateindeed follows from elliptic PDE theory, see Lemma 4.7.Note that the Hessian E θ [ −∇ ℓ ( θ, Z )] is symmetric (by (41) and Assumption 3.2i)), and recallthat λ min ( A ) denotes the smallest eigenvalue of a symmetric matrix A . Assumption 3.3 (Local curvature) . Let B be given in (39) and let ℓ : R D → R be as in (38). i) For some c min > , we have (43) inf θ ∈B λ min (cid:16) E θ [ −∇ ℓ ( θ, Z )] (cid:17) ≥ c min . ii) For some c max ≥ c min > , we have (44) sup θ ∈B h | E θ ℓ ( θ, Z ) | + k E θ [ ∇ ℓ ( θ, Z )] k R D + k E θ [ ∇ ℓ ( θ, Z )] k op i ≤ c max . The following lemma, which is based on concentration of measure arguments, shows that the local‘average’ curvature bound in (43) carries over to the ‘observed’ log-likelihood function, with highfrequentist P Nθ -probability, and whenever D ≤ R N , where the dimension constraint is explicitlyquantified in terms of the constants featuring in the previous hypotheses. The expression for R N substantially simplifies in concrete settings but, in this general form, reflects the various non-asymptotic stochastic regimes of the log-likelihood function and its derivatives. Lemma 3.4.
Suppose that data arises from (37) with ℓ N : R D → R given by (38). SupposeAssumptions 3.2, 3.3 are satisfied. There exists a universal constant C > such that if (45) R N := CN min (cid:26) c min C G η , c min C G η , c min C ′ G , c min k , c max C ′′ G η , c max C ′′G η , c max C ′′′ G , c max k + k (cid:27) , where C G := k m + k m + k m + m , C ′G := k + k k + k ,C ′′G := k m + k m + m + k m + m , C ′′′G = k k + k + k + k , (46) then for any D, N ≥ satisfying D ≤ R N , we have P Nθ (cid:16) inf θ ∈B λ min (cid:2) − ∇ ℓ N ( θ, Z ( N ) ) (cid:3) < N c min (cid:17) ≤ e −R N , (47) as well as P Nθ (cid:16) sup θ ∈B h | ℓ N ( θ, Z ( N ) ) | + k∇ ℓ N ( θ, Z ( N ) ) k R D + k∇ ℓ N ( θ, Z ( N ) ) k op i > N (5 c max + 1) (cid:17) ≤ e −R N + e − N/ . (48)8 R. NICKL AND S. WANG
Suppose that data arises from (37) with ℓ N : R D → R given by (38). SupposeAssumptions 3.2, 3.3 are satisfied. There exists a universal constant C > such that if (45) R N := CN min (cid:26) c min C G η , c min C G η , c min C ′ G , c min k , c max C ′′ G η , c max C ′′G η , c max C ′′′ G , c max k + k (cid:27) , where C G := k m + k m + k m + m , C ′G := k + k k + k ,C ′′G := k m + k m + m + k m + m , C ′′′G = k k + k + k + k , (46) then for any D, N ≥ satisfying D ≤ R N , we have P Nθ (cid:16) inf θ ∈B λ min (cid:2) − ∇ ℓ N ( θ, Z ( N ) ) (cid:3) < N c min (cid:17) ≤ e −R N , (47) as well as P Nθ (cid:16) sup θ ∈B h | ℓ N ( θ, Z ( N ) ) | + k∇ ℓ N ( θ, Z ( N ) ) k R D + k∇ ℓ N ( θ, Z ( N ) ) k op i > N (5 c max + 1) (cid:17) ≤ e −R N + e − N/ . (48)8 R. NICKL AND S. WANG Inspection of the proof shows that for the first inequality (47), the terms involving c max can beremoved from the definition of R N . In the sequel we will restrict considerations to the event E conv := n inf θ ∈B λ min (cid:2) − ∇ ℓ N ( θ ) (cid:3) ≥ N c min / o ∩ n sup θ ∈B h | ℓ N ( θ ) | + k∇ ℓ N ( θ ) k R D + k∇ ℓ N ( θ ) k op i ≤ N (5 c max + 1) o , (49)whose P Nθ -probability is controlled by Lemma 3.4.3.2. Construction of the likelihood surrogate function.
For Bayesian computation viaLangevin-type algorithms one needs to ensure recurrence of the underlying diffusion process, a suf-ficient condition for which is global log-concavity (on R D ) of the target measure to be sampled from,see Appendix A. To this end we now construct a ‘surrogate log-likelihood function’ ˜ ℓ N : R D → R for the log-likelihood ℓ N such that ˜ ℓ N = ℓ N identically on the subset { θ ∈ R D : | θ − θ ∗ | ≤ η/ } of B from (39), and which will be shown to be globally log-concave on the event E from (60) below.In order to perform the convexification of − ℓ N , one needs to identify the region B up to sufficientprecision. In what follows, we denote by θ init = θ init ( Z ( N ) ) ∈ R D a (data-driven) point estimatorwhere the sampling algorithm is initialised; and we define the event E init (measurable subset of( R × O ) N ) by(50) E init := (cid:8) | θ init − θ ∗ | ≤ η/ (cid:9) , where θ init belongs to the region B . That such initialisation is possible (i.e., that E init has sufficientlyhigh P Nθ -probability for appropriate η >
0) is proved for the Schr¨odinger model in Theorem B.6.We require two auxiliary functions, g η (globally convex) and α η (cut-off function): For somesmooth and symmetric (about 0) function ϕ : R → [0 , ∞ ) satisfying supp( ϕ ) ⊆ [ − ,
1] and R R ϕ ( x ) dx = 1, let us define the mollifiers ϕ h ( x ) := h − ϕ ( x/h ) , h >
0. Then, we define thefunctions ˜ γ η , γ η : R → R by ˜ γ η ( t ) := ( t < η/ , ( t − η/ if t ≥ η/ ,γ η ( t ) := (cid:2) ϕ η/ ∗ ˜ γ η (cid:3) ( t ) , (51)where ∗ denotes convolution, and(52) g η : R D → [0 , ∞ ) , g η ( θ ) := γ η ( | θ − θ init | ) . Finally, for some smooth α : [0 , ∞ ) → [0 ,
1] which satisfies α ( t ) = 1 for t ∈ [0 , /
4] and α ( t ) = 0 for t ∈ [7 / , ∞ ), we define the ‘cut-off’ function(53) α η : R D → [0 , , α η ( θ ) = α (cid:0) | θ − θ init | /η (cid:1) . Definition 3.5.
For the auxiliary functions g η , α η from (52), (53) and K > , we define thesurrogate likelihood function ˜ ℓ N by ˜ ℓ N : R D → R , ˜ ℓ N ( θ ) := α η ( θ ) ℓ N ( θ ) − Kg η ( θ ) . (54)When the choice of the constant K > c max from Assumption 3.2,the following global convexity property can be proved for ˜ ℓ N (see Appendix B for a proof). ANGEVIN ALGORITHMS 19
Proposition 3.6.
On the event E conv ∩ E init (cf. (49), (50)), when ˜ ℓ N from (54) is defined withany constant K satisfying (55) K ≥ CN ( c max + 1) · λ max ( M ) /η λ min ( M ) , ( C > depending only on the function α above), we have ℓ N ( θ ) = ˜ ℓ N ( θ ) for all θ ∈ R D s.t. | θ − θ ∗ | ≤ η/ , and inf θ ∈ R D λ min (cid:0) − ∇ ˜ ℓ N ( θ ) (cid:1) ≥ N c min / , (56) as well as k∇ ˜ ℓ N ( θ ) − ∇ ˜ ℓ N (¯ θ ) k R D ≤ Kλ max ( M ) k θ − ¯ θ k R D , θ, ¯ θ ∈ R D . (57)3.3. Non-asymptotic bounds for Bayesian posterior computation.
We now consider theproblem of generating random samples from the posterior measureΠ[ B | Z ( N ) ] = R B e ℓ N ( θ,Z ( N ) ) d Π( θ ) R R D e ℓ N ( θ,Z ( N ) ) d Π( θ ) , B ⊆ R D measurable , arising from data (37) with log-likelihood (38) and Gaussian N (0 , Σ) prior Π of density π on R D ,with positive definite covariance matrix Σ ∈ R D × D .We use the stochastic gradient method obtained from an Euler discretisation of the D -dimensional Langevin diffusion (see Appendix A) with drift vector field ∇ (˜ ℓ N + log π ) basedon the surrogate likelihood function. More precisely, for stepsize γ > ξ k ∼ i.i.d. N (0 , I D × D ), define a Markov chain as ϑ = θ init ,ϑ k +1 = ϑ k + γ (cid:2) ∇ ˜ ℓ N ( ϑ k ) − Σ − ϑ k (cid:3) + p γξ k +1 , k = 0 , , . . . (58)Probabilities and expectations with respect to the law of this Markov chain (random only throughthe ξ k , conditional on the data Z ( N ) ) will be denoted by P θ init , E θ init respectively. The invari-ant measure of the underlying continuous time Langevin diffusion equals the surrogate posteriordistribution given by˜Π[ B | Z ( N ) ] := R B e ˜ ℓ N ( θ,Z ( N ) ) d Π( θ ) R R D e ˜ ℓ N ( θ,Z ( N ) ) d Π( θ ) , B ⊆ R D measurable . In the following results we assume that the Wasserstein distance W between ˜Π( ·| Z ( N ) ) andΠ( ·| Z ( N ) ) can be controlled, specifically, for any ρ >
0, let us define the event(59) E wass ( ρ ) := (cid:8) W (cid:0) Π (cid:2) · | Z ( N ) (cid:3) , ˜Π (cid:2) · | Z ( N ) (cid:3)(cid:1) ≤ ρ/ (cid:9) . For the Schr¨odinger model this is achieved in Theorem 4.14, for ρ decaying exponentially in N ,using that most of the posterior mass (and its mode) concentrate on the set B from (39).Our first result consists of a global Wasserstein-approximation of Π( ·| Z ( N ) ) by the law L ( ϑ k ) on R D of the k -th iterate ϑ k arising from (58). Theorem 3.7 (Non-asymptotic Wasserstein mixing) . Suppose that the model given by (37)-(38)fulfills the Assumptions 3.2, 3.3 for some < η ≤ , that D, N ∈ N are such that D ≤ R N with R N from (45) and let K be as in (55). Further define the constants m := N c min / λ min (Σ − ) , Λ := 7 Kλ max ( M ) + λ max (Σ − ) . Then for any < γ ≤ / Λ and any ρ > the algorithm ( ϑ k : k ≥ from (58) satisfies, on theevent (i.e., measurable subset of ( R × O ) N ) (60) E := E conv ∩ E init ∩ E wass ( ρ ) , (with E conv , E init , E wass ( ρ ) defined in (49), (50), (59), respectively), and all k ≥ , (61) W (cid:0) L ( ϑ k ) , Π[ ·| Z ( N ) ] (cid:1) ≤ ρ + b ( γ ) + 4 (cid:0) τ (Σ , M, R ) + Dm (cid:1)(cid:16) − γm (cid:17) k , where, for some universal constants c , c > , any R ≥ k θ ∗ k R D and κ (Σ) = λ max (Σ) /λ min (Σ) , (62) b ( γ ) = c h γD Λ m + γ D Λ m i , τ (Σ , M, R ) = c κ (Σ) h η λ min ( M ) + R i . From the previous theorem we can obtain the following bound on the computation of posteriorfunctionals by ergodic averages of ϑ k collected after some burn-in time J in ∈ N . Specifically, if wedefine, for any H : R D → R integrable with respect to Π( ·| Z ( N ) ), the random variable(63) ˆ π JJ in ( H ) = 1 J J in + J X k = J in +1 H ( ϑ k ) , we obtain the following non-asymptotic concentration bound. Theorem 3.8 (Lipschitz functionals) . In the setting of the previous theorem, there exist furtherconstants c , c > such that for any ρ > , any burn-in period (64) J in ≥ c mγ × log (cid:16) ρ + b ( γ ) + τ (Σ , M, R ) + Dm (cid:17) , any J ∈ N , any Lipschitz function H : R D → R , any (65) t ≥ √ k H k Lip p ρ + b ( γ ) and on the event E from (60), we have P θ init (cid:16)(cid:12)(cid:12) ˆ π JJ in ( H ) − E Π [ H | Z ( N ) ] (cid:12)(cid:12) ≥ t (cid:17) ≤ (cid:16) − c t m Jγ k H k Lip (1 + 1 / ( mJγ )) (cid:17) . (66)From the last theorem one can obtain as a direct consequence the following guarantee for com-putation of the posterior mean E Π [ θ | Z ( N ) ] by the ergodic average accrued along the Markov chain. Corollary 3.9.
In the setting of Theorem 3.8, if we define ¯ θ JJ in = 1 J J in + J X k = J in +1 ϑ k , then on the event E and for t ≥ √ p ρ + b ( γ ) , we have for some constant c > that P θ init (cid:16)(cid:13)(cid:13) ¯ θ JJ in − E Π [ θ | Z ( N ) ] (cid:13)(cid:13) R D ≥ t (cid:17) ≤ D exp (cid:16) − c t m JγD (1 + 1 / ( mJγ ) (cid:17) . (67) ANGEVIN ALGORITHMS 21
The two previous results imply that one can compute the posterior mean (or E Π [ H | Z ( N ) ] with k H k Lip ≤
1) within precision ε > ǫ & √ ρ : For instance if γ is chosen as γ ≃ min n ε m D Λ , εm / D / Λ o , then the overall number of required MCMC iterations J in + J depends polynomially on the quantities N, D, m − , Λ , ε − . When the latter three constants exhibit at most polynomial growth in N, D (asis the case for the Schr¨odinger equation treated in Section 2), we can deduce that polynomial-timecomputation of such posterior characteristics is feasible, on the event E from (60) at computationalcost J in + J = O ( N b D b ε − b ) , b , b , b >
0, with P θ init -probability as close to 1 as desired. Remark 3.10 (About the events E init , E wass ) . Controlling the probability of the events E init , E wass (featuring in the definition of E in (60)) on which the preceding bounds hold may pose a formidablechallenge in its own right when considering a concrete ‘forward map’ G . For our prototypicalexample of the Schr¨odinger equation from Section 2, this is achieved in Sections 4.2 and B.3. Theproofs there give some guidance for how to proceed in other settings, too. In essence one canexpect that in bounding the P Nθ -probability of the events E init , E wass , global ‘stability’ and ‘range’properties of the map G will play a role, whereas the Assumptions 3.2, 3.3 employed in this sectionare ‘local’ in the sense that they concern properties of G on B from (39) only. Discerning local fromglobal requirements on G in this way appears helpful both in the proofs and in the exposition ofthe main ideas of this paper. Remark 3.11 (Extensions to vector-valued data) . The key results of this section apply to othersettings where the ‘forward’ map G ( θ ) defines an element of the space of continuous maps C ( M → V ) from a d -dimensional compact Riemannian manifold M (possibly with boundary) into a finite-dimensional vector space V of fixed finite dimension dim ( V ) < ∞ . If we assume that the statisticalerrors ( ε i : i = 1 , . . . , N ) in equation (37) are i.i.d. N (0 , Id V ) in V , then the log-likelihood functionof the model is not given by (38) but instead of the form ℓ N ( θ ) = − N X i =1 k Y i − G ( θ )( X i ) k V , ℓ ( θ ) = − k Y − G ( θ )( X ) k V , where the X i , X are drawn i.i.d. from a Borel measure P X on M . Imposing Assumption 3.2 withthe obvious modification of the norms there for V -valued maps, and if Assumption 3.3 holds for thepreceding definition of ℓ ( θ ), then the conclusion of Lemma 3.4 remains valid as stated, after basicnotational adjustments in its proof.3.4. Proof of Lemma 3.4.
It suffices to prove the assertion for R N ≥
1. We first need some morenotation: For any x ∈ O , we denote the point evaluation map by G x : Θ → R , θ
7→ G ( θ )( x ) . For Z = ( Y, X ) ∼ P θ , we will frequently use the following identities in the proofs below (where werecall that ∇ and ∇ act on the θ -variable). − ℓ ( θ, Z ) = 12 (cid:2) Y − G X ( θ ) (cid:3) = 12 (cid:2) G X ( θ ) + ε − G X ( θ ) (cid:3) , −∇ ℓ ( θ, Z ) = (cid:2) G X ( θ ) − G ( θ ) − ε (cid:3) ∇G X ( θ ) , −∇ ℓ ( θ, Z ) = ∇G X ( θ ) ∇G X ( θ ) T + (cid:2) G X ( θ ) − G ( θ ) − ε (cid:3) ∇ G X ( θ ) , − E θ (cid:2) ℓ ( θ, Z ) (cid:3) = 12 + 12 E X [ G X ( θ ) − G X ( θ )] , (68) where we note that by Assumption 3.2, the Hessian ∇ ℓ ( θ, Z ) is a symmetric D × D matrix field.When no confusion can arise, we will suppress the second argument Z and write ℓ ( θ ) for ℓ ( θ, Z ).Throughout, P N := N − P Ni =1 δ Z i denotes the empirical measure induced by Z ( N ) , which actson measurable functions h : R × O → R via P N ( h ) = Z R ×O hdP N = 1 N N X i =1 h ( Z i ) . Proof of (47).
Let us write ¯ ℓ N := ℓ N /N . Then, by a standard inequality due to Weyl as wellas Assumption 3.3, we have for any θ ∈ B that λ min (cid:2) − ∇ ¯ ℓ N ( θ ) (cid:3) ≥ λ min (cid:0) E θ (cid:2) − ∇ ℓ ( θ ) (cid:3)(cid:1) − (cid:13)(cid:13) ∇ ¯ ℓ N ( θ ) − E θ (cid:2) ∇ ℓ ( θ ) (cid:3)(cid:13)(cid:13) op ≥ c min − (cid:13)(cid:13) ∇ ¯ ℓ N ( θ ) − E θ (cid:2) ∇ ℓ ( θ ) (cid:3)(cid:13)(cid:13) op . (69)Hence we deduce P Nθ (cid:16) inf θ ∈B λ min (cid:2) ∇ ℓ N ( θ, Z ) (cid:3) < N c min / (cid:17) ≤ P Nθ (cid:16)(cid:13)(cid:13) ∇ ¯ ℓ N ( θ ) − E θ (cid:2) ∇ ℓ ( θ ) (cid:3)(cid:13)(cid:13) op ≥ c min / θ ∈ B (cid:17) ≤ P Nθ (cid:16) sup θ ∈B sup v : k v k R D ≤ (cid:12)(cid:12)(cid:12) v T (cid:16) ∇ ¯ ℓ N ( θ ) − E θ [ ∇ ℓ ( θ )] (cid:17) v (cid:12)(cid:12)(cid:12) ≥ c min / (cid:17) = P Nθ (cid:16) sup θ ∈B sup v : k v k R D ≤ (cid:12)(cid:12) P N ( g v,θ ) (cid:12)(cid:12) ≥ c min / (cid:17) , (70)where g v,θ ( · ) := v T (cid:16) ∇ ℓ ( θ, · ) − E θ [ ∇ ℓ ( θ )] (cid:17) v, v ∈ R D . The next step is to reduce the supremum over { v : k v k R D ≤ } to a suitable finite maximumover grid points v i by a contraction argument (commonly used in high-dimensional probability).For ρ >
0, let N ( ρ ) denote the minimal number of balls of k · k R D − radius ρ required to cover { v : k v k R D ≤ } , and let v i , k v i k R D ≤
1, be the centre points of a minimal covering. Thus for any v ∈ R D there exists an index i such that k v − v i k R D ≤ ρ . Hence, writing shorthand M θ = ∇ ¯ ℓ N ( θ ) − E θ [ ∇ ℓ ( θ )] , θ ∈ B , we have by the Cauchy-Schwarz inequality and the symmetry of the matrix M θ , v T M θ v = v Ti M θ v i + ( v − v i ) T M θ v + v Ti M θ ( v − v i )= v Ti M θ v i + k v − v i k R D k M θ v k R D + k v − v i k R D k M θ v i k R D ≤ v Ti M θ v i + 2 ρ sup v : k v k R D ≤ v T M θ v. Choosing ρ = and taking suprema it follows that for any θ ∈ B ,(71) sup v : k v k R D ≤ v T M θ v ≤ i =1 ,...,N (1 / v Ti M θ v i . ANGEVIN ALGORITHMS 23
Since the covering ( v i ) is independent of θ , we can further estimate the right hand side of (70) bya union bound to the effect that P Nθ (cid:16) sup θ ∈B sup v : k v k R D ≤ (cid:12)(cid:12)(cid:12) v T M θ v (cid:12)(cid:12)(cid:12) ≥ c min / (cid:17) ≤ N (1 / · sup v : k v k R D ≤ P Nθ (cid:16) sup θ ∈B (cid:12)(cid:12)(cid:12) v T M θ v (cid:12)(cid:12)(cid:12) ≥ c min / (cid:17) ≤ N (1 / · sup v : k v k R D ≤ h P Nθ (cid:16) sup θ ∈B (cid:12)(cid:12)(cid:12) P N ( g v,θ − g v,θ ∗ ) (cid:12)(cid:12)(cid:12) ≥ c min / (cid:17) + P Nθ (cid:0)(cid:12)(cid:12) P N ( g v,θ ∗ ) (cid:12)(cid:12) ≥ c min / (cid:1)i , (72)where we recall that θ ∗ is the centrepoint of the set B from (39). For the rest of the proof, we fixany v ∈ R D with k v k R D = 1. Next, we use (68) to decompose the ‘uncentred’ part of g v,θ as − v T ∇ ℓ ( θ, Z ) v = v T h G X ( θ ) ∇G X ( θ ) T + (cid:2) G X ( θ ) − G X ( θ ) (cid:3) ∇ G X ( θ ) i v − εv T ∇ G X ( θ ) v =: ˜ g Iv,θ ( X ) + εg IIv,θ ( X ) , such that g v,θ ( z ) = g Iv,θ ( x ) + εg IIv,θ ( x ) , where we have defined the centred version of ˜ g Iv,θ as g Iv,θ ( x ) = ˜ g Iv,θ ( x ) − E θ [˜ g Iv,θ ( X )] , x ∈ O . We can therefore bound the right hand side of (72) by N (cid:16) (cid:17) · sup v : k v k R D ≤ (cid:20) P Nθ (cid:16) sup θ ∈B (cid:12)(cid:12) N N X i =1 ( g Iv,θ − g Iv,θ ∗ )( X i ) (cid:12)(cid:12) ≥ c min (cid:17) + P Nθ (cid:16)(cid:12)(cid:12) N N X i =1 g Iv,θ ∗ ( X i ) (cid:12)(cid:12) ≥ c min (cid:17) + P Nθ (cid:16) sup θ ∈B (cid:12)(cid:12) N N X i =1 ε i ( g IIv,θ − g IIv,θ ∗ )( X i ) (cid:12)(cid:12) ≥ c min (cid:17) + P Nθ (cid:16)(cid:12)(cid:12) N N X i =1 ε i g IIv,θ ∗ ( X i ) (cid:12)(cid:12) ≥ c min (cid:17)(cid:21) =: N (1 / · ( i + ii + iii + iv ) . We now use empirical process techniques (Lemma 3.12 and also Hoeffding’s inequality) to boundthe preceding probabilities.
Terms i and ii . In order to apply Lemma 3.12 to term i , we require some preparations. By thedefinition of ˜ g Iv,θ and of the operator norm k · k op , using the elementary identity v T ( aa T − bb T ) v = v T ( a + b )( a − b ) T v for any v, a, b ∈ R D and Assumption 3.2, we have that for any θ, ¯ θ ∈ B , k ˜ g Iv,θ − ˜ g Iv, ¯ θ k ∞ ≤ (cid:13)(cid:13)(cid:13)(cid:2) ∇G ( θ ) ∇G ( θ ) T + (cid:2) G ( θ ) − G ( θ ) (cid:3) ∇ G ( θ ) (cid:3) − (cid:2) ∇G (¯ θ ) ∇G (¯ θ ) T + (cid:2) G (¯ θ ) − G ( θ ) (cid:3) ∇ G (¯ θ ) (cid:3)(cid:13)(cid:13)(cid:13) L ∞ ( O , R D × D ) ≤ (cid:13)(cid:13)(cid:13)(cid:2) ∇G ( θ ) − ∇G (¯ θ ) (cid:3)(cid:2) ∇G ( θ ) + ∇G (¯ θ ) (cid:3) T (cid:13)(cid:13)(cid:13) L ∞ ( O , R D × D ) + (cid:13)(cid:13)(cid:13)(cid:2) G ( θ ) − G (¯ θ ) (cid:3) ∇ G ( θ ) (cid:13)(cid:13)(cid:13) L ∞ ( O , R D × D ) + (cid:13)(cid:13)(cid:13)(cid:2) G (¯ θ ) − G ( θ ) (cid:3)(cid:2) ∇ G ( θ ) − ∇ G (¯ θ ) (cid:3)(cid:13)(cid:13)(cid:13) L ∞ ( O , R D × D ) ≤ m k | θ − ¯ θ | + m k | θ − ¯ θ | + m k | θ − ¯ θ | ≤ C G | θ − ¯ θ | . (73) In particular, by (39) we obtain the uniform boundsup θ ∈B k g Iv,θ − g Iv,θ ∗ k ∞ ≤ θ ∈B k ˜ g Iv,θ ( X ) − ˜ g Iv,θ ∗ k ∞ ≤ C G | θ − θ ∗ | ≤ C G η. (74)We introduce the rescaled function class h Iθ := g Iv,θ − g Iv,θ ∗ C G η , H I = { h Iθ : θ ∈ B} , which has envelope and variance proxy bounded as(75) sup θ ∈B k h Iθ k ∞ ≤ / ≡ U, sup θ ∈B (cid:0) E θ (cid:2) h Iθ ( X ) (cid:3)(cid:1) ≤ / ≡ σ. Next, if d ( θ, ¯ θ ) = E θ (cid:2) ( h Iθ ( X ) − h I ¯ θ ( X )) (cid:3) , d ∞ ( θ, ¯ θ ) = k h Iθ − h I ¯ θ k ∞ , θ, ¯ θ ∈ B , then using (73) we have that d ( θ, ¯ θ ) ≤ d ∞ ( θ, ¯ θ ) ≤ | θ − ¯ θ | /η, θ, ¯ θ ∈ B . Thus for any ρ ∈ (0 , N (cid:0) H I , d , ρ (cid:1) ≤ N (cid:0) H I , d ∞ , ρ (cid:1) ≤ N (cid:0) B , | · | /η, ρ (cid:1) ≤ (3 /ρ ) D . For any A ≥ Z log( A/x ) dx = log( A ) + 1 , Z p log( A/x ) dx ≤ A A − p log( A ) , [see p.190 of [28] for the latter inequality], and hence, using this for A = 3, we can respectivelybound the L ∞ and L metric entropy integrals of H I by J ∞ ( H I ) = Z U log N ( H I , d ∞ , ρ ) dρ . D,J ( H I ) ≤ Z σ q log N ( H I , d , ρ ) dρ . √ D. Now, an application of Lemma 3.12 below implies that for any x ≥ L ′ >
0, we have that(77) P Nθ (cid:16) sup θ ∈B √ N (cid:12)(cid:12)(cid:12) N X i =1 h Iθ ( X i ) (cid:12)(cid:12)(cid:12) ≥ L ′ h √ D + √ x + ( D + x ) / √ N i(cid:17) ≤ e − x . We also have by the definition of g Iv,θ ∗ that k g Iv,θ ∗ k ∞ ≤ k ˜ g Iv,θ ∗ k ∞ ≤ k + k k ) , and hence by Hoeffding’s inequality (Theorem 3.1.2 in [28]) that(78) ii ≤ (cid:16) − N c min · k + k k ) (cid:17) ≤ (cid:16) − N c min C ′ G (cid:17) . Now if we define(79) R ,IN := CN min (cid:26) c min C G η , c min C G η , c min C ′ G (cid:27) , ANGEVIN ALGORITHMS 25 then for any D ≤ R ,IN and choosing x = 4 R ,IN we have L h √ D + √ x + ( D + x ) / √ N i ≤ c min √ N C G η , R ,IN ≤ N c min C ′ G , whenever C > i and of h Iθ , we obtain ii + i ≤ e − R ,IN + P Nθ (cid:16) sup θ ∈B √ N (cid:12)(cid:12)(cid:12) N X i =1 h Iθ ( X i ) (cid:12)(cid:12)(cid:12) ≥ c min √ N C G η (cid:17) ≤ e − R ,IN . (80) Terms iii and iv . Let us now treat the empirical process indexed by the functions { g IIv,θ : θ ∈ B} .Since k v k R D ≤
1, we have for any θ, ¯ θ ∈ B , k g IIv,θ − g IIv, ¯ θ k ∞ ≤ k∇ G ( θ ) − ∇ G (¯ θ ) k L ∞ ( O , R D × D ) ≤ m | θ − ¯ θ | , which also yields the envelope boundsup θ ∈B (cid:13)(cid:13) g IIv,θ − g IIv,θ ∗ (cid:13)(cid:13) ∞ ≤ m sup θ ∈B | θ − θ ∗ | ≤ m η. Now the rescaled function class h IIθ := g IIv,θ − g IIv,θ ∗ m η , H II = { h IIθ : θ ∈ B} , admits envelopes sup θ ∈B k h IIv,θ k ∞ ≤ / ≡ U, sup θ ∈B (cid:0) E θ (cid:2) h IIv,θ ( X ) (cid:3)(cid:1) ≤ / ≡ σ. Thus defining d ( θ, ¯ θ ) := E θ (cid:2) ( h IIv,θ ( X ) − h IIv, ¯ θ ( X )) (cid:3) , d ∞ ( θ, ¯ θ ) = k h IIv,θ − h IIv, ¯ θ k ∞ , θ, ¯ θ ∈ B we have d ( θ, ¯ θ ) ≤ d ∞ ( θ, ¯ θ ) ≤ | θ − ¯ θ | /η, θ, ¯ θ ∈ B . Therefore, just as with the bounds obtained for term i , we have N (cid:0) H II , d , ρ (cid:1) ≤ (3 /ρ ) D and thus,by Lemma 3.12 below,(81) P Nθ (cid:16) sup θ ∈B √ N (cid:12)(cid:12)(cid:12) N X i =1 ε i h IIθ ( X i ) (cid:12)(cid:12)(cid:12) ≥ L ′ h √ D + √ x + ( D + x ) / √ N i(cid:17) ≤ e − x , x ≥ . Moreover, by the hypotheses, k g IIv,θ ∗ k ∞ ≤ k , and hence, invoking the Bernstein inequality (96)with U = σ ≡ k , we obtain that(82) P Nθ (cid:16)(cid:12)(cid:12)(cid:12) √ N N X i =1 ε i g IIv,θ ∗ ( X i ) (cid:12)(cid:12)(cid:12) ≥ k √ x + k x √ N (cid:17) ≤ e − x , x > . We can now set R ,IIN := CN min (cid:26) c min m η , c min m η , c min k , c min k (cid:27) , and choosing x = 4 R ,IIN in the preceding displays, we obtain that for C > D ≤ R ,IIN , iii + iv ≤ P Nθ (cid:16) sup θ ∈B √ N (cid:12)(cid:12)(cid:12) N X i =1 ε i h IIθ ( X i ) (cid:12)(cid:12)(cid:12) ≥ c min √ N m η (cid:17) + P Nθ (cid:16)(cid:12)(cid:12)(cid:12) √ N N X i =1 ε i g IIv,θ ∗ ( X i ) (cid:12)(cid:12)(cid:12) ≥ c min √ N (cid:17) ≤ e − R ,IIN . (83) Combining the terms.
By combining the bounds (70), (72), (80), (83) and using that N (1 / ≤ D ≤ e D (cf. Proposition 4.3.34 in [28]) we obtain that since D ≤ R N ≤ min( R ,IN , R ,IIN ) from (45), P Nθ (cid:16) inf θ ∈B λ min (cid:0) − ∇ ℓ N ( θ, Z ) (cid:1) < N c min / (cid:17) ≤ N (1 / · ( i + ii + iii + iv ) ≤ e D − R ,IN + 4 e D − R ,IIN ≤ e −R N , completing the proof of (47). (cid:3) Proof of (48).
We derive probability bounds for each of the three terms in (48) separately.The general scheme of proof for each of the three bounds is similar to the proof of (47), and wecondense some of the steps to follow.
Second order term.
Using that c max ≥ c min , we can replace (70) by P Nθ (cid:16) sup θ ∈B λ max (cid:2) − ∇ ℓ N ( θ, Z ) (cid:3) ≥ N c max / (cid:17) ≤ P Nθ (cid:16) sup θ ∈B sup v : k v k R D ≤ (cid:12)(cid:12) P N ( g v,θ ) (cid:12)(cid:12) ≥ c min / (cid:17) . From here onwards, this term can be treated exactly as in the proof of (47) and thus, for D ≤ R n from (45), we deduce(84) P Nθ (cid:16) sup θ ∈B λ max (cid:2) − ∇ ℓ N ( θ, Z ) (cid:3) ≥ N c max / (cid:17) ≤ e −R N . First order term.
First, let us denote f v,θ ( z ) := v T (cid:16) ∇ ℓ ( θ, z ) − E θ [ ∇ ℓ ( θ, Z )] (cid:17) , k v k R D ≤ , θ ∈ B , and let ( v i : i = 1 , ..., N (1 / k · k R D -covering with balls of radius 1 /
2, ofthe unit ball { θ : k θ k R D ≤ } . Then for any v there exists v i such that k v − v i k R D ≤ / | P N ( f v,θ ) | ≤ | P N ( f v,θ − f v i ,θ ) | + | P N ( f v i ,θ ) |≤ k v − v i k R D (cid:13)(cid:13) ∇ ¯ ℓ N ( θ ) − E θ (cid:2) ∇ ℓ ( θ ) (cid:3)(cid:13)(cid:13) R D + | P N ( f v i ,θ ) |≤ (cid:13)(cid:13) ∇ ¯ ℓ N ( θ ) − E θ (cid:2) ∇ ℓ ( θ ) (cid:3)(cid:13)(cid:13) R D + | P N ( f v i ,θ ) | . Therefore, since k u k R D = sup v : k v k R D ≤ | v T u | for any u ∈ R D , we deduce for any θ ∈ B ,(85) sup v : k v k R D ≤ | P N ( f v,θ ) | ≤ ≤ i ≤ N (1 / | P N ( f v i ,θ ) | . ANGEVIN ALGORITHMS 27
We can hence estimate P Nθ (cid:16) sup θ ∈B k∇ ¯ ℓ N ( θ ) k R D ≥ c max / (cid:17) ≤ P Nθ (cid:16) sup θ ∈B sup v : k v k R D ≤ (cid:12)(cid:12) v T (cid:2) ∇ ¯ ℓ N ( θ ) − E θ [ ∇ ℓ ( θ )] (cid:3)(cid:12)(cid:12) ≥ c max / (cid:17) ≤ N (1 / · sup v : k v k R D ≤ P Nθ (cid:0) sup θ ∈B (cid:12)(cid:12) P N (cid:0) f v,θ (cid:1)(cid:12)(cid:12) ≥ c max / (cid:1) . (86)We fix v ∈ R D with k v k R D ≤
1. Using (68), by decomposing the ‘uncentred’ part of f v,θ into v T ∇ ℓ ( θ, Z ) = v T ∇G X ( θ ) (cid:2) G X ( θ ) − G ( θ ) (cid:3) − εv T ∇G X ( θ ) =: ˜ f Iv,θ ( X ) − εf IIv,θ ( X ) , we can then write f v,θ ( z ) = f Iv,θ ( x ) + εf IIv,θ ( x ) , where we have further defined f Iv,θ ( x ) := ˜ f Iv,θ ( x ) − E θ [ ˜ f Iv,θ ( X )]. We then estimate the probabilityon the right hand side of (86) as follows, P Nθ (cid:0) sup θ ∈B (cid:12)(cid:12) P N (cid:0) f v,θ (cid:1)(cid:12)(cid:12) ≥ c max / (cid:1) ≤ P Nθ (cid:0) sup θ ∈B (cid:12)(cid:12) P N (cid:0) f Iv,θ − f Iv,θ ∗ (cid:1)(cid:12)(cid:12) ≥ c max / (cid:1) + P Nθ (cid:0)(cid:12)(cid:12) P N (cid:0) f Iv,θ ∗ (cid:1)(cid:12)(cid:12) ≥ c max / (cid:1) + P Nθ (cid:0) sup θ ∈B (cid:12)(cid:12) P N (cid:0) f IIv,θ − f IIv,θ ∗ (cid:1)(cid:12)(cid:12) ≥ c max / (cid:1) + P Nθ (cid:0)(cid:12)(cid:12) P N (cid:0) f IIv,θ ∗ (cid:1)(cid:12)(cid:12) ≥ c max / (cid:1) =: i + ii + iii + iv. (87)We first treat the terms i and ii . By the definition of ˜ f Iv,θ and Assumption 3.2, we have that forany θ, ¯ θ ∈ B , (cid:13)(cid:13) ˜ f Iv,θ − ˜ f Iv, ¯ θ (cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13)(cid:2) ∇G ( θ ) − ∇G (¯ θ ) (cid:3)(cid:2) G ( θ ) − G ( θ ) (cid:3) + ∇G (¯ θ ) (cid:2) G ( θ ) − G (¯ θ ) (cid:3)(cid:13)(cid:13) L ∞ ( O , R D ) ≤ ( k m + k m ) | θ − ¯ θ | . Again using Assumption 3.2, we have likewisesup θ ∈B (cid:13)(cid:13) ˜ f Iv,θ − ˜ f Iv,θ ∗ (cid:13)(cid:13) ∞ ≤ ( k m + k m ) η. Moreover, using that k f Iv,θ ∗ k ∞ ≤ k k , Hoeffding’s inequality yields that ii ≤ (cid:16) − N c max k k (cid:17) . Therefore, by using Lemma 3.12 in the same manner as in (77), we obtain that the rescaled process h Iv,θ := ˜ f Iv,θ − ˜ f Iv,θ ∗ k m + k m ) η satisfies(88) P Nθ (cid:16) sup θ ∈B √ N (cid:12)(cid:12)(cid:12) N X i =1 h Iθ ( X i ) (cid:12)(cid:12)(cid:12) ≥ L ′ h √ D + √ x + ( D + x ) / √ N i(cid:17) ≤ e − x , x ≥ . Thus, setting R ,IN =: CN min n c max ( k m + k m ) η , c max ( k m + k m ) η , c max k k o ,8 R. NICKL AND S. WANG
1. Using (68), by decomposing the ‘uncentred’ part of f v,θ into v T ∇ ℓ ( θ, Z ) = v T ∇G X ( θ ) (cid:2) G X ( θ ) − G ( θ ) (cid:3) − εv T ∇G X ( θ ) =: ˜ f Iv,θ ( X ) − εf IIv,θ ( X ) , we can then write f v,θ ( z ) = f Iv,θ ( x ) + εf IIv,θ ( x ) , where we have further defined f Iv,θ ( x ) := ˜ f Iv,θ ( x ) − E θ [ ˜ f Iv,θ ( X )]. We then estimate the probabilityon the right hand side of (86) as follows, P Nθ (cid:0) sup θ ∈B (cid:12)(cid:12) P N (cid:0) f v,θ (cid:1)(cid:12)(cid:12) ≥ c max / (cid:1) ≤ P Nθ (cid:0) sup θ ∈B (cid:12)(cid:12) P N (cid:0) f Iv,θ − f Iv,θ ∗ (cid:1)(cid:12)(cid:12) ≥ c max / (cid:1) + P Nθ (cid:0)(cid:12)(cid:12) P N (cid:0) f Iv,θ ∗ (cid:1)(cid:12)(cid:12) ≥ c max / (cid:1) + P Nθ (cid:0) sup θ ∈B (cid:12)(cid:12) P N (cid:0) f IIv,θ − f IIv,θ ∗ (cid:1)(cid:12)(cid:12) ≥ c max / (cid:1) + P Nθ (cid:0)(cid:12)(cid:12) P N (cid:0) f IIv,θ ∗ (cid:1)(cid:12)(cid:12) ≥ c max / (cid:1) =: i + ii + iii + iv. (87)We first treat the terms i and ii . By the definition of ˜ f Iv,θ and Assumption 3.2, we have that forany θ, ¯ θ ∈ B , (cid:13)(cid:13) ˜ f Iv,θ − ˜ f Iv, ¯ θ (cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13)(cid:2) ∇G ( θ ) − ∇G (¯ θ ) (cid:3)(cid:2) G ( θ ) − G ( θ ) (cid:3) + ∇G (¯ θ ) (cid:2) G ( θ ) − G (¯ θ ) (cid:3)(cid:13)(cid:13) L ∞ ( O , R D ) ≤ ( k m + k m ) | θ − ¯ θ | . Again using Assumption 3.2, we have likewisesup θ ∈B (cid:13)(cid:13) ˜ f Iv,θ − ˜ f Iv,θ ∗ (cid:13)(cid:13) ∞ ≤ ( k m + k m ) η. Moreover, using that k f Iv,θ ∗ k ∞ ≤ k k , Hoeffding’s inequality yields that ii ≤ (cid:16) − N c max k k (cid:17) . Therefore, by using Lemma 3.12 in the same manner as in (77), we obtain that the rescaled process h Iv,θ := ˜ f Iv,θ − ˜ f Iv,θ ∗ k m + k m ) η satisfies(88) P Nθ (cid:16) sup θ ∈B √ N (cid:12)(cid:12)(cid:12) N X i =1 h Iθ ( X i ) (cid:12)(cid:12)(cid:12) ≥ L ′ h √ D + √ x + ( D + x ) / √ N i(cid:17) ≤ e − x , x ≥ . Thus, setting R ,IN =: CN min n c max ( k m + k m ) η , c max ( k m + k m ) η , c max k k o ,8 R. NICKL AND S. WANG and choosing x = 3 R ,IN in (88), we obtain that for C > D ≤ R ,IN ,(89) ii + i ≤ e − R ,IN + P Nθ (cid:16)(cid:12)(cid:12)(cid:12) √ N N X i =1 h Iv,θ ( X i ) (cid:12)(cid:12)(cid:12) ≥ c max √ N k m + k m ) η (cid:17) ≤ e − R ,IN . We now treat the terms iii and iv . As k v k R D ≤
1, we have that for any θ, ¯ θ ∈ B , k f IIv,θ − f IIv, ¯ θ k ∞ ≤ m | θ − ¯ θ | , k f IIv,θ − f IIv,θ ∗ k ∞ ≤ m η, k f IIv,θ ∗ k ∞ ≤ k . Therefore, by utilising the Lemma 3.12 below as well as Bernstein’s inequality (96) in precisely thesame manner as in the derivations of (81) and (82) respectively, we obtain the two inequalities P Nθ (cid:16) sup θ ∈B √ N (cid:12)(cid:12)(cid:12) N X i =1 ε i f IIv,θ ( X i ) − f IIv,θ ∗ ( X i )4 m η (cid:12)(cid:12)(cid:12) ≥ L ′ h √ D + √ x + ( D + x ) / √ N i(cid:17) ≤ e − x , x ≥ , and P Nθ (cid:16)(cid:12)(cid:12)(cid:12) √ N N X i =1 ε i f IIv,θ ∗ ( X i ) (cid:12)(cid:12)(cid:12) ≥ k √ x + k x √ N (cid:17) ≤ e − x , x > . Thus, if we set R ,IIN := CN min n c max m η , c max m η , c max k , c max k o , then for C > D ≤ R ,IIN and choosing x = 3 R ,IIN in the preceding displays,we obtain(90) iii + iv ≤ e − R ,IIN . By combining (86), (87), (89), (90), using that N (1 / ≤ e D (cf. Proposition 4.3.34 in [28]) andsince D ≤ R N ≤ min( R ,IN , R ,IIN ), we conclude that P Nθ (cid:16) sup θ ∈B k∇ ¯ ℓ N ( θ ) k R D ≥ c max / (cid:17) ≤ N (1 / · ( i + ii + iii + iv ) ≤ e D − R ,IN + 4 e D − R ,IIN ≤ e −R N . (91) Order zero term.
As with the previous terms, we introduce a decomposition − ℓ ( θ, Z ) = 12 (cid:2) G X ( θ ) − G X ( θ ) (cid:3) − ε (cid:2) G X ( θ ) − G X ( θ ) (cid:3) + ε l Iθ ( X ) + εl IIθ ( X ) + ε , and therefore, defining l Iθ ( x ) =: ˜ l Iθ ( x ) − E θ [˜ l Iθ ( X )] , x ∈ O , we have that − ℓ ( θ, Z ) + E θ [ ℓ ( θ )] = l Iθ ( X ) + εl IIθ ( X ) + ε . ANGEVIN ALGORITHMS 29
Then, using Assumption 3.3, we can estimate P Nθ (cid:0) sup θ ∈B (cid:12)(cid:12) ¯ ℓ N ( θ, Z ) (cid:12)(cid:12) ≥ c max + 1 (cid:1) ≤ P Nθ (cid:0) sup θ ∈B (cid:12)(cid:12) ¯ ℓ N ( θ, Z ) − E θ [ ℓ ( θ, Z )] (cid:12)(cid:12) ≥ c max + 1 (cid:1) ≤ P Nθ (cid:0) sup θ ∈B (cid:12)(cid:12) P N ( l Iθ − l Iθ ∗ ) (cid:12)(cid:12) ≥ c max (cid:1) + P Nθ (cid:0) sup θ ∈B (cid:12)(cid:12) P N ( l Iθ ∗ ) (cid:12)(cid:12) ≥ c max (cid:1) + P Nθ (cid:0) sup θ ∈B (cid:12)(cid:12) P N ( l IIθ − l IIθ ∗ ) (cid:12)(cid:12) ≥ c max (cid:1) + P Nθ (cid:0) sup θ ∈B (cid:12)(cid:12) P N ( l IIθ ∗ ) (cid:12)(cid:12) ≥ c max (cid:1) + P Nθ (cid:16) N N X i =1 ε i ≥ (cid:17) =: i + ii + iii + iv + v. To bound the preceding terms, we use Assumption 3.2 to deduce that for all θ, ¯ θ ∈ B , k l Iθ − l I ¯ θ k ∞ ≤ k ˜ l Iθ − ˜ l I ¯ θ k ∞ = (cid:13)(cid:13) − G ( θ ) (cid:2) G ( θ ) − G (¯ θ ) (cid:3) + G ( θ ) − G (¯ θ ) (cid:13)(cid:13) ∞ = (cid:13)(cid:13)(cid:2) ( G ( θ ) − G ( θ )) + ( G (¯ θ ) − G ( θ )) (cid:3)(cid:2) G ( θ ) − G (¯ θ ) (cid:3)(cid:13)(cid:13) ∞ ≤ k m | θ − ¯ θ | , as well as sup θ ∈B k l Iθ − l Iθ ∗ k ∞ ≤ k m η, k l Iθ ∗ k ∞ ≤ k . Moreover, again by Assumption 3.2 we have that for all θ, ¯ θ ∈ B , k l IIθ − l II ¯ θ k ∞ ≤ m | θ − ¯ θ | , sup θ ∈B k l IIθ − l IIθ ∗ k ∞ ≤ m η, k l IIθ ∗ k ∞ ≤ k . Next, similarly as for the second and first order terms, in order to control the terms i and iii wenow apply Lemma 3.12 to the empirical processes indexed by the rescaled empirical processes h Iθ := l Iθ − l Iθ ∗ k m η , h IIθ := l IIθ − l IIθ ∗ m η , and in order to control the terms ii and iv , we respectively apply Hoeffding’s inequality and Bern-stein’s inequality (96) in the same manner as before. Overall, if we set R ,IN := CN min n c max k m η , c max k m η , c max k o , R ,IIN := CN min n c max m η , c max m η , c max k , c max k o , (92)then for C > D ≤ R N ≤ min( R ,IN , R ,IIN ), i + ii + iii + iv ≤ P Nθ (cid:0) sup θ ∈B √ N (cid:12)(cid:12) N X i =1 h Iθ ( X i ) (cid:12)(cid:12) ≥ c max √ N k m η (cid:1) + 2 exp (cid:16) − N c max k (cid:17) + P Nθ (cid:0) sup θ ∈B √ N (cid:12)(cid:12) N X i =1 h IIθ ( X i ) (cid:12)(cid:12) ≥ c max √ N m η (cid:1) + 2 e −R ,IIN ≤ e −R ,IN + 4 e −R ,IIN ≤ e −R N . Finally, we estimate the term v by a standard tail inequality (see Theorem 3.1.9 in [28]), v = P Nθ (cid:16) N X i =1 ( ε i − ≥ N (cid:17) ≤ e − N/ , and thus obtain(93) P Nθ (cid:0) sup θ ∈B (cid:12)(cid:12) ¯ ℓ N ( θ, Z ) (cid:12)(cid:12) ≥ c max + 1 (cid:1) ≤ i + ii + iii + iv + v ≤ e −R N + e − N/ . Conclusion.
By combining (84), (91) and (93), the proof of (48) is completed. (cid:3)
A chaining lemma for empirical processes.
The following key technical lemma is basedon a chaining argument for stochastic processes with a mixed tail (cf. Theorem 2.2.28 in Talagrand[65] and Theorem 3.5 in Dirksen [19]). For us it will be sufficient to control the ‘generic chaining’functionals employed in these references by suitable metric entropy integrals. For any (semi-)metric d on a metric space T , we denote by N = N ( T, d, ρ ) the minimal cardinality of a covering of T byballs with centres ( t i : i = 1 , . . . , N ) ⊂ T such that for all t ∈ T there exists i such that d ( t, t i ) < ρ .Below we require the index set Θ to be countable (to avoid measurability issues). Whenever weapply Lemma 3.12 in this article with an uncountable set Θ, one can show that the supremum canbe realised as one over a countable subset of it. Lemma 3.12.
Let Θ be a countable set. Suppose a class of real-valued measurable functions H = { h θ : X → R , θ ∈ Θ } defined on a probability space ( X , A , P X ) is uniformly bounded by U ≥ sup θ k h θ k ∞ and has varianceenvelope σ ≥ sup θ E X h θ ( X ) where X ∼ P X . Define metric entropy integrals J ( H ) = Z σ p log N ( H , d , ρ ) dρ, d ( θ, θ ′ ) := q E X [ h θ ( X ) − h θ ′ ( X )] ,J ∞ ( H ) = Z U log N ( H , d ∞ , ρ ) dρ, d ∞ ( θ, θ ′ ) := k h θ − h θ ′ k ∞ . For X , . . . , X N drawn i.i.d. from P X and ε i ∼ iid N (0 , independent of all the X i ’s, considerempirical processes arising either as Z N ( θ ) = 1 √ N N X i =1 h θ ( X i ) ε i , θ ∈ Θ , or as Z N ( θ ) = 1 √ N N X i =1 ( h θ ( X i ) − Eh θ ( X )) , θ ∈ Θ . We then have for some universal constant
L > and all x ≥ , Pr (cid:18) sup θ ∈ Θ | Z N ( θ ) | ≥ L h J ( H ) + σ √ x + ( J ∞ ( H ) + U x ) / √ N i(cid:19) ≤ e − x . Proof.
We only prove the case where Z N ( θ ) = P i h θ ( X i ) ε i / √ N , the simpler case without Gaussianmultipliers is proved in the same way. We will apply Theorem 3.5 in [19], whose condition (3.8) we ANGEVIN ALGORITHMS 31 need to verify. First notice that for | λ | < / k h θ − h θ ′ k ∞ , and E ε denoting the expectation withrespect to ε , E exp (cid:8) λε ( h θ − h θ ′ )( X ) (cid:9) ≤ ∞ X k =2 | λ | k E ε | ε | k E X | h θ − h θ ′ | k ( X ) k ! ≤ λ E X [ h θ ( X ) − h θ ′ ( X )] ∞ X k =2 E ε | ε | k k ! (cid:0) | λ |k h θ − h θ ′ k ∞ (cid:1) k − ≤ exp n λ d ( θ, θ ′ )1 − | λ | d ∞ ( θ, θ ′ ) o (94)where we have used the basic fact E ε | ε | k /k ! ≤
1. By the i.i.d. hypothesis we then also have E exp n λ ( Z N ( θ ) − Z N ( θ ′ )) o ≤ exp (cid:26) λ d ( θ, θ ′ )1 − | λ | d ∞ ( θ, θ ′ ) / √ N (cid:27) . An application of the exponential Chebyshev inequality (and optimisation in λ , as in the proof ofProposition 3.1.8 in [28]) then implies that condition (3.8) in [19] holds for the stochastic process Z N ( θ ) with metrics ¯ d = 2 d and ¯ d = d ∞ / √ N .
In particular, the ¯ d -diameter ∆ ( H ) of H is atmost 4 σ and the ¯ d -diameter ∆ ( H ) of H is bounded by 4 U/ √ N . [These bounds are chosen sothat they remain valid for the process without Gaussian multipliers as well.] Theorem 3.5 in [19]now gives, for some universal constant M , and any θ † ∈ Θ thatPr (cid:18) sup θ ∈ Θ | Z N ( θ ) − Z N ( θ † ) | ≥ M (cid:0) γ ( H ) + γ ( H ) + σ √ x + ( U/ √ N ) x (cid:1)(cid:19) ≤ e − x where the ‘generic chaining’ functionals γ , γ are upper bounded by the respective metric entropyintegrals of the metric spaces ( H , ¯ d i ) , i = 1 ,
2, up to universal constants (see (2.3) in [19]). For γ also notice that a simple substitution ρ ′ = ρ √ N implies that Z U/ √ N log N ( H , ¯ d , ρ ) dρ = 1 √ N Z U log N ( H , d ∞ , ρ ′ ) dρ ′ , and we hence deduce that(95) Pr (cid:18) sup θ ∈ Θ | Z N ( θ ) − Z N ( θ † ) | ≥ ¯ L h J ( H ) + σ √ x + ( J ∞ ( H ) + U x ) / √ N i(cid:19) ≤ e − x for some universal constant ¯ L .Now what precedes also implies the classical Bernstein-inequality(96) Pr (cid:16) | Z N ( θ ) | ≥ σ √ x + U x √ N (cid:17) ≤ e − x , x > , for any fixed θ ∈ Θ , U ≥ k h θ k ∞ and σ ≥ E X h θ ( X ), proved as (3.24) in [28], using (94). Applyingthis with θ † and using (95), the final result follows now fromPr (cid:0) sup θ ∈ Θ | Z N ( θ ) | > τ ( x ) (cid:1) ≤ Pr (cid:0) sup θ ∈ Θ | Z N ( θ ) − Z N ( θ † ) | > τ ( x ) (cid:1) + Pr (cid:0) | Z N ( θ † ) | > τ ( x )) (cid:1) ≤ e − x , for any x ≥
1, where τ ( x ) = ¯ L (cid:2) J ( H ) + σ √ x + ( J ∞ ( H ) + U x ) / √ N (cid:3) and L ≥ L > (cid:3)
Proofs for Section 3.3.
We apply the results from Appendix A to µ = ˜Π( ·| Z ( N ) ). Proof of Theorem 3.7.
For any θ, ¯ θ ∈ R D , we have for the log-prior density that k∇ log π ( θ ) − ∇ log π (¯ θ ) k R D = k Σ − ( θ − ¯ θ ) k R D ≤ λ max (Σ − ) k θ − ¯ θ k R D ,λ min ( −∇ log π ( θ )) ≥ λ min (Σ − ) , and for the likelihood surrogate ˜ ℓ N , by Proposition 3.6 and on the event E , that k∇ ˜ ℓ N ( θ ) − ∇ ˜ ℓ N (¯ θ ) k R D ≤ Kλ max ( M ) k θ − ¯ θ k R D ,λ min ( −∇ ˜ ℓ N ( θ )) ≥ N c min / . Combining the last two displays, and on the event E , we can verify Assumption A.1 below for − log d ˜Π( ·| Z ( N ) ) with constants m = N c min / λ min (Σ − ) , Λ = 7 Kλ max ( M ) + λ max (Σ − ) . We may thus apply Proposition A.4 below to obtain, W ( L ( ϑ k ) , Π( ·| Z ( N ) )) ≤ W (Π( ·| Z ( N ) ) , ˜Π( ·| Z ( N ) )) + 2 W ( L ( ϑ k ) , ˜Π( ·| Z ( N ) )) ≤ ρ + b ( γ ) + 4(1 − mγ/ k h k θ init − θ max k R D + Dm i , where θ max denotes the unique maximiser of log d ˜Π( ·| Z ( N ) ) over R D (which exists on the event E conv , by virtue of strong concavity).We conclude by an estimate for k θ init − θ max k R D . To start, notice that for any θ ∈ R D we have | θ − θ init | = ( θ − θ init ) T M ( θ − θ init ) ≥ λ min ( M ) k θ − θ init k R D . (97)Thus, for any θ ∈ R D with k θ − θ init k R D ≥ η /λ min ( M ), we have that | θ − θ init | ≥ η , andtherefore also that g η ( θ ) ≥ (cid:0) | θ − θ init | − η (cid:1) ≥ | θ − θ init | . Thus, for C from (55) and any θ ∈ R D satisfying k θ − θ init k R D ≥ C + 4 η λ min ( M ) , using (97), (55) as well as the upper bound for | ℓ N ( θ ) | in the definition of E conv , we obtain − ˜ ℓ N ( θ ) = Kg η ( θ ) ≥ CN ( c max + 1) 1 + λ max ( M ) /η λ min ( M ) · | θ − θ init | ≥ C N ( c max + 1) k θ − θ init k R D ≥ N ( c max + 1) ≥ − ˜ ℓ N ( θ init ) . This implies that necessarily the unique maximiser θ ˜ ℓ of the (on E conv ) strongly concave map ˜ ℓ N over R D satisfies k θ ˜ ℓ − θ init k R D ≤ /C + 4 η /λ min ( M ) . Moreover, in view of the definition of B and the hypotheses on θ ∗ we have that k θ init k R D ≤ k θ init − θ ∗ k R D + k θ ∗ k R D ≤ | θ init − θ ∗ | p λ min ( M ) + R ≤ η p λ min ( M ) + R, which also allows us to deduce k θ ˜ ℓ k R D ≤ k θ ˜ ℓ − θ init k R D + k θ init k R D ≤ p /C + 3 η p λ min ( M ) + R. ANGEVIN ALGORITHMS 33
We further have that θ Tmax Σ − θ max ≤ θ T ˜ ℓ Σ − θ ˜ ℓ (otherwise θ max would not maximise log d ˜Π( ·| Z ( N ) ))and thus, for κ (Σ) the condition number of Σ, k θ max k R D ≤ λ min (Σ − ) θ Tmax Σ − θ max ≤ λ min (Σ − ) θ T ˜ ℓ Σ − θ ˜ ℓ ≤ κ (Σ) k θ ˜ ℓ k R D . Combining the preceding displays, the proof is now completed as follows: k θ max − θ init k R D . k θ max k R D + k θ init k R D . κ (Σ) k θ ˜ ℓ k R D + η λ min ( M ) + R . κ (Σ) h η λ min ( M ) + R i . Proof of Theorem 3.8.
For any t ≥ H : R D → R we have P θ init (cid:16)(cid:12)(cid:12) ˆ π JJ in ( H ) − E Π [ H | Z ( N ) ] (cid:12)(cid:12) ≥ t (cid:17) ≤ P θ init (cid:16)(cid:12)(cid:12) ˆ π JJ in ( H ) − E θ init [ˆ π JJ in ( H )] (cid:12)(cid:12) ≥ t − (cid:12)(cid:12) E θ init [ˆ π JJ in ( H )] − E Π [ H | Z ( N ) ] (cid:12)(cid:12)(cid:17) . (98)To further estimate the right side, note that for any k ≥ J in , by (64) and Theorem 3.7, we have W ( L ( ϑ k ) , Π( ·| Z ( N ) )) ≤ ρ + b ( γ )) . Noting that (167) below in fact holds for any probability measure µ and thus in particular for µ = Π( ·| Z ( N ) ), it follows that for any Lipschitz function H : R D → R , (cid:0) E θ init [ˆ π JJ in ( H )] − E Π [ H | Z ( N ) ] (cid:1) ≤ k H k Lip ( ρ + b ( γ )) . Thus if t ≥ H and − H yields that ther.h.s. in (98) is further bounded by P θ init (cid:16)(cid:12)(cid:12) ˆ π JJ in ( H ) − E θ init [ˆ π JJ in ( H )] (cid:12)(cid:12) ≥ t/ (cid:17) ≤ (cid:16) − c t m Jγ k H k Lip (1 + 1 / ( mJγ )) (cid:17) . Proof of Corollary 3.9.
We first estimate the probability to be bounded by P θ init (cid:16)(cid:13)(cid:13) ¯ θ JJ in − E θ init (cid:2) ¯ θ JJ in (cid:3)(cid:13)(cid:13) R D ≥ t − (cid:13)(cid:13) E θ init (cid:2) ¯ θ JJ in (cid:3) − E Π [ θ | Z ( N ) ] (cid:13)(cid:13) R D (cid:17) . Next, for any k ≥
1, let ν k denote an optimal coupling between L ( ϑ k ) and Π[ ·| Z ( N ) ] (cf. Theorem4.1 in [75]). Then by Jensen’s inequality and the definition of W from (9), (cid:13)(cid:13) E θ init (cid:2) ¯ θ JJ in (cid:3) − E Π [ θ | Z ( N ) ] (cid:13)(cid:13) R D = (cid:13)(cid:13)(cid:13)(cid:13) J J in + J X k = J in +1 Z R D × R D ( θ − θ ′ ) dν k ( θ, θ ′ ) (cid:13)(cid:13)(cid:13)(cid:13) R D = D X j =1 (cid:18) J J in + J X k = J in +1 Z R D × R D ( θ j − θ ′ j ) dν k ( θ, θ ′ ) (cid:19) ≤ J J in + J X k = J in +1 Z R D × R D D X j =1 ( θ j − θ ′ j ) dν k ( θ, θ ′ )= 1 J J in + J X k = J in +1 W ( L ( ϑ k ) , Π[ ·| Z ( N ) ]) . Thus we obtain from (61), (64) (as after (98)) that (cid:13)(cid:13) E θ init (cid:2) ¯ θ JJ in (cid:3) − E Π [ θ | Z ( N ) ] (cid:13)(cid:13) R D ≤ √ p ρ + b ( γ ) . Now for any j = 1 , ..., d , let us write H j : R D → R , θ θ j , for the j -the coordinate projectionmap, of Lipschitz constant 1. Then in the notation (63) we can write[¯ θ JJ in ] j = ˆ π JJ in ( H j ) , j = 1 , ..., D. For t ≥ p ρ + b ( γ )) and applying Proposition A.3 as in the proof of Theorem 3.8 as well as aunion bound gives P θ init (cid:16)(cid:13)(cid:13) ¯ θ JJ in − E Π [ θ | Z ( N ) ] (cid:13)(cid:13) R D ≥ t (cid:17) ≤ P θ init (cid:16)(cid:13)(cid:13) ¯ θ JJ in − E θ init (cid:2) ¯ θ JJ in (cid:3)(cid:13)(cid:13) R D ≥ t/ (cid:17) = P θ init (cid:18) D X j =1 h ˆ π JJ in ( H j ) − E θ init (cid:2) ˆ π JJ in ( H j )] i ≥ t (cid:19) ≤ D X j =1 P θ init (cid:18)h ˆ π JJ in ( H j ) − E θ init (cid:2) ˆ π JJ in ( H j )] i ≥ t D (cid:19) ≤ D exp (cid:16) − c t m JγD (cid:2) / ( mJγ ) (cid:3) (cid:17) , completing the proof of the corollary.4. Proofs for the Schr¨odinger model
In this section, we will show how the results from Section 3 can be applied to the nonlinearproblem for the Schr¨odinger equation (17). Recalling the notation of Sections 2 and 3, we will set θ ∗ = θ ,D , the norm | · | := k · k R D as well as η := ǫD − /d (for ǫ to be chosen), such that the region B from (39) equals the Euclidean ball(99) B ǫ := n θ ∈ R D : k θ − θ ,D k R D < ǫD − /d o . The first key observation is the following result on the local log-concavity of the likelihoodfunction on B ǫ , which will be proved by a combination of the concentration result Lemma 3.4 withthe PDE estimates below, notably the ‘average curvature’ bound from Lemma 4.7. Proposition 4.1.
Let θ ∈ h satisfy k θ k h ≤ S for some S > and consider ℓ N from (22)with forward map G : R D → R from (17). Then there exist constants < ǫ S = ǫ S ( O , g, Φ) ≤ and c , c , c , c > such that for any ǫ ≤ ǫ S and all D, N satisfying D ≤ c N dd +12 as well as kG ( θ ) − G ( θ ,D ) k L ( O ) ≤ c D − /d , the event E conv ( ǫ ) = n inf θ ∈B ǫ λ min (cid:0) −∇ ℓ N ( θ ) (cid:1) > c N D − /d , sup θ ∈B ǫ h | ℓ N ( θ ) | + k∇ ℓ N ( θ ) k R D + k∇ ℓ N ( θ ) k op i < c N o satisfies (100) P Nθ (cid:0) E conv ( ǫ ) (cid:1) ≥ − e − c N dd +12 . Proof.
For any θ ∈ R D , F θ as in (16), by a Sobolev embedding and (13), we have k F θ k ∞ . k θ k h . D /d k θ k R D . This and the Lemmas 4.4, 4.5, 4.6 verify Assumption 3.2 in the present setting, withconstants k ≃ k ≃ const. , k ≃ m ≃ m ≃ D /d , m ≃ D /d , ANGEVIN ALGORITHMS 35 whence the constants from (46) satisfy C G ≃ D /d , C ′G ≃ D /d , C ′′G ≃ D /d , C ′′′G ≃ const. . Moreover, Lemmas 4.7 and 4.8 verify Assumption 3.3 for our choice of η with(101) c min ≃ D − /d , c max ≃ const.Then the minimum (45) is dominated by the third term, yielding that R N = R N,D ≃ c min /C ′ G ≃ N D − /d . Therefore, we can choose c >
D, N ∈ N satisfying D ≤ cN d/ ( d +12) ,we also have D ≤ R N,D . Lemma 3.4 then implies that for all such
D, N , we have(102) P Nθ (cid:0) E cconv (cid:1) ≤ e −R N + e − N/ ≤ e − cN dd +12 . (cid:3) Next, if θ init is the estimator from Theorem B.6, then in the present setting with ǫ = 1 / log N ,the event (50) equals E init = n k θ init − θ ,D k R D ≤ N ) D /d o . Proposition 4.2.
Assuming Condition 2.3, there exist constants c , c > such that for all N ∈ N , P Nθ (cid:0) E init (cid:1) ≥ − c e − c N d/ (2 α + d ) . Proof.
Using Theorem B.6 and α >
6, we obtain that with sufficiently high probability, k θ init − θ ,D k R D . N − ( α − / (2 α + d ) = o (cid:0) (log N ) − D − /d (cid:1) . (cid:3) Next, denoting by ˜Π( ·| Z ( N ) ) the ‘surrogate’ posterior measure with density (27), and if E wass = n W ( ˜Π( ·| Z ( N ) ) , Π( ·| Z ( N ) )) ≤ exp( − N d/ (2 α + d ) ) / o , is given by (59) with ρ = exp( − N d/ (2 α + d ) ), then Theorem 4.14 implies the following approximationresult in Wasserstein distance. Proposition 4.3.
Assume Conditions 2.2 and 2.3. Then there exist constants c , c > such thatfor all N ∈ N , P Nθ (cid:0) E wass (cid:1) ≥ − c e − c N d/ (2 α + d ) . The preceding propositions imply that the events(103) E N := E conv ∩ E init ∩ E wass satisfy the probability bound P Nθ ( E N ) ≥ − c ′ e − c ′′ N d/ (2 α + d ) . In what follows, the events E N willbe tacitly further intersected with events which have probability 1 for all N large enough, ensuringthat the non-asymptotic conditions required in the results of Section 3 are eventually verified. Proof of Theorem 2.7.
We will prove Theorem 2.7 by applying Theorem 3.7 with the choices B = B ǫ from (99), ǫ = 1 / log N and K from Condition 2.2, ρ = exp( − N d/ (2 α + d ) ) and M = I D × D generating the ellipsoidal norm k · k R D . Using (13), the prior covariance Σ from (23) satisfies λ min (Σ − ) ≃ N d α + d , λ max (Σ − ) ≃ N d α + d D α/d . Then using Condition 2.2, we first have that K & N D /d (log N ) ≃ N c max · (cid:0) η − (cid:1) , verifying the lower bound (55), and then also that m, Λ > m ≃ N D − /d + N d α + d , Λ ≃ N D /d (log N ) + N d α + d D αd . The dimension condition (28) and the condition on α further imply N D − /d & N d α + d , N d α + d D αd . N, whence we further obtain(104) m ≃ N D − /d , Λ ≃ N D /d (log N ) . Noting that also γ = o (Λ − ) with our choices, Theorem 3.7 yields that on the event E N from (103),the Markov chain ( ϑ k ) satisfies the Wasserstein bound (61) with b ( γ ) . γD Λ m + γ D Λ m . γD ( d +24) /d (log N ) + γ N D ( d +44) /d (log N ) , (105)as well as τ (Σ , M, k θ ,D k R D ) . κ (Σ) ≃ D α/d . Using also that
D/m . const., the first part of Theorem 2.7 follows.For the choice of γ = γ ε from (31), straightforward calculation yields that (for N large enough)(106) B ( γ ε ) = o ( ε + N − P ) , which proves the second part of Theorem 2.7. Proof of Proposition 2.4 and of Theorems 2.5, 2.6.
The proof of Proposition 2.4 nowfollows directly from Theorem 3.8 and the preceding computations. Noting that for all N largeenough we have B ( γ ) ≤ N − P , Theorem 2.5 follows from Corollary 3.9, (106) as well as (67), for J in ≥ (log N ) / ( γ ε N D − /d ). Finally, intersecting further with the event E mean := (cid:8) k E Π [ θ | Z ( N ) ] − θ k ℓ ≤ LN − α α + d αα +2 (cid:9) , L > , Theorem 2.6 now follows from the triangle inequality and (153).
Proof of Theorem 2.8.
In the proof we intersect E N from (103) further with the event onwhich the conclusion of Theorem 4.12 holds. Part iii) then follows from part ii) and straightfor-ward calculations. Part i) follows from the arguments following (159) below, where it is proved inparticular that ˆ θ MAP is the unique maximiser of the proxy posterior density ˜ π ( ·| Z ( N ) ) over R D .We can now apply Proposition A.2 with m, Λ from (104), using also that | log ˜ π ( θ init | Z ( N ) ) − log ˜ π (ˆ θ MAP | Z ( N ) ) | . sup θ ∈B / N (cid:12)(cid:12) ℓ N ( θ ) (cid:12)(cid:12) + N d/ (2 α + d ) k ˆ θ MAP k h α + N d/ (2 α + d ) k θ init k h α . N + N d/ (2 α + d ) (1 + D α/d ) . N, in view of ℓ N = ˜ ℓ N on B / N , the definition of E init , (13) and since θ ∈ h α . ANGEVIN ALGORITHMS 37
Analytical properties of the Schr¨odinger forward map.
This section is devoted toproving the four auxiliary Lemmas 4.5-4.8 used in the proof of Proposition 4.1. Throughout weconsider forward map G : R D → L ( O ), G = G ◦ Φ ∗ ◦ Ψ given by (17) and assume the hypothesesof Proposition 4.1, where the set B ǫ was defined in (99).For any f ∈ C ( O ) with f ≥
0, by standard theory for elliptic PDEs (see e.g. Chapter 6.3 of[24]) there exists a linear, continuous operator V f : L ( O ) → H ( O ) describing (weak) solutions V f [ ψ ] = w ∈ H of the (inhomogeneous) Schr¨odinger equation(107) ( ∆2 w − f w = ψ on O ,w = 0 on ∂ O . Lemma 4.4.
For any x ∈ O , the map θ
7→ G ( θ )( x ) is twice continuously differentiable on R D . Thevector field ∇G θ : O → R D is given by v T ∇G θ ( x ) = V f θ (cid:2) u f θ (Φ ′ ◦ F θ )Ψ( v ) (cid:3) ( x ) , x ∈ O , v ∈ R D . Moreover, for any v , v ∈ R D and x ∈ O , the matrix field ∇ G θ : O → R D × D is given by v T ∇ G θ ( x ) v = V f θ (cid:2) u f θ Ψ( v )Ψ( v )(Φ ′′ ◦ F θ ) (cid:3) ( x )+ V f θ (cid:2) (Φ ′ ◦ F θ )Ψ( v ) V f θ (cid:2) u f θ (Φ ′ ◦ F θ )Ψ( v ) (cid:3)(cid:3) ( x )+ V f θ (cid:2) (Φ ′ ◦ F θ )Ψ( v ) V f θ (cid:2) u f θ (Φ ′ ◦ F θ )Ψ( v ) (cid:3)(cid:3) ( x ) . Proof.
In the notation from (17), the map θ
7→ G ( θ )( x ) can be represented as the composition δ x ◦ G ◦ Φ ∗ ◦ Ψ, where δ x : w w ( x ) denotes point evaluation. We first show that each of these fouroperators is twice differentiable. The continuous linear maps Ψ : R D → C ( O ) and δ x : C ( O ) → R are infinitely differentiable (in the Frech´et sense). Moreover, the maps G : C ( O ) ∩ { f > } → C ( O )and Φ ∗ : C ( O ) → C ( O ) ∩ { f > } are twice Fre´chet differentiable with derivatives DG, DG and D Φ ∗ , D Φ ∗ given by Lemma B.2 and (177) respectively. We deduce overall by the chain rulefor Fr´echet derivates (cf. Lemma B.3), that x
7→ G ( θ )( x ) is twice differentiable, with the desiredexpressions for the vector and matrix fields. The continuity of the second partial derivatives followsfrom inspection of the expression for the matrix field, and by applying the regularity results for V f , G and Φ ∗ from Appendix B. (cid:3) Now since k θ k h ≤ S and by the definition (99) of the set B , we have from (13) thatsup θ ∈B k θ k h ≤ k θ ,D k h + sup θ ∈B k θ − θ ,D k h . S + D d sup θ ∈B k θ − θ ,D k R D . S + 1 . It follows further from the Sobolev embedding and regularity of the link function Φ (AppendixB.1.1) that there exists a constant B = B ( S, Φ , O ) < ∞ , such that(108) sup θ ∈B h k F θ k ∞ + k F θ k H + k f θ k H + k f θ k ∞ i ≤ B. In particular, this estimate implies that the constants appearing in the inequalities from LemmaB.1 can be chosen independently of θ ∈ B , which we use frequently below.For notational convenience we also introduce spaces(109) E D := span( e , ..., e D ) ⊆ L ( O ) , D ∈ N , spanned by the first D eigenfunctions of ∆ on O (cf. Section 2.1.1).We first verify the boundedness property required in Assumption 3.2 ii).8 R. NICKL AND S. WANG
7→ G ( θ )( x ) is twice differentiable, with the desiredexpressions for the vector and matrix fields. The continuity of the second partial derivatives followsfrom inspection of the expression for the matrix field, and by applying the regularity results for V f , G and Φ ∗ from Appendix B. (cid:3) Now since k θ k h ≤ S and by the definition (99) of the set B , we have from (13) thatsup θ ∈B k θ k h ≤ k θ ,D k h + sup θ ∈B k θ − θ ,D k h . S + D d sup θ ∈B k θ − θ ,D k R D . S + 1 . It follows further from the Sobolev embedding and regularity of the link function Φ (AppendixB.1.1) that there exists a constant B = B ( S, Φ , O ) < ∞ , such that(108) sup θ ∈B h k F θ k ∞ + k F θ k H + k f θ k H + k f θ k ∞ i ≤ B. In particular, this estimate implies that the constants appearing in the inequalities from LemmaB.1 can be chosen independently of θ ∈ B , which we use frequently below.For notational convenience we also introduce spaces(109) E D := span( e , ..., e D ) ⊆ L ( O ) , D ∈ N , spanned by the first D eigenfunctions of ∆ on O (cf. Section 2.1.1).We first verify the boundedness property required in Assumption 3.2 ii).8 R. NICKL AND S. WANG Lemma 4.5.
There exists a constant
C > such that sup θ ∈B kG ( θ ) k L ∞ ≤ C, sup θ ∈B k∇G ( θ ) k L ∞ ( O , R D ) ≤ C, sup θ ∈B k∇ G ( θ ) k L ∞ ( O , R D × D ) ≤ CD /d . Proof.
The estimate for kG ( θ ) k ∞ follows immediately from (18). To estimate k∇G ( θ ) k L ∞ ( O , R D ) ,we first note that by Lemma 4.4, k∇G ( θ ) k L ∞ ( O , R D ) = sup v : k v k R D ≤ k v T ∇G ( θ ) k L ∞ ≤ sup H ∈ E D : k H k L ≤ (cid:13)(cid:13) V f θ (cid:2) u f θ (Φ ′ ◦ F θ ) H (cid:3)(cid:13)(cid:13) ∞ . Thus by the Sobolev embedding k · k ∞ . k · k H , Lemma B.1 and boundedness of Φ ′ , we have thatfor any θ ∈ B and any H ∈ E D , (cid:13)(cid:13) V f θ [ u f θ (Φ ′ ◦ F θ ) H ] (cid:13)(cid:13) ∞ . (cid:13)(cid:13) V f θ [ u f θ (Φ ′ ◦ F θ ) H ] (cid:13)(cid:13) H . (cid:13)(cid:13) u f θ (Φ ′ ◦ F θ ) H (cid:13)(cid:13) L . (cid:13)(cid:13) u f θ k ∞ k Φ ′ ◦ F θ k ∞ k H k L . k H k L . Again using Lemma 4.4, we can similarly estimate k∇ G ( θ ) k L ∞ ( O , R D ) by k∇ G ( θ ) k L ∞ ( O , R D ) ≤ sup v : k v k R D ≤ k v T ∇ G ( θ ) v k L ∞ ≤ sup H ∈ E D : k H k L ≤ (cid:13)(cid:13) V f θ (cid:2) H (Φ ′ ◦ F θ ) V f θ (cid:2) H (Φ ′ ◦ F θ ) u f θ (cid:3)(cid:3)(cid:13)(cid:13) ∞ + (cid:13)(cid:13) V f θ (cid:2) H (Φ ′′ ◦ F θ ) u f θ (cid:3)(cid:13)(cid:13) ∞ =: sup H ∈ E D : k H k L ≤ I + II. (110)Arguing as in the estimate for k∇G ( θ ) k L ∞ ( O , R D ) , we have that for any θ ∈ B and H ∈ E D , I . k H (Φ ′ ◦ F θ ) V f θ (cid:2) H (Φ ′ ◦ F θ ) u f θ (cid:3) k L . k H k L k Φ ′ ◦ F k ∞ k V f [ H (Φ ′ ◦ F ) u f ] k ∞ . k H k L k H (Φ ′ ◦ F ) u f k L . k H k L , as well as II . k H (Φ ′′ ◦ F θ ) u f θ k L . k u f θ k ∞ k Φ ′′ ◦ F θ k ∞ k H k L k H k ∞ . k H k L k H k H . D /d k H k L , where we used the basic norm estimate on E D ⊆ L ( O ) from Lemma 4.9. By combining the lastthree displays, the proof is completed. (cid:3) Next, we verify the increment bound needed in Assumption 3.2 iii).
Lemma 4.6.
There exists a constant
C > such that for any D ∈ N and any θ, θ ′ ∈ R D , kG ( θ ) − G (¯ θ ) k ∞ ≤ C k F θ − F ¯ θ k ∞ , kG ( θ ) − G (¯ θ ) k L ≤ C k F θ − F ¯ θ k L , (111) as well as, for any θ, θ ′ ∈ B , k∇G ( θ ) − ∇G (¯ θ ) k L ∞ ( O , R D ) ≤ C k F θ − F ¯ θ k ∞ , (112) k∇ G ( θ ) − ∇ G (¯ θ ) k L ∞ ( O , R D × D ) ≤ CD /d k F θ − F ¯ θ k ∞ . (113) ANGEVIN ALGORITHMS 39
Proof.
The estimate (111) follows immediately from (173) and (179). Now fix any θ, ¯ θ ∈ B . Toease notation, in what follows we write F = Ψ( θ ) , ¯ F = Ψ(¯ θ ), f = Φ ◦ F and ¯ f = Φ ◦ ¯ F . For (112),arguing as in the proof of Lemma 4.5, we first have (cid:13)(cid:13) ∇G ( θ ) − ∇G (¯ θ ) (cid:13)(cid:13) L ∞ ( O , R D ) ≤ sup v : k v k R D ≤ (cid:13)(cid:13) v T ( ∇G ( θ ) − ∇G (¯ θ )) (cid:13)(cid:13) ∞ ≤ sup H ∈ E D : k H k L ≤ (cid:13)(cid:13) V f [ H (Φ ′ ◦ F ) u f ] − V ¯ f [ H (Φ ′ ◦ ¯ F ) u ¯ f ] (cid:13)(cid:13) ∞ = sup H ∈ E D : k H k L ≤ (cid:13)(cid:13) ( V f − V ¯ f )[ H (Φ ′ ◦ F ) u f ] (cid:13)(cid:13) ∞ + (cid:13)(cid:13) V ¯ f [ H (Φ ′ ◦ F − Φ ′ ◦ ¯ F ) u ¯ f ] (cid:13)(cid:13) ∞ + (cid:13)(cid:13) V ¯ f [ H (Φ ′ ◦ F )( u f − u ¯ f )] (cid:13)(cid:13) ∞ =: sup H ∈ E D : k H k L ≤ I a + I b + I c . Now, we fix H ∈ E D for the rest of the proof. The term I a can further be estimated by repeatedlyusing the Sobolev embedding k · k ∞ . k · k H , Lemma B.1 as well as (108) and (179): I a = k V f [( f − ¯ f ) V ¯ f [ u ¯ f (Φ ′ ◦ F ) H ]] k ∞ . k V f [( f − ¯ f ) V ¯ f [ u ¯ f (Φ ′ ◦ F ) H ]] k H . k ( f − ¯ f ) V ¯ f [ u ¯ f (Φ ′ ◦ F ) H ] k L . k f − ¯ f k ∞ k u ¯ f (Φ ′ ◦ ¯ F ) H k L . k F − ¯ F k ∞ k H k L . (114)Similarly, I b is estimated as follows: I b . k H (Φ ′ ◦ F − Φ ′ ◦ ¯ F ) u ¯ f k L . k Φ ′ ◦ F − Φ ′ ◦ ¯ F k ∞ k u ¯ f k ∞ k H k L . k F − ¯ F k ∞ k H k L . Finally, we can similarly estimate I c . k ( u f − u ¯ f )(Φ ′ ◦ F ) H k L . k u f − u ¯ f k ∞ k Φ ′ ◦ F k ∞ k H k L . k F − ¯ F k ∞ k H k L , where we have also used (111). By combining the estimates for I a , I b and I c , we have completedthe proof of (112).It remains to prove (113). In analogy to (110), we may fix any v ∈ R D , and it suffices to derivea bound for v T ( ∇ G ( θ ) − ∇ G (¯ θ )) v . To ease notation, let us write H = Ψ v ∈ E D ∼ = R D , as well as h = H (Φ ′ ◦ F ) and ¯ h = H (Φ ′ ◦ ¯ F ). Then by Lemma 4.4, we have the following decomposition intoeight terms: v T ( ∇ G ( θ ) − ∇ G (¯ θ )) v = 2 V ¯ f (cid:2) ¯ hV ¯ f [¯ hu ¯ f ] (cid:3) − V f (cid:2) hV f [ hu f ] (cid:3) + V ¯ f [ u ¯ f H (Φ ′′ ◦ ¯ F )] − V f [ u f H (Φ ′′ ◦ F )]= 2( V ¯ f − V f ) (cid:2) ¯ hV ¯ f [¯ hu ¯ f ] (cid:3) + 2 V f (cid:2) (¯ h − h ) V ¯ f [¯ hu ¯ f ] (cid:3) + 2 V f (cid:2) h ( V ¯ f − V f )[¯ hu ¯ f ] (cid:3) + 2 V f (cid:2) hV f [(¯ h − h ) u ¯ f ] (cid:3) + 2 V f (cid:2) hV f [ h ( u ¯ f − u f )] (cid:3) + ( V ¯ f − V f )[ u ¯ f H (Φ ′′ ◦ ¯ F )] + V f [( u ¯ f − u f ) H (Φ ′′ ◦ ¯ F )] + V f [ u f H (Φ ′′ ◦ ¯ F − Φ ′′ ◦ F )]=: II a + II b + II c + II d + II e + II f + II g + II h . (115) To estimate these terms, we will again repeatedly use (108), the regularity estimates from Lem-mas B.1- B.2 below, the estimates k h k L , k ¯ h k L . k H k L as well as k f − ¯ f k ∞ . k F − ¯ F k ∞ , whichall hold uniformly in θ ∈ B .Using Lemma B.1, including the estimate (171) with ψ = ¯ hV ¯ f [¯ hu ¯ f ], we obtain k II a k ∞ . k f − ¯ f k ∞ k ¯ hV ¯ f [¯ hu ¯ f ] k L . k f − ¯ f k ∞ k ¯ h k L k V ¯ f [¯ hu ¯ f ] k ∞ . k f − ¯ f k ∞ k H k L k ¯ hu ¯ f k L . k f − ¯ f k ∞ k H k L k u ¯ f k ∞ . k F − ¯ F k ∞ k H k L . Similarly, we have k II b k ∞ . k (¯ h − h ) V ¯ f [¯ hu ¯ f ] k L . k H (Φ ′ ◦ ¯ F − Φ ′ ◦ F ) k L k V ¯ f [¯ hu ¯ f ] k ∞ . k u f k ∞ k H k L k ¯ F − F k ∞ k ¯ hu ¯ f k L . k H k L k ¯ F − F k ∞ , and, again using (171), k II c k ∞ . k h ( V ¯ f − V f )[¯ hu ¯ f ] k L . k h k L k ( V ¯ f − V f )[¯ hu ¯ f ] k ∞ . k H k L k ¯ f − f k ∞ k ¯ hu ¯ f k L . k H k L k ¯ F − F k ∞ . For II d , by following similar steps as for II b , we see that k II d k ∞ . k H k L k V f [(¯ h − h ) u ¯ f ] k ∞ . k H k L k ¯ F − F k ∞ , and similarly, using also (111), we obtain k II e k ∞ . k H k L k V f [ h ( u ¯ f − u f )] k ∞ . k H k L k u ¯ f − u f k ∞ . k H k L k ¯ F − F k ∞ . For the term II f , we note that by the Sobolev embedding, k w k ( H ) ∗ ≤ sup ψ : k ψ k H ≤ (cid:12)(cid:12) Z O wψ (cid:12)(cid:12) . k w k L sup ψ : k ψ k H ≤ k ψ k ∞ . k w k L , w ∈ L ( O ) , and consequently by Lemma B.1, k II f k ∞ = k V f [( ¯ f − f ) V ¯ f [ u ¯ f H (Φ ′′ ◦ F )]] k ∞ . k ¯ f − f k ∞ k V ¯ f [ u ¯ f H (Φ ′′ ◦ F )] k L . k ¯ f − f k ∞ k u ¯ f H (Φ ′′ ◦ F ) k ( H ) ∗ . k ¯ f − f k ∞ k u ¯ f H (Φ ′′ ◦ F ) k L . k ¯ F − F k ∞ k H k L . For terms II g and II h , by similar steps and additionally using that by Lemma 4.9, k H k ∞ . k H k H . D /d k H k L for any H ∈ E D , we obtain k II g k ∞ . k u ¯ f − u f k ∞ k H k L k Φ ′′ ◦ ¯ F k ∞ . k ¯ f − f k ∞ k H k L k H k ∞ . D /d k ¯ F − F k ∞ k H k L , as well as k II h k ∞ ≤ k u f H (Φ ′′ ◦ ¯ F − Φ ′′ ◦ F ) k L . k H k L k H k ∞ k ¯ F − F k ∞ . D /d k ¯ F − F k ∞ k H k L . By combining (115) with the estimates for the terms II a − II h , the proof of (113) is complete. (cid:3) ANGEVIN ALGORITHMS 41
We now turn to the key ‘geometric’ bound from the first part of Assumption 3.3, which quantifiesthe average curvature of the likelihood function ℓ N near θ ,D in a high-dimensional setting (when P X is uniform on O ). The curvature deteriorates with rate D − /d as D → ∞ , which is in linewith the (local) ill-posedness of the Schr¨odinger model, and the related fact that the associated‘Fisher information operator’ is of the form I , with I being the inverse of a second order (ellipticSchr¨odinger-type) operator (cf. also Section 4 in [53]). Lemma 4.7.
Let ℓ ( θ ) be as in (38) with G : R D → R from (17), and let B ǫ be as in (99). Let θ ∈ h satisfy k θ k h ≤ S for some S > . Then there exist constants < ǫ S ≤ , c , c > suchthat if also kG ( θ ) − G ( θ ,D ) k L ( O ) ≤ c D − /d , then for all D ∈ N and all ǫ ≤ ǫ S , (116) inf θ ∈B ǫ λ min (cid:0) E θ (cid:2) − ∇ ℓ ( θ ) (cid:3)(cid:1) ≥ c D − /d . Proof.
We begin by noting that for any Z = ( Y, X ) ∈ R × O , we have −∇ ℓ ( θ, Z ) = ∇G X ( θ ) G X ( θ ) T − ( Y − G X ( θ )) ∇ G X ( θ ) . Using this and Lemma 4.4, we obtain that for any v ∈ R D , with the previous notation H = Ψ( v )and h = (Φ ′ ◦ F θ ) H , v T E θ [ −∇ ℓ ( θ, Z )] v = k V f θ [ u f θ (Φ ′ ◦ F θ ) H ] k L ( O ) − h u f θ − u f θ , V f θ [ hV f θ [ hu f θ ]] i L ( O ) − h u f θ − u f θ , V f θ [ u f θ H (Φ ′′ ◦ F θ )] i L ( O ) =: I + II + III. (117)We next derive a lower bound on the term I and upper bounds for the terms II and III , for anyfixed v ∈ R D . Lower bound for I . Writing a θ := u f θ (Φ ′ ◦ F θ ), using the elliptic L -( H ) ∗ coercivity estimate(170) from Lemma B.1 below as well as (108), we have(118) √ I = k V f θ [ a θ H ] k L ( O ) & k a θ H k ( H ) ∗ k f θ k ∞ & k a θ H k ( H ) ∗ , θ ∈ B . The next step is to lower bound a θ . By Theorem 1.17 in [13], the expected exit time τ O featuring inthe Feynman-Kac formula (12) satisfies the uniform estimate sup x ∈O E x τ O ≤ K ( vol ( O ) , d ) < ∞ .Therefore, using also Jensen’s inequality and g ≥ g min >
0, we have that, with B from (108),(119) inf θ ∈B inf x ∈O u f θ ( x ) ≥ g min e − BK ( vol ( O ) ,d ) =: u min > . Also, since Φ is a regular link function, for some k = k ( B ) > θ ∈B inf x ∈O [Φ ′ ◦ F θ ]( x ) ≥ inf t ∈ [ − k,k ] Φ ′ ( t ) > , and therefore for some a min = a min (Φ , B, O , g min ) > θ ∈B inf x ∈O a θ ( x ) ≥ a min > . We thus obtain, by definition of ( H ) ∗ and the multiplication inequality (7) that for some c = c ( a min ) > k H k ( H ) ∗ = k a θ a − θ H k ( H ) ∗ ≤ k a − θ k H k a θ H k ( H ) ∗ ≤ c (1 + k a θ k H ) k a θ H k ( H ) ∗ , (121) where in the last inequality we used (178) for the function x /x . Using again (108), regularityof Φ ′ , the chain rule as well as the elliptic regularity estimate (175), we obtain that(122) sup θ ∈B k a θ k H ≤ sup θ ∈B k u f θ k H sup θ ∈B k Φ ′ ◦ F θ k H ≤ C ( g, S, O , Φ) < ∞ . Therefore, combining the displays (118), (121), (122), we have proved that, uniformly in θ ∈ B ,(123) I & k a θ H k H ) ∗ & k H k H ) ∗ c sup θ ∈B (1 + k a θ k H ) & D − /d k H k L , where we have used Lemma 4.9 below in the last inequality. Upper bound for II and III . Using the self-adjointness of V f θ on L ( O ), a Sobolev embedding,Lemma B.1, (108), the Lipschitz estimate (173) as well as (18), we have uniformly in θ ∈ B , | II | . (cid:12)(cid:12)(cid:12) Z O ( u f θ − u f θ ) V f θ [ hV f θ [ hu f θ ]] (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) Z O V f θ [ u f θ − u f θ ][ hV f θ [ hu f θ ]] (cid:12)(cid:12)(cid:12) . k V f θ [ u f θ − u f θ ] k ∞ k hV f θ [ hu f θ ] k L . k u f θ − u f θ k L k h k L k V f θ [ hu f θ ] k L . k u f θ − u f θ k L k H k L . (124)Similarly, for the term III , using also k Φ ′′ k ∞ < ∞ , we estimate | III | = (cid:12)(cid:12) h u f θ − u f θ , V f θ [ u f θ H (Φ ′′ ◦ F θ )] i L ( O ) (cid:12)(cid:12) = (cid:12)(cid:12) h V f θ [ u f θ − u f θ ] , u f θ H (Φ ′′ ◦ F θ ) i L ( O ) (cid:12)(cid:12) ≤ k V f θ [ u f θ − u f θ ] k ∞ k u f θ k ∞ k Φ ′′ ◦ F θ k ∞ k H k L . k u f θ − u f θ k L k H k L (125)Combining the displays (117), (123), (124) and (125), we have proved that for any θ ∈ B , any v ∈ R D and some constants c ′ , c ′′ > v T E θ [ −∇ ℓ ( θ, Z )] v ≥ (cid:2) c ′ D − /d − c ′′ k u f θ − u f θ k L (cid:3) k H k L . Using (111) and the hypotheses, we obtain that for some c g > k u f θ − u f θ k L ≤ kG ( θ ) − G ( θ ,D ) k L + c g k θ ,D − θ k R D ≤ ( c + c g ǫ S ) D − /d . Thus for all c , ε S > v ∈ R D with k v k R D = k Ψ( v ) k L = k H k L = 1, we obtain that for any θ ∈ B ǫ S and some c ′′′ > λ min (cid:0) E θ [ −∇ ℓ ( θ, Z )] (cid:1) ≥ c ′′′ D − /d , which completes the proof. (cid:3) Finally, we prove the upper bound required for Assumption 3.3 ii).
Lemma 4.8 (Upper bound) . For every
S > , there exists a constant c max > such that for k θ k h ≤ S and all D ∈ N , we have sup θ ∈B h | E θ [ ℓ ( θ, Z )] | + k E θ [ ∇ ℓ ( θ, Z )] k R D + k E θ [ ∇ ℓ ( θ, Z )] k op i ≤ c max . ANGEVIN ALGORITHMS 43
Proof.
For the zeroeth order term, using Lemma 4.5, we have that for some K > θ ∈ B , | E θ [ ℓ ( θ )] | = 1 / / kG ( θ ) − G ( θ ) k L . kG ( θ ) k ∞ + k u f k ∞ ≤ K . For the first order term, similarly by Lemma 4.5 there exists some K > θ ∈ B , (cid:13)(cid:13) E θ (cid:2) − ∇ ℓ ( θ ) (cid:3)(cid:13)(cid:13) R D . (cid:13)(cid:13) hG ( θ ) − G ( θ ) , ∇G ( θ ) i L ( O ) (cid:13)(cid:13) R D . (cid:13)(cid:13) G ( θ ) − G ( θ ) (cid:13)(cid:13) ∞ (cid:13)(cid:13) ∇G ( θ ) (cid:13)(cid:13) L ∞ ( O , R D ) ≤ K . For the second order term, we recall the decomposition λ max (cid:0) E θ (cid:2) − ∇ ℓ ( θ ) (cid:3)(cid:1) = sup v : k v k R D ≤ v T E θ (cid:2) − ∇ ℓ ( θ ) (cid:3) v = sup v : k v k R D ≤ (cid:2) I + II + III (cid:3) , where the terms I − III were defined in (117). Suitable uniform upper bounds for the terms II and III have already been shown in (124) and (125) respectively, whence it suffices to upper bound theterm I . We do this by using (108) and Lemma B.1: for any θ ∈ B and any H = Ψ( v ), v ∈ R D , √ I = k V f θ [ u f θ (Φ ′ ◦ F θ ) H ] k L . k u f θ (Φ ′ ◦ F θ ) H k L . k u f θ k ∞ k Φ ′ ◦ F θ k ∞ k H k L . k v k R D . (cid:3) We conclude with the following basic comparison lemma for Sobolev norms on the subspaces E D ⊆ L ( O ) from (109). Lemma 4.9.
There exists
C > such that for any D ∈ N and any H ∈ E D , (126) k H k H ≤ CD /d k H k L , k H k L ≤ CD /d k H k ( H ) ∗ . Proof.
Fix D ∈ N . By the isomorphism property of ∆ between the spaces H and L (see e.g.Theorem II.5.4 in [45]), we first have the norm equivalence k ∆ H k L . k H k H . k ∆ H k L , H ∈ E D . It follows by Weyl’s law (13) that k H k H . D X k =1 (cid:12)(cid:12) h H, e k i L (cid:12)(cid:12) λ k . D /d k H k L . Thus, combining the above display with the following duality argument completes the proof: k H k L = sup ψ ∈ E D : k ψ k L ≤ (cid:12)(cid:12) h H, ψ i L (cid:12)(cid:12) . D /d sup ψ ∈ E D : k ψ k H ≤ (cid:12)(cid:12) h H, ψ i L (cid:12)(cid:12) ≤ D /d k H k ( H ) ∗ . (cid:3) Wasserstein approximation of the posterior measure.
The main purpose of this sectionis to prove Theorem 4.14, which provides a bound on the Wasserstein distance between the posteriormeasure Π( ·| Z ( N ) ) from (24) and the surrogate posterior ˜Π( ·| Z ( N ) ) from (27) in the Schr¨odingermodel. The idea behind the proof of this theorem is to show that both Π( ·| Z ( N ) ) and ˜Π( ·| Z ( N ) )concentrate most of their mass on the region (99) where the log-likelihood function ℓ N is stronglyconcave (with high P Nθ -probability, cf. Proposition 4.1). This involves initially a careful study ofthe mode (maximiser) of the posterior density, given in Theorem 4.12. Convergence rate of MAP estimates.
For ( Y i , X i ) Ni =1 arising from (19) with G : R D → R from(17), we now study maximisers(127) ˆ θ MAP ∈ arg max θ ∈ R D " − N N X i =1 (cid:0) Y i − G ( θ )( X i ) (cid:1) − δ N k θ k h α , δ N = N − α α + d , of the posterior density (24). For Λ α from (23) we will write I ( θ ) := k θ k h α = θ T Λ α θ for θ ∈ R D .We denote the empirical measure on R × O induced by the Z i = ( X i , Y i )’s as(128) P N = 1 N N X i =1 δ ( Y i ,X i ) , so that Z hdP N = 1 N N X i =1 h ( Y i , X i )for any measurable map h : R × O → R . Recall also that p θ : R × O → [0 , ∞ ) denotes the marginalprobability densities of P Nθ defined in (21). Lemma 4.10.
Let ˆ θ MAP be any maximiser in (127), and denote by θ ,D the projection of θ onto R D . We have ( P Nθ -a.s.) kG (ˆ θ MAP ) −G ( θ ) k L + δ N I (ˆ θ MAP ) ≤ Z log p ˆ θ MAP p θ ,D d ( P N − P θ )+ δ N I ( θ ,D )+ 12 kG ( θ ,D ) −G ( θ ) k L . Proof.
By the definitions ℓ N (ˆ θ MAP ) − ℓ N ( θ ,D ) − N δ N I (ˆ θ MAP ) ≥ − N δ N I ( θ ,D )which is the same as(129) N Z log p ˆ θ MAP p θ ,D d ( P N − P θ ) + N δ N I ( θ ,D ) ≥ N δ N I (ˆ θ MAP ) − N Z log p ˆ θ MAP p θ ,D dP θ . The last term can be decomposed as − Z log p ˆ θ MAP p θ ,D dP θ = − Z log p ˆ θ MAP p θ dP θ + Z log p θ ,D p θ dP θ = 12 kG (ˆ θ MAP ) − G ( θ ) k L ( O ) − kG ( θ ,D ) − G ( θ ) k L ( O ) where we have used a standard computation of likelihood ratios (see also Lemma 23 in [29]). Theresult follows from the last two displays after dividing by N . (cid:3) The following result can be proved by adapting techniques from M -estimation [70] (see also [71],[56]) to the present situation. We will make crucial use of the concentration Lemma 3.12. Proposition 4.11.
Let α > d . Suppose k θ k h α ≤ c and that D is such that kG ( θ ) − G ( θ ,D ) k L ≤ c δ N for some c , c > . Then, for any c ≥ we can choose C = C ( c, c , c ) large enough so thatevery ˆ θ MAP maximising (127) satisfies, (130) P Nθ (cid:18) kG (ˆ θ MAP ) − G ( θ ) k L + δ N I (ˆ θ MAP ) > Cδ N (cid:19) . e − c Nδ N . Proof.
We define functionals τ ( θ, θ ′ ) = 12 kG ( θ ) − G ( θ ′ ) k L + δ N I ( θ ) , θ ∈ R D , θ ′ ∈ h α , ANGEVIN ALGORITHMS 45 and empirical processes W N ( θ ) = Z log p θ p θ ,D d ( P N − P θ ) , W N, ( θ ) = Z log p θ p θ d ( P N − P θ ) , θ ∈ R D , so that W N ( θ ) = W N, ( θ ) − W N, ( θ ,D ) , θ ∈ R D . Using the previous lemma it suffices to bound P Nθ (cid:16) τ (ˆ θ MAP , θ ) > Cδ N , W N (ˆ θ MAP ) ≥ τ (ˆ θ MAP , θ ) − δ N I ( θ ,D ) − kG ( θ ,D ) − G ( θ ) k L / (cid:17) Since I ( θ ,D ) = k θ ,D k h α / ≤ k θ k h α / ≤ c / kG ( θ ,D ) − G ( θ ) k L ≤ c δ N by hypothesis, we can choose C large enough so that the last probability is bounded by P Nθ (cid:16) τ (ˆ θ MAP , θ ) > Cδ N , | W N (ˆ θ MAP ) | ≥ τ (ˆ θ MAP , θ ) / (cid:17) ≤ ∞ X s =1 P Nθ sup θ ∈ R D :2 s − Cδ N ≤ τ ( θ,θ ) ≤ s Cδ N | W N, ( θ ) | ≥ s Cδ N / ! + P Nθ (cid:0) | W N, ( θ ,D ) | ≥ Cδ N / (cid:1) ≤ ∞ X s =1 P Nθ (cid:18) sup θ ∈ Θ s | W N, ( θ ) | ≥ s Cδ N / (cid:19) , (131)where, for s ∈ N ,(132) Θ s := (cid:8) θ ∈ R D : τ ( θ, θ ) ≤ s Cδ N (cid:9) = (cid:8) θ ∈ R D : kG ( θ ) − G ( θ ) k L + δ N k θ k h α ≤ s +1 Cδ N (cid:9) , and where we have used that θ ,D ∈ Θ for C large enough by the hypotheses. To proceed, noticethat N W N, ( θ ) = ℓ N ( θ ) − ℓ N ( θ ) − E θ [ ℓ N ( θ ) − ℓ N ( θ )]and that, for ( Y i , X i ) ∼ i.i.d. P θ , ℓ N ( θ ) − ℓ N ( θ ) = − N X i =1 (cid:2) ( G ( θ )( X i ) − G ( θ )( X i ) + ε i ) − ε i (cid:3) = − N X i =1 ( G ( θ )( X i ) − G ( θ )( X i )) ε i − N X i =1 ( G ( θ )( X i ) − G ( θ )( X i )) , (133)so that we have to deal with two empirical processes separately. We first bound(134) ∞ X s =1 P Nθ (cid:18) sup θ ∈ Θ s | Z N ( θ ) | ≥ √ N s Cδ N / (cid:19) where Z N = 1 √ N N X i =1 h θ ( X i ) ε i , h θ = G ( θ ) − G ( θ ) , θ ∈ Θ = Θ s , s ∈ N , is as in Lemma 3.12. We will apply that lemma with bounds (recalling vol ( O ) = 1)(135) E X h θ ( X ) = kG ( θ ) − G ( θ ) k L ≤ s +1 Cδ N =: σ s , k h θ k ∞ ≤ θ kG ( θ ) k ∞ ≤ U < ∞ uniformly in all θ ∈ Θ s , for some fixed constant U = U ( g, O ) (cf. (18)). For the entropy bounds,we use that on each slice sup θ ∈ Θ s k F θ k H α ≤ C ′ s/ , which for α > d implies (using (4.184) in [28]and standard extension properties of Sobolev norms)log N (cid:0) { F θ : θ ∈ Θ s } , k · k ∞ , ρ (cid:1) ≤ K (cid:16) s/ ρ (cid:17) d/α , ρ > , for some constant K = K ( α, d, C ′ ). Since the map F θ
7→ G ( θ ) is Lipschitz for the k · k ∞ -norm(Lemma 4.6) we deduce that also(136) log N (cid:0) { h θ = G ( θ ) − G ( θ ) : θ ∈ Θ s } , k · k ∞ , ρ (cid:1) ≤ K ′ (cid:16) s/ ρ (cid:17) d/α , ρ > , and as a consequence, for α > d and J ( H ) , J ∞ ( H ) defined in Lemma 3.12,(137) J ( H ) . Z σ s (cid:16) s/ ρ (cid:17) d/ α dρ . sd/ α σ − d α s , J ∞ ( H ) . Z U (cid:16) s/ ρ (cid:17) d/α dρ . sd/ α U − dα . The sum in (134) can now be bounded by Lemma 3.12 with x = c N s δ N and the choices of σ s , U in (135) for C = C ( c ) > X s ∈ N P Nθ (cid:18) sup θ ∈ Θ s | Z N ( θ ) | ≥ √ N σ s / (cid:19) ≤ X s ∈ N e − c s Nδ N . e − c Nδ N since then, by definition of δ N , for α > d and C large enough, the quantities(139) J ( H ) . sd/ α (2 s/ √ Cδ N ) − d α . C (4 α + d ) / α √ N σ s , σ s √ x ≤ c √ C √ N σ s , and(140) 1 √ N J ∞ ( H ) . sd/ α √ N . C √ N σ s , x √ N = c C √ N σ s are all of the correct order of magnitude compared to √ N σ s .We now turn to the process corresponding to the second term in (133), which is bounded by(141) X s ∈ N P Nθ (cid:18) sup θ ∈ Θ s | Z ′ N ( θ ) | ≥ √ N s Cδ N / (cid:19) where Z ′ N is now the centred empirical process Z ′ N ( θ ) = 1 √ N N X i =1 ( h θ − E X h θ ( X ) (cid:1) , with H = { h θ = ( G ( θ ) − G ( θ )) : θ ∈ Θ s } to which we will again apply Lemma 3.12. Just as in (135) the envelopes of this process areuniformly bounded by a fixed constant, again denoted by U , which implies in particular that thebounds (137) also apply to H as then, for some constant c U > k h θ − h θ ′ k ∞ ≤ c U kG ( θ ) − G ( θ ′ ) k ∞ . Moreover on each slice Θ s the weak variances are bounded by E X h θ ( X ) ≤ c ′ U k h θ k L ≤ σ s with σ s as in (135) and some c ′ U >
0. We see that all bounds required to obtain (134) apply to theprocess Z ′ N as well, and hence the series in (131) is indeed bounded as required in the proposition,completing the proof. (cid:3) ANGEVIN ALGORITHMS 47
From a stability estimate for θ
7→ G ( θ ) we now obtain the following convergence rate for k ˆ θ MAP − θ k ℓ which in turn also bounds k ˆ θ MAP − θ ,D k R D . Theorem 4.12.
Let Z ( N ) ∼ P Nθ be as in (20) where θ ∈ h α , α > d, d ≤ . Define ¯ δ N := N − r ( α ) where r ( α ) = α α + d αα + 2 . Suppose k θ k h α ≤ c and that D is such that kG ( θ ) − G ( θ ,D ) k L ≤ c δ N , for some constants c , c > . Then given c > we can choose ¯ C, ¯ c large enough (depending on c, c , c , α, O ) so thatfor all N and any maximiser ˆ θ MAP satisfying (127), one has (142) P Nθ (cid:16) k ˆ θ MAP − θ k ℓ ≤ ¯ C ¯ δ N , k ˆ θ MAP k h α ≤ ¯ C (cid:17) ≥ − ¯ ce − c Nδ N . Proof.
By Proposition 4.11 we can restrict to events(143) T N := (cid:8) kG (ˆ θ MAP ) − G ( θ ) k L ≤ Cδ N , k F ˆ θ MAP k H α = k ˆ θ MAP k h α ≤ √ C (cid:9) of sufficiently high P Nθ -probability. If we write ˆ f = Φ ◦ F ˆ θ MAP for Φ from (17) then by (178), on theevents T N we also have k ˆ f k H α ≤ C ′ and k ˆ f k ∞ ≤ C ′ , for some C ′ >
0. We write u ˆ f = G (ˆ θ MAP ) forthe unique solution of the Schr¨odinger equation (11) corresponding to ˆ f . We then necessarily have f = ∆ u f / (2 u f ) both for f = ˆ f and f = f , where we also use that denominator u f is boundedaway from zero by a constant depending only on C ′ ≥ k f k ∞ , O , g , see (119). Then using themultiplication and interpolation inequalities (7), (8), the regularity estimate from (176) and (178),we have for t = α/ ( α + 2), k ˆ f − f k L . k u ˆ f − u f k H . kG (ˆ θ MAP ) − G ( θ ) k tL k u ˆ f − u f k − tH α +2 . δ tN ( k ˆ f k H α + k f k H α ) . δ tN (144)on the event T N . From a Sobolev imbedding (some κ >
0) and applying (8) again we furtherdeduce k ˆ f − f k ∞ . δ ( α − d/ − κ ) / ( α +2) N → N → ∞ , hence using inf x f ( x ) > K min we also haveinf x ˆ f ( x ) ≥ K min + k for some k > T N , for all N large enough). We deduce k ˆ θ MAP − θ k ℓ ≤ k F ˆ θ MAP − F θ k L = k Φ − ◦ ˆ f − Φ − ◦ f k L . k ˆ f − f k L . δ tN on the events T N , where in the last inequality we have used regularity of the inverse link functionΦ − : [ K min + k, ∞ ) and (179). This completes the proof. (cid:3) Posterior contraction rates.
We now study the full posterior distribution (24) arising fromthe Gaussian prior Π for θ from (23). The result we shall prove parallels Theorem 4.12 but holdsfor most of the ‘mass’ of the posterior measure instead of just for its ‘mode’ ˆ θ MAP . This requiresvery different techniques and we rely on ideas from Bayesian nonparametrics [26, 73], specificallyrecent progress [51] that allows one to deal with non-linear settings (see also [29]).In the proof of Theorem 4.14 to follow we will require control of the posterior ‘normalisingfactors’, expressed via sets(145) C N = C N,K = (cid:26)Z R D e ℓ N ( θ ) − ℓ N ( θ ) d Π( θ ) ≥ Π( B ( δ N )) exp {− (1 + K ) N δ N } (cid:27) , for some K >
0, where δ N = N − α/ (2 α + d ) and B ( δ N ) = (cid:8) θ ∈ R D : kG ( θ ) − G ( θ ) k L ( O ) < δ N (cid:9) .8 R. NICKL AND S. WANG
0, where δ N = N − α/ (2 α + d ) and B ( δ N ) = (cid:8) θ ∈ R D : kG ( θ ) − G ( θ ) k L ( O ) < δ N (cid:9) .8 R. NICKL AND S. WANG This is achieved in the course of the proof of our next result. We denote by c g the global Lipschitzconstant of the map θ
7→ G ( θ ) from ℓ ( N ) → L ( O ), see (111). Theorem 4.13.
Let Z ( N ) , θ , α, d, ¯ δ N be as in Theorem 4.12 and let Π( ·| Z ( N ) ) denote the posteriordistribution from (24). Suppose k θ k h α ≤ c and that D ≤ c N δ N is such that (146) kG ( θ ) − G ( θ ,D ) k L ( O ) ≤ c δ N for some finite constants c , c > , < c < / . Then for any a > there exist c ′ , c ′′ such thatfor K, L = L ( a, c , c , c g , α, O ) large enough, (147) P Nθ (cid:0)(cid:8) Π( θ : k θ − θ ,D k R D ≤ L ¯ δ N , k θ k h α ≤ L | Z ( N ) ) ≥ − e − aNδ N (cid:9) , C N,K (cid:1) ≥ − c ′ e − c ′′ Nδ N . Proof.
We initially establish some auxiliary results that will allow us to apply a standard contractiontheorem from Bayesian non-parametrics, specifically in a form given in Theorem 13 in [29]. ByLemma 23 in [29] and (18) we can lower bound Π N ( B N ) in (35) in [29] by our Π N ( B ( δ N )) (afteradjusting the choice of δ N in [29] by a multiplicative constant). Then using (146), Corollary 2.6.18 in[28], and ultimately Theorem 1.2 in [44] combined with (4.184) in [28], we have for θ ′ ∼ N (0 , Λ − α ),Π N ( kG ( θ ) − G ( θ ) k L ( O ) < δ N ) ≥ Π N ( kG ( θ ) − G ( θ ,D ) k L ( O ) < δ N / ≥ Π N ( k θ − θ ,D k R D < δ N / c g ) ≥ e − Nδ N k θ ,D k hα / Pr( k θ ′ k R D < √ Nδ N / c g ) ≥ e − ¯ dNδ N (148)for some ¯ d >
0. From this we deduce further from Borell’s Gaussian iso-perimetric inequality [7] (inthe form of Theorem 2.6.12 in [28]), arguing just as in Lemma 17 in [29] (and invoking the remarkafter that lemma with κ = 0 there), that given B > M large enough (depending on¯ d, B ) such thatΠ N (cid:0) θ = θ + θ ∈ R D : k θ k R D ≤ M δ N , k θ k h α ≤ M (cid:1) ≥ − e − BNδ N . Next the eigenvalue growth λ αk . k α/d from (13) and the hypothesis on D imply that for ¯ L largeenough we have(149) k θ k h α . D α/d k θ k R D ≤ ( c N δ N ) α/d M δ N ≤ ¯ L/ N ( A cN ) ≤ e − BNδ N where A N = { θ ∈ R D : k θ k h α ≤ ¯ L } . The k · k ∞ -covering numbers of the implied set of regression functions G ( θ ) satisfy the boundslog N ( {G ( θ ) : θ ∈ A N } , k · k ∞ , δ N ) . log N ( { F θ : θ ∈ A N , k · k ∞ , δ N ) . log N ( { F : k F k H α ( O ) ≤ c ¯ L } , k · k ∞ , δ N ) . N δ N , for some c >
0, using that the map F θ
7→ G ( θ ) is globally Lipschitz for the k · k ∞ -norm (Lemma4.6) and also the bound (4.184) in [28]. By (18) and Lemma 22 in [29] the previous metric entropyinequality also holds for the Hellinger distance replacing k · k ∞ -distance on the l.h.s. in the lastdisplay. Theorem 13 and again Lemma 22 in [29] now imply that for any a > L large enough,(151) P Nθ (cid:16) Π( { θ : kG ( θ ) − G ( θ ) k L > Lδ N } ∪ A cN | Z ( N ) ) ≤ e − aNδ N (cid:17) → N → ∞ . The convergence in probability to zero obtained in the proof of Theorem 13 in [29]is in fact exponentially fast, as required in (147): This is true by virtue of the bound to followin the next display (which forms part of the proof in [29] as well), and since the type-one testing ANGEVIN ALGORITHMS 49 errors in (39) in [29] are controlled at the required exponential rate (via Theorem 7.1.4 in [28]).The inequality P Nθ (cid:16) Z B ( δ N ) e ℓ N ( θ ) − ℓ N ( θ ) d Π( θ ) ≥ Π( B ( δ N )) exp {− (1 + K ) N δ N } (cid:17) ≤ c ′ e − c ′′ Nδ N , bounding P Nθ ( C cN,K ) as required in the theorem follows from Lemma 4.15 below for large enough K and ¯ C = 1 / R D asΘ N := { θ : kG ( θ ) − G ( θ ) k L ≤ Lδ N } ∩ A N = { θ : kG ( θ ) − G ( θ ) k L ≤ Lδ N , k F θ k H α = k θ k h α ≤ ¯ L } paralleling the events T N from (143) above. Then arguing as in and after (144), one shows that(152) Θ N ⊂ ˜Θ N = { θ : k θ − θ k R D ≤ LN − r ( α ) , k θ k h α ≤ L } , increasing also the constant L if necessary, and hence the posterior probability of this event isalso lower bounded by Π( ˜Θ N | Z ( N ) ) ≥ − e − aNδ N , with the desired P Nθ -probability, proving thetheorem. (cid:3) Moreover, a quantitative uniform integrability argument from Section 5.4.5 in [51] (see theproof of Theorem 4.14, term III, below) then also gives a convergence rate for the posterior mean E Π [ θ | Z ( N ) ] towards θ , namely that for L large enough there exist ¯ c ′ , ¯ c ′′ > P Nθ (cid:0) k E Π [ θ | Z ( N ) ] − θ k ℓ > L ¯ δ N (cid:1) ≤ ¯ c ′ e − ¯ c ′′ Nδ N . Globally log-concave approximation of the posterior in Wasserstein distance.
Recall the sur-rogate posterior measure ˜Π( ·| Z ( N ) ) from (27) with log-density(154) log ˜ π N ( θ ) = const + ˜ ℓ N ( θ ) − N δ N k θ k h α , θ ∈ R D with θ init and parameters ǫ, K chosen as in Condition 2.2, and with δ N = N − α/ (2 α + d ) . We nowprove the main result of this section. Theorem 4.14.
Assume Condition 2.3 and let ˜Π( ·| Z ( N ) ) be the probability measure of densitygiven in (27) with K, ε > chosen as in Condition 2.2. Then for some a , a > and all N ∈ N , P Nθ (cid:0) W ( ˜Π( ·| Z ( N ) ) , Π( ·| Z ( N ) )) > e − Nδ N / (cid:1) ≤ a e − a Nδ N . Proof.
In the proof we will require a new sequence(155) ˜ δ N = N ( − α +2) / (2 α + d ) p log N describing the ‘rate of contraction’ of the surrogate posterior obtained below. We first notice thatthe definitions of ¯ δ N (from Theorem 4.12) and of δ N imply by straightforward calculations andusing D . N δ N , α >
6, the asymptotic relations as N → ∞ ,(156) δ N D /d p log N = O (˜ δ N ) , δ N ≪ ¯ δ N ≪ ˜ δ N ≪ N D − d , which we shall use in the proof. We will prove the bound for all N large enough, which is sufficientto prove the desired inequality after adjusting the constant in . (since probabilities are alwaysbounded by one). Geometry of the surrogate posterior.
To set things up, consider MAP estimates ˆ θ MAP from(127). In view of (18), the function q N to be maximised over R D in (127) satisfies q N ( θ ) < q N (0) for all θ such that k θ k h α exceeds some positive constant k . Then on the compact set M = { θ ∈ R D : k θ k h α ≤ k } the function q N is continuous (as G is continuous from R D → L ∞ ( O ), Lemma4.6), and hence attains its maximum at some ˆ θ M ∈ M , which must be a global maximiser of q N since q N (ˆ θ M ) ≥ q N (0) > inf θ ∈ M c q N ( θ ). Conclude that a maximiser ˆ θ MAP exists (one shows thatit can be taken to be measurable, Exercise 7.2.3 in [28]).In view of Proposition 4.1, Theorem 4.12, Theorem B.6 (and the remark before it) and α > S N := n k θ init − θ ,D k R D ≤
18 log
N D /d , inf θ ∈B / log N λ min ( −∇ ℓ N ( θ )) ≥ cN D − /d o ∩ n any ˆ θ MAP satisfies k ˆ θ MAP − θ ,D k R D ≤ min (cid:8)
18 log
N D /d , ¯ C ¯ δ N (cid:9)o , where B ǫ was defined in (99), where ¯ C is from (142) and where c = c from Proposition 4 .
1. On S N we have the following properties of ˜ ℓ N . First, from (26),(157) ˜ ℓ N ( θ ) = ℓ N ( θ ) for any θ s.t. k θ − θ ,D k R D ≤ D /d log N .
Moreover, by Proposition 3.6, log ˜ π ( ·| Z ( N ) ) is strongly concave in view of(158) sup θ,ϑ ∈ R D , k ϑ k R D =1 ϑ T [ ∇ (log ˜ π N ( θ ))] ϑ ≤ sup θ,ϑ ∈ R D , k ϑ k R D =1 ϑ T [ ∇ ˜ ℓ N ( θ )] ϑ ≤ − cN D − /d . Finally, any ˆ θ MAP necessarily satisfies(159) 0 = ∇ log π (ˆ θ MAP | Z ( N ) ) = ∇ log ˜ π (ˆ θ MAP ) , from which we conclude that ˆ θ MAP necessarily equals the unique global maximiser of the stronglyconcave function log ˜ π ( ·| Z ( N ) ) over R D . Decomposition of the Wasserstein distance.
Now let us writeˆ B ( r ) = { θ ∈ R D : k θ − ˆ θ MAP k R D ≤ r } , for the Euclidean ball of radius r > θ MAP . Then using Theorem 6.15 in [75] with x = ˆ θ MAP , we obtain for any m > W ( ˜Π( ·| Z ( N ) ) , Π( ·| Z ( N ) )) ≤ Z R D k θ − ˆ θ MAP k R D d | ˜Π( ·| Z ( N ) ) − Π( ·| Z ( N ) ) | ( θ ) ≤ Z ˆ B ( m ˜ δ N ) k θ − ˆ θ MAP k R D d | ˜Π( ·| Z ( N ) ) − Π( ·| Z ( N ) ) | ( θ )+ 2 Z R D \ ˆ B ( m ˜ δ N ) k θ − ˆ θ MAP k R D d | ˜Π( ·| Z ( N ) ) − Π( ·| Z ( N ) ) | ( θ ) ≤ m ˜ δ N Z ˆ B ( m ˜ δ N ) d | Π( ·| Z ( N ) ) − ˜Π( ·| Z ( N ) ) | dθ + 2 Z k θ − ˆ θ MAP k R D >m ˜ δ N k θ − ˆ θ MAP k R D d ˜Π( ·| Z ( N ) )+ 2 Z k θ − ˆ θ MAP k R D >m ˜ δ N k θ − ˆ θ MAP k R D d Π( ·| Z ( N ) ) ≡ I + II + III,
ANGEVIN ALGORITHMS 51 and we now bound
I, II, III in separate steps.
Term II.
We can write the surrogate posterior density as˜ π ( θ | Z ( N ) ) = e ˜ ℓ N ( θ ) − ˜ ℓ N (ˆ θ MAP ) π ( θ ) R R D e ˜ ℓ N ( θ ) − ˜ ℓ N (ˆ θ MAP ) π ( θ ) dθ , θ ∈ R D , and will first lower bound the normalising factor. From (156) we have for any c > B N ≡ {k θ − θ ,D k R D ≤ cδ N } ⊂ n k θ − θ ,D k R D ≤ D /d log N o whenever N is large enough. Since ℓ N ( θ ) = ˜ ℓ N ( θ ) on the last set we have on an event of largeenough P Nθ -probability, Z R D e ˜ ℓ N ( θ ) − ˜ ℓ N (ˆ θ MAP ) d Π( θ ) ≥ Z B N e ˜ ℓ N ( θ ) − ˜ ℓ N (ˆ θ MAP ) d Π( θ )= Z B N e ℓ N ( θ ) − ℓ N (ˆ θ MAP ) dν ( θ ) × Π( B N ) ≥ e − ¯ cNδ N for some ¯ c = ¯ c ( ¯ d, c ), where we have used Lemma 4.15 for our choice of B N (permitted for ap-propriate choice of c > G : R D → L is Lipschitz, see Appendix B) with ν = Π( · ) / Π( B N ) , ¯ C = 1 /
2; as well as the small ball estimate for Π in (148).Now recall the prior (23) and define scaling constants V N = (2 π ) − D/ q det( N δ N Λ α ) × e ¯ cNδ N . Then on the preceding events the term II can be bounded, using a second order Taylor expansionof log ˜ π ( ·| Z ( N ) ) around its maximum ˆ θ MAP combined with (158), (159), as Z k θ − ˆ θ MAP k R D >m ˜ δ N k θ − ˆ θ MAP k R D ˜ π ( θ | Z ( N ) ) dθ ≤ e ¯ cNδ N Z k θ − ˆ θ MAP k R D >m ˜ δ N k θ − ˆ θ MAP k R D e ˜ ℓ N ( θ ) − ˜ ℓ N (ˆ θ MAP ) π ( θ ) dθ ≤ V N × Z k θ − ˆ θ MAP k R D >m ˜ δ N k θ − ˆ θ MAP k R D e ˜ ℓ N ( θ ) − Nδ N k θ k hα − ˜ ℓ N (ˆ θ MAP )+ Nδ N k ˆ θ MAP k hα dθ = V N × Z k θ − ˆ θ MAP k R D >m ˜ δ N k θ − ˆ θ MAP k R D e log ˜ π N ( θ ) − log ˜ π N (ˆ θ MAP ) dθ ≤ V N × Z k θ − ˆ θ MAP k R D >m ˜ δ N k θ − ˆ θ MAP k R D e − cND − /d k θ − ˆ θ MAP k R D / dθ ≤ V N × (cid:0) πcN D − /d (cid:1) D/ Pr (cid:0) k Z k R D > m ˜ δ N (cid:1) where we have used x e − cx ≤ e − cx / for all x ∈ R , c ≥ N such that cN D − /d ≥
1) andwhere Z ∼ N (cid:16) , cD − /d N I D × D (cid:17) . Now by D ≤ c N δ N and (156), E k Z k R D ≤ q E k Z k R D ≤ q D/ ( cD − /d N ) ≤ (2 c /c ) / δ N D /d ≤ ( m/ δ N for m large enough, so thatPr (cid:0) k Z k R D > m ˜ δ N (cid:1) ≤ Pr (cid:0) k Z k R D − E k Z k R D > ( m/ δ N (cid:1) ≤ e − m cND − /d ˜ δ N / by a concentration inequality for Lipschitz-functionals of D -dimensional Gaussian random vectors(e.g., Theorem 2.5.7 in [28] applied to ( cN D − /d / / Z ∼ N (0 , I D × D ) and F = k · k R D ). By (13)and since D . N δ N we have for some c ′ > V N ≤ e c ′ Nδ N log N so that for m large enough and using (156), the last term in the displayed array above, and hence II , is bounded by2 V N × (cid:0) πcN D − /d (cid:1) D/ × e − m cD − /d N ˜ δ N / ≤ e − m D − /d N ˜ δ N / ≤ e − Nδ N . Term III:
We first note that Theorem 4.13 and (156) imply that for every a > m large enough such thatΠ( k θ − ˆ θ MAP k R D > m ˜ δ N | Z ( N ) ) ≤ Π( k θ − θ ,D k R D > m ¯ δ N − k ˆ θ MAP − θ ,D k R D | Z ( N ) ) ≤ Π( k θ − θ ,D k R D > m ¯ δ N / | Z ( N ) ) ≤ e − aNδ N on events S ′ N ⊂ S N of sufficiently high probability. Moreover, again by Theorem 4.13, we canfurther restrict the argument that follows to the event C N,K from (145) for some
K >
0. Now usingthe Cauchy-Schwarz and Markov inequalities as well as E Nθ e ℓ N ( θ ) − ℓ N ( θ ) = 1 and the small ballestimate for Π in (148), we have P Nθ (cid:16) C N,K ∩ S ′ N , Z k θ − ˆ θ MAP k R D >m ˜ δ N k θ − ˆ θ MAP k R D d Π( ·| Z ( N ) ) > e − Nδ N / (cid:17) ≤ P Nθ (cid:16) C N,K ∩ S ′ N , Π( k θ − ˆ θ MAP k R D > m ˜ δ N | Z ( N ) ) E Π [ k θ − ˆ θ MAP k R D | Z ( N ) ] > e − Nδ N / (cid:17) ≤ P Nθ (cid:16) S ′ N , e (1+ K + ¯ d +2 − a ) Nδ N Z R D k θ − ˆ θ MAP k R D e ℓ N ( θ ) − ℓ N ( θ ) d Π( θ ) > / (cid:17) . e (1+ K + ¯ d +2 − a ) Nδ N Z R D (1 + k θ k R D ) d Π( θ ) ≤ e − a Nδ N whenever m and then a are large enough, since Π has uniformly bounded fourth moments and since k ˆ θ MAP k R D is uniformly bounded by a constant depending only on k θ k ℓ on the events S N . Term I:
On the events S N we have from (156) that for fixed m > N large enoughˆ B ( m ˜ δ N ) ⊆ { θ : k θ − θ ,D k R D ≤ / (8 D /d log N ) } . On the latter set, by (157), the probability measures ˜Π( ·| Z ( N ) ) and Π( ·| Z ( N ) ) coincide up to anormalising factor, and thus we can represent their Lebesgue densities as˜ π ( θ | Z ( N ) ) = p N π ( θ | Z ( N ) ) , θ ∈ ˆ B ( m ˜ δ N ) , for some 0 < p N < ∞ . Moreover, by the preceding estimates for terms II and III (which hold justas well without the integrating factors k θ − ˆ θ MAP k R D ), we have both p N Π( ˆ B ( m ˜ δ N ) | Z ( N ) ) = ˜Π( ˆ B ( m ˜ δ N ) | Z ( N ) ) ≥ − e − Nδ N / ⇒ − e − Nδ N / ≤ p N ,p − N ˜Π( ˆ B ( m ˜ δ N ) | Z ( N ) ) = Π( ˆ B ( m ˜ δ N ) | Z ( N ) ) ≥ − e − Nδ N / ⇒ − e − Nδ N / ≤ p N ANGEVIN ALGORITHMS 53 on events of sufficiently high P Nθ -probability. On these events necessarily p N ∈ h − e − Nδ N , − e − Nδ N i and so for N large enough Z ˆ B ( m ˜ δ N ) d | Π( ·| Z ( N ) ) − ˜Π( ·| Z ( N ) ) | ( θ ) = | − p N | Z ˆ B ( m ˜ δ N ) π ( θ | Z ( N ) ) dθ ≤ | − p N | ≤ e − Nδ N / , which is obvious for p N ≤ f ( x ) = (1 − x ) − near x = 0 also for p N >
1. Collecting the bounds for
I, II, III completes the proof. (cid:3)
An ‘exponential’ small ball lemma.
Lemma 4.15.
Let G be as in (17) and let ν be a probability measure on some ( ℓ ( N ) -measurable)set (160) B N ⊆ (cid:8) θ ∈ ℓ ( N ) : kG ( θ ) − G ( θ ) k L ≤ Cδ N (cid:9) , for some ¯ C > . Then for ℓ N from (22) we have for every K = K ( ¯ C ) > large enough and some fixed constant b > that (161) P Nθ (cid:18)Z B N e ℓ N ( θ ) − ℓ N (ˆ θ MAP ) dν ( θ ) ≤ e − (1+ K ) ¯ C Nδ N (cid:19) . e − bNδ N . The same conclusion holds true with ℓ N (ˆ θ MAP ) replaced by ℓ N ( θ ) .Proof. We proceed as in Lemma 7.3.2 in [28] to deduce from Jensen’s inequality (applied to log and R ( · ) dν ) that, for P N the empirical measure from (128), the probability in question is bounded by P Nθ Z Z B N log p θ p ˆ θ MAP dν ( θ ) d ( P N − P θ ) ≤ − (1 + K ) ¯ C δ N − Z Z B N log p θ p ˆ θ MAP dν ( θ ) dP θ ! . Now just as in the proof of Lemma 4.10 and using Theorem 4.12 we see that − Z log p θ p ˆ θ MAP dP θ = − Z log p θ p θ dP θ + Z log p θ p ˆ θ MAP dP θ = 12 kG ( θ ) − G ( θ ) k L − kG (ˆ θ MAP ) − G ( θ ) k L ≤ ¯ C δ N so that using also Fubini’s theorem the last probability can be bounded by P Nθ (cid:18) √ N Z Z B N log p θ p θ dν ( θ ) d ( P N − P θ ) ≥ K ¯ C √ N δ N / (cid:19) + P Nθ (cid:18) √ N Z log p ˆ θ MAP p θ d ( P N − P θ ) ≥ K ¯ C √ N δ N / (cid:19) . For the first probability we decompose as in (133) and consider Z N as in Lemma 3.12 for fixed h θ equal to either h or h , where h ( x ) = Z B N ( G ( θ )( x ) − G ( θ )( x )) dν ( θ ) , and h ( x ) = Z B N ( G ( θ )( x ) − G ( θ )( x )) dν ( θ ) . To each of these we apply Bernstein’s inequality (96) with x = N σ and K large enough to obtainthe desired exponential bound, using uniform boundedness kG ( θ ) − G ( θ ) k ∞ ≤ U from (18) andJensen’s inequality in the variance estimates E X h ( X ) ≤ C δ N ≡ σ in the first case and E X h ( X ) ≤ U Z B N kG ( θ ) − G ( θ ) k L dν ( θ ) ≤ U ¯ Cδ N ≡ σ for the second case. [This already proves the case where ˆ θ MAP is replaced by θ .]For the second probability, restricting to the event in the supremum below, which has sufficientlyhigh P Nθ -probability in view of Proposition 4.11, it suffices to bound P Nθ sup k θ k hα ≤ C, kG ( θ ) −G ( θ ) k L ≤ Cδ N √ N (cid:12)(cid:12)(cid:12) Z log p θ p θ d ( P N − P θ ) (cid:12)(cid:12)(cid:12) ≥ K ¯ C √ N δ N / ! . This term corresponds to the empirical process bounded in and after (131) for s = 1. Choosing K large enough the proof there now applies directly, giving the desired exponential bound. (cid:3) Appendix A. Review of convergence guarantees for ULA
In this section we collect some key results (that were used in our proofs) about convergenceguarantees for an Unadjusted Langevin Algorithm (ULA) for sampling from strongly log-concavetarget measures , see [16, 20, 21] and also the classical reference [62]. Our presentation follows therecent article [21].Suppose that µ is a Borel probability measure on R D which has a Lebesgue density proportionalto e − U for some potential U : R D → R , specifically(162) µ ( B ) = R B e − U ( θ ) dθ R R D e − U ( θ ) dθ , B ⊆ R D measurable . Following [21] (cf. H1 and H2 there) we will assume that the potential U has a Λ-Lipschitz gradientand is m -strongly convex. Assumption A.1.
1. The function U : R D → R is continuously differentiable and there exists aconstant Λ ≥ such that for all θ, ¯ θ ∈ R D , k∇ U ( θ ) − ∇ U (¯ θ ) k R D ≤ Λ k θ − ¯ θ k R D .
2. There exists a constant < m ≤ Λ such that for all θ, ¯ θ ∈ R D , we have U (¯ θ ) ≥ U ( θ ) + h∇ U ( θ ) , ¯ θ − θ i R D + m k θ − ¯ θ k R D . Under Assumption A.1, the potential U has a unique minimiser over R D , which we shall denoteby θ U . For the computation of θ U via gradient descent methods, we have the following standardresult from convex optimisation (see Theorem 1 in [16] and (9.18) in [8]). Proposition A.2.
Suppose U : R D → R satisfies Assumption A.1. Then the gradient descentalgorithm given by ϑ k +1 = ϑ k − ∇ U ( ϑ k ) , k = 0 , , , . . . , satisfies that k ϑ k − θ U k R D ≤ U ( ϑ ) − U ( θ U )) m (cid:0) − m (cid:1) k , k = 0 , , , . . . ANGEVIN ALGORITHMS 55
The results presented below establish corresponding geometric convergence bounds for stochastic gradient methods which target the entire probability measure µ (instead of just its mode θ U ). Definethe continuous time Langevin diffusion process as the unique strong solution ( L t : t ≥
0) of thestochastic differential equation(163) dL t = −∇ U ( L t ) dt + √ dW t , t ≥ , L t ∈ R D , where ( W t : t ≥
0) is a D -dimensional standard Brownian motion. It is well known that the Markovprocess ( L t : t ≥
0) has µ from (162) as its invariant measure. The Euler-Maruyama discretisationof the dynamics (163) gives rise to the discrete-time Markov chain ( ϑ k : k ≥ ϑ k +1 = ϑ k − γ ∇ U ( ϑ k ) + p γξ k +1 , k ≥ , where ( ξ k : k ≥
1) form an i.i.d. sequence of D -dimensional standard Gaussian N (0 , I D × D ) vectors,and γ > step size . We will refer to ( ϑ k ) as the unadjusted Langevin algorithm(ULA) in what follows. We denote by P θ init , E θ init the law and expectation operator, respectively,of the Markov chain ( ϑ k : k ≥
1) when started at a deterministic point ϑ = θ init . We also write L ( ϑ k ) for the (marginal) distribution of the k -th iterate ϑ k .For any measurable function H : R D → R and any J in , J ≥
0, let us define the average of H along an ULA trajectory after ‘burn-in’ period J in byˆ µ JJ in ( H ) = 1 J J in + J X k = J in +1 H ( ϑ k ) . Proposition A.3.
Suppose that U satisfies Assumption A.1 and suppose γ ≤ / ( m + Λ) . Then forall J, J in ≥ , x > and any Lipschitz function H : R D → R , we have the concentration inequality P θ init (cid:16) ˆ µ JJ in ( H ) − E θ init [ˆ µ JJ in ( H )] ≥ x (cid:17) ≤ exp (cid:16) − Jγx m k H k Lip (1 + 2 / ( mJγ )) (cid:17) . Proof.
The statement follows directly from Theorem 17 of [21], noting that κ = 2 m Λ / ( m + Λ) ∈ [ m, m ] and that the constant v N,n ( γ ) from (28) of [21] can be upper bounded by1 + m − + 2 / ( m + Λ) γJ ≤ / ( mγJ ) . (cid:3) Proposition A.4.
Suppose that U satisfies Assumption A.1 and let γ, J in , J and H be as inProposition A.3. Then we have for µ as in (162) that (165) W ( L ( ϑ k ) , µ ) ≤ (cid:0) − mγ/ (cid:1) k h k θ init − θ U k R D + Dm i + b ( γ ) / , k ≥ , where (166) b ( γ ) = 36 γD Λ m + 12 γ D Λ m , as well as (167) (cid:16) E θ init [ˆ µ JJ in ( H )] − E µ H (cid:17) ≤ k H k Lip J J in + J X k = J in +1 W ( L ( ϑ k ) , µ ) . Proof.
The display (167) is derived in (27) of [21]. The bound (165) follows from an applicationof Theorem 5 in [21] with fixed step size γ >
0, where in our case, noting again that κ ∈ [ m, m ],the expression u (1) n ( γ ) there is upper bounded by 2 (cid:0) − mγ/ (cid:1) k and the expression u (2) n ( γ ) there isupper bounded by (using that γ ≤ min { / Λ , /m } ≤ min { / Λ , /κ } )Λ Dγ (cid:0) κ − + γ (cid:1)(cid:16) γm + Λ γ (cid:17) k X i =1 (1 − κγ/ k − i ≤ Λ Dγ (cid:0) κ − + γ (cid:1)(cid:16) γm + Λ γ (cid:17) κγ ≤ Λ Dγ (cid:16) κ − + γκ (cid:17)(cid:16) γm (cid:17) ≤ Λ Dγm − (cid:16)
18 + 6Λ γm (cid:17) , which equals (166). (cid:3) Appendix B. Auxiliary results
B.1.
Analytical properties of Schr¨odinger operators and link functions.
Recall the inverseSchr¨odinger operators V f from (107). Lemma B.1.
There exists a constant
C > such that for any f ∈ C ( O ) with f ≥ , the followingholds. i) We have the estimates k V f [ ψ ] k L ≤ C k ψ k L , ψ ∈ L ( O ) , k V f [ ψ ] k ∞ ≤ C k ψ k ∞ , ψ ∈ C ( O ) . (168) ii) For any ψ ∈ L ( O ) , we have that k V f [ ψ ] k H ≤ C (1 + k f k ∞ ) k ψ k L , (169) as well as C (1 + k f k ∞ ) k ψ k ( H ) ∗ ≤ k V f [ ψ ] k L ≤ C (1 + k f k ∞ ) k ψ k ( H ) ∗ . (170) iii) If also d ≤ , then for any ψ ∈ L ( O ) and any f, ¯ f ∈ C ( O ) with f, ¯ f ≥ , we have that (171) k V f [ ψ ] − V ¯ f [ ψ ] k ∞ . (1 + k f k ∞ ) k ψ k L k f − ¯ f k ∞ . Proof.
Part i) is a direct consequence of the Feynman-Kac formula for V f [ ψ ] from [13] (see alsoLemma 25 in [56]). The upper bounds in part ii) likewise are proved by standard arguments forelliptic PDEs (see, e.g., Lemma 26 in [56]). In order to prove the lower bound in (170), let us denotethe Schr¨odinger operator by S f [ w ] = ∆ w − f w . Since S f : H → L satisfies S f V f [ ψ ] = ψ , itsuffices to show that k S f w k ( H ) ∗ . (1 + k f k ∞ ) k w k L , w ∈ H . ANGEVIN ALGORITHMS 57
Using the divergence theorem we have that for such w , k S f w k ( H ) ∗ = sup ψ ∈ H : k ψ k H ≤ (cid:12)(cid:12)(cid:12) Z O ψS f w (cid:12)(cid:12)(cid:12) = sup ψ ∈ H : k ψ k H ≤ (cid:12)(cid:12)(cid:12) Z O wS f ψ (cid:12)(cid:12)(cid:12) ≤ k w k L sup ψ ∈ H : k ψ k H ≤ k S f ψ k L , and the term on the right hand side is further estimated by k S f ψ k L . k ∆ ψ k L + k f ψ k L . k f k ∞ k ψ k L ≤ k f k ∞ , which proves (170). Finally, (171) is proved by using a Sobolev embedding as well as (168), (169): k V f [ ψ ] − V ¯ f [ ψ ] k ∞ . k V f [( f − ¯ f ) V ¯ f [ ψ ]] k H . (1 + k f k ∞ ) k ( f − ¯ f ) V f [ ψ ] k L . (1 + k f k ∞ ) k f − ¯ f k ∞ k ψ k L . (cid:3) For any normed vector spaces ( V, k ·k V ) and ( W, k ·k W ) let L ( V, W ), denote the space of boundedlinear operators V → W , equipped with the operator norm. For g ∈ C ∞ ( ∂ O ) and any f ∈ C ( O )with f >
0, there exists a unique (weak) solution G ( f ) ∈ C ( O ) of (11), see Theorem 4.7 in [13].We define the operators DG f ∈ L ( C ( O ) , C ( O )) and D G f ∈ L ( C ( O ) , L ( C ( O ) , C ( O ))) as DG f [ h ] = V f [ h u f ] , ( D G f [ h ])[ h ] = V f [ h DG f [ h ]] + V f [ h DG f [ h ]] , h , h ∈ C ( O ) . (172)The next lemma establishes that these operators are suitable Fr´echet derivatives of G on the opensubset { f ∈ C ( O ) , f > } of C ( O ). Lemma B.2. i) For any f ∈ C ( O ) with f > , we have G ( f ) ∈ C ( O ) . Moreover there exists C > such that for any f, ¯ f ∈ C ( O ) with f, ¯ f > , (173) k G ( ¯ f ) − G ( f ) k ∞ ≤ C k ¯ f − f k ∞ , as well as k G ( ¯ f ) − G ( f ) − DG f [ ¯ f − f ] k ∞ ≤ C k ¯ f − f k ∞ , k DG ¯ f − DG f − D G f [ ¯ f − f ] k L ( C ( O ) ,C ( O )) ≤ C k ¯ f − f k ∞ . (174) ii) For any integer α > d/ there exists a constant C > such that for all f ∈ H α with inf x ∈O f ( x ) > , we have k G ( f ) k H ≤ C ( k f k L + k g k C ( ∂ O ) ) , (175) k G ( f ) k H α +2 ≤ C (1 + k f k α/ H α ) k g k C α +2 ( ∂ O ) . (176) Proof.
The estimate (173) follows from the identity G ( ¯ f ) − G ( f ) = V f [( ¯ f − f ) G ( ¯ f )], (168) and (18).Arguing similarly and using (173), we further obtain k G ( ¯ f ) − G ( f ) − DG f [ ¯ f − f ] k ∞ = k V f [( ¯ f − f )( G ( ¯ f ) − G ( f ))] k ∞ . k ( ¯ f − f )( G ( ¯ f ) − G ( f )) k ∞ . k ¯ f − f k ∞ ,8 R. NICKL AND S. WANG
The estimate (173) follows from the identity G ( ¯ f ) − G ( f ) = V f [( ¯ f − f ) G ( ¯ f )], (168) and (18).Arguing similarly and using (173), we further obtain k G ( ¯ f ) − G ( f ) − DG f [ ¯ f − f ] k ∞ = k V f [( ¯ f − f )( G ( ¯ f ) − G ( f ))] k ∞ . k ( ¯ f − f )( G ( ¯ f ) − G ( f )) k ∞ . k ¯ f − f k ∞ ,8 R. NICKL AND S. WANG which proves the first part of (174). For the second part of (174), we have for any h ∈ C ( O ) that DG ¯ f [ h ] − DG f [ h ] = V ¯ f [ hu ¯ f ] − V f [ hu f ]= V ¯ f [ h ( u ¯ f − u f )] + ( V ¯ f − V f )[ hu f ]= V f [ hDG f [ ¯ f − f ]] + R + V f [( ¯ f − f ) V f [ hu f ]] + R = ( D G f [ ¯ f − f ])[ h ] + R + R , with remainder terms R , R given by R = [ V ¯ f − V f ][ h ( u ¯ f − u f )] + V f [ h ( u ¯ f − u f − DG [ h ])] ,R = [ V ¯ f − V f ]( hu f ) − V f [( ¯ f − f ) V f [ hu f ]] . Using the identity ( V ¯ f − V f ) ψ = V f [( ¯ f − f ) V ¯ f [ ψ ]] with ψ = h ( u ¯ f − u f ), Lemma B.1 as well as thefirst part of (174), we have k R k ∞ . k ¯ f − f k ∞ k h ( u ¯ f − u f ) k ∞ + k h k ∞ k u f + h − u f − D ¯ G [ h ] k ∞ . k ¯ f − f k ∞ k h k ∞ , and arguing similarly, k R k ∞ = k V f [( ¯ f − f )( V ¯ f − V f )[ hu f ]] k ∞ . k ¯ f − f k ∞ k ( V ¯ f − V f )[ hu f ] k ∞ . k ¯ f − f k ∞ k h k ∞ . This completes the proof of (174).To prove (175), we use that (∆ , tr) : H ( O ) → L × H / ( ∂ O ) [where tr denotes the boundarytrace operator for the domain O ] is a topological isomorphism, see Theorem II.5.4 in [45], such thatin particular k G ( f ) k H . k f u f k L + k g k C ( ∂ O ) ≤ k f k L + k g k C ( ∂ O ) . where we also used (18). Finally, (176) is proved in Lemma 27 in [56]. (cid:3) B.1.1.
Properties of the map Φ ∗ . We summarise some properties of ‘regular’ link functions fromDefinition 2.1. We recall the notation Φ ∗ for the associated composition operator from (15). Forany F ∈ C ( O ), define the operators D Φ ∗ F ∈ L ( C ( O ) , C ( O )), D Φ ∗ F ∈ L ( C ( O ) , L ( C ( O ) , C ( O ))) by(177) D Φ ∗ F [ H ] = H Φ ′ ◦ F, ( D Φ ∗ F [ H ])[ J ] = HJ Φ ′′ ◦ F, H, J ∈ C ( O ) . Then for any
F, H, J ∈ C ( O ) and x ∈ O , a Taylor expansion immediately implies that, with ζ x , ¯ ζ x denoting intermediate points between F ( x ) and ( F + H )( x ), | (Φ ∗ ( F + H ) − Φ ∗ ( F ) − D Φ ∗ F [ H ])( x ) | = | H ( x )Φ ′′ ( ζ x ) / | ≤ k H k ∞ sup t ∈ R | Φ ′′ ( t ) | , (cid:12)(cid:12)(cid:0) D Φ ∗ F + H − D Φ ∗ F − D Φ ∗ F [ H ] (cid:1) [ J ]( x ) (cid:12)(cid:12) = (cid:12)(cid:12) J ( x ) H ( x )Φ ′′′ (¯ ζ x ) / (cid:12)(cid:12) ≤ k J k ∞ k H k ∞ sup t ∈ R | Φ ′′′ ( t ) | , whence D Φ ∗ , D Φ ∗ are the Fr´echet derivatives of Φ ∗ : C ( O ) → C ( O ).We also need the basic fact that for any integer α > d/ C > F ∈ H α ( O ),(178) k Φ ◦ F k H α ≤ C (1 + k Φ ◦ F k αH α ) , see Lemma 29 in [56]. Finally, note that by the definition of Φ, there exists C ′ > F , F ∈ C ( O ),(179) k Φ ◦ ¯ F − Φ ◦ F k ∞ ≤ C k ¯ F − F k ∞ , k Φ ◦ ¯ F − Φ ◦ F k L ≤ C k ¯ F − F k L . ANGEVIN ALGORITHMS 59
B.1.2.
Chain rule for Fr´echet derivatives.
Let
U, V be normed vector spaces and
D ⊆ U an opensubset. For a map T : D → V we denote by DT θ ∈ L ( U, V ) and D T θ ∈ L ( U, L ( U, V )) the first andsecond order Fr´echet derivatives at θ ∈ D , respectively, whenever they exist. The following basiclemma then follows directly from the chain rule. Lemma B.3.
Suppose
U, V, W are (open subsets of ) normed vector spaces, and suppose that A : U → V and B : V → W are both twice differentiable in the Fr´echet sense. Then for any θ ∈ U and H , H ∈ U , we have that D ( B ◦ A ) θ = DB A ( θ ) ◦ DA θ and (cid:0) D ( B ◦ A ) θ [ H ] (cid:1) [ H ] = (cid:0) D B A ( θ ) [ DA θ [ H ]] (cid:1) [ DA θ [ H ]] + DB A ( θ ) (cid:2) ( D A θ [ H ])[ H ] (cid:3) . (180)B.2. Proof of Proposition 3.6.
We first record the following basic lemma without proof.
Lemma B.4.
Let | · | be an ellipsoidal norm on R D with associated matrix M , | θ | = θ T M θ anddefine the function n : θ → | θ | . Then for any θ = 0 , we have ∇ n ( θ ) = M θ | θ | , ∇ n ( θ ) = M | θ | − M θ ( M θ ) T | θ | , (181) as well as the norm estimates k∇ n ( θ ) k R D ≤ p λ max ( M ) , (182) k∇ n ( θ ) k op ≤ λ max ( M ) / | θ | . (183)Using Lemma B.4, we prove the following bounds on the cut-off function α η . Lemma B.5. If | · | is an ellipsoidal norm with associated matrix M , | θ | = θ T M θ , then thefunction α η from (53) satisfies that for all θ ∈ R D , k∇ α η ( θ ) k R D ≤ k α k C p λ max ( M ) η , k∇ α η ( θ ) k op ≤ k α k C λ max ( M ) η . Proof.
We may assume w.l.o.g. that θ init = 0 and we write n ( θ ) = | θ | . The gradient bound isobtained by the chain rule and (182): k∇ α η ( θ ) k R D = (cid:13)(cid:13) η − α ′ (cid:0) | θ | /η (cid:1) ∇ n ( θ ) (cid:13)(cid:13) R D ≤ η − k α k C p λ max ( M ) . For the Hessian, we similarly employ the chain rule, (182), (183) as well as the fact that α ′ ( t ) = 0when t ∈ (0 , / k∇ α η ( θ ) k op ≤ η − (cid:13)(cid:13) α ′′ (cid:0) | θ | /η (cid:1) ∇ n ( θ ) ∇ n ( θ ) T (cid:13)(cid:13) op + η − (cid:13)(cid:13) α ′ (cid:0) | θ | /η (cid:1) ∇ n ( θ ) (cid:13)(cid:13) op ≤ η − k α k C k∇ n ( θ ) k R D + η − k α k C {| θ |≥ η/ } · λ max ( M ) | θ | ≤ η − k α k C λ max ( M ) . (cid:3) We now turn to the proof of Proposition 3.6. Throughout, we work on the event E conv ∩ E init defined by (49),(50); moreover we assume without loss of generality that θ init = 0. Proof of Proposition 3.6.
We divide the proof into five steps.
1. Local lower bound for α η ℓ N . For the set V := { θ : | θ | ≤ η/ } , by definition of E init , we have that V ⊆ B . Thus using the definitions of E conv and of α η , we obtain(184) inf θ ∈ V λ min (cid:0) − ∇ [ α η ℓ N ]( θ ) (cid:1) ≥ N c min / .
2. Upper bound for α η ℓ N . By the chain rule, Lemma B.5, the definition of E conv and usingthat k α k C ≥
1, we obtain that for any θ ∈ R D and some c = c ( α ), k∇ [ α η ℓ N ]( θ ) k op ≤ | ℓ N ( θ ) |k∇ α η ( θ ) k op + 2 k∇ α η ( θ ) k R D k∇ ℓ N ( θ ) k R D + | α η ( θ ) |k∇ ℓ N ( θ ) k op ≤ θ ∈B (cid:16)(cid:2) | α η ( θ ) | + k∇ α η ( θ ) k R D + k∇ α η ( θ ) k op (cid:3)(cid:2) | ℓ N ( θ ) | + k∇ ℓ N ( θ ) k R D + k∇ ℓ N ( θ ) k op (cid:3)(cid:17) ≤ c (cid:0) λ max ( M ) /η (cid:1) · N ( c max + 1) . (185)
3. Global lower bound for ∇ g η . First we note that g η is convex on all of R D : Indeed, thisfollows from the the identity γ η = ˜ γ η ∗ ϕ η/ , the convexity of the functions n : θ
7→ | θ | , ˜ γ η and thefact that convolution with the positive function ϕ η/ preserves convexity. As g η has C regularity,it follows that ∇ g η (cid:23) R D .We next prove a quantitative lower bound for ∇ g η on the set V c . By the chain rule and LemmaB.4, we have that for any θ ∈ R D , writing v = ∇ n ( θ ), ∇ g η ( θ ) = γ ′′ η ( | θ | ) ∇ n ( θ ) ∇ n ( θ ) T + γ ′ η ( | θ | ) ∇ n ( θ )= γ ′′ η ( | θ | ) vv T + γ ′ η ( | θ | ) | θ | (cid:0) M − vv T (cid:1) = (cid:16) γ ′′ η ( | θ | ) − γ ′ η ( | θ | ) | θ | (cid:17) vv T + γ ′ η ( | θ | ) | θ | M =: A ( | θ | ) vv T + B ( | θ | ) M. (186)To derive lower bounds for the functions B ( · ) and A ( · ), we first observe that by the symmetry of ϕ η/ around 0, it holds for any t ≥ η/ γ ′ η ( t ) = Z [ − η/ ,η/ ϕ η/ ( y ) · t − y − η/
8) = 2( t − η/ . Thus the function B ( t ) = γ ′ η ( t ) /t strictly increases on (3 η/ , ∞ ), and for any t ≥ η/
4, we obtain(188) B ( t ) ≥ B (3 η/
4) = γ ′ η (3 η/ η/ η/ − η/ η/ . For the term A ( · ), we note that for any t ≥ η/
4, using that γ ′′ η ( t ) = 2 as well as (187), we have(189) A ( t ) = 2 − t − η/ t ≥ . Combining the displays (186), (188), (189), we have proved the lower bound(190) inf θ ∈ V c λ min (cid:0) ∇ g η ( θ ) (cid:1) ≥ λ min ( M ) / , .
4. Global upper bound for ∇ g η . We note that the functions A ( · ), B ( · ) from (186) satisfysup t ∈ (0 , ∞ ) | A ( t ) | ≤ sup t ∈ (0 , ∞ ) | γ ′ η ( t ) /t | + | γ ′′ η ( t ) | ≤ , sup t ∈ (0 , ∞ ) | B ( t ) | ≤ sup t ∈ (0 , ∞ ) | γ ′ η ( t ) /t | ≤ . Hence, by (186) and Lemma B.4, we obtain that(191) k∇ g η ( θ ) k op ≤ k vv T k op + 2 k M k op ≤ λ max ( M ) , θ ∈ R D . ANGEVIN ALGORITHMS 61
5. Combining the bounds.
Combining the estimates (184), (185) and (190), we obtain thatinf θ ∈ V λ min (cid:0) − ∇ ˜ ℓ N ( θ ) (cid:1) ≥ N c min , inf θ ∈ V c λ min (cid:0) − ∇ ˜ ℓ N ( θ ) (cid:1) ≥ Kλ min ( M )3 − c (cid:0) λ max ( M ) /η (cid:1) N ( c max + 1) . (192)In particular, there exists C ≥ K satisfying (55), we haveinf θ ∈ R D λ min (cid:0) − ∇ ˜ ℓ N ( θ ) (cid:1) ≥ min n N c min , Kλ min ( M )6 o = N c min / , which completes the proof of (56). To prove (57), we use (185), (191) and (55) to obtain that forall θ = ¯ θ ∈ R D , k∇ ˜ ℓ N ( θ ) − ∇ ˜ ℓ N (¯ θ ) k R D k θ − ¯ θ k R D ≤ sup θ ∈ R D k∇ ˜ ℓ N ( θ ) k op ≤ c k α k C (cid:0) λ max ( M ) /η (cid:1) N ( c max + 1) + 6 Kλ max ( M ) ≤ Kλ max ( M ) . (cid:3) B.3.
Initialisation.
In this section we prove the existence of polynomial time ‘initialiser’ θ init = θ init ( Z ( N ) ) ∈ R D (that lies in the region B / log N from (99) of strong log-concavity of the posteriormeasure with high P Nθ -probability, when α > Theorem B.6.
Suppose θ ∈ h α ( O ) for some α > d/ , d ≤ . Then there exists a measurablefunction θ init ∈ R D of the data Z ( N ) from (20) and large enough M ′ > such that for all N, D ∈ N and some ¯ c > , P Nθ (cid:0) k θ init − θ ,D k R D > M ′ N − ( α − / (2 α + d ) (cid:1) . e − ¯ cN d/ (2 α + d ) . Moreover θ init is the output of a polynomial time algorithm involving O ( N b ) , b > , iterations ofgradient descent (each requiring a multiplication with a fixed D ′ × D ′ matrix, D ′ . N d/ (2 α + d ) ).Proof. Step I.
To start, consider the wavelet frame (cid:8) φ l,r , ≤ r ≤ N l , l ∈ N (cid:9) , N l . ld , of L ( O ) constructed in Theorem 5.51 in [68]. Then for data arising from (19), choosing2 J ≃ N / (2 α + d ) = ( N δ N ) /d , δ N = N − α/ (2 α + d ) , n J ≡ X l ≤ J N l . Jd , and for multiscale vectors ( λ l,r ) ∈ R n J , define(193) ˆ λ = arg min λ ∈ R nJ N N X i =1 (cid:0) Y i − X l ≤ J,r λ l,r φ l,r ( X i ) (cid:1) + δ N k λ k h α , k λ k h α = X l,r lα λ l,r . Next we set ˆ u = ˆ u ( Z ( N ) ) = X l ≤ J,r ˆ λ l,r φ l,r , u f ,J = X l ≤ J,r λ ,l,r φ l,r , where the λ ,l,r ∈ h α +2 are frame coefficients of u f = G ( θ ) ∈ H α +2 furnished by Theorem 5.51 in[68] and the elliptic regularity estimate (176). In particular by the Sobolev embedding h α +2 ⊂ b α ∞∞ ( d <
4) and again Theorem 5.51 in [68] we can prove(194) k u f − u f ,J k L . k u f − u f ,J k ∞ . − Jα . δ N . We now apply a standard result from M estimation [70, 71], with empirical norms k u k N ) = 1 N N X i =1 u ( X i ) , conditional on the design X , . . . , X n , to obtain the following bound. Proposition B.7.
We have for α > d/ , all N and some constant c > , (195) P Nθ (cid:0) k ˆ u − u f k N ) + δ N k ˆ λ k h α > k u f − u f ,J k N ) + δ N k λ ,l,r k h α | ( X i ) Ni =1 (cid:1) ≤ e − cNδ N . Proof.
We apply Theorem 2.1 in [71]. We can bound the k · k ∞ and then also k · k ( N ) -metric entropyof the class of functions n u : u = X l ≤ J,r λ l,r φ l,r ; k λ k h α ≤ m o , m > , by the metric entropy of a ball of radius m in a H α -Sobolev space, which by (4.184) in [28] isof order H ( τ ) . ( m/τ ) d/α for every m >
0. Then arguing as in Section 3.1.1 in [71] (the onlynotational difference being that here d > (cid:3)
This implies in particular, using k u k ( N ) ≤ k u k ∞ , (194), λ ,l,r ∈ h α +2 and Theorem 5.51 in [68],that for some C, C ′ > P Nθ (cid:0) k ˆ u k H α > C (cid:1) ≤ P Nθ (cid:0) k ˆ λ k h α > C ′ (cid:1) ≤ exp {− cN δ N } . as well as(197) P Nθ (cid:0) k ˆ u − u f ,J k N ) > Cδ N (cid:1) ≤ exp {− cN δ N } . In Step IV below we establish the following restricted isometry type bound P Nθ (cid:12)(cid:12)(cid:12) k ˆ u − u f ,J k N ) k ˆ u − u f ,J k L − (cid:12)(cid:12)(cid:12) ≤ ! ≥ − c ′′ e − c ′ Nδ N (198)for some constants c ′ , c ′′ > P Nθ ≤ k ˆ u − u f ,J k N ) k ˆ u − u f ,J k L ≤ ! ≥ − c ′′ e − c ′ Nδ N . On the event A N in the last probability we can write, using again (194) and (197), for M largeenough, P Nθ (cid:0) k ˆ u − u f k L > M δ N (cid:1) ≤ P Nθ (cid:0) k ˆ u − u f ,J k L > ( M/ δ N (cid:1) ≤ P Nθ k ˆ u − u f ,J k L k ˆ u − u f ,J k N ) k ˆ u − u f ,J k N ) > ( M/ δ N , A N ! + c ′′ e − c ′ Nδ N ≤ P Nθ (cid:16) k ˆ u − u f ,J k N ) > ( M/ δ N (cid:17) + c ′′ e − c ′ Nδ N . e − cNδ N + e − c ′ Nδ N . ANGEVIN ALGORITHMS 63
Overall what precedes implies that we can find M large enough such that for some constants¯ c, ¯ c ′ > P Nθ (cid:0) k ˆ u − u f k L ≤ M δ N and k ˆ u k H α ≤ M (cid:1) ≥ − ¯ c ′ e − ¯ cNδ N . Step II.
By definition of the k · k h α -norm, the objective function minimised in (193) over R n J is m -strongly convex with convexity bound m ≥ δ N . Moreover, noting that the sum-of-squares term Q N appearing in (193) satisfies ∂Q N ∂λ l ′ ,r ′ ( λ ) = − N N X i =1 (cid:2) Y i − X l ≤ J,r λ l,r φ l,r ( X i ) (cid:3) φ l ′ ,r ′ ( X i ) , l ′ ≤ J, ≤ r ′ ≤ N l ′ , we can deduce that the gradient of the objective function is globally Lipschitz with constant atmost of order O (2 Jd ) = O ( N δ N ), using standard properties of the wavelet frame from Definition5.25 in [68]. Using (18), (96) and a standard tail inequality for χ -random variables (Theorem 3.1.9in [28]), one shows further that for some ¯ C > P Nθ -probability, Q N (0) = 1 N N X i =1 (cid:0) ε i + 2 ε i u f ( X i ) + u f ( X i ) (cid:1) ≤ ¯ C. By Proposition A.2 and using the standard sequence norm inequality k v k h β ≤ Jβ k v k ℓ . N β α + d k v k ℓ , v ∈ R n J , β ≥ , we deduce that on preceding events and for any fixed p > b > λ init ∈ R n J from O ( N b ) iterations of gradient descent satisfies k λ init − ˆ λ k h α ≤ N − p . In particularwe can choose p such that, denoting u init := X l ≤ J,r λ init,l,r φ l,r , we have that k ˆ u − u init k H α . k ˆ λ − λ init k h α = o ( δ N ); hence by virtue of (199), we may restrict therest of the proof to an event of sufficiently large probability where u init satisfies(200) k u init − u f k L + δ N k u init k H α ≤ (2 M + 1) δ N . Step III.
From the interpolation inequality for Sobolev norms from Section 1.3 and (200) wenow obtain, with sufficiently high P Nθ -probability,(201) k u init − u f k H ≤ ¯ M N − ( α − / (2 α + d ) and the Sobolev imbedding ( d <
4) further implies k u init − u f k ∞ → N → ∞ so that wededuce from (119) ˆ u ≥ u f / ≥ c > P Nθ -probability. So on these events wecan define a new estimator(202) f init = ∆ u init u init , noting that f = ∆ u f u f . For F init = Φ − ◦ f init , using also the regularity of the inverse link function (179), we then see k F init − F θ k L . k f init − f k L . k u init − u f k H , and hence for some M ′ > P Nθ (cid:0) k F init − F θ k L ≤ M ′ N − ( α − / (2 α + d ) (cid:1) ≥ − ¯ c ′ e − ¯ cNδ N . We finally define θ init as θ init = ( h F init , e k i L : k ≤ D ) ∈ R D , D ∈ N , the vector of the first D ‘Fourier coefficients’ of F init . Then we obtain from Parseval’s identity that k θ init − θ ,D k R D ≤ k F init − F θ k L , which combined with the last probability inequality establishesconvergence rate desired in Theorem B.6. Step IV. Proof of (198).
Let us introduce the symmetric n J × n J , n J . Jd , matricesˆΓ ( l,r ) , ( l ′ ,r ′ ) = 1 N N X i =1 φ l,r ( X i ) φ l ′ ,r ′ ( X i ) , Γ ( l,r ) , ( l ′ ,r ′ ) = Z O φ l,r ( x ) φ l ′ ,r ′ ( x ) dP X ( x ) , and vectors (ˆ λ = ˆ λ l,r ) , ( λ = λ ,l,r ) ∈ R n J . Then we can write k ˆ u − u f ,J k N ) − k ˆ u − u f ,J k L ( O ) = (ˆ λ − λ ) T (ˆΓ − Γ)(ˆ λ − λ )and hence (one minus the) probability relevant in (198) can be bounded asPr (cid:12)(cid:12)(cid:12) (ˆ λ − λ ) T (ˆΓ − Γ)(ˆ λ − λ )(ˆ λ − λ ) T Γ(ˆ λ − λ ) (cid:12)(cid:12)(cid:12) > / ! ≤ Pr sup v ∈ R nJ : v T Γ v ≤ (cid:12)(cid:12) v T (ˆΓ − Γ) v (cid:12)(cid:12) > / ! . We also note that by the frame property of the { φ l,r } , specifically from (5.252) in [68] with s =0 , p = q = 2, for any u v = P l ≤ J,r v l,r φ l,r we have the norm equivalence(203) k v k R nJ ≃ k u v k L = X l,l ′ ≤ J,r,r ′ v l,r v l ′ ,r ′ Γ ( l,r ) , ( l ′ ,r ′ ) = v T Γ v =: k v k , with the constants implied by ≃ independent of J . Next for any κ > { v m , m = 1 , . . . , M J,κ } , M J,κ . (3 /κ ) n J denote the centres of balls of k · k Γ -radius κ covering the unit ball V Γ of ( R n J , k · k Γ ) (e.g., as inProp. 4.3.34 in [28] and using (203)). Then using the Cauchy-Schwarz inequality | v T (ˆΓ − Γ) v | = | ( v − v m + v m ) T (ˆΓ − Γ)( v − v m + v m ) |≤ k v − v m k sup v ∈ V Γ | v T (ˆΓ − Γ) v (cid:12)(cid:12) + 2 k v − v m k Γ k (ˆΓ − Γ) v k Γ + | v Tm (ˆΓ − Γ) v m |≤ ( κ + 2 κ ) sup v ∈ V Γ | v T (ˆΓ − Γ) v (cid:12)(cid:12) + | v Tm (ˆΓ − Γ) v m | so choosing κ small enough so that κ + 2 κ < / v ∈ V Γ | v T (ˆΓ − Γ) v (cid:12)(cid:12) ≤ (4 /
3) max m =1 ,...,M J | v Tm (ˆΓ − Γ) v m | , M J ≡ M J,κ . In particular, using also that M J . e c Jd ≤ e c Nδ N , the last probability is thus bounded by(205) Pr (cid:16) max m =1 ,...,M J | v Tm (ˆΓ − Γ) v m | > / (cid:17) ≤ e c Nδ N max m Pr (cid:16) | v Tm (ˆΓ − Γ) v m | > / (cid:17) . Each of the last probabilities can be bounded by Bernstein’s inequality (Prop. 3.1.7 in [28]) appliedto v Tm (ˆΓ − Γ) v m = 1 N N X i =1 Z i − EZ i , ANGEVIN ALGORITHMS 65 with i.i.d. variables Z i = Z i,m given by(206) Z i = X l,l ′ ≤ J,r,r ′ v m,l,r v m,l ′ ,r ′ φ l,r ( X i ) φ l ′ ,r ′ ( X i ) = X l ≤ J,r v m,l,r φ l,r ( X i ) X l ′ ≤ J,r ′ v m,l ′ ,r ′ φ l ′ ,r ′ ( X i ) , wit vectors v m all satisfying k v m k Γ ≤
1. For these variables we have from the Cauchy-Schwarzinequality | Z i | ≤ (cid:12)(cid:12)(cid:12) X l ≤ J,r v m,l,r φ l,r ( · ) (cid:12)(cid:12)(cid:12) ≤ k v m k R nJ X l ≤ J,r ( φ l,r ( · )) ≤ c Jd ≡ U where the constant c depends only on the wavelet frame (cf. (203) and also Definition 5.25 in [68]).Similarly, using the previous estimate, we can bound EZ i = E h X l ≤ J,r v m,l,r φ l,r ( X i ) i ≤ U Z O h X l ≤ J,r v m,l,r φ l,r ( x ) i dx = U k v m k ≤ U. Now Proposition 3.1.7 in [28] implies for some constant c > (cid:16) N | v m (ˆΓ − Γ) v m | > N/ (cid:17) ≤ n − N / N U + (2 / N U o ≤ e − c /δ N since U = c Jd ≃ N δ N . Now since α > d/ δ N = o (1 / √ N ) and thus (1 /δ N ) ≫ N δ N which means that the r.h.s in (205) is bounded by a constant multiple of e − c ′ Nδ N for some c ′ > (cid:3) References [1] K. Abraham and R. Nickl. On statistical Cald´eron problems. Mathematical Statistics andLearning, (2):165–216, 2019.[2] S. Arridge, P. Maass, O. ¨Oktem, and C.-B. Sch¨onlieb. Solving inverse problems using data-driven models. Acta Numer., 28:1–174, 2019.[3] G. Bal and G. Uhlmann. Inverse diffusion theory of photoacoustics. Inverse Problems,26(8):085010, 20, 2010.[4] A. Belloni and V. Chernozhukov. On the computational complexity of MCMC-based estimatorsin large samples. Annals of Statistics, 37, 2009.[5] T. Bengtsson, P. Bickel, and B. Li. Curse-of-dimensionality revisited: collapse of the particlefilter in very large scale systems. In Probability and statistics: essays in honor of David A.Freedman, volume 2 of Inst. Math. Stat. (IMS) Collect., pages 316–334. 2008.[6] P. Bickel, B. Li, and T. Bengtsson. Sharp failure rates for the bootstrap particle filter in highdimensions. In Pushing the limits of contemporary statistics: contributions in honor of JayantaK. Ghosh, volume 3 of Inst. Math. Stat. (IMS) Collect., pages 318–329. 2008.[7] C. Borell. The Brunn-Minkowski inequality in Gauss space. Invent. Math., 30(2):207–216,1975.[8] S. Boyd and L. Vandenberge. Convex optimization. Cambridge University Press, 2004.[9] O. Capp´e, E. Moulines, and T. Ryd´en. Inference in hidden Markov models. Springer Series inStatistics. Springer, New York, 2005.[10] I. Castillo and R. Nickl. Nonparametric Bernstein–von Mises Theorems in Gaussian whitenoise. Ann. Statist., 41(4):1999–2028, 2013. [11] I. Castillo and R. Nickl. On the Bernstein–von Mises phenomenon for nonparametric Bayesprocedures. Ann. Statist., 42(5):1941–1969, 2014.[12] I. Castillo and J. Rousseau. A Bernstein–von Mises theorem for smooth functionals in semi-parametric models. Ann. Statist., 43(6):2353–2383, 2015.[13] K. L. Chung and Z. X. Zhao. From Brownian motion to Schr¨odinger’s equation. Springer-Verlag, Berlin, 1995.[14] S. Cotter, G. Roberts, A. Stuart, and D. White. MCMC methods for functions: Modifying oldalgorithms to make them faster. Statistical Science, 28(3):424–446, 2013.[15] S. L. Cotter, M. Dashti, J. C. Robinson, and A. M. Stuart. Bayesian inverse problems forfunctions and applications to fluid mechanics. Inverse Problems, 25(11):115008, 43, 2009.[16] A. S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concavedensities. J. R. Stat. Soc. Ser. B. Stat. Methodol., 79(3):651–676, 2017.[17] M. Dashti and A. M. Stuart. Uncertainty quantification and weak approximation of an ellipticinverse problem. SIAM J. Numer. Anal., 49(6):2524–2542, 2011.[18] M. Dashti and A. M. Stuart. The Bayesian approach to inverse problems. In: Handbook ofUncertainty Quantification, Editors R. Ghanem, D. Higdon and H. Owhadi, Springer, 2016.[19] S. Dirksen. Tail bounds via generic chaining. Electron. J. Probab., 20:no. 53, 29, 2015.[20] A. Durmus and E. Moulines. Nonasymptotic convergence analysis for the unadjusted Langevinalgorithm. Ann. Appl. Probab., 27(3):1551–1587, 2017.[21] A. Durmus and E. Moulines. High-dimensional Bayesian inference via the unadjusted Langevinalgorithm. Bernoulli, 25, 2019.[22] A. Eberle. Reflection couplings and contraction rates for diffusions. Probability Theory andRelated Fields, 166:851–886, 2015.[23] H. W. Engl, M. Hanke, and A. Neubauer. Regularization of Inverse Problems. Kluwer AcademicPublishers Group, 1996.[24] L. C. Evans. Partial differential equations. American Math. Soc., Second edition, 2010.[25] S. Ghosal. Asymptotic normality of posterior distributions for exponential families when thenumber of parameters tends to infinity. Journal of Multivariate Analysis, 74(1):49 – 68, 2000.[26] S. Ghosal and A. W. van der Vaart. Fundamentals of Nonparametric Bayesian Inference.Cambridge University Press, New York, 2017.[27] D. Gilbarg and N. S. Trudinger. Elliptic partial differential equations of second order. Springer-Verlag, Berlin-New York, 1998.[28] E. Gin´e and R. Nickl. Mathematical foundations of infinite-dimensional statistical models.Cambridge University Press, New York, 2016.[29] M. Giordano and R. Nickl. Consistency of Bayesian inference with Gaussian process priors inan elliptic inverse problem. Inverse Problems, 2020.[30] H. Haario, M. Laine, M. Lehtinen, E. Saksman, and J. Tamminen. Markov chain MonteCarlo methods for high dimensional inversion in remote sensing. J. R. Stat. Soc. Ser. B Stat.Methodol., 66(3):591–607, 2004.[31] M. Hairer, J. Mattingly, and M. Scheutzow. Asymptotic coupling and a general form ofHarris’ theorem with applications to stochastic delay equations. Probab. Theory Relat. Fields,149:223–259, 2011.[32] M. Hairer, A. Stuart, and S. Vollmer. Spectral gaps for a Metropolis-Hastings algorithm ininfinite dimensions. The Annals of Applied Probability, 24(6):2455–2490, 2014.[33] M. Hanke, A. Neubauer, and O. Scherzer. A convergence analysis of the landweber iterationfor nonlinear ill-posed problems. Numerische Mathematik, 1995.
ANGEVIN ALGORITHMS 67 [34] A. Hinrichs, E. Novak, M. Ullrich, and H. Wo´zniakowski. The curse of dimensionality fornumerical integration of smooth functions. Math. Comp., 83(290):2853–2863, 2014.[35] R. Holley and D. Stroock. Logarithmic Sobolev inequalities and stochastic Ising models.Journal of Statistical Physics, 46:1159–1194, 1987.[36] J. Ilmavirta and F. Monard. Integral geometry on manifolds with boundary and applications.Inverse problems, 2020, to appear.[37] J. Kaipio, V. Kolehmainen, E. Somersalo, and M. Vauhkonen. Statistical inversion and MonteCarlo sampling methods in electrical impedance tomography. Inverse Problems, 16(5):1487–1522, 2000.[38] J. Kaipio and E. Somersalo. Statistical and Computational Inverse Problems. Number 160 inApplied Mathematical Sciences. Springer-Verlag New York, 2004.[39] B. Kaltenbacher, A. Neubauer, and O. Scherzer. In Iterative Regularization Methods forNonlinear Ill-Posed Problems. Radon Series on Computational and Applied Mathematics, 2008.[40] A. Katchalov, Y. Kurylev, and M. Lassas. Inverse boundary spectral problems. Chapman &Hall/CRC, Boca Raton, FL, 2001.[41] P.-S. M. d. Laplace. Theorie analytiques des probabilit´es. Courcier, Paris, 1812.[42] M. Lassas, E. Saksman, and S. Siltanen. Discretization-invariant Bayesian inversion and Besovspace priors. Inverse Probl. Imaging, 3(1):87–122, 2009.[43] L. Le Cam. Asymptotic methods in statistical decision theory. Springer-Verlag, New York,1986.[44] W. V. Li and W. Linde. Approximation, metric entropy and small ball estimates for Gaussianmeasures. Ann. Probab., 27(3):1556–1578, 1999.[45] J.-L. Lions and E. Magenes. Non-homogeneous boundary value problems and applications.Vol. I. Springer-Verlag, New York-Heidelberg, 1972.[46] L. Lov´asz and M. Simonovits. Random walks in a convex body and an improved volumealgorithm. Random Structures Algorithms, 4(4):359–412, 1993.[47] L. Lov´asz and S. Vempala. The geometry of logconcave functions and sampling algorithms.Random Structures Algorithms, 30(3):307–358, 2007.[48] Y.-A. Ma, Y. Chen, C. Jin, N. Flammarion, and M. Jordan. Sampling can be faster thanoptimization. Proc. Nat. Acad. Sciences, 2019.[49] A. J. Majda and J. Harlim. Filtering complex turbulent systems. Cambridge University Press,Cambridge, 2012.[50] F. Monard, R. Nickl, and G. P. Paternain. Efficient nonparametric Bayesian inference for X -ray transforms. Ann. Statist., 47(2):1113–1147, 2019.[51] F. Monard, R. Nickl, and G. P. Paternain. Consistent inversion of noisy non-Abelian X-raytransforms. Comm. Pure Appl. Math., 2020, to appear.[52] J. L. Mueller and S. Siltanen. Linear and nonlinear inverse problems with practical applications,volume 10. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2012.[53] R. Nickl. Bernstein-von Mises theorems for statistical inverse problems I: Schr¨odinger equation.J. Eur. Math. Soc., 22: 2697–2750, 2020.[54] R. Nickl and J. S¨ohl. Nonparametric Bayesian posterior contraction rates for discretely ob-served scalar diffusions. Ann. Statist., 45(4):1664–1693, 2017.[55] R. Nickl and J. S¨ohl. Bernstein–von Mises theorems for statistical inverse problems II: com-pound Poisson processes. Electron. J. Stat., 13(2):3513–3571, 2019.[56] R. Nickl, S. van de Geer, and S. Wang. Convergence rates for penalised least squares estimatorsin PDE-constrained regression problems. SIAM J. Uncert. Quant., 8, 2020.8 R. NICKL AND S. WANG