Theoretical guarantees for sampling and inference in generative models with latent diffusions
aa r X i v : . [ m a t h . P R ] M a y Theoretical guarantees for sampling and inference in generativemodels with latent diffusions
Belinda Tzen ∗ Maxim Raginsky † Abstract
We introduce and study a class of probabilistic generative models, where the latent object is a finite-dimensional diffusion process on a finite time interval and the observed variable is drawn conditionallyon the terminal point of the diffusion. We make the following contributions:We provide a unified viewpoint on both sampling and variational inference in such generative modelsthrough the lens of stochastic control.We quantify the expressiveness of diffusion-based generative models. Specifically, we show that onecan efficiently sample from a wide class of terminal target distributions by choosing the drift of the latentdiffusion from the class of multilayer feedforward neural nets, with the accuracy of sampling measuredby the Kullback–Leibler divergence to the target distribution.Finally, we present and analyze a scheme for unbiased simulation of generative models with latentdiffusions and provide bounds on the variance of the resulting estimators. This scheme can be imple-mented as a deep generative model with a random number of layers.
Recently there has been much interest in using continuous-time processes to analyze discrete-time algo-rithms and probabilistic models (Wibisono et al., 2016; Li et al., 2017; Mandt et al., 2017; Chen et al., 2018;Yang et al., 2018). In particular, diffusion processes have been examined as a way towards a better un-derstanding of first- and second-order optimization methods, as they afford an analysis of behavior overnon-convex landscapes using a rich array of techniques from the mathematical physics literature (Li et al.,2017; Raginsky et al., 2017; Zhang et al., 2017). Gradient flows and diffusions have also found a role in theanalysis of deep neural nets, where they are interpreted as describing the limiting case of infinitely manylayers, with each layer being ‘infinitesimally thin’ (e.g., Chen et al. (2018); Li et al. (2018)). As in the caseof optimization, continuous-time frameworks enable the use of a different set of tools for studying stan-dard questions of relevance, such as sampling and inference, i.e., forward and backward passes through thenetwork.In this work, we consider a class of generative models where the latent object X = { X t } t ∈ [0 , is a d -dimensional diffusion and the observable object Y is a random element of some space Y : d X t = b ( X t , t ; θ ) d t + d W t , X = x (1.1a) Y ∼ q ( ·| X ) (1.1b) ∗ University of Illinois; e-mail: [email protected]. † University of Illinois; e-mail: [email protected]. here (1.1a) is a d -dimensional Itô diffusion process whose drift b ( · , · ; θ ) is a member of some paramet-ric function class, such as multilayer feedforward neural nets, and (1.1b) prescribes an observation modelfor generating Y conditionally on X . To the best of our knowledge, generative models of this form werefirst considered by Movellan et al. (2002) as a noisy continuous-time counterpart of recurrent neural nets.More recently, Hashimoto et al. (2016) and Ryder et al. (2018) investigated the use of discrete-time recur-rent neural nets to approximate the population dynamics of biological systems that are classically modeledby diffusions. It is natural to view (1.1) as a continuum limit of deep generative models introduced byRezende et al. (2014) — in fact, as we explain in Section 4, one can simulate a model of the above formusing a deep generative model with a random number of layers. Alternatively, one can think of (1.1) as a neural stochastic differential equation , in analogy to the neural ODE framework of Chen et al. (2018).There are three main questions that are natural to ask concerning the usefulness of such models: Howexpressive can they be? How might one sample from such a diffusion process? How might one performinference on it? As our first contribution, we provide a unified view of sampling and inference throughthe lens of stochastic control. In particular, by adding a control u t to the drift of some reference diffusion,one can obtain a desired distribution at t = 1 , and the minimal-cost control that yields exact samplingis given by the so-called Föllmer drift (Föllmer, 1985; Dai Pra, 1991; Lehec, 2013; Eldan and Lee, 2018).Complementarily, we show that any control u t added to the drift b ( · , t ; θ ) in (1.1a) leads to a variationalupper bound on the log-likelihood of a given tuple of observations ( y , . . . , y n ) . Variational inference thenreduces to minimizing the expected control cost over a tractable class of controls. While we provide aunifying viewpoint that captures both sampling and inference, we emphasize that this is a synthesis of anumber of existing results, and serves as a conceptual underpinning and motivation for our subsequentanalysis. Specifically, after establishing that diffusion-based generative models can be effectively workedwith, we explore their expressive power vis-à-vis neural nets: We show that, if the target density of X canbe efficiently approximated using a neural net, then the corresponding Föllmer drift can also be efficientlyapproximated by a neural net, such that the terminal law of the diffusion with this approximate drift is ε -close to the target density in Kullback–Leibler divergence. Finally, we investigate unbiased simulationmethods for generative models with underlying diffusion processes and provide bounds on the variance ofthe resulting estimators. To arrive at the unified perspective of sampling and inference, we begin by formulating a stochastic controlproblem that captures all of our desiderata: sampling from a target probability law µ at terminal time t = 1 ;a set of tractable controls that might be used to take it there; and an appropriate notion of cost with thatcaptures both the ‘control effort’ and the terminal cost that quantifies the discrepancy between the finalprobability law and the target measure µ .Our first result, stated in Theorem 2.1, is an explicit characterization of the value function of this controlproblem, which has a free-energy interpretation and can be understood from an information-theoretic view-point: the Kullback–Leibler divergence between the law of the path of the uncontrolled diffusion and that ofthe path of the controlled diffusion is the expected total work done by the control. The negative free energywith respect to the uncontrolled process is a lower bound on that of the controlled process after accountingfor the work done, and equality is achieved by the optimal control. As pointed out above, this result is asynthesis of a number of existing results, and its main purpose is to motivate the use of controlled diffusionsin probabilistic generative modeling.We next examine the expressiveness of these generative models, which refers to their ability to gener-ate samples from a given target distribution for X when the observation model q ( ·|· ) in (1.1b) is fixed.2n Theorem 3.1, we provide quantitative guarantees for obtaining approximate samples from a given tar-get distribution µ for X when the drift b in (1.1a) is restricted to be a multilayer feedforward neural net.Specifically, we show that, if the density f of µ with respect to the standard Gaussian measure on R d canbe efficiently approximated by a feedforward neural net, then the corresponding Föllmer drift can also beapproximated efficiently by a neural net. Moreover, this approximate Föllmer drift yields a diffusion { b X t } ,such that b µ = Law( b X ) satisfies D ( µ || b µ ) ≤ ε for a given accuracy ε > . Under some assumptions onthe smoothness of f and ∇ f and on their uniform approximability by neural nets, the proof proceeds asfollows: First, we show that the Föllmer drift can be approximated by a neural net uniformly over a givencompact subset of R d and for all t ∈ [0 , . Then, to show that the terminal distribution resulting fromthis approximation is ε -close to µ in KL-divergence, we use Girsanov’s theorem to relate D ( µ k b µ ) to theexpected squared error between the Föllmer drift and its neural-net approximation.Finally, we discuss the issue of unbiased simulation with the goal of estimating expected values of func-tions of X . The standard Euler–Maruyama scheme (Graham and Talay, 2013, Chap. 7) is straightforward,but produces a biased estimator. Typically, one uses Monte Carlo sampling to reduce the variance; if theestimator is biased, then the variance will be reduced by a factor of N − δ for some δ ∈ (0 , , instead ofthe optimal reduction by the factor of N , for N Monte Carlo runs. One way to obtain an unbiased estimatoris to employ a random discretization of the time interval [0 , , where the sampling times are generated bya point process on the real line. Unbiased simulation schemes of this type have been proposed and ana-lyzed by Bally and Kohatsu-Higa (2015), Andersson and Kohatsu-Higa (2017), and Henry-Labordère et al.(2017). Our final result, Theorem 4.1, builds on the latter work and presents an unbiased, finite-variancesimulation scheme. Conceptually, the simulation scheme can be thought of as a deep latent Gaussian modelin the sense of Rezende et al. (2014), but with a random number of layers. Unfortunately, the variance ofthe resulting estimator can exhibit exponential dependence on dimension. We show why this is the case viaan analysis of the moment-generating function of the point process used to generate the random mesh andpropose alternatives to reduce the variance. The Euclidean norm of a vector x ∈ R d will be denoted by k x k , the transpose of a vector or a matrix willbe indicated by ( · ) T . The d -dimensional Euclidean ball of radius R centered at the origin will be denoted by B d ( R ) . The standard Gaussian measure on R d will be denoted by γ d . The Euclidean heat semigroup Q t , t ≥ , acts on measurable functions f : R d → R as follows: Q t f ( x ) := Z R d f ( x + √ tz ) γ d (d z ) = E [ f ( x + √ tZ )] , Z ∼ γ d . (1.2)A function g : R d × [0 , → R is of class C , if it is twice continuously differentiable in the space variable x ∈ R d and once continuously differentiable in the time variable t ∈ [0 , . Before addressing the specific questions posed in the Introduction, we aim to demonstrate that both samplingand variational inference in generative models of the form (1.1) can be viewed through the lens of stochasticcontrol. We give a brief description of the relevant ideas in Appendix A; the book by Fleming and Rishel(1975) is an excellent and readable reference. 3 .1 A stochastic control problem
Let (Ω , F , { F t } , P ) be a probability space with a complete and right-continuous filtration { F t } , and let W = { W t } be a standard d -dimensional Brownian motion adapted to { F t } . Consider the Itô diffusionprocess d X t = b ( X t , t ) d t + d W t , t ∈ [0 , X = x (2.1)where the drift b : R d × [0 , → R d is sufficiently well-behaved (say, bounded and Lipschitz). Thenthe process { X t } admits a transition density , i.e., a family of functions p s,t : R d × R d → R + for all ≤ s < t ≤ , such that, for all points x, y ∈ R d and all Borel sets A ⊂ R d , P [ X t ∈ A | X s = x ] = Z A p s,t ( x, y ) d y (2.2)(see, e.g., Protter (2005, Chap. V)).Consider the following stochastic control problem: Let U be the set of controls , i.e., measurable func-tions u : R d × [0 , → R d . Any u ∈ U defines a diffusion process X u = { X ut } t ∈ [0 , by d X ut = (cid:0) b ( X ut , t ) + u ( X ut , t ) (cid:1) d t + d W t , t ∈ [0 , X u = x . (2.3)We say that X u is a diffusion controlled by u . Let a function g : R d → (0 , ∞ ) be given. For each u ∈ U ,we define the family of cost-to-go functions J u ( x, t ) := E " Z t k u s k d s − log g ( X u ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ut = x , x ∈ R d , t ∈ [0 , (2.4)where u s is shorthand for u ( X us , s ) . The value functions v : R d × [0 , → R + are defined by v ( x, t ) := inf u ∈ U J u ( x, t ) , (2.5)and we say that a control u ∗ ∈ U is optimal if J u ∗ ( x, t ) = v ( x, t ) for all x and t . The following theorem is,essentially, a synthesis of the results of Pavon (1989) and Dai Pra (1991): Theorem 2.1.
Consider the control problem (2.4) . The value function v is given by v ( x, t ) = − log E [ g ( X ) | X t = x ] , (2.6) where the conditional expectation is with respect to the uncontrolled diffusion process (2.1) . Moreover, theoptimal control u ∗ is given by u ∗ ( x, t ) = −∇ v ( x, t ) , where the gradient is taken with respect to the spacevariable x ∈ R d , and the corresponding controlled diffusion { X ∗ t } = { X u ∗ t } has the transition density p ∗ s,t ( x, y ) = p s,t ( x, y ) exp (cid:0) v ( x, s ) − v ( y, t ) (cid:1) , (2.7) where p s,t ( · ) is the transition density (2.2) of the uncontrolled process. This result, proved in Appendix A, also admits an information-theoretic interpretation. Let P denotethe probability law of the path X [0 , of the uncontrolled diffusion process (2.1) and let P u denote thecorresponding object for the controlled diffusion (2.3). Since X and X u differ from each other by a change4f drift, the probability measures P u and P are mutually absolutely continuous, and the Radon–Nikodymderivative d P u / d P is given by the Girsanov formula (Protter, 2005) d P u d P = exp − Z u T t d W t + 12 Z k u t k d t ! , (2.8)where u T t d W t := P di =1 u i,t d W i,t , with u i, · and d W i, · denoting the i th coordinates of u and W respectively.From (2.8), we can calculate the Kullback–Leibler divergence between P u and P : D ( P u k P ) = E P u (cid:20) log d P u d P (cid:21) = E " Z k u t k d t . (2.9)Therefore, by Theorem 2.1, for any control u ∈ U , we can write − log E [ g ( X ) | X = x ] ≤ D ( P u k P ) − E [log g ( X u ) | X u = x ] , (2.10)with equality if and only if u = u ∗ . An inequality of this form holds more generally for real-valuedmeasurable functions of the entire path X [0 , (Boué and Dupuis, 1998).We will now demonstrate how both the problem of sampling and the problem of variational inferencecan be addressed via the above theorem. Recall that, in the context of exact sampling, the objective is to construct a diffusion process { X t } t ∈ [0 , ,such that X has a given target distribution µ . We will consider the case when µ is absolutely continuouswith respect to the standard Gaussian measure γ d and let f denote the Radon–Nikodym derivative d µ/ d γ d .This problem goes back to a paper of Schrödinger (1931); for rigorous treatments, see, e.g., Jamison (1975),Föllmer (1985), Dai Pra (1991), Lehec (2013), Eldan and Lee (2018). The derivation we give below is notnew (see, e.g., Dai Pra (1991, Thm. 3.1)), but the route we take is somewhat different in that we make thestochastic control aspect more explicit.We take b ( x, t ) ≡ and X = 0 in (2.1). Then the diffusion process { X t } is simply the standard d -dimensional Brownian motion { W t } , which has the Gaussian transition density p s,t ( x, y ) = 1(2 π ( t − s )) d/ exp (cid:18) − t − s ) k x − y k (cid:19) . Now consider the control problem (2.4) with g = f . By Theorem 2.1, the value function v is given by v ( x, t ) = − log E [ f ( W ) | W t = x ] , and can be computed explicitly. For ≤ t < , we have e − v ( x,t ) = E [ f ( W ) | W t = x ]= 1(2 π (1 − t )) d/ Z R d f ( y ) exp (cid:18) − − t ) k x − y k (cid:19) d y = Q − t f ( x ) , where Q denotes the Euclidean heat semigroup (1.2). Hence, v ( x, t ) = − log Q − t f ( x ) , and the optimaldiffusion process { X ∗ t } has the drift u ∗ ( x, t ) = −∇ v ( x, t ) = ∇ log Q − t f ( x ) . Following Lehec (2013)and Eldan and Lee (2018), we will refer to u ∗ as the Föllmer drift in the sequel.5t remains to show that X ∗ ∼ µ . Using the formula (2.7) for the transition density of X ∗ togetherwith the fact that e − v ( y, = f ( y ) and e − v (0 , = E [ f ( W )] = R f d γ d = 1 , we see that p ∗ , (0 , y ) d y = f ( y ) γ d (d y ) . Then, for any Borel set A ⊆ R d , P [ X ∗ ∈ A ] = Z A p ∗ , (0 , y ) d y = Z A f ( y ) γ d (d y ) = µ ( A ) . Moreover, using the entropy inequality (2.10), we can show that the Föllmer drift is optimal in the followingstrong sense: Consider any control u ∈ U with X u = 0 and with the property that Law( X u ) = µ . For anysuch control, E [log f ( X u ) | X u = 0] = Z R d d µ log f = Z R d d µ log d µ d γ d = D ( µ k γ d ) , while clearly log E [ f ( W )] = 0 . Therefore, it follows from (2.10) that, for any such control u , D ( P u k P ) = 12 E "Z k u t k d t ≥ D ( µ k γ d ) , with equality if and only if u = u ∗ . Thus, the Föllmer drift has the minimal ‘energy’ among all admissiblecontrols that induce the distribution µ at t = 1 , and this energy is precisely the Kullback–Leibler divergencebetween µ and the standard Gaussian measure γ d (Dai Pra, 1991; Lehec, 2013; Eldan and Lee, 2018). We now turn to the problem of variational inference. We are given an n -tuple of observations y =( y , . . . , y n ) ∈ Y n , and wish to upper-bound the negative log-likelihood L n ( y ; θ ) := 1 n n X i =1 L ( y i ; θ ) , where L ( y ; θ ) := − log E [ q ( y | X )] and { X t } is the diffusion process (1.1).We take b = b ( · , · ; θ ) in (2.1) and consider the control problem (2.4) with g ( x ) = q ( y | x ) for some fixed y ∈ Y . Then, by Theorem 2.1, any control u ∈ U gives rise to an upper bound on L ( y ; θ ) : L ( y ; θ ) ≤ E " Z k u t k d t − log q ( y | X u ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X u = x =: F u ( y ; θ ) , where the quantity on the right-hand side can be thought of as the variational free energy that depends onthe choice of the control u , and equality is achieved when u = u ∗ . While the structure of the optimal control u ∗ is described in Theorem 2.1, it may not be possible to derive it in closed form. However, we can fix aclass ˜ U ⊂ U of tractable suboptimal controls and upper-bound L ( y ; θ ) by inf u ∈ ˜ U F u ( y ; θ ) . For example,we can take ˜ U to consist of all controls of the form u ( x, t ) = φ − b ( x, t ; θ ) for some φ ∈ R d . In that case, X ut is the sum of the Brownian motion W t and the affine drift x + tφ , and consequently F u ( y ; θ ) = E " Z k φ − b ( x + tφ + W t ; θ ) k d t − log q ( y | x + φ + W ) , where the expectation is taken with respect to the standard Brownian motion W . Another possiblity is toconsider controls of the form u ( x, t ) = Ax − b ( x, t ; θ ) , for some A ∈ R d × d . The corresponding controlleddiffusion is the Ornstein–Uhlenbeck process X ut = e At x + R t e A ( t − s ) d W s , and the variational free energycan be minimized over A ∈ R d × d . 6 Expressiveness
Now that we have shown that generative models of the form (1.1) allow for both sampling and variationalinference, we turn to the analysis of their expressiveness . Specifically, our objective is to show that, byworking with a suitable structured class of drifts b ( · , · ; θ ) , we can achieve approximate sampling from a richclass of distributions at the terminal time t = 1 .Let µ be the target probability measure for X . We assume that µ is absolutely continuous with respectto γ d and let f denote the Radon–Nikodym derivative d µ/ d γ d . From Section 2.2 we know that the diffusionprocess governed by the Itô SDE d X t = b ( X t , t ) d t + d W t , X = 0 (3.1)with the Föllmer drift b ( x, t ) = ∇ log Q − t f ( x ) has the property that µ = Law( X ) , and, moreover, itis optimal in the sense that it minimizes the ‘energy’ R E k u t k d t among all adapted drifts { u t } thatresult in distribution µ at time t = 1 . The main result of this section is as follows: If the Radon–Nikodymderivative f can be approximated efficiently by multilayer feedforward neural nets, then, for any ε > ,there exists a drift b b ( x, t ) = b b ( x, t ; θ ) that can be implemented exactly by a neural net whose parameters θ do not depend on time or space, and the terminal law b µ := Law( b X ) of the diffusion process d b X t = b b ( b X t , t ) d t + d W t , b X = 0 (3.2)is an ε -approximation to µ in the KL-divergence: D ( µ k b µ ) ≤ ε . Moreover, the size of the neural net thatimplements the approximate Föllmer drift b b can be estimated explicitly in terms of the size of a suitableapproximating neural net for f .We begin by imposing some assumptions on f . The first assumption is needed to guarantee enoughregularity for the Föllmer drift: Assumption 3.1.
The function f is differentiable, both f and ∇ f are L -Lipschitz, and there exists a constant c ∈ (0 , , such that f ≥ c everywhere. This assumption is satisfied, for example, by Gibbs measures of the form µ (d x ) = Z − e − k x k − F ( x ) d x with a differentiable potential F : R d → R + , such that both F and ∇ F are Lipschitz, and F is boundedfrom above; see Appendix B for details.Next, we introduce the assumptions pertaining to the approximability of f by neural nets. Let σ : R → R be a fixed nonlinearity. Given a vector w ∈ R n and scalars α, β , define the function N σw,α,β : R n → R , N σw,α,β ( x ) := α · σ (cid:0) w T x + β (cid:1) . For ℓ ≥ , we define the class N σℓ of ℓ -layer feedforward neural nets with activation function σ recursively asfollows: N σ consists of all functions of the form x P mi =1 N σw i ,α i ,β i ( x ) for all m ∈ N , w , . . . , w m ∈ R d , α , . . . , α m , β , . . . , β m ∈ R , and, for each ℓ ≥ , N σℓ +1 := [ k ≥ [ m ≥ ( x m X i =1 N σw i ,α i ,β i ( h ( x ) , . . . , h k ( x )) : α , . . . , α m , β , . . . , β m ∈ R , w , . . . , w m ∈ R k , h , . . . , h k ∈ N σℓ ) . N σℓ is a function that represents computation by a directed acyclic graph, where eachnode receives inputs u , . . . , u k , performs a computation of the form ( u , . . . , u k ) σ ( w u + . . . + w k u k + β ) , and communicates the outcome of the computation to all the nodes in the next layer. We refer to ℓ as the depth of the neural net, and define the size of the neural net as the total number of nodes in its computationgraph. We will denote by N σℓ,s the collection of all neural nets with depth ℓ and size s . All these definitionsextend straightforwardly to the case of neural nets with vector-valued output and to the case where eachnode may have a different activation function.We assume that the activation function σ is differentiable and universal , in the sense that any univariateLipschitz function which is nonconstant on a bounded interval can be approximated arbitrarily well by anelement of N σ : Assumption 3.2.
The activation function σ : R → R is differentiable. Moreover, there exists a constant c σ > depending only on σ , such that the following holds: For any L -Lipschitz function h : R → R whichis constant outside the interval [ − R, R ] and for any δ > , there exist real numbers a, { ( α i , β i , γ i ) } mi =1 ,where m ≤ c σ RLδ , such that the function ˜ h ( x ) = a + m X i =1 α i σ ( β i x + γ i ) (3.3) satisfies sup x ∈ R | ˜ h ( x ) − h ( x ) | ≤ δ . Remark 3.1.
Apart from differentiability, this is the same assumption made by Eldan and Shamir (2016).For example, it holds for differentiable sigmoidal activation functions, i.e., monotonic functions that satisfy lim u →−∞ σ ( u ) = a and lim u → + ∞ σ ( u ) = b for some a = b . The popular rectified linear unit (or ReLU)activation function u u ∨ is universal in the above sense but not differentiable. However, we can replaceit by the differentiable softplus function u log(1 + e cu ) , where increasing the value of c > results infiner approximations to the ReLU. Also, note that the function ˜ h differs from the elements of N σ by thepresence of the constant term a . However, the constant function x a can be implemented by N σ ,a/σ ( z ) ,z ,for any z ∈ R such that σ ( z ) = 0 . Thus, we will refer to functions of the form (3.3) as -layer neuralnetworks of size m + 1 .We also make the following assumption regarding approximability of f by neural nets: Assumption 3.3.
For any
R > and ε > , there exists a neural net b f ∈ N σℓ,s with ℓ, s ≤ poly(1 /ε, d, L, R ) ,such that sup x ∈ B d ( R ) | f ( x ) − b f ( x ) | ≤ ε and sup x ∈ B d ( R ) k∇ f ( x ) − ∇ b f ( x ) k ≤ ε. (3.4) Remark 3.2.
Typical results on neural net approximation are concerned with approximating a given functionuniformly on a given compact set. By contrast, Assumption 3.3 requires uniform approximability of both f and its gradient ∇ f on a compact set by some neural net b f and its gradient ∇ b f . Such simultaneousapproximation guarantees can also be found in the literature, see, e.g., Hornik et al. (1990); Yukich et al.(1995); Li (1996). See Safran and Shamir (2017) for a discussion of various trade-offs between depth andwidth (maximum number of neurons per layer) in neural net approximation.We are now in a position to state the main result of this section:8 heorem 3.1. Suppose Assumptions 3.1–3.3 are in force. Let L denote the maximum of the Lipschitzconstants of f and ∇ f . Then, for any < ε < L /c , there exists a neural net b v : R d × [0 , → R d withsize polynomial in /ε, d, L, c, /c , such that the activation function of each neuron is an element of the set { σ, σ ′ , ReLU } , and the following holds: If { b X t } t ∈ [0 , is the diffusion process governed by the Itô SDE d b X t = b b ( b X t , t ) d t + d W t , b X = 0 (3.5) with the drift b b ( x, t ) = b v ( x, √ − t ) , then b µ := Law( b X ) satisfies D ( µ k b µ ) ≤ ε . The proof relies on three key steps: First, we show that the heat semigroup Q t f ( x ) can be approximatedby a finite sum of the form N P n ≤ N f ( x + √ tz n ) uniformly for all x ∈ B d ( R ) and all t ∈ [0 , , where z , . . . , z N ∈ R d lie in a ball of radius O ( √ d log N ) . This result is stated in Appendix C and proved usingempirical process methods. Next, replacing f with a suitable neural net approximation b f , we build on thisresult to show that the Föllmer drift ∇ log Q − t f ( x ) can be approximated by a neural net using σ , σ ′ , andReLU as activation functions. This is the content of Theorem 3.2 below (the proof is given in Appendix D).The third step uses Girsanov theory to upper-bound the approximation error that results from replacing theFöllmer drift by this neural net. Theorem 3.2.
Let < ε < L/c and
R > be given. Then there exists a neural net b v : R d × [0 , → R d of size polynomial in /ε, d, L, R, c, /c , such that the activation function of each neuron is an element ofthe set { σ, σ ′ , ReLU } , and the following holds: sup x ∈ B d ( R ) sup t ∈ [0 , (cid:13)(cid:13)(cid:13)b v ( x, √ t ) − ∇ log Q t f ( x ) (cid:13)(cid:13)(cid:13) ≤ ε and max i ∈ [ d ] sup x ∈ R d sup t ∈ [0 , | b v i ( x, √ t ) | ≤ Lc .
We now complete the proof of Theorem 3.1. For any
R > , Theorem 3.2 guarantees the existence of aneural net b v : R d × [0 , → R d that satisfies sup x ∈ B d ( R ) sup t ∈ [0 , (cid:13)(cid:13)(cid:13)b v ( x, √ t ) − ∇ log Q t f ( x ) (cid:13)(cid:13)(cid:13) ≤ √ ε (3.6)and max i ∈ [ d ] sup x ∈ R d sup t ∈ [0 , | b v i ( x, √ t ) | ≤ Lc . (3.7)Let µ := Law( X [0 , ) and b µ := Law( b X [0 , ) . The Girsanov formula gives D ( µ k b µ ) = 12 Z E k b ( X t , t ) − b b ( X t , t ) k d t, b and b b are bounded by Lemma B.1 in Appendix B and (3.7). We now proceed to estimate the integrand. For each t ∈ [0 , , E k b ( X t , t ) − b b ( X t , t ) k = E h k b ( X t , t ) − b b ( X t , t ) k · { X t ∈ B d ( R ) } i + E h k b ( X t , t ) − b b ( X t , t ) k · { X t B d ( R ) } i =: T + T , where T ≤ ε by (3.6). To estimate T , we first observe that, since the Föllmer drift is bounded in norm by L/c by Lemma B.1, we have P ( sup t ∈ [0 , k X t k ≥ R ) ≤ √ d + L/cR (Bubeck et al., 2018, Lemma 3.8). Therefore, T ≤ dL c · √ d + L/cR .
Choosing R large enough to guarantee T ≤ ε and putting everything together, we obtain D ( µ k b µ ) ≤ ε .Therefore, D ( µ k b µ ) ≤ D ( µ k b µ ) ≤ ε by the data processing inequality. Now that we have shown that generative models with latent diffusions are capable of expressing a rich classof probability distributions, we turn to the problem of unbiased simulation. Specifically, given a function g : R d → R , we wish to estimate the expectation E [ g ( X ) | X = x ] , where X = { X t } t ∈ [0 , with X = x is a diffusion process of the form (1.1). The simplest approach is to use the Euler–Maruyama scheme: Fix apartition t < t < . . . < t n < t n +1 = 1 of [0 , and define the Itô process { ˜ X t } t ∈ [0 , by ˜ X = x and ˜ X t = ˜ X t i + Z tt i b ( ˜ X t i , t i ; θ ) d s + Z tt i d W s , t ∈ ( t i , t i +1 ] , i = 0 , . . . , n. (4.1)In particular, for each ≤ i ≤ n + 1 , ˜ X t i = ˜ X t i − + b ( ˜ X t i − , t i − ; θ )( t i − t i − ) + W t i − W t i − . We can then estimate the expectation E [ g ( X )] by g ( ˜ X t n +1 ) ≡ g ( ˜ X ) , but this estimate is biased: if g is,say, bounded, then | E [ g ( X )] − E [ g ( ˜ X )] | ≤ C g ( x ) · max ≤ i ≤ n ( t i +1 − t i ) , where C g ( x ) > is some constant that depends on g and on the starting point x (Graham and Talay,2013). Recently, several authors (Bally and Kohatsu-Higa, 2015; Andersson and Kohatsu-Higa, 2017;Henry-Labordère et al., 2017) have studied unbiased simulation of SDEs using Euler–Maruyama schemeswith random partitions, where the partition breakpoints are generated by a Poisson point process on the real10ine. In this section, we build on this line of work and present a scheme for unbiased simulation in thecontext of generative models of the form (1.1) that uses random partitions generated by arbitrary renewalprocesses (Kallenberg, 2002, Chap. 9) with sufficiently well-behaved densities of interrenewal times. Ouranalysis closely follows that of Henry-Labordère et al. (2017), but we provide a more refined analysis of thevariance of the resulting estimators.We first describe the simulation procedure. In what follows, we will drop the index θ from the drift tokeep the notation clean. Let τ , τ , . . . be i.i.d. nonnegative random variables with an absolutely continuousdistribution whose support contains the interval [0 , ε ] for some ε > . Let F τ and f τ denote the cdf andthe pdf of τ . Let T = 0 and T k := k X i =1 τ i ∧ , k ≥ and N := max { k : T k < } . Define a process b X = { b X t } t ∈ [0 , with b X = x as the Euler–Maruyama scheme (4.1) on the randompartition T < T < . . . < T N < T N +1 ≡ of [0 , , and let b ψ := 11 − F τ (1 − T N ) · (cid:16) g ( b X ) − g ( b X T N ) { N> } (cid:17) · N Y k =1 f τ ( T k − T k − ) b W k , (4.2)where b W k := (cid:0) b ( b X T k , T k ) − b ( b X T k − , T k − ) (cid:1) T (cid:0) W T k +1 − W T k (cid:1) T k +1 − T k . This process can be interpreted as a deep generative model in the sense of Rezende et al. (2014), butwith a random number of layers. Specifically, let ξ , ξ , . . . i . i . d . ∼ γ d be independent of { τ i } , and define b X (0) , b X (1) , . . . , b X ( N +1) recursively by taking b X (0) = x and b X ( k +1) = b X ( k ) + b ( b X ( k ) , T k ) · ( T k +1 − T k ) + ( T k +1 − T k ) / ξ k +1 , k = 0 , , . . . , N. Then b ψ d = g ( b X ( N +1) ) · − F τ (1 − T N ) · N Y k =1 (cid:16) b ( b X ( k ) , T k ) − b ( b X ( k − , T k − ) (cid:17) T ξ k +1 f τ ( T k − T k − ) · ( T k +1 − T k ) / , where d = denotes equality of probability distributions. We are now ready to state our main result on unbiasedsimulation (see Appendix E for the proof): Theorem 4.1.
Suppose that the drift b ( x, t ) is uniformly bounded, Lipschitz in x , and -Hölder in t , i.e., forsome constants b ∞ > and L b > , k b ( x, t ) k ≤ b ∞ and k b ( x, s ) − b ( y, t ) k ≤ L b (cid:16) k x − y k + | s − t | / (cid:17) . (4.3) for all x, y ∈ R d and all s, t ∈ [0 , . Suppose also that f τ ( s ) ≤ Ce as , s ∈ (0 , (4.4)11 or some constants C > and a ≥ . Then, for any Lipschitz-continuous g : R d → R with Lipschitzconstant L g , b ψ is an unbiased estimator of E [ g ( X ) | X = x ] with Var[ b ψ ] ≤ (cid:18) e a − F τ (1) (cid:19) KM N ( κ ) , (4.5) where K = poly( | g ( x ) | , L b , L g , b ∞ , d ) , κ = log poly( C, L b , L g , b ∞ , d ) , and M N ( θ ) := E [exp( θN )] is themoment-generating function of N . For example, the type of drift used in the construction of Section 3 has the property (4.3). The key implica-tion of Theorem 4.1 is that the variance of the estimator b ψ is controlled by the moment-generating functionof N , and is therefore related to the tail behavior of the sums S k := P ki =1 τ i . In some cases, one cancalculate M N in closed form. For instance, if we take τ , τ , . . . i . i . d . ∼ Exp( λ ) for some λ > , then theestimator (4.2) reduces to the one introduced by Henry-Labordère et al. (2017). Since F τ ( s ) = 1 − e − λs and f τ ( s ) = λe − λs for s ≥ , (4.4) holds with C = 1 /λ and a = λ ; moreover, N ∼ Pois( λ ) with M N ( θ ) = exp (cid:0) λ ( e θ − (cid:1) . Thus,
Var[ b ψ ] grows like exp( d ) , as already observed by Henry-Labordère et al. (2017). One way to reducethe variance is to choose the τ i ’s with lighter tails. To see this, we need estimates of Λ N ; the followinglemma provides a computable upper bound: Lemma 4.1.
Let M τ denote the moment-generating function of τ . Then M N ( θ ) ≤ e θ inf β> ( ( β + 1) e θβ + ∞ X k =0 (cid:0) e θ +1 M τ ( − β ) (cid:1) k ) . (4.6)As an example, suppose τ , τ , . . . are i.i.d. samples from the uniform distribution on [0 , T ] for some T > . Then M τ ( − β ) = 1 βT (1 − e − βT ) , and it is a matter of straightforward but lengthy algebra to show that M τ ( − β ) ≤ e − θ +1) for all β satisfying e − βT ≥ (cid:16) e − θ +1) log 2 − (2 θ + 3) e − θ +1) (cid:17) . Using this in (4.6) yields the estimate M N ( θ ) . e poly( θ ) . The density of a Uniform(0 , T ) random variableclearly satisfies (4.4). Thus, applying Theorem 4.1 to the estimator (4.2) with τ i i . i . d . ∼ Uniform(0 , T ) , wesee that its variance scales quasipolynomially in d , i.e., Var[ b ψ ] . e polylog( d ) . However, choosing τ i ’s withlighter tails will generally lead to larger values of N , i.e., a deeper generative model will be needed.12 The proof of Theorem 2.1
We first need some background on controlled diffusion processes, see, e.g., Fleming and Rishel (1975). Asin Section 2, let U be the set of controls, where each u ∈ U defines a controlled diffusion governed by theItô SDE d X ut = (cid:0) b ( X ut , t ) + u ( X ut , t ) (cid:1) d t + d W t , t ∈ [0 , X u = x . Let c : R d × R d → R + and ˜ c : R d → R + be given. For each u ∈ U , we define the cost-to-go functions J u ( x, t ) := E " Z t c ( X us , u s ) d s + ˜ c ( X u ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ut = x , x ∈ R d , t ∈ [0 , (A.1)where u s is shorthand for u ( X us , s ) . The value functions v : R d × [0 , → R + are defined in (2.5). In general,finding an optimal control is difficult. However, a sufficient condition for optimality is given by the so-called verification theorem from the theory of controlled diffusions (see, e.g., Chap. VI of Fleming and Rishel(1975)): Suppose that there exists a function v ∈ C , ( R d × [0 , that solves the Cauchy problem ∂v ( x, t ) ∂t + L t v ( x, t ) = − min α ∈ R d (cid:8) α T ∇ v ( x, t ) + c ( x, α ) (cid:9) on R d × [0 , g ( · ,
1) = ˜ c ( · ) (A.2)where L t is the (time-varying) generator of the diffusion (2.1): L t h ( x, t ) := b ( x, t ) T ∇ h ( x, t ) + 12 tr ∇ h ( x, t ) (A.3)for any h ∈ C , ( R d × [0 , , and where the gradient and the Hessian are taken with respect to the ‘spacevariable’ x ∈ R d . Then v is the value function for (A.1), and the optimal control u ∗ is given by u ∗ ( x, t ) = arg min α ∈ R d (cid:8) α T ∇ v ( x, t ) + c ( x, α ) (cid:9) . (A.4)The PDE (A.2) is called the Bellman equation associated to the control problem (A.1).
Remark A.1.
In fact, the control (A.4) is optimal among a much wider class of adapted controls , i.e., allstochastic processes { u t } t ∈ [0 , adapted to the filtration { F t } . The class U defined above consists of so-called Markov controls , where u t is a deterministic function of X ut and t . In that case, the controlled diffusion X u is a Markov process.We now turn to the proof of Theorem 2.1. The first step is to use the logarithmic transforma-tion due to Fleming (1978); see also Fleming and Sheu (1985); Sheu (1991). Consider the function h ( x, t ) := E [ g ( X ) | X t = x ] . By the Feynman–Kac formula (Kallenberg, 2002, Thm. 24.1), this func-tion is a C , solution of the Cauchy problem ∂h∂t + L t h = 0 on R d × [0 , h ( · ,
1) = g ( · ) . (A.5)It is a matter of simple calculus to verify that v ( x, t ) = − log h ( x, t ) solves the Cauchy problem ∂v∂t + L t v = 12 k∇ v k on R d × [0 , v ( · ,
1) = − log g ( · ) . (A.6)13oreover, using the variational representation k∇ v k = − min α ∈ R d (cid:26) α T ∇ v + 12 k α k (cid:27) , where the optimizer is given by α ∗ = −∇ v , it is readily verified that (A.6) is the Bellman equation (A.2)associated to the control problem (2.4). Hence, by the verification theorem, v ( x, t ) = − log h ( x, t ) is thevalue function we seek, and the optimal control is given by u ∗ ( x, t ) = −∇ v ( x, t ) .Now consider the diffusion process d X ∗ t = (cid:0) b ( X ∗ t , t ) + ∇ log h ( X ∗ t , t ) (cid:1) d t + d W t , which satisfies − log E [ g ( X ) | X t = x ] = E " Z T k∇ log h ( X ∗ s , s ) k d s − log g ( X ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ∗ t = x = min u ∈ U E " Z t k u s k d s − log g ( X u ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ut = x . Since h solves (A.5), the transition density of { X ∗ t } is given by (2.7) by a result of Jamison (1975) andDai Pra (1991). B Regularity properties of f and the Föllmer drift We first show that Assumption 3.1 holds for Gibbs measures µ (d x ) = Z − e − k x k − F ( x ) d x with sufficiently well-behaved potentials F . Suppose that F : R d → R + is differentiable, and both F and ∇ F are L -Lipschitz. Then f = d µ/ d γ d = const · e − F , and the Lipschitz continuity of f follows from theLipschitz continuity of u e − u on [0 , ∞ ) : | e − F ( x ) − e − F ( y ) | ≤ | F ( x ) − F ( y ) | ≤ L k x − y k . Likewise, the Lipschitz continuity of ∇ f follows from the Lipschitz continuity of ∇ F : since ∇ e − F = − e − F ∇ F , we have k∇ e − F ( x ) − ∇ e − F ( y ) k ≤ e − F ( x ) k∇ F ( x ) − ∇ F ( y ) k + k∇ F ( y ) k| e − F ( x ) − e − F ( y ) |≤ ( L + L ) k x − y k . Finally, suppose that F is also bounded from above, F ≤ a for some a > . Then f ≥ c everywhere, where < c ≤ because both µ and γ d are probability measures.We will also need the following simple lemma: Lemma B.1 (Regularity of the Föllmer drift) . Under Assumption 3.1, the Föllmer drift b ( x, t ) = ∇ log Q − t f ( x ) is bounded in norm by L/c and is Lipschitz with Lipschitz constant
L/c + L /c , where L is the maximum of the Lipschitz constants of f and ∇ f . roof. The heat semigroup Q t f ( x ) = E [ f ( x + √ tZ )] , Z ∼ γ d , commutes with the gradient operator: forany differentiable and Lipschitz f : R d → R , ∂ i Q t f = Q t ∂ i f for all i ∈ [ d ] (Stroock, 2008, Corollary 2.2.8).Therefore, since f ( x ) ≥ c and k∇ f ( x ) k ≤ L for all x , we have Q t f ( x ) ≥ c and k∇ Q t f ( x ) k ≤ L for all x ∈ R d and all t ≥ . Consequently, for any x ∈ R d and t ∈ [0 , k b ( x, t ) k = (cid:13)(cid:13)(cid:13)(cid:13) ∇ Q − t f ( x ) Q − t f ( x ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ Lc .
Also, since ∇ f is Lipschitz, k∇ Q t f ( x ) − ∇ Q t f ( x ′ ) k ≤ L k x − x ′ k for any x, x ′ ∈ R d and t ∈ [0 , , andthus k b ( x, t ) − b ( x ′ , t ) k = (cid:13)(cid:13)(cid:13)(cid:13) ∇ Q − t f ( x ) Q − t f ( x ) − ∇ Q − t f ( x ′ ) Q − t f ( x ′ ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ k∇ Q − t f ( x ) − ∇ Q − t f ( x ′ ) k Q − t f ( x ′ ) + k b ( x, t ) k · | Q − t f ( x ) − Q − t f ( x ′ ) | Q − t f ( x ′ ) ≤ Lc + L c ! k x − x ′ k , and the proof is complete. C Uniform approximation of the heat semigroup by a finite sum
In this appendix, we prove the following result, which is used in the proof of Theorem 3.2:
Theorem C.1.
For any ε > and any R > , there exist N = poly(1 /ε, d, L, R ) points z , . . . , z N ∈ R d ,for which the following holds: max n ≤ N k z n k ≤ p ( d + 6) log N sup x ∈ B d ( R ) sup t ∈ [0 , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X n =1 f ( x + √ tz n ) − Q t f ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ε sup x ∈ B d ( R ) sup t ∈ [0 , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X n =1 ∇ f ( x + √ tz n ) − ∇ Q t f ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ε We gather some preliminaries first. We recall the definition of the Orlicz exponential norm of order (Giné and Nickl, 2016, Sec. 2.3): for a real-valued random variable U , k U k ψ := inf ( c > E exp (cid:18) | U | c (cid:19) ≤ ) . The ψ norm dominates the L norm k U k := ( E | U | ) / : k U k ≤ k U k ψ . A simple application ofMarkov’s inequality leads to the following tail bound: P (cid:2) | U | ≥ t k U k ψ (cid:3) ≤ e t − . (C.2)15 emma C.1. Let U = k Z k , where Z ∼ γ d . Then k U k ψ ≤ √ d + √ .Proof. If F : R d → R is -Lipschitz, then the centered random variable ξ = F ( Z ) − E F ( Z ) has subgaus-sian tails (Boucheron et al., 2013, Theorem 5.6): P (cid:8) | ξ | ≥ t (cid:9) ≤ e − t / for all t > . This implies that k ξ k ψ ≤ √ (Giné and Nickl, 2016, Eq. (2.25)). Taking F ( Z ) = U and using the triangleinequality, we obtain k U k ψ ≤ E U + k U − E U k ψ ≤ √ d + √ , where E U ≤ k U k = √ d by Jensen’s inequality.Let U , . . . , U N , N ≥ , be a collection of (possibly dependent) random variables with finite ψ norms.Then we have the following maximal inequality: (cid:13)(cid:13)(cid:13)(cid:13) max j ≤ N | U j | (cid:13)(cid:13)(cid:13)(cid:13) ψ ≤ p log N max j ≤ N k U j k ψ , (C.3)(Lemma 2.3.3 in Giné and Nickl (2016)).We also need some results on suprema of empirical processes. Let G be a class of real-valued functionson some measurable space Z . We say that a positive function F : Z → R + is an envelope of G if | g ( z ) | ≤ F ( z ) for all g ∈ G and z ∈ Z . Let Z , . . . , Z N be i.i.d. random elements of Z with probability law P and denote by P N the corresponding empirical distribution, i.e., P N ( A ) = N − P n ≤ N { Z n ∈ A } for allmeasurable sets A ⊂ Z . We will use the linear functional notation for expectations, i.e., P g := E P [ g ( Z )] and P N g := E P N [ g ( Z )] = N − P n ≤ N g ( Z n ) . We are interested in the quantity k P N − P k G := sup g ∈ G | P N g − P g | , which is a random variable under standard regularity assumptions on G , such as separability. The expectedsupremum E k P N − P k G is controlled by the covering numbers of G . The L ( Q ) covering numbers of G with respect to a probability measure Q on Z are defined by N ( G , L ( Q ) , ε ) := min n K : there exist f , . . . , f K ∈ L ( Q ) such that sup g ∈ G min k ≤ K k g − f k k L ( P ) ≤ ε o . The
Koltchinskii–Pollard ε -entropy of G is given by H ( G , F, ε ) := sup Q q log 2 N ( G , L ( Q ) , ε k F k L ( Q ) ) , where the supremum is over all probability measures Q supported on finitely many points of Z . Then wehave the following bound on the expectation of k P N − P k G (Theorem 3.54 and Eq. (3.177) in Giné and Nickl(2016)): 16 emma C.2. Let G be a class of functions containing , such that J ( G , F ) := Z ∞ H ( G , F, ε ) d ε < ∞ . Let Z , . . . , Z N be i.i.d. copies of a random element Z of Z with probability law P , such that F ∈ L ( P ) .Then E k P N − P k G ≤ √ J ( G , F ) k F k L ( P ) √ N .
We also have the following generalization of Talagrand’s concentration inequality to unbounded classes offunctions, due to Adamczak (2008) (see also Sec. 2.3 in Koltchinskii (2011)):
Lemma C.3.
Let G be a class of real-valued functions on Z with envelope F . Then there exists an absoluteconstant C > , such that, for any γ > , P k P N − P k G ≥ C " E k P N − P k G + σ P ( G ) r γN + (cid:13)(cid:13)(cid:13)(cid:13) max n ≤ N F ( Z n ) (cid:13)(cid:13)(cid:13)(cid:13) ψ √ γN ≤ e − γ , where σ P ( G ) := sup g ∈ G (cid:16) P g − ( P g ) (cid:17) . With these preliminaries out of the way, we have the following result:
Lemma C.4.
Let g : R d → R be L -Lipschitz with respect to the Euclidean norm. Let Z , . . . , Z N be i.i.d.copies of a d -dimensional random vector Z , such that U := k Z k has finite ψ norm. Then there exists anabsolute constant C > , such that, for any γ > , sup x ∈ B d ( R ) sup t ∈ [0 , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X n =1 g ( x + √ tZ i ) − E [ g ( x + √ tZ )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C " L √ πRd (( R ∨
1) + k U k ψ ) √ N + 5 L (cid:0) ( R ∨
1) + k U k ψ (cid:1) r γN (C.4) with probability at least − e − γ .Proof. For each x ∈ R d and t ≥ let g x,t ( z ) := g ( x + √ tz ) . Let P denote the probability law of Z . Since P N g x,t − P g x,t = P N ( g x,t − g , ) − P ( g x,t − g , ) for all x, t , where g , ( · ) = g (0) is a constant, we canreplace each g x,t with ¯ g x,t := g x,t − g , , introduce the function class G := { ¯ g x,t : x ∈ B d ( R ) , t ∈ [0 , } ,and analyze the empirical process supremum k P N − P k G = sup x ∈ B d ( R ) sup t ∈ [0 , | P N ¯ g x,t − P ¯ g x,t | . Define the function F ( z ) := L (( R ∨
1) + k z k ) . Since k · k ≤ k · k ψ , F ∈ L ( P ) . By Lipschitz continuity,for all z ∈ R d , x ∈ B d ( R ) , t ∈ [0 , , we have | ¯ g x,t ( z ) | ≤ | g ( x + √ tz ) − g (0) | ≤ L k x + √ tz k ≤ F ( z ) , F is a square-integrable envelope of G . Moreover, for any probability measure Q supported on finitelymany points in R d and for all x, x ′ ∈ B d ( R ) and t, t ′ ∈ [0 , , k ¯ g x,t − ¯ g x ′ ,t ′ k L ( Q ) ≤ k F k L ( Q ) · ( k x − x ′ k + | t − t ′ | / ) . Thus we can estimate the L ( Q ) covering numbers of G by N ( G , L ( Q ) , ε k F k L ( Q ) ) ≤ N ( B d ( R ) , k · k , ε/ · N ([0 , , | · | , ε / . Using standard volumetric estimates on the covering numbers of ℓ balls, we obtain the following bound onthe Koltchinskii–Pollard entropy of G : H ( G , F, ε ) ≤ d log 2 √ Rε ! + where ( u ) + := u ∨ , and therefore J ( G , F ) = Z ∞ H ( G , F, ε ) d ε ≤ √ πRd. Lemma C.2 then gives E k P N − P k G ≤ √ J ( G ) k F k L ( P ) √ N ≤ √ πRd k F k L ( P ) √ N = 16 L √ πRd (( R ∨
1) + k U k ) √ N ≤ L √ πRd (( R ∨
1) + k U k ψ ) √ N (C.5)Furthermore, we estimate σ P ( G ) ≤ k F k L ( P ) ≤ k F ( Z ) k ψ = k L (( R ∨
1) + U ) k ψ ≤ L (cid:0) ( R ∨
1) + k U k ψ (cid:1) (C.6)and (cid:13)(cid:13)(cid:13)(cid:13) max j ≤ N F ( Z j ) (cid:13)(cid:13)(cid:13)(cid:13) ψ = L (cid:13)(cid:13)(cid:13)(cid:13) ( R ∨
1) + max j ≤ N U j (cid:13)(cid:13)(cid:13)(cid:13) ψ ≤ L ( R ∨
1) + 4 L p log N k U k ψ , (C.7)where we have used the triangle inequality for k · k ψ , as well as the maximal inequality (C.3). Using theestimates (C.5), (C.6), and (C.7) in Adamczak’s inequality, we obtain (C.4).18e are now ready to prove Theorem C.1. The proof is via the probabilistic method. Let ε > and R > be given, and choose N = C √ dε · L (cid:16) ( R ∨
1) + √ d + √ (cid:17) · (cid:16) √ πRd + 5 p log 4( d + 1) (cid:17)! , where C > is the absolute constant in the bound of Lemma C.4. Let Z , . . . , Z N be i.i.d. copies of Z ∼ γ d ,and observe that E [ f ( x + √ tZ )] = Q t f ( x ) and E [ ∂ i f ( x + √ tZ )] = ∂ i Q t f ( x ) = Q t ∂ i f ( x ) for all x ∈ R d , t ≥ , and i ∈ [ d ] . Define the events E := (cid:26) max n ≤ N k Z n k ≥ p ( d + 6) log N (cid:27) E := sup x ∈ B d ( R ) sup t ∈ [0 , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X n =1 f ( x + √ tZ n ) − Q t f ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ ε E := max i ∈ [ d ] sup x ∈ B d ( R ) sup t ∈ [0 , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X n =1 ∂ i f ( x + √ tZ n ) − ∂ i Q t f ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ ε √ d . We will show that P { E ∪ E ∪ E } < , which will imply that there exists at least one realization of Z , . . . , Z N verifying the statement of the theorem.By Lemma C.1, U = k Z k satisfies k U k ψ ≤ √ d + √ , and therefore U ∗ N := max n ≤ N U n satisfies k U ∗ N k ψ ≤ p d + 6) log N by the maximal inequality (C.3). Consequently, it follows from (C.2) that P { E } ≤ P { U ∗ N ≥ √ k U ∗ N k ψ } ≤ e − ≤ . Moreover, since the function f and all of its partial derivatives are L -Lipschitz, Lemma C.4 (with γ =log 4( d + 1) ) and the union bound give P { E ∪ E } ≤ / . Therefore, P { E ∪ E ∪ E } ≤ / . D The proof of Theorem 3.2: uniform approximation of the Föllmer driftby a neural net
We first collect a few preliminaries.
Lemma D.1 (cheap gradient principle, Griewank and Walther (2008)) . Let f : R d → R be implementableby a neural net with differentiable activation function σ : R → R , where the neural net has size (number ofnodes) m and depth (number of layers) ℓ . Then each coordinate of the gradient ∇ f can be computed by aneural net that has size O ( m + ℓ ) , and where the activation function of each neuron is an element of the set { σ, σ ′ } . Lemma D.2 (approximating multiplication and reciprocals) . Let σ : R → R be an activation functionsatisfying Assumption 3.2. Then:1. For any M > and any δ > , there exists a -layer neural net g : R → R of size m ≤ c σ M δ + 1 ,such that sup x,y ∈ [ − M,M ] | g ( x, y ) − xy | ≤ δ. (D.1)19 . For any < a ≤ b < ∞ and any δ > , there exists a -layer neural net q : R → R of size m ≤ c σ ba δ + 1 , such that sup x ∈ [ a,b ] (cid:12)(cid:12)(cid:12)(cid:12) q ( x ) − x (cid:12)(cid:12)(cid:12)(cid:12) ≤ δ. (D.2) Remark D.1.
These approximations suffice for our purposes. However, if one uses the ReLU activationfunction x x ∨ , then both multiplication and reciprocals can be ε -approximated by neural nets with sizeand depth polylogarithmic in /ε (Yarotsky, 2017; Telgarsky, 2017). Proof.
For multiplication, we first consider the function x x ∧ (4 M ) , which is M -Lipschitz andconstant outside the interval [ − M, M ] . Assumption 3.2 then grants the existence of a univariate function g : R → R of the form (3.3) with m ≤ c σ M δ satisfying | g ( x ) − x | ≤ δ for all x ∈ [ − M, M ] . Thedesired approximation g : R → R is given by g ( x, y ) = 14 (cid:0) g ( x + y ) − g ( x − y ) (cid:1) , which is a -layer neural net with size m ≤ c σ M δ + 1 . Indeed, using the polarization identity xy =( x + y ) − ( x − y ) , we have sup x,y ∈ [ − M,M ] | g ( x, y ) − xy |≤
14 sup x,y ∈ [ − M,M ] (cid:12)(cid:12)(cid:12) g ( x + y ) − ( x + y ) (cid:12)(cid:12)(cid:12) + 14 sup x,y ∈ [ − M,M ] (cid:12)(cid:12)(cid:12) g ( x − y ) − ( x − y ) (cid:12)(cid:12)(cid:12) ≤ δ. For approximating the reciprocal, consider the univarite function x a { x < a } + 1 x { a ≤ x ≤ b } + 1 b { x > b } , which is (1 /a ) -Lipschitz and constant outside of the interval [ − b, b ] . The existence of the function q withthe stated properties follows immediately from Assumption 3.2.We now prove Theorem 3.2. Let δ = c ε L . By Theorem C.1, there exist points z , . . . , z N ∈ R d with N = poly(1 /δ, d, L, R ) , such that R N,d := max n ≤ N k z n k ≤ p ( d + 6) log N , and the function ϕ : R d × [0 , → R defined by ϕ ( x, t ) := 1 N N X n =1 f ( x + tz n ) satisfies sup x ∈ B d ( R ) sup t ∈ [0 , | ϕ ( x, √ t ) − Q t f ( x ) | ≤ δ and sup x ∈ B d ( R ) sup t ∈ [0 , k∇ ϕ ( x, √ t ) − ∇ Q t f ( x ) k ≤ δ.
20y Assumption 3.3, there exists a neural net b f : R d → R be that approximates f and the gradient of f toaccuracy δ on the blown-up ball B d ( R + R N,d ) . Then the function b ϕ : R d × [0 , → R , b ϕ ( x, t ) := 1 N N X n =1 b f ( x + tz n ) can be computed by a neural net of size N · poly(1 /δ, d, L, R ) , such that sup x ∈ B d ( R ) sup t ∈ [0 , | b ϕ ( x, √ t ) − Q t f ( x ) |≤ sup x ∈ B d ( R ) sup t ∈ [0 , | b ϕ ( x, √ t ) − ϕ ( x, √ t ) | + sup x ∈ B d ( R ) sup t ∈ [0 , | ϕ ( x, √ t ) − Q t f ( x ) |≤ sup x ∈ B d ( R + R N,d ) | b f ( x ) − f ( x ) | + sup x ∈ B d ( R ) sup t ∈ [0 , | ϕ ( x, √ t ) − Q t f ( x ) | ≤ δ and sup x ∈ B d ( R ) sup t ∈ [0 , k∇ b ϕ ( x, √ t ) − ∇ Q t f ( x ) k≤ sup x ∈ B d ( R ) sup t ∈ [0 , k∇ b ϕ ( x, √ t ) − ∇ ϕ ( x, √ t ) k + sup x ∈ B d ( R ) sup t ∈ [0 , k∇ ϕ ( x, √ t ) − ∇ Q t f ( x ) k≤ sup x ∈ B d ( R + R N,d ) k∇ b f ( x ) − ∇ f ( x ) k + sup x ∈ B d ( R ) sup t ∈ [0 , k∇ ϕ ( x, √ t ) − ∇ Q t f ( x ) k ≤ δ. Since f is L -Lipschitz and bounded below by c , we have c ≤ Q t f ( x ) ≤ L ( k x k + √ d ) + f (0) for any x ∈ R d and t ∈ [0 , . Therefore, on B d ( R ) × [0 , , c ≤ b ϕ ( x, √ t ) ≤ L ( R + √ d ) + f (0) + c where we have used the fact that δ ≤ c/ . Without loss of generality, we may assume that L ≥ . Then, forany x ∈ B d ( R ) and t ∈ [0 , , (cid:13)(cid:13)(cid:13) ∇ log b ϕ ( x, √ t ) − ∇ log Q t f ( x ) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∇ b ϕ ( x, √ t ) b ϕ ( x, √ t ) − ∇ Q t f ( x ) Q t f ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ b ϕ ( x, √ t ) k∇ b ϕ ( x, √ t ) − ∇ Q t f ( x ) k + (cid:13)(cid:13)(cid:13)(cid:13) ∇ Q t f ( x ) Q t f ( x ) (cid:13)(cid:13)(cid:13)(cid:13) | b ϕ ( x, √ t ) − Q t f ( x ) | b ϕ ( x, √ t ) ≤ Lc · δ + Lc · c · δ ≤ ε , where we have used Lemma B.1 to bound k ∇ Q t fQ t f k ≤ L/c . In other words, ∇ log b ϕ ( x, √ t ) approximates ∇ log Q t f ( x ) to accuracy ε/ uniformly on B d ( R ) × [0 , . It remains to approximate ∇ log b ϕ ( x, √ t ) by aneural net to accuracy ε/ .To that end, we first represent ∇ log b ϕ ( x, √ t ) as a composition of several elementary operations andthen approximate each step by a neural net. Specifically, the computation of v i = ∂ i log b ϕ ( x, √ t ) can berepresented as a computation graph with the following structure:21. Compute a = b ϕ ( x, √ t ) .2. Compute b i = ∂ i b ϕ ( x, √ t ) .3. Compute r = 1 /a .4. Compute v i = rb i .Given x and √ t , a is computed by a neural net with activation function σ , of size poly(1 /δ, d, L, R ) anddepth poly(1 /δ, d, L, R ) . Therefore, by the cheap gradient principle (Lemma D.1), b i can be computed bya neural net of size poly(1 /δ, d, L, R ) , where the activation function of each neuron is an element of the set { σ, σ ′ } . Next, since a takes values in [ c/ , L ( R + √ d ) + f (0) + c/ , by Lemma D.2 the reciprocal r = 1 /a can be computed to accuracy ε/ (4 L √ d ) by a -layer neural net with activation function σ and of size O c · (cid:16) L ( R + √ d ) + f (0) + c/ (cid:17) · L √ dε ! ≤ poly(1 /ε, d, L, R, c, /c ) Let b r denote the resulting approximation. Then, since | b i | ≤ L and | b r | ≤ /c + ε/ (4 L √ d ) ≤ /c , byLemma D.2 the product b rb i can be approximated to accuracy ε/ √ d by a -layer neural net with activationfunction σ and with at most O (4 /c ∨ L ) · √ dε ! ≤ poly(1 /ε, d, L, /c ) neurons. The overall accuracy of approximation is | b v i − v i | ≤ | b v i − b rb i | + | b rb i − rb i | ≤ ε √ d . Thus, the vector v = ( v , . . . , v d ) can be ε/ -approximated by ˜ v ( x, √ t ) , where ˜ v : R d × [0 , → R d is a neural net with vector-valued output that has the size poly(1 /ε, d, L, R, c, /c ) . Finally, since sup x ∈ B d ( R ) sup t ∈ [0 , | ˜ v i ( x, √ t ) | ≤ L/c , the function b v i ( x, √ t ) := min { max { ˜ v i ( x, √ t ) , − L/c } , L/c } is continuous, takes values in [ − L/c, L/c ] and coincides with ˜ v i on B d ( R ) × [0 , . Moreover, the minand max operations can each be implemented exactly using O (1) ReLU neurons.
E Proof of Theorem 4.1
E.1 Unbiasedness
We follow the strategy of Henry-Labordère et al. (2017) and construct a sequence { ψ n } n ≥ of unbiasedestimators, such that E [ ψ n ] n →∞ −−−→ E [ ψ ] , where ψ := lim n →∞ ψ n . By a standard approximation argument,we can assume that g is bounded and Lipschitz. 22et ∆ Tk := T k − T k − and ∆ Wk := W T k − W T k − , for k ≥ . For each n ≥ , let ψ n := g ( b X ) · − F τ (∆ Tn +1 ) N ∧ n Y k =1 (cid:16) b ( b X T k , T k ) − b ( b X T k − , T k − ) (cid:17) T ∆ Wk +1 f τ (∆ Tk )∆ Tk +1 · { N ≤ n } + n +1 Y k =1 f τ (∆ Tk ) · (cid:16) b ( b X T n +1 , T n +1 ) − b ( b X T n , T n ) (cid:17) T ∇ h ( b X T n +1 , T n +1 ) · ∆ Wn +1 ∆ Tn +1 · { N>n } , (E.1)where h ( x, t ) := E [ g ( X ) | X t = x ] . We will show that E [ ψ n ] = E [ g ( X )] for all n and that the sequence { ψ n } n ≥ is uniformly integrable. Then it will follow from the dominated convergence theorem that ψ = lim n →∞ ψ n = 11 − F τ (1 − T N ) · g ( b X ) · N Y k =1 f τ ( T k − T k − ) b W k (E.2)is also an unbiased estimator. Observe that the estimator b ψ defined in (C.2) differs from ψ : instead of g ( b X ) , we have g ( b X ) − g ( b X N ) { N> } . Just as in Henry-Labordère et al. (2017), the term proportional to g ( b X N ) { N> } serves as a control variate to ensure that b ψ has finite variance. Indeed, since E [∆ WN +1 | T N ] =0 , it is easy to see that E − F τ (1 − T N ) · g ( b X N ) { N> } · N Y k =1 f τ ( T k − T k − ) b W k = 0 , and therefore E [ b ψ − ψ ] = 0 .Given x, v ∈ R d and t ∈ [0 , , consider the constant-drift diffusion process { ˜ X t,x,vs } s ∈ [ t, with ˜ X t,x,vt = x and d ˜ X t,x,vs = v d s + d W s , s ∈ [ t, . This process has the infinitesimal generator L v h ( x, t ) := v T ∇ h ( x, t ) + 12 tr ∇ h ( x, t ) , ∀ h ∈ C , ( R d × [0 , . Then, by Dynkin’s formula (Kallenberg, 2002, Lemma 19.21), for any t ≤ s ≤ , h ( ˜ X t,x,vs , s ) = h ( ˜ X t,x,vt , t ) + Z st (cid:26) ∂∂r + L v (cid:27) h ( ˜ X t,x,vr , r ) d r + M ts , (E.3)where { M ts } s ∈ [ t, is a martingale. In particular, let h ∈ C , ( R d × [0 , be a bounded solution of theCauchy problem ∂h∂t + L t h = 0 , h ( · ,
1) = g ( · ) (E.4)where L t h ( x, t ) := b ( x, t ) T ∇ h ( x, t ) + 12 tr ∇ h ( x, t ) . ∂h∂t + L v h = ( v − b ) T h, h ( · ,
1) = g ( · ) and using this in (E.3), we obtain the formula h ( ˜ X t,x,vs , s )= g ( ˜ X t,x,v ) + Z s (cid:0) b ( ˜ X t,x,vr , r ) − v (cid:1) T ∇ h ( ˜ X t,x,vr , r ) d r + M ts − M t , t ≤ s ≤ . In particular, since h ( x, t ) = E [ g ( X ) | X t = x ] by the Feynman–Kac formula, we have h ( x, t ) = E " g ( ˜ X t,x,v ) + Z t (cid:0) b ( ˜ X t,x,vs , s ) − v (cid:1) T ∇ h ( ˜ X t,x,vs , s ) d s , (E.5)where E [ M tt − M t ] = 0 since M h,t is a martingale.Using Eq. (E.5) with t = 0 and v = v := b ( x, , we have h ( x,
0) = E " g ( ˜ X ,x,v ) + Z (cid:0) b ( ˜ X t,x,vs , s ) − b ( x, (cid:1) T ∇ h ( ˜ X t,x,vs , s ) d s . Recalling that T = τ ∧ is independent of the Brownian motion { W t } and P [ T ≥
1] = P [ τ ≥
1] =1 − F τ (1) , we have E [ g ( ˜ X ,x,v )] = 11 − F τ (1) E [ g ( ˜ X ,x,v ) { T ≥ } ] , (E.6)and E "Z (cid:0) b ( ˜ X ,x,v s , s ) − b ( x, (cid:1) T ∇ h ( ˜ X ,x,v s , s ) d s = E (cid:20) f τ ( T ) (cid:0) b ( ˜ X ,x,v T , T ) − b ( x, (cid:1) T ∇ h ( ˜ X ,x,v T , T ) { T < } (cid:21) . (E.7)Since the process ˜ X ,x,v coincides with b X on [0 , T ] , it follows from (E.6) and (E.7) that h ( x, E " − F τ (∆ T ) g ( b X ) { T ≥ } + 1 f τ (∆ T ) (cid:0) b ( b X T , T ) − b ( b X T , T ) (cid:1) T ∇ h ( b X T , T ) { T < } (E.8) = E [ ψ ] , where the last equality follows from the fact that T = ∆ T ≥ if and only if N = 0 .24y Lemma E.1 in Section E.3, ∇ h ( x, E " g ( ˜ X ,x,v ) W + Z (cid:16)(cid:0) b ( ˜ X t,x,vs , s ) − b ( x, (cid:1) T ∇ h ( ˜ X t,x,vs , s ) (cid:17) W s s d s = E " − F τ (1) g ( b X ) ∆ W ∆ T { T ≥ } + 1 f τ (∆ T ) (cid:16)(cid:0) b ( b X T , T ) − b ( b X T , T ) (cid:1) T ∇ h ( b X T , T ) (cid:17) ∆ W ∆ T { T < } (E.9)Moreover, if we change the initial condition from t = 0 , v = v to t = T , v = v := b ( b X T , T ) , then itfollows from (E.9) that, conditionally on ( b X T , T ) , whenever T < , ∇ h ( b X T , T ) = E " − F τ (∆ T ) g ( b X ) ∆ W ∆ T { T ≥ } + 1 f τ (∆ T ) (cid:16)(cid:0) b ( b X T , T ) − b ( b X T , T ) (cid:1) T ∇ h ( b X T , T ) (cid:17) ∆ W ∆ T { T < } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) b X T , T . (E.10)Substituting (E.10) into (E.8) and using the fact that the event { T < ≤ T } is equivalent to { N = 1 } , wehave h ( x,
0) = E [ ψ ] . Repeating this procedure, we have E [ g ( X ) | X = x ] = h ( x,
0) = E [ ψ n ] , n ≥ . We claim that the sequence { ψ n } n ≥ is uniformly integrable. To see this, first observe that, for each k , E [ k ∆ Wk +1 k|| T k +1 ] ≤ (∆ Tk +1 d ) / . Then the uniform integrability follows from the boundedness of b , g , ∇ h ,and from Lemma E.2 in Section E.3. Therefore, taking the limit as n → ∞ , we obtain E [ g ( X ) | X = x ] = lim n →∞ E [ ψ n ] = E (cid:20) lim n →∞ ψ n (cid:21) = E [ ψ ] , where the second equality follows from the dominated convergence theorem. E.2 Variance
Let L := L b ∨ L g . For ≤ k ≤ N + 1 , let ∆ b Xk := b X T k +1 − b X T k denote the increments of b X . Since T N +1 = 1 , we have (cid:12)(cid:12)(cid:12) g ( b X ) − g ( b X T N ) { N> } (cid:12)(cid:12)(cid:12) ≤ | g ( x ) | + L | ∆ b X | , N = 0 L | ∆ b XN +1 | , N > which gives (cid:12)(cid:12)(cid:12) g ( b X ) − g ( b X T N ) { N> } (cid:12)(cid:12)(cid:12) ≤ | g ( x ) | { N =0 } + L (cid:18)q ∆ TN +1 + k ∆ b XN +1 k (cid:19) . b ψ as follows: | b ψ | ≤ e a − F τ (1) · (cid:18) | g ( x ) | + L (cid:16)q ∆ T + k ∆ b X k (cid:17)(cid:19) · N Y k =1 CL (cid:16)q ∆ Tk +1 + k ∆ b Xk +1 k (cid:17) ∆ Tk +1 · k ∆ Wk +1 k , where, for k ≥ , k ∆ b Xk +1 k = k b ( b X T k , T k ) · ∆ Tk +1 + ∆ Wk +1 k≤ b ∞ ∆ Tk +1 + k ∆ Wk +1 k . Let F k := σ ( T j , b X j : 1 ≤ j ≤ k ) . Then, since Law(∆ Wk +1 | F k ) = Law((∆ Tk +1 ) / Z | F k ) , where Z ∼ γ d isindependent of F k ∨ σ ( T k +1 ) , we have E " (∆ Tk +1 ) / + k ∆ b Xk +1 k ∆ Tk +1 · k ∆ Wk +1 k ! (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F k ≤ E k " b ∞ ∆ Tk +1 + (∆ Tk +1 ) / (1 + k Z k )∆ Tk +1 · q ∆ Tk +1 k Z k ! (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F k ≤ E h (1 + b ∞ + k Z k ) k Z k i =: κ. Therefore, we can estimate E [ b ψ ] ≤ (cid:18) e a − F τ (1) (cid:19) · E (cid:20)(cid:16) | g ( x ) | + L (1 + √ d ) (cid:17) (cid:21) · E (cid:2) exp( κN ) (cid:3) . E.3 Auxiliary lemmas
The following lemma is a straightforward consequence of the Gaussian integration-by-parts formula ∇ x E [ f ( x + Z )] = E [ f ( x + Z ) Z ] , Z ∼ γ d , for any C function f : R d → R : Lemma E.1 (Henry-Labordère et al. (2017)) . Let ν be a positive measure on [0 , . Let ϕ : R d × [0 , → R be a continuous function, such that Z E (cid:20)(cid:13)(cid:13)(cid:13)(cid:13) ϕ ( x + vt + W t ) W t t (cid:13)(cid:13)(cid:13)(cid:13)(cid:21) ν (d t ) < ∞ . Then ∇ x Z E [ ϕ ( x + vt + W t )] ν (d t ) ! = Z E (cid:20) ϕ ( x + vt + W t ) W t t (cid:21) ν (d t ) The next lemma is used to show that the sequence { ψ n } is uniformly integrable: Lemma E.2.
For any
C > , E C N − F τ (∆ TN +1 ) N Y k =1 f τ (∆ Tk )(∆ Tk +1 ) / < ∞ . (E.11)26 roof. For each n ≥ , define the n -simplex S n := (cid:8) ( s , s , . . . , s n ) ∈ [0 , n : 0 < s < . . . < s n < (cid:9) with s ≡ and s n +1 ≡ . Consider the partial sums S k := P ki =1 τ i . Since the τ i ’s are i.i.d., the conditionaljoint density of ( S , S , . . . , S n ) given N = n is equal to q n ( s , s , . . . , s n ) = 1 P [ N = n ] · (1 − F τ (1 − s n )) · n Y k =1 f τ ( s k − s k − ) , ( s , . . . , s n ) ∈ S n where we have set s ≡ . Then a calculation similar to the one in Appendix B ofAndersson and Kohatsu-Higa (2017) leads to E C N − F τ (∆ TN +1 ) N Y k =1 f τ (∆ Tk )(∆ Tk +1 ) / = X n ≥ P [ N = n ] · C n Z S n − F τ ( s n ) n Y k =1 f τ ( s k − s k − )( s k +1 − s k ) / q n ( s , . . . , s n ) d s ≤ X n ≥ C n Z S n n Y k =0 s k +1 − s k ) / d s = √ π · E / , / ( C √ π ) , where d s is the Lebesgue measure on S n and E α,β ( z ) := ∞ X k =0 z k Γ( β + αk ) , z ∈ C , α, β > (E.12)is the Mittag–Leffler function (Erdélyi et al., 1955). When α and β are both real and positive, the series in(E.12) converges for all values of z ∈ C , which completes the proof. F Proof of Lemma 4.1
For each t ≥ , let N t := max { k : S k < t ≤ S k +1 } . Then N = N and T n = S n for n ≤ N . Moreover, { N t } t ≥ is a renewal process with renewal times { S k } k ≥ and i.i.d. interrenewal times with pdf f τ . Themoment-generating function of M t can be upper-bounded as follows (Glynn and Whitt, 1994): E [ e θN t ] ≤ e θ ∞ X k =0 e θk P [ S k < t ] . (F.1)Let t = 1 and fix some β > . Then ∞ X k =0 e θk P [ S k <
1] = X k ≤ β e θk P [ S k <
1] + X k>β e θk P [ S k < ≤ ( β + 1) e θβ + ∞ X k =0 e θk P [ S k < kβ − ] . τ i ’s are i.i.d., we can further estimate P [ S k < kβ − ] = P [ k − βS k > ≤ e k E h e − βS k i = (cid:16) eM τ ( − β ) (cid:17) k . Substituting these estimates into (F.1) and optimizing over β , we get (4.6). Acknowledgments
The authors would like to thank Matus Telgarsky for many enlightening discussions. This work was sup-ported in part by the NSF CAREER award CCF-1254041, in part by the Center for Science of Information(CSoI), an NSF Science and Technology Center, under grant agreement CCF-0939370, in part by the Centerfor Advanced Electronics through Machine Learning (CAEML) I/UCRC award no. CNS-16-24811, and inpart by the Office of Naval Research under grant no. N00014-12-1-0998.
References
Radosław Adamczak. A tail inequality for suprema of unbounded empirical processes with applications toMarkov chains.
Electronic Journal of Probability , 13:1000–1034, 2008.Patrik Andersson and Arturo Kohatsu-Higa. Unbiased simulation of stochastic differential equations usingparametrix expansions.
Bernoulli , 23(3):2028–2057, 2017.Vlad Bally and Arturo Kohatsu-Higa. A probabilistic interpretation of the parametrix method.
Annals ofApplied Probability , 25(6):3095–3138, 2015.Stéphane Boucheron, Gábor Lugosi, and Pascal Massart.
Concentration Inequalities: A NonasymptoticTheory of Independence . Oxford University Press, 2013.Michelle Boué and Paul Dupuis. A variational representation for certain functionals of Brownian motion.
Annals of Probability , 26(4):1641–1659, 1998.Sébastien Bubeck, Ronen Eldan, and Joseph Lehec. Sampling from a log-concave distribution with projectedLangevin Monte Carlo.
Discrete and Computational Geometry , 57:757–783, 2018.Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K. Duvenaud. Neural ordinary differentialequations. In
Neural Information Processing Systems , 2018.Paolo Dai Pra. A stochastic control approach to reciprocal diffusion processes.
Applied Mathematics andOptimization , 23:313–329, 1991.Ronen Eldan and James R. Lee. Regularization under diffusion and anticoncentration of the informationcontent.
Duke Mathematical Journal , 167(5):969–993, 2018.Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In
Proceedings of the29th Annual Conference on Learning Theory , pages 907–940, 2016.Arthur Erdélyi, Wilhelm Magnus, Fritz Oberhettinger, and Francesco G. Tricomi.
Higher TranscendentalFunctions , volume III. McGraw-Hill, New York, 1955.28endell H. Fleming. Exit probabilities and optimal stochastic control.
Applied Mathematics and Optimiza-tion , 4:329–346, 1978.Wendell H. Fleming and Raymond W. Rishel.
Deterministic and Stochastic Optimal Control . Springer,1975.Wendell H. Fleming and Sheunn-Jyi Sheu. Stochastic variational formula for fundamental solutions ofparabolic PDE.
Applied Mathematics and Optimization , 13:193–204, 1985.Hans Föllmer. An entropy approach to time reversal of diffusion processes. In
Stochastic DifferentialSystems (Marseille-Luminy, 1984) , volume 69 of
Lecture Notes in Control and Information Sciences .Springer, 1985.Evarist Giné and Richard Nickl.
Mathematical Foundations of Infinite-Dimensional Statistical Models . Cam-bridge University Press, 2016.Peter W. Glynn and Ward Whitt. Large deviations behavior of counting processes and their inverses.
Queue-ing Systems , 17:107–128, 1994.Carl Graham and Denis Talay.
Stochastic Simulation and Monte Carlo Methods: Mathematical Foundationsof Stochastic Simulation , volume 68 of
Stochastic Modeling and Applied Probability . Springer, 2013.Andreas Griewank and Andrea Walther.
Evaluating Derivatives: Principles and Techniques of AlgorithmicDifferentiation . SIAM, 2nd edition, 2008.Tatsunori Hashimoto, David Gifford, and Tommi Jaakkola. Learning population-level diffusions with gen-erative RNNs. In
Proceedings of the 33rd International Conference on Machine Learning , pages 2417–2426, 2016.Pierre Henry-Labordère, Xiaolu Tian, and Nizar Touzi. Unbiased simulation of stochastic differential equa-tions.
Annals of Applied Probability , 27(6):3305–3341, 2017.Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Universal approximation of an unknown mappingand its derivatives using multilayer feedforward networks.
Neural Networks , 3(5):551–560, 1990.Benton Jamison. The Markov processes of Schrödinger.
Zeitschrift für Wahrscheinlichkeitstheorie undverwandte Gebiete , 32(4):323–331, 1975.Olav Kallenberg.
Foundations of Modern Probability . Springer, 2nd edition, 2002.Vladimir Koltchinskii.
Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems ,volume 2033 of
Lecture Notes in Mathematics . Springer, 2011.Joseph Lehec. Representation formula for the entropy and functional inequalities.
Annales de l’InstitutHenri Poincaré – Probabilités et Statistiques , 49(3):885–899, 2013.Qianxiao Li, Cheng Tai, and Weinan E. Stochastic modified equations and adaptive stochastic algorithms.In
Proceedings of the 34th International Conference on Machine Learning , pages 2101–2110, 2017.Qianxiao Li, Long Chen, Cheng Tai, and Weinan E. Maximum principle based algorithms for deep learning.
Journal of Machine Learning Research , 18:1–29, 2018.29in Li. Simultaneous approximation of multivariate functions and their derivatives by neural networks withone hidden layer.
Neurocomputing , 12:327–343, 1996.Stephan Mandt, Matthew D. Hoffman, and David M. Blei. Stochastic gradient descent as approximateBayesian inference.
Journal of Machine Learning Research , 18:1–35, 2017.Javier R. Movellan, Paul Mineiro, and Ruth J. Williams. A Monte Carlo EM approach for partially observ-able diffusion processes: theory and applications to neural networks.
Neural Computation , 14:1507–1544,2002.Michele Pavon. Stochastic control and nonequilibrium thermodynamical systems.
Applied Mathematicsand Optimization , 19:187–202, 1989.Philip E. Protter.
Stochastic Integration and Differential Equations . Springer, 2nd edition, 2005.Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via Stochastic GradientLangevin Dynamics: a nonasymptotic analysis. In
Proceedings of the 2017 Conference on LearningTheory , 2017.Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approx-imate inference in deep generative models. In
Proceedings of the 2014 International Conference onMachine Learning , pages 1278–1286, 2014.Tom Ryder, Andrew Golightly, A. Steven McGough, and Dennis Prangle. Black-box variational inferencefor stochastic differential equations. In
Proceedings of the 35th International Conference on MachineLearning , pages 4423–4432, 2018.Itay Safran and Ohad Shamir. Depth-width tradeoffs in approximating natural functions with neural net-works. In
Proceedings of the 34th International Conference on Machine Learning , pages 2979–2987,2017.Erwin Schrödinger. Über die Umkehrung der Naturgesetze.
Sitzung ber Preuss. Akad. Wissen., Berlin Phys.Math. , 144, 1931.Sheunn-Jyi Sheu. Some estimates of the transition density of a nondegenerate diffusion Markov process.
Annals of Probability , 19(2):538–561, 1991.Daniel W. Stroock.
An Introduction to Partial Differential Equations for Probabilists . Cambridge UniversityPress, 2008.Matus Telgarsky. Neural networks and rational functions. In
Proceedings of the 34th International Confer-ence on Machine Learning , pages 3387–3393, 2017.Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan. A variational perspective on accelerated methodsin optimization.
Proceedings of the National Academy of Sciences (U.S.) , 113(47):E7351–E7358, 2016.Lin Yang, Raman Arora, Vladimir Braverman, and Tuo Zhao. The physical systems behind optimizationalgorithms. In
Neural Information Processing Systems , 2018.Dmitry Yarotsky. Error bounds for approximations with deep ReLU networks.
Neural Networks , 94:103–114, 2017. 30oseph E. Yukich, Maxwell B. Stinchcombe, and Halbert White. Sup-norm approximation bounds for neuralnetworks through probabilistic methods.
IEEE Transactions on Information Theory , 41(4):1021–1027,July 1995.Yuchen Zhang, Percy Liang, and Moses Charikar. A hitting time analysis of stochastic gradient Langevindynamics. In