Adaptive Non-reversible Stochastic Gradient Langevin Dynamics
aa r X i v : . [ c s . L G ] S e p Adaptive Non-reversible Stochastic GradientLangevin Dynamics
Vikram Krishnamurthy and George Yin
Abstract
It is well known that adding any skew symmetric matrix to the gradient of Langevin dynamics algorithm results in a non-reversible diffusion with improved convergence rate. This paper presents a gradient algorithm to adaptively optimize the choice ofthe skew symmetric matrix. The resulting algorithm involves a non-reversible diffusion algorithm cross coupled with a stochasticgradient algorithm that adapts the skew symmetric matrix. The algorithm uses the same data as the classical Langevin algorithm.A weak convergence proof is given for the optimality of the choice of the skew symmetric matrix. The improved convergencerate of the algorithm is illustrated numerically in Bayesian learning and tracking examples.
Keywords . Langevin dynamics, non reversible dynamics, skew symmetric matrix, stochastic gradient algorithm, weak con-vergence, adaptive Bayesian learning I. I
NTRODUCTION
Langevin dynamics are used for global stochastic optimization (see for example [1], [2]) and also used as a non-parametricmethod for reconstructing (exploring) cost functions (such as posterior densities) from noisy evaluations of the gradient [3], [4].The idea is as follows. Suppose C ( θ ) is a continuously differentiable cost function on the interior of a compact set Θ ⊂ R N .Let b ∇ θ c k ( θ k ) denote a noisy observation of the gradient ∇ θ C ( θ k ) evaluated at point θ k ∈ R N .Then the classical stochastic gradient Langevin algorithm and its associated continuous-time Langevin diffusion process are,respectively (Langevin algorithm) θ k +1 = θ k − ε b ∇ θ c k ( θ k ) + √ ε r β w k (1)(Langevin diffusion) dθ ( t ) = −∇ θ C ( θ ) dt + r β dW ( t ) , t ≥ , (2)In the Langevin algorithm (1), the step size ε is a small positive constant, { w k , k ≥ } is an i.i.d. sequence of standard N -variate Gaussian random variables, and β > denotes the inverse temperature parameter. In the continuous-time Langevindiffusion process (2), W ( t ) denotes standard N -variate Brownian motion. The Langevin dynamics algorithm (1) is obtainedby an Euler-Maruyama time discretization of the Langevin diffusion process (2).It is straightforwardly shown that the stationary distribution of the Langevin diffusion (2) is the Gibbs measure π ( θ ) ∝ exp (cid:0) − βC ( θ ) (cid:1) . (3)Therefore, Langevin dynamics algorithm (1) leads to the following two immediate applications:1) Reconstructing costs and Bayesian Learning . For fixed β , let ˆ π ( θ ) denote the empirical density function constructed fromsamples { θ k } generated by the Langevin dynamics (1). Then clearly log ˆ π ( θ ) ∝ C ( θ ) . Thus the Langevin dynamicsalgorithm serves as a non-parametric method for reconstructing (exploring) C ( θ ) given the gradient estimates { b ∇ θ c ( θ k ) } .Specifically, this is useful in Bayesian learning [3] where C ( θ ) is the expectation of the posterior density; in such casescomputing the posterior can be difficult due to the normalization factor; yet it is easy to simulate noisy gradients fromthe product of the likelihood and the prior.2) Global Optimization . For sufficiently large β , using Laplace asymtotics, it can be shown that the Gibbs distribution π in(3) concentrates around the global minimizers of C ( θ ) . So for large β , the Langevin dynamics algorithm (1) serves as aglobal minimization algorithm for non-convex cost C ( θ ) . Motivation. Accelerated Non-reversible Diffusions
The Langevin dynamics (2) is a reversible diffusion process. However, the convergence rate to the stationary distribution π ( · ) can be slow. It is well known [5], [6], [7] that adding any skew symmetric matrix to the gradient always improves the Vikram Krishnamurthy is with the School of Electrical & Computer Engineering, Cornell University, NY 14853, USA. [email protected]. G. Yin iswith Department of Mathematics, University of Connecticut, Storrs, CT 06269-1009, USA. ([email protected]). In the opposite direction, it is well known that the interpolated process constructed from (1) converges weakly to (2). convergence rate of Langevin dynamics to its stationary distribution. That is, for any N × N skew symmetric matrix S , thenon-reversible accelerated diffusion is(Accelerated Diffusion) dθ ( t ) = − ( I + S ) ∇ θ C ( θ ) dt + r β dW ( t ) , t ≥ , (4)has a large spectral gap and therefore converges to the same stationary distribution π ( θ ) faster than (2); see [5], [6], [7] for aformal proof. The accelerated resulting gradient algorithm obtained by a Euler-Maruyma time discretization of (4) is(Accelerated Algorithm) θ k +1 = θ k − ε [ I + S ] b ∇ θ c k ( θ k ) + √ ε r β w k (5) Main idea
The natural question is:
How to choose skew symmetric matrix S in the accelerated algorithm (5)? In this context, the mainidea of the paper is two fold:1) Our first result is to construct an adaptive version of the above non-reversible diffusion by adapting the skew symmetricmatrix S . In simple terms, we adapt skew symmetric matrix S in real time via a stochastic gradient algorithm, so thatit converges to a local optimum. Thus the algorithm comprises a non-reversible diffusion (5) cross-coupled with anotherstochastic gradient algorithm that updates the skew symmetric matrix S k at each time k .Actually we propose 3 different non-reversible diffusion algorithms in Sec.II; a Hessian based algorithm, and types offinite difference simultaneous perturbation stochastic approximation (SPSA) algorithms (which are computationally moreefficient than the Hessian based algorithm). SPSA has been used as finite efficient difference method for evaluating gradientestimates in classical stochastic gradient algorithms [8]. To the best of our knowledge, SPSA has not been used in thecontext of Lagenvin dynamics. In extensive numerical studies (including real datasets) we show that all 3 algorithmsalways perform better than the vanilla non-reversible diffusion algorithm (5).2) Our second result is a tracking analysis for non-stationary global stochastic optimization; we show that the algorithmcan track a time-varying global optimum that jump changes according to a slowly varying Markov chain. Specifically,we are interested in tracking the global minimum of a non-convex stochastic optimization problem when the minimumjump changes (evolves) over time according to the sample path of an unknown Markov chain. Specifically, we analyzehow well does a fixed step size stochastic gradient Langevin algorithm (and generalized Langevin algorithms where thevariance of the injected noise is adapted over time) track the time evolving global minima when the algorithm does nothave knowledge of the Markovian evolution of the minima. Context
For the case C ( θ ) is quadratic in θ , [7] gives an algorithm to choose the optimal skew symmetric S to maximize the spectralgap of the diffusion (4). However, for general costs C ( · ) there is no obvious way of maximizing the spectral gap. Our idea ofadapting the skew symmetric matrix in a non-reversible diffusion, stems from [9], [10] where stochastic gradient algorithmswere proposed for adapting the step size of a stochastic gradient algorithm. Indeed, the idea of using a stochastic gradientalgorithm to update the step size was proposed originally as an exercise in [10, Exercise 4.4.2] in the context of least meanssquares (LMS) algorithms. Such adaptive step size LMS algorithms have been shown to perform extremely well in wirelesscommunication applications [11], [12]. Of course, the setup in the current paper is different since we are adapting a skewsymmetric matrix to accelerate a non-reversible diffusion process (rather than adapting the scalar step size for a classical LMSalgorithm).An important feature of adaptive non-reversible diffusion algorithms (Algorithms 1, 2 and 3 proposed in this paper) is theconstant step size ε (as opposed to a decreasing step size). This facilities estimating (tracking) parameters that evolve overtime. Sec.V gives a formal weak convergence analysis of the asymptotic tracking capability of algorithm (6) when the cost C ( · ) jump changes over time according to a slow (but unknown) Markov chain. The most interesting case considered in Sec.Vis when the reward changes at the same rate as the algorithm. Then stochastic averaging theory yields a Markov switcheddiffusion limit as the asymptotic behavior of the algorithm. Due to the constant step size, the appropriate notion of convergenceis weak convergence [13], [14], [15]. The Markovian hyper-parameter tracking analysis generalizes our earlier work [16], [17]in stochastic gradient algorithms to the current case of Langevin dynamics.Most existing literature analyzes stochastic approximation algorithms for tracking a parameter that evolves according to a“slowly time-varying” sample path of a continuous-valued process so that the parameter changes by small amounts over smallintervals of time. When the rate of change of the underlying parameter is slower than the adaptation rate of the stochasticapproximation algorithm (e.g., a slow random walk), the mean square tracking error can be analyzed as in [10], [13], [18],[19]. In comparison, our analysis covers the case where the global optimum evolves with discrete jumps that can be arbitrarily Recall S is skew symmetric if S ′ = − S . Clearly the diagonal elements of a skew symmetric matrix are zero; also x ′ Sx = 0 for all x ∈ R N . large in magnitude on short intervals of time. Also, the jumps can occur on the same time scale as the speed of adaptation ofthe stochastic approximation algorithm. Two-time scale and singularly perturbed jump Markov systems are studied in [20].Finally, we mention that [21], [22] study convergence of the Langevin dynamics stochastic gradient algorithm in a non-asymptotic setting. Although the setting in our paper is asymptotic, it is interest in future work to study the non-asymptoticsetting. II. A DAPTIVE N ON - REVERSIBLE D IFFUSION A LGORITHMS
The key idea behind the adaptive algorithms below is to parametrize θ by S ; denote this as θ ( S ) . Then one can pose astochastic optimization problem to find the skew symmetric matrix S ∗ to minimize C ( θ ( S )) . In this section we propose threeadaptive algorithms to adapt S ; a Hessian based algorithm (Algorithm 1), a SPSA algorithm (Algorithm 2), and a two-timescale SPSA algorithm (Algorithm 3). From a practical point of view, the SPSA algorithm is numerically efficient and yieldsresults comparable to the more expensive Hessian based algorithm. A. Algorithm 1. Hessian Based Adaptive Diffusion
Let ε, α be small non-negative fixed step sizes with α = o ( ε ) . Let e i denote the unit vector with 1 in the i -th position. Thenthe algorithm is as follows: θ k +1 = θ k − ε ( I + S k ) b ∇ θ c k ( θ k ) + √ ε r β w k S k +1 ( i, j ) = S k ( i, j ) − α b ∇ ′ θ c k ( θ k ) D k ( i, j ) (cid:12)(cid:12)(cid:12) S + S − , i > jS k +1 ( i, j ) = − S k +1 ( j, i ) , i < j, S k +1 ( i, i ) = 0 D k +1 ( i, j ) = D k ( i, j ) − ε ( I + S k ) b ∇ θ c k ( θ k ) D k ( i, j ) − ε ( e j − e i ) ′ b ∇ θ c k ( θ k ) (6a)(6b)(6c)(6d)Eq.(6a) is simply the non-reversible diffusion (5) with injected noise w k . Recall β > is the inverse temperature parameterand S k is a N × N skew symmetric matrix. In (6b), S is initialized to an arbitrary skew symmetric matrix. The notation (cid:12)(cid:12) S + S − indicates that the estimate S k ( i, j ) is projected onto the closed interval [ S − , S + ] if the estimate lies outside this region.The recursion (6b) can be viewed as a stochastic gradient algorithm with step size α to minimize the cost C ( θ ( S )) wrt S .Formally, the gradient ∇ S ( i,j ) C = [ ∇ θ C ] ′ dθdS ( i, j ) so that an estimate of the gradient b ∇ S ( i,j ) c ( θ k ) is b ∇ ′ θ c k ( θ k ) D k ( i, j ) where D k ( i, j ) = dθdS ( i,j ) .The third equation (6c) enforces that S k is skew symmetric, namely S ′ k = − S k .The final recursion (6d) is obtained by taking the “derivative” of the first recursion with respect to S by defining the vector D k ( i, j ) = dθ k dS ( i,j ) ∈ R N by holding S fixed. Then differentiating (6a) wrt S ( i, j ) yields D k +1 ( i, j ) = D k ( i, j ) − ε ( I + S k ) ∂∂S ( i, j ) b ∇ θ c k ( θ k ) − ε dS k dS ( i, j ) b ∇ θ c k ( θ k ) and ∂∂S ( i, j ) b ∇ θ c k ( θ k ) = b ∇ θ c k ( θ k ) dθ k dS ( i, j ) , dS k dS ( i, j ) = ( e j − e i ) Note that (6d) involves the Hessian b ∇ θ c k ( θ k ) .Formally, the process D k in (6d) is interpreted as the derivative ( d/dS ) θ | θ = θ k . This derivative process is defined in themean square sense as in [9, p. 1406]: lim ∆ → E (cid:12)(cid:12)(cid:12)(cid:12) D − θ ( S + ∆) − θ ( S )∆ (cid:12)(cid:12)(cid:12)(cid:12) = 0 . In summary, the adaptive non-reversible diffusion algorithm (6) is given by cross coupling two stochastic gradient algorithms(first and second recursion) along with the derivative update of θ with respect to the parameter S (final recursion). Note thatwhen S is a fixed constant, S + = S − = S , then algorithm (6) reduces to the non reversible diffusion algorithm (5). Main Convergence Result (Informal)
Since Algorithm 1 uses a constant step size (as opposed to a decreasing step size), the appropriate notion of convergenceis weak convergence [13], [14], [15]. Recall that weak convergence is a function space generalization of convergence indistribution of random variables. The assumptions and main result will be stated formally and proved in Sec.IV. Here we givea heuristic statement. We will show that the sequence of estimates { θ k } generated by the Algorithm 1 converges weakly to anon-reversible accelerated diffusion with the optimal skew symmetric matrix.As is typically done in weak convergence analysis, we first represent the sequence of estimates { θ k } generated by Algorithm 1as a continuous-time process. This is done by constructing the continuous-time trajectory via piecewise constant interpolation.Let T denote a positive real number which denotes the finite time horizon. For t ∈ [0 , T ] , define the continuous-time piecewiseconstant interpolated processes parametrized by the step size ε as θ ε ( t ) = θ k for t ∈ [ εk, εk + ε ) We can now state our main result
Informal Result 1:
Under suitable assumptions (Sec.IV), the interpolated processes ( θ ε ( · ) , D ε ( · ) , S ε ( · )) converges weaklyto ( θ ( · ) , D ( · ) , S ( · )) such that the limit satisfies the following system of equations dθ = ( I + S ) ∇ C ( θ ) + r β dWddt D ( i, j ) = ( I + S ) ∇ C ( θ ) D ( i, j ) − ( e j − e i ) ′ ∇ C ( θ ) ddt S ( i, j ) = ∇ C ( θ ) D ( i, j ) (cid:12)(cid:12)(cid:12) S + S − , i > j (7)where W ( · ) is N -dimensional Brownian motion. ✷ The most important takeaway from the above result is that the skew symmetric matrix S satisfies the projected ordinarydifferential equation (ODE). ˙ S ( i, j ) = − ddS ( i, j ) C ( θ ( S )) , S ∈ ( S − , S + ) (8)Note that (8) implies that the adaptive algorithm for adjusting S is a gradient decent method. By the weak convergence, S k will spend nearly all of the time in an arbitrarily small neighborhood of the local minima of E { c k ( θ ( S )) } , which is consistentwith our motivation for the adaptive non-reversible diffusion Algorithm 1. B. Algorithm 2. SPSA based Adaptive Diffusion
An issue with Algorithm 1 is that the computational cost is O ( N ) which is excessive for large N ; this computationalcost is due to the update (6d) which is O ( N ) for each i, j . Also evaluating the Hessian b ∇ θ c k ( θ ) can be difficult in somestochastic optimization problems. Examining (6), we see that the Hessian arises as a by-product of evaluating the gradientestimate ∇ S c k ( θ k ) . Below we propose a finite difference evaluation of ∇ S c k ( θ k ) ; this does not involve the Hessian. Buta naive evaluation of the finite difference approximation to gradient ∇ S c k ( θ k ) would require N evaluations (simulations)of the cost; namely evaluate c k ( θ ( S + e ij )) and c k ( θ ( S − e ij )) for each component ( i, j ) of S . The main idea below is toevaluate this gradient estimate using the SPSA (simultaneous perturbation stochastic approximation) algorithm [8]. The SPSAalgorithm picks two random matrices S k + µ ∆ k and S k − µ ∆ k to evaluate ∇ S c ( θ k ) and therefore requires only 2 evaluations(simulations) of the cost c k ( · ) .Let ε, µ, α be small non-negative fixed step sizes. We propose the following SPSA based algorithm that does not requirecomputation of the Hessian θ + k +1 = θ + k − ε ( I + S k + µ ∆ k ) b ∇ θ c k ( θ + k ) + √ ε r β w k θ − k +1 = θ − k − ε ( I + S k − µ ∆ k ) b ∇ θ c k ( θ − k ) + √ ε r β w k S k +1 ( i, j ) = S k ( i, j ) − α c k ( θ + ) − c k ( θ − )2 µ ∆ k ( i, j ) (9)Here the elements of the matrix ∆ k are simulated as follows: ∆ k ( i, i ) = 0∆ k ( i, j ) = ( − w. p. . w. p . , i > j, ∆ k ( i, j ) = − ∆ k ( j, i ) , i < j (10)Note S k + µ ∆ k and S k − µ ∆ k are skew symmetric matrices by construction.Algorithm 2 has computational cost of O ( N ) at each iteration. C. Algorithm 3. Two time scale SPSA Adaptive Diffusion
Algorithm 2 discussed above simultaneously evaluates the gradient and updates the estimates in one time step. In comparison,we now construct a two-time scale algorithm that proceeds as follows:Run the following recursion on the slow time scale k = 1 , , . . . , θ k +1 = θ k − ε ( I + S k ) b ∇ θ c k ( θ k ) + √ ε r β w k (11)and simulate ∆ k according to (10). Then for each k , run multiple steps n = 0 , . . . , N on the fast time scale to evaluate theestimate D k of the gradient of c k ( θ ) wrt S k : Initialize θ +0 = θ − = θ k and θ + n +1 = θ + n − ε ( I + S k + µ ∆ k ) b ∇ θ c n ( θ + n ) + √ ε r β w n θ − n +1 = θ − n − ε ( I + S k − µ ∆ k ) b ∇ θ c n ( θ − n ) + √ ε r β w n D n +1 ( i, j ) = D n ( i, j ) + c n ( θ + ) − c n ( θ − )2 µ ∆ n ( i, j ) (12)Finally, update S in (11) on the slow time scale as S k +1 ( i, j ) = S k ( i, j ) − α D N ( i, j ) (13)Note that in the special case where (11) is omitted and (12) is run for one step, Algorithm 3 specializes to Algorithm 2.III. N UMERICAL E XAMPLES . A
DAPTIVE
KL D
IVERGENCE AND B AYESIAN L EARNING
This section compares the performance of our proposed non-reversible diffusion algorithms (Algorithms 1 2 and 3) to theclassical Langevin algorithm in numerical examples. We present with a low dimensional KL divergence/Bayesian learningproblem ( N = 2 ) and then a larger N = 10 dimensional problem. In both cases, we show that Algorithms 1 and 2 convergefaster than the accelerated non-reversible diffusion (5); which in turn converges faster than the classical Langevin (1). A. Estimating KL Divergence
The aim is to use the adaptive algorithms proposed above to explore and reconstruct high value regions of the KL divergenceof the posterior. As will be discussed below, a special case of this setup is Bayesian learning discussed in [3], where thealgorithms explore high probability regions of the posterior distribution.Let θ o ∈ R N denote a true parameter value (which is unknown to the algorithm). Let θ ∈ R N denote a random variable withknown prior distribution p ( θ ) . A sequence of independent observation random variables { Y k } , are generated from a knownlikelihood p ( y | θ o ) . The KL divergence of a sequence of T observations is K( θ o , θ ) = − E θ { log p ( θ | Y , . . . , Y T ) p ( θ o | Y , . . . , Y T ) } (14)Given the observation sequence { y k } , suppose we use the proposed algorithms on expected cost C ( θ ) = − E θ o { log p ( θ, Y , . . . , Y T ) } = Z log p ( θ, y , . . . , y T ) p ( y , . . . , y T | θ o ) dy , . . . dy T (15)A naive implementation of the unbiased gradient estimate is ∇ θ p ( θ, y , . . . , y T ) ; this uses batches of observations of length T from the sequence { y k } . However, since the observations y k are iid, we can instead use a single observation y k at each time k as an unbiased sample path gradient of the cost: b ∇ θ c k ( θ k ) = −∇ θ log p ( θ k ) − T ∇ θ log p ( y k | θ k ) (16)With this setup, suppose the Langevin dynamics or any of the proposed algorithms above, are run on the observation sequence { y k } , generated from the likelihood p ( y | θ o ) . Then, clearly the algorithms asymptotically generate samples { θ k } from thestationary distribution (3), namely π ( θ ) ∝ exp( − C ( θ )) ∝ exp(K( θ o , θ )) where the proportionality constant involves terms independent of θ .To summarize, the Langevin dynamics algorithm and non-reversible diffusion algorithms (Algorithms 1, 2 and 3) operatingon observations { y k } can be used with gradient estimate b ∇ θ c k ( θ k ) in (16) to estimate the KL divergence. Specifically if theempirical histogram ˆ π ( θ ) is constructed from the samples { θ k } generated by the various algorithms, then log ˆ π ( θ ) ∝ K( θ o , θ ) . In this section, we use upper case Y for random variables and lower case y for their realization. Remark. Bayesian Learning:
Bayesian learning described in [3] is a special case of the above setup. It deals with exploringhigh probability regions of a posterior density.The setup in [3] is as follows: Suppose y , . . . , y T is a fixed realization generated from p ( y | θ ) . Then ¯ C ( θ ) △ = − log p ( θ, y , . . . , y T ) is a deterministic cost. This is unlike the cost C ( θ ) in (15) which involves the sequence of random variables Y , . . . , Y T andan expectation. Clearly the sample path cost b ∇ θ c k ( θ k ) evaluated in (16) for k ∈ , . . . , T is a noisy unbiased estimate of ¯ C ( θ ) .Suppose the Langevin dynamics algorithm or any of the adaptive algorithms proposed above, are run on the augmenteddataset z = y , . . . , y T , y , . . . , y T , . . . , . Note the augmented dataset comprises multiple repetitions of y , . . . , y T . Then thealgorithms asymptotically generate samples { θ k } from the stationary distribution (3), namely π ( θ ) ∝ exp( − ¯ C ( θ )) = p ( θ | y , . . . , y T ) To summarize, the Langevin dynamics algorithm and non-reversible diffusion algorithms (Algorithms 1, 2 and 3) operating onaugmented dataset z can be used with gradient estimate b ∇ θ c k ( θ k ) in (16) to perform Bayesian learning. That is, the algorithmsconstruct a non-parametric estimate of the posterior distribution p ( θ | Y ) from the empirical density ˆ π ( θ ) by using the iterates { θ k } generated by the algorithms. B. Example 1. Bayesian Learning N = 2 Here we consider the case N = 2 , θ = [ θ (1) , θ (2)] ′ , y k ∼ N ( θ (1) , √
2) + 12 N ( θ (1) + θ (2) , √ θ (1) ∼ N (0 , √ , θ (2) ∼ N (0 , (17)For true parameter value θ o = [0 , ′ , it can be verified that the objective − C ( θ ) is non-concave in θ and has two maximaat θ = [0 , ′ and θ = [1 , − ′ .To illustrate the posterior p ( θ | y , . . . , y T ) visually, Figure 1 plots the empirical density and contours of p ( θ | y , . . . , y T ) , θ ∈ R , for T = 100 using the Metropolis Hastings algorithm. -3 -2 -1 0 1 2 3-3-2-10123 Fig. 1: Metropolis Hastings simulation of posterior distribution p ( θ | y , . . . , y T ) , T = 100 .The augmented dataset z was generated as repetitions of y , . . . , y ; so z has points. We ran the Langevindynamics algorithm, accelerated algorithm and Algorithms 1, 2 and 3 with β = 1 over augmented dataset z for 30 independenttrials each with initial condition θ = [4 , ′ . Each trial has a different sample path of the injected noise { w k } . The × skewsymmetric matrix was initialized as S = (cid:20) − ss (cid:21) where s ∼ N (0 , .Figure 2 displays the estimated posterior means E { θ ( i ) | y , . . . , y } , i = 1 , . As can be seen from Figure 2, Algorithms1, 2, and 3 converges faster than the accelerated algorithm, which in turn converges faster than the classical Langevin. In [3] this is termed as running the algorithms on multiple sweeps of y , . . . , y T . Also [3] uses a decreasing step size algorithm. -0.500.511.52 ClassicalAcceleratedAdaptiveSPSAM-Hastings (a) Posterior mean for θ (1) -0.500.511.52 M ean ClassicalAcceleratedAdaptiveSPSAM-Hastings (b) Posterior mean for θ (2) Fig. 2: N = 2 . Comparison of posterior means E { θ ( i ) | y , . . . y } , i = 1 , versus iterations for various algorithms C. Example 2. Bayesian Learning N = 10 θ ( i ) ∼ N ( µ i , σ i ) , µ i ∼ U [ − , , σ i ∼ U [1 , , i ∈ { , . . . N } ,y k ∼ N ( N/ X i =1 θ ( i ) , √
2) + 12 N ( N X j = N +1 θ ( j ) , √ As in the previous example the aim is to reconstruct the posterior p ( θ | y , . . . , y ) .First, the Metropolis Hastings algorithm was used to generate samples from the posterior. We view the estimates from theMetropolis Hastings as the ground truth.Next we implemented the classical Langevin algorithm, accelerated algorithm and adaptive algorithms. In the acceleratedalgorithm and Algorithms 1, 2, the skew symmetric matrix S was initialized as a tri-diagonal matrix with elements above thediagonal chosen as N (0 , random variables, and elements below the diagonal chosen as the negative of these. The augmenteddataset z was generated as repetitions of y , . . . , y ; so z has points. Each algorithm was run for 50 independenttrials with step sizes ε = 10 − , α = 10 − .The posterior p ( θ | y , . . . , y ) is a -variate distribution Figure 3 shows the posterior mean estimates of the first twomarginals, computed for the various algorithms. Also shown are the L distances of these marginals to that of the MetropolisHastings algorithm. The L distance (Wasserstein 1-metric) for the first two marginals is d ( i ) = Z | ˆ F i ( α ( i )) − F i ( α ( i )) | dα ( i ) , i = 1 , (18)where F i is the cumulative distribution of marginal i constructed via Metropolis Hastings (ground truth) and ˆ F i is the empiricalcumulative distribution constructed by the Langevin or adaptive algorithm.The L distance is more appropriate for our purposes than the Kolmogorov-Smirnov distance since typically the constantor proportionality β is not known and so the regions of support of the empirical cdfs can vary substantially. Acknowledgement . The above simulations were done Cornell graduate student Omer Serbetci.IV. W
EAK C ONVERGENCE A NALYSIS
V. N ON - STATIONARY G LOBAL O PTIMIZATION AND T RACKING A NALYSIS
Our next main result concerns estimating a time evolving global minimum in a non-stationary global stochastic optimizationproblem. Alternatively, we use to use non-reversible diffusion based algorithms to explore and track a time evolving expectedSince we are estimating (tracking) a time evolving global minimum/cost, we first give a model for the evolution. Below, theMarkov chain { x k } will be used as a hyper-parameter to model the evolution of the global minimum. By hyper-parameter wemean that the Markov chain model is not known or used by the algorithms. The Markov chain assumption is used only forour convergence analysis to determine how well does our proposed algorithm estimates (tracks) a global minimum/expectedcost that jump changes (evolves) according to an unknown Markov chain. -1012345678 ClassicalAcceleratedAdaptiveSPSAM-Hastings (a) Posterior mean for θ (1) -1012345 ClassicalAcceleratedAdaptiveSPSAM-Hastings (b) Posterior mean for θ (2) ClassicalAcceleratedAdaptiveSPSA (c) Wasserstein-1 distance for first marginal ClassicalAcceleratedAdaptiveSPSA (d) Wasserstein-1 distance for second marginal
Fig. 3: N = 10 . Comparison of posterior means E { θ ( i ) | y , . . . y } and Wasserstein 1-distances d ( i ) for the first two marginals i = 1 , versus iterations for various algorithms A. Non-stationary Stochastic Optimization Problem
In this section, we treat the problem minimization of an objective function in which the objective function is randomlychanging within a finite set. Effectively, instead of one objective function, we have a finite number of objective functions todeal with. For the reason of mathematical convenience, we assume that the random changing behavior is modeled by a “slow”Markov chain { x k } on the finite state space X = { , . . . , X } and the one-step transition probability I + αQ . Here α > isa small parameter and Q = ( q ij ) is a generator of a continuous-time Markov chain so that q ij ≥ for i = j and P j q ij = 0 for each i ∈ X . We assume that Q is irreducible (see [23, p.23]). For notational convenience, we have chosen the states ofthe Markov chain to take integer values. This is no loss of generality.With the above setup, we carry out an optimization problem of the form θ x ∈ Θ = { arg min θ ∈ Θ ⊂ R N ,x ∈X c ( θ, x ) } where c ( θ, x ) = E x { c ( θ, x, z ) } = E { c ( θ, x, z ) | x k = x } , (19)where z is the observation. The above optimization is taken as conditional expectation conditioned on x k = x . Thus in lieu ofone objective function, we have X objective functions. Thus equivalently, we are treating a time-varying tracking problem oftracking the time-varying minimizer. R EFERENCES[1] S. B. Gelfand and S. K. Mitter, “Recursive stochastic algorithms for global optimization in Rˆd,”
SIAM Journal on Control and Optimization , vol. 29,no. 5, pp. 999–1018, 1991.[2] V. S. Borkar and S. K. Mitter, “A strong approximation theorem for stochastic recursive algorithms,”
Journal of optimization theory and applications ,vol. 100, no. 3, pp. 499–513, 1999.[3] M. Welling and Y. W. Teh, “Bayesian learning via stochastic gradient Langevin dynamics,” in
Proceedings of the 28th International Conference onMachine Learning (ICML-11) , 2011, pp. 681–688.[4] V. Krishnamurthy and G. Yin, “Langevin dynamics for inverse reinforcement learning of stochastic gradient algorithms,” arXiv preprint arXiv:2006.11674 ,2020.[5] C.-R. Hwang, S.-Y. Hwang-Ma, and S.-J. Sheu, “Accelerating gaussian diffusions,”
The Annals of Applied Probability , pp. 897–913, 1993.[6] C.-R. Hwang, S.-Y. Hwang-Ma, S.-J. Sheu et al. , “Accelerating diffusions,”
The Annals of Applied Probability , vol. 15, no. 2, pp. 1433–1444, 2005. [7] G. A. Pavliotis,
Stochastic processes and applications: diffusion processes, the Fokker-Planck and Langevin equations . Springer, 2014, vol. 60.[8] J. Spall,
Introduction to Stochastic Search and Optimization . Wiley, 2003.[9] H. J. Kushner and J. Yang, “Analysis of adaptive step-size SA algorithms for parameter tracking,”
IEEE Transactions in Automatic Control , vol. 40,no. 8, pp. 1403–1410, August 1995.[10] A. Benveniste, M. Metivier, and P. Priouret,
Adaptive Algorithms and Stochastic Approximations , ser. Applications of Mathematics. Springer-Verlag,1990, vol. 22.[11] V. Krishnamurthy, G. Yin, and S. Singh, “Adaptive step size algorithms for blind interference suppression in DS/CDMA systems,”
IEEE Transactionson Signal Processing , vol. 49, no. 1, pp. 190–201, January 2001.[12] V. Krishnamurthy, X. Wang, and G. Yin, “Spreading code optimization and adaptation in CDMA via discrete stochastic approximation,”
IEEE Trans.Info Theory , vol. 50, no. 9, pp. 1927–1949, Sept. 2004.[13] H. J. Kushner and G. Yin,
Stochastic Approximation Algorithms and Recursive Algorithms and Applications , 2nd ed. Springer-Verlag, 2003.[14] S. N. Ethier and T. G. Kurtz,
Markov Processes—Characterization and Convergence . Wiley, 1986.[15] P. Billingsley,
Convergence of Probability Measures , 2nd ed. New York: Wiley, 1999.[16] G. Yin, V. Krishnamurthy, and C. Ion, “Regime switching stochastic approximation algorithms with application to adaptive discrete stochasticoptimization,”
SIAM Journal on Optimization , vol. 14, no. 4, pp. 117–1215, 2004.[17] G. Yin, C. Ion, and V. Krishnamurthy, “How does a stochastic optimization/approximation algorithm adapt to a randomly evolving optimum/root withjump Markov sample paths,”
Mathematical programming B. (Special Issue dedicated to B.T. Polyak’s 70th Birthday) , vol. 120, no. 1, pp. 67–99, 2009.[18] R. Simmons and S. Konig, “Probabilistic navigation in partially observable environments,” in
Proceedings of 14th International Joint Conference onArtificial Intelligence . Montreal, Canada: Morgan Kaufman, 1995, pp. 1080–1087.[19] G. V. Moustakides, “Exponential convergence of products of random matrices: Application to adaptive algorithms,”
International Journal of AdaptiveControl and Signal Processing , vol. 12, no. 7, pp. 579–597, 1998.[20] G. Yin and Q. Zhang,
Discrete-time Markov chains: two-time-scale methods and applications . Springer, 2006, vol. 55.[21] Y. W. Teh, A. H. Thiery, and S. J. Vollmer, “Consistency and fluctuations for stochastic gradient Langevin dynamics,”
The Journal of Machine LearningResearch , vol. 17, no. 1, pp. 193–225, 2016.[22] M. Raginsky, A. Rakhlin, and M. Telgarsky, “Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis,” arXiv preprintarXiv:1702.03849 , 2017.[23] G. G. Yin and Q. Zhang,