Stochastic Gradient Langevin Dynamics Algorithms with Adaptive Drifts
SStochastic Gradient Langevin Dynamics Algorithms withAdaptive Drifts
Sehwan Kim, Qifan Song, and Faming Liang ∗ September 22, 2020
Abstract
Bayesian deep learning offers a principled way to address many issues concerning safety ofartificial intelligence (AI), such as model uncertainty,model interpretability, and predictionbias. However, due to the lack of efficient Monte Carlo algorithms for sampling from theposterior of deep neural networks (DNNs), Bayesian deep learning has not yet powered ourAI system. We propose a class of adaptive stochastic gradient Markov chain Monte Carlo(SGMCMC) algorithms, where the drift function is biased to enhance escape from saddlepoints and the bias is adaptively adjusted according to the gradient of past samples. Weestablish the convergence of the proposed algorithms under mild conditions, and demonstratevia numerical examples that the proposed algorithms can significantly outperform the existingSGMCMC algorithms, such as stochastic gradient Langevin dynamics (SGLD), stochasticgradient Hamiltonian Monte Carlo (SGHMC) and preconditioned SGLD, in both simulationand optimization tasks.
Keywords:
Adaptive MCMC, Adam, Bayesian Deep Learning, Momentum, StochasticGradient MCMC ∗ To whom correspondence should be addressed: Faming Liang. F. Liang is Professor (email: [email protected]), S. Kim is Graduate Student, and Q. Song is Assistant Professor, Department of Statistics,Purdue University, West Lafayette, IN 47907. a r X i v : . [ s t a t . M L ] S e p Introduction
During the past decade, deep learning has been the engine powering many successes of artificialintelligence (AI). However, the deep neural network (DNN), as the basic model of deep learning,still suffers from some fundamental issues, such as model uncertainty, model interpretability, andprediction bias, which pose a high risk on the safety of AI. In particular, the standard optimiza-tion algorithms such as stochastic gradient descent (SGD) produce only a point estimate for theDNN, where model uncertainty is completely ignored. The machine prediction/decision is blindlytaken as accurate and precise, with which the automated system might become life-threateningto humans if used in real-life settings. The universal approximation ability of the DNN enablesit to learn powerful representations that map high-dimensional features to an array of outputs.However, the representation is less interpretable, from which important features that govern thefunction of the system are hard to be identified, causing serious issues in human-machine trust.In addition, the DNN often contains an excessively large number of parameters. As a result, thetraining data tend to be over-fitted and the prediction tends to be biased.As advocated by many researchers, see e.g. Kendall and Gal (2017) and Chen (2018), Bayesiandeep learning offers a principled way to address above issues. Under the Bayesian framework,a sparse DNN can be learned by sampling from the posterior distribution formulated with anappropriate prior distribution, see e.g. Liang et al. (2018) and Polson and Rockova (2018). For thesparse DNN, interpretability of the structure and consistency of the prediction can be establishedunder mild conditions, and model uncertainty can be quantified based on the posterior samples, seee.g. Liang et al. (2018). However, due to the lack of efficient Monte Carlo algorithms for samplingfrom the posterior of DNNs, Bayesian deep learning has not yet powered our AI systems.Toward the goal of efficient Bayesian deep learning, a variety of stochastic gradient Markovchain Monte Carlo (SGMCMC) algorithms have been proposed in the literature, including stochas-tic gradient Langevin dynamics (SGLD) (Welling and Teh, 2011), stochastic gradient HamiltonianMonte Carlo (SGHMC) (Chen et al., 2014), and their variants. One merit of the SGMCMC al-gorithms is that they are scalable, requiring at each iteration only the gradient on a mini-batchof data as in the SGD algorithm. Unfortunately, as pointed out in Dauphin et al. (2014), DNNsoften exhibit pathological curvature and saddle points, rendering the first-order gradient basedalgorithms, such as SGLD, inefficient. To accelerate convergence, the second-order gradient algo-rithms, such as stochastic gradient Riemannian Langevin dynamics (SGRLD)(Ahn et al., 2012;2irolami and Calderhead, 2011; Patterson and Teh, 2013) and stochastic gradient RiemannianHamiltonian Monte Carlo (SGRHMC) (Ma et al., 2015), have been developed. With the use ofthe Fisher information matrix of the target distribution, these algorithms rescale the stochasticgradient noise to be isotropic near stationary points, which helps escape saddle points faster.However, calculation of the Fisher information matrix can be time consuming, which makes thesealgorithms lack scalability necessary for learning large DNNs. Instead of using the exact Fisherinformation matrix, preconditioned SGLD (pSGLD) (Li et al., 2016) approximates it by a diagonalmatrix adaptively updated with the current gradient information.Ma et al. (2015) provides a general framework for the existing SGMCMC algorithms (seeSection 2), where the stochastic gradient of the energy function (i.e., the negative log-targetdistribution) is restricted to be unbiased. However, this restriction is unnecessary. As shownin the recent work, see e.g., Dalalyan and Karagulyan (2017), Song et al. (2020), and Bhatiaet al. (2019), the stochastic gradient of the energy function can be biased as long as its meansquared error can be upper bounded by an appropriate function of θ t , the current sample of thestochastic gradient Markov chain. On the other hand, a variety of adaptive SGD algorithms,such as momentum (Qian, 1999), Adagrad (Duchi et al., 2011), RMSprop (Tieleman and Hinton,2012), and Adam (Kingma and Ba, 2014), have been proposed in the recent literature for dealingwith the saddle point issue encountered in deep learning. These algorithms adjust the movingdirection at each iteration according to the current gradient as well as the past ones. It was shownin Staib et al. (2019) that, compared to SGD, these algorithms escape saddle points faster andcan converge faster overall to the second-order stationary points.Motivated by the two observations above, we propose a class of adaptive SGLD algorithms,where a bias term is included in the drift function to enhance escape from saddle points andaccelerate the convergence in the presence of pathological curvatures. The bias term can beadaptively adjusted based on the path of the sampler. In particular, we propose to adjust thebias term based on the past gradients in the flavor of adaptive SGD algorithms (Ruder, 2016).We establish the convergence of the proposed adaptive SGLD algorithms under mild conditions,and demonstrate via numerical examples that the adaptive SGLD algorithms can significantlyoutperform the existing SGMCMC algorithms, such as SGLD, SGHMC and pSGLD.3 A Brief Review of Existing SGMCMC Algorithms
Let X N = ( X , X , . . . , X N ) denote a set of N independent and identically distributed samplesdrawn from the distribution f ( x | θ ), where N is the sample size and θ is the vector of parameters.Let p ( X N | θ ) = (cid:81) Ni =1 f ( X i | θ ) denote the likelihood function, let π ( θ ) denote the prior distribution of θ , and let U ( θ ) = − log p ( X N | θ ) − log π ( θ ) denote the energy function of the posterior distribution.If θ has a fixed dimension and U ( θ ) is differentiable with respect to θ , then the SGLD algorithmcan be used to simulate from the posterior, which iterates by θ t +1 = θ t − (cid:15) t +1 ∇ θ ˜ U ( θ t ) + (cid:112) (cid:15) t +1 τ η t +1 , η t +1 ∼ N (0 , I d ) , where d is the dimension of θ , I d is an d × d -identity matrix, (cid:15) t +1 is the learning rate, τ is the tem-perature, and ∇ θ ˜ U ( θ ) denotes an estimate of ∇ θ U ( θ ) based on a mini-batch of data. The learningrate can be kept as a constant or decreasing with iterations. For the former, the convergence ofthe algorithm was studied in Sato and Nakagawa (2014) and Dalalyan and Karagulyan (2017).For the latter, the convergence of the algorithm was studied in Teh et al. (2016).The SGLD algorithm has been extended in different ways. As mentioned previously, each of itsexisting variants can be formulated as a special case of a general SGMCMC algorithm given in Maet al. (2015). Let ξ denote an augmented state, which may include some auxiliary components. Forexample, SGHMC augments the state to ξ = ( θ, v ) by including an auxiliary velocity componentdenoted by v . Then the general SGMCMC algorithm is given by θ t +1 = θ t − (cid:15) t +1 [ D ( ξ ) + Q ( ξ )] ∇ ξ ˜ H ( ξ ) + Γ( ξ ) + (cid:112) (cid:15) t +1 τ Z t +1 , where Z t +1 ∼ N (0 , D ( ξ t )), H ( ξ ) is the energy function of the augmented system, ∇ ξ ˜ H ( ξ ) de-notes an unbiased estimate of ∇ ξ H ( ξ ), D ( ξ ) is a positive semi-definite diffusion matrix, Q ( ξ ) is askew-symmetric curl matrix, and Γ i ( ξ ) = (cid:80) dj =1 ∂∂ξ j ( D ij ( ξ ) + Q ij ( ξ )). The diffusion D ( ξ ) and curl Q ( ξ ) matrices can take various forms and the choice of the matrices will affect the rate of con-vergence of the sampler. For example, for the SGHMC algorithm, we have H ( ξ ) = U ( θ ) + v T v , D ( ξ ) = C for some positive semi-definite matrix C , and Q ( ξ ) = − II . For theSGRLD algorithm, we have ξ = θ , H ( ξ ) = U ( θ ), D ( ξ ) = G ( θ ) − , Q ( ξ ) = 0, where G ( θ ) is theFisher information matrix of the posterior distribution. By rescaling the parameter updates ac-cording to geometry information of the manifold, SGRLD generally converges faster than SGLD.However, calculating the Fisher information matrix and its inverse can be time consuming when4he dimension of θ is high and the total sample size N is large. To address this issue, pSGLDapproximates G ( θ ) using a diagonal matrix and sequentially updates the approximator using thecurrent gradient information. To be more precise, it is given by G ( θ t +1 ) = diag(1 (cid:11) ( λ + (cid:112) V ( θ t + ))) ,V ( θ t +1 ) = βV ( θ t ) + (1 − β ) ∇ θ ˜ U ( θ t ) (cid:12) ∇ θ ˜ U ( θ t ) , where λ denotes a small constant, (cid:12) and (cid:11) represent element-wise vector product and division,respectively. Motivated by the observations that the stochastic gradient ∇ θ ˜ U ( θ ) used in SGLD is not necessarilyunbiased and that the past gradients can be used to enhance escape from saddle points for SGD,we propose a class of adaptive SGLD algorithms, where the past gradients are used to acceleratethe convergence of the sampler by forming a bias to the drift at each iteration. A general form ofthe adaptive SGLD algorithm is given by θ t +1 = θ t − (cid:15) t +1 ( ∇ θ ˜ U ( θ t ) + aA t ) + (cid:112) (cid:15) t +1 τ η t +1 , (1)where A t is the adaptive bias term, a is called the bias factor, and η t +1 ∼ N (0 , I d ). Two adaptiveSGLD algorithms are given in what follows. In the first algorithm, the bias term is constructedbased on the momentum algorithm (Qian, 1999); and in the second algorithm, the bias term isconstructed based on the Adam algorithm (Kingma and Ba, 2014). It is known that SGD has trouble in navigating ravines, i.e., the regions where the energy surfacecurves much more steep in one dimension than in another, which are common around local energyminima (Ruder, 2016; Sutton, 1986). In this scenario, SGD oscillates across the slopes of theravine while making hesitant progress towards the local energy minima. To accelerate SGD inthe relevant direction and dampen oscillations, the momentum algorithm (Qian, 1999) updatesthe moving direction at each iteration by adding a fraction of the moving direction of the pastiteration, the so-called momentum term, to the current gradient. By accumulation, the momentum5erm increases updates for the dimensions whose gradients pointing in the same directions andreduces updates for the dimensions whose gradients change directions. As a result, the oscillationis reduced and the convergence is accelerated.
Algorithm 1
MSGLD
Input:
Data { x i } Ni =1 , subsample size n , smoothing factor 0 < β <
1, bias factor a , temperature τ , and learning rate (cid:15) ; Initialization: θ from an appropriate distribution, and m = 0. for i = 1 , , . . . , do Draw a mini-batch of data { x ∗ j } nj =1 , and calculate θ t +1 = θ t − (cid:15) ( ∇ ˜ U ( θ t ) + am t ) + e t +1 , m t = β m t − + (1 − β ) ∇ ˜ U ( θ t − ),where e t +1 ∼ N (0 , τ (cid:15)I d ), and d is the dimension of θ . end for As an analogy of the momentum algorithm in stochastic optimization, we propose the so-calledmomentum SGLD (MSGLD) algorithm, where the momentum is calculated as an exponentiallydecaying average of past stochastic gradients and added as a bias term to the drift of SGLD. Theresulting algorithm is depicted in Algorithm 1, where a constant learning rate (cid:15) is considered forsimplicity. However, as mentioned in the Appendix, the algorithm also works for the case that thelearning rate decays with iterations. The convergence of the algorithm is established in Theorem3.1, whose proof is given in the Appendix.
Theorem 3.1 (Ergodicity of MSGLD)
Suppose the conditions (A.1)-(A.5) hold (given in Ap-pendix), β ∈ (0 , is a constant, and the learning rate (cid:15) is sufficiently small. Then for any smoothfunction φ ( θ ) , L L (cid:88) k =1 φ ( θ k ) − (cid:90) Θ φ ( θ ) π ∗ ( θ ) dθ p → , as L → ∞ , where π ∗ denotes the posterior distribution of θ , and p → denotes convergence in probability. Algorithm 1 contains a few parameters, including the subsample size n , smoothing factor β ,bias factor a , temperature τ , and learning rate (cid:15) . Among these parameters, n , τ and (cid:15) are sharedwith SGLD and can be set as in SGLD. Refer to Nagapetyan et al. (2017) and Nemeth andFearnhead (2019) for more discussions on their settings. The smoothing factor β is a constant,6hich is typically set to 0.9. The bias factor a is also a constant, which is typically set to 1 or aslightly large value. The Adam algorithm (Kingma and Ba, 2014) has been widely used in deep learning, which typicallyconverges much faster than SGD. Recently, Staib et al. (2019) showed that Adam can be viewedas a preconditioned SGD algorithm, where the preconditioner is estimated in an on-line mannerand it helps escape saddle points by rescaling the stochastic gradient noise to be isotropic nearstationary points.
Algorithm 2
ASGLD
Input:
Data { x i } Ni =1 , subsample size n , smoothing factors β and β , bias factor a , temperature τ , and learning rate (cid:15) ; Initialization: θ from appropriate distribution, m = 0 and V = 0; for i = 1 , , . . . , do Draw a mini-batch of data { x ∗ j } nj =1 , and calculate θ t +1 = θ t − (cid:15) ( ∇ ˜ U ( θ t ) + am t (cid:11) √ V t + λ ) + e t +1 ,m t = β m t − + (1 − β ) ∇ ˜ U ( θ t − ), V t = β V t − + (1 − β ) ∇ ˜ U ( θ t − ) (cid:12) ∇ ˜ U ( θ t − ),where λ is a small constant added to avoid zero-divisors, e t +1 ∼ N (0 , τ (cid:15)I d ), and d is thedimension of θ . end for Motivated by this result, we propose the so-called Adam SGLD (ASGLD) algorithm. Ideally,we would construct the adaptive bias term as follows: m t = β m t − + (1 − β ) ∇ ˜ U ( θ t − ) , ˜ V t = β ˜ V t − + (1 − β ) ˜ U ( θ t − ) ˜ U ( θ t − ) T , ˜ A t = ˜ V − / t m t , (2)where β and β are smoothing factors for the first and second moments of stochastic gradients,respectively. Since ˜ V t can be viewed as an approximator of the true second moment matrix E ( ∇ θ ˜ U ( θ t − ) ∇ θ ˜ U ( θ t − ) T ) at iteration t −
1, ˜ A t can viewed as the rescaled momentum which isisotropic near stationary points. If the bias factor a is chosen appropriately, ASGLD is expected7o converge very fast. In particular, the bias term may guide the sampler to converge to a globaloptimal region quickly, similar to Adam in optimization. However, when the dimension of θ is high,calculation of ˜ V t and ˜ V − / t can be time consuming. To accelerate computation, we propose toapproximate ˜ V t using a diagonal matrix as in pSGLD. This leads to Algorithm 2. The convergenceof the algorithm is established in Theorem 3.2, whose proof is given in the Appendix. Theorem 3.2 (Ergodicity of ASGLD)
Suppose the conditions (A.1)-(A.5) hold (given in Ap-pendix), β < β are two constants between 0 and 1, and the learning rate (cid:15) is sufficiently small.Then for any smooth function φ ( θ ) , L L (cid:88) k =1 φ ( θ k ) − (cid:90) Θ φ ( θ ) π ∗ ( θ ) dθ p → , as L → ∞ , where π ∗ denotes the posterior distribution of θ , and p → denotes convergence in probability. Compared to Algorithm 1, ASGLD contains one more parameter, β , which works as thesmoothing factor for the second moment term and is suggested to take a value of 0.999 in thispaper. In addition to the Momentum and Adam algorithms, other optimization algorithms, such asAdaMax (Kingma and Ba, 2014) and Adadelta (Zeiler, 2012), can also be incorporated into SGLDto accelerate its convergence. Other than the bias term, the past gradients can also be used toconstruct an adaptive preconditioner matrix in a similar way to pSGLD. Moreover, the adaptivebias and adaptive preconditioner matrix can be used together to accelerate the convergence ofSGLD.
Before applying the adaptive SGLD algorithms to DNN models, we first illustrate their perfor-mance on three low-dimensional examples. The first example is a multivariate Gaussian distri-bution with high correlation values. The second example is a multi-modal distribution, whichmimics the scenario with multiple local energy minima. The third example is more complicated,which mimics the scenario with long narrow ravines.8 .1 A Gaussian distribution with high correlation values
Suppose that we are interested in drawing samples from π ( θ ), a Gaussian distribution with themean zero and the covariance matrix Σ = . . . For this example, we have ∇ θ U ( θ ) = Σ − θ ,and set ∇ ˜ U ( θ ) = ∇ U ( θ ) + e in simulations, where θ = ( θ , θ ) T ∈ R and e ∼ N (0 , I ). ForASGLD, we set τ = 1, a = 0 . (cid:15) = 0 . β = 0 . β = 0 . τ = 1, a = 0 . β = 0 . (cid:15) = 0 .
1. For comparison, SGLD was also run for this example with thesame learning rate (cid:15) = 0 .
1. Figure 1 shows that both ASGLD and MSGLD work well for thisexample, where the left panel shows that they can produce the same accurate estimate as SGLDfor the covariance matrix as the number of iterations becomes large.Figure 1: Performance of adaptive SGLD algorithms: (left) average absolute errors of the samplecovariance matrix produced by SGLD, ASGLD and MSGLD along with iterations, and (right)scatter plots of the samples generated by ASGLD and MSGLD during their first 1000 iterations.
The target distribution is a 2-dimensional 5-component mixture Gaussian distribution, whosedensity function is given by π ( θ ) = (cid:80) i =1 110 π exp( −(cid:107) θ − µ i (cid:107) ), where µ = ( − , − T , µ =( − , T , µ = (0 , T , µ = (3 , T , µ = (3 , T . For this example, we considered the naturalgradient variational inference (NGVI) algorithm (Lin et al., 2019), which is known to convergevery fast in the variational inference field, as the baseline algorithm for comparison.9or adaptive SGLD method, both ASGLD and MSGLD were applied to this example. We set ∇ θ ˜ U ( θ ) = ∇ U ( θ ) + e , where e ∼ N (0 , I ) and U ( θ ) = − log π ( θ ). For a fair comparison, eachalgorithm was run in 6.0 CPU minutes. The numerical results were summarized in Figure 2, whichshows the contour of the energy function and its estimates by NGVI, MSGLD and ASGLD. Theplots indicate that MSGLD and ASGLD are better at exploring the multi-modal distributionsthan NGVI.Figure 2: Contour plots of the true energy function (left) and its estimates by NGVI (middle left),MSGLD (middle right), and ASGLD (right). Consider a nonlinear regression y = f θ ( x ) + (cid:15), (cid:15) ∼ N (0 , , where x ∼ U nif [ − , θ = ( θ , θ ) T ∈ R , and f θ ( x ) = ( x − + 2 sin( θ x ) + θ + cos( θ x − − θ . As θ increases, the function f θ ( x ) fluctuates more severely. Figure 3a depicts the regression,where we set θ = 20 and θ = 10. Since the random error (cid:15) is relatively large compared to thelocal fluctuation of f θ ( x ), i.e., 2 sin( θ x ) + θ + cos( θ x − − θ , identification of the exactvalues of ( θ , θ ) can be very hard, especially when the subsample size n is small.From this regression, we simulated 5 datasets with ( θ , θ ) = (20 ,
10) independently. Eachdataset consists of 10,000 samples. To conduct Bayesian analysis for the problem, we set theprior distribution: θ ∼ N (0 ,
1) and θ ∼ N (0 , a priori independent. This choice ofthe prior distribution makes the problem even harder, which discourages the convergence of the10 a) (b) Figure 3: (a) The dashed line is for the global pattern ( x − and the solid line is for the regressionfunction f θ ( x ), where θ = 20 and θ = 10. The points represent 100 random samples from theregression. (b) Contour plot of the energy function for one dataset and the sample paths producedby SGLD, SGHMC, pSGLD, ASGLD and MSGLD in a run, where the sample paths have beenthinned by a factor of 50 for readability of the plot.posterior simulation to the true value of θ . Instead, it encourages to estimate f θ ( x ) by the globalpattern ( x − .Both ASGLD and MSGLD were run for each of the 5 datasets. Each run consisted of 30,000iterations, where the first 10,000 iterations were discarded for the burn-in process and the samplesgenerated from the remaining 20,000 iterations were averaged as the Bayesian estimate of θ . Inthe simulations, we set the subsample size n = 100. The settings of other parameters are givenin the Appendix. Table 1 shows the Bayesian estimates of θ produced by the two algorithms ineach of the five runs. The MSGLD estimate converged to the true value in all five runs, whilethe ASGLD estimates converged to the true value in four of five runs. For comparison, SGLD,SGHMC and pSGLD were also applied to this example with the settings given in the Appendix.As implied by Table 1, all the three algorithms essentially failed for this example: none of theirestimates converged to the true value!For a further exploration, Figure 3b shows the contour plot of the energy function for onedataset as well as the sample paths produced by SGLD, SGHMC, pSGLD, ASGLD and MSGLDfor the dataset. As shown by the contour plot, the energy landscape contains multiple long narrowravines, which make the existing SGMCMC algorithms essentially fail. However, due to the use of11able 1: Bayesian estimates of θ produced by different algorithms for the long narrow energyravines example in five independent runs, where the true value of θ is (20,10). Method θ θ -5.79 -6.59 17.43 0.99 -3.01SGLD θ -2.42 -1.76 9.02 -1.36 -4.74 θ θ θ θ -2.22 -15.44 8.17 0.65 3.75 θ θ θ θ momentum information, ASGLD and MSGLD work extremely well for this example. As indicatedby their sample paths, they can move along narrow ravines, and converge to the true value of θ very quickly. It is interesting to point out that pSGLD does not work well for this example,although it has used the past gradients in constructing the preconditioned matrix. A possiblereason for this failure is that it only approximates the preconditioned matrix by a diagonal matrixand missed the correlation between different components of θ . The dataset is available at the UCI Machine Learning Repository, which consists of 4435 traininginstances and 2000 testing instances. Each instance consists of 36 attributes which representfeatures of the earth surface images taken from a satellite. The training instances consist of 6classes, and the goal of this study is to learn a classifier for the earth surface images.We modeled this dataset by a fully connected DNN with structure 36-30-30-6 and
Relu as the12ctivation function. Let θ denote the vector of all parameters (i.e., connection weights) of theDNN. We let θ be subject to a Gaussian prior distribution: θ ∼ N (0 , I d ), where d is the dimensionof θ . The SGLD, SGHMC, pSGLD, ASGLD and MSGLD algorithms were all applied to simulatefrom the posterior of the DNN. Each algorithm was run for 3,000 epochs with the subsample size n = 50 and a decaying learning rate (cid:15) k = (cid:15) γ (cid:98) k/L (cid:99) , (3)where k indexes epochs, the initial learning rate (cid:15) = 0 . γ = 0 .
5, the step size L = 300, and (cid:98) z (cid:99) denotes the maximum integer less than z . For the purpose of optimization, the temperaturewas set to τ = 0 .
01. The settings for the specific parameters of each algorithm were given in theAppendix.Each algorithm was run for five times for the example. In each run, the training and testclassification accuracy were calculated by averaging over the last 200 samples, which were collectedfrom the last 100,000 iterations with a thinning factor of 500. For each algorithm, Table 2 reportsthe mean classification accuracy, for both training and test, averaged over five runs and its standarddeviation. The results indicate that MSGLD has significantly outperforms other algorithms in bothtraining and test for this example. While ASGLD performs similarly to pSGLD for this example.Table 2: Training and test classification accuracy produced by different SGMCMC algorithmsfor the Landsat data, where the accuracy and its standard error were calculated based on 5independent runs.
Method Training Accuracy Test AccuracySGLD 93.163 ± ± ± ± ± ± ± ± ± ± Finally, we note that for this example, the SGMCMC algorithms have been run excessively long.Figure 4a and Figure 4b show, respectively, the training and test classification errors producedby SGLD, pSGLD, SGHMC, ASGLD and MSGLD along with iterations. It indicates again that13 a) (b)
Figure 4: (a) Training classification errors produced by different SGMCMC algorithms for theLandsat data example. (b) Test classification errors produced by different SGMCMC algorithmsfor the Landsat data example.MSGLD significantly outperforms other algorithms in both training and test.
The MNIST is a benchmark dataset of computer vision, which consists of 60,000 training instancesand 10,000 test instances. Each instance is an image consisting of 28 ×
28 attributes and repre-senting a hand-written number of 0 to 9. For this data set, we tested whether ASGLD or MSGLDcan be used to train sparse DNNs. For this purpose, we considered a mixture Gaussian prior foreach of the connection weights: π ( θ k ) ∼ λ k N (0 , σ ,k ) + (1 − λ k ) N (0 , σ ,k ) , (4)where k is the index of hidden layers, and σ ,k is a relatively very small value compared to σ ,k .In this paper, we set λ k = 10 − , σ ,k = 0 . , σ ,k = 1 × − for all k .We trained a DNN with structure 784-800-800-10 using ASGLD and MSGLD for 250 epochswith subsample size 100. For the first 100 epochs, the DNN was trained as usual, i.e., with aGaussian prior N (0 , I d ) imposed on each connection weight. Then the DNN was trained for 150epochs with the prior (4). The settings for the other parameters of the algorithms were givenin the Appendix. Each algorithm was run for 3 times. In each run, the training and predictionaccuracy were calculated by averaging, respectively, the fitting and prediction results over theiterations of the last 5 epochs. The numerical results were summarized in Table 3, where “Sparse14atio” is calculated as the percentage of the learned connection weights satisfying the inequality | θ k | ≥ (cid:113) log( − λ k λ k σ k σ k ) σ k σ k σ k − σ k . The threshold is determined by solving the probability inequality P { θ k ∼ N (0 , σ ,k ) | θ k } ≤ P { θ k ∼ N (0 , σ ,k ) | θ k } .Table 3: Comparison of different algorithms for training sparse DNNs for the MNIST data, wherethe reported results for each algorithm were averaged based on three independent runs. Method Training Accuracy Test Accuracy Sparsity RatioASGLD ± ± ± ± ± ± ± ± ± Adam is a well known DNN optimization method for MNIST data. For comparison, it wasalso applied to train the sparse DNN. Interestingly, ASGLD outperforms Adam in both trainingand prediction accuracy, although its resulting network is a little more dense than that by Adam.
The CIFAR-10 and CIFAR-100 are also benchmark datasets of computer vision. CIFAR-10 con-sists of 50,000 training images and 10,000 test images, and the images were classified into 10 classes.The CIFAR-100 dataset consists of 50,000 training images and 10,000 test images, but the imageswere classified into 100 classes. We modeled both datasets using a ResNet-18He et al. (2015), andtrained the model for 250 epochs using various optimization and SGMCMC algorithms, includingSGD, Adam, SGLD, SGHMC, pSGLD, ASGLD and MSGLD with data augmentation techniques.The temperature τ was set to 1 . e − L = 40 and γ = 0 .
5; and theweight prior was set to N (0 , I d ) for all SGMCMC algorithms. For CIFAR-100, the learning ratewas set as in (3) with L = 90 and γ = 0 .
1; and the weight prior was set to N (0 , I d ) for SGLD,pSGLD, and SGHMC, and N (0 , I d ) for ASGLD and MSGLD. For SGD and Adam, we set theobjective function as − n n (cid:88) i =1 log f ( x i | θ ) + 12 λ (cid:107) θ (cid:107) , (5)15here f ( x i | θ ) denotes the likelihood function of observation i , and λ is the regularization param-eter. For SGD we set λ = 5 . e −
4; and for Adam we set λ = 0. For Adam, we have also tried thecase λ (cid:54) = 0, but the results were inferior to those reported below.For each dataset, each algorithm was run for 5 times. In each run, the training and testclassification errors were calculated by averaging over the iterations of the last 5 epochs. Table 4reported the mean training and test classification accuracy (averaged over 5 runs) of the BayesianResNet-18 for CIFAR-10 and CIFAR-100. The comparison indicates that MSGLD outperforms allother algorithms in test accuracy for the two datasets. In terms of training accuracy, ASGLD andMSGLD work more like the existing momentum-based algorithms such as ADAM and SGHMC,but tend to generalize better than them. This result has been very remarkable, as all algorithmswere run with a small number of epochs; that is, the Monte Carlo algorithms are not necessarilyslower than the stochastic optimization algorithms in deep learning.Table 4: Mean training and test classification accuracy (averaged over 5 runs) of the BayesianResNet-18 for CIFAR-10 and CIFAR-100 data CIFAR-10 CIFAR-100Method Training Test Training TestSGD 99.721 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± This paper has proposed a class of adaptive SGMCMC algorithms by including a bias term to thedrift of SGLD, where the bias term is allowed to be adaptively adjusted with past samples, past16radients, etc. The proposed algorithms have extended the framework of the existing SGMCMCalgorithms to an adaptive one in a flavor of extending the Metropolis algorithm Metropolis et al.(1953) to the adaptive Metropolis algorithm (Haaro et al., 2001) in the history of MCMC. Thenumerical results indicate that the proposed algorithms have inherited many attractive properties,such as quick convergence in the scenarios with long narrow ravines or saddle points, from theircounterpart optimization algorithms, while ensuring more extensive exploration of the samplespace than the optimization algorithms due to their simulation nature. As a result, the proposedalgorithms can significantly perform the existing optimization and SGMCMC algorithms in bothsimulation and optimization.For the adaptive SGMCMC algorithms, different bias terms represent different strengths inescaping from local traps, saddle points, or long narrow ravines. In the future, we will consider todevelop an adaptive SGMCMC algorithm with a complex bias term which has incorporated all thestrengths in escaping from local traps, saddle points, long narrow ravines, etc. We will also considerto incorporate other advanced SGMCMC techniques into adaptive SGMCMC algorithms to furtherimprove their performance. For example, cyclical SGMCMC (Zhang et al., 2019) proposed acyclical step size schedule, where larger steps discover new modes, and smaller steps characterizeeach mode. This technique can be easily incorporated into adaptive SGLD algorithms to improvetheir convergence to the stationary distribution.
AppendixA Proofs of Theorems 3.1 and 3.2
Considered a generalized SGLD algorithm with a biased drift term for simulating from the targetdistribution π ∗ ( θ ) ∝ exp {− U ( θ ) } . Let θ k +1 and θ k be two random vectors in Θ satisfying θ k +1 = θ k − (cid:15) [ ∇ U ( θ k ) + ζ k +1 ] + √ (cid:15)e k +1 , (6)where e k +1 ∼ N (0 , I p ), and ζ k +1 = ∇ (cid:98) U ( θ k ) − ∇ U ( θ k ) denotes deviation between the drift ∇ (cid:98) U ( θ k )used in simulations and the ideal drift ∇ U ( θ k ) = −∇ log π ∗ ( θ k ). For example, in equation (1), wehave ∇ (cid:98) U ( θ k ) = ∇ θ (cid:101) U ( θ k ) + aA k .For the generalized SGLD algorithm (6), we aim to analyze the deviation of the averagingestimate ˆ φ = L (cid:80) Lk =1 φ ( θ k ) from the posterior mean ¯ φ = (cid:82) Θ φ ( θ ) π ∗ ( dθ ) for a bounded smooth17unction φ ( θ ) of interest. The key tool we employed in the analysis is the Poisson equation whichis used to characterize the fluctuation between φ and ¯ φ : L g ( θ ) = φ ( θ ) − ¯ φ, (7)where g ( θ ) is the solution to the Poisson equation, and L is the infinitesimal generator of theLangevin diffusion L g := (cid:104)∇ g, ∇ U ( · ) (cid:105) + τ ∆ g. By imposing the following regularity conditions on the function g ( θ ), we can control the fluctuationof L (cid:80) Lk =1 φ ( θ k ) − ¯ φ , which enables convergence of the sample average.(A.1) Given a sufficiently smooth function g ( θ ) as defined in (7) and a function V ( θ ) such thatthe derivatives satisfy the inequality (cid:107) D j g (cid:107) (cid:46) V p j ( θ ) for some constant p j >
0, where j ∈ { , , , } . In addition, V p has a bounded expectation, i.e., sup k E [ V p ( θ k )] < ∞ ; and V p is smooth, i.e. sup s ∈ (0 , V p ( sθ +(1 − s ) ϑ ) (cid:46) V p ( θ )+ V p ( ϑ ) for all θ, ϑ ∈ Θ and p ≤ j { p j } .For a stronger but verifiable version of the condition, we refer readers to Vollmer et al. (2016).In what follows, we present a lemma which is adapted from Theorem 3 of Chen et al. (2015) witha fixed learning rate (cid:15) . Note that Chen et al. (2015) requires { ζ k : k = 1 , , . . . } to be a zero meansequence, while in our case { ζ k : k = 1 , , . . . } forms an auto-regressive sequence which makes theproof of Chen et al. (2015) still go through. A similar lemma can also be established for a decayinglearning rate sequence. Refer to Theorem 5 of Chen et al. (2015) for the detail. Lemma A.1
Assume the condition (A.1) hold. For a smooth function φ , the mean square error(MSE) of the generalized SGLD algorithm (6) at time T = (cid:15)L is bounded as E (cid:107) ˆ φ − ¯ φ (cid:107) (cid:46) L L (cid:88) k =1 E (cid:107) ζ k (cid:107) + 1 L(cid:15) + (cid:15) . (8)To prove Theorems 3.1 and 3.2, we further make the following assumptions:(A.2) (smoothness) U ( θ ) is M -smooth; that is, there exists a constant M > θ, θ (cid:48) ∈ Θ, (cid:107)∇ U ( θ ) − ∇ U ( θ (cid:48) ) (cid:107) ≤ M (cid:107) θ − θ (cid:48) (cid:107) . (9)The smoothness of ∇ U ( θ ) is a standard assumption in studying the convergence of SGLD, andit has been used in a few work, see e.g. Raginsky et al. (2017) and Xu et al. (2018).18A.3) (Dissipativity) There exist constants m > b ≥ θ ∈ Θ, (cid:104)∇ U ( θ ) , θ (cid:105) ≥ m (cid:107) θ (cid:107) − b. (10)This assumption has been widely used in proving the geometric ergodicity of dynamical systems(Mattingly et al., 2002; Raginsky et al., 2017; Xu et al., 2018). It ensures the sampler to movetowards the origin regardless the position of the current point.(A.4) (Gradient noise) The stochastic gradient ξ ( θ ) = ∇ (cid:101) U ( θ ) − ∇ U ( θ ) is unbiased; that is, forany θ ∈ Θ, E [ ξ ( θ )] = 0. In addition, there exists some constant B > E (cid:107) ξ ( θ ) (cid:107) ≤ M (cid:107) θ (cid:107) + B , wherethe expectation is taken with respect to the distribution of the gradient noise. Lemma A.2 (Uniform L bound) Assume the conditions (A.2)-(A.4) hold. For any < (cid:15)
The proof follows that of Lemma 1 in Deng et al. (2019). To make use of that proof, wecan rewrite equation (6) as θ k +1 = θ k − (cid:15) ∇ θ L ( θ k , ζ k +1 ) + √ (cid:15)e k +1 , where ∇ θ L ( θ k , ζ k +1 ) = [ ∇ θ U ( θ k ) + ζ k +1 ] by viewing ζ k +1 as an argument of the function L ( · , · ).Then it is easy to verify that the conditions (A.2)-(A.4) imply the conditions of Lemma 1 of Denget al. (2019), and thus the uniform L bound holds. Note that given the condition (A.3), theinequality E (cid:104)∇ θ L ( θ k , ζ k +1 ) , θ k (cid:105) ≥ mE (cid:107) θ k (cid:107) − b , required by Deng et al. (2019) in its proof, willhold as long as (cid:15) is sufficiently small or β is not very close to 1. (cid:3) Let θ ∗ denote the minimizer of U ( θ ). Therefore, ∇ U ( θ ∗ ) = 0. Then, by Lemma A.2 andcondition (A.2), there exists a constant C such that E (cid:107)∇ U ( θ k ) (cid:107) ≤ M ( G + (cid:107) θ ∗ (cid:107) ) := C < ∞ . (11)Let ξ k +1 := ∇ ˜ U ( θ k ) − ∇ U ( θ k ) be the gradient estimation error. By Lemma A.2 and condition(A.4), there exists a constant C such that E ( (cid:107) ξ k (cid:107) ) ≤ M G + B := C < ∞ . (12)19 .1 Proof of Theorem 3.1 Proof:
The update of the MSGLD algorithm can be rewritten as θ k +1 = θ k − (cid:15) [ ∇ U ( θ k ) + ζ k +1 ] + √ (cid:15)e k +1 , where ζ k +1 = am k + ξ k +1 and m = 0.First, we study the bias of ζ k . According to the recursive update rule of m i , we have E ( ζ k +1 |F k ) /a = E ( m k |F k ) = (1 − β ) ∇ U ( θ k − ) + β E ( m k − |F k )=(1 − β ) ∇ U ( θ k − ) + (1 − β ) β ∇ U ( θ k − ) + β E ( m k − |F k ) = · · · = k (cid:88) i =1 (1 − β ) β i − ∇ U ( θ k − i ) + β k E ( m |F k ) = k (cid:88) i =1 (1 − β ) β i − ∇ U ( θ k − i ) . Hence, by Jensen’s inequality (cid:107) E ( ζ k +1 |F k ) (cid:107) ≤ a k (cid:88) i =1 (1 − β ) β i − (cid:107)∇ U ( θ k − i ) (cid:107) ≤ a (cid:118)(cid:117)(cid:117)(cid:116) k (cid:88) i =1 (1 − β ) β i − (cid:107)∇ U ( θ k − i ) (cid:107) . By (11), the bias is further bounded by E (cid:107) E ( ζ k +1 |F k ) (cid:107) ≤ a k (cid:88) i =1 (1 − β ) β i − E (cid:107)∇ U ( θ k − i ) (cid:107) ≤ a C . (13)For the variance of ζ k +1 , we have E (cid:107) ζ k +1 − E ( ζ k +1 |F k ) (cid:107) = E (cid:107) ξ k +1 + am k − E ( am k |F k ) (cid:107) = E (cid:107) ξ k +1 + a (1 − β ) ˜ ∇ U ( θ k − ) + aβ m k − − a (1 − β ) ∇ U ( θ k − ) − aβ E ( m k − |F k ) (cid:107) = E (cid:107) ξ k +1 + a (1 − β ) ξ k + aβ m k − − aβ E ( m k − |F k ) (cid:107) = · · · = E (cid:107) ξ k +1 + k (cid:88) i =1 a (1 − β ) β i − ξ k − i +1 (cid:107) . Due to the independence among ξ k ’s, we have E (cid:107) ζ k +1 − E ( ζ k +1 |F k ) (cid:107) ≤ E (cid:107) ξ k +1 (cid:107) + k (cid:88) i =1 a (1 − β ) β i − E (cid:107) ξ k − i +1 (cid:107) ≤ C [1 + a (1 − β ) / (1 + β )] , (14)where the last inequality follows from (12).Combining (13) and (14), we have E (cid:107) ζ k +1 (cid:107) ≤ a C + C [1 + a (1 − β ) / (1 + β )] < ∞ , which conclude the proof by applying Lemma A.1 and Chebyshev’s inequality. (cid:3) .2 Proof of Theorem 3.2 Proof:
The update of the ASGLD algorithm can be rewritten as θ k +1 = θ k − (cid:15) [ ∇ U ( θ k ) + ζ k +1 ] + √ (cid:15)e k +1 , where ζ k +1 = am k (cid:11) √ v k + λ + ξ k +1 .According to the recursive update rule of m i and v i , we have m i =(1 − β ) ˜ U ( θ i − ) + (1 − β ) β ˜ U ( θ i − ) + (1 − β ) β ˜ U ( θ i − ) + · · · v i =(1 − β ) ˜ U ( θ i − ) (cid:12) ˜ U ( θ i − ) + (1 − β ) β ˜ U ( θ i − ) (cid:12) ˜ U ( θ i − ) + (1 − β ) β ˜ U ( θ i − ) (cid:12) ˜ U ( θ i − ) + · · · Therefore, by Cauchy-Schwarz inequality, when β < β , we have (cid:107) m i − (cid:11) √ v i − (cid:107) ∞ ≤ (cid:118)(cid:117)(cid:117)(cid:116) i − (cid:88) j =1 (1 − β ) β j − (1 − β ) β j − ≤ (cid:115) (1 − β ) − β − β /β := C. It implies that (cid:107) m i − (cid:11) (cid:112) v i − + λ (cid:107) ≤ √ pC almost surely, and in consequence, E (cid:107) E ( ζ k +1 |F k ) (cid:107) ≤ a C p, (15)and, by (12), E (cid:107) ζ k +1 − E ( ζ k +1 |F k ) (cid:107) ≤ E (cid:107) ζ k +1 (cid:107) ≤ E (cid:107) ξ k +1 (cid:107) + a E (cid:107) m k (cid:11) (cid:112) v k + λ (cid:107) ≤ a C p + C . (16)Combining (15) and (16), we have E (cid:107) ζ k +1 (cid:107) ≤ a C p + C < ∞ , which concludes the proof by applying Lemma A.1 and Chebyshev’s inequality. (cid:3) B Experimental Setup
All numerical experiments on deep learning were done with pytorch. For all SGMCMC algorithms,the initial learning rates were set at the order of O (1 /N ) in all experiments except for in MNISTtraining. For the optimization methods such as SGD and Adam, the objective function was setto (5), where f ( x i | θ ) denotes the likelihood function of observation i , and λ is the regularizationparameter whose value varies for different datasets.21 ulti-modal distribution For NGVI method, we chose a 5-component mixture Gaussian dis-tribution, and set the initial parameters as π i = , µ i ∼ N (0 , . I ), Σ i = I for i = 1 , . . . ,
5, andthe learning rate (cid:15) = 0 . a, β ) = (10 , .
9) and the learning rate (cid:15) = 0 . a, β , β ) = (1 , . , . (cid:15) = 0 .
05. The CPU timelimit was set to 6 minutes.
Distribution with long narrow ravines
Each algorithm was run for 30 ,
000 iterations withthe settings of specific parameters given in Table 5.Table 5: Parameter setting for the distribution with long narrow ravinesMethod Initial value β β a λ SGLD 1 e − e − e − e − e − e − e − Landsat
Each algorithm was run for 3000 epochs with the settings of specific parameters givenin Table 6. Table 6: Parameter setting for the Landsat data exampleMethod Initial value β β a λ SGLD 0 . / . / . / e − . / e − . / NIST
Each algorithm was run for 250 epochs, where the first 100 epochs were run withthe conventional Gaussian prior N (0 , λ = 0 forall 250 epochs. The settings of the specific parameters were given in Table 7.Table 7: Parameter settings for MNIST before (stage I) and after (stage II) sparse learning Stage I Stage IIMethod Initial β β a λ τ Initial β β a λ τ ADAM 0 .
001 0.9 0.999 1 e − . e − . e − e − . /
10 0.9 0.999 1 1 e − e − . e − . /
10 0.99 1 1 e − Cifar-10 and Cifar-100
For SGD, the objective function was set as (5) with λ = 5 . e −
4. ForAdam, the objective function was set as (5) with λ = 0. The settings of specific parameters aregiven in Table 8. Table 8: Parameter settings for CIFAR-10 and CIFAR-100 CIFAR-10 CIFAR-100Method Initial β β a λ Initial β β a λ SGD 0 . . .
001 0.9 0.999 1 e − .
001 0.9 0.999 1 e − . / . / . / . / . / e − . / e − . / e − . / e − . / . / eferences Ahn, S., Korattikara, A., and Welling, M. (2012), “Bayesian Posterior Sampling via StochasticGradient Fisher Scoring,” in
ICML .Bhatia, K., Ma, Y.-A., Dragan, A. D., Bartlett, P. L., and Jordan, M. I. (2019), “BayesianRobustness: A Nonasymptotic Viewpoint,” arXiv preprint arXiv:1907.11826 .Chen, C. (2018), “Uncertainty Estimation of Deep Neural Networks,”
PhD dissertation, Universityof South Carolina, U.S.A.
Chen, C., Ding, N., and Carin, L. (2015), “On the Convergence of Stochastic Gradient MCMCAlgorithms with High-order Integrators,” in
NeurIPS , pp. 2278–2286.Chen, T., Fox, E. B., and Guestrin, C. (2014), “Stochastic Gradient Hamiltonian Monte Carlo,”in
ICML .Dalalyan, A. S. and Karagulyan, A. G. (2017), “User-friendly guarantees for the Langevin MonteCarlo with inaccurate gradient,”
CoRR , abs/1710.00095.Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014), “Iden-tifying and attacking the saddle point problem in high-dimensional non-convex optimization,”in
Advances in Neural Information Processing Systems 27 , eds. Ghahramani, Z., Welling, M.,Cortes, C., Lawrence, N. D., and Weinberger, K. Q., Curran Associates, Inc., pp. 2933–2941.Deng, W., Zhang, X., Liang, F., and Lin, G. (2019), “An Adaptive Empirical Bayesian Methodfor Sparse Deep Learning,” in
NeurIPS .Duchi, J., Hazan, E., and Singer, Y. (2011), “Adaptive subgradient methods for online learningand stochastic optimization,”
Journal of Machine Learning Research , 12, 2121–2159.Girolami, M. and Calderhead, B. (2011), “Riemann manifold Langevin and Hamiltonian MonteCarlo methods (with discussion),”
Journal of the Royal Statistical Society, Series B , 73, 123–214.Haaro, H., Saksman, E., and Tamminen, J. (2001), “An Adaptive Metropolis Algorithm,”
Bernoulli , 7, 223–242. 24e, K., Zhang, X., Ren, S., and Sun, J. (2015), “Deep Residual Learning for Image Recognition,”
CVPR .Kendall, A. and Gal, Y. (2017), “What uncertainties do we need in Bayesian deep learning forcomputer vision,” in
The 31st Conference on Neural Information Processing Systems (NIPS2017) , Long Beach, CA, USA.Kingma, D. and Ba, J. (2014), “Adam: A Method for Stochastic Optimization,”
InternationalConference on Learning Representations , 1–13.Li, C., Chen, C., Carlson, D. E., and Carin, L. (2016), “Preconditioned Stochastic GradientLangevin Dynamics for Deep Neural Networks,” in
AAAI .Liang, F., Li, Q., and Zhou, L. (2018), “Bayesian Neural Networks for Selection of Drug SensitiveGenes,”
Journal of the American Statistical Association , 113, 955–972.Lin, W., Khan, M. E., and Schmidt, M. (2019), “Fast and Simple Natural-Gradient VariationalInference with Mixture of Exponential-family Approximations,” in
ICML , PMLR, Proceedingsof Machine Learning Research.Ma, Y.-A., Chen, T., and Fox, E. B. (2015), “A Complete Recipe for Stochastic Gradient MCMC,”in
NIPS .Mattingly, J., Stuartb, A., and Highamc, D. (2002), “Ergodicity for SDEs and Approximations:Locally Lipschitz Vector Fields and Degenerate Noise,”
Stochastic Processes and their Applica-tions , 101, 185–232.Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. (1953), “Equation ofstate calculations by fast computing machines,”
Journal of Chemical Physics , 21, 1087–1091.Nagapetyan, T., Duncan, A., Hasenclever, L., Vollmer, S., L., S., and Zygalakis, K. (2017), “TheTrue Cost of SGLD,”
ArXiv:1706.02692v1 .Nemeth, C. and Fearnhead, P. (2019), “Stochastic Gradient Markov Chain Monte Carlo,” arXiv:1907.06986 .Patterson, S. and Teh, Y. W. (2013), “Stochastic Gradient Riemannian Langevin Dynamics on theProbability Simplex,” in
Advances in Neural Information Processing Systems 26 , eds. Burges,25. J. C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. Q., Curran Associates,Inc., pp. 3102–3110.Polson, N. and Rockova, V. (2018), “Posterior Concentration for Sparse Deep Learning,” arXivpreprint arXiv:1803.09138 .Qian, N. (1999), “On the momentum term in gradient descent learning algorithms,”
Neural Net-works , 12, 145–151.Raginsky, M., Rakhlin, A., and Telgarsky, M. (2017), “Non-convex Learning via Stochastic Gradi-ent Langevin Dynamics: a nonasymptotic analysis,”
Proceedings of Machine Learning Research ,65, 1–30.Ruder, S. (2016), “An overview of gradient descent optimization algorithms,”
CoRR ,abs/1609.04747.Sato, I. and Nakagawa, H. (2014), “Approximation Analysis of Stochastic Gradient LangevinDynamics by using Fokker-Planck Equation and Ito Process,” in
ICML .Song, Q., Sun, Y., Ye, M., and Liang, F. (2020), “Extended Stochastic Gradient MCMC forLarge-Scale Bayesian Variable Selection,” arXiv: , 2002.02919v1.Staib, M., Reddi, S., Kale, S., Kumar, S., and Sra, S. (2019), “Escaping saddle points with adaptivegradient methods,” in
ICML .Sutton, R. S. (1986), “Two Problems with Backpropagation and Other Steepest-Descent LearningProcedures for Networks,” in
Proceedings of the Eighth Annual Conference of the CognitiveScience Society , Hillsdale, NJ: Erlbaum.Teh, Y. W., Thiery, A. H., and Vollmer, S. J. (2016), “Consistency and fluctuations for stochasticgradient Langevin dynamics,”
The Journal of Machine Learning Research , 17, 193–225.Tieleman, T. and Hinton, G. (2012), “Lecture 6.5-RMSProp: Divide the gradient by a runningaverage of its recent magnitude,”
COURSERA: Neural Networks for Machine Learning , 4, 26–31.Vollmer, S. J., Zygalakis, K. C., and Teh, Y. W. (2016), “Exploration of the (Non-)AsymptoticBias and Variance of Stochastic Gradient Langevin Dynamics,”
Journal of Machine LearningResearch , 17, 1–48. 26elling, M. and Teh, Y. W. (2011), “Bayesian Learning via Stochastic Gradient Langevin Dy-namics,” in
ICML .Xu, P., Chen, J., Zou, D., and Gu, Q. (2018), “Global Convergence of Langevin Dynamics BasedAlgorithms for Nonconvex Optimization,” in
NeurIPS .Zeiler, M. D. (2012), “ADADELTA: An Adaptive Learning Rate Method,”
CoRR , abs/1212.5701.Zhang, R., Li, C., Zhang, J., Chen, C., and Wilson, A. G. (2019), “Cyclical Stochastic GradientMCMC for Bayesian Deep Learning,” arXiv preprint arXiv:1902.03932arXiv preprint arXiv:1902.03932