[PDF] Parallel-tempered Stochastic Gradient Hamiltonian Monte Carlo for Approximate Multimodal Posterior Sampling

Abstract

We propose a new sampler that integrates the protocol of parallel tempering with the Nosé-Hoover (NH) dynamics. The proposed method can efficiently draw representative samples from complex posterior distributions with multiple isolated modes in the presence of noise arising from stochastic gradient. It potentially facilitates deep Bayesian learning on large datasets where complex multimodal posteriors and mini-batch gradient are encountered.

Full PDF

11st Symposium on Advances in Approximate Bayesian Inference, 2018 1–6

Parallel-tempered Stochastic Gradient Hamiltonian Monte Carlo forApproximate Multimodal Posterior Sampling

Rui Luo ∗ RUI . LUO @ AIG . COM

Qiang Zhang ∗ QIANG . ZHANG @ AIG . COM

Yuanyuan Liu

YUANYUAN . LIU @ AIG . COM

American International Group, Inc.

Abstract

We propose a new sampler that integrates the protocol of parallel tempering with the Nos´e-Hoover(NH) dynamics. The proposed method can efﬁciently draw representative samples from complexposterior distributions with multiple isolated modes in the presence of noise arising from stochasticgradient. It potentially facilitates deep Bayesian learning on large datasets where complex multi-modal posteriors and mini-batch gradient are encountered.

1. Introduction

In Bayesian inference, one of the fundamental problems is to efﬁciently draw i.i.d. samples fromthe posterior distribution π ( θ | (cid:68) ) given the dataset (cid:68) = { x } , where θ ∈ (cid:146) D denotes the variable ofinterest. Provided the prior distribution π ( θ ) and the likelihood per datum (cid:96) ( θ ; x ) , the posterior tobe sampled can be formulated as π ( θ | (cid:68) ) = π ( θ ) (cid:214) x ∈ (cid:68) (cid:96) ( θ ; x ) . (1)To facilitate posterior sampling, the framework of Markov chain Monte Carlo (MCMC) has beenestablished, which has initiated a broad family of methods that generate Markov chains to proposenew sample candidates and then apply tests of acceptance in order to guarantee the condition ofdetailed balance. Methods like the Metropolis-Hastings (MH) algorithm (Metropolis et al., 1953;Hastings, 1970), the Gibbs sampler (Geman and Geman, 1984), and the hybrid/Hamiltonian MonteCarlo (HMC) (Duane et al., 1987; Neal, 2011) are famous representatives for the MCMC familywhere different generating procedures of Markov chains are adopted; each of those methods hasachieved great success on various tasks in statistics and related ﬁelds.Among MCMC methods, HMC, in particular, has attracted attention due to its exploitation ofgradient information. In a typical HMC setting (Neal, 2011), the target posterior distribution π ( θ | (cid:68) ) is embedded into a virtual physical system ﬁxed at the standard temperature T = with the potentialenergy deﬁned in the form of U ( θ ) = − log π ( θ | (cid:68) ) = − log π ( θ ) − (cid:213) x ∈ (cid:68) log (cid:96) ( θ ; x ) − const . (2)The variable of interest θ is interpreted as the position of the system in the phase space; an auxiliaryvariable p ∈ (cid:146) D is then introduced as the conjugate momentum corresponding to the kinetic energy p (cid:62) M − p / . By deﬁning the total energy, i.e. the Hamiltonian, as the sum of the potential and kineticenergy, the Hamiltonian dynamics that governs the physical system can therefore be derived from ∗ Equal c (cid:13) R. Luo, Q. Zhang & Y. Liu. a r X i v : . [ s t a t . M L ] D ec UO Z HANG L IU the Hamilton’s formalism. From the perspective of sampling, new sample candidates are proposedvia simulating the Hamiltonian dynamics, where the gradient of potential ∇ U ( θ ) is utilized.Despite possessing numerous advantages against its alternatives within the MCMC family,HMC still suffers, however, from two major issues: 1. gradient noise arising from mini-batchesmay lead to a severe deviation of the dynamics from the desired orbit; 2. isolated modes may not becorrectly sampled or even left undiscovered. Unfortunately, as one deals with deep neural networkstraining on large datasets, those two problems arise simultaneously: deep neural networks leads tocomplex posterior distributions for the parameters, which may contain numbers of isolated modes;efﬁcient training on large datasets requires mini-batching, the gradient hence would be quite noisyas is evaluated on a small fraction of dataset.It has long been known that the tempering mechanism is capable of helping the system to getacross high energy barriers and hence improve the ergodicity (Marinari and Parisi, 1992; Earl andDeem, 2005). Recently, the research of incorporating tempering into MCMC methods has pro-vided a practical approach towards efﬁcient multimodal posterior sampling (Graham and Storkey,2017; Luo et al., 2018). In the meantime, the advances in thermostatting techniques for moleculardynamics (Jones and Leimkuhler, 2011) have shed some light on adaptive control for noisy dynam-ics. In this paper, we propose a novel method that addresses the two issues previously mentionedfor HMC; it combines the protocol of parallel tempering (Swendsen and Wang, 1986; Sugita andOkamoto, 1999) with the dynamics of Nos´e-Hoover (NH) thermostat (Nos´e, 1984; Hoover, 1985).The simulation shows the advantages w.r.t. the accuracy as well as efﬁciency of our method againstthe classic HMC (Neal, 2011) and one of its stochastic variants, Stochastic Gradient Nos´e-HooverThermostat (SGNHT) (Ding et al., 2014).

2. Parallel-tempered Stochastic Gradient Hamiltonian Monte Carlo

The proposed method consists of two alternating subroutines: 1. the parallel dynamics simulation ofsystem replicas, and 2. the conﬁguration exchange between replicas. The ﬁrst subroutine utilizes theNos´e-Hoover thermostat to adaptively detect and neutralize the noise within mini-batch gradient; thesecond incorporates a mini-batch acceptance test to ensure the detailed balance during exchanges.

We deﬁne an increasing ladder { T j } Rj = of temperature with R rungs; the temperature ranges fromthe standard T = to some higher temperature. On each rung j , a replica ( θ j , p j ) of the physicalsystem is initialized and the actual potential energy for that replica is rescaled to U ( θ j )/ T j .As the datum x within each mini-batch (cid:83) is independently selected at random, the mini-batchgradient can be approximated by a Gaussian variable due to the Central Limit Theorem (CLT): ∇ ˜ U ( θ ) = −∇ log π ( θ ) − | (cid:68) || (cid:83) | (cid:213) x ∈ (cid:83) ⊂ (cid:68) ∇ log (cid:96) ( θ ; x ) . (3)To retain the correct trajectory in simulating the system dynamics, we leverage the NH thermo-stat because of its capability of adaptive control of the gradient noise (Jones and Leimkuhler, 2011;Ding et al., 2014). According to the formulation of Hoover (1985), for each replica ( θ j , p j ) , weaugment the system with NH thermostat ξ j ∈ (cid:146) and then modify the dynamics as: d θ j d t = M − p j , d p j d t = −∇ ˜ U ( θ j )/ T j − ξ p j , d ξ j d t = (cid:104) p (cid:62) j M − p j − D (cid:105) (cid:46) Q , (4) ARALLEL - TEMPERED S TOCHASTIC G RADIENT

HMC where M denotes the mass, and Q the thermal inertia. It can be proved that the dynamics in Eq. (4)leads to a stationary distribution w.r.t. θ j by the Fokker-Planck equation (Risken and Haken, 1989) π j ( θ j ) ∝ e − U ( θ j )/ T j . (5)This guarantees that, during the simulation, one can readily recover the desired distribution at acertain temperature T j by simply retaining the position θ j and discarding the momentum p j as wellas the thermostat ξ j . Note that for the replica on rung , the temperature is ﬁxed at standard T = and the position θ = θ is distributed as the target posterior π ( θ ) = e − U ( θ )/ T = e − U ( θ ) = π ( θ | (cid:68) ) . The principles of statistical physics suggest that high temperature facilitates the physical systems toget across energy barriers, which means replicas at higher temperatures are more likely to traverseamong different modes of the distributions. As a consequence, however, the distribution sampled athigh temperature has a spread spectrum and is hence biased. To recover an unbiased distribution, weperform conﬁguration exchange between replicas at higher temperatures and the one at the standard.Consider the conﬁguration exchange between the replicas on rung i and j ; as is a non-physicalprocess, the exchange has to satisfy the condition of detailed balance: π j ( θ j ) π k ( θ k ) α [( j , k ) → ( k , j )] = π j ( θ k ) π k ( θ j ) α [( k , j ) → ( j , k )] , (6)where the transition probability reads α [( i , j ) → ( j , i )] = π j ( θ k ) π k ( θ j ) π j ( θ j ) π k ( θ k ) + π j ( θ k ) π k ( θ j ) = + e − δ E , (7)and δ E = (cid:2) U ( θ k ) − U ( θ j ) (cid:3) (cid:2) ( T k − T j )/ T j T k (cid:3) . It is straightforward to verify that Eq. (6) holds. Notethat the transition probability α [( j , k ) → ( k , j )] resembles the logistic distribution; such logistic testof acceptance is developed by Barker (1965).With mini-batching, the potential energy ˜ U ( θ j ) becomes a r.v., and so is the difference ˜ U ( θ k ) − ˜ U ( θ j ) . By CLT, δ E is asymptotically Gaussian with some certain variance σ . Seita et al. (2017)proposed a mini-batch version of Baker’s logistic test of acceptance such that δ E + (cid:67) > must holdfor the exchange to carry out, where (cid:76) denotes an auxiliary correction r.v. that aims to bridge thegap between the logistic distribution and Gaussian. The probability density p (cid:67) of this correctionvariable (cid:67) satisﬁes the convolution equation p (cid:67) ∗ p (cid:78) σ = p (cid:76) ; it is equivalent to solve the Gaussiandeconvolution problem w.r.t. the standard logistic distribution.With the convolution theorem for distributions, it is helpful to convert the Gaussian deconvolu-tion into solving for the inverse Fourier transform w.r.t. quotient of characteristic functions p (cid:67) = π ∫ ∞−∞ φ (cid:76) ( t ) φ (cid:78) σ ( t ) e − ixt d t , since φ (cid:67) = φ (cid:76) / φ (cid:78) σ , (8)where φ (cid:78) σ and φ (cid:76) denote the characteristic functions of (cid:78) ( , σ ) and the standard logistic r.v.,respectively. As the logistic distribution has much heavier tails than the Gaussian, the exact solutionof p (cid:67) does not exist: the “integrand” on the RHS of Eq. (8) is in fact not integrable. We can onlyapproximate p (cid:67) by introducing the kernel ψ = e − γ t of bandwidth / γ (see Fan, 1991) in Eq. (8): ˆ p (cid:67) = π ∫ ∞−∞ ψ · φ (cid:76) φ (cid:78) σ e − itx d t = π ∫ ∞−∞ (cid:20) ψφ (cid:78) σ (cid:21) φ (cid:76) e − ixt d t . (9) UO Z HANG L IU Using the Hermite polynomials H k (Abramowitz and Stegun, 1965), we now expand the quo-tient within the brackets of Eq. (9) as ψφ (cid:78) σ = e − γ t + σ t / = ∞ (cid:213) k = γ k k ! H k ( σ / γ ) t k . (10)The correction distribution can be approximated via Fourier’s differential theorem: ˆ p (cid:67) = ∞ (cid:213) k = (− ) k k ! H k ( σ / γ ) γ k (cid:20) π ∫ ∞−∞ (− it ) k φ (cid:76) e − itx d t (cid:21) = ∞ (cid:213) k = (− ) k k ! H k ( σ / γ ) γ k p ( k ) (cid:76) , (11) where p ( j ) (cid:76) represents the ( j + ) -th derivative of logistic function, which can be efﬁciently calculatedin a recursive fashion (Minai and Williams, 1993). Iteration P r() I. Sampling trajectory P r() II. Samples via the proposed method -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 300.000.050.100.15 P r() III. Samples via SGNHT -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 300.000.050.100.15 P r() IV. Samples via HMC

Figure 1: Experiment on sampling a d mixture of 4 Gaussians.Figure 2: Experiment on sampling a d mixture of 5 Gaussians.

3. Experiment

We conduct two sets of experiments on synthetic distributions: the ﬁrst is a mixture of Gaussiansin d , and the second is a d Gaussian mixture with isolated modes. The potential energy as wellas its gradient is perturbed by zero-mean Gaussian noise with variance σ = . , and all samplersin test have no access to the actual parameters of that noise. We establish a ladder of temperaturewith R = rungs ranging from T = to T R = , i.e. totally replicas are simulated in parallel.The baselines are the classic HMC (Neal, 2011) the adaptive variant SGNHT (Ding et al., 2014).It is demonstrated in Fig. 1 and 2 that, in both synthetic testing cases, our method has accuratelysampled the target distributions with multiple isolated modes in the presence of noise within mini-batch gradient, where all baselines failed: SGNHT managed to control the gradient noise but did notdiscover the isolated modes while the classic HMC appears to be unable to correctly draw samplesdue to the deviated dynamics. Moreover, the subplot on the left of Fig. 1 illustrates the samplingtrajectory of our method, indicating a good mixing property. ARALLEL - TEMPERED S TOCHASTIC G RADIENT

HMC

References

Milton Abramowitz and Irene A Stegun.

Handbook of mathematical functions: with formulas,graphs, and mathematical tables , volume 55. Courier Corporation, 1965.Av A Barker. Monte carlo calculations of the radial distribution functions for a proton-electronplasma.

Australian Journal of Physics , 18(2):119–134, 1965.Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D Skeel, and Hartmut Neven.Bayesian sampling using stochastic gradient thermostats. In

Advances in neural informationprocessing systems , pages 3203–3211, 2014.Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid monte carlo.

Physics letters B , 195(2):216–222, 1987.David J Earl and Michael W Deem. Parallel tempering: Theory, applications, and new perspectives.

Physical Chemistry Chemical Physics , 7(23):3910–3916, 2005.Jianqing Fan. On the optimal rates of convergence for nonparametric deconvolution problems.

TheAnnals of Statistics , pages 1257–1272, 1991.Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the bayesianrestoration of images.

IEEE Transactions on pattern analysis and machine intelligence , (6):721–741, 1984.Matthew M. Graham and Amos J. Storkey. Continuously tempered hamiltonian monte carlo. In

Proceedings of the Thirty-Third Conference on Uncertainty in Artiﬁcial Intelligence, UAI 2017,Sydney, Australia, August 11-15, 2017 , 2017.W Keith Hastings. Monte carlo sampling methods using markov chains and their applications.

Biometrika , 57(1):97–109, 1970.William G Hoover. Canonical dynamics: equilibrium phase-space distributions.

Physical review A ,31(3):1695, 1985.Andrew Jones and Ben Leimkuhler. Adaptive stochastic methods for sampling driven molecularsystems.

The Journal of chemical physics , 135(8):084125, 2011.Rui Luo, Yaodong Yang, Jun Wang, and Yuanyuan Liu. Thermostat-assisted continuously-temperedhamiltonian monte carlo for multimodal posterior sampling on large datasets. In

Advances inNeural Information Processing Systems , 2018.Enzo Marinari and Giorgio Parisi. Simulated tempering: a new monte carlo scheme.

EPL (Euro-physics Letters) , 19(6):451, 1992.Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Ed-ward Teller. Equation of state calculations by fast computing machines.

The journal of chemicalphysics , 21(6):1087–1092, 1953.Ali A Minai and Ronald D Williams. On the derivatives of the sigmoid.

Neural Networks , 6(6):845–853, 1993. UO Z HANG L IU Radford M Neal. MCMC using Hamiltonian dynamics.

Handbook of Markov Chain Monte Carlo ,2:113–162, 2011.Shuichi Nos´e. A uniﬁed formulation of the constant temperature molecular dynamics methods.

TheJournal of chemical physics , 81(1):511–519, 1984.H. Risken and H. Haken.

The Fokker-Planck Equation: Methods of Solution and ApplicationsSecond Edition . Springer, 1989.Daniel Seita, Xinlei Pan, Haoyu Chen, and John F. Canny. An efﬁcient minibatch acceptance test formetropolis-hastings. In

Proceedings of the Thirty-Third Conference on Uncertainty in ArtiﬁcialIntelligence, UAI 2017, Sydney, Australia, August 11-15, 2017 , 2017.Yuji Sugita and Yuko Okamoto. Replica-exchange molecular dynamics method for protein folding.

Chemical physics letters , 314(1-2):141–151, 1999.Robert H Swendsen and Jian-Sheng Wang. Replica monte carlo simulation of spin-glasses.

PhysicalReview Letters , 57(21):2607, 1986., 57(21):2607, 1986.