[PDF] Continuously tempered Hamiltonian Monte Carlo

Abstract

Hamiltonian Monte Carlo (HMC) is a powerful Markov chain Monte Carlo (MCMC) method for performing approximate inference in complex probabilistic models of continuous variables. In common with many MCMC methods, however, the standard HMC approach performs poorly in distributions with multiple isolated modes. We present a method for augmenting the Hamiltonian system with an extra continuous temperature control variable which allows the dynamic to bridge between sampling a complex target distribution and a simpler unimodal base distribution. This augmentation both helps improve mixing in multimodal targets and allows the normalisation constant of the target distribution to be estimated. The method is simple to implement within existing HMC code, requiring only a standard leapfrog integrator. We demonstrate experimentally that the method is competitive with annealed importance sampling and simulating tempering methods at sampling from challenging multimodal distributions and estimating their normalising constants.

Full PDF

CContinuously tempered Hamiltonian Monte Carlo

Matthew M. Graham

School of InformaticsUniversity of Edinburgh

Amos J. Storkey

School of InformaticsUniversity of Edinburgh

Abstract

Hamiltonian Monte Carlo (HMC) is a powerfulMarkov chain Monte Carlo (MCMC) method forperforming approximate inference in complexprobabilistic models of continuous variables. Incommon with many MCMC methods, however,the standard HMC approach performs poorly indistributions with multiple isolated modes. Wepresent a method for augmenting the Hamiltoniansystem with an extra continuous temperature con-trol variable which allows the dynamic to bridgebetween sampling a complex target distributionand a simpler unimodal base distribution. Thisaugmentation both helps improve mixing in mul-timodal targets and allows the normalisation con-stant of the target distribution to be estimated. Themethod is simple to implement within existingHMC code, requiring only a standard leapfrog in-tegrator. We demonstrate experimentally that themethod is competitive with annealed importancesampling and simulating tempering methods atsampling from challenging multimodal distribu-tions and estimating their normalising constants.

Applications of

Markov chain Monte Carlo (MCMC) meth-ods to perform inference in challenging statistical physicsproblems were among the earliest uses of the ﬁrst modernelectronic computers [39]. In the years since, MCMC meth-ods have become a mainstay for performing approximateinference in complex probabilistic models in most ﬁelds ofcomputational science. Statistical physics in particular hascontinued to play a key role in the development of MCMCmethodology, both directly through the wider adoption ofadvances originally developed within the ﬁeld such as tem-pering methods [18, 31, 43] and gradient-based dynamics[12], and more generally as a source of physical intuitionfor methods such as Gibbs sampling [14, 17]. The aim in MCMC methods is to construct a Markov chainwhich has as its unique stationary distribution a target distri-bution, such that samples from the chain can be used to formMonte Carlo estimates of expectations with respect to thetarget distribution. Several general purpose constructionshave been developed for specifying valid Markov transitionoperators for arbitrary target distributions, including theseminal Metropolis–Hastings algorithm [22, 32] as well al-ternatives such as slice [9, 36] and Gibbs sampling [14, 17].While these methods give a basis for constructing chainsthat will asymptotically give correct inferences, producingalgorithms that will give reasonable results in a practicalamount of time is still a major challenge. For the restrictedcase of target distributions deﬁned by densities that are dif-ferentiable functions of a real-valued vector variable,

Hamil-tonian Monte Carlo (HMC) [12, 37] provides a frameworkfor deﬁning Markov chains that use the gradient of the targetdensity to converge quickly to the target distribution and tomake efﬁcient large moves around it once it is reached.Although HMC is able to efﬁciently explore contiguous re-gions of high probability density in a target distribution, aswith most MCMC methods using local moves it struggles tomove between isolated modes in multimodal target distribu-tions [37]. Tempering methods [18, 31, 43] which augmentthe state space with an (inverse) temperature variable are of-ten used to improve exploration in multimodal distributions.As the temperature variable is typically discrete however itcannot be updated with a gradient-based HMC dynamic.In this paper we present an alternative continuous tempering approach which instead augments the state with a continu-ous temperature variable. This allows use of HMC to jointlyupdate both the temperature and original state, making theapproach simple to use with existing HMC implementa-tions. The proposed approach both improves explorationof multimodal target densities and also provides estimatesof the typically unknown normalising constant of the targetdensity, which is often an important inferential quantity inits own right for model comparison purposes [16]. Originally termed

Hybrid Monte Carlo in [12]. a r X i v : . [ s t a t . C O ] A p r

20 -15 -10 -5 0 5 10 15 20

Target state 𝑥 P r obab ili t y den s i t y Target 𝑍 exp[− 𝜙 ( 𝑥 )] Base exp[− 𝜓 ( 𝑥 )] (a) Target (blue curve) & base (green curve) density functions. -20 -15 -10 -5 0 5 10 15 20 Target state 𝑥 -10-50510 T e m pe r a t u r e c on t r o l 𝑢 (b) Joint energy (contour plot) & example trajectory (green curve). -20 -15 -10 -5 0 5 10 15 20 Target state 𝑥 I n v e r s e t e m pe r a t u r e 𝛽 (c) Joint density (contour plot) and CT HMC samples (circles). . . . . . . P r obab ili t y den s i t y TargetHMC −20 −15 −10 −5 0 5 10 15 20

Target state 𝑥 . . . . . . P r obab ili t y den s i t y TargetCT HMC (d) Histograms from HMC (top) and CT HMC (bottom) samples.

Figure 1: Visualisations of continuous tempering (CT) in a bimodal univariate target density. (a) A two-component Gaussianmixture target density (blue curve) and Gaussian base density (green curve) with mean and variance matched to the target.(b) The extended potential energy on the target state 𝗑 and temperature control variable 𝗎 (contour plot - dark colours indicatelow energy) and an example simulated Hamiltonian trajectory in the joint space (green curve). The temperature controlvariable bridges the base and target densities lowering energy barriers in the target space. (c) Joint density on the target state 𝗑 and inverse temperature β (16) (contour plot, dark colours indicate high density) and samples from a CT HMC chain runin the joint space (circles, size of each circle is proportional to 𝕡 [ β = 1 | 𝗑 = 𝑥 ] and so larger symbols indicate a greaterweighting in estimates of expectations with respect to the target (21)). (d) Example target state sample histograms fromrunning standard HMC in the original target density (top) and running HMC in the extended joint space (bottom). HMC can be used to perform inference whenever the targetdistribution of interest is deﬁned on a real-valued vectorrandom variable 𝘅 ∈ ℝ 𝐷 =  , by a density function whichis differentiable and has support almost everywhere in  .Distributions with bounded support can often be mapped toan equivalent unbounded density by performing a change ofvariables via a smooth bijective map [8, 25].We will follow the common convention of deﬁning the targetdensity in a Boltzmann-Gibbs form in terms of a real-valued potential energy function 𝜙 ∶  → ℝ and a (typicallyunknown) normalising constant 𝑍 by 𝕡 [ 𝘅 = 𝒙 ] = 1 𝑍 exp[− 𝜙 ( 𝒙 )] . (1)The key idea of HMC is to simulate a Hamiltonian dynamicin an extended state space to form long-range proposalsfor a Metropolis accept-reject step with a high probabilityof acceptance. The target vector 𝘅 is augmented with a momentum vector 𝗽 ∈ ℝ 𝐷 . Typically the momentum is chosen to be independent of the target state with marginal 𝕡 [ 𝗽 = 𝒑 ] ∝ exp[− 𝜏 ( 𝒑 )] with the joint density factorising as 𝕡 [ 𝘅 = 𝒙 , 𝗽 = 𝒑 ] = 𝕡 [ 𝘅 = 𝒙 ] 𝕡 [ 𝗽 = 𝒑 ]∝ exp[− 𝜙 ( 𝒙 ) − 𝜏 ( 𝒑 )] . (2)With analogy to classical dynamics, 𝜏 ( 𝒙 ) is referred to asthe kinetic energy and ℎ ( 𝒙 , 𝒑 ) = 𝜙 ( 𝒙 ) + 𝜏 ( 𝒑 ) is termedthe Hamiltonian . By construction, marginalising the jointdensity over the momenta recovers 𝕡 [ 𝘅 = 𝒙 ] .Each HMC update involves a discrete-time simulation ofthe canonical Hamiltonian dynamic d 𝒙 d 𝑡 = 𝜕ℎ𝜕 𝒑 𝖳 = 𝜕𝜏𝜕 𝒑 𝖳 d 𝒑 d 𝑡 = − 𝜕ℎ𝜕 𝒙 𝖳 = − 𝜕𝜙𝜕 𝒙 𝖳 (3)which conserves the Hamiltonian and is time-reversibleand volume-preserving. Through choice of an appropriatesymplectic integrator such as the popular leapfrog (Störmer-Verlet) scheme, the simulated discrete time dynamic re-mains exactly time-reversible and volume-preserving andalso approximately conserves the Hamiltonian even overlong simulated trajectories [29].he discrete-time dynamic is used to generate a new pro-posed state ( 𝒙 ′ , 𝒑 ′ ) given the current state ( 𝒙 , 𝒑 ) and thisproposal accepted or rejected in a Metropolis step withacceptance probability min { , exp [ ℎ ( 𝒙 , 𝒑 ) − ℎ ( 𝒙 ′ , 𝒑 ′ ) ]} .Due to the approximate Hamiltonian conservation the ac-ceptance probability is typically close to one, and so HMCis able to make long-range moves in high-dimensional state-spaces while still maintaining high acceptance rates, a sig-niﬁcant improvement over the behaviour typical of simplerMetropolis–Hastings methods in high-dimensions.The energy conservation property which gives this desirablebehaviour also however suggests that standard HMC updatesare unlikely to move between isolated modes in a targetdistribution. The Hamiltonian is approximately conservedover a trajectory therefore 𝜙 ( 𝒙 ′ ) − 𝜙 ( 𝒙 ) ≈ 𝜏 ( 𝒑 ) − 𝜏 ( 𝒑 ′ ) .Typically a quadratic kinetic energy 𝜏 ( 𝒑 ) = 𝒑 𝖳 𝑴 −1 𝒑 ∕ 2 is used corresponding to a Gaussian marginal density onthe momentum. As this kinetic energy is bounded belowby zero, the maximum change in potential energy over atrajectory is approximately equal to the initial kinetic energy.At equilibrium the momenta will have a Gaussian distribu-tion and so the kinetic energy a 𝜒 distribution with mean 𝐷 ∕ and variance 𝐷 [2, 37]. If potential energy barrierssigniﬁcantly larger than ∼ 𝐷 separate regions of the conﬁg-uration state space the HMC updates are unlikely to moveacross the barriers meaning impractically long samplingruns will be needed for effective ergodicity. A common approach in MCMC methods for dealing withmultimodal target distributions is to introduce a conceptof temperature. In statistical mechanics, the Boltzmanndistribution on a conﬁguration 𝘅 of a mechanical systemwith energy function 𝜙 and in thermal equilibrium witha heat bath at temperature 𝑇 is deﬁned by a probabilitydensity exp[− 𝛽𝜙 ( 𝒙 )]∕ 𝑧 ( 𝛽 ) where 𝛽 = ( 𝑘 𝐵 𝑇 ) −1 is the in-verse temperature , 𝑘 𝐵 is Boltzmann’s constant and 𝑧 ( 𝛽 ) isthe partition function . At high temperatures ( 𝛽 → ) thedensity function becomes increasingly ﬂat across the targetstate space and, correspondingly, energy barriers betweendifferent regions of the state space become lower.In the above statistical mechanics formulation, if 𝘅 ∈ ℝ 𝐷 the distribution in the limit 𝛽 → would be an improperﬂat density across the target state space. More usefully froma statistical perspective we can use an inverse temperaturevariable 𝛽 ∈ [0 , to geometrically bridge between a simple base distribution with normalised density exp[− 𝜓 ( 𝒙 )] at 𝛽 = 0 and the target distribution at 𝛽 = 1 𝜋 ( 𝒙 | 𝛽 ) = exp[− 𝛽𝜙 ( 𝒙 ) − (1 − 𝛽 ) 𝜓 ( 𝒙 )] , (4) 𝑧 ( 𝛽 ) = ∫  exp[− 𝛽𝜙 ( 𝒙 ) − (1 − 𝛽 ) 𝜓 ( 𝒙 )] d 𝒙 . (5) Several approaches for varying the system temperature inMCMC methods have been proposed. In simulated temper-ing [31] an ordered set of inverse temperatures are chosen { 𝛽 𝑛 } 𝑁𝑛 =0 ∶ 0 = 𝛽 < 𝛽 < ⋯ < 𝛽 𝑁 = 1 , (6)and a joint distribution deﬁned as 𝕡 [ 𝘅 = 𝒙 , β = 𝛽 𝑛 ] ∝ 𝜋 ( 𝒙 | 𝛽 𝑛 ) exp ( 𝑤 𝑛 ) , (7)where { 𝑤 𝑛 } 𝑁𝑛 =0 are a set of prior weights associated witheach inverse temperature. Alternating MCMC updates of β given 𝘅 and 𝘅 given β are performed, with 𝘅 samples forwhich β = 1 converging in distribution to the target. Up-dates of the inverse temperature variable can propose movesto a limited set of neighbouring inverse temperatures or sam-ple β independently from its conditional ℙ [ β = 𝛽 𝑛 | 𝘅 = 𝒙 ] .The ratio of the marginal probabilities of β = 1 and β = 0 can be related to the usually unknown normalising constant 𝑍 for the target distribution by 𝑍 = exp( 𝑤 − 𝑤 𝑁 ) ℙ [ β = 1] ℙ [ β = 0] , (8)allowing estimation of 𝑍 from a simulated tempering chainby computing the ratio of counts of β = 1 and β = 0 .As an alternative [7] proposes to instead use a ‘ Rao-Blackwellised ’ estimator for 𝑍 based on the identity ℙ [ β = 𝛽 𝑛 ] = ∫  ℙ [ β = 𝛽 𝑛 | 𝘅 = 𝒙 ] 𝕡 [ 𝘅 = 𝒙 ] d 𝒙 (9)which indicates ℙ [ β = 𝛽 𝑛 ] can be estimated by averagingthe conditional probabilities ℙ [ β = 𝛽 𝑛 | 𝘅 = 𝒙 ] across sam-ples of 𝘅 from the joint in (7). The authors empiricallydemonstrate this estimator can give signiﬁcantly improvedestimation accuracy over the simpler count-based estimator.A problem for simulated tempering methods is that as ℙ [ β = 𝛽 𝑛 ] ∝ 𝑧 ( 𝛽 𝑛 ) exp( 𝑤 𝑛 ) and the partition function canvary across several orders of magnitude, the chain can mixpoorly between inverse temperatures. This is often tackledwith an iterative approach in which initial pilot runs are usedto estimate ℙ [ β = 𝛽 𝑛 ] and the weights { 𝑤 𝑛 } 𝑁𝑛 =0 set so as totry to ﬂatten out this marginal distribution [19].An alternative is parallel tempering , Metropolis-coupledMCMC or replica exchange [13, 18, 43], where multipleMarkov chains on the target state are run in parallel, withthe 𝑛 th chain having an associated inverse temperature 𝛽 𝑛 deﬁned as in (6). MCMC updates are performed on eachchain which leave 𝜋 ( 𝒙 | 𝛽 𝑛 ) invariant. Interleaved with theseupdates, exchanges of the conﬁguration states of the chainsat adjacent inverse temperatures ( 𝛽 𝑛 , 𝛽 𝑛 +1 ) are proposed andaccepted or rejected in a Metropolis–Hastings step. Paralleltempering sidesteps the issue with mixing between inversetemperatures by keeping the inverse temperatures for thechains ﬁxed, at a cost of a signiﬁcantly increased state size. nnealed Importance Sampling (AIS) [35] is another ther-modynamic ensemble method, closely related to TemperedTransitions [34] and often the ‘go-to’ method for normalis-ing constant estimation in the machine learning literaturee.g [40, 45]. In AIS an ordered set of inverse temperaturesare deﬁned as in (6) and a corresponding series of transi-tion operators { 𝕋 𝑛 } 𝑁 −1 𝑛 =1 which have 𝜋 ( 𝒙 | 𝛽 𝑛 ) for the corre-sponding 𝑛 as their stationary distributions. Assuming thebase distribution corresponding to 𝛽 = 0 can be sampledfrom independently, an independent state 𝒙 (0) from this basedistribution is generated and then the transition operatorsapplied in a ﬁxed sequence 𝕋 … 𝕋 𝑁 −1 to generate a chainof intermediate states { 𝒙 ( 𝑛 ) } 𝑁 −1 𝑛 =1 . The product of ratios 𝑟 = 𝑁 −1 ∏ 𝑛 =0 [ 𝜋 ( 𝒙 ( 𝑛 ) | 𝛽 𝑛 +1 ) 𝜋 ( 𝒙 ( 𝑛 ) | 𝛽 𝑛 ) ] (10)is an importance weight for the ﬁnal state 𝒙 ( 𝑁 −1) with re-spect to the target distribution (1) and also an unbiasedestimator for 𝑍 . By simulating multiple independent runsof this process, a set of importance weighted samples can becomputed to estimate expectations with respect to the target(1) and the weights 𝑟 used to unbiasedly estimate 𝑍 . Due toJensen’s inequality the unbiased estimate of 𝑍 correspondsto stochastic lower bound on log 𝑍 [21].In cases where an exact sample from the target distributioncan be generated, running reverse AIS from the target to basedistribution can be used to calculate an unbiased estimateof 𝑍 and so a stochastic upper bound on log 𝑍 [21].This combination of generating a stochastic lower bound on log 𝑍 from forward AIS runs and a stochastic upper boundfrom reverse AIS runs is termed bidirectional Monte Carlo . In all three of AIS, parallel and simulated tempering thechoice of the discrete inverse temperature schedule (6) iskey to getting the methods to perform well in complex highdimensional distributions [1, 3, 34]. To get reasonable per-formance it may be necessary to do preliminary pilot runsto guide the number of inverse temperatures and spacingbetween them to use, adding to the computational burdenand difﬁculty of using these methods in a black-box fash-ion. A natural question is therefore whether it is possible touse a continuously varying inverse temperature variable toside-step the need to choose a set of values.

Path sampling [16] proposes this approach, deﬁning a gen-eral path as a function parametrised by 𝛽 which continu-ously maps between the target density exp[− 𝜙 ( 𝒙 )]∕ 𝑍 at 𝛽 = 1 and a base density exp[− 𝜓 ( 𝒙 )] at 𝛽 = 0 , with thegeometric bridge in (4) a particular example. A joint target 𝕡 [ 𝘅 = 𝒙 , β = 𝛽 ] ∝ 𝜋 ( 𝒙 | 𝛽 ) 𝜌 ( 𝛽 ) is deﬁned with 𝜌 ( 𝛽 ) a user-chosen prior density on the inverse temperature variableanalogous to the weights { 𝑤 𝑛 } 𝑁𝑛 =0 in simulated tempering. In [16] it is proposed to construct a Markov chain leavingthe joint density invariant by alternating updates of 𝘅 given β and β given 𝘅 . Samples from the joint system can be usedto estimate 𝑍 via the thermodynamic integration identity log 𝑍 = ∫ ∫  𝕡 [ 𝘅 = 𝒙 , β = 𝛽 ] 𝕡 [ β = 𝛽 ] 𝜕 log 𝜋 ( 𝒙 | 𝛽 ) 𝜕𝛽 d 𝒙 d 𝛽. Adiabatic Monte Carlo [3] also proposes using a continu-ously varying inverse temperature variable, here speciﬁcallyin the context of HMC. The original Hamiltonian system ( 𝘅 , 𝗽 ) is further augmented with a continuous inverse tem-perature coordinate β ∈ [0 , . A contact Hamiltonian isdeﬁned on the augmented system, ℎ 𝑐 ( 𝒙 , 𝒑 , 𝛽 ) = 𝛽𝜙 ( 𝒙 ) + (1 − 𝛽 ) 𝜓 ( 𝒙 ) + 12 𝒑 𝖳 𝑴 −1 𝒑 + log 𝑧 ( 𝛽 ) + ℎ (11)this deﬁning a corresponding contact Hamiltonian ﬂow , d 𝒙 d 𝑡 = 𝜕ℎ 𝑐 𝜕 𝒑 𝖳 , d 𝒑 d 𝑡 = 𝜕ℎ 𝑐 𝜕𝛽 𝒑 − 𝜕ℎ 𝑐 𝜕 𝒙 𝖳 , d 𝛽 d 𝑡 = ℎ 𝑐 − 𝜕ℎ 𝑐 𝜕 𝒑 𝒑 . (12)The contact Hamiltonian ﬂow restricted to the zero level-set of the contact Hamiltonian (which the initial state canalways be arranged to lie in by appropriately choosing thearbitrary constant ℎ ) generates trajectories which exactlyconserve the contact Hamiltonian and extended state spacevolume element, and correspond to the thermodynamicalconcept of an isentropic or reversible adiabatic process.For a quadratic kinetic energy, d 𝛽 d 𝑡 is always non-positiveand so forward simulation of the contact Hamiltonian ﬂowgenerates non-increasing trajectories in 𝛽 (and backwardssimulation generates non-decreasing trajectories in 𝛽 ). Inthe ideal case this allows the inverse temperature range [0 , to be coherently traversed without the random walkexploration inherent to simulated tempering.Simulating the contact Hamiltonian ﬂow is non-trivial inpractice however: the contact Hamiltonian (11) depends onthe log partition function log 𝑧 ( 𝛽 ) , the partial derivativesof which require computing expectations with respect to 𝜋 ( 𝒙 | 𝛽 ) which for most problems is intractable to do exactly.Moreover the contact ﬂow can encounter meta-stabilitieswhereby d 𝛽 d 𝑡 becomes zero and the ﬂow halts at an intermedi-ate 𝛽 meaning the ﬂow no longer deﬁnes a bijection between 𝛽 = 0 and 𝛽 = 1 . This can be ameliorated by regular resam-pling of the momenta however this potentially increases therandom-walk behaviour of the overall dynamic.An alternative extended Hamiltonian approach for simulat-ing a system with a continuously varying inverse tempera-ture was proposed recently in the statistical physics literature[20]. The inverse temperature of the system is indirectly setvia an auxiliary variable, which we will term a temperaturecontrol variable 𝗎 ∈ ℝ . This control variable is mapped ton interval [ 𝑠, , < 𝑠 < via a smooth piecewise deﬁnedfunction 𝛽 ∶ ℝ → [ 𝑠, , with the conditions that for a pairof thresholds ( 𝜃 , 𝜃 ) with < 𝜃 < 𝜃 , 𝛽 ( 𝑢 ) = 1 ∀ | 𝑢 | ≤ 𝜃 , 𝛽 ( 𝑢 ) = 𝑠 ∀ | 𝑢 | ≥ 𝜃 and 𝑠 < 𝛽 ( 𝑢 ) < 𝜃 < | 𝑢 | < 𝜃 .Unlike Adiabatic Monte Carlo, an additional momentumvariable 𝗏 corresponding to 𝗎 is also introduced. Althoughseemingly a minor difference this simpliﬁes the implemen-tation of the approach signiﬁcantly as the system retains asymplectic structure and can continue to be viewed withinthe usual Hamiltonian dynamics framework. An extendedHamiltonian is then deﬁned on the augmented system ̃ℎ ( 𝒙 , 𝑢, 𝒑 , 𝑣 ) = 𝛽 ( 𝑢 ) 𝜙 ( 𝒙 ) + 𝜔 ( 𝑢 ) + 12 𝒑 𝖳 𝑴 −1 𝒑 + 𝑣 𝑚 (13)where 𝜔 is a ‘conﬁning potential’ on 𝑢 and 𝑚 is the mass(marginal variance) associated with 𝗏 .This extended Hamiltonian remains separable with respectto the extended conﬁguration ( 𝘅 , 𝗎 ) and extended momen-tum ( 𝗽 , 𝗏 ) and so can be efﬁciently simulated using, for ex-ample, a standard leapfrog integrator. In [20] the extendedHamiltonian dynamics are integrated using a Langevinscheme without Metropolis adjustment and shown to im-prove mixing in several molecular dynamics problems.Due to the condition 𝛽 ( 𝑢 ) = 1 ∀ | 𝑢 | < 𝜃 , the set of sampledconﬁguration states 𝘅 which have associated | 𝗎 | < 𝜃 will(assuming the dynamic is ergodic and Metropolis adjustmentwere used) asymptotically converge in distribution to thetarget, and so can be used to estimate expectations withoutany importance re-weighting.The 𝛽 function is required to be bounded below by some 𝑠 > in [20] due to the base density being bridged tobeing an improper uniform density across  . The partitionfunction 𝑧 ( 𝛽 ) → ∞ as 𝛽 → in this case, which wouldimply an inﬁnite density for regions in the extended statespace where 𝛽 ( 𝗎 ) = 0 . Even with a non-zero lower boundon 𝛽 , the large variations in 𝑧 [ 𝛽 ( 𝗎 )] across different 𝗎 valuescan lead to the dynamic poorly exploring the 𝗎 dimension.In [20] this issue is tackled by introducing an adaptivehistory-dependent biasing potential on 𝗎 to try to achievea ﬂat density across a bounded interval | 𝗎 | < 𝜃 , using forexample metadynamics [26]. The resulting non-Markovianupdates bias the invariant distribution of the target statehowever this can be accounted for either by a re-weightingscheme [5], or using a vanishing adaptation. As in [20] we deﬁne an extended Hamiltonian on an aug-mented state ( 𝘅 , 𝗎 ) with associated momenta ( 𝗽 , 𝗏 ) , ̃ℎ ( 𝒙 , 𝑢, 𝒑 , 𝑣 ) = 𝛽 ( 𝑢 ) [ 𝜙 ( 𝒙 ) + log 𝜁 ] +[1 − 𝛽 ( 𝑢 )] 𝜓 ( 𝒙 ) − log |||| 𝜕𝛽𝜕𝑢 |||| + 12 𝒑 𝖳 𝑴 −1 𝒑 + 𝑣 𝑚 . (14) Like the previous extended Hamiltonian approach of [20],this Hamiltonian is separable and the corresponding dy-namic can be efﬁciently simulated with a leapfrog integrator.The reversible and volume-preserving simulated dynamiccan then be used as a proposal generating mechanism on thejoint space ( 𝘅 , 𝗎 , 𝗽 , 𝗏 ) for a Metropolis–Hastings step as instandard HMC. We will term this approach of running HMCin the extended joint space joint continuous tempering .In contrast to [20] we propose to use a smooth monotoni-cally increasing map 𝛽 ∶ ℝ → [0 , as the inverse tempera-ture control function, with our default choice in all experi-ments being the logistic sigmoid 𝛽 ( 𝑢 ) = [ 𝑢 ) ] −1 .As in the discussion earlier 𝜓 is the negative logarithmof a simple normalised base density, which (as we willmotivate in the next section) we will usually choose to be anapproximation to the target density (1). Similarly the log 𝜁 term will be chosen to be an approximation to log 𝑍 .We can marginalise out the momenta from the joint distribu-tion deﬁned by this Hamiltonian, giving a joint density 𝕡 [ 𝘅 = 𝒙 , 𝗎 = 𝑢 ] ∝ |||| 𝜕𝛽𝜕𝑢 |||| exp [ − 𝛽 ( 𝑢 )( 𝜙 ( 𝒙 ) + log 𝜁 ) − (1 − 𝛽 ( 𝑢 )) 𝜓 ( 𝒙 ) ] . (15)If we deﬁne a random variable β = 𝛽 ( 𝗎 ) and use the changeof variables formula for a density, we further have that 𝕡 [ 𝘅 = 𝒙 , β = 𝛽 ] ∝ exp[− 𝛽𝜙 ( 𝒙 ) − (1 − 𝛽 ) 𝜓 ( 𝒙 )] 𝜁 𝛽 . (16)The bijectivity between 𝗎 and β is useful as although ifsimulating a Hamiltonian dynamic we will generally wishto work with 𝗎 as it is unbounded, the conditional densityon β given 𝘅 has a tractable normalised form 𝕡 [ β = 𝛽 | 𝗑 = 𝒙 ] = exp[− 𝛽 Δ( 𝒙 )]Δ( 𝒙 )1 − exp[−Δ( 𝒙 )] , (17)with Δ( 𝒙 ) = 𝜙 ( 𝒙 ) + log 𝜁 − 𝜓 ( 𝒙 ) . This corresponds to anexponential distribution with rate parameter Δ( 𝒙 ) truncatedto [0 , . As an alternative to the joint updates, anotheroption is therefore to form a Markov chain which leaves(16) invariant by alternating independently sampling β fromits conditional given 𝘅 and performing a transition whichleaves the conditional on 𝘅 given β invariant, similar tothe suggested approach in [16]. We will term this Gibbssampling type procedure as Gibbs continuous tempering .Further we can use (17) to write the marginal density on β 𝕡 [ β = 𝛽 ] = 𝔼 [ exp[− 𝛽 Δ( 𝘅 )]Δ( 𝘅 )1 − exp[−Δ( 𝘅 )] ] . (18)By integrating the joint (16) over  we also have that 𝕡 [ β = 𝛽 ] = 1 𝐶𝜁 𝛽 ∫  𝜋 ( 𝒙 | 𝛽 ) d 𝒙 = 𝑧 ( 𝛽 ) 𝐶𝜁 𝛽 ⇒ 𝕡 [ β = 0] = 1 𝐶 , 𝕡 [ β = 1] = 𝑍𝐶𝜁 , (19)or an unknown normalising constant 𝐶 of the joint, and sofor MCMC samples { 𝒙 ( 𝑠 ) , 𝛽 ( 𝑠 ) } 𝑆𝑠 =1 from the joint (16) 𝑍 = 𝕡 [ β = 1] 𝕡 [ β = 0] 𝜁 = lim 𝑆 → ∞ ∑ 𝑆𝑠 =1 [ 𝑤 ( 𝒙 ( 𝑠 ) )]∑ 𝑆𝑠 =1 [ 𝑤 ( 𝒙 ( 𝑠 ) )] 𝜁, (20)with 𝑤 ( 𝒙 ) = Δ( 𝒙 )1 − exp[−Δ( 𝒙 )] , 𝑤 ( 𝒙 ) = Δ( 𝒙 )exp[Δ( 𝒙 )] − 1 . This can be seen to be a continuous analogue of the

Rao-Blackwellised estimator combining (8) and (9) used in [7].Similarly we can calculate consistent estimates of expecta-tions with respect to the target density 𝕡 [ 𝘅 = 𝒙 | β = 1] =exp[− 𝜙 ( 𝒙 )]∕ 𝑍 as importance weighted sums 𝔼 [ 𝑓 ( 𝘅 ) | β = 1] = ∫  𝑓 ( 𝒙 ) 𝕡 [ β = 1 | 𝘅 = 𝒙 ] 𝕡 [ 𝘅 = 𝒙 ] d 𝒙 ∫  𝕡 [ β = 1 | 𝘅 = 𝒙 ] 𝕡 [ 𝘅 = 𝒙 ] d 𝒙 = lim 𝑆 → ∞ ∑ 𝑆𝑠 =1 [ 𝑤 ( 𝒙 ( 𝑠 ) ) 𝑓 ( 𝒙 ( 𝑠 ) )]∑ 𝑆𝑠 =1 [ 𝑤 ( 𝒙 ( 𝑠 ) )] . (21)We can also estimate expectations with respect to the basedensity 𝕡 [ 𝘅 = 𝒙 | β = 0] = exp[− 𝜓 ( 𝒙 )] using 𝔼 [ 𝑓 ( 𝘅 ) | β = 0] = lim 𝑆 → ∞ ∑ 𝑆𝑠 =1 [ 𝑤 ( 𝒙 ( 𝑠 ) ) 𝑓 ( 𝒙 ( 𝑠 ) )]∑ 𝑆𝑠 =1 [ 𝑤 ( 𝒙 ( 𝑠 ) )] . (22)Often the base density will have known moments (e.g. meanand covariance of a Gaussian base density) which can becompared to the estimates calculated using (22) to check forconvergence problems. Convergence of the estimates to thetrue moments is not a sufﬁcient condition for convergenceof the chain to the target joint density (16) but is necessary. By applying Hölder’s and Jensen’s inequalities we canbound 𝕡 [ β = 𝛽 ] (see Appendix A for details) 𝐶 ( 𝑍𝜁 ) 𝛽 exp ( − 𝛽𝑑 𝑏 → 𝑡 ) ≤ 𝕡 [ β = 𝛽 ] ≤ 𝐶 ( 𝑍𝜁 ) 𝛽 , (23)where 𝑑 𝑏 → 𝑡 indicates the Kullback–Leibler (KL) divergencefrom the base to target distribution 𝑑 𝑏 → 𝑡 = ∫  exp[− 𝜓 ( 𝒙 )] log ( exp[− 𝜓 ( 𝒙 )]exp[− 𝜙 ( 𝒙 )]∕ 𝑍 ) d 𝒙 . (24)If 𝜁 = 𝑍 the upper-bound is constant. If additionally we had 𝑑 𝑏 → 𝑡 = 0 , the bound becomes tight and we would have a ﬂatmarginal density on β , which although is not guaranteed tobe optimal we hope will improve mixing in the β dimension.In reality we do not know 𝑍 and cannot choose a base dis-tribution such that the KL divergence is zero as we wishto use a simple density amenable to exploration. However we can see that under the constraint of the base distributionallowing exploration, a potentially useful heuristic is to min-imise the KL divergence to the target distribution. Furtherwe want to ﬁnd a 𝜁 as close to 𝑍 as possible.Variational inference is an obvious route for tackling bothproblems, allowing us to ﬁt a base density in a simple para-metric family (e.g. Gaussian) by directly minimising theKL divergence (24) while also giving a lower bound on log 𝑍 . In some cases we can use variational methods specif-ically aimed at the target distribution family. More generallymethods such as Automatic Differentiation Variational Infer-ence (ADVI) [25] provide a black-box framework for ﬁttingvariational approximations to differentiable target densities.In some models such as Variational Autoencoders [24, 38]a parametric variational approximation to the target densityof interest (e.g. posterior on latent space) is ﬁtted duringtraining of the original model, in which case it provides anatural choice for the base distribution as observed in [45].A potential problem is that the classes of target distributionthat we are particularly interested in applying our approachto — those with multiple isolated modes — are precisely thesame distributions that simple variational approximationswill tend to ﬁt poorly, the divergence (24) being minimisedfavouring ‘mode-seeking’ solutions [4], which usually ﬁtonly one mode well. This both limits how small the diver-gence from the base to target can be made, but also cruciallyis undesirable as we wish the base distribution to allow thedynamic to move between modes in the target.One option would be to instead to minimise the reversedform of the KL divergence from the target to base distribu-tion, this tending to produce ‘mode-covering’ solutions thatmatch global moments (and can also be used to producean analogous lower bound on the marginal density to thatfor 𝑑 𝑏 → 𝑡 in (23) as shown in Appendix A). Methods suchas expectation propagation (EP) [33] do allow moment-matching approximations to be found and may be a goodoption in some cases. Another possibility would be to usean alternative divergence in a variational framework thatfavours mode-covering solutions, with methods using the 𝜒 -divergence [11] and Rényi divergence [30] having beenrecently proposed in this context.A further alternative is to ﬁt multiple local variational ap-proximations { 𝑞 𝑖 ( 𝒙 )} 𝐿𝑖 =1 by minimising the variational ob-jective from multiple random parameter initialisations (dis-carding any duplicate solutions as measured by some dis-tance tolerance between the variational parameters), eachapproximating a single mode well. We can then combinethese local approximations into a global approximation 𝑞 ( 𝒙 ) .One option is to use a mixture of the local approximations 𝑞 ( 𝒙 ) = 𝜁 ∑ 𝐿𝑖 =1 [ exp( 𝓁 𝑖 ) 𝑞 𝑖 ( 𝒙 ) ] , 𝜁 = ∑ 𝐿𝑖 =1 exp( 𝓁 𝑖 ) , (25)with 𝓁 𝑖 the ﬁnal variational objective value for 𝑞 𝑖 ( log 𝑍 inus the KL divergence from 𝑞 𝑖 to target distribution). Ifthe local approximations have non-overlapping support thiswill lead to a global approximation which is guaranteed tobe at least as close in KL divergence as any of the localapproximations and a log 𝜁 which is at least as tight a lowerbound on log 𝑍 as any of the individual 𝓁 𝑖 [47].Often we may wish to use local approximations with over-lapping support (e.g. Gaussian) where the guarantee doesnot apply. However for the cases of target distributions withmultiple isolated modes the ‘overlap’ (regions with highdensity under multiple 𝑞 𝑖 ) between local Gaussian approx-imations to each mode will often be minimal and so themethod is still a useful heuristic.A mixture distribution is unlikely to itself be a good choiceof base distribution however, as it will tend to be multimodal.We therefore instead propose to use a base distribution withmoments matched to the ﬁtted mixture distribution , e.g. aGaussian exp[− 𝜓 ( 𝒙 )] with mean and covariance matchedto the mean and covariance of the mixture 𝑞 ( 𝒙 ) .For complex target distributions it will sometimes remaindifﬁcult to ﬁnd a reasonable 𝜓 and log 𝜁 which well ap-proximate 𝜙 and log 𝑍 using approaches such as those justdiscussed . If the KL divergence from the base to target dis-tribution is high, the bounds on the marginal in (23) becomeincreasingly loose and in practice this tends to lead to lowdensities on intermediate inverse temperatures. This reducesthe ability of the MCMC dynamic to move between the in-verse temperatures corresponding to the target and basedistributions and so the mixing gain from the augmentation.Similarly from (19) the density ratio 𝕡 [ β = 1] ∕ 𝕡 [ β = 0] is exp(log 𝑍 − log 𝜁 ) , and so for large | log 𝑍 − log 𝜁 | thedynamic will tend to spend most of the time close to β = 1 if log 𝑍 > log 𝜁 or β = 0 if log 𝑍 < log 𝜁 . In the formercase there will be limited gain from using the extendedspace over running a chain on the original target densitywhile in the latter the variance of the estimator in (21) willbe comparable to directly using an importance samplingestimator for the target using samples from the base density.A natural approach to try to ameliorate these effects is touse an iterative ‘bootstrap’ method. An initial 𝜓 and log 𝜁 are chosen for example using a Gaussian variational approx-imation to the target density as described above. A Markovchain which leaves the joint density (16) invariant is thensimulated. The samples from this chain can then be usedto both compute an estimate for log 𝑍 using (20) and up-dated estimates of the target density mean and covarianceusing (21). A new Gaussian base density can then be spec-iﬁed with the updated mean and covariance estimates and log 𝜁 chosen to be the updated log 𝑍 estimate. Samplesof the new joint density can then be used to update log 𝜁 and 𝜓 again and so on. This is analogous to the iterativeapproaches often used in simulated tempering to choose theprior weights on the inverse temperature values [7, 19]. To validate the proposed approach we performed a seriesof experiments comparing our proposed continuous temper-ing methods to existing approaches. Code for running allexperiments is available at https://git.io/cthmc . We start with an illustrative example of the gains of theproposed approach over standard HMC in target densitieswith isolated modes. We use a one-dimensional Gaussianmixture density with two separated Gaussian componentsas the target density, shown by the blue curve in Figure1a. Although performance in this toy univariate model isnot necessarily reﬂective of that in more realistic higher-dimensional models, it has the advantage of allowing thejoint density on 𝗑 and β = 𝛽 ( 𝗎 ) to be directly visualised.For the base density exp[− 𝜓 ( 𝑥 )] we use a univariate Gaus-sian with mean and variance matched to those of the tar-get density (corresponding to the Gaussian density min-imising the KL divergence from the target to base distri-bution), shown by the green curve in Figure 1a. We alsoset log 𝜁 = log 𝑍 and so the performance here represents a‘best-case’ scenario for the continuous tempering approach.The resulting potential energy ( − log 𝕡 [ 𝗑 , 𝗎 ] ) on the ex-tended ( 𝗑 , 𝗎 ) space is shown in Figure 1b. For positivetemperature control values (and so inverse temperature val-ues close to 1), the energy surface tends increasingly to thedouble-well potential corresponding to the target distribu-tion, with a high energy barrier between the two modes.For negative temperature control values the energy surfacetends towards the single quadratic well corresponding to theGaussian base density. The resulting joint energy surfaceallows for paths between the values of the target state 𝗑 corresponding to the two modes which have much lowerpotential energy barriers than the potential barrier betweenthe two modes in the original target space, allowing simu-lated Hamiltonian trajectories such as that shown in greento more easily explore the target state space.Samples from a HMC chain on the extended joint space areshown in Figure 1c, with the joint density on ( 𝗑 , β ) (16)shown in the background as a contoured heat map. It canbe seen that the Hamiltonian dynamic is able to explorethe joint space well with good coverage of all of the highdensity regions. The size of the points in 1c is proportionalto 𝑤 ( 𝑥 ) = 𝕡 [ β = 1 | 𝗑 = 𝑥 ] and so reﬂects the importanceweights of the samples in the estimator for expectationswith respect to the target in (21). Importantly even pointsfor which β is close to zero can contribute signiﬁcantlyto the expectations if the corresponding 𝗑 value is prob-able under the target: this is in contrast to the extendedHamiltonian approach of [20] where only a subset of pointscorresponding to β = 1 are used to compute expectations.he ﬁnal panel, Figure 1d shows empirical histograms onthe target variable 𝗑 estimated from samples of a chain onthe extended space (joint continuous tempering, bottom)and standard HMC on the original target space (top). Ascan be seen the standard HMC approach gets stuck in onemode thus does not assign any mass to the other mode inthe histogram, unlike the tempered chain which identiﬁesboth modes and accurately estimates their relative masses. For a second more challenging test case, we performedinference in Gaussian mixture relaxations of a set of tensynthetic Boltzmann machine distributions [46]. The param-eters of the Boltzmann machine distributions were randomlygenerated so that the corresponding relaxations are highlymultimodal and so challenging to explore well.The moments of the relaxation distributions can be calcu-lated from the moments of the original discrete Boltzmannmachine distribution, which for models with a small numberof binary units 𝐷 𝐵 (30 in our experiments) can be computedexactly by exhaustive iteration across the 𝐷 𝐵 discrete states.This allows ground truth moments to be calculated againstwhich convergence can be checked. The parametrisationused is described in in Appendix B. A Gaussian base densityand approximate normalising constant 𝜁 was ﬁt to each the10 relaxation target densities by matching moments to a mix-ture of variational Gaussian approximations (individuallyﬁtted using a mean-ﬁeld approach based on the underlyingBoltzmann machine distribution) as described in section 6.Plots showing the root mean squared error (RMSE) in es-timates of log 𝑍 and the mean and covariance of the relax-ation distribution against computational run time for differ-ent sampling methods are shown in Figure 2. The RMSEvalues are normalised by the RMSEs of the correspondingestimated moments used in the base density (and log 𝜁 ) suchthat values below unity indicate an improvement in accuracyover the variational approximation. The curves shown areRMSEs averaged over 10 independent runs for each of the10 generated parameter sets, with the ﬁlled regions indicat-ing ±3 standard errors of the mean. The free parametersof all methods were hand-tuned on one parameter set andthese values then ﬁxed across all runs. All methods used ashared Theano [44] implementation running on a Intel Corei5-2400 quad-core CPU for the underlying HMC updatesand so run times are roughly comparable.For simulated tempering (ST), Rao Blackwellised estimatorswere used as described in [7], with HMC-based updates ofthe target state 𝘅 | β interleaved with independent samplingof β | 𝘅 , and 1000 𝛽 𝑛 values used. For AIS, HMC updateswere used for the transition operators and separate runs with1000, 5000 and 10000 𝛽 𝑛 values used to obtain estimates atdifferent run times. For the tempering approaches run timescorrespond to increasing numbers of MCMC samples. The two continuous tempering (CT) approaches, Gibbs CTand joint CT, both dominate in terms of having lower av-erage RMSEs in all three moment estimates across all runtimes, with joint CT showing marginally better performanceon estimates of log 𝑍 and 𝔼 [ 𝘅 ] . The tempering approachesseem to outperform AIS here, possibly as the highly mul-timodal nature of the target densities favours the ability oftempered dynamics to move up an down the inverse temper-ature scale and so in and out of modes in the target density,unlike AIS where the deterministic temperature traversalis more likely to mean the runs end up being conﬁned to asingle mode after the initial transitions for low 𝛽 𝑛 . As a third experiment, we apply our joint continuous temper-ing approach to perform Bayesian inference in a hierarchicalregression model for predicting indoor radon measurements[15]. To illustrate the ease of integrating our approach inexisting HMC-based inference software, this experimentwas performed with the Python package PyMC3 [42], withits ADVI feature used to ﬁt the base density and its imple-mentation of the adaptive

No U-Turn Sampler (NUTS) [23]HMC variant used to sample from the extended space.The regression target in the dataset is measurements of theamount of radon gas 𝗒 ( 𝑖 ) in 𝑁 = 919 households. Twocontinuous regressors 𝒙 ( 𝑖 ) and one categorical regressor 𝑐 ( 𝑖 ) are provided per household. A multilevel regression modeldeﬁned by the factor graph in 3a was used. The modelincludes ﬁve scalar parameters ( σ 𝛼 , µ 𝛼 , σ 𝛽 , µ 𝛽 , (cid:15) ) , an 85-dimensional intercept vector α and a two-dimensional re-gressor coefﬁcients vector β , giving 92 parameters in total.As an example task, we consider inferring the marginallikelihood of the data under the model. Estimation of themarginal likelihood from MCMC samples of the target den-sity alone is non-trivial, with approaches such as the har-monic mean estimator having high variance. Here we try toestablish if our approach can be used in a black-box fashionto compute a reasonable estimate of the marginal likelihood.As our ‘ground truth’ we use a large batch of long AIS runs(average across 100 runs of 10000 inverse temperatures)on a separate Theano implementation of the model. Weuse ADVI to ﬁt a diagonal covariance Gaussian variationalapproximation to the target density and use this as the basedensity. NUTS chains, initialised at samples from the basedensity, were then run on the extended space for 2500 itera-tions. The samples from these chain were used to computeestimates of the normalising constant (marginal likelihood)using the estimator (20). The results are shown in Figure3. It can be seen that estimates from the NUTS chains inthe extended continuously tempered space quickly convergeto a marginal likelihood estimate very close to the AIS esti-mate, and signiﬁcantly improve over the ﬁnal lower boundon the marginal likelihood that ADVI converges to. .0 20.0 40.0 60.0 80.0 Time / s 2 1 R e l a t i v e R M SE AISST Gibbs CTJoint CT (a) log 𝑍 Time / s 1 R e l a t i v e R M SE AISST Gibbs CTJoint CT (b) 𝔼 [ 𝘅 ] Time / s 1 R e l a t i v e R M SE AISST Gibbs CTJoint CT (c) 𝔼 [ 𝘅𝘅 𝖳 ] − 𝔼 [ 𝘅 ] 𝔼 [ 𝘅 ] 𝖳 Figure 2:

Root Mean Squared Errors (RMSEs) in empirical moments estimated from MCMC samples against run timefor various thermodynamic ensemble MCMC methods run on Gaussian Boltzmann machine relaxation target distributions.All RMSEs are relative to the RMSE of the corresponding approximate moments calculated using the moment-matchedvariational mixtures, so values below represent improvement on deterministic approximation. For AIS points across timeaxis represent increasing number of inverse temperatures: (1 , ,

10) × 10 . For ST, Gibbs CT and joint CT curves showRMSEs for expectations calculated with increasing number of samples from chains. All curves / points show mean across10 runs for each of 10 generated parameter sets. Filled regions / error bars show ±3 standard errors of mean. 𝑦 ( 𝑖 )  ( ̂𝑦 ( 𝑖 ) , 𝜖 ) 𝜖̂𝑦 ( 𝑖 )  ≥ (2 . 𝛿 ( 𝜶 [ 𝑐 ( 𝑖 ) ] + 𝜷 𝖳 𝒙 ( 𝑖 ) ) 𝑖 ∈ {1… 𝑁 } 𝜶𝜷  ( 𝜇 𝛼 , 𝜎 𝛼 𝑰 ) 𝜎 𝛼 𝜇 𝛼  ≥ (2 .  (0 ,  ( 𝜇 𝛽 , 𝜎 𝛽 𝑰 ) 𝜎 𝛽 𝜇 𝛽  ≥ (2 .  (0 , (a) Factor graph of hierarchical model. Time / s

Log m a r g i na l li k e li hood e s t. CT NUTSADVI ADVI (final)AIS (b) Log marginal likelihood estimates.

Figure 3: (a) Factor graph for hierarchical regression model for

Radon data. Factor notation -  ( 𝜇, 𝜎 ) : normal distributionwith mean 𝜇 , (co-)variance 𝜎 ;  ≥ ( 𝛾 ) half-Cauchy distribution with scale 𝛾 ; 𝛿 ( 𝑧 ) Dirac-delta located at 𝑧 . (b) Log marginallikelihood estimates against run time for hierarchical regression model. Black dashed line shows estimated log marginallikelihood from a long AIS run which is used as a proxy for the ground truth. The noisy blue curve shows the evidencelower bound ADVI variational objective over training and the red dot-dashed line the ﬁnal converged value used for log 𝜁 .The green curve shows log marginal likelihood estimates using samples from NUTS chains running on the extended jointdensity in the estimator (20), with the run time corresponding to increasing samples being included in the estimator (offsetby initial ADVI run time). Curve shows mean over 10 independent runs and ﬁlled region ±3 standard errors of mean. For our ﬁnal experiments, we compare the efﬁciency of ourcontinuous tempering approaches to simulated temperingand AIS for marginal likelihood estimation in decoder-basedgenerative models for images. Use of AIS in this contextwas recently proposed in [45].Speciﬁcally we estimate the joint marginal likelihood of1000 generated binary images under the Bernoulli de-coder distribution of two importance weighted autoen-coder (IWAE) [6] models. Each IWAE model has onestochastic hidden layer and a 50-dimensional latent space,with the two models trained on binarised versions of theMNIST [28] and Omniglot [27] datasets using the code at https://github.com/yburda/iwae . The generated im-ages used in the experiments are shown in Appendix C. By performing inference on the per-image posterior den-sities on the latent representation given image, the jointmarginal likelihood of the images can be estimated as theproduct of estimates of the normalising constants of theindividual posterior densities. The use of generated imagesallows bidirectional Monte Carlo (BDMC) [21] to be usedto ‘sandwich’ the marginal likelihood with both stochas-tic upper and lower bounds formed with long forward andbackward AIS runs (averages over 16 independent runs with10000 inverse temperatures as used in [45]).As the per-image latent representations are conditionallyindependent given the images, chains on all the posteriordensities can be run in parallel, with the experiments inthis section run on a NVIDIA Tesla K40 GPU to exploitthis inherent parallelism. The encoder of the trained IWAE

50 100 150 200 250 300 350 400

Time / s

Log m a r g i na l li k e li hood e s t. STGibbs CTJoint CT AISBDMC upperBDMC lower (a) MNIST log marginal likelihood estimates.

Time / s

Log m a r g i na l li k e li hood e s t. STGibbs CTJoint CT AISBDMC upperBDMC lower (b) Omniglot log marginal likelihood estimates.

Figure 4: Estimates of the log joint marginal likelihood of 1000 generated images under the Bernoulli decoder distributionsof two IWAE models trained on the MNIST and Omniglot datasets against computation time. The black / red dashedlines show stochastic upper / lower bounds calculated using long BDMC runs. For AIS points across time axis representincreasing number of inverse temperatures: (50 , , , , , . For ST, Gibbs CT and joint CT curves showestimates calculated with increasing number of samples from chains. All curves / points show mean across 10 runs.models is an inference network which outputs the mean anddiagonal covariance of a Gaussian variational approximationto the posterior density on the latent representation given animage and so was used to deﬁne per-image Gaussian basedensities as suggested in [45]. Similarly the per-image log 𝜁 values were set using importance-weighted variational lowerbound estimates for the per-image marginal likelihoods.The results are shown in Figure 4, with the curves / pointsshowing average results across 10 independent runs andﬁlled regions / bars ±3 standard error of means for the es-timates. Here Gibbs CT and AIS perform similarly, withjoint CT converging less quickly and simulated temperingsigniﬁcantly less efﬁcient. The quick convergence of AISand Gibbs CT here suggests the posterior densities are rel-atively easy for the dynamics to explore and well matchedby the Gaussian base densities, limiting the gains from anymore coherent exploration of the extended space by thejoint CT updates. The higher per-leapfrog-step costs ofthe HMC updates in the extended space therefore mean thejoint CT approach is less efﬁcient overall here. The poorerperformance of simulated tempering here is in part due tothe generation of the discrete random indices becoming abottleneck in the GPU implementation.A possible reason for the better relative performance of AIShere compared to the experiments in Section 7.2 is its moreeffective utilisation of the parallel compute cores availablewhen running for example on a GPU. Multiple AIS chainscan be run for each data point and then the resulting unbiasedestimates for each data points marginal likelihood averaged(reducing the per data point variance) before taking theirproduct for the joint marginal likelihood estimate. Whileit is also possible to run multiple tempered chains per datapoint and similarly combine the estimates, empirically wefound that greater gains in estimation accuracy came fromrunning a single longer chain rather than multiple shorterchains of total length equivalent to the longer chain. This can be explained by the initial warm up transients of eachshorter chain having a greater biasing effect on the overallestimate compared to running longer chains. The approach we have presented is a simple but powerful ex-tension to existing tempering methods which can both helpexploration of distributions with multiple isolated modesand allow estimation of the normalisation constant of thetarget distribution. A key advantage of the joint continuoustempering method is the simplicity of its implementation - itsimply requires running HMC in an extended state space andso can easily be used for example within existing probabilis-tic programming software which use HMC-based methodsfor inference such as PyMC3 [42] and Stan [8]. By updatingthe inverse temperature jointly with the original target state,it is also possible to leverage adaptive HMC variants such asNUTS [23] to perform tempered inference in a ‘black-box’manner without the need to separately tune the updates ofthe inverse temperature variable.The Gibbs continuous tempering method also provides arelatively black-box framework for tempering. Compared tosimulated tempering it removes the requirement to choosethe number and spacing of discrete inverse temperatures andalso replaces generation of a discrete random variate from acategorical distribution when updating β given 𝘅 (which asseen in Section 7.4 can become a computational bottleneck)with generation of a truncated exponential variate (whichcan be performed efﬁciently by inverse transform sampling).Compared to the joint continuous tempering approach, theGibbs approach is less simple to directly integrate in to ex-isting HMC implementations due to the separate β updates,but eliminates the need to tune the temperature control massvalue 𝑚 and achieved similar or better sampling efﬁciencyin the experiments in Section 7.ur proposal to use variational approximations within anMCMC framework can be viewed within the context ofseveral existing approaches which suggest combining varia-tional and MCMC inference methods. Variational MCMC [10] proposes using a variational approximation as the basisfor a proposal distribution in a Metropolis-Hastings MCMCmethod.

MCMC and Variational Inference: Bridging theGap [41] includes parametrised MCMC transitions withina (stochastic) variational approximation and optimises thevariational bound over these (and a base distribution’s) pa-rameters. Here we exploit cheap (relative to running a longMCMC chain) but biased variational approximations to atarget distribution and its normalising constant, and proposeusing them within an MCMC method which gives asymp-totically exact results to help improve sampling efﬁciency.

Acknowledgements

References [1] G. Behrens, N. Friel, and M. Hurn. Tuning temperedtransitions.

Statistics and computing , 2012.[2] M. Betancourt. A general metric for Riemannian man-ifold Hamiltonian Monte Carlo. In

Geometric Scienceof Information . 2013.[3] M. Betancourt. Adiabatic Monte Carlo. arXiv preprintarXiv:1405.3489 , 2014.[4] C. M. Bishop.

Pattern recognition and machine learn-ing . Springer, 2006.[5] M. Bonomi, A. Barducci, and M. Parrinello. Recon-structing the equilibrium Boltzmann distribution fromwell-tempered metadynamics.

Journal of computa-tional chemistry , 2009.[6] Y. Burda, R. Grosse, and R. Salakhutdinov. Impor-tance weighted autoencoders. In

Proceedings of theInternational Conference on Learning Representations(ICLR) , 2016.[7] D. Carlson, P. Stinson, A. Pakman, and L. Paninski.Partition functions from Rao-Blackwellized temperedsampling. In

Proceedings of The 33rd InternationalConference on Machine Learning , 2016. [8] B. Carpenter, A. Gelman, M. Hoffman, D. Lee,B. Goodrich, M. Betancourt, M. A. Brubaker, J. Guo,P. Li, and A. Riddell. Stan: A probabilistic program-ming language.

Journal of Statistical Software , 2016.[9] P. Damlen, J. Wakeﬁeld, and S. Walker. Gibbs sam-pling for Bayesian non-conjugate and hierarchicalmodels by using auxiliary variables.

Journal of theRoyal Statistical Society: Series B (Statistical Method-ology) , 1999.[10] N. De Freitas, P. Højen-Sørensen, M. I. Jordan, andS. Russell. Variational MCMC. In

Proceedings of theSeventeenth Conference on Uncertainty in ArtiﬁcialIntelligence , 2001.[11] A. B. Dieng, D. Tran, R. Ranganath, J. Paisley, andD. M. Blei. The 𝜒 -divergence for approximate infer-ence. arXiv preprint arXiv:1611.00328 , 2016.[12] S. Duane, A. D. Kennedy, B. J. Pendleton, andD. Roweth. Hybrid Monte Carlo. Physics LettersB , 1987.[13] D. J. Earl and M. W. Deem. Parallel tempering: theory,applications, and new perspectives.

Physical Chem-istry Chemical Physics , 2005.[14] A. E. Gelfand and A. F. Smith. Sampling-based ap-proaches to calculating marginal densities.

Journal ofthe American Statistical Association , 1990.[15] A. Gelman and J. Hill.

Data analysis using regres-sion and multilevel/hierarchical models , chapter 12:Multilevel Linear Models: the Basics. CambridgeUniversity Press, 2006.[16] A. Gelman and X.-L. Meng. Simulating normaliz-ing constants: From importance sampling to bridgesampling to path sampling.

Statistical science , 1998.[17] S. Geman and D. Geman. Stochastic relaxation, Gibbsdistributions, and the Bayesian restoration of images.

IEEE Transactions on pattern analysis and machineintelligence , 1984.[18] C. J. Geyer. Markov chain Monte Carlo maximumlikelihood.

Computer Science and Statistics , 1991.[19] C. J. Geyer and E. A. Thompson. Annealing Markovchain Monte Carlo with applications to ancestral infer-ence.

Journal of the American Statistical Association ,1995.[20] G. Gobbo and B. J. Leimkuhler. Extended Hamiltonianapproach to continuous tempering.

Physical Review E ,2015.[21] R. Grosse, Z. Ghahramani, and R. P. Adams. Sand-wiching the marginal likelihood using bidirectionalMonte Carlo. arXiv preprint arXiv:1511.02543 , 2015.[22] W. K. Hastings. Monte Carlo sampling methods usingMarkov chains and their applications.

Biometrika ,1970.23] M. D. Hoffman and A. Gelman. The No-U-turn sam-pler: adaptively setting path lengths in HamiltonianMonte Carlo.

Journal of Machine Learning Research ,2014.[24] D. P. Kingma and M. Welling. Auto-encoding varia-tional Bayes. In

Proceedings of the International Con-ference on Learning Representations (ICLR) , 2014.[25] A. Kucukelbir, D. Tran, R. Ranganath, A. Gelman,and D. M. Blei. Automatic differentiation variationalinference. arXiv preprint arXiv:1603.00788 , 2016.[26] A. Laio and M. Parrinello. Escaping free-energy min-ima.

Proceedings of the National Academy of Sciences ,2002.[27] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum.Human-level concept learning through probabilisticprogram induction.

Science , 2015.[28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.Gradient-based learning applied to document recogni-tion.

Proceedings of the IEEE , 1998.[29] B. Leimkuhler and S. Reich.

Simulating Hamiltoniandynamics . Cambridge University Press, 2004.[30] Y. Li and R. E. Turner. Rényi divergence variationalinference. In

Advances in Neural Information Process-ing Systems , 2016.[31] E. Marinari and G. Parisi. Simulated tempering: a newMonte Carlo scheme.

Europhysics Letters , 1992.[32] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth,A. H. Teller, and E. Teller. Equation of state calcu-lations by fast computing machines.

The Journal ofChemical Physics , 1953.[33] T. P. Minka. Expectation propagation for approximateBayesian inference.

Proceedings of the Seventeenthconference on Uncertainty in Artiﬁcial Intelligence ,2001.[34] R. M. Neal. Sampling from multimodal distributionsusing tempered transitions.

Statistics and Computing ,1996.[35] R. M. Neal. Annealed importance sampling.

Statisticsand Computing , 2001.[36] R. M. Neal. Slice sampling.

Annals of statistics , 2003.[37] R. M. Neal. MCMC using Hamiltonian dynamics.

Handbook of Markov Chain Monte Carlo , 2011.[38] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochas-tic backpropagation and approximate inference in deepgenerative models. In

Proceedings of The 31st Inter-national Conference on Machine Learning , 2014.[39] C. Robert and G. Casella. A short history of MarkovChain Monte Carlo: subjective recollections from in-complete data.

Statistical Science , 2011. [40] R. Salakhutdinov and I. Murray. On the quantitativeanalysis of deep belief networks. In

Proceedings of the25th International Conference on Machine learning ,2008.[41] T. Salimans, D. P. Kingma, and M. Welling. Markovchain Monte Carlo and variational inference: Bridg-ing the gap. In

International Conference on MachineLearning , 2015.[42] J. Salvatier, T. V. Wiecki, and C. Fonnesbeck. Proba-bilistic programming in Python using PyMC3.

PeerJComputer Science , 2016.[43] R. H. Swendsen and J.-S. Wang. Replica Monte Carlosimulation of spin-glasses.

Physical Review Letters ,1986.[44] Theano Development Team. Theano: A Python frame-work for fast computation of mathematical expressions. arXiv e-prints , abs/1605.02688, 2016.[45] Y. Wu, Y. Burda, R. Salakhutdinov, and R. Grosse. Onthe quantitative analysis of decoder-based generativemodels. arXiv preprint arXiv:1611.04273 , 2016.[46] Y. Zhang, Z. Ghahramani, A. J. Storkey, and C. A.Sutton. Continuous relaxations for discrete Hamilto-nian Monte Carlo. In

Advances in Neural InformationProcessing Systems , 2012.[47] O. Zobay. Mean ﬁeld inference for the Dirichlet pro-cess mixture model.

Electronic Journal of Statistics ,2009.

Bounding the inverse temperature marginal density

We have a joint density on ( 𝘅 , β ) 𝕡 [ 𝘅 = 𝒙 , β = 𝛽 ] = 1 𝐶 exp [ − 𝛽𝜙 ( 𝒙 ) − 𝛽 log 𝜁 − (1 − 𝛽 ) 𝜓 ( 𝒙 ) ] . (26)The resulting marginal density on β is 𝕡 [ β = 𝛽 ] = ∫  𝕡 [ 𝘅 = 𝒙 , β = 𝛽 ] d 𝒙 = 1 𝐶𝜁 𝛽 ∫  exp[− 𝛽𝜙 ( 𝒙 ) − (1 − 𝛽 ) 𝜓 ( 𝒙 )] d 𝒙 . (27)To derive an upper-bound on 𝕡 [ β = 𝛽 ] we use Hölder’s inequality ∫  𝑔 ( 𝒙 ) ℎ ( 𝒙 ) d 𝒙 ≤ ( ∫  | 𝑔 ( 𝒙 ) | 𝑎 d 𝒙 ) 𝑎 ( ∫  | ℎ ( 𝒙 ) | 𝑎 d 𝒙 ) 𝑎 (28)where 𝑎 ∈ [0 , and 𝑔 and ℎ are measurable functions. We also use the deﬁnitions ∫  exp[− 𝜙 ( 𝒙 )] d 𝒙 = 𝑍 and ∫  exp[− 𝜓 ( 𝒙 )] d 𝒙 = 1 . (29)From (27) we have that 𝕡 [ β = 𝛽 ] = 1 𝐶𝜁 𝛽 ∫  ( exp[− 𝜙 ( 𝒙 )] 𝛽 )( exp[− 𝜓 ( 𝒙 )] 𝛽 ) d 𝒙 . (30)Applying Hölder’s inequality (28) with 𝑔 ( 𝒙 ) = exp[− 𝜙 ( 𝒙 )] 𝛽 , ℎ ( 𝒙 ) = exp[− 𝜓 ( 𝒙 )] 𝛽 and 𝑎 = 𝛽 𝕡 [ β = 𝛽 ] ≤ 𝐶𝜁 𝛽 ( ∫  ||| exp[− 𝜙 ( 𝒙 )] 𝛽 ||| 𝛽 d 𝒙 ) 𝛽 ( ∫  ||| exp[− 𝜓 ( 𝒙 )] 𝛽 ||| 𝛽 d 𝒙 ) 𝛽 (31) = 1 𝐶𝜁 𝛽 ( ∫  exp[− 𝜙 ( 𝒙 )] d 𝒙 ) 𝛽 ( ∫  exp[− 𝜓 ( 𝒙 )] d 𝒙 ) 𝛽 . (32)Substituting the deﬁnitions in (29) gives 𝕡 [ β = 𝛽 ] ≤ 𝐶 ( 𝑍𝜁 ) 𝛽 . (33)To derive a lower-bound on 𝕡 [ β = 𝛽 ] , we use Jensen’s inequality 𝜑 ( ∫  𝑔 ( 𝒙 ) 𝑞 ( 𝒙 ) d 𝒙 ) ≥ ∫  𝜑 ( 𝑔 ( 𝒙 )) 𝑞 ( 𝒙 ) d 𝒙 , (34)for a concave function 𝜑 , normalised density 𝑞 ∶ ∫  𝑞 ( 𝒙 ) d 𝒙 = 1 and measurable 𝑔 . The logarithm of (27) gives log 𝕡 [ β = 𝛽 ] + 𝛽 log 𝜁 + log 𝐶 = log ( ∫  exp(− 𝛽 [ 𝜙 ( 𝒙 ) − 𝜓 ( 𝒙 )]) exp[− 𝜓 ( 𝒙 )] d 𝒙 ) . (35)Applying Jensen’s inequality (34) with 𝜑 = log , 𝑞 = exp(− 𝜓 ) and 𝑔 = exp[− 𝛽 ( 𝜙 − 𝜓 )]log 𝕡 [ β = 𝛽 ] + log 𝐶 + 𝛽 log 𝜁 ≥ 𝛽 ∫  [ 𝜓 ( 𝒙 ) − 𝜙 ( 𝒙 )] exp[− 𝜓 ( 𝒙 )] d 𝒙 (36) = 𝛽 ∫  (log 𝑍 − log 𝑍 − log exp[− 𝜓 ( 𝒙 ) + 𝜙 ( 𝒙 )]) exp[− 𝜓 ( 𝒙 )] d 𝒙 (37) = 𝛽 log 𝑍 − 𝛽 ∫  exp[− 𝜓 ( 𝒙 )] log ( exp[− 𝜓 ( 𝒙 )]exp[− 𝜙 ( 𝒙 )]∕ 𝑍 ) d 𝒙 . (38)Recognising the integral in the last line as the Kullback–Leibler (KL) divergence 𝑑 𝑏 → 𝑡 from the base density exp[− 𝜓 ( 𝒙 )] tothe target density exp[− 𝜙 ( 𝒙 )]∕ 𝑍 𝑑 𝑏 → 𝑡 = ∫  exp[− 𝜓 ( 𝒙 )] log ( exp[− 𝜓 ( 𝒙 )]exp[− 𝜙 ( 𝒙 )]∕ 𝑍 ) d 𝒙 , (39)nd taking the exponential of both sides and rearranging we have 𝕡 [ β = 𝛽 ] ≥ 𝐶 ( 𝑍𝜁 ) 𝛽 exp ( − 𝛽𝑑 𝑏 → 𝑡 ) . (40)By instead noting (27) can be rearranged into the form log 𝕡 [ β = 𝛽 ] + log 𝐶 + 𝛽 log 𝜁 − log 𝑍 = log ( ∫  exp(−(1 − 𝛽 )[ 𝜓 ( 𝒙 ) − 𝜙 ( 𝒙 )]) 1 𝑍 exp[− 𝜙 ( 𝒙 )] d 𝒙 ) , (41)by an equivalent series of steps we can also derive a bound using the reversed form of the KL divergence 𝑑 𝑡 → 𝑏 = ∫  𝑍 exp[− 𝜙 ( 𝒙 )] log ( exp[− 𝜙 ( 𝒙 )]∕ 𝑍 exp[− 𝜓 ( 𝒙 )] ) d 𝒙 . (42)from the target to the base distribution, giving that 𝕡 [ β = 𝛽 ] ≥ 𝐶 ( 𝑍𝜁 ) 𝛽 exp [ −(1 − 𝛽 ) 𝑑 𝑡 → 𝑏 ] . (43) B Gaussian mixture Boltzmann machine relaxation

We deﬁne a

Boltzmann machine distribution on a signed binary state 𝘀 ∈ {−1 , +1} 𝐷 𝐵 =  as ℙ [ 𝘀 = 𝒔 ] = 1 𝑍 𝐵 exp ( 𝒔 𝖳 𝑾 𝒔 + 𝒔 𝖳 𝒃 ) 𝑍 𝐵 = ∑ 𝒔 ∈  [ exp ( 𝒔 𝖳 𝑾 𝒔 + 𝒔 𝖳 𝒃 )] . (44)We introduce an auxiliary real-valued vector random variable 𝘅 ∈ ℝ 𝐷 with a Gaussian conditional distribution 𝕡 [ 𝘅 = 𝒙 | 𝘀 = 𝒔 ] = 1(2 𝜋 ) 𝐷 ∕ exp [ − 12 ( 𝒙 − 𝑸 𝖳 𝒔 ) 𝖳 ( 𝒙 − 𝑸 𝖳 𝒔 )] (45)with 𝑸 a 𝐷 𝐵 × 𝐷 matrix such that 𝑸𝑸 𝖳 = 𝑾 + 𝑫 for some diagonal 𝑫 which makes 𝑾 + 𝑫 positive semi-deﬁnite. In ourexperiments, based on the observation in [46] that minimising the maximum eigenvalue of 𝑾 + 𝑫 decreases the maximalseparation between the Gaussian components in the relaxation, we set 𝑫 as the solution to the semi-deﬁnite programme min 𝑫 [ 𝜆 MAX ( 𝑾 + 𝑫 ) ] ∶ 𝑾 + 𝑫 ⪰ 0 (46)where 𝜆 MAX denotes the maximal eigenvalue. In general the optimised 𝑾 + 𝑫 lies on the semi-deﬁnite cone and so hasrank less than 𝐷 𝐵 hence a 𝑸 can be found such that 𝐷 < 𝐷 𝐵 . The resulting joint distribution on ( 𝘅 , 𝘀 ) is 𝕡 [ 𝘅 = 𝒙 , 𝘀 = 𝒔 ] = 1(2 𝜋 ) 𝐷 ∕ 𝑍 𝐵 exp [ − 12 𝒙 𝖳 𝒙 + 𝒔 𝖳 𝑸𝒙 − 12 𝒔 𝖳 𝑸𝑸 𝖳 𝒔 + 12 𝒔 𝖳 𝑾 𝒔 + 𝒔 𝖳 𝒃 ] (47) = 1(2 𝜋 ) 𝐷 ∕ 𝑍 𝐵 exp [ − 12 𝒙 𝖳 𝒙 + 𝒔 𝖳 ( 𝑸𝒙 + 𝒃 ) − 12 𝒔 𝖳 𝑫𝒔 ] (48) = 1(2 𝜋 ) 𝐷 ∕ 𝑍 𝐵 exp ( Tr( 𝑫 ) ) exp [ − 12 𝒙 𝖳 𝒙 ] 𝐷 𝐵 ∏ 𝑖 =1 ( exp [ 𝑠 𝑖 ( 𝒒 𝖳 𝑖 𝒙 + 𝑏 𝑖 )]) , (49)where { 𝒒 𝖳 𝑖 } 𝐷 𝐵 𝑖 =1 are the 𝐷 𝐵 rows of 𝑸 . We can marginalise over the binary state 𝘀 as each 𝗌 𝑖 is conditionally independent ofall the others given 𝘅 in the joint distribution. This gives the Boltzmann machine relaxation density on 𝘅 𝕡 [ 𝘅 = 𝒙 ] = 2 𝐷 𝐵 (2 𝜋 ) 𝐷 ∕ 𝑍 𝐵 exp ( Tr( 𝑫 ) ) exp ( − 12 𝒙 𝖳 𝒙 ) 𝐷 𝐵 ∏ 𝑖 =1 [ cosh ( 𝒒 𝖳 𝑖 𝒙 + 𝑏 𝑖 )] , (50)which is a specially structured Gaussian mixture density with 𝐷 𝐵 components. If we deﬁne 𝕡 [ 𝘅 = 𝒙 ] = 𝑍 exp[− 𝜙 ( 𝒙 )] with 𝜙 ( 𝒙 ) = 12 𝒙 𝖳 𝒙 − 𝐷 𝐵 ∑ 𝑖 =1 [ log cosh ( 𝒒 𝖳 𝑖 𝒙 + 𝑏 𝑖 )] , (51)hen the normalisation constant 𝑍 of the relaxation density can be related to the normalising constant of the correspondingBoltzmann machine distribution by log 𝑍 = log 𝑍 𝐵 + 12 Tr( 𝑫 ) + 𝐷 𝜋 ) − 𝐷 𝐵 log 2 . (52)It can also be shown that the ﬁrst and second moments of the relaxation distribution are related to the ﬁrst and secondmoments of the corresponding Boltzmann machine distribution by 𝔼 [ 𝘅 ] = ∫  𝒙 ∑ 𝒔 ∈  ( 𝕡 [ 𝘅 = 𝒙 | 𝘀 = 𝒔 ] ℙ [ 𝘀 = 𝒔 ]) d 𝒙 = ∑ 𝒔 ∈  ( ∫  𝒙  ( 𝒙 ; 𝑸 𝖳 𝒔 , 𝑰 ) d 𝒙 ℙ [ 𝘀 = 𝒔 ] ) = 𝔼 [ 𝑸 𝖳 𝘀 ] = 𝑸 𝖳 𝔼 [ 𝘀 ] , (53)and 𝔼 [ 𝘅𝘅 𝖳 ] = ∑ 𝒔 ∈  ( ∫  𝒙𝒙 𝖳  ( 𝒙 ; 𝑸 𝖳 𝒔 , 𝑰 ) d 𝒙 ℙ [ 𝘀 = 𝒔 ] ) = 𝔼 [ 𝑸 𝖳 𝘀𝘀 𝑸 + 𝑰 ] = 𝑸 𝖳 𝔼 [ 𝘀𝘀 𝖳 ] 𝑸 + 𝑰 . (54)The weight parameters 𝑾 of the Boltzmann machine distributions used in the experiments in Section 7.2 were generatedusing an eigendecomposition based method. A uniformly distributed (with respect to the Haar measure) random orthogonalmatrix 𝑹 was sampled. A vector of eigenvalues 𝒆 was generated by sampling independent zero-mean unit-variance normalvariates 𝑛 𝑖 ∼  ( ⋅ ; 0 ,

1) ∀ 𝑖 ∈ { , … 𝐷 𝐵 } and then setting 𝑒 𝑖 = 𝑠 tanh( 𝑠 𝑛 𝑖 ) ∀ 𝑖 ∈ { , … 𝐷 𝐵 } , with 𝑠 = 6 and 𝑠 = 2 inthe experiments. This generates eigenvalues concentrated near ± 𝑠 with this empirically observed to lead to systems whichtended to be highly multimodal. A symmetric matrix 𝑽 = 𝑹 diag( 𝒆 ) 𝑹 𝖳 was then computed and the weights 𝑾 set such that 𝑊 𝑖,𝑗 = 𝑉 𝑖,𝑗 ∀ 𝑖 ≠ 𝑗 and 𝑊 𝑖,𝑖 = 0 ∀ 𝑖 ∈ { , … 𝐷 𝐵 } . The biases 𝒃 where generated using 𝑏 𝑖 ∼  ( ⋅ ; 0 , . ) ∀ 𝑖 ∈ { , … 𝐷 𝐵 } .An example of a two-dimensional projection of independent samples from a Boltzmann machine relaxation density with 𝐷 = 27 ( 𝐷 𝐵 = 28 ), and 𝑾 and 𝒃 generated as just described in shown in ﬁgure 5. As can be seen even when projecteddown to two-dimensions the resulting density shows multiple separated modes.

20 10 0 10 20