Adaptive Path Sampling in Metastable Posterior Distributions
AAdaptive Path Sampling in Metastable Posterior Distributions
Yuling Yao ∗† Collin Cademartori ∗† Aki Vehtari ‡ Andrew Gelman § Abstract
The normalizing constant plays an important role in Bayesian computation, and there is alarge literature on methods for computing or approximating normalizing constants that cannotbe evaluated in closed form. When the normalizing constant varies by orders of magnitude,methods based on importance sampling can require many rounds of tuning. We present animproved approach using adaptive path sampling, iteratively reducing gaps between the baseand target. Using this adaptive strategy, we develop two metastable sampling schemes. Theyare automated in Stan and require little tuning. For a multimodal posterior density, we equipsimulated tempering with a continuous temperature. For a funnel-shaped entropic barrier, weadaptively increase mass in bottleneck regions to form an implicit divide-and-conquer. Bothapproaches empirically perform better than existing methods for sampling from metastabledistributions, including higher accuracy and computation efficiency.
Keywords : importance sampling, Markov chain Monte Carlo, normalizing constant, pathsampling, posterior metastability, simulated tempering.
1. The normalizing constant and posterior metastability
In Bayesian computation, the posterior distribution is often available as an unnormalized density q ( θ ). The unknown and often analytically-intractable integral (cid:82) q ( θ ) dθ is called the normalizingconstant of q . Many statistical problems involve estimating the normalizing constant, or the ratiosof them among several densities. For example, the marginal likelihood of a statistical model withlikelihood p ( y | θ ) and prior p ( θ ) is the normalizing constant of p ( y | θ ) p ( θ ): p ( y ) = (cid:82) p ( θ, y ) dθ . TheBayes factor of two models p ( y | θ ) , p ( θ ) and p ( y | θ ) , p ( θ ), requires the ratio of the normalizingconstants in densities p ( y, θ ) and p ( y, θ ).Besides, we are often interested in the normalizing constant as a function of parameters. Ina posterior density p ( θ , θ , . . . , θ d | y ), the marginal density of coordinate θ is proportional to (cid:82) . . . (cid:82) p ( θ , θ , . . . , θ d | y ) dθ , . . . , dθ d , the normalizing constant of the posterior density with respectto all remaining parameters. Accurate normalizing constant estimation means we have well exploredregion containing most of the posterior mass, which implies that we can then accurately alsoestimate posterior expectations of many other functionals.In simulated tempering and annealing, we augment the distribution q ( θ ) with an inverse tem-perature λ and sample from p ( θ, λ ) ∝ q ( θ ) λ . Successful tempering requires fully exploring thespace of λ , which in turn requires evaluation of the normalizing constant as a function of λ : z ( λ ) = (cid:82) q ( θ ) λ dθ . Similar tasks arise for model selection and averaging on a series of statisti-cal models indexed by a continuous tuning parameter λ : p ( θ, y | λ ). In cross validation, we attach toeach data point y i a λ i and augment the model p ( y i | θ ) p ( θ ) to be q ( λ, y, θ ) = (cid:81) i p ( y i | θ ) λ i p ( θ ), suchthat the pointwise leave-one-out log predictive density log p ( y i | y − i ) becomes the log normalizingconstant log (cid:82) p ( y i | θ ) p ( θ | y, λ i = 0 , λ j = 1 , ∀ j (cid:54) = i ) dθ . ∗ joint first author. † Department of Statistics, Columbia University. ‡ Helsinki Institute of Information Technology; Aalto University, Department of Computer Science. § Department of Statistics and Political Science, Columbia University. a r X i v : . [ s t a t . C O ] S e p n all of these problems we are given an unnormalized density q ( θ, λ ), where θ ∈ Θ is a multidi-mensional sampling parameter and λ ∈ Λ is a free parameter, and we need to evaluate the integralsat any λ , z : Λ → R , z ( λ ) = (cid:90) Θ q ( θ, λ ) dθ, ∀ λ ∈ Λ . (1) z ( · ) is a function of λ . For convenience, throughout the paper we will call z ( · ) the normalizing con-stant without describing it is a function. We also use notations q and p to distinguish unnormalizedand normalized densities.In most applications, it is enough to capture z ( λ ) up to a multiplicative factor that is free of λ ,or equivalently the ratios of this integral with respect to a fixed reference point λ over any λ :˜ z ( λ ) = z ( λ ) /z ( λ ) , ∀ λ ∈ Λ . (2) Two accessible but conceptually orthogonal approaches stand out for the computation of (1) and(2). Viewing (1) as the expectation with respect to the conditional density θ | λ ∝ q ( θ, λ ), we cannumerically integrate (1) using quadrature, where the simplest is linearly interpolation, and the logratio in (2) can be computed from first order Taylor series expansion,log z ( λ ) z ( λ ) ≈ ( λ − λ ) ddλ log z ( λ ) | λ = λ ≈ ( λ − λ ) 1 z ( λ ) (cid:90) Θ (cid:18) ddλ q ( θ, λ ) | λ = λ (cid:19) dθ. (3)In contrast, we can sample from the conditional density θ | λ ∝ q ( θ, λ ), and compute (1) byimportance sampling, z ( λ ) z ( λ ) ≈ S S (cid:88) s =1 q ( θ s , λ ) q ( θ s , λ ) , θ s =1 , ··· ,S ∼ q ( θ, λ ) . (4)How should we choose between estimates (3) and (4)? There is no definite answer. For example,in variational Bayes with model parameter θ and variational parameter λ , the gradient part of the(3) is the score function estimator based on the log-derivative trick , while the gradient of (4)under a location-scale family in q ( θ s | λ ) is called the Monte Carlo gradient estimator using the reparametrization trick —but it is in general unknown which is better.On the other hand, both (3) and (4) impose severe scalability limitations on the dimension of λ and θ . Essentially they extrapolate either from the density conditional on λ or from simulationdraws from q ( θ | λ ) to make inferences about the density conditional on λ , hence depending cruciallyon how close the two conditional distribution θ | λ and θ | λ are. Even under a normal approximation,the Kullback-Leibler (KL) divergence between these densities scales linearly with the dimension of θ . Thus it takes at least O (exp(dim( θ )) posterior draws to make (4) practically accurate. Thereliability of (3) is further contingent on how flat the Hessian of z ( λ ) is, which is more intractableto estimate. In practice, these methods can fail silently especially when implemented as a black-boxstep in a large algorithm without diagnostics. From the Bayesian computation perspective, the log normalizing constant is interpreted as theanalogy of “free energy” in physics (up to a multiplicative constant), in line with the interpretationof log q ( θ ) to be the “potential energy” in Hamiltonian Monte Carlo sampling.2 q A) Energetic barrier q q B) Energetic barrier low inv−temperature q q C) Entropic barrier q q D) Entropic barrier with more concentration on the neck low densityhigh density
Figure 1:
An illustration of metastability in a bivariate a distribution p ( θ , θ ) . (A) With a mixtureof two Gaussian distributions, the energetic barrier prevents rapid mixing between modes. (B) Witha lower inverse temperature λ , the energetic barrier becomes flatter in the conditional distribution p ( θ | λ ) . (C) With an entropic barrier, the left and right part of the distribution is only connectedthough a narrow tunnel, where the Markov chain will behave like a random walk. (D) Adding moredensity on the neck increases the transition probability, while leaving p ( θ | θ ) invariant. A potential energy is called metastable if the corresponding probability measure has someregions of high probability, but separated by low probability connections. Following are two typesof metastability, which both cause Markov chain Monte Carlo algorithms difficulty moving betweenregions. This pathology is often identifiable by a large ˆ R between chains and low effective samplesize in generic sampling (Vehtari et al., 2020).1. In an energetic barrier, the density is isolated in multiple modes, and the transition probabilityis low between modes. In particular, the Hamiltonian—the sum of the potential and kineticenergy—is preserved in every Hamiltonian Monte Carlo trajectory. Hence, a transition acrossmodes is unlikely unless the kinetic energy is exceptionally large.2. In an entropic barrier, or funnel, the typical set is connected only through narrow and possiblytwisted ridges. This barrier is amplified when the dimension of θ increases, in a way that arandom walk in high dimensional space can hardly find the correct direction.As the name suggests, we have the freedom to adaptively tune the free energy of the samplingdistribution to remove the metastability therein. The basic strategy is to augment the originalmetastable density q ( θ ) with an auxiliary variable λ , obtaining some density on the extendedspace q ( θ, λ ). For energetic barriers, we can take λ to be an inverse temperature variable, apower transformation of the posterior density. The energetic barrier is flattened by a lower inversetemperature. Temperature-based approaches cannot eliminate entropic barriers in the same way,but the transition is boosted by adding probability mass to the neck region. Figure 1 gives agraphical illustration.While the normalizing constant is not itself statistically meaningful in sampling from the aug-mented density q ( θ, λ ), computing it serves as an essential intermediate step to constructing theotherwise intractable joint density. Unlike in usual Monte Carlo methods where the target distri-bution is given and and static, here as in an augmented system, we have the flexibility to eithersample θ from some conditional distribution θ | λ at a discrete sequence of λ , or from some jointdistribution of ( θ, λ ), treating λ continuously. While continuous joint sampling has a finer-grainedexpressiveness for approximating the normalizing constant, it is harder to access the conditionaldistribution, which is ultimately what we need when λ is an augmented parameter.3o facilitate the normalizing constant estimation and sampling in metastable distributions, thefull problem contains three tasks.1. Determining what distribution we should sample θ and λ from.2. Estimating the normalizing constants efficiently with the generated simulation draws.3. Diagnosing the reliability of the sampling and estimation, particularly distinguishing betweenan informative extrapolation and a noisy random guess, and deciding when and where toadaptively resample.In Section 2, we introduce a practical solution to all three problems. It extends the idea of pathsampling (Gelman and Meng, 1998) to an adaptive design, which performs both the continuously-ranged normalizing constant estimation, and direct sampling of the conditional density. Applyingthis strategy to metastable sampling, we demonstrate in Sections 2.2 and 2.3 that our proposedadaptive path sampling method enables efficient sampling in both the energetic and entropic bottle-necks, and as a byproduct provides normalizing constant estimation and convergence diagnostics.In Section 3, we compare the proposed adaptive sampling to other approaches and show that itis the infinitely dense limit of these basic strategies (3) and (4). We experimentally illustrate theadvantage of the proposed methods in Section 4. For the purpose of log normalizing constant esti-mation and continuous tempering, we have automated our method in the general purpose softwareStan (Stan Development Team, 2020), and illustrate the practical implementation in the Appendix.
2. Proposed method
To begin with, we outline an adaptive path sampling algorithm for the general problem of normal-izing constant (ratio) estimation (2) in a λ -augmented system q ( θ, λ ) , θ ∈ Θ , λ ∈ Λ. We furtherelaborate its application in the context of metastable sampling in Section 2.2 and 2.3.Instead of sampling λ directly, we consider a transformed sampling parameter a through λ = f ( a ), where the link function f : A →
Λ is continuously differentiable and A is the supportof a . For simplicity, we will use A = [0 ,
1] in this section. The actual sampling takes placein the
A ×
Θ space. If there is an interval
I ⊂ A which f maps to a fixed value λ I , we candirectly obtain conditional draws from θ | λ I by { θ i : a i ∈ I} , while not suffering from discretizationerrors of the the normalization constant z ( λ ) = (cid:82) Θ q ( θ, λ ) dθ. We denote the conditional density π λ := p ( θ | λ ) = q ( θ, λ ) /z ( λ ).The general algorithm then iterates the following four steps. Step 1. Joint sampling with invariant conditional densities.
To start, we sample S joint simulation draws ( θ i , a i ) Si =1 from a joint density p ( θ, a ) ∝ c ( λ ) q ( θ, λ ) , λ = f ( a ) , (5)where c ( λ ) is a parametric pseudo prior that is constructed using a series of kernels { γ i ( λ ) } Ii =1 andregression coefficients { β cj } Jj =0 (which will be updated adaptively throughout the algorithm),log c ( λ ) = β c λ + I (cid:88) j =1 β cj γ i ( λ ) . (6)4y default, we initialize at a constant function β c = 0, i.e., c ( λ ) ≡
1. No matter what the prior c ( λ )is, the conditional distributions θ | a ∝ q ( θ, λ = f ( a )) in the joint simulation draws are invariant.This motivates to adaptively changing the pseudo-prior c ( λ ).For the joint sampling task (5), we will typically be using dynamic Hamiltonian Monte Carlo(HMC) (Hoffman and Gelman, 2014; Betancourt, 2017) in Stan, which only requires the unnor-malized log density log q ( θ, λ ) − log c ( λ ) as input. Step 2. Estimating the log normalizing constant from joint draws.
Thermodynamicintegration (Gelman and Meng, 1998, see also Appendix A.1) is based on the identity dda log z ( f ( a )) = E θ | f ( a ) (cid:18) ∂∂a log q ( θ, f ( a )) (cid:19) , (7)where the expectation is over the invariant conditional distribution θ | a ∝ q ( θ, f ( a )).The ratio of the normalizing constant can be computed by integrating both sides of (7). To dothis, we rank all the sampled draws according to their a coordinate: a (1) < a (2) < · · · < a ( s ∗ ) , andcompute the pointwise gradients U ( i ) = (cid:80) a j = a ( i ) ∂∂a log q ( θ, f ( a )) (cid:12)(cid:12) θ j ,a j (cid:80) j : a j = a ( i ) . (8)When there is no tie, the gradient estimate (8) essentially approximates the intractable point-wise integral in (7), E θ | f ( a s ) (cid:0) ∂∂a log ( q ( θ, f ( a )) (cid:1) by one Monte Carlo draw ∂∂a log ( q ( θ, f ( a )) | θ s , a s ,a common technique in stochastic approximation.The integral of the right hand side of (7) is then computed from these expectation estimatesand the trapezoidal rule. For any a ∗ ∈ A , we find its covering interval a ∗ ∈ [ a ( i ∗ ) , a ( i ∗ +1) ), andcompute its normalizing constant with reference to z ( f (0)) bylog z ( f ( a ∗ )) z ( f (0)) = (cid:90) a ∗ dda log z ( f ( a )) da = (cid:90) a ∗ E θ | f ( a ) (cid:18) ∂∂a log ( q ( θ, f ( a )) (cid:19) da ≈
12 ( a (1) − U (1) + U ) + 12 i ∗ − (cid:88) j =1 ( a ( j +1) − a ( j ) )( U ( j +1) + U ( j ) ) + 12 ( a ∗ − a ( i ∗ ) )( U ( i ∗ ) + U a ∗ ) , (9)where U a ∗ and U are obtained by extrapolating U . Step 3. Parametric regularization and adaptive updates.
When the normalizing constant z ( λ ) is only required up to a multiplicative factor, we can assume z ( f (0)) = 1, In Section 2.3, weshow how to remove the fixed reference by additional self-normalization when the exact normalizingconstant is needed.Equation (9) yields an unbiased estimate of log z ( · ). However, due to the stochastic approxima-tion, (9) has nonignorable variance in the region where not enough a i are sampled. For smoothnessand regularization, we approximate log z ( · ) in some parametric family according to the L distancecriterion, min (cid:82) (cid:16) log z ( λ ) − (cid:16) β λ + (cid:80) Jj =1 β j γ j ( λ ) (cid:17)(cid:17) da . We compute this objective function on auniform grid with length I : { λ ∗ i = i/I, ≤ i ≤ I } , compute each log z ( λ ∗ i ) using estimate (9), andsolve the least squares regression β z = arg min β I (cid:88) i =1 log z ( λ ∗ i ) − β λ ∗ i + J (cid:88) j =1 β j γ j ( λ ∗ i )) . (10)5his parametric estimation serves two goals. First, it provides us with a functional form for theprior which we use in the next step to adaptively modify the sampling distribution. Second, theregression estimate (10) smooths finite sample noise in (9), which is a bias-variance tradeoff.We update the functional form of the pseudo-prior β c := β z , or equivalently c ( · ) := z ( · ). Step 4. Diagnostics, stopping condition, and mixing.
The marginal distribution of a from the sampling distribution (5) satisfies p ( a ) = z ( λ ) /c ( λ ) , λ = f ( a ). If z ( λ ) were accuratelycomputed, one step adaptation c ( a ) := z ( a ) would result in a uniform marginal distribution on a ,which is the basis of diagnostics.Notably, the sampled marginal density p ( a ) can be estimated as a normalizing constant p ( a ) = (cid:82) Θ q ( θ, f ( a )) c − ( f ( a )) dθ , thus we use a similar estimate as (9), only modifying the gradient U ( i ) by U p ( i ) = (cid:80) j : a j = a ( i ) ∂∂a (cid:16) log q ( θ, f ( a )) − log c ( f ( a )) (cid:17)(cid:12)(cid:12)(cid:12) θ j ,a j (cid:80) a j = a ( i ) . (11)The sample estimates of z ( λ ) and p ( λ ) possess finite sample Monte Carlo error and are proneto over-extrapolation in regions of few a draws. Therefore, we repeat Steps 1–3 until p ( a ) is“functionally close enough” to a uniform density. However, running until the complete uniformityis both in practice inefficient and in theory unlikely to be obtained as the actual log normalizingconstant z () will not fall into the parametric family exactly.Our adaptation step z → c can be viewed as an importance sampling procedure from thejoint proposal c ( f ( a )) − q ( θ, f ( a )) to the joint target z ( f ( a )) − q ( θ, f ( a )). The importance ratio is r ( a ) = c ( f ( a )) /z ( f ( a )) = 1 /p ( a ), which only depends on the marginal of a , and the normalizingconstant estimate (9) can be equivalently expressed by the importance sampling estimate z ( λ ) − = c ( λ ) − r ( a ) when the the marginal p ( a ) is estimated from path sampling.To assess the accuracy of the final estimate, we use a Pareto-ˆ k diagnostic adapted from Paretosmoothed importance sampling (PSIS, Vehtari et al., 2019). We fit the importance ratio r i = 1 /p ( a i )in a generalized Pareto distribution, estimating its right tail shape parameter ˆ k . As already appliedin other computation diagnostics (e.g., Yao et al., 2018), ˆ k quantifies the Renyi-divergence betweenthe sampled density c ( a ) − q ( θ, a ) and the target z ( a ) − q ( θ, a ) that has a uniform marginal on a .When ˆ k < .
7, the normalizing constant z ( λ ), viewed pointwise as an importance samplingestimate, is ensured to be reliable with a practical number of simulation draws, and we terminatesampling. The ˆ k threshold can be chosen smaller to make the decision more conservative.Otherwise, we perform further sampling with the updated pseudo prior c . Crucially, the pathsampling estimate (9) is always unbiased for the log normalization constant under any samplingdistribution as long as θ | a is left invariant. Thus, we save all previously sampled draws { a s , θ s } , andmix them with the newly sampled draws in the normalizing constant estimation (9) during eachadaption. This remixing step corresponds to a divide-and-conquer strategy that we will furtherexploit to sample from a metastabe distribution with entropic barriers (Section 2.3). Given a statistical model, we can evaluate the posterior joint density p ( θ, y )= prior × likelihood.In this section, we suppress the dependence on data y and denote the unnormalized posteriordistribution from which we want to sample as q ( θ ) := p ( θ, y ) , θ ∈ Θ. When q ( θ ) exhibits severemultimodality, Markov chain Monte Carlo (MCMC) algorithms have difficulty moving betweenmodes. The state-of-the-art dynamic Hamiltonian Monte Carlo sampler for a bimodal density has6 mixing rate as slow as random-walk Metropolis (Mangoubi et al., 2018), and even optimal tuningand Riemannian metrics do not help. Although it is possible to evade multimodal sampling usingother post-processing and re-weighting strategies (e.g., Yao et al., 2020), we aim here to samplefrom the exact posterior density.To ease the energetic barrier between modes, we consider a distribution bridging between thetarget q ( θ ) and a base distribution ψ ( θ ) through a geometrically tempered path: p ( θ | λ ) = 1 z ( λ ) q ( θ ) λ ψ ( θ ) − λ , λ ∈ [0 , , θ ∈ Θ , where λ is the augmented inverse temperature, z ( λ ) is the normalizing constant z ( λ ) = (cid:82) Θ ψ ( θ ) λ q ( θ ) − λ dθ , and ψ ( θ ) is a proper base probability density, typically a simple initial guessor the prior that is easy to sample from. When ψ is known exactly, z (0) is 1. We will discuss thechoice of base distribution later. p ( θ | λ ) is the target density when λ = 1, and becomes “flattened”for a smaller λ . λ a cooling heating Figure 2:
The link function λ = f ( a ) . Theflat area between 0.8 and 1.2 allows the con-tinuous sampler to have a region where thereare exact draws from the target distribution. To ensure that the joint sampler can access thebase and target distributions with nonzero proba-bilities, we define a link function λ = f ( a ) : [0 , → [0 , f ( a ) = f (2 − a ), and flat attwo ends: f ( a ) = 0 when 0 ≤ a ≤ a min or 2 − a min ≤ a ≤
2, and f ( a ) = 1 when a max ≤ a ≤ − a max .This is easy to satisfy using a piecewise polynomial,see Figure 2 for an illustration. In the experimentsection we use a min = 0 . a max = 0 .
8, and the con-crete definition is in Appendix A.2.An a -trajectory from 0 to 2 corresponds to acomplete λ tour from 0 to 1 (cooling) and back downto 0 (heating). This formulation allows the samplerto cycle back and forth through the space of λ continuously, while ensuring that some of thesimulation draws (those with a between 0.8 and 1.2) are drawn from the exact target distributionwith λ = 1.To run simulated tempering, we apply the adaptive path sampling (Steps 1–4 in Section 2.1)to the joint distribution, p ( θ, a ) ∝ c ( f ( a )) q ( θ ) f ( a ) ψ ( θ ) − f ( a ) , a ∈ [0 , , θ ∈ Θ . During each adaptation, we sample from this joint density, use all existing draws (includingfrom previous adaptions) to obtain the path sampling estimated log normalization constant log z ,parametrically regularize it, and update log c by log ˆ z . Because f () is designed symmetric, z ( f ( a )) = z ( f (2 − a )), we flip all a s to be 2 − a s for all a s > z estimation (9).In addition, the pointwise gradient U in (8) is further simplified to be U ( i ) = (cid:80) j : a j = a ( i ) f (cid:48) ( a ( i ) ) (cid:16) log q ( θ j ) − log ψ ( θ j ) (cid:17)(cid:80) j : a j = a ( i ) . If the base density ψ ( θ ) is chosen to be the prior in the model, this gradient is simply the productof f (cid:48) ( a ( i ) ) and the log likelihood. 7hen the marginal distribution of a is close to uniform, which is monitored by the Pareto-ˆ k diagnostic, a path has been constructed from the base to the target. We then collect all draws inthe final adaptation with temperature λ = 1, i.e. { θ i | f ( a i ) = 1 } . These are the desired draws fromthe target distribution q ( θ ). As a byproduct, we obtain the log normalization constant estimatelog z , and z (1) equals the marginal likelihood if ψ is chosen as the prior. The full tempering methodis summarized in Algorithm 1. Algorithm 1:
Continuous tempering with path sampling.
Input: ψ ( θ ) , q ( θ ): the base and (unnormalized) target density; Output: draws from the target distribution; log z ( · ): log normalizing constant.Initialize pseudo-prior log c ( · ) = 0; repeat Sample { a s , θ s } s =1 ,...,S from the joint density q ( θ, a ) = c ( a ) ψ ( θ ) f ( a ) q ( θ ) (1 − f ( a )) ;Flip a s := 2 − a s all a s > z ( · ) by path sampling (9), from draws in all adaptations;Estimate log p ( · ) by path sampling (9) and gradients (11), from the current adaptation;Update log c ( · ) ← log z ( · ); until Pareto- ˆ k , the estimated tail shape parameter of the ratios /p ( a s ) , is smaller than 0.7. ;Collecting sample { θ i | f ( a i ) = 1 } as posterior draws of target distributions. The proposed continuous simulated tempering algorithm in Section 2.2 alleviates metastablility inenergetic barriers. However, tempering is not effective at overcoming purely entropic barriers. Insuch cases, instead of augmenting the density with an additional temperature variable, we increasethe density in the bottleneck region to encourage transitions between metastable regions.In many models, it is known that certain marginal distributions are problematic. For example,in hierarchical models, the centered parameterization effectively creates left truncation on the grouplevel standard deviation τ , as the sampler hardly enters the τ ≈ q ( θ, τ ), where τ is the targeted problematic margin and θ is all remaining parameters.In other cases, these problematic marginals can be identified by various MCMC diagnostics suchas trace plots.A conservative solution is to sample τ first from some “wider” proposal distribution, then sample θ given τ in a Gibbs fashion, and finally adjust for the extra wide proposal by importance sampling.Here we propose an alternative strategy based on path sampling that does not require the Gibbsstep or any closed form conditional density. The method is readily available using the joint samplerin Stan.We first sample some simulation draws from the joint posterior distribution of all parameters˜ q ( θ, τ ) = q ( θ, τ ), without requiring a complete exploration. The marginal density of τ in the originalmodel, p ( τ ), is the normalization constant p ( τ ) ∝ (cid:82) Θ q ( θ, τ ) dθ. Hence, we compute log p ( τ ) over allsampled τ using path sampling formula (9), and modify the sampling distribution by adding thebias term log ˜ q ( θ, τ ) := log q ( θ, τ ) + log (cid:0) p targ ( τ ) /p ( τ ) (cid:1) , where p targ ( τ ) is a desired marginal density.It can be fixed at any distribution that covers the true posterior marginal of τ , such as its prior.we discuss other choices in the end of this section.We then sample new draws ( θ, τ ) from this adapted sampling distribution. Since we only changethe marginal distribution of τ between adaptations, the conditional densities θ | τ remain invariant.8e mix the new sample with the ones from all previous adaptations and use this cumulativesample to estimate the aggregated marginal density p ( τ ) using path sampling at each adaptation.We iterate the above procedure until we reach approximate marginal convergence to p targ ( τ ), whichis further quantified by the Pareto-ˆ k diagnostic. τ is often undersampled in some regions in early iterations. In these regions, the path samplingestimated p ( τ ) is less reliable, and the log (cid:0) p targ ( τ ) /p ( τ ) (cid:1) term will cause the sampler to focusin these undersampled regions on the next iteration (and mostly ignore those regions that werealready sufficiently sampled). This adaptive behavior is what makes this algorithm an implicitdivide-and-conquer-type procedure. Algorithm 2 shows the complete setup for this procedure.Once we have obtained our estimate of p ( τ ) from the divide-and-conquer algorithm, we canuse importance sampling to compute marginal posterior expectations and quantiles. Alternatively,the posterior expectation of any integrable h ( τ ) is the normalizing constant of h ( τ ) p ( τ ), which wecan evaluate using path sampling and (one-dimensional) quadrature (9). In our experiments, wefind the latter approach to be more robust. Furthermore, as a byproduct of our marginal densityestimate, we can evaluate the marginal distribution function out to arbitrary distances. This allowsus to estimate quantiles with extremely small tail probability that may be more difficult to estimatewith Monte Carlo draws.In addition, the path sampling estimate of p ( τ ) can be used to diagnose poor sampling behaviorin standard HMC by comparing this estimate with the obtained empirical distribution. Algorithm 2:
Implicit divide-and-conquer scheme for metastable distributions
Input: q ( τ, θ ): the (unnormalized) joint density; τ : the problematic margin; θ : allremaining parameters; p targ ( τ ): the targeted marginal of τ . Output: p ( τ ): marginal density of τ in the original joint density;Initialize sampling distribution ˜ q = q ( θ, τ ); j =1; repeat Generate sample { τ s , θ s } from ˜ q ( τ, θ ) ;Mix these draws with all previous adaptations { τ s , θ s } Ss =1 ;Compute p ( τ ), the marginal density of q ( τ, θ ), using path sampling (9), gradients (11)and all draws;Smooth estimated p ( τ ) by regression (10), and record p j ( · ) := p ( · );Update sampling density ˜ q ( τ, θ ) := q ( τ, θ ) p targ ( τ ) /p ( τ ); j := j + 1; until The ratios r = { p j ( τ ) /p j − ( τ ) } have ˆ k < ; The optimal marginal distribution of a . Lastly, the adaptive path sample estimate doesnot depend on the marginal distribution of a and λ , so that this marginal distribution is deter-mined by user specification. By default in (9) we use the update rule z ( · ) → c ( · ), which enforcesa uniform marginal distribution on a . More generally, by updating c ( f ( · )) ← z ( f ( · )) /p prior ( · ),the final marginal distribution of a will approach p prior . The choice of p prior is subject to a effi-ciency - robustness trade-off. Gelman and Meng (1998) showed that the generalized Jeffreys prior p opt ( λ ) ∝ (cid:113) E θ | λ U ( θ, λ ) minimizes the variance of the estimated log normalizing constant, where U ( θ, λ ) = ∂∂λ log q ( θ, λ ). With a slight twist, in continuous tempering (Section 2.2), we can provethat another optimal prior p opt ( a ) ∝ f (cid:48) ( a ) (cid:113) Var θ ∼ p ( θ | a ) U ( θ, a ) ensures a smooth KL gap between two9djacent tempered distribution in posterior sample KL (cid:0) π a , π a + δa (cid:1) ≈ constant for a min < a < a max (Appendix A.5). In discrete tempering, this constant KL gap is related to a constant acceptancerate in the neighboring Gibbs update. However, due to its dependence on the unknown normal-izing constant (and higher orders), these two efficiency-optimal priors require additional tuningand adaptations. In continuous tempering, we prefer the simple uniform a margin for robustness,as it guarantees a complete λ tour in the joint path. In implicit divide and conquer, if the effi-ciency is more of a concern, we recommend to adaptively update the target marginal to matchthe efficiency-optimal one p targ ( τ ) ← p opt ( τ ) ∝ (cid:113) E θ ∼ q ( θ,τ ) ( ∂∂τ log q ( θ, τ )) , which can be furtherstochastically approximated by joint Monte Carlo draws. This additional adaptation is optionaland the estimation of p opt is not required to be precise.
3. Related work: From importance sampling to adaptive importance sampling toRao-Blackwellization to path sampling to adaptive path sampling
There is a large literature on methods for computing or approximating normalizing constants thatcannot be evaluated in closed form. We refer to Gelman and Meng (1998) and Lelievre et al. (2010)for comprehensive reviews on normalizing constants.
Adaptive importance sampling.
The basic importance sampling strategy (4) treats the nor-malizing constant z ( λ ) as a conditional expectation with respect to the random variable θ , whosedistribution is parameterized by λ . Chatterjee and Diaconis (2018) proved that under certain con-ditions that the number of simulation draws required for the importance sampling estimate (4) of z ( λ ) to have small error with high probability is roughly exp(KL( π λ || π λ )).When this KL gap is too large, a remedy is to add more discrete ladders λ < λ < . . . < λ K = λ ,and use adaptive importance sampling. At the ( j + 1)-th time, we sample θ j ,...,jS from π λ j , andthe importance sampling estimate gives z λ j +1 /z λ j = 1 /S (cid:80) Si =1 q ( θ j,i , λ j +1 ) /q ( θ ji , λ j ) . The finalestimation of normalizing constant is z λ K z λ = k − (cid:89) j =0 z λ j +1 z λ j ≈ k − (cid:89) j =0 (cid:32) /S S (cid:88) i =1 q ( θ ji , λ j +1 ) /q ( θ ji , λ j ) (cid:33) . (12) Bridge sampling.
The importance sampling is restricted to sampling from the conditional den-sity θ | λ for a constant λ at each run—a slice in the ( θ, λ ) joint space. Bridge sampling simultaneouslydraws θ j ,...,jS j from π λ j and θ ( j +1)1 ,..., ( j +1) S j +1 from π λ j +1 . Given a sequence of integrable function { α j ( · ) } Jj =1 , we estimate the ratio of normalizing constant via the bridge sampling (Meng and Wong,1996) estimate: z λ j +1 z λ j = E π λj ( α j ( θ ) q ( θ, λ j +1 )) E π λj +1 ( α j ( θ ) q ( θ, λ j )) ≈ (cid:80) S j i =1 q ( θ ji , λ j +1 ) α j ( θ ji ) /S j (cid:80) S j +1 i =1 q ( θ ( j +1) i , λ j ) α j ( θ ( j +1) i ) /S j +1 . (13)Under some independence assumptions, the optimal choice of α j ( θ ) to minimize the variance ofmultistage bridge sampling estimate (13) is α opt j ( · ) = n j z − j (cid:80) m n m z − m q m ( · ) (Meng and Wong, 1996; Shirtsand Chodera, 2008). 10 ao-Blackwellization. When some λ k are rarely seen, the inverse probability weighting isunstable. Motivated by the identity p ( λ ) = E θ p ( λ | θ ), Carlson et al. (2016) proposed a Rao-Blackwellized (Robert and Casella, 2013) estimate of the normalizing constant, essentially re-placing the empirical marginal probability mass function Pr( λ ) by a Rao-Blackwellized estimatePr( λ = λ k ) = (cid:80) Ss =1 ( q ( θ s , λ k ) / (cid:80) k (cid:48) q ( θ s , λ k (cid:48) )). We show in the appendix that this estimate is equiv-alent to multistate bridge sampling (13) by taking an empirical approximation of the theoreticaloptimum α opt j defined above. Non-equilibrium methods.
The importance sampling and bridge sampling estimates requires n j > θ | λ j . In non-equilibrium methods, we start by a simulation draw θ from equilibrium θ | λ , evolve it through a sequence of transitions that keeps π k , k = 1 , . . . , K invariant at each step, and collect one non-equilibrium trajectory ( θ , . . . , θ K − ). Notably, we donot draw from π K directly, and θ j , j ≥ π j distributed. We still obtain an unbiasedestimate (Jarzynski, 1997; Neal, 2001): z λ K z λ = E (exp W ( θ , . . . , θ K − )) , W ( θ , . . . , θ K − ) = log k − (cid:89) j =0 q ( θ j , λ j +1 ) q ( θ j , λ j ) . (14)where the expectation is over all initial draws and trajectories. Thermodynamic integration.
The path sampling estimate (9) is unbiased for the log normal-izing constant (ratios) log z , while other importance sampling based algorithms are unbiased in thescale of the normalizing constant z . In our adaptive procedure, since we update the logarithm of thejoint density and compute its gradient in the next Hamiltonian Monte Carlo run, the unbiasednessof log z is more relevant. In statistical physics, the W quantity in (14) is interpreted as virtualwork induced on the system. Jensen’s inequality leads to E W ≥ log z ( λ K ) − log z ( λ ). This isa microscopic analogy of the second law of thermodynamics: the work entered in the system isalways larger than the free energy change, unless the switching is processed infinitely slow, whichcorresponds to the thermodynamic integration.The thermodynamic integration equality (7) was first introduced by Kirkwood (1935) in sta-tistical physics, and further refined or applied by Ogata (1989); Neal (1993); Gelman and Meng(1998); Rischard et al. (2018) in the context of normalizing constant computing. However, thetypical use of thermodynamic integration requires discretizing λ into a fixed quadrature ladder λ < . . . < λ K , on which the gradient U j is computed using many draws from θ | λ j . These laddersinvolve further manual tuning (Schlitter, 1991; Blondel, 2004) to control the variance of U , and thediscretization error (bias) in the numerical integration is non-vanishing in a finite discrete ladder,to a large extent compromising the unbiasedness property of log z estimation.Our present paper also extends the general discussion of path sampling in Gelman and Meng(1998), with added steps on parametric regularization, iterative adaptations, and diagnostics. Es-sentially we use stochastic approximation (Robbins and Monro, 1951) to compute the the pointwisegradient (7). We employ a carefully designed link function f that allows direct access to simulationdraws from some chosen conditional distributions θ | λ , while also eliminates the discretization bias.These extensions facilitate the continuous tempering scheme by providing direct access to drawsfrom the target distribution. Path sampling as the continuous limit of importance sampling and Taylor expansion.
In Section 1.1, we describe two separate approaches: the importance sampling estimate (4) and11aylor series expansion (3). They reach the same first order limit when the proposal is infinitelyclose to the target. That is, for any fixed λ , as δ = | λ − λ | → δ log E λ (cid:18) q ( θ | λ ) q ( θ | λ ) (cid:19) = (cid:90) Θ ∂∂λ log q ( θ | λ ) p ( θ | λ ) dθ + o (1) = 1 δ E λ (cid:18) log q ( θ | λ ) q ( θ | λ ) (cid:19) . The path sampling estimate (cid:82) λ λ (cid:82) Θ ∂∂λ log q ( θ | λ l ) p ( θ | λ l ) dθdλ that we employ is the integral of thedominate term in the middle. In this sense, path sampling is the continuous limit of both theimportance sampling and the Taylor expansion approach. For a more rigorous derivation, seeAppendix A.4.More generally, thermodynamic integration (7) can be viewed as the K → ∞ limit of theequilibrium bridge sampling, Rao-Blackwellization, and annealed importance sampling, but withouthaving to fit the conditional model infinitely many times (for rigorous proofs, see Appendix A.3and A.4; see also discussions in Gelman and Meng, 1998; Carlson et al., 2016; Lelievre et al., 2010).We will further elaborate in the simulating tempering context that such continuous extension isdesired, as otherwise the necessary number of interpolating temperatures K soon blows up whenthe dimension of θ increases. Simulated tempering.
Simulated tempering and its variants provide an accessible approach tosampling from a multimodal distribution. We augment the state space Θ with an auxiliary inversetemperature parameter λ , and employ a sequence of interpolating densities, typically through apower transformation p j ∝ p ( θ | y ) λ j on the ladder 0 < λ < . . . λ K = 1, such that p K is thedistribution we want to sample from and p is a (proper) base distribution. At a smaller λ , thebetween-mode energy barriers in p ( θ | λ ) collapse and the Markov chains are easier to mix. Thisdynamic makes the sampler more likely to fully explore the target distribution at λ = 1.Discrete simulated tempering (Marinari and Parisi, 1992; Neal, 1993; Geyer and Thompson,1995) samples from the joint distribution p ( λ, θ ) ∝ /c ( λ ) q ( θ ) λ using a Gibbs scheme, where c ( λ ) isa pseudo-prior that is often iteratively assigned to be ˆ z ( λ ): an estimate of the normalizing constantof q ( θ ) λ which may be obtained by any of the methods discussed above. Each Gibbs swap involvessampling θ | λ with a one- or multi-step Metropolis update that keeps p ( θ | λ ) invariant, and a randomwalk in λ that leaves λ | θ invariant. The number of Metropolis updates, the number of temperaturesamples, and the temperature spacing all involve intensive user tuning.In simulated annealing (Neal, 1993; Morris et al., 1998), we evolve λ through a fixed schedule,and update θ using a Markov chain targeting the current conditional distribution p ( θ | λ ). Theannealed importance sampling (Neal, 2001) adjusts the non-equilibrium in the θ | λ update by theimportance weight defined in (14).However, as we mentioned above, discrete tempering schemes are sensitive to the choice of λ ladders. The Markov chain must usually proceed by making small changes between the neighbor-ing distributions. Following the discussion in Betancourt (2015), if some pair of π λ j and π λ j − have a large KL divergence, the error in importance sampling (12) based algorithms will be domi-nated by the j -th term. Under the optimal design, the Kullback–Leibler (KL) divergence betweenneighboring distributions should be roughly constant:constant (cid:117) KL (cid:16) π λ , π λ + δλ (cid:17) = (cid:90) log (cid:0) π λ π λ + δλ ) dπ λ = log (cid:0) z ( λ + δλ ) z ( λ ) (cid:1) − ( δλ ) ddλ log z ( λ ) , (15)which can hardly be achieved even adaptively due to the reliance on the unknown log z and itsderivative. 12urther, even with a constant KL gap, the discrete ladder imposes dimension limitations. Forexample, in simulated tempering, a transition ( θ, λ i ) → ( θ, λ j ) between two temperatures λ i and λ j in the Gibbs update would only be likely if there is significant overlap between the potentialenergy distributions log q ( θ | λ i ) and log q ( θ | λ j ). Under a normal approximation with dim(Θ) = d ,the width of the energy distribution scales by O ( d / ), while the distance between two adjoiningenergy distributions is O ( d/K ). This leads to the best case bound on the necessary number ofinterpolating densities K = O ( d / ). In practice, K is recommended to grow like O ( d ) (Madrasand Zheng, 2003). More theoretical discussion shows that K = O ( d ) is often needed to ensure apolynomial bound on the adjacent temperature overlap and rapid mixing (Woodard et al., 2009).Since the update on ( λ k ) Kk =1 behaves as a random walk, even when the θ | λ update takes zero time,the relaxation time of a diffusion process with discrete state space ( λ k ) Kk =1 often scales by O ( K ),soon becoming unaffordable as K grows. Toward continuous tempering.
Gobbo and Leimkuhler (2015) designed a continuous temper-ing scheme by adding a single auxiliary variable a ∈ R to the system. The inverse temperature f ( a )is defined by a smooth function such that f ( a ) = 1 for | a | < ∆ and f ( a ) = 0 .
15 for | a | > ∆ ∗ . TheHamiltonian of the system is modified to be ˆ H ( θ, p, a, p a ) = f ( a ) H ( θ, p ) + p a /m a + log z ( a ) , where p a is the momentum of a , and log z ( a ) is adaptively updated using importance sampling, in orderto force a to be uniformly distributed in the interval [∆ ∗ , ∆ ∗ ]. Similarly, Betancourt (2015) intro-duced adiabatic Monte Carlo, where a contact Hamiltonian is defined on the augmented Hamiltonsystem ˆ H ( θ, p, λ ) = − λ log q ( θ ) + 1 / p T M − p + log z ( λ ). These methods are shown to outper-form discrete tempering, but they require a modified Hamiltonian and are sensitive to task-specificimplementation and tuning.Graham and Storkey (2017) formulated a continuous tempering on the joint density p ( θ, λ ) ∝ ζ − λ q ( x ) λ ψ ( x ) (1 − λ ) . This can be viewed as a special case of our proposed method by restricting theparametric form of our normalizing constant estimate to a single parameter exponential function z ( λ ) = ζ λ . This method does not directly acquire simulation draws from θ | λ = 1 or 0, and integralsunder target distribution are evaluated through importance weighting.
4. Experiments
To manifest the advantage of the proposed method, we present a series of experiments. In Section4.1, we use a conjugate model to compare the accuracy of normalizing constant estimation. InSection 4.2, we show that the continuous tempering with path sampling is scalable to high dimen-sional multimodal posteriors. In Section 4.3 and 4.4, we highlight the computational efficiency fromthe proposed method, including higher effective sample size and quicker mixing time. Lastly, wevalidate the implicit divide and conquer in a funnel shaped posterior in Section 4.5.
We start with an example adapted from Betancourt (2015), in which the true normalizing constantcan be evaluated analytically. Consider a model with a binomial likelihood y ∼ Binomial( θ, n )and a Beta prior θ ∼ Beta( α, β ). Along a geometric bridge between the prior and posterior, theconditional unnormalized tempered posterior density at an inverse temperature λ is q ( θ | λ ) = Binomial( y | θ, n ) λ Beta( θ | α, β ) , θ ∈ [0 , , λ ∈ [0 , . ambda.grid.true l og ( z . g r i d .t r ue ) l log normalizing const. Easy case theta.grid ba s e_den s i t y . g r i d q conditional density targetprior lambda.grid t he t a . g r i d lq joint distribution whenlambda is uniform Index lambda.grid.true l og ( z . g r i d .t r ue ) l Hard case log normalizing const. theta.grid ba s e_den s i t y . g r i d q target prior conditional density lambda.grid t he t a . g r i d lq joint distribution whenlambda is uniform Figure 3: (a) The analytic log normalizing constant. (b) the base and the target (c) the joint of ( λ, θ ) when the marginal of λ is uniform. In the hard case the target and base is separated and thelog normalizing constant changes rapidly. True iter 1initial guess0.0 0.5−4−20 l Log normalizing constant with path samplingcontinuous tempering
True initial guess0.0 0.5 1.0−4−20 l with emperical estimatecontinuous tempering l l l l l l l l l l ll l l l l l l l l l ll l l l l l l l l l ll l l l l l l l l l ll l l l l l l l l l ll l l l l l l l l l ll l l l l l l l l l ll l l l l l l l l l ll l l l l l l l l l ll l l l l l l l l l ll l l l l l l l l l ll l l l l l l l l l ll l l l l l l l l l ll l l l l l l l l l ll l l l l l l l l l ll l l l l l l l l l ll l l l l l l l l l ll l l l l l l l l l ll l l l l l l l l l ll l l l l l l l l l l iter 1iter 5initial guess0.0 0.5 1.0−4−20 l Rao−Blackwellizeddiscrete tempering l l l l l l l l l l l
True iter 20initial guess0.0 0.5 1.0−4−20 l discrete temperingwith emperical estimate Figure 4:
Estimation of the log normalizing constant in the easy Beta-binomial example amongfirst 20 adaptations. All methods eventually approximate the true function, while the proposedcontinuous tempering with path sampling has the fastest convergences rate.
Due to conjugacy, the normalized distribution has a closed form expression p ( θ | λ ) = Beta( λy + α, λ ( n − y ) + β ) , and the true normalizing constant z is z ( λ ) = (cid:90) q ( θ | λ ) dθ = (cid:18) Γ( n + 1)Γ( y + 1)Γ( n − y + 1) (cid:19) λ Γ( α + β )Γ( α )Γ( β ) Γ( λy + α )Γ( λ ( n − y ) + β )Γ( λn + α + β ) . We compare four sample-based estimates of the log normalizing constant: (i) continuoustempering with adaptive path sampling (the proposed method) with joint density p ( a, θ ) ∝ /c ( f ( a )) q ( θ | f ( a )); (ii) continuous tempering, but replacing the path sampling estimate of marginaldensity p ( a ) by an empirical estimate and using importance sampling ˆ z ( a ) = p ( a ) c ( a ) to computeand update z ; (iii) discrete simulated tempering with importance sampling estimation and estimat-ing the marginal probability mass function Pr( λ = λ i ) , λ < λ < . . . < λ K = 1 by MonteCarlo average; (iv) discrete tempering with a Rao-Blackwellized (Carlson et al., 2016) estimate ofthe probability mass function.In the first setting (left half of Figure 3), we set α = 2, β = 1, y = 60, n = 80. Figure 4presents the log normalizing constant estimates in the first 20 adaptations. All methods start witha flat initial guess log z ( λ ) = 0. In continuous tempering, we draw 3000 joint ( a, θ ) draws in eachadaptation. The first half of the draws are treated as warm-up and discarded (which also do inthe discrete setting). We choose a length I = 100 approximation grid in the parametric adaptationstep (6). For discrete tempering, we use an evenly spaced ladder λ i = ( i − / , i = 1 , . . . , λ draws per adaptation, each followed 100 HMC updates in θ . These numbersensure the continuous tempering has a smaller computation cost (3000 joint draws) than in discreteimplementations (100 HMC updates on θ ×
150 updates on λ i ) per adaptation, so as to make theefficiency comparison convincing. 14igure 5: Comparison of four tempering methods in the hard beta-binomial example. Row 1: Start-ing from a flat guess, only the proposed method converges to the the true (red curve) value of thelog normalizing constant after 8 adaptations. Rows 2–4 compare the first 150 draws of the jointdistribution of parameters and temperatures in adaptations 3, 6, and 10. The proposed method fullyexplores the the joint space efficiently, while the two discrete schemes exhibit random walk behaviourin temperatures updates, and cannot adapt to the rapid changing regions near λ = 0 . l l l l l l l l l l l l l l l l l l l continuous temperingwith path samplingadaptation timesL2 error l l l l l l l l l l l l l l l l l l l l continuous temperingwith emperical estimationadaptation times l l l l l l l l l l l l l l l l l l l discrete temperingRao−Blackwellizedadaptation times l l l l l l l l l l l l l l l l l l l l discrete temperingwith emperical estimationadaptation times Figure 6: L errors of the log normalizing constant estimate log z ( λ ) from four methods duringthe first 20 adaptations. Only continuous tempering with path sampling gives a monotonicallydecreasing error which shrinks to zero in practical amount of time.
15n this easy case, all methods eventually approximate the truth(red curve), among which ourproposed continuous tempering with path sampling has the quickest convergence rate, almost re-covering the truth within the first adaptation.In the hard case (right half of Figure 3), α = 9, β = 0 . k = 115, n = 550. The baseand target are two isolated spikes, and the log normalizing constant changes rapidly especiallyfor small λ . Figure 5 compares four log normalizing constant estimation methods in the first 20adaptations. Continuous tempering with path sampling nearly converges to the true value afterthe 8th adaptation. Rao-Blackwellization achieves a reasonable discrete approximation at the 13-th adaptation, but the result is unstable due to discretization error. In fact, Figure 6 displaysthe L error of log z . Only the proposed method has a monotonically decreasing error with moreadaptations. The two empirical importance sampling based methods are both too noisy to beuseful.Rows 2–4 in Figure 5 show the first 150 joint draws in adaptation 3, 6, and 10 from all fourmethods. Our proposed method fully explores the joint space in adaptation 10. Here, the joint HMCjumps are gradient-guided and and make subtle jumps in regions of rapid change (where λ ≈ θ | λ = 0 and λ = 0 . Next we consider the problem of sampling from Gaussian mixtures. For visual illustration we beginwith the simple case of a mixture of two one-dimensional Gaussians.Figure 7 shows five out of ten total iterations for which we ran our tempering algorithm. Atinitialization, the joint sampler can only see a thin slice of a in the region around the base, with alarge resulting Pareto-ˆ k diagnostic. With more adaptations, more temperatures are sampled, andthe path sampling estimate relies less on extrapolation. After four adaptations, the sampler hasfully explored a ∈ [0 , f ( a ) = 1 retrieves the target distribution. The accuracy is confirmed by theˆ k < .
7, or a visual check of a nearly uniform a marginal distribution.Next we consider a 10-component mixture of Gaussians. We generate the individual Gaussiancomponents with variance a tenth of the minimum distance between any two of the mode centersto ensures separation between the modes. We perform this sampling in a range of dimensions from10 to 100 (with the number of components fixed). We compare the proposed tempering with pathsampling with two benchmark algorithms (i) Rao-Blackwellized discrete tempering (Carlson et al.,2016), and (ii) continuous tempering with a log linear normalizing constant assumption (Grahamand Storkey, 2017). We do not include other empirical importance sampling based tempering asthey are strictly worse than these two benchmarks. For all methods, we use an over-dispersednormal base distribution.In order to assess whether a given tempering procedure has succeeded, we count the proportionof draws in each target mixture component for each tempering method, and compare this distribu-tion to the actual equal weight target. The results are averaged over five independent runs. Figure8 displays the evolution of the mean absolute difference between the sampled mode proportions ineach adaptation of each algorithm and the target in the 100-dimensional case. The left panel ofFigure 8 labels each point with the total CPU time needed to complete each adaptation iteration.Not only does the path sampling algorithm achieve more uniform mode exploration, but it doesso in significantly less computation time than the Rao-Blackwellized procedure. This is partly asymptom of the HMC-in-Gibbs implementation of the Rao-Blackwellized algorithm.16igure 7: Path sampling-based tempering for the target a Gaussian mixture, plotting adaptation 1,2, 4, 6, and 10. The first row shows the joint simulation draws; the second row displays the estimateof the log normalizing constant; the third row displays the marginal density on the temperature andcumulative draws of the temperature variable, with Pareto- ˆ k diagnostics printed at the bottom leftof each panel (good if ˆ k < . ); and the fourth row is the draws from the target density. Figure 8:
Mean absolute error of target simulation draws from adaptive path sampling based tem-pering (proposed), and two benchmarks: log-linear continuous tempering and Rao-Blackwellization.Results are averages over 5 repeated runs. Left: Sampling error and time at each adaptation fora Gaussian mixture target with 100 dimensions and 10 components. Numerical labels indicate thenumber of seconds of CPU time before each adaptation completed. Middle: Comparison of sam-pling errors as dimensions range from d = 10 to . The log-linear tempering is supplied with truenormalization constant while the other two are with uniform initialization. Right: The estimatedlog normalizing constant log z ( f ( a )) from path sampling. z ( f ( a )) is not monotone or linear, whichexplains the undestied performance of the log linear tempering even when it is supplied with theground truth normalizing constant.In Figure 8, we do not see the mode exploration degrade with increasing dimension for pathsampling. This is because the log normalizing constant itself scales mildly with the dimensionin Gaussian mixtures. It general, the number of dimensions is not the limiting factor of pathsampling, but it could inflate the log normalizing constant in a severe base-target mismatch. Wefurther discuss dimension scalability in Appendix C. Next we consider sampling from a flower shaped distribution as previously used in Sejdinovic et al.(2014) and Nemeth et al. (2019). This is a two-dimensional distribution with probability densityspread throughout multiple “petals”. Its probability density function is given by p ( x, y | σ, r, A, ω ) ∝ exp (cid:18) − σ (cid:16)(cid:112) x + y − r − A cos( ω arctan( y, x )) (cid:17)(cid:19) . The probability mass is pinched into narrow regions between petals, slowing exploration of thetarget. For this reason, the flower distribution is challenging to sample from using standard HMC.Figure 9 plots draws from the flower distribution with 6 petals using plain HMC and usingcontinuous tempering with path sampling. As before, we use a normal base distribution. Both al-gorithms were run for enough time to generate a similar number of draws from the target. StandardHMC clearly fails to adequately explore each of the petals of the flower distribution. Path sampling-based continuous tempering succeeds in generating draws from each of the petals in roughly equalproportions.Figure 9:
Draws (red dots) from the flower target generated by plain HMC (left) and continuoustempering with path sampling (right). The blue contours represent the underlying target density.
Comparison of computational cost of path sampling versus standard HMC for a flowerdistribution with 40 petals. The x-axis displays the time in seconds, and the y-axis shows theachieved (cid:98) R -value for the x coordinate (left) and y coordinate (right) of the flower distribution.The dashed grey line displays the 1.05 cutoff below which we consider our chains to have mixedsufficiently. All results are averaged over 15 replications. We further compare the total mixing time. Figure 10 plots the computed ˆ R (Vehtari et al., 2020)for the x and y coordinates of a challenging flower distribution with 40 petals against computationaltime. Continuous tempering with path sampling crosses the ˆ R = 1 .
05 threshold for adequate mixingin about a quarter of the time (already having included all adaptations) required for standard HMCto reach that threshold, although the latter can eventually approximate the target with many moredraws. Thus, despite the much lower time cost per sample for standard HMC, path sampling stillmanages to be more efficient in terms of total mixing time. This disparity will only get moreapparent as the difficulty of the target density increases.
The proposed path sampling and continuous tempering can enhance computational efficiency evenwhen there is no direct posterior multimodality. In particular, the HMC mixing time scales poly-nomially with dimensions, hence fitting a slightly larger model can be cheaper than fitting twomodels separately. Consider a sparse regression with regularized horseshoe prior (Piironen andVehtari, 2017a,b), an effective tool for Bayesian sparse regression. Letting y n be a binary out-come and x n × D be predictors, the regression coefficient β is equipped with a regularized horse-shoe prior: β d | τ, ζ, γ ∼ normal (cid:0) , γζ d ( γ + τ ζ d ) − / (cid:1) , τ ∼ Cauchy + (0 , / ( D − √ n )) , ζ d ∼ Cauchy + (0 , , d = 1 , . . . , D. To take into account the model freedom between the logit link Pr( y i =1 | β, logit) = logit − (cid:16)(cid:80) Dd =1 β d x id (cid:17) and probit link Pr( y i = 1 | β, probit) = Φ (cid:16)(cid:80) Dd =1 β d x id (cid:17) in thelikelihood, we construct a tempered path between the logit and probit model p ( a, β, τ, ζ, γ ) ∝ c ( a ) n (cid:89) i =1 (cid:16) Pr( y = y i | β, logit) − λ Pr( y = y i | β, probit) λ (cid:17) p prior ( β, τ, ζ, γ ) , λ = f ( a ) , where p prior ( β, τ, ζ, γ ) encodes the regularized horseshoe prior.In the first experiment, we generate n = 40 data points and D = 100 covariates with a maximumpairwise correlation 0.5. Among all covariates, only the first three have nonzero coefficients β , , .Furthermore, y is chosen to have a logit link in the true data generating process. We vary a min inthe link function λ = f ( a ) as defined in Equation (17), such that when 0 < a < a min , λ = 0, thepath sampling draws from the logit model, and when 1 > a > a max = 1 − a min , λ = 1, it draws19 l l l l l l l min min ESS /second probit modelpath sampling individual fit l l l l l l l l min min ESS /second logit modelpath samplingindividual fit l l l l l l l individual fit l l l l l l l Figure 11:
The smallest unit time tail effective sample size among all regression coefficients inthe horseshoe regression with both Bernoulli logit and probit likelihood. The green line refers totwo separate model fits and the red line corresponds to a joint tempered path sampling countingall adaptation time. Left two panels: we vary a min : the proportion of logit and probit samplingin the final joint sample. Right two panels: we fix a min = 0 . , a max = 0 . and vary the input-dimension/sample-size from 2 to 8. In most cases, the joint path sampler generates a larger tailESS /s for both probit and logit sample, in addition to all intermediate models fitted simultaneously. from the probit model. We run path sampling with two adaptations, S = 500 draws in the firstadaptation, and S = 3000 in the second. We then compute the tail effective sample size (tail-ESS,Vehtari et al., 2020) from the the probit and logit model in the aggregated draws divided by thetotal sampling time (sum of 4 chains). For compassion, we also fit these two models separatelyand count their effective sample size. To reflect the bottleneck of the computation, we monitorthe smallest tail effective sample size among all regression coefficients β d . Path sampling betweentwo models expands the model continuously such that both individual models are special casesof the augmented model. Because the posterior distributions β | y under the probit and logit linksare similar, a connected path between them stabilizes the tail of posterior sampling, resulting in alarger unit time tail effective sample size, as shown in the left two panels of Figure 11.In the second experiment, we fix a min = 0 . , a max = 0 .
65, and vary the input dimension D = nρ, ≤ ρ ≤
8. Given that a max − a min = 30% of the sampling time is spent on intermediatemodels, it is remarkable that most times the joint sample from path sampling renders a tail effectivesample size from individual models no slower than fitting them separately, as verified in the righttwo panels in Figure 11. The simulation results are averaged over 10 repeated runs. As a caveat, ajoint path of two arbitrary models is not always more efficient than separate fits. Figure 11 displaysthe minimal effective sample size among all regression coefficients β d . When averaged over all β d ,the mean tail effective sample size in path sampling is smaller than individual fits in this example,which can be viewed as a parameter-wise efficiency-robustness trade-off. We apply the implicit divide-and-conquer algorithm to the hierarchical model and eight-schooldataset (Gelman et al., 2013). In this problem, we are given eight observed means y i and standarddeviations s i which are related to each other through the following hierarchical model with unknownparameters θ, µ, τ ,centered parameterization : y i ∼ normal( θ i , s i ) , θ i iid ∼ normal( µ, τ ) . The centered parametrization in hierarchical models often behaves as implicit left-truncation dueto entropic barriers. In particular, as τ →
0, the mass of the conditional posterior p ( θ | τ, y, s )20igure 12: Comparison of log estimation error of the left and right tail quantile estimations using S = 4000 posterior draws from HMC (centered and non-centered parametrizations) and implicitdivide-and-conquer. The x-axis displays the left tail probability. The y-axis displays the log absoluteerror of the quantile estimate. The proposed method (red curve) achieves the highest accuracy. concentrates around µ , creating a funnel-shape in the posterior of θ that is challenging to traverse.In this dataset, the left truncation can be overcome by switching to a non-centered parametrizationthat introduces auxiliary variables ˜ θ ,non − centered parameterization : θ i = µ + τ × ˜ θ i , ˜ θ iid ∼ normal(0 , . However, reparametrization is often restrited to location scale families, and choosing between cen-tered and non-centered parametrization remains difficult for real data (Gorinova et al., 2020).Figure 13:
Comparison of log estimation errorsof the first to fourth posterior moment of τ usingdraws S = 4000 from HMC (centered and non-centered parametrizations) or implicit divide-and-conquer. The latter returns the marginaldensity estimation, which is further equippedwith either importance sampling or numericalintegration. The x-axis displays the estimatedmoment, and the y-axis displays the log absoluteerror of the moment estimate. We apply the proposed implicit divide-and-conquer algorithm on the sample drawn fromthe original centered parametrization. It succes-sively pushes the marginal of τ towards a desiredtarget marginal p targ ( τ ). In our case, we usethe efficiency-optimal prior of the form p ( τ ) ∝ (cid:112) E τ U ( θ, µ, τ ) , where the U ( θ, µ, τ ) functionsare the gradients of the log posterior given in (8).This target cannot be computed explicitly, so wealso estimate it at each adaptation using a sim-ple window-based averaging scheme over the τ draws. We adaptively repeat the algorithm untilwe meet the convergence criterion.Figure 12 displays the log errors for quan-tile estimates in the right and left tails of themarginal posterior of τ for HMC with (a) cen-tered and (b) non-centered parametrizations, and(c) the implicit divide-and-conquer. We set pos-terior draw size S = 4000 (after thinning) in allthree samples. In these experiments, we use es-timates from the non-centered parametrizationwith 10 draws as our ground truth. In the righttail, the implicit divide-and-conquer runs out-perform the Monte Carlo estimates across all but thesmallest quantile. For the left tail, we estimate quantiles with left tail probabilities between 0 . .
1. In this case, path sampling dominates all other methods, with a significant improvementin the extreme quantiles.The estimated marginal density p ( τ ) enables expectation computation. We examine the per-formance for marginal moment estimation E ( τ m ) using both importance sampling and numericalintegration (the same trapezoidal rule as in (9)). Figure 13 displays the log errors of estimates forthe first four moments of the marginal posterior distribution of τ . The moments computed fromsimulation draws of the implicit divide-and-conquer, using either numerical integration scheme orimportance sampling, have a lower error than Monte Carlo estimates from both HMC parametriza-tions.Overall, we see from Figures 12 and 13 that the proposed implicit divide-and-conquer pro-vides good performance for a range of posterior estimation tasks, often out-performing all otherapproaches and never generating estimates with errors large enough to be unusable in any of thetested problems. This is remarkable since the implicit divide-and-conquer scheme generates itsunderlying HMC samples using the centered parametrization during each adaptation.The implicit divide-and-conquer scheme demands more computation time than one-run HMC,but it avoids the need for a problem-specific reparametrization, so it can be applied in cases whereother approaches may not be available. Furthermore, because it comes with a stopping criterion, wecan ensure that the extra computation time arrives at a sufficiently accurate result by termination.
5. Discussion
In continuous simulated tempering, we initialize the pseudo prior at c = 1. When the underlingnormalizing constant z ( λ ) is several orders of magnitude smaller than z (0) for all λ >
0, the samplerstarting from this non-informative initialization will get stuck in λ = 0. Because f (cid:48) ( a ) = 0 in thebase, path sampling will fail to update. In this situation, we update the slope b in (6) to be theimportance sampling estimate (4): b ←− log ˆ z IS (1) = log (cid:16) S (cid:80) Ss =1 q ( θ s , λ = 1) q − ( θ s , λ = 0) (cid:17) . Thisestimate ˆ z IS is unlikely to be accurate, thus we only use for initialization to avoid local convergence.In Section 2.2, we merely require the base measure ψ to have the same support as the target q .A natural candidate for ψ is the prior as we have used throughout the experiments. Ideally, the basemeasurement should balance between being easy to sample from, and close enough to the targetdistribution. This is not a unique challenge in our method, as all (discrete) simulated temperingand, more generally, importance sampling methods need to construct a good base measurement.The current paper treats the base measurement as an extra input that user has to specify. Based onthe discussion on how to construct and optimize the proposal distribution in adaptive importancesampling (Geweke, 1989; Owen and Zhou, 2000; Bugallo et al., 2017; Paananen et al., 2020), itmay be possible to obtain better performance by optimizing the choice of base measurement, whichitself is often an iterative problem.That said, we are unlikely to be able to automatically construct a good base density for im-portance sampling in high dimensional problems such that the KL divergence between the baseand the target is bounded. Otherwise other advanced sampling method would not be required.Fortunately, the adaptive path sampling estimate is less sensitive to base-target discrepancy. Aswe have seen in experiments, our path sampling-based method still yields accurate log normalizingconstant estimates and smooth joint sampling even starting from a crudely chosen base distributionwhen importance sampling and bridge sampling have failed.22 rue log zestimate log z est. is bad, l log z ( l ) but tempering is OK! k hat = −1.73 < 0.7 l p ( l ) Index
1% pointwise error log z est. is good, l log z ( l ) but tempering is BAD! k hat = Inf >> 0.7 l p ( l ) Figure 14:
An illustration of the different orientations in normalizing constant estimation andsimulated tempering. In the left two panels, the blue curve poorly fits the true log normalizingconstant, but there is enough mass everywhere in the resulting marginal density to ensure a completepath in simulated tempering. The right two panels: with a much larger scale, when log z estimationis off by 1% at a single point, the resulting path becomes a point mass in λ . In continuous tempering,such failure is diagnosed by a large ˆ k . In accordance with the error analysis in Section 3 and the simulation results, the proposed adaptivepath sampling is more scalable to higher dimensions compared with existing counterparts. However,path sampling-based continuous tempering can still fail in high dimensional posteriors.In essence, simulated tempering depends on, but can be more difficult than normalizing constantestimation. On one hand, simulated tempering does not require a precise estimation of z ( λ ) (lefthalf of Figure 14, see also Geyer and Thompson, 1995), as long as there is enough posterior marginaldensity p ( a ) or p ( λ ) everywhere—but this will only happen when the scale of z ( λ ) is small.In Appendix C, we provide a failure mode example in a latent Dirichlet allocation model, wherethe log normalizing constant estimation has a scale ∼ , and path sampling estimation managesto estimate it with pointwise errors ∼ . For fitting a curve, a 1% error is accurate enough. Buta pointwise 1% error in the log normalizing constant amounts to inflating the marginal density p ( a )in the joint sampling (5) by a factor exp(10 ) at that point, which effectively becomes a point massand makes path tempering get stuck in one region. Such failures happen in discrete tempering too,and is identifiable by our ˆ k diagnostics. In other words, successful simulated tempering requiresthe estimation of the pointwise normalizing constant with multiplicative precision, see Figure 14for an illustration.The absolute scale of the log normalizing constant is comparable to the log KL divergencebetween the base and the target. In a prior-posterior tempering path, the log normalizing constantat λ = 1 is the log marginal likelihood. It grows linearly with both the sample size and how closelythe model fits the data (the log likelihood). When the model is poor in predication (as in thelatent Dirichlet allocation example), the log normalizing constant soon escapes from the estimationaccuracy that any estimate can achieve, an analogy of the “folk theorem of statistical computing”:When you have computational problems, often there’s a problem with your model (Gelman, 2008).Furthermore, we present a geometric path for continuous tempering. It is not clear if to modifythe free energy by a density-power-transformation is the best way to remove metastability, althoughmost existing tempering methods adopt this form. If there is rapid phase transition at somecritical temperature, any power-transformation tempering will be prohibitively slow (Bhatnagarand Randall, 2004). Fortunately, the general framework in Section 2.1 permits an arbitrary pathformulations, and is easily implemented in Stan by replacing the closed form gradient U = log23ikelihood by automatic differentiation. We leave this more flexible tempering path for futureresearch. We have developed a method that integrates adaptive path sampling and tempering and haveapplied it to several examples. The procedure is highly automated, and we have implemented it inan R package using the general-purpose Bayesian inference engine Stan, returning both the desiredposterior sample and the estimated log normalizing constant at convergence. See Appendix B forsoftware details.In Bayesian computation, the ultimate goal is not to stop at the posterior simulation draws, butto use them to check and improve the model in a workflow. If data come from an identifiable model,then with reasonable sample size we can expect to distinguish among parameters and obtain a wellbehaved posterior distribution (eventually achieving asymptotic normality). From this perspective,multimodal posteriors should be unlikely with large sample size and may represent data that do notfit the model. Hence, it is crucial to check the model fit even after the target posterior is obtainedfrom our proposed sampling algorithms.Finally, although we have argued its relative advantage over existing methods, we do not thinkthe proposed method based on adaptive sampling and tempering can solve all metasable samplingproblems, either because of a badly-chosen base measurement, or the dimension and sample sizelimitation imposed by tempering itself. In this case, the Pareto-ˆ k diagnostic is still useful tounderstand why the method fails. Besides refining the base measurement and potentially modifyingthe model, another alternative strategy to metastable sampling is to use cross validation and multi-chain stacking (Yao et al., 2020) to combine the non-mixed simulation draws, which in effect changesthe target distribution. Acknowledgements
We thank Ben Bales, Rok ˇCeˇsnovar, M˚ans Magnusson for helpful comments and the U.S. NationalScience Foundation, Institute of Education Sciences, Office of Naval Research, Sloan Foundation,and Schmidt Futures for partial support.Replication R and Stan code for experiments is available at https://github.com/yao-yl/path-tempering .Corresponds to Y.Y.
Betancourt, M. (2015). Adiabatic Monte Carlo. arXiv:1405.3489 .Betancourt, M. (2017). A conceptual introduction to Hamiltonian Monte Carlo. arXiv:1701.02434 .Bhatnagar, N. and Randall, D. (2004). Torpid mixing of simulated tempering on the Potts model. In
Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms , pages 478–487.Blondel, A. (2004). Ensemble variance in free energy calculations by thermodynamic integration:theory, optimal “alchemical” path, and practical solutions.
Journal of Computational Chemistry ,25:985–993. 24ugallo, M. F., Elvira, V., Martino, L., Luengo, D., Miguez, J., and Djuric, P. M. (2017). Adaptiveimportance sampling: The past, the present, and the future.
IEEE Signal Processing , 34:60–79.Carlson, D., Stinson, P., Pakman, A., and Paninski, L. (2016). Partition functions from Rao-Blackwellized tempered sampling. In
Proceedings of the 33rd International Conference on Ma-chine Learning .Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M. A.,Guo, J., Li, P., and Ridell, A. (2017). Stan: A probabilistic programming language.
Journal ofStatistical Software , 76(1):1–32.Chatterjee, S. and Diaconis, P. (2018). The sample size required in importance sampling.
Annalsof Applied Probability , 28:1099–1135.Gelman, A. (2008). The folk theorem of statistical computing. https://statmodeling.stat.columbia.edu/2008/05/13/the_folk_theore .Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013).
Bayesian Data Analysis . CRC Press, New York, 3rd edition.Gelman, A. and Meng, X.-L. (1998). Simulating normalizing constants: From importance samplingto bridge sampling to path sampling.
Statistical Science , 13:163–185.Geweke, J. (1989). Bayesian inference in econometric models using Monte Carlo integration.
Econo-metrica , 57:1317–1339.Geyer, C. J. and Thompson, E. A. (1995). Annealing Markov chain Monte Carlo with applicationsto ancestral inference.
Journal of the American Statistical Association , 90:909–920.Gobbo, G. and Leimkuhler, B. J. (2015). Extended Hamiltonian approach to continuous tempering.
Physical Review E , 91:061301.Gorinova, M. I., Moore, D., and Hoffman, M. D. (2020). Automatic reparameterisation of proba-bilistic programs. In
International Conference on Machine Learning .Graham, M. M. and Storkey, A. J. (2017). Continuously tempered Hamiltonian Monte Carlo. In
Proceedings of the Conference on Uncertainty in Artificial Intelligence .Hoffman, M. D. and Gelman, A. (2014). The no-U-turn sampler: Adaptively setting path lengthsin hamiltonian monte carlo.
Journal of Machine Learning Research , 15:1593–1623.Jarzynski, C. (1997). Nonequilibrium equality for free energy differences.
Physical Review Letters ,78:2690.Kirkwood, J. G. (1935). Statistical mechanics of fluid mixtures.
Journal of Chemical Physics ,3:300–313.Lelievre, T., Rousset, M., and Stoltz, G. (2010).
Free Energy Computation: A MathematicalPerspective . Imperial College Press, London.Madras, N. and Zheng, Z. (2003). On the swapping algorithm.
Random Structures & Algorithms ,22:66–97. 25angoubi, O., Pillai, N. S., and Smith, A. (2018). Does Hamiltonian Monte Carlo mix faster thana random walk on multimodal densities? arXiv:1808.03230 .Marinari, E. and Parisi, G. (1992). Simulated tempering: A new Monte Carlo scheme.
EurophysicsLetters , 19:451.Meng, X.-L. and Wong, W. H. (1996). Simulating ratios of normalizing constants via a simpleidentity: A theoretical exploration.
Statistica Sinica , 6:831–860.Morris, G. M., Goodsell, D. S., Halliday, R. S., Huey, R., Hart, W. E., Belew, R. K., and Olson,A. J. (1998). Automated docking using a lamarckian genetic algorithm and an empirical bindingfree energy function.
Journal of Computational Chemistry , 19:1639–1662.Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods. Technicalreport, CRG-TR-93-1, Department of Computer Science, University of Toronto.Neal, R. M. (2001). Annealed importance sampling.
Statistics and Computing , 11:125–139.Nemeth, C., Lindsten, F., Filippone, M., and Hensman, J. (2019). Pseudo-extended Markov chainMonte Carlo. In
Advances in Neural Information Processing Systems , pages 4312–4322.Ogata, Y. (1989). A Monte Carlo method for high dimensional integration.
Numerische Mathematik ,55:137–157.Owen, A. and Zhou, Y. (2000). Safe and effective importance sampling.
Journal of the AmericanStatistical Association , 95:135–143.Paananen, T., Piironen, J., B¨urkner, P.-C., and Vehtari, A. (2020). Implicitly adaptive importancesampling. arXiv:1906.08850 .Piironen, J. and Vehtari, A. (2017a). On the hyperprior choice for the global shrinkage parameterin the horseshoe prior.
Proceedings of the 20th International Conference on Artificial Intelligenceand Statistics .Piironen, J. and Vehtari, A. (2017b). Sparsity information and regularization in the horseshoe andother shrinkage priors.
Electronic Journal of Statistics , 11:5018–5051.R Core Team (2020). R: A language and environment for statistical computing.Rischard, M., Jacob, P. E., and Pillai, N. (2018). Unbiased estimation of log normalizing constantswith applications to Bayesian cross-validation. arXiv:1810.01382 .Robbins, H. and Monro, S. (1951). A stochastic approximation method.
Annals of MathematicalStatistics , 22:400–407.Robert, C. and Casella, G. (2013).
Monte Carlo Statistical Methods . Springer Science & BusinessMedia, New York, 2nd edition.Schlitter, J. (1991). Methods for minimizing errors in linear thermodynamic integration.
MolecularSimulation , 7:105–112.Sejdinovic, D., Strathmann, H., Garcia, M. L., Andrieu, C., and Gretton, A. (2014). Kernel adaptiveMetropolis-Hastings. In
International Conference on Machine Learning , pages 1665–1673.26hirts, M. R. and Chodera, J. D. (2008). Statistically optimal analysis of samples from multipleequilibrium states.
Journal of Chemical Physics , 129:124105.Stan Development Team (2020). Stan modeling language user’s guide. Version 2.23.Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., and B¨urkner, P.-C. (2020). Rank-normalization, folding, and localization: An improved (cid:98) R for assessing convergence of MCMC. Bayesian Analysis . in press.Vehtari, A., Simpson, D., Gelman, A., Yao, Y., and Gabry, J. (2019). Pareto smoothed importancesampling. arXiv:1507.02646 .Woodard, D. B., Schmidler, S. C., and Huber, M. (2009). Conditions for rapid mixing of paralleland simulated tempering on multimodal distributions.
Annals of Applied Probability , 19:617–640.Yao, Y., Vehtari, A., and Gelman, A. (2020). Stacking for non-mixing Bayesian computations: Thecurse and blessing of multimodal posteriors. arXiv:2006.12335 .Yao, Y., Vehtari, A., Simpson, D., and Gelman, A. (2018). Yes, but did it work?: Evaluatingvariational inference. In
International Conference on Machine Learning . Appendix A. Derivation of equations
A.1. Thermodynamic integration identity
Let q : R d +1 → R be an unnomralized density function. Let the normalizing constant (function) z : R → R be defined by z ( λ ) = (cid:82) R d q ( λ, θ ) dθ, where θ ∈ R d . Calculus shows ddλ log z ( λ ) = ddλ (cid:90) R d q ( λ, θ ) dθ (cid:90) R d q ( λ, θ ) dθ = (cid:90) R d ∂∂λ q ( λ, θ ) dθz ( λ )= (cid:90) R d ∂∂λ log q ( λ, θ ) q ( λ, θ ) dθz ( λ )= (cid:90) R d ∂∂λ log q ( λ, θ ) q ( λ, θ ) z ( λ ) dθ, (16)That is, ddλ log z ( λ ) = E θ | λ ( ∂∂λ log q ( λ, θ )), which leads to the thermodynamic integration identity(3) we use by chain rule.At the second equation of (16), we are assuming the legitimacy of changing the order of deriva-tive and integral. One sufficient condition is that there exists a sufficiently large constant M suchthat (cid:82) R d | ∂∂λ q ( λ, θ ) | dθ < M for all λ , and the interchangeability follows by the dominated conver-gence theorem. 27 .2. The link function in continuous tempering In continuous tempering, we choose the following piecewise polynomial link function (see Figure 2for a visualization): f ( a ) = , ≤ a < a min , − a − a min a max − a min ) + 3( a − a min a max − a min ) , a min ≤ a < a max , , a max ≤ a < − a max , − − a min − aa max − a min ) + 3( − a min − aa max − a min ) , − a max ≤ a < − a min , , − a min ≤ a ≤ . (17)It has a continuous first order derivative. In experiments and default software implementation, weset a min = 0 . a max = 0 . A.3. Rao-Blackwellization as the optimal multistate bridge sampling
In this subsection we follow the notation of bridge sampling discussed in Lelievre et al. (2010). Let q i = q ( θ | λ i ) be the unnormalized density at a sequence of discrete temperatures λ i , i = 1 , . . . I , and Z = ( z , . . . , z I ) T the (ratio of) normalizing constants (assuming z = 1). With an arbitrary list ofΘ → R functions α i,j , we define A an ( I − × ( I −
1) matrix, and B an ( I −
1) vector: A = a − b . . . − b I − b a . . . − b I . . . . . . . . . . . . . . . . . . . . . . . − b I − b I . . . a I , B = b b ... b I , with each entry a, b a shorthand for b ij = E π j ( α ij q i ) , a i = I (cid:88) j =1 ,j (cid:54) = i E π i ( α ij q j ) . In discrete tempering, each time we sample from q ( θ, λ ) = c ( λ ) q ( θ, λ ). Since λ is discrete here, wedenote c m = c ( λ m ) and q m = q ( θ, λ = λ m ), both of which are given in each adaptation. Carlsonet al. (2016) considered the Rao-Blackwellized estimate of the normalizing constant:ˆ z RBk = n (cid:88) l =1 q k ( θ l ) (cid:80) Ij =1 c j q j ( θ l ) , k = 1 , . . . , I. Proposition 1.
The Rao-Blackwellized estimate can be derived from multistate bridge sampling bychoosing α ij ( θ ) = n j ˆ z − j (cid:80) m c − m q m ( θ ) , which is further an empirical estimate of the optimal bridge sampling functions.Proof. We rearranged the multistate bridge sampling estimates z i b ji = z j b ij , ∀ i, j into a matrixform AZ = B. Z is α ij ( θ ) = n j z − j (cid:80) m n m z − m q m ( θ ) , which in practice, starting from some initial guess ˆ z , this can be approximated by α ij ( θ ) = n j ˆ z − j (cid:80) m n m ˆ z − m q m ( θ ) . Denote S ( θ ) = (cid:80) m c m q m ( θ ), then α ij ( θ ) = n j ˆ z − j S ( θ ) . We estimate the matrices A and B by theirempirical means. ˆ a i ˆ z i = n − i n i (cid:88) k =1 ˆ z i I (cid:88) j =1 ,j (cid:54) = i α ij ( θ i,k ) q j ( θ i,k )= ˆ z i n − i n i (cid:88) k =1 (cid:80) Ij =1 ,j (cid:54) = i n j ˆ z − j q j ( θ ) S ( θ i,k )= ˆ z i − n i (cid:88) k =1 q i ( θ i,k ) S ( θ i,k ) . I (cid:88) j =1 ,j (cid:54) = i ˆ b ij ˆ z j = I (cid:88) j =1 ,j (cid:54) = i ˆ z j n − ji n j (cid:88) k =1 n j ˆ z − j q i ( θ j,k ) S ( θ j,k ) = I (cid:88) j =1 ,j (cid:54) = i n i (cid:88) k =1 q i ( θ j,k ) S ( θ j,k ) . Combining these two parts, we obtain the final estimate z i = I (cid:88) j =1 ,j (cid:54) = i n i (cid:88) k =1 q i ( θ j,k ) S ( θ j,k ) + n i (cid:88) k =1 q i ( θ i,k ) S ( θ i,k )= N (cid:88) m =1 q i ( θ m ) (cid:80) Ij =1 c j q j ( θ m ) , which is identical to the Rao-Blackwellized estimates. A.4. Path sampling as the limit of bridge sampling and annealed importance sampling
Proposition 2.
Path sampling can be viewed as the continuous limit of bridge sampling (13) andannealed importance sampling (14) when the intermediate states (0 = λ < λ < . . . < λ L +1 = 1) is infinitely dense, such that max l δ l → , where δ l = λ l +1 − λ l is the neighboring spacing.Proof. The proof is similar to the reasoning in Gelman and Meng (1998). Notably, bridge samplingand annealed importance sampling work in the scale of the normalizing constant z , essentially com-puting z (1) /z (0) by (cid:81) Ll =0 ( z ( l + 1) /z ( l )), or equivalently, log z (1) /z (0) = (cid:80) Ll =0 log z ( l ) − log z ( l − z ( l ) z ( l −
1) = (cid:90) Θ q ( θ | λ l +1 ) q ( θ | λ l ) p ( θ | λ l ) dθ. In general, log E ( q ( θ | λ l +1 ) /q ( θ | λ l )) (cid:54) = E (log q ( θ | λ l +1 ) − log q ( θ | λ l )), where the expectation is takenover θ ∼ p ( θ | λ l ). However, such difference will be be approaching zero when we have fine ladder.29or a fixed λ l , let G l ( ξ ) = log (cid:90) Θ q ( θ | λ l + ξ ) q ( θ | λ l ) p ( θ | λ l ) dθ. It satisfies G l (0) = 0, and its derivative is G (cid:48) l ( ξ ) = ddξ (cid:18) log (cid:90) Θ q ( θ | λ l + ξ ) q ( θ | λ l ) p ( θ | λ l ) dθ (cid:19) = z ( l ) z ( l + ξ ) (cid:18)(cid:90) Θ ∂∂ξ q ( θ | λ l + ξ ) q ( θ | λ l ) p ( θ | λ l ) dθ (cid:19) . At ξ = 0, G (cid:48) l (0) becomes identical to the path sampling gradient in (7), as G (cid:48) l (0) = (cid:90) Θ ∂∂λ log q ( θ | λ l ) p ( θ | λ l ) dθ. By Taylor series expansion, G l ( ξ ) = G l (0) + ξ (cid:82) Θ ∂∂λ log q ( θ | λ l ) p ( θ | λ l ) dθ + o ( ξ ) , Hence, in thelimit as max l δ l →
0, the importance sampling based estimate can be rearranged intolog z (1) /z (0) = L (cid:88) l =0 G l ( δ l )= L (cid:88) l =0 (cid:18) δ l (cid:90) Θ ∂∂λ log q ( θ | λ l ) p ( θ | λ l ) dθ + o ( δ l ) (cid:19) = (cid:90) (cid:90) Θ ∂∂λ log q ( θ | λ l ) p ( θ | λ l ) dθdλ + o (1) , where the dominant term equals the path sampling estimate, and the remainder approaches 0 inthe dense limit δ l → , ∀ l since (cid:80) l δ l = 1. A.5. On the choice of prior p ( a )In path sampling, the final marginal distribution of a relies on user specification. By defaultwe use z ( · ) → c ( · ), which enforce a uniform marginal distribution. More generally, by updating c ( · ) ← z ( · ) /p prior ( · ), the final marginal distribution of a will approach p prior . In this section wediscuss the choice of p prior beyond uniformity.The choice of p prior is subject to a efficiency-robustness trade-off. There are three separate goalsto pursue via prior tuning:1. Robustness.
Because a uniform a ensures the Markov chain has explored the whole tem-perature space, it is a conservative choice and we use for default in adaptations. p ( a ) ∝ . Minimal variance of log z . On then other end of the spectrum, we can ask for efficiency.Gelman and Meng (1998) proved that the generalized Jeffreys prior minimizes the varianceof estimated log normalizing constant. p opt ( λ ) ∝ (cid:113) E θ | λ U ( θ, λ ) . where U ( θ, λ ) = ∂∂λ log q ( θ, λ ). 30. Smooth transition in the joint sampling.
From the perspective of successful sampling,we could ask for KL (cid:16) π a , π a + δa (cid:17) ≈ constant , which is related to the constant acceptance rate in discrete Gibbs update (when the discretetemperature update is restricted to either a one-step jump λ k → λ k ± or remain unchanged,This constant acceptance rate is often used as a tuning target in discrete tempering (Geyerand Thompson, 1995). The next proposition gives an closed form optimal prior that ensuresthis constant KL gap. Proposition 3.
The desired prior to achieve a smooth KL gap is p opt ( a ) ∝ f (cid:48) ( a ) (cid:113) Var θ ∼ p ( θ | a ) U ( θ, a ) , ∀ a min < a < a max . This is similar to the minimal-variance prior above, with a slight twist from the second moment tovariance.Proof.
It is easy to verify thatKL (cid:16) π a , π a + δa (cid:17) = (cid:90) log (cid:0) π a π a + δa ) dπ a = 12 ( δa ) (cid:16) d da log z ( a ) − f (cid:48)(cid:48) ( a ) f (cid:48) ( a ) dda log( z ( a )) (cid:17) + o ( δa ) . Assuming we have already sampled from the joint stationary distribution, the gap between twoneighboring order statistics reflects how dense the local density is, i.e., δa ∝ /p ( a ) . Further, the two derivative terms can be expressed by expectations, dda log( z ( a )) = f (cid:48) ( a )E θ ∼ p ( θ | a ) (cid:2) log( ψ ) − log( φ ) (cid:3) .d da log( z ( a )) = f (cid:48)(cid:48) ( a )E θ ∼ p ( θ | a ) (cid:2) log( ψ ) − log( q ) (cid:3) + f (cid:48) ( a ) E θ ∼ p ( θ | a ) (cid:16)(cid:0) log( ψ ) − log( q ) (cid:1) (cid:17) − (cid:16) f (cid:48) ( a )E θ ∼ p ( θ | a ) (log( ψ ) − log( q )) (cid:17) , which further simplifies to d da log( z ( a )) − f (cid:48)(cid:48) ( a ) f (cid:48) ( a ) dda log( z ( a ))= f (cid:48) ( a ) E θ ∼ p ( θ | a ) (cid:16)(cid:0) log( ψ ) − log( q ) (cid:1) (cid:17) − (cid:16) f (cid:48) ( a )E θ ∼ p ( θ | a ) (log( ψ ) − log( q )) (cid:17) . Put all together, the constant KL gap will be achieved by p ( a ) ∝ ( d da log z ( a ) − f (cid:48)(cid:48) ( a ) f (cid:48) ( a ) dda log( z ( a )) (cid:17) = 1 f (cid:48) ( a ) (cid:16) Var θ ∼ p ( θ | a ) (cid:0) log( ψ ) − log( q ) (cid:1)(cid:17) . It is also evident that under this prior, both KL (cid:16) π a , π a + δa (cid:17) and the reserve jump KL (cid:16) π a + δa , π a (cid:17) approximates (different) constants along the trajectory.Due to dependence on the unknown normalizing constant (and higher orders), these twoefficiency-optimal priors require additional tuning and adaptations. In general, we still preferto use the simple uniform margin for robustness. Nevertheless, our method enables any userspecific-prior choice, and in Section 4.5, we illustrate that this adaptively estimated and assignedefficiency-optimal prior further reduces the variance in implicit divide and conquer scheme.31 .6. Choice of regression kernels In regularized path sampling (Section 2), we regularize log z and log c via a parametric regressionform: min β I (cid:88) i =1 log z ( f ( a ∗ i )) − β f ( a ∗ i ) + J (cid:88) j =1 β j γ j ( f ( a ∗ i )) . (18)In other words, we approximate log z ( f ( a )) by β f ( a )+ (cid:80) Jj =1 β j γ j ( f ( a )) . This also enables a clodedform expression for the gradient c (cid:48) and z (cid:48) that will be used in estimate (11).The term log z ( f ( a )) will be a constant function wherever f ( a ) is constant. In our experiment,we have tried a sequence of Gaussian kernels, γ j ( λ ) = exp( − ( λ − λ kernel j ) σ ) , λ kernel j = jJ + 1 , σ kernel = 1 /J, ≤ j ≤ J, logit kernels, γ j ( λ ) = 11 + exp( − λ − λ kernel j σ kernel ) , λ kernel j = jJ + 1 , σ kernel = 1 /J, ≤ j ≤ J, and cubic splines.We have not found the kernel choice to have a large impact on the final tempering procedureor log normalizing constant. For the experiments in Section 4, we used a combination of theGaussian and logit kernels at J = 10 points within the path sampling adaptation steps for speedand simplicity. After running our final adaptation, we then smoothed the final estimate of thenormalizing constant or marginal posterior using cubic splines since these introduce less pseudo-periodic behavior than the Gaussian kernels.To be clear, although z ( · ) can be further used in density estimation, the kernels are used herefor regularization and functional approximation, and is not relevant to kernel density estimation.The latter is noisy and contingent on various smoothness assumptions. In contrast, (18) is a linearregression problem as log z ( f ( a ∗ i )) is known from path sampling estimate. Besides, the choice ofinner kernel points should not be confused with the tempering ladder. Here we are sampling a and λ continuously and evaluate z ( λ ) for all λ ∈ [0 , Appendix B. Software implementation in
Stan
To provide an R (R Core Team, 2020) interface of path sampling and continuous tempering, wecreate a package pathtemp , with the underlying execution inside the general-purpose Bayesianinference engine Stan (Carpenter et al., 2017; Stan Development Team, 2020). The source code isavailable at https://github.com/yao-yl/path-tempering . The procedure is highly automatedand requires minimal tuning.To install the package, call devtools::install_github("yao-yl/path-tempering/package/pathtemp",upgrade="never")
We demonstrate the practical implementation of continuous tempering on a Cauchy mixtureexample. Consider the following Stan model: 32 ata {real y;}parameters {real theta;}model{y ~ cauchy(theta, 0.2);-y ~ cauchy(theta, 0.2);}
Yao et al. (2020) have analyzed the posterior behaviour of this mixture model. With a mod-erately large input data y , the posterior distribution of θ will asymptotically concentrated at twopoints close to ± y . As a result, Stan cannot fully sample from this two-point-spike even with alarge number of iterations.To run continuous tempering, a user can specify any base model, say θ ∼ normal(0 , alternative model block as if it is a regular model. ...model{ // keep the original modely ~ cauchy(theta,0.2);-y ~ cauchy(theta,0.2);}alternative model{ // add a new block of the base measure (e.g., the prior).theta ~ normal(0,5);} After saving this code to a stan file cauchy.stan , we run the function code temperature augment() , which automatically constructs a tempered path betweenthe orginal model and the alternative model, and generates a working model named cauchy augmented.stan : library(pathtemp)update_model <- stan_model("solve_tempering.stan")file_new <- code_temperature_augment("cauchy.stan")> output:> A new stan file has been created: cauchy_augmented.stan. We have automated path sampling and its adaptation into a function path sample() . Thefollowing two lines realize adaptive path sampling. sampling_model <- stan_model(file_new)
The returned value path sample fit provides access to the posterior draws θ from the targetdensity and base density, the join path in the ( θ, a ) space in the final adaptation, and the estimatedlog normalizing constant log z ( λ ). 33 im_cauchy <- extract(path_sample_fit$fit_main)in_target <- sim_cauchy$lambda==1in_prior <- sim_cauchy$lambda==0 The output is presented in Figure 15. −20 −10 0 10 200.00.20.4 q post. draws from target −20 −10 0 10 200.000.050.10 q post. draws from base target base base a q the joint path l est. log normali. const. Figure 15:
Output from the Cauchy code example: Posterior draws θ from the target density and basedensity, the join path in the ( θ, a ) space in the final adaptation, and the estimated log normalizingconstant log z ( λ ) . The visualization is based on 6 adaptations and S=1500 posterior draws. Second, this automated procedure enables to fit two models together.The following Stan code fits a regression with both the probit and logit link. A path betweenthem effectively expands the model continuously such both individual model are special cases ofthe augmented model. The computational efficiency is enhanced as we are fitting one slightlylarger model rather than fitting two models. In addition, the log normalizing constant tells uswhich models fits the data better, which is related to but distinct from the log Bayes factor inmodel comparisons. The output in one run is presented in Figure 16. The data favor the logit linkaccordingly. 34 b logit model b probit model (alt.) logit probit probit a b the joint path l est. log normali. const. Figure 16:
Output from the logit-probit code example: the posterior draws β from the probit andlogit link, the join path in the ( β, a ) space in the final adaptation, and the estimated log normalizingconstant log z ( λ ) . The visualization is based on 2 adaptations and S=1500 posterior draws. data {int n;int y[n];real x[n];}parameters {real beta;}model {beta ~ normal (0,2);y ~ bernoulli_logit(beta * x); // logistic regression}alternative model {beta ~ normal (0,1); // can be a different priory ~ bernoulli(Phi(beta * x)); // probit regression} Appendix C. A failure mode: simulated tempering on latent Dirichlet allocation
Here we present a failure mode of simulated tempering on a high dimensional latent Dirichletallocation (LDA) model, which is widely used in natural language processing, computer vision, andpopulation genetics. In the model, the j -th document (1 ≤ j ≤ J ) is drawn from the l -th topic(1 ≤ l ≤ L ) with probability θ jl , where the topic is defined by a vector of probabilities φ l overthe vocabulary, such that each word in the document from topic l is independently drawn from amultinomial distribution with probability φ l . We apply the LDA model to texts in the novel Prideand Prejudice . After removing frequent and rare words, the book contains 2025 paragraphs and32877 words, with a total unique vocabulary size of 1495. We randomly split the words in the datainto a training and test set. The dimension of the parameters θ and φ grows as a function of thenumber of topics L by 2025 × L and L × . θ and φ . In our experiment, we use L = 5 topics.35DA is prone to a multimodal posterior distribution. The variational based inference often doesnot replicate itself from multiple runs or data shuffle, which can create the appearance of randomresults for the user and reduces the predictive power. Yao et al. (2020) has reported posteriormultimodality in this dataset and model setting.We use continuous tempering to sample from the joint distribution proportional to the jointdensity: c − ( λ )prior − λ likelihood λ . For initialization, we start by simulating draws ( θ ) prior i in theprior, where θ now denotes all parameters in the model, and assign the importance samplingestimate log (cid:16) /S (cid:80) Si =1 p ( y | ( θ ) prior i ) (cid:17) to the initial slope coefficient b , which is close to − × in our first run.After 10 adaptations and 3000 joint HMC iterations per adaptation, the sampler still exploresa thin range of the transformed temperature a , both in the last sample, and all 10 samples mixed,shown in the two histograms in Figure 17. This marginal pattern fails the Pareto-ˆ k diagnostic,suggesting the joint tempering path is unreliable. dist. of temp. all adaptationsa dist. of temp. last adaptationsa l est. log normali. const. −80000−60000−40000−2000000.0 0.5 1.0 last adaptaionfirst 9 adaptatons Figure 17: λ . In adaptive path sampling, the log normalizing constant is computed by the integral of thepointwise gradient u i = log p ( y | a i , θ i ) dda λ (cid:12)(cid:12) a = a i . We examined these three terms: log likelihoodlog p ( y | a i , θ i ), pointwise gradient u i , and the λ derivative dda λ (cid:12)(cid:12) a = a i along all sampled a i in Figure18. The variation in log likelihood log p ( y | a i , θ i ) could be a concern as we approximate its pointwiseexpectation by one Monte Carlo draw. However, the variation in pointwise log likelihood (scale of10 ) is small compared with its absolute scale ( ∼ ). Hence, the gradient seems to have beencomputed with small Monte Carlo error (panel 2). Expect if we zoom in, the potinwise variationcan still be found around a ≈ . curve fitting in data analysis. Weoften do not care about the absolute scale of the curve. Indeed we often transform all input tothe unit scale in regressions, implicitly assuming that approximating a process N(1 : 10 ,
1) withpointwise errors N(0 , . ) is operationally equivalent to approximating the multiplicatively inflatedprocess N(100 : 1000 , ) with pointwise errors N(0 , ).Measured from its relative error, the path sampling estimate of log normalizing constant hasbeen stable after 10 adaptations (last panel in Figure 17). However, because the log normalizingconstant enters the joint density in an additive way, the absolute scale of the approximation errordoes matter for the purpose of tempering. Having an approximation error ∼ in the scale oflog z is tiny compared to the range of the whole curve, and inevitable if we use any parametricregularization to fit the curve. But this 10 error is comparable with the scale of log likelihood.36igure 18: (1) the pointwise log likelihood p ( y | theta, a ) among all simulations draws as a functionof a . (2) the computed pointwise gradient u. (3) the derivative dλ /da. (4) zoom in panel 2 andonly present the lower end. That is, if we start with the exactly known log normalizing constant log z ( a ) = O (10 ), we willobtain the exact uniform margin of a in the posterior. But if at one single temperature point a we add an 1% noise log ˆ z ( a )+ = 0 .
01 log z ( a ), we will create an exp(100) = 10 bump in themarginal density p ( a ): essentially a point mass at a .This pitfall does not mean our method is particularly flawed. In fact, discrete tempering meth-ods are even worse, as (a) in the best case they estimate a coarse discrete ladder, while here everypoint matters, and (b) they work in the scale of normalizing constant z directly, and it is hopelessto estimate a quantity in the scale of exp( −−