Parallel Tempering on Optimized Paths
Saifuddin Syed, Vittorio Romaniello, Trevor Campbell, Alexandre Bouchard-Côté
PParallel tempering on optimized paths
Saifuddin Syed * 1
Vittorio Romaniello * 1
Trevor Campbell Alexandre Bouchard-Cˆot´e Abstract
Parallel tempering (PT) is a class of Markov chainMonte Carlo algorithms that constructs a path ofdistributions annealing between a tractable ref-erence and an intractable target, and then inter-changes states along the path to improve mixingin the target. The performance of PT depends onhow quickly a sample from the reference distri-bution makes its way to the target, which in turndepends on the particular path of annealing dis-tributions. However, past work on PT has usedonly simple paths constructed from convex com-binations of the reference and target log-densities.This paper begins by demonstrating that this pathperforms poorly in the setting where the referenceand target are nearly mutually singular. To ad-dress this issue, we expand the framework of PTto general families of paths, formulate the choiceof path as an optimization problem that admitstractable gradient estimates, and propose a flex-ible new family of spline interpolation paths foruse in practice. Theoretical and empirical resultsboth demonstrate that our proposed methodologybreaks previously-established upper performancelimits for traditional paths.
1. Introduction
Markov Chain Monte Carlo (MCMC) methods are widelyused to approximate intractable expectations with respect toun-normalized probability distributions over general statespaces. For hard problems, MCMC can suffer from poormixing. For example, faced with well-separated modes,MCMC methods often get trapped exploring local regionsof high probability. Parallel tempering (PT) is a widelyapplicable methodology (Geyer, 1991) to tackle poor mixingof MCMC algorithms.Suppose we seek to approximate an expectation with respect * Equal contribution Department of Statistics, University ofBritish Columbia, Vancouver, Canada. Correspondence to: Saifud-din Syed < [email protected] > , Vittorio Romaniello < [email protected] > . to an intractable target distribution π . Denote by π a reference distribution defined on the same space, which isassumed to be tractable in the sense of the availability of anefficient sampler. This work is motivated by the case where π and π are nearly mutually singular. A typical case iswhere the target is a Bayesian posterior distribution, thereference is the prior—for which i.i.d. sampling is typicallypossible—and the prior is misspecified.PT methods are based on a specific continuum of distribu-tions π t ∝ π − t π t , t ∈ [0 , , bridging π and π . Thispath of intermediate distributions is known as the powerposterior path in the literature, but in our framework it willbe more natural to think of these continua as a linear paths ,as they linearly interpolate between log-densities. PT algo-rithms discretize the path at some t < · · · < t N = 1 to obtain a sequence of distributions π t , π t , . . . , π t N . SeeFigure 1 (top) for an example of a linear path for two nearlymutually singular Gaussian distributions.Given the path discretization, PT involves running N + 1 MCMC chains that together target the product distribution π t π t · · · π t N . Based on the assumption that the chain π can be sampled efficiently, PT uses swap-based interactionsbetween neighbouring chains to propagate the explorationdone in π into improved exploration in the chain of inter-est π . By designing these swaps as Metropolis–Hastingsmoves, PT guarantees that the marginal distribution of the N th chain converges to π ; and in practice, the rate of con-vergence is often much faster compared to running a singlechain (Woodard et al., 2009). PT algorithms are extensivelyused in hard sampling problems arising in statistics, physics,computational chemistry, phylogenetics, and machine learn-ing (Desjardins et al., 2014; Ballnus et al., 2017; Kamberaj,2020; M¨uller & Bouckaert, 2020).Notwithstanding empirical and theoretical successes, ex-isting PT algorithms also have well-understood theoreticallimitations. Earlier work focusing on the theoretical analysisof reversible variants of PT has shown that adding too manyintermediate chains can actually deteriorate performance(Lingenheil et al., 2009; Atchad´e et al., 2011). Recent work(Syed et al., 2019) has shown that a nonreversible variantof PT (Okabe et al., 2001) is guaranteed to dominate itsclassical reversible counterpart, and moreover that in thenonreversible regime adding more chains is guaranteed to a r X i v : . [ s t a t . C O ] F e b arallel tempering on optimized paths improve performance. However, even with these more ef-ficient non reversible PT algorithms, Syed et al. (2019)established that the improvement brought by higher paral-lelism will asymptote to a fundamental limit known as the global communication barrier .In this work, we show that by generalizing the class of pathsinterpolating between π and π from linear to nonlinear,the global communication barrier can be broken, leadingto substantial performance improvements. Importantly, thenonlinear path used to demonstrate this breakage is com-puted using a practical algorithm that can be used in anysituation where PT is applicable. An example of a path opti-mized using our algorithm is shown in Figure 1 (bottom).We also present a detailed theoretical analysis of paralleltempering algorithms based on nonlinear paths. Using thisanalysis we prove that the performance gains obtained bygoing from linear to nonlinear path PT algorithms can bearbitrarily large. Our theoretical analysis also motivates aprincipled objective function used to optimize over a para-metric family of paths. Literature review
Beyond parallel tempering, severalmethods to approximate intractable integrals rely on a pathof distributions from a reference to a target distribution, andthere is a rich literature on the construction and optimiza-tion of nonlinear paths for annealed importance samplingtype algorithms (Gelman & Meng, 1998; Rischard et al.,2018; Grosse et al., 2013; Brekelmans et al., 2020). Thesealgorithms are highly parallel; however, for challengingproblems, even when combined with adaptive step size pro-cedures (Zhou et al., 2016) they typically suffer from par-ticle degeneracy (Syed et al., 2019, Sec. 7.4). Moreover,these methods use different path optimization criteria whichare not well motivated in the context of parallel tempering.The most closely related work is Tawn et al. (2020), whichconsiders a specific example of a nonlinear path, distinctfrom the one considered here. Moreover the construction ofthe nonlinear path in this previous work requires knowledgeof the location of modes of π and hence makes the overallalgorithm much less broadly applicable than standard PT.
2. Background
In this section, we provide a brief overview of parallel tem-pering (PT) (Geyer, 1991), as well as recent results onnonreversible communication (Okabe et al., 2001; Sakai& Hukushima, 2016; Syed et al., 2019). Define a referenceunnormalized density function π for which sampling istractable, and an unnormalized target density function π for which sampling is intractable; the goal is to obtain sam-ples from π . Define a path of distributions π t ∝ π − t π t We assume all distributions share a common state space X throughout, and will often suppress the arguments of (log-)density for t ∈ [0 , from the reference to the target. Finally, definethe annealing schedule T N to be a monotone sequence in [0 , , satisfying T N = ( t n ) Nn =0 , t ≤ t ≤ · · · ≤ t N = 1 (cid:107)T N (cid:107) = max n ∈{ ,...,N − } t n +1 − t n . The core idea of parallel tempering is to construct a Markovchain ( X m , . . . , X Nm ) , m = 1 , , . . . that (1) has invariantdistribution π t · π t · · · π t N —such that we can treat themarginal chain X Nn as samples from the target π —and (2)swaps components of the state vector such that independentsamples from component 0 (i.e., the reference π ) traversealong the annealing path and aid mixing in component N (i.e., the target π ). This is possible to achieve by itera-tively performing a local exploration move followed by a communication move as shown in Algorithm 1. Local Exploration
Given ( X m − , . . . , X Nm − ) , we ob-tain an intermediate state ( ˜ X m , . . . , ˜ X Nm ) by updating the n th component using any MCMC move targeting π t n , for n = 0 , . . . , N . This move can be performed in parallelacross components since each is updated independently. Communication
Given the intermediate state ( ˜ X m , . . . , ˜ X Nm ) , we apply pairwise swaps of compo-nents n and n + 1 , n ∈ S m ⊂ { , . . . , N − } forswapped index set S m . Formally, a swap is a move from ( x , . . . , x N ) to ( x , . . . , x n +1 , x n , . . . , x N ) , which isaccepted with probability α n = 1 ∧ π t n ( x n +1 ) π t n +1 ( x n ) π t n ( x n ) π t n +1 ( x n +1 ) . (1)Since each swap only depends on components n , n + 1 , onecan perform all of the swaps in S m in parallel, as long as n ∈ S m implies ( n + 1) / ∈ S m . The largest collection ofsuch non-interfering swaps is S m ∈ { S even , S odd } , where S even , S odd are the even and odd subsets of { , . . . , N − } respectively. In non-reversible PT (NRPT) (Okabe et al.,2001), the swap set S m at each step m is set to S m = (cid:26) S even m even S odd m odd.Based on simplifying assumptions on the local explorationmoves, recent work has shown that if r ( t, t (cid:48) ) is the proba-bility of a swap being rejected between chains at t, t (cid:48) , thenthe round trip rate τ ( T N ) —the frequency at which a newsample from the reference π percolates to the target π andthen back to π —is (Syed et al., 2019, Cors. 1,2), τ ( T N ) = (cid:32) N − (cid:88) n =0 r ( t n , t n +1 )1 − r ( t n , t n +1 ) (cid:33) − . (2) functions—i.e., π instead of π ( x ) —for notational brevity. arallel tempering on optimized paths Algorithm 1
NRPT
Require: state x , path π t , schedule T N , Mr n ← for all n ∈ { , . . . , N − } for m = 1 to M do ˜ x m ← LocalExploration ( x m − ) S m ← S even if m is even, otherwise S m ← S odd for n = 0 to N − do α n ← ∧ π tn ( x n +1 ) π tn +1 ( x n ) π tn ( x n ) π tn +1 ( x n +1 ) r n ← r n + (1 − α n ) U n ∼ Unif(0 , if n ∈ S m and U n ≤ α n then (˜ x nm , ˜ x n +1 m ) ← (˜ x n +1 m , ˜ x nm ) end ifx m ← ˜ x m end forend for r n ← r n /M for n ∈ { , . . . , N − } Return: { x m } Mm =1 , { r n } N − n =0 Further, if T N is refined so that (cid:107)T N (cid:107) → as N → ∞ , wefind that the asymptotic (in N ) round trip rate is τ ∞ = lim N →∞ τ ( T N ) = (2 + 2Λ) − , (3)where Λ ≥ is a constant associated with the pair π , π called the global communication barrier (Syed et al., 2019).Note that Λ does not depend on the number of chains N ordiscretization schedule T N .
3. General annealing paths
In the following, we use the terminology annealing path todescribe a continuum of distributions interpolating between π and π ; this definition will be formalized shortly. Theprevious work reviewed in the last section assumes thatthe annealing path has the form π t ∝ π − t π t , i.e., thatthe annealing path linearly interpolates between the logdensities. A natural question is whether using other pathscould lead to an improved round trip rate.In this work we show that the answer to this question ispositive: the following proposition demonstrates that thetraditional path π t ∝ π − t π t suffers from an arbitrarilysuboptimal global communication barrier even in simpleexamples with Gaussian reference and target distributions. Proposition 1.
Suppose the reference and target distribu-tions are π = N ( µ , σ ) and π = N ( µ , σ ) , and define z = | µ − µ | /σ . Then as z → ∞ ,1. the path π t ∝ π − t π t has τ ∞ = Θ(1 /z ) , and2. there exists a path of Gaussians distributions with τ ∞ = Ω(1 / log z ) . Figure 1.
Two annealing paths between a π = N ( − , . ) (lightblue) and π = N (2 , . ) (dark blue) : the traditional linearpath (top) and an optimized nonlinear path (bottom). While thedistributions in the linear path are nearly mutually singular, theoptimized path overlap substantially, leading to faster round trips. Therefore, upon decreasing the variance of the reference andtarget while holding their means fixed, the traditional linearannealing path obtains an exponentially smaller asymptoticround trip rate than the optimal path of Gaussians. Figure 1provides an intuitive explanation. The standard path (top)corresponds to a set of Gaussian distributions with meaninterpolated between the reference and target. If one reducesthe variance of the reference and target, so does the varianceof the distributions along the path. For any fixed N , thesedistributions become nearly mutually singular, leading toarbitrarily low round trip rates. The solution to this issue(bottom) is to allow the distributions along the path to haveincreased variances, thereby maintaining mutual overlapand the ability to swap components with a reasonable prob-ability. This motivates the need to design more generalannealing paths. In the following, we introduce the precisegeneral definition of an annealing path, an analysis of pathcommunication efficiency in parallel tempering, and a rigor-ous formulation of—and solution to—the problem of tuningpath parameters to maximize communication efficiency. Let P ( X ) be the set of probability densities with full supporton a state space X . For any collection of densities π t ∈P ( X ) with index t ∈ [0 , , associate to each a log-densityfunction W t such that π t ( x ) = 1 Z t exp ( W t ( x )) , x ∈ X , (4)where Z t = (cid:82) X exp ( W t ( x )) d x is the normalizing constant.Definition 1 provides the conditions necessary to form apath of distributions from a reference π to a target π that arallel tempering on optimized paths are well-behaved for use in parallel tempering. Definition 1. An annealing path for a target density π is amap [0 , → P ( X ) , denoted t (cid:55)→ π t , such that:1. the right endpoint is the target distribution π = π ,2. for each x ∈ X , the log density W t ( x ) : [0 , → R ispiecewise twice continuously differentiable in t , and3. there exist V , V : X → [0 , ∞ ) such that for all x ∈ X and all t ∈ [0 , with defined derivatives, (cid:12)(cid:12)(cid:12)(cid:12) d W t d t ( x ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ V ( x ) , (cid:12)(cid:12)(cid:12)(cid:12) d W t d t ( x ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ V ( x ) , and sup t ∈ [0 , E π t [ V + V ] < ∞ . There are many ways to move beyond the standard linearpath π t ∝ π − t π t . For example, consider a nonlinear path π t ∝ π η ( t )0 π η ( t )1 where η i : [0 , → R are functions suchthat η (0) = η (1) = 1 and η (1) = η (0) = 0 . Anexample of such functions is the cubic B´ezier curve η ( t ) = (1 − t )((1 − t ) + 3(1 − t ) t + 3 t ) η ( t ) = t (3(1 − t ) + 3(1 − t ) t + t ) . As long as for all t ∈ [0 , , π t is a normalizable density andthe appropriate moment bounds exist per Definition 1, this isa valid annealing path. Further, note that the path parameterdoes not necessarily have to appear as an exponent: considerfor example the mixture path π t ∝ (1 − t ) π + tπ . Section 4provides a more detailed example based on linear splines. Given a particular annealing path satisfying Definition 1, werequire a method to characterize the round trip rate perfor-mance of parallel tempering based on that path. The resultspresented in this section form the basis of the objective func-tion used to optimize over paths, as well as the foundationfor the proof of Proposition 1.We start with some notation for the rejection rates involvedwhen Algorithm 1 is used with nonlinear paths (Equation 4).For t, t (cid:48) ∈ [0 , , x ∈ X , define the rejection rate function r : [0 , → [0 , to be r ( t, t (cid:48) ) = 1 − E [exp (min { , A t,t (cid:48) ( X, X (cid:48) ) } )] A t,t (cid:48) ( x, x (cid:48) ) = ( W t (cid:48) ( x ) − W t ( x )) − ( W t (cid:48) ( x (cid:48) ) − W t ( x (cid:48) )) , where X ∼ π t and X (cid:48) ∼ π t (cid:48) are independent. Assumingthat all chains have reached stationarity, and all chains un-dergo efficient local exploration —i.e., W t ( X ) , W t ( ˜ X ) areindependent when X ∼ π t and ˜ X is generated by local exploration from X —then the round trip rate for a partic-ular schedule T N has the form given earlier in Equation(2). This statement follows from (Syed et al., 2019, Cor. 1)without modification, because the proof of that result doesnot depend on the form of the acceptance ratio.Our next objective is to characterize the asymptotic commu-nication efficiency of a nonlinear path in the regime where N → ∞ and (cid:107)T N (cid:107) → —which establishes its fundamen-tal ability to take advantage of parallel computation. In otherwords, we require a generalization of the asymptotic resultin Equation (3). In this case, previous theory relies on theparticular form of the acceptance ratio for linear paths (Syedet al., 2019); in the remainder of this section, we provide ageneralization of the asymptotic result for nonlinear pathsby piecewise approximation by a linear spline.For t, t (cid:48) ∈ [0 , , define Λ( t, t (cid:48) ) to be the global communi-cation barrier for the linear secant connecting π t and π (cid:48) t , Λ( t, t (cid:48) ) = 12 (cid:90) E [ | A t,t (cid:48) ( X s , X (cid:48) s ) | ]d s, (5) X s , X (cid:48) s i.i.d. ∼ Z t,t (cid:48) ( s ) exp ((1 − s ) W t + sW t (cid:48) ) . Lemma 1 shows that the global communication barrier Λ( t, t (cid:48) ) along the secant of the path connecting t to t (cid:48) isa good approximation of the true rejection rate r ( t, t (cid:48) ) with O ( | t − t (cid:48) | ) error as t → t (cid:48) . Lemma 1.
For any annealing path, there exists a constant
C < ∞ independent of t, t (cid:48) such that for all t, t (cid:48) ∈ [0 , , | r ( t, t (cid:48) ) − Λ( t, t (cid:48) ) | ≤ C | t − t (cid:48) | . (6)A direct consequence of Lemma 1 is that for any fixedschedule T N , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N − (cid:88) n =0 r ( t n , t n +1 ) − Λ( T N ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C (cid:107)T N (cid:107) , (7)where Λ( T N ) = (cid:80) N − n =0 Λ( t n , t n +1 ) . Intuitively, in the (cid:107)T N (cid:107) ≈ regime where rejection rates are low, r ( t n , t n +1 )1 − r ( t n , t n +1 ) ≈ r ( t n , t n +1 ) , and we have that τ ( T N ) ≈ (2 + 2Λ( T N )) − . Therefore, Λ( T N ) characterizes the communication efficiency of thepath in a way that naturally extends the global communica-tion barrier from the linear path case. Theorem 2 providesthe precise statement: the convergence is uniform in T N and depends only on (cid:107)T N (cid:107) , and Λ( T N ) itself converges to aconstant Λ in the asymptotic regime. We refer to Λ , definedbelow in Equation (9) as the global communication barrierfor the general annealing path. arallel tempering on optimized paths Theorem 2.
If there exists an (cid:15) > such that sup t ∈ [0 , E π t [(1 + V ) exp( (cid:15)V )] < ∞ , then lim (cid:15) → sup T N : (cid:107)T N (cid:107)≤ (cid:15) (cid:12)(cid:12) (2 + 2Λ( T N )) − − τ ( T N ) (cid:12)(cid:12) = 0 (8) and lim (cid:15) → sup T N : (cid:107)T N (cid:107)≤ (cid:15) | Λ( T N ) − Λ | = 0 , (9) where Λ = (cid:82) λ ( t )d t for an instantaneous rejection ratefunction λ : [0 , → [0 , ∞ ) given by λ ( t ) = lim ∆ t → r ( t + ∆ t, t ) | ∆ t | = 12 E (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) dW t dt ( X t ) − dW t dt ( X (cid:48) t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) , X t , X (cid:48) t i.i.d. ∼ π t . The integrability condition is required to control the tailbehaviour of distributions formed by linearized approxima-tions to the path π t . This condition is satisfied by a widerange of annealing paths, e.g., the linear spline paths pro-posed in this work in Section 4—since in that case V = 0 . It is often the case that there are a set of candidate annealingpaths in consideration for a particular target π . For example,if a path has tunable parameters φ ∈ Φ that govern its shape,we can generate a collection of annealing paths that all target π by varying the parameter φ . We call such collections anannealing path family. Definition 3. An annealing path family for π is a set ofannealing paths π t that target π . There are many ways to construct useful annealing pathfamilies. For example, if one is provided a parametric familyof variational distributions { q φ : φ ∈ Φ } for some parameterspace Φ , one can construct the annealing path family oflinear paths q − tφ π t from a variational reference to the target π . Similarly, building on the B´ezier curve path from earlier,one can consider a parametrized B´ezier curve η ( t ) = (1 − t )((1 − t ) + 3 φ (1 − t ) t + 3 φ t ) η ( t ) = t (3 φ (1 − t ) + 3 φ (1 − t ) t + t ) , with tunable parameter φ = ( φ , φ , φ , φ ) ∈ R . If Φ is the subset of φ ∈ R such that for all t ∈ [0 , , π t is anormalizable density with the appropriate moment boundsper Definition 1, this is a valid annealing path family.Since every path in an annealing path family has the desiredtarget distribution π , we are free to optimize the path overthe tuning parameter space φ ∈ Φ in addition to optimizing Algorithm 2
PathOptNRPT
Require: state x , path family π φt , parameter φ , N , M , S , learning rate γ T N ← (0 , /N, /N, . . . , for s = 1 to S do { x m } Mm =1 , ( r n ) Nn =0 ← NRPT ( x , π φt , T N , M ) λ φ ← CommunicationBarrier ( T N , { r n } ) T N ← UpdateSchedule ( λ φ , N ) φ ← φ − γ ∇ φ (cid:80) N − n =0 SKL( π φt n , π φt n +1 ) x ← x M end forReturn: φ, T N the schedule T N . Motivated by the analysis of Section 3.2,a natural objective function for this optimization to consideris the non-asymptotic round trip rate φ (cid:63) , T (cid:63)N = arg max φ ∈ Φ , T N τ φ ( T N ) (10) = arg min φ ∈ Φ , T N N − (cid:88) n =0 r φ ( t n , t n +1 )1 − r φ ( t n , t n +1 ) , (11)where now the round trip rate and rejection rates dependboth on the schedule and path parameter, denoted by super-script φ . We solve this optimization using an approximatecoordinate-descent procedure, iterating between an updateof the schedule T N for a fixed path parameter φ ∈ Φ , fol-lowed by a a gradient step in φ based on a surrogate objec-tive function and a fixed schedule. This is summarized inAlgorithm 2. We outline the details of schedule and pathtuning procedure in the following. Tuning the schedule
Fix the value of φ , which fixes thepath. We adapt a schedule tuning algorithm from past workto update the schedule T N = ( t n ) Nn =0 (Syed et al., 2019,Section 5.1). Based on the same argument as this previouswork, we obtain that when (cid:107)T N (cid:107) ≈ , the non-asymptoticround trip rate is maximized when the rejection rates areall equal. The schedule that approximately achieves thissatisfies ∀ n ∈ { , . . . , N − } , φ (cid:90) t n λ φ ( s )d s = nN . (12)We use Monte Carlo estimates of the rejection rates r φ ( t n , t n +1 ) in combination with Theorem 2 to approxi-mate t (cid:55)→ (cid:82) t λ φ ( s ) d s , s ∈ [0 , , and then use bisectionsearch to solve for each t n according to Equation (12). Optimizing the path
Fix the schedule T N ; we now wantto improve the path itself by modifying φ . However, this We assume that the optimization over φ ends after a finitenumber of iterations to sidestep the potential pitfalls of adaptiveMCMC methods (Andrieu & Moulines, 2006). arallel tempering on optimized paths is not as simple as taking a gradient step for the objec-tive in Equation (10). In particular, it is typical in earlyiterations when a poor path is initialized that the rejectionrates r φ ( t n , t n +1 ) ≈ . As demonstrated empirically inAppendix F, gradient estimates in this regime exhibit a lowsignal-to-noise ratio that precludes their use for optimiza-tion.Consider instead the global communication barrier Λ φ ( T N ) for the linear spline approximation to the path; Theorem 2guarantees that as long as (cid:107)T N (cid:107) is small enough, one canoptimize Λ φ ( T N ) in place of the round trip rate τ φ ( T N ) .First note that by Jensen’s inequality, N Λ φ ( T N ) ≤ N N − (cid:88) n =0 Λ φ ( t n , t n +1 ) . Next, we apply Jensen’s inequality again to the definition of Λ φ ( t n , t n +1 ) from (5), which shows that Λ φ ( t n , t n +1 )) ≤ (cid:90) E [ A t n ,t n +1 ( X s , X (cid:48) s ) ]d s, where X s are defined in (5) are drawn from the linear pathbetween π φt n and π φt n +1 . Finally, we note that the innerexpectation is the path integral of the Fisher informationmetric along the linear path and evaluates to the symmetricKL divergence (Dabak & Johnson, 2002, Result 4), (cid:90) E (cid:2) ( A t n ,t n +1 ( X s , X (cid:48) s )) (cid:3) d s = 2SKL( π φt n , π φt n +1 ) . Therefore we have φ ( T N ) N ≤ N − (cid:88) n =0 SKL( π φt n , π φt n +1 ) . (13)The slack in this inequality could potentially depend on φ even in the large N regime. Therefore, during optimiza-tion, we recommend monitoring the value of the originalobjective function (Equation (10)) to ensure that the opti-mization of the surrogate SKL objective indeed improves it,and hence the round trip rate performance of PT via Equa-tion (2). In the experiments we display the values of bothobjective functions.
4. Spline annealing path family
In this section, we develop a family of annealing paths—the spline annealing path family —that offers a practicaland flexible improvement upon the traditional linear pathsconsidered in past work. We first define a general familyof annealing paths based on the exponential family, andthen provide the specific details of the spline family with adiscussion of its properties. Empirical results in Section 5demonstrate that the spline annealing path family resolvesthe problematic Gaussian annealing example in Figure 1.
We begin with the practical desiderata for an annealing pathfamily given a fixed reference π and target π distribution. First, the traditional linear path π t ∝ π − t π t should be amember of the family, such that one can achieve at least theround trip rate provided by that path. Second, the familyshould be broadly applicable and not depend on particulardetails of either π or π . Finally, using the Gaussian exam-ple from Figure 1 and Proposition 1 as insight, the familyshould enable the path to smoothly vary from π to π whileinflating / deflating the variance as necessary.These desiderata motivate the design of the exponentialannealing path family , in which each annealing path takesthe form π t ∝ π η ( t )0 π η ( t )1 = exp( η ( t ) T W ( x )) , for some function η ( t ) = ( η ( t ) , η ( t )) and reference/targetlog densities W ( x ) = ( W ( x ) , W ( x )) . The B´ezier curvesfrom Section 3 are an example. Intuitively, η ( t ) and η ( t ) represent the level of annealing for the reference and targetrespectively along the path. Proposition 2 shows that abroad collection of functions η indeed construct a validannealing path family including the linear path. Without lossof generality, we assume any piecewise twice continuouslydifferentiable function η : [0 , → R is reparameterizedsuch that for t ∈ [0 , where η (cid:48) ( t ) exists, (cid:107) η (cid:48) ( t ) (cid:107) = L where L is the length of the path. Proposition 2.
Let Ω ⊆ R be the set Ω = (cid:26) ξ ∈ R : (cid:90) exp( ξ T W ( x ))(1 + (cid:107) W ( x ) (cid:107) )d x < ∞ (cid:27) . Suppose (0 , ∈ Ω and A M is the set of piecewise twicecontinuously differentiable functions η : [0 , → Ω suchthat η (1) = (0 , and sup t (cid:107) η (cid:48)(cid:48) ( t ) (cid:107) ≤ M . Then (cid:8) π t ( x ) ∝ exp( η ( t ) T W ( x )) : η ∈ A M (cid:9) is an annealing path family for target distribution π . More-over, if (1 , ∈ Ω , then A M contains the linear path π t ∝ π − t π t . Proposition 2 reduces the problem of designing a generalfamily of paths of probability distributions to the muchsimpler task of designing paths in R . We argue that agood choice can be constructed using linear spline pathsconnecting K knots φ = ( φ , . . . , φ K ) ∈ ( R ) K +1 , i.e.,for all k ∈ { , . . . , K } and t ∈ [ k − K , kK ] , η φ ( t ) (cid:55)→ ( k − Kt ) φ k − + ( Kt − k + 1) φ k . A natural extension of this discussion would includeparametrized variational reference distribution families. For sim-plicity we restrict to a fixed reference. arallel tempering on optimized paths
Let Ω be defined in Proposition 2. The K -knot spline an-nealing path family is defined as the set of K -knot linearspline paths such that φ = (1 , , φ K = (0 , , and ∀ k, φ k ∈ Ω . Validity:
Since Ω is a convex set per the proof of Propo-sition 2, we are guaranteed that η φ ([0 , ⊆ Ω , and so thiscollection of annealing paths is a subset of the exponentialannealing family and hence forms a valid annealing pathfamily targeting π . Convexity:
Furthermore, convexity of Ω implies that tun-ing the knots φ ∈ Ω K +1 involves optimization within aconvex constraint set. In practice, we enforce also thatthe knots are monotone in each component—i.e., the firstcomponent monotonically decreases → and the sec-ond increases → —such that the path of distributionsalways moves from the reference to the target. Becausemonotonicity constraint sets are linear and hence convex,the overall monotonicity-constrained optimization problemhas a convex domain. Flexibility:
Assuming the family is nonempty, it triviallycontains the linear path. Further, given a large enoughnumber of knots K , the spline annealing family well-approximates subsets of the exponential annealing familyfor fixed M > . In particular, sup η ∈A M inf φ ∈ Ω K +1 (cid:107) η φ − η (cid:107) ∞ ≤ M K . (14)Figure 2 provides an illustration of the behaviour of opti-mized spline paths for a Gaussian reference and target. Thepath takes a convex curved shape; starting at the bottomright point of the figure (reference), this path corresponds toincreasing the variance of the reference, shifting the meanfrom reference to target, and finally decreasing the varianceto match the target. With more knots, this process happensmore smoothly.Appendix D provides an explicit derivation of the stochasticgradient estimates we use to optimize the knots of the splineannealing path family. It is also worth making two practicalnotes. First, we cannot use the usual 2-norm projection toenforce the constraint, because this causes an undesirablecollapse of knots. Therefore, we maintain monotonicity asfollows: after each stochastic gradient step, we identify amonotone subsequence of knots containing the endpoints,remove the nonmonotone jumps, and linearly interpolate be-tween the knot subsequence with an even spacing. Second,we take gradient steps in a log-transformed space such thatknot components are always strictly positive. Figure 2.
The spline path for K = 1 , , , , , knots for thefamily generated by π = N ( − , . and π = N (1 , . .
5. Experiments
In this section, we study the empirical performance of non-reversible PT based on the spline annealing path family( K ∈ { , , , , } ) from Section 4, with knots and sched-ule optimized using the tuning method from Section 3. Wecompare this method to two PT methods based on stan-dard linear paths: non-reversible PT with adaptive schedule(“NRPT+Linear”) (Syed et al., 2019), and reversible PT(“Reversible+Linear”) (Atchad´e et al., 2011).We use the terminology “scan” to denote one iteration of thefor loop in Algorithm 2. The computational cost of a scanis comparable for all the methods, since the bottleneck isthe local exploration step shared by all methods. These ex-periments demonstrate two primary conclusions: (1) tunednonlinear paths provide a substantially higher round trip ratecompared to linear paths for the examples considered; and(2) the symmetric KL sum objective (Equation 13) is a goodproxy for the round trip rate as a tuning objective.We run the following benchmark problems; see the supple-ment for details. Gaussian: a synthetic setup in which thereference distribution is π = N ( − , . ) and the targetis π = N (1 , . ) . For this example we used N = 50 parallel chains and fixed the computational budget to 45000samples. For Algorithm 2, the computational budget wasdivided equally over 150 scans, meaning 300 samples wereused for every gradient update. “Reversible+Linear” per-formed 45000 local exploration steps with a communica-tion step after every iteration while for “NRPT+Linear” thecomputational budget was used to adapt the schedule. Thegradient updates were performed using Adagrad with learn-ing rate equal to 0.2. Beta-binomial model: a conjugateBayesian model with prior π ( p ) = Beta(180 , andlikelihood L ( x | p ) = Binomial( x | n, p ) . We simulated data x , . . . , x ∼ Binomial(100 , . resulting in a poste-rior π ( p ) = Beta(140180 , . The prior and poste-rior are heavily concentrated at 0.2 and 0.7 respectively. arallel tempering on optimized paths Figure 3.
Top:
Cumulative round trips averaged over 10 runs for the spline path with K = 2 , , , , (solid blue), NRPT using a linearpath (dashed green), and reversible PT with linear path (dash/dot red). The slope of the lines represent the round trip rate. We observelarge gains going from linear to non-linear paths ( K > ). For all values of K > , the optimized spline path substantially improves onthe theoretical upper bound on round trip rate possible using linear path (dotted black). Bottom:
Non-asymptotic communication barrierfrom Equation 11 (solid blue) and Symmetric KL (dash orange) as a function of iteration for one run of PathOptNRPT + Spline ( K = 4 knots). We used the same settings as for the Gaussian example.
Mixture model:
A Bayesian Gaussian mixture model withmixture proportions w , w , mixture component densities N ( µ i , ) for mean parameters µ , µ , and a binary clus-ter label for each data point. We place a Dir(1 , prior onthe proportions, and a N (150 , prior on each of the twomean parameters. We simulated n = 1000 data points fromthe mixture . N (100 , ) + 0 . N (200 , ) . Exploringthe full posterior distribution is challenging in this contextdue to a combination of misspecification of the prior andlabel switching. In this example we used N = 35 chainsand fixed the computational budget to 25000, divided into50 scans using 500 samples each. We optimize the pathusing Adagrad with a learning rate of 0.3. For all the experi-ments we performed one local exploration step before eachcommunication step.The results of these experiments are shown in Figure 3.Examining the top row—which shows the number of roundtrips as a function of the number of scans—one can see thatPT using the spline annealing family outperforms PT usingthe linear annealing path across all numbers of knots tested.Moreover, the slope of these curves demonstrates that PTwith the spline annealing family exceeds the theoreticalupper bound of round trip rate for the linear annealing path(from Equation (8)). The largest gain is obtained from goingfrom K = 1 (linear) to K = 2 . For the Gaussian andbeta-binomial examples, increasing the number of knots tomore than K > leads to marginal improvements. In the case of the Gaussian example, note that since the globalcommunication barrier Λ for the linear path is much largerthan N , algorithms based on linear paths incurred rejectionrates of nearly one for most chains, resulting in no roundtrips. For the mixture example, we note that once one addsa large number ( K = 10 ) of knots, the path optimizationproceeds more slowly.The bottom row of Figure 3 shows the value of the surrogateSKL objective and non-asymptotic communication barrierfrom Equation 11. In particular, these figures demonstratethat the SKL provides a surrogate objective that is a reason-able proxy for the non-asymptotic communication barrier,but does not exhibit as large estimation variance in earlyiterations when there are pairs of chains with rejection ratesclose to one.
6. Discussion
In this work, we identified the use of linear paths of distri-butions as a major bottleneck in the performance of paralleltempering algorithms. To address this limitation, we haveprovided a theory of parallel tempering on nonlinear paths,a methodology to tune parametrized paths, and finally apractical, flexible family of paths based on linear splines.Future work in this line of research includes extensions to es-timate normalization constants, as well as the developmentof techniques and theory surrounding the use of variationalreference distributions. arallel tempering on optimized paths
References
Andrieu, C. and Moulines, E. On the ergodicity propertiesof some adaptive MCMC algorithms.
Annals of AppliedProbability , 16(3):1462–1505, 2006.Atchad´e, Y. F., Roberts, G. O., and Rosenthal, J. S. To-wards optimal scaling of Metropolis-coupled Markovchain Monte Carlo.
Statistics and Computing , 21(4):555–568, 2011.Ballnus, B., Hug, S., Hatz, K., G¨orlitz, L., Hasenauer, J.,and Theis, F. J. Comprehensive benchmarking of Markovchain Monte Carlo methods for dynamical systems.
BMCSystems Biology , 11(1):63, 2017.Brekelmans, R., Masrani, V., Bui, T., Wood, F., Galstyan,A., Steeg, G. V., and Nielsen, F. Annealed importancesampling with q-paths. arXiv:2012.07823 , 2020.Costa, S. I., Santos, S. A., and Strapasson, J. E. Fisherinformation distance: A geometrical reading.
DiscreteApplied Mathematics , 197:59–69, 2015.Dabak, A. G. and Johnson, D. H. Relations betweenKullback-Leibler distance and Fisher information.
Jour-nal of The Iranian Statistical Society , 5:25–37, 2002.Desjardins, G., Luo, H., Courville, A., and Bengio, Y. Deeptempering. arXiv:1410.0123 , 2014.Gelman, A. and Meng, X.-L. Simulating normalizing con-stants: From importance sampling to bridge sampling topath sampling.
Statistical science , pp. 163–185, 1998.Geyer, C. J. Markov chain Monte Carlo maximum likeli-hood.
Interface Proceedings , 1991.Grosse, R. B., Maddison, C. J., and Salakhutdinov, R. R.Annealing between distributions by averaging moments.In
Advances in Neural Information Processing Systems ,2013.Kamberaj, H.
Molecular Dynamics Simulations in Statisti-cal Physics: Theory and Applications . Springer, 2020.Lingenheil, M., Denschlag, R., Mathias, G., and Tavan,P. Efficiency of exchange schemes in replica exchange.
Chemical Physics Letters , 478(1-3):80–84, 2009.M¨uller, N. F. and Bouckaert, R. R. Adaptive parallel temper-ing for BEAST 2. bioRxiv , 2020. doi: 10.1101/603514.Okabe, T., Kawata, M., Okamoto, Y., and Mikami, M.Replica-exchange Monte Carlo method for the isobaric–isothermal ensemble.
Chemical Physics Letters , 335(5-6):435–439, 2001. Predescu, C., Predescu, M., and Ciobanu, C. V. The incom-plete beta function law for parallel tempering samplingof classical canonical systems.
The Journal of ChemicalPhysics , 120(9):4119–4128, 2004.Rainforth, T., Kosiorek, A. R., Le, T. A., Maddison, C. J.,Igl, M., Wood, F., and Teh, Y. W. Tighter variationalbounds are not necessarily better. In
International Con-ference on Machine Learning , 2018.Rischard, M., Jacob, P. E., and Pillai, N. Unbiased esti-mation of log normalizing constants with applications toBayesian cross-validation. arXiv:1810.01382 , 2018.Sakai, Y. and Hukushima, K. Irreversible simulated tem-pering.
Journal of the Physical Society of Japan , 85(10):104002, 2016.Syed, S., Bouchard-Cˆot´e, A., Deligiannidis, G., and Doucet,A. Non-reversible parallel tempering: an embarassinglyparallel MCMC scheme. 2019. arXiv:1905.02939.Tawn, N. G., Roberts, G. O., and Rosenthal, J. S. Weight-preserving simulated tempering.
Statistics and Comput-ing , 30(1):27–41, 2020.Woodard, D. B., Schmidler, S. C., Huber, M., et al. Condi-tions for rapid mixing of parallel and simulated temperingon multimodal distributions.
The Annals of Applied Prob-ability , 19(2):617–640, 2009.Zhou, Y., Johansen, A. M., and Aston, J. A. Toward auto-matic model comparison: An adaptive sequential MonteCarlo approach.
Journal of Computational and GraphicalStatistics , 25(3):701–726, 2016. arallel tempering on optimized paths
A. Proof of Proposition 1
Define π = N ( µ , σ ) and π = N ( µ , σ ) with W i ( x ) ∝ − σ ( x − µ i ) (throughout we use the propor-tionality symbol ∝ with log-densities to indicate an un-specified additive constant). Suppose π t is the linear path π t ( x ) ∝ exp( W t ) where W t = (1 − t ) W + tW . Notethat as a function of x , W t ( x ) ∝ − − t σ ( x − µ ) − t σ ( x − µ ) ∝ − σ ( x − µ t ) , µ t = (1 − t ) µ + tµ , and thus π t = N ( µ t , σ ) . Taking a derivative of W t , wefind that d W t d t = ( µ − µ ) (cid:0) x − µ + µ (cid:1) σ . We will now compute λ ( t ) . If X t , X (cid:48) t ∼ π t , then λ ( t ) = 12 E (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) dWdt ( X t ) − dWdt ( X (cid:48) t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) = | µ − µ | σ E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X t − µ + µ σ − X (cid:48) t − µ + µ σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) = | µ − µ | σ E (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) X t − µ t σ − X (cid:48) t − µ t σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) = | µ − µ | σ E [ | Z − Z (cid:48) | ] , where Z, Z (cid:48) ∼ N (0 , . Thus Z − Z (cid:48) ∼ N (0 , , and | Z − Z (cid:48) | has a folded normal distribution with expectation / √ π .This implies λ ( t ) = z/ √ π where z = | µ − µ | /σ and Λ = (cid:82) λ ( t )d t = z/ √ π . By Theorem 2, the asymptoticround trip rate τ linear ∞ for the linear path satisfies, τ linear ∞ = 12 + 2Λ = 12 + 2 z/ √ π = Θ (cid:18) z (cid:19) . We will now establish an upper bound for the communica-tion barrier Λ for a general path π t . If X t , X (cid:48) t ∼ π t , thenTheorem 2 and Jensen’s inequality imply the following: Λ = (cid:90) E (cid:115)(cid:18) dWdt ( X t ) − dWdt ( X (cid:48) t ) (cid:19) d t ≤ (cid:90) (cid:118)(cid:117)(cid:117)(cid:116) E (cid:34)(cid:18) dWdt ( X t ) − dWdt ( X (cid:48) t ) (cid:19) (cid:35) d t = 1 √ (cid:90) (cid:115) Var π t (cid:20) dW t dt (cid:21) d t = 1 √ F , where Λ F is the length of the the path π t with the Fisher in-formation metric. The geodesic path of Gaussians between π and π that minimizes Λ F satisfies (Costa et al., 2015,Eq. 11, Sec. 2) Λ F = √ (cid:18) z z (cid:112) z (cid:19) . (15)Again, by Theorem 2, the asymptotic round trip rate τ geodesic ∞ for the geodesic path satisfies τ geodesic ∞ = 12 + 2Λ ≥
12 + 2Λ F = Θ (cid:18) z (cid:19) . B. Proof of Lemma 1
Definition 4.
Given a path π t and measurable function f ,we denote (cid:107) f (cid:107) π = sup t E π t [ f ] . Following the computation in Predescu et al. (2004, Propo-sition 1), we have r ( t, t (cid:48) ) = 1 − E [exp( − | A t,t (cid:48) ( ˜ X / , ˜ X (cid:48) / ) | ] E [exp( − A t,t (cid:48) ( ˜ X / , ˜ X (cid:48) / ))] , (16)where ˜ X s , ˜ X (cid:48) s ∼ ˜ π s = Z ( s ) exp((1 − s ) W t + sW t (cid:48) ) and A t,t (cid:48) ( x, x (cid:48) ) = ( W t (cid:48) ( x ) − W t ( x )) − ( W t (cid:48) ( x (cid:48) ) − W t ( x (cid:48) )) . In particular, the path of distributions ˜ π s for s ∈ [0 , is thelinear path between π t and π t (cid:48) . Lemma 2.
For all k ≥ , there is a (possibly infinite)constant ˜ C k independent of t, t (cid:48) , T N such that sup s E [ | A t,t (cid:48) ( ˜ X s , ˜ X (cid:48) s ) | k ] ≤ ˜ C k | t (cid:48) − t | k . where ˜ X s , ˜ X (cid:48) s ∼ ˜ π s .Proof. The mean-value theorem and condition 3 of Defin-tion 1 imply that W t ( x ) is Lipschitz in t , | W t ( x ) − W t (cid:48) ( x ) | ≤ V ( x ) | t (cid:48) − t | . The triangle inequality therefore implies | A t,t (cid:48) ( x, x (cid:48) ) | ≤ ( V ( x ) + V ( x (cid:48) )) | t (cid:48) − t | . By taking expectations and using the fact | a + b | k ≤ k − ( | a | k + | b | k ) , we have that E [ | A t,t (cid:48) ( ˜ X s , ˜ X (cid:48) s ) | k ] ≤ k E ˜ π s [ V k ] | t (cid:48) − t | k ≤ k (cid:107) V k (cid:107) ˜ π s | t − t (cid:48) | k . arallel tempering on optimized paths We now begin the proof of Lemma 1. Define ˜ λ ( s ) = E [ | A t,t (cid:48) ( ˜ X s , ˜ X (cid:48) s ) | ] for ˜ X s , ˜ X (cid:48) s ∼ ˜ π s . Then a third orderTaylor expansion of Equation (16) (Predescu et al., 2004),which contains terms of the form E [ | A t,t (cid:48) ( ˜ X s , ˜ X (cid:48) s ) | k ] thatcan be controlled via Lemma 2, yields r ( t, t (cid:48) ) = ˜ λ (1 /
2) + R ( t, t (cid:48) ) , | R ( t, t (cid:48) ) | ≤ C (cid:48) | t − t (cid:48) | , for some finite constant C (cid:48) independent of t, t (cid:48) . By Syedet al. (2019, Prop. 2, Appendix C) we have that in addi-tion, ˜ λ ( s ) is in C ([0 , , and thus there is a constant C (cid:48)(cid:48) independent of t, t (cid:48) such that sup s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d ˜ λds (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C (cid:48)(cid:48) | t − t (cid:48) | . The error bound for the midpoint rule implies, (cid:12)(cid:12)(cid:12)(cid:12) ˜ λ (1 / − (cid:90) ˜ λ ( s ) ds (cid:12)(cid:12)(cid:12)(cid:12) ≤
124 sup s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d ˜ λds (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C (cid:48)(cid:48) | t − t (cid:48) | . The result follows: there is a finite constant C independentof t, t (cid:48) such that | r ( t, t (cid:48) ) − Λ( t, t (cid:48) ) | = (cid:12)(cid:12)(cid:12)(cid:12) r ( t, t (cid:48) ) − (cid:90) ˜ λ ( s ) ds (cid:12)(cid:12)(cid:12)(cid:12) ≤ C | t (cid:48) − t | . C. Proof of Theorem 2
We first note that without loss of generality we can placean artificial schedule point t n at each of the finitely manydiscontinuities in W t or its first/second derivative. Thus weassume the W t is C on each interval [ t n − , t n ] . Later inthe proof it will become clear that the contributions of theseartificial schedule points becomes negligible as (cid:107)T N (cid:107) → .Given a schedule T N , define the path ˜ π t = Z t exp( ˜ W t ) with log-likelihood ˜ W t satisfying for each segment t n − ≤ t ≤ t n , ˜ W t = W t n − + ∆ W n ∆ t n ( t − t n − ) , where ∆ W n = W t n − W t n − and ∆ t n = t n − t n − . Inparticular, ˜ W t agrees with W t for t ∈ T N , linearly interpo-lates between W t n − and W t n for t ∈ [ t n − , t n ] , and for all x , | ˜ W t ( x ) − W t ( x ) | ≤
12 sup t ∈ [ t n − ,t n ] (cid:12)(cid:12)(cid:12)(cid:12) d W t dt ( x ) (cid:12)(cid:12)(cid:12)(cid:12) ∆ t n . (17)The following lemma shows that the normalization con-stant of, and expectations under, ˜ π t are comparable to thesame for π t with an error bound that depends on (cid:107)T N (cid:107) andconverges to 0 as (cid:107)T N (cid:107) → . Lemma 3.
For measurable functions f and s > , let E t ( f, s ) = E π t (cid:104) | f | e s V (cid:105) , and define E t ( s ) = E t (1 , s ) for brevity.(a) For any schedule T N , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˜ Z t Z t − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ E t ( (cid:107)T N (cid:107) ) − , and if (cid:107)T N (cid:107) is small enough that E t ( (cid:107)T N (cid:107) ) < , (cid:12)(cid:12)(cid:12)(cid:12) Z t ˜ Z t − (cid:12)(cid:12)(cid:12)(cid:12) ≤ E t ( (cid:107)T N (cid:107) ) − − E t ( (cid:107)T N (cid:107) ) . (b) For any schedule T N and measureable function f , if (cid:107)T N (cid:107) is small enough that E t ( (cid:107)T N (cid:107) ) < , | E ˜ π t [ f ] − E π t [ f ] | ≤ E t ( (cid:107)T N (cid:107) ) − − E t ( (cid:107)T N (cid:107) ) E t ( f, (cid:107)T N (cid:107) )+ E t ( f, (cid:107)T N (cid:107) ) − E t ( f, . Proof. (a) We rewrite the expression ˜ Z t Z t = 1 Z t (cid:90) X e ˜ W t ( x ) d x = (cid:90) X e ˜ W t ( x ) − W t ( x ) π t ( x )d x = 1 + (cid:90) X (cid:16) e ˜ W t ( x ) − W t ( x ) − (cid:17) π t ( x )d x. Thus using the inequality | e x − | ≤ e | x | − , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˜ Z t Z t − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) X (cid:16) e ˜ W t ( x ) − W t ( x ) − (cid:17) π t ( x )d x (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:90) X (cid:16) e | ˜ W t ( x ) − W t ( x ) | − (cid:17) π t ( x )d x ≤ (cid:90) X (cid:16) e V ( x ) (cid:107)T N (cid:107) − (cid:17) π t ( x )d x = E π t (cid:104) e (cid:107)T N (cid:107) V − (cid:105) = E t ( (cid:107)T N (cid:107) ) − . The bound on | Z t / ˜ Z t − | arises from straightforwardalgebraic manipulation of the above bound. arallel tempering on optimized paths (b) We begin by rewriting E ˜ π t [ f ] : E ˜ π t [ f ] − E π t [ f ]= 1˜ Z t (cid:90) X f ( x ) e ˜ W t ( x ) d x − E π t [ f ]= (cid:90) X f ( x ) (cid:18) Z t ˜ Z t e ˜ W t ( x ) − W t ( x ) − (cid:19) π t ( x )d x = (cid:18) Z t ˜ Z t − (cid:19) (cid:90) X f ( x ) e ˜ W t ( x ) − W t ( x ) π t ( x )d x + (cid:90) X f ( x ) (cid:16) e ˜ W t ( x ) − W t ( x ) − (cid:17) π t ( x )d x. Therefore again using | e x − | ≤ e | x | − and theprevious bound, | E ˜ π t [ f ] − E π t [ f ] | ≤ E t ( (cid:107)T N (cid:107) ) − − E t ( (cid:107)T N (cid:107) ) E t ( f, (cid:107)T N (cid:107) )+ E t ( f, (cid:107)T N (cid:107) ) − E t ( f, . By changing variables via t = t n − + s ∆ t n in (5), we canrewrite Λ( t n − , t n ) as Λ( t n − , t n ) = (cid:90) t n t n − E (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) ∆ W n ∆ t n ( ˜ X t ) − ∆ W n ∆ t n ( ˜ X (cid:48) t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) d t, where ˜ X t , ˜ X (cid:48) t ∼ ˜ π t . Note that by construction for t ∈ ( t n − , t n ) we have d ˜ W t dt exists and equals ∆ W n ∆ t n . So bysumming over n we get, Λ( T N ) = N (cid:88) n =1 Λ( t n − , t n )= (cid:90) E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d ˜ W t dt ( ˜ X t ) − d ˜ W t dt ( ˜ X (cid:48) t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) dt = (cid:90) ˜ λ ( t ) dt If we can show that sup t | ˜ λ ( t ) − λ ( t ) | converges uniformly to 0 as (cid:107)T N (cid:107) → then by dominated convergence theorem Λ( T N ) converges to Λ uniformly as (cid:107)T N (cid:107) → . The roundtrip rate then uniformly converges to (2+2Λ) − by Theorem3 of (Syed et al., 2019).Adding and subtracting E (cid:104)(cid:12)(cid:12)(cid:12) d ˜ W t dt ( X t ) − d ˜ W t dt ( X (cid:48) t ) (cid:12)(cid:12)(cid:12)(cid:105) withinthe absolute difference | ˜ λ ( t ) − λ ( t ) | and using the triangleinequality, it can be shown that we require bounds on J ,t = (cid:90) π t ( x ) π t ( y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d ˜ Wtdt ( x ) − d ˜ Wtdt ( y ) (cid:12)(cid:12)(cid:12) − | dWtdt ( x ) − dWtdt ( y ) | (cid:12)(cid:12)(cid:12) We say a ( T N ) converges uniformly to a if for all (cid:15) > , ∃ δ > such that (cid:107)T N (cid:107) < δ implies | a ( T N ) − a | < (cid:15) . and J ,t = (cid:90) | π t ( x ) π t ( y ) − ˜ π t ( x )˜ π t ( y ) | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d ˜ W t dt ( x ) − d ˜ W t dt ( y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . For the first term, the mean value theorem implies that thereexist s, s (cid:48) ∈ [ t n − , t n ] (potentially functions of x and y ,respectively) such that J ,t = (cid:90) π t ( x ) π t ( y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) dWsdt ( x ) − dWs (cid:48) dt ( y ) (cid:12)(cid:12)(cid:12) − | dWtdt ( x ) − dWtdt ( y ) | (cid:12)(cid:12)(cid:12) Split the integral into the set A of x, y ∈ X where thefirst term in the absolute value is larger; the same analysiswith the same result applies in the other case in A c . Here,Taylor’s theorem and the triangle inequality yield (cid:12)(cid:12)(cid:12)(cid:12) dW s dt ( x ) − dW s (cid:48) dt ( y ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) dW t dt ( x ) − dW t dt ( y ) (cid:12)(cid:12)(cid:12)(cid:12) + ( V ( x ) + V ( y )) (cid:107)T N (cid:107) . Using this and the same procedure for A c , we have that J ,t ≤ (cid:90) π t ( x ) π t ( y )( V ( x ) + V ( y )) (cid:107)T N (cid:107) = 2 E π t [ V ] (cid:107)T N (cid:107) . This converges to 0 as (cid:107)T N (cid:107) → .For the second term J ,t , we can again use the mean valuetheorem to find s, s (cid:48) ∈ [ t n − , t n ] where J ,t = (cid:90) | π t ( x ) π t ( y ) − ˜ π t ( x )˜ π t ( y ) | (cid:12)(cid:12)(cid:12)(cid:12) dW s dt ( x ) − dW s (cid:48) dt ( y ) (cid:12)(cid:12)(cid:12)(cid:12) , and therefore via the triangle inequality, symmetry, and the V ( x ) bound on the first path derivative, J ,t ≤ (cid:90) V ( x ) | π t ( x ) π t ( y ) − ˜ π t ( x )˜ π t ( y ) | . We then add and subtract π ( x )˜ π ( y ) within the absolutevalue and use the triangle inequality again to find that J ,t ≤ (cid:90) ( V ( x ) + E π t [ V ]) | π t ( x ) − ˜ π t ( x ) | = 2 (cid:90) π t ( x ) ( V ( x ) + E π t [ V ]) (cid:12)(cid:12)(cid:12)(cid:12) − ˜ π t ( x ) π t ( x ) (cid:12)(cid:12)(cid:12)(cid:12) . Note that by the triangle inequality and the bound | e x − | ≤ e | x | − , (cid:12)(cid:12)(cid:12)(cid:12) − ˜ π t ( x ) π t ( x ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) Z t ˜ Z t − (cid:12)(cid:12)(cid:12)(cid:12) e (cid:107)T N (cid:107) V ( x ) + e (cid:107)T N (cid:107) V ( x ) − . Assume that (cid:107)T N (cid:107) is small enough such that E t ( (cid:107)T N (cid:107) ) < ,and let f = V + E π t V . Then by Lemma 3, J ,t ≤ E t ( (cid:107)T N (cid:107) ) − − E t ( (cid:107)T N (cid:107) ) E t ( f, (cid:107)T N (cid:107) )+ E t ( f, T N ) − E t ( f, . arallel tempering on optimized paths By assumption we know that E t ( f, s ) is finite for some s small enough. Therefore as (cid:107)T N (cid:107) → , by monotoneconvergence E t ( f, (cid:107)T N (cid:107) ) → E t ( f, , and in particular E t ( T N ) → . Therefore J ,t + J ,t → as (cid:107)T N (cid:107) → andthe proof is complete. D. Objective and Gradient
Here we derive the gradient used to optimize the surrogateSKL objective in Equation 13. First we derive the gradientfor the expectation of a general function in Section D.1.Next, in Section D.2, we show the result for the specificcase of expectations of linear functions with respect to dis-tributions in the exponential family. Lastly, we show howthe result is related to our SKL objective in Sections D.3and D.4.
D.1. Derivative of parameter-dependent expectation
Here we consider the problem of computing g φ ( x ) = ∇ φ (cid:90) X π φ ( x ) J φ ( x )d x where π φ ( x ) = Z ( φ ) − exp( W φ ( x )) , Z ( φ ) = (cid:82) X exp( W φ ( x )) d x and J φ ( x ) is a function depending on φ .Assuming we can interchange the gradient and the expecta-tion and using the product rule we can rewrite: g φ ( x ) = (cid:90) X ( J φ ( x ) ∇ φ π φ ( x ) + π φ ( x ) ∇ φ J φ ( x )) d x. Using ∇ φ π φ ( x ) = π φ ( x ) ∇ φ log π φ ( x ) , g φ ( x ) = (cid:90) X π φ ( x )( J φ ( x ) ∇ φ log π φ ( x ) + ∇ φ J φ ( x ))d x. From the definition of π φ ( x ) , we can evaluate the scorefunction as ∇ φ log π φ ( x ) = −∇ φ log Z ( φ ) + ∇ φ W φ ( x )= − E [ ∇ φ W φ ( x )] + ∇ φ W φ ( x ) . Substitute this in g φ ( x ) we obtain, g φ ( x ) = (cid:90) X π φ ( x ) J φ ( x )( − E [ ∇ φ W φ ( x )] + ∇ φ W φ ( x ))d x + (cid:90) X π φ ( x ) ∇ φ J φ ( x )d x = − E [ J φ ( x )] E [ ∇ φ W φ ( x )] + E [ J φ ( x ) ∇ φ W φ ( x )]+ E [ ∇ φ J φ ( x )]= Cov [ ∇ φ W φ ( x ) , J φ ( x )] + E [ ∇ φ J φ ( x )] . D.2. Exponential family and linear function
The gradient derived in the previous section can easily beapplied to expectations with respect to functions linear in φ under distributions in the exponential family. Let J φ ( x ) = ξ J ( φ ) T J ( x ) be a linear function in φ and suppose W φ ( x ) = ξ W ( φ ) T W ( x ) for some functions ξ J : R d → R n , J : X → R n and ξ W : R d → R m , W : X → R m . Then g φ ( x ) = Cov [ ∇ φ W φ ( x ) , J φ ( x )] + E [ ∇ φ J φ ( x )]= ∇ φ ξ W ( φ ) T Cov [ W ( x ) , J T ( x )] ξ J ( φ )+ ∇ φ ξ J ( φ ) T E [ J ( x )] where ∇ φ ξ ( φ ) T is the transposed Jacobian of ξ . D.3. Symmetric KL: general case
Next we show that the symmetric KL divergence of Equa-tion 13 can be rewritten as a sum of expectations over func-tions parametrized by φ , hence falling in the frameworkpresented above.For path parameter φ , the symmetric KL divergence is L SKL ( φ ) = N − (cid:88) n =0 SKL ( π φt n , π φt n +1 )= E (cid:34) log π φt n +1 ( X n +1 ) π φt n ( X n +1 ) + log π φt n ( X n ) π φt n +1 ( X n ) (cid:35) where X n ∼ π φt n . After cancellation of the normalizationconstants we obtain L SKL ( φ ) = N − (cid:88) n =0 E (cid:2) W φt n +1 ( X n +1 ) − W φt n ( X n +1 )+ W φt n ( X n ) − W φt n +1 ( X n ) (cid:3) . Collecting expectations under the same distribution andrearranging terms, L SKL ( φ ) = E [ W φt ( X ) − W φt ( X )]+ N − (cid:88) n =1 E [2 W φt n ( X n ) − W φt n +1 ( X n ) − W φt n − ( X n )]+ E [ W φt N ( X N ) − W φt N − ( X N )] . Defining for n = 1 , . . . , N − , J φ ( x ) = W φt ( x ) − W φt ( x ) J φn ( x ) = 2 W φt n ( x ) − W φt n +1 ( x ) − W φt n − ( x ) J φN ( x ) = W φt N ( x ) − W φt N − ( x ) , we have that L SKL ( φ ) = N (cid:88) n =0 E [ J φn ( X n )] arallel tempering on optimized paths and ∇ φ L SKL ( φ ) = N (cid:88) n =0 ∇ φ E [ J φn ( X n )] where ∇ φ E [ J φn ( X n )] can be computed using the formuladerived in Section D.1. D.4. Symmetric KL: exponential family case
For the spline family introduce in Section 4, the distributions π φt n are in the exponential family with, W φt n ( x ) = η φ ( t n ) T W ( x ) , n = 0 , . . . , N. It follows that the functions J φn are linear in φ with J φ ( x ) = z φ T W ( x ) J φn ( x ) = z φnT W ( x ) , n = 1 , . . . , N − J φN ( x ) = z φN T W ( x ) , where z φ = η φ ( t ) − η φ ( t ) z φn = 2 η φ ( t n ) − η φ ( t n +1 ) − η φ ( t n − ) , n = 1 , . . . , N − z φN = η φ ( t N ) − η φ ( t N − ) . Given this relation, the stochastic gradient of Equation 13can be evaluated using s samples from parallel temperingthrough the formula in Section D.2 defining: X = ( X , . . . , X N ) W ( X ) = [ W ( X ) , W ( X ) , . . . , W ( X N ) , W ( X N )] T J ( X ) = W ( X ) ξ W ( φ ) = [ η φ ( t ) , η φ ( t ) , . . . , η φ ( t N ) , η φ ( t N )] T ξ J ( φ ) = [ z φ , , z φ , , . . . , z φN, , z φN, ] T where X is the s × N matrix of samples from parallel tem-pering, W ( X ) is a s × N matrix evaluating X elementwiseat the reference and target distributions W and W , ξ W ( φ ) is a N × vector of annealing coefficients and ξ J ( φ ) is a N × vector of coefficients defining J φ = [ J φ , . . . , J φN ] . E. Proof of proposition 2
For this annealing path family, W t ( x ) = η ( t ) T W ( x ) . where η is parametrized so that (cid:107) η (cid:48) ( t ) (cid:107) = L . Therefore,the piecewise twice continuous differentiability of η ( t ) and endpoint conditions imply that conditions 1 and 2 of Defini-tion 1 are satisfied. Next, note that (cid:12)(cid:12)(cid:12)(cid:12) d W t d t (cid:12)(cid:12)(cid:12)(cid:12) = | η (cid:48) ( t ) T W ( x ) | ≤ L (cid:107) W ( x ) (cid:107) (cid:12)(cid:12)(cid:12)(cid:12) d W t d t (cid:12)(cid:12)(cid:12)(cid:12) = | η (cid:48)(cid:48) ( t ) T W ( x ) | ≤ M (cid:107) W ( x ) (cid:107) , and thus we set V ( x ) = V ( x ) = ( L + M ) (cid:107) W ( x ) (cid:107) . Since η ([0 , ⊆ Ω , and noting that E (cid:2) (cid:107) W ( X ) (cid:107) (cid:3) < ∞ implies E [ (cid:107) W ( X ) (cid:107) ] < ∞ by Jensen’s inequality, the path satisfiescondition 3 of Definition 1. Finally, note that Ω is a convexsubset of R : for any nonnegative function G ( x ) , vectors ξ , ξ ∈ R , and λ ∈ [0 , , exp(( λξ + (1 − λ ) ξ ) T W ( x )) G ( x )= (cid:0) exp( ξ T W ) G ( x ) (cid:1) λ (cid:0) exp( ξ T W ) G ( x ) (cid:1) − λ and so H¨older’s inequality (cid:82) f λ g − λ ≤ ( (cid:82) f ) λ ( (cid:82) g ) − λ yields log-convexity (and hence convexity). Therefore aslong as the endpoints (0 , and (1 , are in Ω —which isthe case if the family is nonempty—any convex combinationof (0 , and (1 , is also in Ω and therefore the linear pathis included. F. Empirical support for the SKL surrogateobjective function
Two objective functions were discussed in Section 3: onebased on rejection rate statistics, i.e. Equation (10), and thesymmetric KL divergence (SKL). In this section we performcontrolled experiments comparing the signal-to-noise ratioof Monte Carlo estimators of the gradient of these two ob-jectives. Let G denote a Monte Carlo estimator of a partialderivative with respect to one of the parameters φ i . Refer toD for details on the stochastic gradient estimators. In thisexperiment we use i.i.d. samples so that the Monte Carloestimators are unbiased, justifying the use of the variance asa notion of noise. Hence following (Rainforth et al., 2018),we define the signal-to-noise ratio by SNR = | E [ G ] /σ [ G ] | ,where σ [ G ] denotes the standard deviation of G . We usetwo chains with one set to a standard Gaussian, the otherto a Gaussian with mean φ and unit variance. We showthe value of the two objective functions in Figure 4 (left).The label “Rejection” refers to the expected rejection of theswap proposal between the two chains, r . We also show thesquare root of half of the SKL (“SqrtHalfSKL”), to quantifythe tightness of the bound in Equation (13), while “Ineff”shows the rejection odds, r/ (1 − r ) , called inefficiency in(Syed et al., 2019).Signal-to-noise ratio estimates were computed for each pa-rameter φ i ∈ { , / , / , . . . , } . Each gradient estimateuses 50 samples, and to approximate the signal-to-noiseratio the estimation was repeated 1000 times for each φ i and arallel tempering on optimized paths Variational parameter V a l ue o f t he ob j e c t i v e f un c t i on Objective function
InefRejectionSKLSqrtHalfSKL 2468 0.5 1.0 1.5 2.0
Variational parameter S NR o f t he g r ad i en t | m ean / s t d . de v . | Objective function
RejectionSKL
Figure 4.
Left: objective functions for path optimization in a controlled experiment as a function of a variational parameter φ . Right:signal-to-noise of corresponding gradient estimators on the same range of parameters. objective function. The results are shown in Figure 4 (right),and show that in the regime of small rejection ( < ≈ ),the gradient estimator based on the rejection objective hasa superior signal-to-noise ratio compared to its SKL coun-terpart. However as φ increases and the two distributionsbecome farther apart, the situation is reversed, providingempirical support for the surrogate objective for challengingpath optimization problems. G. Experimental details
All the experiments were conducted comparing reversiblePT, non-reversible PT and non-reversible PT based on thespline family with K ∈ { , , , , } .Every method was initialized at the linear path with equallyspaced schedule, i.e. π t ∝ π − t/N π t/N with N the num-ber of parallel chains. All methods performed one localexploration step before a communication step.To ensure a fair comparison of the different algorithms, wefixed the computational budget to a pre-determined num-ber of samples in each experiment. Reversible PT usedthe budget to perform local exploration steps followed bycommunication steps. In non-reversible PT the computa-tional budget was used to tune the schedule according to theprocedure described in (Syed et al., 2019, Section 5.1). Fornon-reversible PT with path optimization, the computationalbudget was divided equally over a fixed number of scans ofAlgorithm 2, where a scan corresponds to one iteration ofthe for loop.Optimization of the spline annealing path family was per-formed using the SKL surrogate objective of Equation 13.Adagrad was used for the optimization. The gradient wasscaled elementwise by its absolute value plus the value ofthe knot component. Such scaling was necessary to limitthe gradient in the interval [ − , , stabilizing the optimiza-tion and avoiding possible exploding gradients due to the transformation to log space.To mitigate variance in the results due to randomness, weperformed 10 runs of each method and averaged the resultsacross the runs. G.1. Gaussian
This experiment optimized the path between the reference π = N ( − , . ) and the target π = N (1 , . ) .We used N = 50 parallel chains initialized at a statesampled from a standard Normal distribution. In thissetting, π t has a closed form that can be shown to be N (cid:18) η ( t ) − η ( t ) η ( t )+ η ( t ) , (cid:16) . η ( t )+ η ( t ) (cid:17) (cid:19) , therefore, in the local ex-ploration step of parallel tempering we sampled i.i.d. from π t . The computational budget was fixed at 45000 samples.Non-reversible PT with optimized path divided the budgetin 150 scans. Therefore, for every gradient step in Algo-rithm 2, the gradient was estimated with 300 samples. Weused 0.2 as learning rate for Adagrad. G.2. Beta-binomial model
The second experiment was performed on a conju-gate Bayesian model. The model prior was π ( p ) =Beta(180 , . The likelihood was L ( x | p ) =Binomial( x | n, p ) . We simulated x , . . . , x ∼ Binomial(100 , . , resulting in a posterior distribution π ( p ) = Beta(140180 , . The prior is concentratedat 0.176 with a standard deviation of 0.0119. The posteriordistribution is concentrated at 0.697 with a standard devia-tion of 0.001. We used N = 50 parallel chains initializedat 0.5. Also in this experiment it is possible to compute π t in closed form. Let S = (cid:80) i =1 x i , R = 2000 × then π t ( p ) = Beta(179 η ( t )+(180+ S − η ( t )+1 , η ( t )+(840+ N − S − η ( t )+1) . Hence, in the local explorationstep of parallel tempering we sampled i.i.d. from π t . The arallel tempering on optimized paths computational budget was fixed at 45000 samples. Non-reversible PT with optimized path divided the budget in 150scans. Therefore, for every gradient step in Algorithm 2, thegradient was estimated with 300 samples. We used 0.2 aslearning rate for Adagrad. G.3. Mixture model
The third experiment was a Bayesian Gaussian mixturemodel with mixture proportions w , w , mixture componentdensities N ( µ i , ) for mean parameters µ , µ , and a bi-nary cluster label for each data point. We placed a Dir(1 , prior on the proportions, and a N (150 , prior on each ofthe two mean parameters. We simulated n = 1000 datapoints from the mixture . N (100 , ) + 0 . N (200 , ) .We used N = 35= 35