A fresh take on 'Barker dynamics' for MCMC
AA fresh take on ‘Barker dynamics’ for MCMC.
Max Hird, Samuel Livingstone & Giacomo Zanella
Abstract
We study a recently introduced gradient-based Markov chain Monte Carlomethod based on ‘Barker dynamics’. We provide a full derivation of the methodfrom first principles, placing it within a wider class of continuous-time Markov jumpprocesses. We then evaluate the Barker approach numerically on a challenging ill-conditioned logistic regression example with imbalanced data, showing in particularthat the algorithm is remarkably robust to irregularity (in this case a high degree ofskew) in the target distribution.
For over half a century now Markov chain Monte Carlo has been used to samplefrom and compute expectations with respect to unnormalised probability distribu-tions [16]. The idea is to construct a Markov chain for which a distribution of interestis invariant. Provided that the chain is 𝜋 -irreducible and aperiodic (see e.g. [20]),then the distribution of 𝑋 𝑛 , the 𝑛 th point in the chain, will approach the invariant dis-tribution as 𝑛 → ∞ , and ergodic averages from the chain can be used to approximatedesired integrals. Max HirdDepartment of Statistical Science, University College London, Gower Street, London WC1E 6BT,UK. e-mail: [email protected]
Samuel LivingstoneDepartment of Statistical Science, University College London, Gower Street, London WC1E 6BT,UK. e-mail: [email protected]
Giacomo ZanellaDepartment of Decision Sciences, BIDSA and IGIER, Bocconi University, Via Roentgen 1, 20136Milan, Italy. e-mail: [email protected]
Preprint submitted to
Proceedings of MCQMC2020 conference. 1 a r X i v : . [ s t a t . C O ] F e b Max Hird, Samuel Livingstone & Giacomo Zanella
Restricting attention to X ⊂ R 𝑑 , one way to confirm that a distribution on X with density 𝜋 ( 𝑥 ) is invariant for a Markov chain with transition kernel 𝑄 ( 𝑥, 𝐴 ) : = ∫ 𝐴 𝑞 ( 𝑥, 𝑦 ) 𝑑𝑦 is to establish that the equation 𝜋 ( 𝑦 ) 𝑞 ( 𝑦, 𝑥 ) 𝜋 ( 𝑥 ) 𝑞 ( 𝑥, 𝑦 ) : = 𝑡 ( 𝑥, 𝑦 ) = 𝑥, 𝑦 ∈ X such that 𝜋 ( 𝑥 ) 𝑞 ( 𝑥, 𝑦 ) >
0, and that 𝜋 ( 𝑦 ) 𝑞 ( 𝑦, 𝑥 ) = detailed balance equations. The celebrated Metropolis–Hastings algorithm [16, 10] is built on the idea of coercing a Markov chain intohaving a specified invariant distribution. This is achieved through what will becalled a balancing function in this article. Consider the scenario in which 𝜋 is notinvariant for 𝑄 , meaning (1) does not hold. A new kernel can be created which infact does satisfy equation (1) by setting 𝑝 ( 𝑥, 𝑦 ) : = 𝑔 ( 𝑡 ( 𝑥, 𝑦 )) 𝑞 ( 𝑥, 𝑦 ) , (2)where 𝑔 ( 𝑡 ) satisfies 𝑔 ( 𝑡 ) = 𝑡𝑔 ( / 𝑡 ) (3)whenever 𝑡 >
0, and 𝑔 ( ) : =
0. By noting that 𝜋 ( 𝑥 ) 𝑞 ( 𝑥, 𝑦 ) 𝑡 ( 𝑥, 𝑦 ) = 𝜋 ( 𝑦 ) 𝑞 ( 𝑦, 𝑥 ) and 𝑡 ( 𝑦, 𝑥 ) = / 𝑡 ( 𝑥, 𝑦 ) , it is easily seen that 𝜋 ( 𝑥 ) 𝑝 ( 𝑥, 𝑦 ) = 𝜋 ( 𝑥 ) 𝑞 ( 𝑥, 𝑦 ) 𝑔 ( 𝑡 ( 𝑥, 𝑦 )) (4) = 𝜋 ( 𝑥 ) 𝑞 ( 𝑥, 𝑦 ) 𝑡 ( 𝑥, 𝑦 ) 𝑔 ( / 𝑡 ( 𝑥, 𝑦 ))) = 𝜋 ( 𝑦 ) 𝑞 ( 𝑦, 𝑥 ) 𝑔 ( 𝑡 ( 𝑦, 𝑥 )) = 𝜋 ( 𝑦 ) 𝑝 ( 𝑦, 𝑥 ) as required.The problem, however, with taking the above strategy is that there is no guaranteethat ∫ 𝑝 ( 𝑥, 𝑦 ) 𝑑𝑦 =
1, in fact this is extremely unlikely to be the case. More stepsmust be taken, therefore, to create a Markov process. The Metropolis–Hastingssolution is to restrict to balancing functions that satisfy 𝑔 ( 𝑡 ) ≤ 𝑡 ∈ [ , ∞) .This ensures that the kernel 𝐾 ( 𝑥, 𝐴 ) : = ∫ 𝐴 𝑝 ( 𝑥, 𝑦 ) 𝑑𝑦 satisfies 𝐾 ( 𝑥, X ) ≤
1. Theremaining probability mass can then be found by simply adding a rejection step ,meaning that with probability 1 − 𝐾 ( 𝑥, X ) the chain remains at its current point 𝑥 .There is, however, another way to create a Markov process from 𝑝 ( 𝑥, 𝑦 ) , withoutresorting to the Metropolis–Hastings approach. This consists of defining a continuoustime Markov jump process in which jumps from the point 𝑥 occur with intensity 𝜆 ( 𝑥 ) = ∫ 𝑝 ( 𝑥, 𝑦 ) 𝑑𝑦, (5)and the jump location 𝑦 is sampled from a distribution with density 𝑝 ( 𝑥, 𝑦 )/ 𝜆 ( 𝑥 ) .The function 𝑝 ( 𝑥, 𝑦 ) then describes the rate at which the process jumps from 𝑥 to 𝑦 , and (4) indicates that the process is 𝜋 -invariant. The challenge associated fresh take on ‘Barker dynamics’ for MCMC. 3 with this second approach is that the integral (5) will often be intractable, meaningthat simulating the process is not straightforward. Here we describe a solution that isoutlined in the recent contribution [14], through a judicious choice of 𝑔 and a suitableapproximation to 𝑡 ( 𝑥, 𝑦 ) . It should of course be noted that in the case of finite X then(5) becomes a sum, and so provided that this is well-defined the process can beexactly simulated. We do not consider this setting here, but direct the interestedreader to [18], in which the approach is elegantly described, following earlier workin [23].In the next section we discuss Barker’s accept-reject rule, an early Markov chainMonte Carlo method from which we draw inspiration, before covering the generalapproach to the design of Markov jump processes with a prescribed invariant distri-bution in Section 3. It is here that we derive a Markov process that approximatelypreserves a given distribution, and show that the Barker balancing function is theonly choice giving rise to such a process. In Section 4 we reveal the Barker pro-posal scheme, in which this new process is used as a proposal mechanism withina Metropolis–Hastings algorithm. In Section 5 we discuss the merits of using thisnew algorithm, by comparing it to suitable alternatives both theoretically and nu-merically, in the latter case using a challenging logistic regression example withimbalanced categorical data. Readers who are familiar with the Metropolis–Hastings algorithm will naturallygravitate towards the choice of balancing function 𝑔 ( 𝑡 ) = min ( , 𝑡 ) in (2), resultingin the familiar Hastings acceptance probability 𝑔 𝐻 ( 𝑡 ( 𝑥, 𝑦 )) = min (cid:18) , 𝜋 ( 𝑦 ) 𝑞 ( 𝑦, 𝑥 ) 𝜋 ( 𝑥 ) 𝑞 ( 𝑥, 𝑦 ) (cid:19) . (6)It should be noted, however, that several other choices of 𝑔 are possible. One al-ternative proposed by Barker in [3] is 𝑔 ( 𝑡 ) = 𝑡 /( + 𝑡 ) , resulting in the acceptanceprobability 𝑔 𝐵 ( 𝑡 ( 𝑥, 𝑦 )) = 𝜋 ( 𝑦 ) 𝑞 ( 𝑦, 𝑥 ) 𝜋 ( 𝑥 ) 𝑞 ( 𝑥, 𝑦 ) + 𝜋 ( 𝑦 ) 𝑞 ( 𝑦, 𝑥 ) (7)after multiplying the numerator and denominator by 𝜋 ( 𝑥 ) 𝑞 ( 𝑥, 𝑦 ) . In the case 𝑞 ( 𝑥, 𝑦 ) = 𝑞 ( 𝑦, 𝑥 ) this further reduces to 𝜋 ( 𝑦 )/( 𝜋 ( 𝑥 ) + 𝜋 ( 𝑦 )) . Note that both 𝑔 𝐻 and 𝑔 𝐵 satisfy 𝑔 ≤ 𝑔 𝐻 is preferred to 𝑔 𝐵 in the context of the Metropolis–Hastingsalgorithm is due to the work of Peskun [17] and Tierney [22], which established thatfor the same choice of 𝑞 ( 𝑥, 𝑦 ) the acceptance rate 𝑔 𝐻 will result in Markov chainsthat produce ergodic averages with smallest asymptotic variance. A key part of theargument is that 𝑔 𝐻 will maximise the probability of moving from 𝑥 to 𝑦 , for any 𝑦 ≠ 𝑥 . When comparing (6) and (7) this is easy to see, as Max Hird, Samuel Livingstone & Giacomo Zanella 𝑔 𝐵 ( 𝑡 ( 𝑥, 𝑦 )) = 𝜋 ( 𝑦 ) 𝑞 ( 𝑦, 𝑥 ) 𝜋 ( 𝑥 ) 𝑞 ( 𝑥, 𝑦 ) + 𝜋 ( 𝑦 ) 𝑞 ( 𝑦, 𝑥 ) ≤ 𝜋 ( 𝑦 ) 𝑞 ( 𝑦, 𝑥 ) 𝜋 ( 𝑥 ) 𝑞 ( 𝑥, 𝑦 ) whenever 𝜋 ( 𝑥 ) 𝑞 ( 𝑥, 𝑦 ) >
0. Combining with the fact that 𝑔 𝐵 ≤ 𝑔 𝐵 ≤ 𝑔 𝐻 for every value of 𝑡 ( 𝑥, 𝑦 ) .It is important to emphasise, however, that the above discussion and conclusionsabout the optimality of 𝑔 𝐻 are confined to the scenario in which (2) is used to createa Metropolis–Hastings algorithm. It no longer applies to the setting in which thefunction 𝑝 ( 𝑥, 𝑦 ) is used to define a Markov jump process with transition rates givenby (5). Furthermore, in this case the stipulation that 𝑔 ≤ 𝑔 𝐻 but also alternatives such as 𝑔 𝐵 andothers when designing jump processes of the type described in Section 1. The dynamics of a jump process for which transitions from 𝑥 to 𝑦 occur at the rate 𝑝 ( 𝑥, 𝑦 ) are as follows: if the current point at time 𝑡 is 𝑋 𝑡 = 𝑥 ∈ X , the process willremain at 𝑥 for an exponentially distributed period of time 𝜏 ∼ Exp ( 𝜆 ( 𝑥 )) , with 𝜆 ( 𝑥 ) defined in (5), before moving to the next point using the Markov ‘jump’ kernel 𝐽 ,defined for any event 𝐴 as 𝐽 ( 𝑥, 𝐴 ) : = ∫ 𝐴 𝑝 ( 𝑥, 𝑦 ) 𝜆 ( 𝑥 ) 𝑑𝑦. (8)In order to simulate such a process, we must therefore be able to compute 𝜆 ( 𝑥 ) : = ∫ 𝑝 ( 𝑥, 𝑦 ) 𝑑𝑦 , and also to simulate from 𝐽 ( 𝑥, ·) , for any 𝑥 ∈ X .Note, however, that if 𝜆 ( 𝑥 ) were constant, we could simply use the jump kernel 𝐽 directly and simulate a discrete-time Markov chain. To see this, note that in generalthe jump kernel will have invariant density proportional to 𝜆 ( 𝑥 ) 𝜋 ( 𝑥 ) , as the equation 𝜆 ( 𝑥 ) 𝜋 ( 𝑥 ) 𝑝 ( 𝑥, 𝑦 ) 𝜆 ( 𝑥 ) = 𝜆 ( 𝑦 ) 𝜋 ( 𝑦 ) 𝑝 ( 𝑦, 𝑥 ) 𝜆 ( 𝑦 ) simplifies to 𝜋 ( 𝑥 ) 𝑝 ( 𝑥, 𝑦 ) = 𝜋 ( 𝑦 ) 𝑝 ( 𝑦, 𝑥 ) , which holds by design. If 𝜆 ( 𝑥 ) = 𝜆 then theabove equation simply shows that 𝐽 is 𝜋 -reversible. We can therefore either simulatethe continuous-time process with constant jump rate 𝜆 , or just ignore this step andtake 𝐽 as the kernel of a discrete-time Markov chain. In Subsection 3.1, we showthat making a careful approximation to 𝑡 ( 𝑥, 𝑦 ) allows just such a constant jump rateprocess to be found. fresh take on ‘Barker dynamics’ for MCMC. 5 𝒕 ( 𝒙, 𝒚 ) Recalling that 𝑝 ( 𝑥, 𝑦 ) = 𝑔 ( 𝑡 ( 𝑥, 𝑦 )) 𝑞 ( 𝑥, 𝑦 ) , the jump rate 𝜆 ( 𝑥 ) is the integral ∫ 𝑔 ( 𝑡 ( 𝑥, 𝑦 )) 𝑞 ( 𝑥, 𝑦 ) 𝑑𝑦, which in general will not be tractable. A natural starting point for simplifying theproblem is to restrict to the family of transition densities for which 𝑞 ( 𝑥, 𝑦 ) = 𝑞 ( 𝑦, 𝑥 ) ,and further to the random walk case 𝑞 ( 𝑥, 𝑦 ) = 𝑞 ( 𝑦 − 𝑥 ) . When this choice is madethen 𝑡 ( 𝑥, 𝑦 ) = 𝜋 ( 𝑦 )/ 𝜋 ( 𝑥 ) . Restricting for now to X ⊂ R , a first order approximationof this ratio can be constructed using a Taylor series expansion as 𝜋 ( 𝑦 )/ 𝜋 ( 𝑥 ) = exp { log 𝜋 ( 𝑦 ) − log 𝜋 ( 𝑥 )} ≈ exp {( 𝑦 − 𝑥 )∇ log 𝜋 ( 𝑥 )} for 𝑦 suitably close to 𝑥 . The purpose of using this approximation is that 𝑦 now onlyenters the expression through the difference term 𝑧 : = 𝑦 − 𝑥 , and furthermore it holdsthat 𝑡 ∗ 𝑥 ( 𝑧 ) : = 𝑒 𝑧 ∇ log 𝜋 ( 𝑥 ) = / 𝑒 − 𝑧 ∇ log 𝜋 ( 𝑥 ) = / 𝑡 ∗ 𝑥 (− 𝑧 ) . (9)Since 𝑞 ( 𝑥, 𝑦 ) = 𝑞 ( 𝑧 ) we can therefore express the entire integral as 𝜆 ∗ ( 𝑥 ) = ∫ ∞−∞ 𝑔 ( 𝑡 ∗ 𝑥 ( 𝑧 )) 𝑞 ( 𝑧 ) 𝑑𝑧. By first writing 𝜆 ∗ ( 𝑥 ) as the sum of two integrals over the disjoint regions (∞ , ] and [ , ∞) , then switching the limits of integration through a change of variables inthe first of these and finally re-combining, we arrive at the expression 𝜆 ∗ ( 𝑥 ) = ∫ −∞ 𝑔 ( 𝑡 ∗ 𝑥 ( 𝑧 )) 𝑞 ( 𝑧 ) 𝑑𝑧 + ∫ ∞ 𝑔 ( 𝑡 ∗ 𝑥 ( 𝑧 )) 𝑞 ( 𝑧 ) 𝑑𝑧 = ∫ ∞ (cid:2) 𝑔 ( 𝑡 ∗ 𝑥 (− 𝑧 )) 𝑞 (− 𝑧 ) + 𝑔 ( 𝑡 ∗ 𝑥 ( 𝑧 )) 𝑞 ( 𝑧 ) (cid:3) 𝑑𝑧 = ∫ ∞ (cid:2) 𝑔 ( 𝑡 ∗ 𝑥 (− 𝑧 )) + 𝑔 ( 𝑡 ∗ 𝑥 ( 𝑧 )) (cid:3) 𝑞 ( 𝑧 ) 𝑑𝑧, where the last line follows from the fact that 𝑞 ( 𝑧 ) = 𝑞 (− 𝑧 ) . Using (9) and then thebalancing property (3) reveals that 𝑔 ( 𝑡 ∗ 𝑥 (− 𝑧 )) = 𝑔 ( / 𝑡 ∗ 𝑥 ( 𝑧 )) = 𝑔 ( 𝑡 ∗ 𝑥 ( 𝑧 ))/ 𝑡 ∗ 𝑥 ( 𝑧 ) , meaning that setting 𝑡 ∗ 𝑥 ( 𝑧 ) : = 𝑡 ∗ the term in square brackets inside the integral can bewritten ( + / 𝑡 ∗ ) 𝑔 ( 𝑡 ∗ ) . Note that if this expression were in fact equal to a constant,then 𝜆 ∗ ( 𝑥 ) would become tractable, and furthermore it would not depend on 𝑥 . TheBarker rule is the unique (up to constant multiple) choice of balancing function forwhich this property holds. To see this, note that for any 𝑐 ≠ Max Hird, Samuel Livingstone & Giacomo Zanella ( + / 𝑡 ∗ ) 𝑔 ( 𝑡 ∗ ) = 𝑐 ⇐⇒ 𝑔 ( 𝑡 ∗ ) = 𝑐 + / 𝑡 ∗ . Setting 𝑐 = 𝑡 ∗ / 𝑡 ∗ reveals the choice 𝑔 𝐵 ( 𝑡 ∗ ) = 𝑡 ∗ /( + 𝑡 ∗ ) , andfurthermore 𝜆 ∗ ( 𝑥 ) = ∫ ∞ 𝑞 ( 𝑧 ) 𝑑𝑧 = , using the facts that 𝑞 ( 𝑧 ) = 𝑞 (− 𝑧 ) and ∫ 𝑞 ( 𝑧 ) 𝑑𝑧 =
1. In fact the choice of 𝑐 isirrelevant here as it simply acts as a constant multiple to the jump rate and does notenter into the jump kernel expression. We refer to the resulting Markov process as Barker dynamics . The family of skew-symmetric distributions on R has densities of the form2 𝐹 ( 𝛽𝑧 ) 𝜙 ( 𝑧 ) , (10)where 𝜙 is a symmetric probability density function, 𝐹 is a cumulative distributionfunction such that 𝐹 ( ) = / 𝐹 (cid:48) is a symmetric density, and 𝛽 ∈ R [2].Choosing 𝛽 > 𝛽 = 𝛽𝑧 can be replaced with more general functions of 𝑧 , but theabove suffices for our needs.The jump kernel (8) with symmetric choice of 𝑞 , the approximation 𝑡 ∗ in (9) andbarker balancing function 𝑔 𝐵 leads to the Markov kernel 𝐽 ∗ ( 𝑥, 𝐴 ) : = ∫ 𝐴 𝑔 𝐵 ( exp {( 𝑦 − 𝑥 )∇ log 𝜋 ( 𝑥 )}) 𝑞 ( 𝑦 − 𝑥 ) 𝑑𝑦 (11)for any event 𝐴 . Writing 𝐹 𝐿 ( 𝑧 ) : = /( + 𝑒 − 𝑧 ) , the cumulative distribution functionof the logistic distribution, and noting that 𝑔 𝐵 ( 𝑒 𝑧 ) = 𝐹 𝐿 ( 𝑧 ) , the associated transitiondensity can be written 𝑗 ∗ ( 𝑥, 𝑥 + 𝑧 ) = 𝐹 𝐿 ( 𝛽 𝑥 𝑧 ) 𝑞 ( 𝑧 ) where 𝛽 𝑥 : = ∇ log 𝜋 ( 𝑥 ) . We see, therefore, that the resulting transition is skew-symmetric , with the level ofskew at the current state 𝑥 determined by ∇ log 𝜋 ( 𝑥 ) . Because of this, a convenientalgorithm for drawing samples from this transition kernel exists, and consists of thefollowing:1. Draw 𝜉 ∼ 𝑞 (·)
2. Set 𝑏 = 𝐹 𝐿 ( 𝛽 𝑥 𝜉 ) , otherwise set 𝑏 = −
13. Set 𝑧 = 𝑏𝜉
4. Return 𝑥 + 𝑧 . fresh take on ‘Barker dynamics’ for MCMC. 7 The resulting draw is from the kernel 𝐽 ∗ ( 𝑥, ·) . To see this, note that the probabilitydensity associated with any 𝑧 is 𝑗 ∗ ( 𝑥, 𝑥 + 𝑧 ) = 𝑞 ( 𝑧 ) 𝐹 𝐿 ( 𝛽 𝑥 𝑧 ) + 𝑞 (− 𝑧 )( − 𝐹 𝐿 (− 𝛽 𝑥 𝑧 )) , which gives the density associated with either drawing 𝑧 and setting 𝑏 = − 𝑧 and setting 𝑏 = −
1. After noting that 𝑞 ( 𝑧 ) = 𝑞 (− 𝑧 ) and 1 − 𝐹 𝐿 (− 𝛽 𝑥 𝑧 ) = 𝐹 𝐿 ( 𝛽 𝑥 𝑧 ) by the symmetry of the logistic distribution, this simplifies to 𝑗 ∗ ( 𝑥, 𝑥 + 𝑧 ) = 𝐹 𝐿 ( 𝛽 𝑥 𝑧 ) 𝑞 ( 𝑧 ) as required. Figure 1 illustrates the inner workings of such a transition. It is naturalto consider whether other choices of skewing function derived from other balancingfunctions can be used to produce such a Markov transition. It is shown in AppendixF of [14], however, this is not possible, more precisely it is shown that 𝑔 𝐵 is theunique choice of balancing function leading to a skew-symmetric transition kernelwhen the first order approximation 𝑡 ∗ ( 𝑥, 𝑦 ) is used in place of 𝑡 ( 𝑥, 𝑦 ) . This is in factevident from the calculations of Subsection 3.1. 𝜋 ( 𝑥 ) 𝜉 = . P [ 𝑏 = ] = . 𝑥𝑥 − 𝜉 𝑥 + 𝜉 ∇ log 𝜋 ( 𝑥 ) Fig. 1
A diagram of a typical draw from the transition kernel (11) using the algorithm outlinedin Subsection 3.2. The black ball 𝑥 is the current state, and the sizes of the blue balls indicate theprobability of moving to that point, given that the innovation drawn in step 1 is 𝜉 = .
2. A movein the direction of the gradient is clearly more probable. Max Hird, Samuel Livingstone & Giacomo Zanella X ⊂ R 𝒅 The culmination of Section 3 is a Markov transition kernel (11) and an algorithm inSubsection 3.2 to draw samples from this kernel. Note, however, that this transitionkernel will not in general have equilibrium distribution 𝜋 , owing to the 1st order ap-proximation used in Subsection (3.1). In some cases it might be reasonable to simplyignore this fact and use the method regardless, in the hope that any approximationerror is small (the authors will discuss this approach in forthcoming work). Theresolution we will adopt here, however, is to use the transition as a proposal withina Metropolis–Hastings algorithm. Note that the transition density can be written 𝑗 ∗ ( 𝑥, 𝑦 ) ∝ 𝑞 ( 𝑦 − 𝑥 )/( + 𝑒 ( 𝑥 − 𝑦 ) ∇ log 𝜋 ( 𝑥 ) ) with 𝑞 ( 𝑦 − 𝑥 ) = 𝑞 ( 𝑥 − 𝑦 ) , meaning thatthe Metropolis–Hastings acceptance probability becomes 𝛼 ( 𝑥, 𝑦 ) = min (cid:18) , 𝜋 ( 𝑦 )( + 𝑒 ( 𝑥 − 𝑦 ) ∇ log 𝜋 ( 𝑥 ) ) 𝜋 ( 𝑥 )( + 𝑒 ( 𝑦 − 𝑥 ) ∇ log 𝜋 ( 𝑦 ) ) (cid:19) . We have also restricted attention thus far to the one-dimensional setting, as ex-tending to a 𝑑 -dimensional transition kernel for 𝑑 > 𝑑 -dimensional symmetric andcentred density 𝑞 . There are, however, many different ways to introduce the skew-ing mechanism into a 𝑑 -dimensional distribution, which is done in one dimensionthrough the variable 𝑏 ∈ {− , } . We consider two here, which we believe to benatural generalisations, and of which one is in fact clearly preferable to the other.The first is to simply introduce the same variable 𝑏 , and after drawing 𝜉 ∼ 𝑞 (·) ,set P [ 𝑏 = ] = 𝐹 𝐿 ( 𝛽 𝑇𝑥 𝜉 ) , where 𝛽 𝑥 : = ∇ log 𝜋 ( 𝑥 ) as in Subsection 3.2. The onlydifference between this and the one-dimensional case is that now 𝛽 𝑥 and 𝜉 are 𝑑 -dimensional vectors, meaning the scalar product is replaced by an inner product.This procedure is a single global skewing of the initial symmetric distribution 𝑞 .It turns out, however, that a much more favourable approach is to skew eachdimension individually. This involves defining 𝑏 ∈ {− , } 𝑑 , and setting P [ 𝑏 𝑖 = ] = 𝐹 𝐿 ( 𝛽 𝑥,𝑖 𝜉 𝑖 ) for 𝑖 ∈ { , ..., 𝑑 } , where 𝛽 𝑥,𝑖 : = 𝜕 log 𝜋 ( 𝑥 )/ 𝜕𝑥 𝑖 , the 𝑖 th partialderivative of log 𝜋 ( 𝑥 ) . This approach allows a much more flexible level of skewingto be applied to the base distribution 𝑞 . In fact, once the initial 𝜉 ∼ 𝑞 (·) is drawn,the first approach only considers two possible candidate moves: 𝑥 + 𝜉 and 𝑥 − 𝜉 .In a high dimensional setting it may not be that either of these candidate moves isparticularly favourable in terms of helping the chain mix. By contrast, the secondapproach allows for 2 𝑑 possible moves after 𝜉 has been sampled. Figure 2 illustrateshow this increased flexibility can result in much more favourable transitions.One can make the comparison between the two approaches more concrete. In[14], it is shown that the asymptotic variance of the first 𝑑 -dimensional versionof the Barker proposal will be at least half as large as that of a random walkMetropolis algorithm. As such, using scaling arguments based on limiting diffusionapproximations, it can be shown that 𝑂 ( 𝑑 ) iterations of the algorithm are needed toachieve estimates of a fixed level of precision as 𝑑 → ∞ . By contrast, in the samework it is shown that only 𝑂 ( 𝑑 / ) iterations are needed for the second version to fresh take on ‘Barker dynamics’ for MCMC. 9 𝜋 ( 𝑥 ) 𝜋 ( 𝑥 ) Fig. 2
A typical draw from the two different multi-dimensional transition kernels described inSection 4 when 𝑑 =
2. The black ball 𝑥 is the current state, and the sizes of the blue balls indicatethe probability of moving to each candidate point after the initial innovation 𝜉 has been drawn.Using the first variant (left-hand side) only two moves are possible, neither of which move the chaincloser to the high probability region of 𝜋 . By contrast, using the second variant (right-hand side)2 𝑑 moves are possible, and the most likely of these will move the chain in a favourable direction. achieve the same goal. This is akin to the Metropolis-adjusted Langevin algorithm,another popular gradient-based Metropolis–Hastings algorithm (e.g. [19]). Whenreferring to the Barker method in 𝑑 -dimensions, from this point forward we willexclusively refer to the second approach described in this section. A single transitionof the resulting 𝑑 -dimensional Metropolis–Hastings algorithm with current state 𝑥 is given below.1. Draw 𝜉 ∼ 𝑞 (·)
2. For 𝑖 ∈ { , ..., 𝑑 } set 𝑏 𝑖 = ( + 𝑒 − 𝛽 𝑥,𝑖 𝜉 𝑖 ) − , otherwise set 𝑏 𝑖 = −
1, where 𝛽 𝑥,𝑖 : = 𝜕 log 𝜋 ( 𝑥 )/ 𝜕𝑥 𝑖
3. Set 𝑦 : = 𝑥 + 𝑏 · 𝜉 , where 𝑎 · 𝑏 = ( 𝑎 𝑏 , 𝑎 𝑏 , ..., 𝑎 𝑑 𝑏 𝑑 ) defines the element-wiseproduct of two vectors 𝑎 = ( 𝑎 , ..., 𝑎 𝑑 ) and 𝑏 = ( 𝑏 , ..., 𝑏 𝑑 ) in R 𝑑
4. Set the next state to be 𝑦 with probability 𝛼 𝑑 ( 𝑥, 𝑦 ) = min (cid:32) , 𝜋 ( 𝑦 ) 𝜋 ( 𝑥 ) 𝑑 (cid:214) 𝑖 = + 𝑒 ( 𝑥 𝑖 − 𝑦 𝑖 ) 𝛽 𝑥,𝑖 + 𝑒 ( 𝑦 𝑖 − 𝑥 𝑖 ) 𝛽 𝑦,𝑖 (cid:33) , otherwise remain at 𝑥 .We note that the algorithm requires the same ingredients as MALA, and has thesame computational cost per iteration, which is dominated by the calculation of thegradient and target distribution. A simple function to run the Barker proposal in the Rprogramming language is provided at https://github.com/gzanella/barker . Gradient-based MCMC methods are typically used because they perform well inhigh-dimensional settings. The Barker algorithm is no exception here, achieving thesame 𝑂 ( 𝑑 − / ) asymptotic efficiency as the popular Metropolis-adjusted Langevinalgorithm (MALA) for suitably regular problems, where 𝑑 represents the dimensionof the state space [14]. The design of the Barker scheme, however, does differ fromother gradient-based schemes such as MALA and Hamiltonian Monte Carlo (HMC).In both of the latter well-known approaches the gradient is incorporated through adeterministic drift, which depends linearly on ∇ log 𝜋 ( 𝑥 ) . In MALA, for example, ifthe current point is 𝑥 the proposal will be 𝑦 = 𝑥 + ℎ ∇ log 𝜋 ( 𝑥 ) + ℎ𝜉, where 𝜉 ∼ 𝑁 ( , ) and ℎ >
0. When the gradient is suitably regular and ℎ well-chosen this transition can be very desirable; for example if 𝜋 is Gaussian then theproposal becomes 𝑦 = ( − ℎ / ) 𝑥 + ℎ𝜉 , leading to dynamics in which the chaindrifts towards the centre of the space very quickly provided that ℎ <
2. In the samesetting, however, it is immediately clear that choosing ℎ > ℎ in this example.The above case is indicative of a much more general phenomenon that is well-known to practitioners, namely that popular gradient-based methods often producefast-mixing Markov chains on a particular class of problems and provided that thetuning parameters of the algorithm are well-chosen, but that this class of problemsis smaller than ideal, and that performance degrades rapidly when a poor choice oftuning parameters is made. This phenomenon is not only restricted to settings inwhich the MALA proposal becomes unstable (as in the Gaussian case), and meansthat it is also often difficult to tune the methods adaptively during the course ofthe simulation, an issue that is discussed in [14]. In that work the authors focus oncharacterising robustness to tuning , providing a mathematical argument to show thatfor MALA and HMC performance is much more sensitive to the choice of proposaltuning parameters than for the Barker proposal. One scenario in which gradient-based algorithms can perform poorly is when thedistribution of interest 𝜋 exhibits considerable skew. To explore this phenomenon wefirst consider a simple one-dimensional model problem, before performing a morecomprehensive numerical study on a challenging ill-conditioned logistic regressionexample. We will show that in both of these cases the Barker algorithm is considerablymore robust to the level of skewness exhibited than other gradient-based schemes. fresh take on ‘Barker dynamics’ for MCMC. 11 In essence the challenge is that a suitable step-size in one region of the state-spaceis unsuitable in a different region, in this case owing to the skewed nature of thetarget density. An algorithm that is more robust to step-size choice therefore hassome advantages in this setting.
A model problem.
Consider a family of probability distributions 𝜋 𝜂 on R indexedby 𝜂 ≥
0. The family will be constructed in such a way that as 𝜂 grows the degree ofskew of 𝜋 𝜂 is increased. Denote the current state as 𝑥 ∈ R and the proposal as 𝑦 ∈ R ,such that the signs of ∇ log 𝜋 𝜂 ( 𝑥 ) and ∇ log 𝜋 𝜂 ( 𝑦 ) are different. This represents thecase (shown in Figure 3) in which there is a mode in between 𝑥 and 𝑦 . We supposefor simplicity that as 𝜂 → ∞ the quantities 𝜋 𝜂 ( 𝑥 ) , 𝜋 𝜂 ( 𝑦 ) and ∇ log 𝜋 𝜂 ( 𝑥 ) remainunchanged, but |∇ log 𝜋 𝜂 ( 𝑦 )| → ∞ . By definition, therefore, 𝜂 plays no role indetermining how probable it is to generate 𝑦 given the current point 𝑥 for either theBarker or MALA algorithms. It does, however, play a pivotal role in how probablethe reverse move is, meaning the transition densities 𝑞 𝑀𝜂 ( 𝑦, 𝑥 ) and 𝑞 𝐵𝜂 ( 𝑦, 𝑥 ) which areused in the MALA and Barker acceptance rates. In the MALA case the probabilityof accepting the proposal 𝑦 is given by 𝛼 𝑀𝜂 ( 𝑥, 𝑦 ) = min (cid:32) , 𝜋 𝜂 ( 𝑦 ) 𝑞 𝑀𝜂 ( 𝑦, 𝑥 ) 𝜋 𝜂 ( 𝑥 ) 𝑞 𝑀𝜂 ( 𝑥, 𝑦 ) (cid:33) . Note that − log 𝑞 𝑀𝜂 ( 𝑦, 𝑥 ) = ℎ ( 𝑥 − 𝑦 − ℎ ∇ log 𝜋 𝜂 ( 𝑦 )/ ) +
12 log ( 𝜋ℎ ) → ∞ as 𝜂 → ∞ , meaning 𝛼 𝑀𝜂 ( 𝑦, 𝑥 ) approaches zero in the same limit. Now consideringthe same quantity for the Barker algorithm, the negative log-density of the reversemove is − log 𝑞 𝐵𝜂 ( 𝑦, 𝑥 ) = log (cid:0) + exp {( 𝑦 − 𝑥 )∇ log 𝜋 𝜂 ( 𝑦 )} (cid:1) + 𝐶 for some 𝐶 < ∞ . Note that ( 𝑦 − 𝑥 ) and ∇ log 𝜋 𝜂 ( 𝑦 ) are of opposite sign (becausethere is a mode between 𝑥 and 𝑦 ), meaning their product will be negative. Becauseof this for large enough ∇ log 𝜋 𝜂 ( 𝑦 )− log 𝑞 𝐵𝜂 ( 𝑦, 𝑥 ) ≈ exp {( 𝑦 − 𝑥 )∇ log 𝜋 𝜂 ( 𝑦 )} , using that log ( + 𝑥 ) ≈ 𝑥 when | 𝑥 | (cid:28)
1. This quantity approaches 0 as 𝜂 → ∞ ,meaning 𝛼 𝐵𝜂 ( 𝑥, 𝑦 ) → 𝛼 𝐵 ∗ > , which suggests that the Barker algorithm will be less severely influenced by theheterogeneity in the size of the target gradient. Figure 3 provides some more intuitionfor the contrasting behaviour between the two methods. Fig. 3
The forward (blue) and reverse (green) proposal densities associated with two points sepa-rated by a mode for an example target distribution that contains skew. In the MALA case the currentpoint 𝑥 is quite unlikely under the reverse proposal density (green curve), whereas for the Barkeralgorithm this is not the case. Skewed posterior distributions appear in many common modelling settings, but it isperhaps surprising that even seemingly simple logistic regression models can exhibitsuch a degree of skew that they pose a significant challenge to MCMC methods.This is despite the fact that the posterior distribution is strongly log-concave and thegradient is Lipschitz, meaning that several favourable results on the mixing propertiesof classical gradient-based algorithms can been established (e.g. [7, 6, 8]).We consider an example using the arrythmia dataset from the UCI machine learn-ing repository, available at https://archive.ics.uci.edu/ml/datasets/arrhythmia .The dataset consists of 452 observations of 279 different covariates. The modellingtask is to detect the presence or absence of cardiac arrythmia. The data presentsa challenge as there are many imbalanced categorical covariates with only a fewobservations in certain categories.The number of predictors compared to the size of the dataset makes the problemhighly ill-conditioned. To combat this we selected 25 imbalanced covariates and 25others, meaning 50 covariates in total for our problem. The 25 imbalanced predictorswere chosen from among the categorical covariates for which one category appearedtwo or fewer times in the dataset, whereas the remaining 25 were chosen from theremaining set. Despite this pre-processing the problem is still highly ill-conditionedand the maximum likelihood estimator is undefined, making a Bayesian approachvery natural for the problem. We also note that despite the reduced number ofcovariates the final problem is still of large enough dimension that simpler fittingmethods will be ineffective, in line with the recommendations of [5]. The choiceof dataset was inspired by [11], in which the authors highlight that imbalancedcategorical data can cause problems for Markov chain Monte Carlo methods. In thiscase the result is a logistic regression posterior distribution with a pronounced levelof skewness in certain dimensions, as shown in Figure 4. fresh take on ‘Barker dynamics’ for MCMC. 13
Fig. 4
Example marginal distributions for selected covariates using the output of the Barkeralgorithm, illustrating varying degrees of skew in different dimensions. The plots are shown for theraw data.
For the Barker scheme we choose a Gaussian 𝑞 (·) (although we note anecdotallythat ongoing work suggests that other choices may be preferable). With the goalof minimising the degree of hand-tuning needed for each algorithm, we used anadaptive approach to choosing algorithmic tuning parameters, precisely Algorithm 4of [1], which consists of a Robbins–Monro scheme for learning a single global scale 𝜆 and a covariance matrix Σ , which combine to form the pre-conditioning matrix 𝜆 Σ . We set the Robbins–Monro learning rate at time 𝑡 to be 𝑡 − . . The matrix Σ can be dense or restricted to diagonal; the former allows correlations to be betternavigated by the sampler, but the diagonal approach means less parameters mustbe learned during the simulation. A weakly-informative independent Gaussian priorwith zero mean and variance 25 was chosen for each model parameter. It is alsosometimes recommended in logistic regression problems to first standardise thecovariates, transforming each to have zero mean and unit variance. This can have theeffect of making the posterior more regular and as a consequence the inference lesschallenging, but is not always done by practitioners.The above considerations led us to four different testing scenarios for each algo-rithm: dense Σ with raw data, dense Σ with standardised data, diagonal Σ with rawdata and diagonal Σ with standardised data. For each of these scenarios we comparedthe Barker proposal scheme with MALA, a classical gradient-based alternative, as asimple illustration of the different patterns of behaviour that the two algorithms canexhibit.Trace plots showing the performance of the MALA and Barker algorithms ineach scenario are shown in Figure 5. It is immediately clear that MALA strugglesto reach equilibrium in 3 out of 4 scenarios, only really performing reasonablywhen Σ is diagonal and the data is standardised. As expected, standardising the dataaids performance, but it is perhaps surprising that the sampler also struggles in thedense Σ setting. By comparison, visually the Barker algorithm behaves reasonablyin all scenarios. To evaluate the samplers at equilibrium and once the adaptation hasstabilised, we examine effective sample sizes for each scenario in which equilibriumis visually reached after 30,000 iterations in Table 1. The effective sample sizesallow us to see that performance once equilibrium is reached is largely comparablebetween the two schemes once the scenario is favourable. The key strengths of theBarker approach in this example are its robustness to lack of standardisation androbustness to different adaptation strategies. Fig. 5
A selection of trace plots from the MALA and Barker algorithms for the logistic regressionexample. 1st row: raw data with dense Σ ; 2nd row: standardised data with dense Σ ; 3rd row: rawdata with diagonal Σ ; 4th row: standardised data with diagonal Σ . Table 1
Minimum and median effective sample sizes (ESS) for the logistic regression example.Dataset Algorithm ESS (min., med.)Raw Barker (dense) 38.82, 156.67Raw Barker (diag.) 65.55, 164.67Raw MALA (dense) n/aRaw MALA (diag.) n/aStandardised Barker (dense) 53.36, 98.44Standardised Barker (diag.) 44.19, 101.51Standardised MALA (dense) n/aStandardised MALA (diag.) 37.21, 87.14
We have given a pedagogical treatment of the Barker proposal scheme, a newgradient-based MCMC algorithm that we argue has some desirable features whencompared to classical gradient-based alternatives, namely its robustness (in a very fresh take on ‘Barker dynamics’ for MCMC. 15 general sense). There are numerous ways in which classical schemes such as MALAand HMC can be made more robust in different settings (e.g. [21, 4, 12, 15]),but these often introduce additional tuning parameters and can suffer from otherissues, meaning that the quality of performance becomes very problem-specific.Another alternative approach are second-order methods that incorporate the Hessianof log 𝜋 ( 𝑥 ) in some way (e.g. [9, 13]), but generally the cost of their implementationis large, and can grow cubically with dimension. Based on the simplicity, scalingproperties and robustness of the Barker proposal we argue that there are likely to bemany realistic scenarios in which it proves useful, and in addition there is much roomfor the development of further algorithms within the general framework discussedin Section 3. References [1] Andrieu, C., Thoms, J.: A tutorial on adaptive MCMC. Statistics and computing (4), 343–373 (2008)[2] Azzalini, A., Regoli, G.: Some properties of skew-symmetric distributions.Annals of the Institute of Statistical Mathematics (4), 857–879 (2012)[3] Barker, A.A.: Monte carlo calculations of the radial distribution functions for aproton-electron plasma. Australian Journal of Physics (2), 119–134 (1965)[4] Brosse, N., Durmus, A., Moulines, É., Sabanis, S.: The tamed unadjustedLangevin algorithm. Stochastic Processes and their Applications (10),3638–3663 (2019)[5] Chopin, N., Ridgway, J., et al.: Leave Pima Indians alone: binary regressionas a benchmark for Bayesian computation. Statistical Science (1), 64–87(2017)[6] Dalalyan, A.S., Karagulyan, A.: User-friendly guarantees for the LangevinMonte Carlo with inaccurate gradient. Stochastic Processes and their Applica-tions (12), 5278–5311 (2019)[7] Durmus, A., Moulines, E., et al.: Nonasymptotic convergence analysis for theunadjusted langevin algorithm. The Annals of Applied Probability (3),1551–1587 (2017)[8] Dwivedi, R., Chen, Y., Wainwright, M.J., Yu, B.: Log-concave sampling:Metropolis-Hastings algorithms are fast! In: Conference on Learning The-ory, pp. 793–797. PMLR (2018)[9] Girolami, M., Calderhead, B.: Riemann manifold langevin and hamiltonianmonte carlo methods. Journal of the Royal Statistical Society: Series B (Sta-tistical Methodology) (2), 123–214 (2011)[10] Hastings, W.K.: Monte Carlo sampling methods using Markov chains and theirapplications. Biometrika (1970)[11] Johndrow, J.E., Smith, A., Pillai, N., Dunson, D.B.: MCMC for imbalancedcategorical data. Journal of the American Statistical Association (527),1394–1403 (2019) [12] Livingstone, S., Faulkner, M.F., Roberts, G.O.: Kinetic energy choice in Hamil-tonian/hybrid Monte Carlo. Biometrika (2), 303–319 (2019)[13] Livingstone, S., Girolami, M.: Information-geometric Markov chain MonteCarlo methods using diffusions. Entropy (6), 3074–3102 (2014)[14] Livingstone, S., Zanella, G.: The Barker proposal: combining robustness andefficiency in gradient-based MCMC. arXiv preprint arXiv:1908.11812 (2019)[15] Lu, X., Perrone, V., Hasenclever, L., Teh, Y.W., Vollmer, S.: Relativistic montecarlo. In: Artificial Intelligence and Statistics, pp. 1236–1245. PMLR (2017)[16] Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.:Equation of state calculations by fast computing machines. The journal ofchemical physics (6), 1087–1092 (1953)[17] Peskun, P.H.: Optimum monte-carlo sampling using markov chains. Biometrika (3), 607–612 (1973)[18] Power, S., Goldman, J.V.: Accelerated sampling on discrete spaces with non-reversible markov processes. arXiv preprint arXiv:1912.04681 (2019)[19] Roberts, G.O., Rosenthal, J.S.: Optimal scaling of discrete approximationsto Langevin diffusions. Journal of the Royal Statistical Society: Series B(Statistical Methodology) (1), 255–268 (1998)[20] Roberts, G.O., Rosenthal, J.S., et al.: General state space Markov chains andMCMC algorithms. Probability surveys , 20–71 (2004)[21] Roberts, G.O., Tweedie, R.L., et al.: Exponential convergence of Langevindistributions and their discrete approximations. Bernoulli (4), 341–363 (1996)[22] Tierney, L.: A note on Metropolis-Hastings kernels for general state spaces.Annals of applied probability pp. 1–9 (1998)[23] Zanella, G.: Informed proposals for local MCMC in discrete spaces. Journalof the American Statistical Association115