A general perspective on the Metropolis-Hastings kernel
aa r X i v : . [ s t a t . C O ] D ec A general perspective on the Metropolis–Hastings kernel
Christophe Andrieu ∗ , Anthony Lee ∗ and Sam Livingstone † January 1, 2021 ∗ School of Mathematics, University of Bristol, U.K. † Department of Statistical Science, University College London, U.K.
Abstract
Since its inception the Metropolis–Hastings kernel has been applied in sophisticated ways toaddress ever more challenging and diverse sampling problems. Its success stems from the flexibilitybrought by the fact that its verification and sampling implementation rests on a local “detailedbalance” condition, as opposed to a global condition in the form of a typically intractable integralequation. While checking the local condition is routine in the simplest scenarios, this proves muchmore difficult for complicated applications involving auxiliary structures and variables. Our aimis to develop a framework making establishing correctness of complex Markov chain Monte Carlokernels a purely mechanical or algebraic exercise, while making communication of ideas simplerand unambiguous by allowing a stronger focus on essential features — a choice of embeddingdistribution, an involution and occasionally an acceptance function — rather than the induced,boilerplate structure of the kernels that often tends to obscure what is important. This frameworkcan also be used to validate kernels that do not satisfy detailed balance, i.e. which are notreversible, but a modified version thereof.
Contents Markov chain proposals, stopping times and processes & NUTS 22
B.1 Standard results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49B.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
C X-tra chance proof 51D NUTS motivation 51E Event chain algorithms 52 Introduction
Assume one is interested in sampling from a probability distribution π , defined on some probabilityspace ( Z , Z ) . A Markov chain Monte Carlo algorithm (MCMC) consists of simulating a realizationof a time-homogeneous Markov chain ( Z , Z , Z . . . ) , of say kernel P , with the property that thedistribution of Z n becomes arbitrarily close to π as n → ∞ irrespective of the distribution of Z .A property the kernel P , or its components in the case of mixtures or composition of kernels, mustsatisfy is to leave the distribution π invariant, that is π should be a fixed point of the Markov kernel.This is often referred to as a “global balance” condition in the physics literature and is most often nottractable to verify. Instead one can consider the stronger “detailed balance” condition, or reversibility,a more tractable property due to its local character which has led in particular to the celebratedMetropolis-Hastings (MH) kernel (Metropolis et al., 1953; Hastings, 1970), the cornerstone of MCMCsimulations, and a multitude of successful variations. It is difficult to overstate the importance ofdetailed balance when discussing the widespread application of MH kernels: one can view such a kernelas being defined by a pair ( π, Q ) , where Q is a proposal Markov kernel, and the algorithm requires onlysimulation according to Q and computing densities associated with π and Q . This ease of use has leadto MH algorithms being used in increasingly sophisticated contexts, leading to sometimes spectacularpractical improvements but also increased complexity when establishing correctness (which we will takethroughout to mean ensure that π is left invariant by P ) and communicating their structure. The aimof this paper is to develop a simple and general framework to address these issues. In particular, theproposed framework defines an invariant MH kernel Π using a triple ( µ, φ, a ) , where µ is the invariantdistribution of Π , φ is an involution and a is an acceptance function, and retains similar ease-of-useproperties to those described above: one is required only to be able to simulate from an appropriateconditional distribution of µ , calculate φ and ratios of densities involving µ and φ . We consider a framework, extending Tierney (1998), for defining a µ -reversible Markov kernel Π ofthe Metropolis-Hastings type, which only requires the specification of a triplet ( µ, φ, a ) where µ is aprobability measure on some space ( E , E ) , φ : E → E an involution, and a : R + → [0 , an acceptancefunction. As we shall see, this covers most scenarios of interest where sampling from π as above isof interest by letting π be a marginal of µ . More specifically for ξ = ( ξ , ξ − ) ∼ µ such that ξ ∼ π , ξ − is a set of instrumental random variables involved in the design of MH kernels–often referred toas “proposals” for standard algorithm, but we refrain from using this reductive terminology. Thenthe involution φ is applied, defining ξ ′ := φ ( ξ ) , and ξ ′ is the next state of the Markov chain witha probability entirely determined by the triplet ( µ, φ, a ) , or the Markov chain remains at ξ . Whatis remarkable is that a correct algorithm is mathematically entirely determined by this triplet–inparticular there is, again at a theoretical level, no need to determine an expression for the “acceptanceratio”: it exists!Practical implementation requires determining a tractable expression for the acceptance ratio whichis, fundamentally, of a measure theoretic nature. Measure theoretic arguments are often overlookedin the literature and indeed do not need to be considered in detail in most simple scenarios. Howeverthis is not the case for more involved cases, where such issues can lead to excruciating and ad hoc contortions, and we have made an effort here not to ignore them. We hope to convince the reader thatdoing so is truly valuable and brings both generality and clarity to the arguments. The backgroundrequired is minimal and we provide key results in the text: extensive knowledge of measure theory isnot a prerequisite to read the manuscript.As we shall see we focus primarily on the choice of ( µ, φ ) since the choice of a is, at least theoretically,independent of the choice of ( µ, φ ) and can be determined optimally thanks to the results of Peskun(1973) and Tierney (1998) in the reversible setup and Christophe Andrieu and Livingstone (2019)3or nonreversible extensions. We revisit numerous examples, some particularly simple for pedagogicalpurposes, but also dedicate full sections (Sections 5 and 7) to popular examples which, we know, havebaffled more than one researcher before. This includes the No U-Turn Sampler (Hoffman and Gelman,2014), the extra-chance algorithm (Sohl-Dickstein, Mudigonda, and DeWeese, 2014; Campos and J.Sanz-Serna, 2015) or event chain algorithms (Michel, 2016). In fact, We provide generalizations andin some cases completely novel versions of these algorithms.We neither address the issues of convergence to equilibrium or ergodic averages, nor answer the questionof what is the best possible involution. These are completely separate issues but we note that theideas of Thin, Kotelevskii, et al. (2020), or more generally adaptive MCMC (Christophe Andrieu andThoms, 2008), could be used for the latter purpose while Durmus, Moulines, and Saksman (2017) andThin, Durmus, et al. (2020) provide some ideas concerning general results to establish irreducibilityand aperiodicity, the additional sufficient ingredients needed to ensure convergence. There are in ourview too many degrees of freedom involved in the choice of good involutions, auxiliary variables andtheir distributions and we do not believe that a theorem can, yet, replace intuition, creativity andcommonsense when designing good MCMC schemes. Our aim here is rather to make checking thatone’s intuition is correct a purely algebraic exercise, removing in particular the need to revisit commonpoints every time the question of correctness arises, while helping with efficient and unambiguouscommunication of potentially very complex schemes–see Christophe Andrieu, Arnaud Doucet, Yıldırım,et al. (2020) for an attempt at implementing this point of view.We limit probabilistic arguments and notation to a minimum and, in contrast with accepted commonwisdom, most often use lower case fonts for both random variables and their realizations in order toalleviate notation. We hope this does not cause confusion. This work is strongly influenced by Tierney (1998) where the possibility of using involutions as “deter-ministic proposals” is suggested, but not developed as a unifying tool as in the present paper, and thetreatment of densities therein is the direct source of inspiration for our own treatment. The papersFang, J.-M. Sanz-Serna, and Skeel (2014) and Campos and J. Sanz-Serna (2015) were complementary,and revealed to us the importance and generality of the involution point of view, both in the reversibleand nonreversible setups, although not always in an explicit manner. A statement of the main abstractresult (Theorem 3) was given in Christophe Andrieu and Livingstone (2019, Proposition 3.5) and pre-sented in a series of lectures organized at the Higher School of Economics lectures in St. Petersburgin August 2019 (Christophe Andrieu, 2019), together with various applications, while a preliminaryversion of the results concerned with NUTS were presented at BayesComp 2020 in Florida in January2020. We have recently become aware of Graham (2018, p. 64) where the possibility of using aninvolution as an update was suggested, drawing on an analogy to Green (1995), but not developed. Infact the involutive framework underpins Green (1995) but is not made explicit. The term “InvolutiveMCMC”, perhaps a tautology, was coined in Neklyudov et al. (2020) where classical algorithms are re-visited in turn following this perspective, but no connection to earlier literature was made; we also noteCusumano-Towner, Lew, and Mansinghka (2020) with earlier claims and the interesting very recentcontribution by Glatt-Holtz, Krometis, and Mondaini (2020). Thin, Kotelevskii, et al. (2020) exploitthis type of representation of the MH kernel to design normalising flows and Thin, Durmus, et al.(2020) establish necessary conditions mirroring Tierney (1998) in the skew detailed balance scenario,but also general conditions ensuring aperiodicity and periodicity. • All real-valued functions we consider are Borel measurable.4 If µ is a measure on ( E, E ) and f : E → R is a µ -integrable function then we denote the integral µ ( f ) := R E f ( x ) µ (d x ) .• min { a, b } = a ∧ b , max { a, b } = a ∨ b .• f · g is pointwise product f · g = x f ( x ) g ( x ) , f /g = x f ( x ) /g ( x ) .• For a set A ⊂ E, the function A is the indicator function of set A , i.e. A ( x ) = ( if x ∈ A, otherwise . We also use the notation I { x ∈ A } := A ( x ) when the definition of A is explicit and long.• used to denote the constant function x , usage is clear from context.• For a given x , δ x is the Dirac measure at x : δ x ( A ) = A ( x ) .• If ( E, E ) and ( F, F ) are measurable spaces, the product measurable space is ( E × F, E ⊗ F ) where E ⊗ F is the product σ -algebra σ ( { A × B : A ∈ E , B ∈ F } ) . If µ is a measure on ( E, E ) and ν a measure on ( F, F ) then their product measure on ( E × F, E ⊗ F ) is µ ⊗ ν where ( µ ⊗ ν )( A, B ) = µ ( A ) ν ( B ) and define recursively µ ⊗ n = µ ⊗ ( n − ⊗ µ for n ∈ N ∗ .• If µ is a measure on ( E, E ) then the restriction of µ to C ∈ E is a measure µ C on ( E, E ) satisfying µ C ( A ) := µ ( A ∩ C ) for any A ∈ E .• If µ (d x, d y ) is a probability measure, we write µ x to refer to a conditional probability measurefor Y given X = x . (Polish space)• A cycle of two Markov kernels P : E × E → [0 , and Q : E × E → [0 , is the Markov kernel P Q ( x, A ) = Z P ( x, d y ) Q ( y, A ) , x ∈ E, A ∈ E . • We adopt the standard conventions for products and sums that for b < a , Q bi = a · = 1 and P bi = a · = 0 whatever the nature of the argument.• For x ∈ R , sgn( x ) ∈ {− , , } is the sign of x .• We define N = { , , . . . , } and N ∗ = { , , . . . } .• We define J i, j K = { i, i + 1 , . . . , j } for integers i ≤ j , and J i K = J , i K for i ∈ N ∗ . Assume one is interested in sampling from a probability distribution π , defined on some probabilityspace ( Z , Z ) . A Markov chain Monte Carlo (MCMC) algorithm consists of simulating a realization { Z i ; i ≥ } of a Markov chain such that P ( Z n ∈ A ) → π ( A ) , A ∈ Z , as n → ∞ and/or for functions f ∈ L ( Z , π ) , lim n →∞ n n X i =1 f ( Z i ) = π ( f ) , P denoting the transitionprobability of the Markov chain, π is left invariant by P . That is, the “global balance” condition holds: Z π (d z ) P ( z, A ) = π ( A ) , z ∈ Z , A ∈ Z . (1)It is very difficult to verify (1) directly, complicating the design of Markov kernels satisfying thisproperty. A successful approach often consists instead of verifying the stronger, local property of“detailed balance” or π − reversibility. Definition 1 (Reversible Markov kernel) . For a finite measure µ on ( E, E ) , a Markov kernel P : E × E → [0 , is µ -reversible if the measures µ (d ξ ) P ( ξ, d ξ ′ ) and µ (d ξ ′ ) P ( ξ ′ , d ξ ) are identical. That is,if, Z A µ (d ξ ) P ( ξ, B ) = Z B µ (d ξ ) P ( ξ, A ) , A, B ∈ E . It is straightforward to deduce that (1) holds if P is π -reversible by taking A = E in the definition. Remark . The definition of µ -reversibility is equivalent to: for all measurable F, G : E → [0 , , Z F ( ξ ) G ( ξ ′ ) µ (d ξ ) P ( ξ, d ξ ′ ) = Z G ( ξ ) F ( ξ ′ ) µ (d ξ ) P ( ξ, d ξ ′ ) . (2)In particular, we recover the definition with F = A and G = B , and for the other direction, we usethe identity µ (d ξ ) P ( ξ, d ξ ′ ) = µ (d ξ ′ ) P ( ξ ′ , d ξ ) .Metropolis–Hastings (MH) kernels are a flexible class of π − reversible Markov kernels for which sim-ulation of the corresponding Markov chain can often be implemented on a computer. A textbookderivation is as follows. Assume that Z = R d and let { Q ( z, · ) , z ∈ Z } be a family of probability distri-butions on ( Z , Z ) from which it is easy to sample. Assume for presentational simplicity that for any z ∈ Z , π and Q ( z, · ) have strictly positive densities with respect to the Lebesgue measure, denoted ̟ and q ( z, · ) . The MH kernel defined by π and Q is given by P ( z, d z ′ ) = α ( z, z ′ ) q ( z, z ′ )d z ′ + s ( z ) δ z (d z ′ ) , where α ( z, z ′ ) = 1 ∧ r ( z, z ′ ) , s ( z ) = 1 − R α ( z, z ′ ) q ( z, z ′ )d z ′ and r ( z, z ′ ) = ̟ ( z ′ ) q ( z ′ , z ) ̟ ( z ) q ( z, z ′ ) . Letting ρ ( z, z ′ ) := ̟ ( z ) q ( z, z ′ ) , verifying π − reversibility can be reduced to checking that for f, g : Z → [0 , Z f ( z ) g ( z ′ ) ρ ( z, z ′ ) α ( z, z ′ )d z d z ′ = Z g ( z ) f ( z ′ ) ρ ( z, z ′ ) α ( z, z ′ )d z d z ′ , (3)since Z f ( z ) g ( z ′ ) ̟ ( z ) s ( z ) δ z (d z ′ )d z = Z f ( z ) g ( z ) ̟ ( z ) s ( z )d z = Z g ( z ) f ( z ′ ) ̟ ( z ) s ( z ) δ z (d z ′ )d z. It is a standard exercise to show that ρ ( z, z ′ ) α ( z, z ′ ) = ρ ( z ′ , z ) α ( z ′ , z ) and conclude that (3) holds.We outline now a less direct way, which however has the benefit of highlighting important genericproperties required.Define ξ := ( z, z ′ ) , d ξ = d z d z ′ , φ ( z, z ′ ) = ( z ′ , z ) and F ( z, z ′ ) = f ( z ) and G ( z, z ′ ) = g ( z ) then (3) canbe re-expressed as Z F ( ξ ) G ◦ φ ( ξ ) ρ ( ξ ) α ( ξ )d ξ = Z F ◦ φ ( ξ ) G ( ξ ) ρ ( ξ ) α ( ξ )d ξ. (4)6urther notice that the acceptance ratio is of the form r ( ξ ) = ρ ◦ φ/ρ ( ξ ) and that, using that φ ◦ φ = Id , r ◦ φ ( ξ ) = ρ ◦ φρ ◦ φ ( ξ ) = ρρ ◦ φ ( ξ ) = 1 r ( ξ ) , therefore implying r ( ξ ) α ◦ φ ( ξ ) = r ( ξ ) [1 ∧ r ◦ φ ( ξ )] = α ( ξ ) . (5)We now show that (4) holds for any measurable F, G : Z → [0 , Z F ( ξ ) G ◦ φ ( ξ ) ρ ( ξ ) α ( ξ )d ξ = Z F ( ξ ) G ◦ φ ( ξ ) ρ ( ξ ) r ( ξ ) α ◦ φ ( ξ )d ξ = Z F ( ξ ) G ◦ φ ( ξ ) ρ ◦ φ ( ξ ) α ◦ φ ( ξ )d ξ = Z F ◦ φ ( ξ ′ ) G ( ξ ′ ) ρ ( ξ ′ ) α ( ξ ′ )d ξ ′ , where we have used α ( ξ ) = r ( ξ ) α ◦ φ ( ξ ) , r ( ξ ) = ρ ◦ φ/ρ ( ξ ) the change of variable ξ ′ = φ ( ξ ) and thefact that φ is an involution with Jacobian | det φ ′ ( ξ ) | = 1 (see Theorem 4). This therefore implies (3)and in turn that P is π − reversible. In fact, letting µ (d ξ ) := ρ ( ξ )d ξ , we notice that this establishes µ − reversibility of an MH kernel targeting the extended probability distribution µ .This presentation has the advantage of highlighting a set of generic properties sufficient to establish π − reversibility:(a) the distribution π is a marginal of a probability distribution µ ,(b) the proposed state is obtained by applying an involution φ to ξ ,(c) it holds that α ( ξ ) µ (d ξ ) = α ◦ φ ( ξ ) µ φ (d ξ ) with µ φ the probability distribution of ξ ′ = φ − ( ξ ) = φ ( ξ ) ,suggesting that more general choices of µ, φ and α can also define π − reversible Markov kernels. It canbe shown (Theorem 4) that the first two properties automatically imply the mathematical existenceof α such that the third property holds, highlighting the fundamental rôle played by the involutorynature of φ . Practical implementation of the algorithm requires two additional properties of µ : theexistence of a tractable probability density to compute α and ease of sampling from the conditionaldistribution in µ (d ξ ) = π (d ξ ) µ ξ (d ξ − ) .The clear benefit of this approach is that establishing correctness becomes a purely mechanical, or“algebraic”, exercise, therefore improving clarity of arguments and facilitating communication. In order to gain generality and clarify we will appeal to a very small number of standard measure the-oretical notions and results related to change of variables and Radon–Nykodim derivatives. Althoughit is always a good idea to check the proof of classical results, there is no need to do so in order tounderstand the content of this manuscript.
Definition 2 (Pushforward) . Let µ be a measure on ( E, E ) and ϕ : ( E, E ) → ( F, F ) a measurablefunction. The pushforward of µ by ϕ is defined by µ ϕ ( A ) = µ ( ϕ − ( A )) , A ∈ F , where ϕ − ( A ) = { x ∈ E : ϕ ( x ) ∈ A } is the preimage of A under ϕ .7or example, if µ is a probability distribution then µ ϕ is the probability measure associated with ϕ ( X ) when X ∼ µ . Definition 3 (Dominating and equivalent measures) . For two measures µ and ν on the same measur-able space ( E, E ) ,(a) µ is said to dominate ν if for all measurable A ∈ E , ν ( A ) > ⇒ µ ( A ) > – this is denoted µ ≫ ν .(b) µ and ν are equivalent, written µ ≡ ν , if µ ≫ ν and ν ≫ µ .We will need the notion of Radon-Nikodym derivative: Theorem 1 (Radon–Nikodym) . Let µ and ν be σ -finite measures on ( E, E ) . Then ν ≪ µ if and onlyif there exists an essentially unique, measurable, non-negative function f such that Z A f ( ξ ) µ (d ξ ) = ν ( A ) , A ∈ E . Therefore we can view d ν/ d µ := f as the density of ν w.r.t µ and in particular if g is integrable w.r.t. ν then Z g ( ξ ) d ν d µ ( ξ ) µ (d ξ ) = Z g ( ξ ) ν (d ξ ) . This is covered by Billingsley (1995, Theorems 32.2 & 16.11).If µ is a measure and f a non-negative, measurable function then µ · f is the measure ( µ · f )( A ) = R A ( x ) f ( x ) µ (d x ) , i.e. the measure ν = µ · f such that the Radon–Nikodym derivative of d ν/ d µ = f . Theorem 2 (Change of variables) . A function f : F → R is integrable w.r.t. µ ϕ if and only if f ◦ ϕ is integrable w.r.t. µ , in which case Z F f ( ξ ) µ ϕ (d ξ ) = Z E f ◦ ϕ ( ξ ) µ (d ξ ) . (6)This can be found in Billingsley (1995, Theorem 16.13). The following result is central to the design of MH-based MCMC, formalizes the observations made inSection 2 and generalizes parts of Tierney (1998, Proposition 1 and Theorem 2), concerned with thespecific involution φ ( z, z ′ ) = ( z ′ , z ) and a particular form of distribution µ . We do not pursue necessityconditions here, to keep the presentation brief and focused on practical consequences: Tierney (1998)discusses such issues, while Thin, Durmus, et al. (2020) revisits these issues in a particle nonreversiblesetup (see Section 4). The proof can be found in Appendix A. This result mirrors Christophe Andrieuand Livingstone (2019, Proposition 2). Theorem 3.
Let µ be a finite measure on ( E, E ) , φ : E → E an involution. Then(a) there exists a set S = S ( µ, µ φ ) ∈ E such that(i) φ ( S ) = S ,(ii) with µ S ( A ) := µ ( A ∩ S ) for any A ∈ E we have µ φS ≡ µ S , iii) µ and µ φ are mutually singular on S ∁ , i.e. there exist sets A, B ∈ E such that A ∩ B = ∅ , A ∪ B = S ∁ and µ ( A ) = µ φ ( B ) = 0 .(b) defining for ξ ∈ Ξ , r ( ξ ) := ( d µ φS / d µ S ( ξ ) ξ ∈ S, otherwise , (7) and letting a : [0 , ∞ ) → [0 , such that a ( r ) = ( r = 0 ra (1 /r ) r > , we have that,(i) for ξ ∈ Ξ , α ( ξ ) := a ◦ r ( ξ ) = ( r ( ξ ) · α ◦ φ ( ξ ) ξ ∈ S, otherwise , (ii) for any measurable F, G : E → [0 , , Z F ( ξ ) G ◦ φ ( ξ ) α ( ξ ) µ (d ξ ) = Z F ◦ φ ( ξ ) G ( ξ ) α ( ξ ) µ (d ξ ) , (iii) the Markov kernel Π defined by Π( ξ, { φ ( ξ ) } ) = α ( ξ ) = 1 − Π( ξ, { ξ } ) , is µ -reversible.Remark . The condition on a is satisfied by a ( r ) = 1 ∧ r (corresponding to the Metropolis–Hastingsacceptance probability), and also a ( r ) = r/ (1 + r ) (Barker’s acceptance probability; see Example 3),therefore ensuring the existence of Π and P .In practice one is interested in the component ξ of µ , which is distributed according to π . In fact, theMarkov kernel Π in Theorem 3 can be used to define a π -invariant Markov kernel P . The proof canbe found in Appendix A. Proposition 1.
Let π be a probability distribution on ( Z , Z ) and let µ be a probability distribution on ( E, E ) such that µ (d ξ ) := π (d ξ ) µ ξ (d ξ − ) , where µ ξ denotes the conditional distribution of ξ − given ξ under µ . Then the Markov kernel P ( ξ , A ) := Z A ( ξ ′ ) µ ξ (d ξ − )Π( ξ ; d ξ ′ ) , A ∈ Z , is π -reversible. An algorithmic description of P is given in Alg. 1 highlighting the practical requirement that samplingfrom µ ξ ( · ) for ξ ∈ Z should be tractable.The implication of these results should be clear. If sampling from π is of interest, any choice of µ ofthe form µ (d ξ ) = π (d ξ ) µ ξ (d ξ − ) , (8)together with an involution φ and an acceptance function a defines a π − reversible Markov kernel/chain.It turns out that all MH-type kernels we are aware of, including advanced and complex implementa-tions, can be described and immediately justified using this framework.9 lgorithm 1 To sample from P ( ξ , · ) (a) Given ξ , sample ξ − ∼ µ ξ ,(b) Compute ξ ′ = φ ( ξ ) ,(c) With probability α ( ξ ) return ξ ′ , otherwise return ξ . Remark . The framework specified is very flexible: to define a π -reversible Markov kernel P , whosesimulation is described in Algorithm 1, it is sufficient to define a triple ( µ, φ, a ) such that π is the ξ -marginal of µ . This is analogous to the definition of a traditional Metropolis–Hastings kernel viathe choice ( π, Q ) in Section 2. Importantly the nature of ξ − is a priori arbitrary and does not have tocoincide with that of ξ , therefore providing great freedom. In general, the association is not unique:there are several ( µ, φ, a ) triples corresponding to the same Markov kernel P . In the sequel we will focusprimarily on the the measure-involution pair ( µ, φ ) , since the choice of a can be taken independentlyof the choice of ( µ, φ ) from a theoretical perspective.In the sequel we will consider Markov kernels Π as in Theorem 3, or derivatives such as P in Proposi-tion 1 as Metropolis–Hastings type kernels. Remark . In the context of Proposition 1 it is natural to ask whether theoretical properties, such asoptimality in terms of optimal variance of Π translate into optimality for P . The answer is yes andfollows by application of the results of Maire, Douc, and Olsson (2014), later extended in ChristopheAndrieu and Livingstone (2019) to the nonreversible scenario treated in Section 4.We now provide examples of commonly used Markov kernels, which can be recognized by the particularform of µ and φ , and for which expressions of the corresponding acceptance ratios is left to Section 3.2.This highlights the fact that the acceptance ratio is a function depending only on µ and φ . Example 1.
The textbook presentation of the MH kernel considered in the introduction correspondsto the choice of a family of conditional probability distributions { µ z ( · ) = Q ( z, · ) , z ∈ Z } on ( Z , Z ) , ξ = ( z, z ′ ) ∈ E = Z × Z , φ ( z, z ′ ) = ( z ′ , z ) and a = r ∧ r . Example 2.
The Random Walk Metropolis (RWM) can be thought of as corresponding to the choice { µ z ( · ) = κ ( · ) , z ∈ Z } for some probability distribution κ on ( Z , Z ) , ξ = ( z, v ) and φ ( ξ ) = ( z + v, − v ) .Alternatively, one may express the RWM as a special case of Example 1 so that { µ z ( · ) = Q ( · ) , z ∈ Z } , ξ = ( z, z ′ ) and φ ( z, z ′ ) = ( z ′ , z ) . Example 3 (Metropolis–Hastings, Barker, etc.) . Let µ be as in Example 1, and let ξ = ( z, z ′ ) with ξ = z . Then Alg 1 corresponds to simulating from the Metropolis–Hastings (resp. Barker) kernelwhen α = a ◦ r , with a ( v ) = 1 ∧ v (resp. a ( v ) = 1 / (1 + v ) ). This corresponds to the presentationadopted by Tierney (1998) and commonly adapted in the literature.The requirement that φ be an involution may appear restrictive, but in fact for a given invertiblefunction one can define a corresponding involution by extending the space. Remark . Let µ be a measure admitting π as a marginal and φ : E → E be invertible, but not aninvolution. Then ( µ , φ ) is a corresponding measure-involution pair, where µ (d ξ, d v ) := µ (d ξ ) I { v ∈{− , }} / and φ ( ξ, v ) := ( φ v ( ξ ) , − v ) on E := E × {− , } . Since µ admits µ as a marginal, it alsoadmits π as a marginal. Example 4 (Ordered overrelaxation (Neal, 1998)) . A Gibbs sampler can be thought of as a MHupdate where conditional distributions of the target distribution π on ( Z , Z ) are used in the proposalmechanism. To fix ideas assume Z = X × Y where Y ⊂ R and let η be one such conditional distribution10n ( Y , Y ) from which sampling is tractable. The goal of the method is to develop a numerical im-plementation the following remark. Let ( x, y ) ∈ Z and let F η be the cumulative distribution function(cdf) corresponding to η , where x is implicit. Then y ′ = F − η (cid:0) − F η ( y ) (cid:1) is antithetic to y –in fact for u ∼ Uniform(0 , the pair (cid:0) F − η ( u ) , F − η (cid:0) − u ) (cid:1) is the lower bound in the Fréchet class of bivariatedistributions of marginals η . The numerical approximation of this remark exploits the link betweenempirical cdf and order statistics. One can sample multiple times independently from η , leading tothe probability distribution, for n ∈ N ∗ , on ( Z × Y n , Z ⊗ Y ⊗ n ) µ = π ⊗ η ⊗ n . Let z = ( x, y ) and let σ : J , n K → J , n K be the ξ := ( z, y , . . . , y n ) -dependent permutation such that y σ (0) ≤ · · · ≤ y σ ( n ) and let r ∈ J , n K be the integer such that σ ( r ) = 0 i.e. y is the r − th rank order statistic among y , y , . . . , y n . Now we consider the following involution, for r ∈ J , n − K φ ( z, y . . . , y n ) = ( x, y σ ( n − r ) , y , . . . , y σ ( n − r ) − , y , y σ ( n − r )+1 , . . . , y n ) , with straightforward adaptation if r ∈ { , , n } . It should be clear, from the exchangeability conditionalupon x ∈ X , that for ξ ∈ S (cid:0) µ, µ φ (cid:1) , r ( ξ ) = 1 . One can naturally replace η with a proposal distributionof our choosing, but the acceptance ratio is then not identically equal to .Adopting this point of view makes establishing reversibility routine, even in complex scenarios. How-ever practical implementation of the update requires an explicit expression for the acceptance ratio r in (7), not provided by the results above. Remark . Alg. 1 is conceptually simple, but in practice it may be expedient to avoid a directimplementation. What is actually required to simulate from P ( ξ , · ) is to sample a Bernoulli( α ( ξ )) random variable, where ξ − ∼ µ ξ and to compute φ ( ξ ) . In particular, it may not be necessary tosimulate or store ξ − in its entirety to perform these task, e.g. when ξ − is large or even infinite-dimensional. Some examples are provided in Section 4.We will primarily focus on Alg. 1 in the sequel. Hence, for examples and applications of this frameworkwe will identify an appropriate ( µ, φ ) , hence defining Π in Theorem 3 up to the choice of a . Thecorresponding π -reversible Markov kernel is then defined by P in Proposition 1. There are, of course,other µ -invariant kernels that can be constructed using Π . For example, letting R define the refreshmentkernel R ( ξ, d ξ ′ ) = δ ξ (d ξ ′ ) µ ξ (d ξ ′− ) , Alg. 1 corresponds to tracking the ξ -coordinate of R Π( ξ , · ) . One could instead define a µ -invariantkernel as γR + (1 − γ )Π for some γ ∈ (0 , . Even more generally, one could replace R with anotherMarkov kernel that only leaves the conditional distribution µ ξ invariant. The cycle R Π is then µ -invariant and would sometimes be referred to as a Metropolis-within-Gibbs (MwG) kernel, although wenote that in this case the corresponding ξ -coordinate of the µ -invariant Markov chain would in generalnot be Markov. More generally we will refer to an algorithm involving a mixture (“random-scan”) orcycle (“deterministic scan”) of kernels targetting the same distribution as a MwG, a widely acceptedmisnomer. In order to compute the acceptance ratio r in Theorem 3, one must identify S and have an expressionfor d µ φS / d µ S . We show below how to phrase these objects in terms of a density ρ = d µ/ d λ , where λ is an appropriate reference measure. Such a density is often available a priori in practice.11 roposition 2. Let µ be a finite measure on ( E, E ) , φ : E → E an involution, let λ ≫ µ bea σ -finite measure satisfying λ ≡ λ φ and let ρ = d µ/ d λ . Then we can take S = S ( µ, µ φ ) to be S = { ξ : ρ ( ξ ) ∧ ρ ◦ φ ( ξ ) > } and r ( ξ ) = ( ρ ◦ φρ ( ξ ) d λ φ d λ ( ξ ) ξ ∈ S, otherwise , (9) in Theorem 3. The proof can be found in Appendix B.2. In many situations λ will be the Lebesgue or countingmeasure, but can also be a product of these, or an infinite-dimensional probability measure such as aGaussian measure Hairer, Stuart, Vollmer, et al. (2014) or the law of a Markov chain (this is treatedin Subsection 5.2). Computing (9) involves additionally computing the density d λ φ / d λ . Remark . If in Proposition 2 λ is invariant under φ , i.e. λ = λ φ then r = ρ ◦ φ/ρ . In theory, itis always possible to find a reference measure invariant under φ , e.g. one could instead of λ take λ := λ + λ φ or even λ := µ + µ φ , which underpins the proof of Theorem 3. However, it may not bestraightforward or natural to compute the density d µ/ d λ , while there is often a natural choice of λ for which d µ/ d λ can be computed.A standard scenario is when λ is the Lebesgue measure on E = R d and φ is a diffeomorphism, in whichcase d λ φ / d λ corresponds to the absolute value of the determinant of the Jacobian, since then for any λ -integrable f (see Theorem B.2 in Appendix B) Z f ◦ φ ( ξ ) | det φ ′ ( ξ ) | λ (d ξ ) = Z f ( ξ ) λ (d ξ ) = Z f ◦ φ ( ξ ) λ φ (d ξ ) , while for an arbitrary, measurable, non-negative g : E → R we can take f = g ◦ φ − to obtain g = f ◦ φ and hence, Z g ( ξ ) | det φ ′ ( ξ ) | λ (d ξ ) = Z g ( ξ ) λ φ (d ξ ) . The example of the introduction corresponds to this scenario, but where in addition φ is an involutionand the reference measure is invariant under φ . Remark . There are several ways one can determine d λ φ / d λ in common situations. For example:(a) Let E = R d with ξ = ( z , . . . , z d ) , and φ be an involution that permutes its input, i.e. φ ( z , . . . , z d ) =( z σ (1) , . . . , z σ ( d ) ) for some permutation σ of { , . . . , d } . Then since φ ′ ( ξ ) is the corresponding per-mutation matrix and all permutations have a determinant in {− , } , we obtain | det φ ′ ( ξ ) | = 1 .So if λ is the Lebesgue measure on R d then λ φ = λ .(b) Let µ be a measure with countable support X , and let λ be the counting measure on E = X ∪ φ ( X ) = X ∪ { φ ( x ) : x ∈ X } . Then for an arbitrary, measurable A ⊆ E we have λ φ ( A ) = λ ( φ − ( A )) = | A | = λ ( A ) since φ is an involution. Hence λ = λ φ so d λ φ / d λ = 1 .In some of our applications, µ has continuous and discrete components, and a density with respectto a product of a Lebesgue measure and a counting measure. When the involution for the discretecomponent does not depend on the continuous component, we have the following result. Lemma 1.
Let λ X be the Lebesgue measure on X , λ Y the counting measure on Y and g : Y → Y bean involution with g ( Y ) ⊆ Y . Let f : X × Y → X be a function such that φ ( x, y ) = ( f ( x, y ) , g ( y )) , is an involution. Then d λ φ / d λ = (cid:12)(cid:12) det f ′ y ( x ) (cid:12)(cid:12) .
12e are now in a position to provide expressions for the acceptance ratios in Examples 1–2.
Example 5 (Metropolis–Hastings acceptance ratio) . Let π and { Q ( z, · ) , z ∈ Z } be probability mea-sures on ( Z , Z ) such that with ν the Lebesgue or counting measure we have ν ≫ π and ν ≫ Q ( z, · ) for each z ∈ Z . Let ̟ ( z ) = d π/ d ν ( z ) and q ( z, z ′ ) = d Q ( z, · ) / d ν ( z ′ ) for all ( z, z ′ ) ∈ Z . With ξ = ( z, z ′ ) we let µ (d ξ ) = π (d z ) Q ( z, d z ′ ) , and φ ( z, z ′ ) = ( z ′ , z ) . Then with λ φ = λ := ν × ν we obtain ρ ( ξ ) = ̟ ( z ) q ( z, z ′ ) and ρ ◦ φ ( ξ ) = ̟ ( z ′ ) q ( z ′ , z ) and the acceptance ratio is, for ξ ∈ S ( µ, µ φ ) = { ξ : ρ ( ξ ) ∧ ρ ◦ φ ( ξ ) > } , r ( ξ ) = ρ ◦ φ ( ξ ) ρ ( ξ ) = ̟ ( z ′ ) q ( z ′ , z ) ̟ ( z ) q ( z, z ′ ) . Example 6 (Random walk Metropolis ratio) . The setup is similar to above but we assume that Z = R d , ν is the Lebesgue measure, q ( z, v ) = q ( v ) := d Q/ d ν ( v ) for ( z, v ) ∈ Z and q ( v ) = q ( − v ) for v ∈ Z . Here λ = ν × ν , ξ = ( z, v ) ∈ Z , φ ( ξ ) = ( z + v, − v ) and | det φ ′ ( ξ ) | = 1 , leading to r ( ξ ) = ρ ◦ φ ( ξ ) ρ ( ξ ) = ̟ ( z + v ) q ( − v ) ̟ ( z ) q ( v ) = ̟ ( z + v ) ̟ ( z ) . It is possible to consider the setting where ξ = ξ and φ is an involution with non-unit Jacobian. Suchsituations are related, e.g., to the Monte Carlo Markov kernels based on deterministic transformationsproposed by Dutta and Bhattacharya (2014). Example 7.
Assume ρ = d µ/ d λ with { ξ ∈ E : ρ ( ξ ) > } = (0 , with λ the Lebesgue measure on R and let φ ( ξ ) = 1 / (2 ξ ) . One can deduce that λ and λ φ are equivalent with d λ φ / d λ ( ξ ) = | φ ′ ( ξ ) | =1 / (2 ξ ) . We obtain S = S ( µ, µ φ ) = { ξ ∈ E : ρ ( ξ ) ∧ ρ ◦ φ ( ξ ) > } = (0 , ∩ φ − (0 ,
1) = [1 / , .Therefore r ( ξ ) = ρ ◦ φ ( ξ ) / ( ρ ( ξ )2 ξ ) for ξ ∈ S and r ( ξ ) = 0 otherwise. Example 8.
Consider π a probability measure on R dominated by Lebesgue and ϕ ( x ) = x , which isinvertible but not an involution. Then following Remark 5, we can extend the space to E = R ×{− , } and define ξ = ( x, k ) , µ (d ξ ) = π (d x ) I ( k ∈ {− , } ) / , λ to be the product of the Lebesgue measureand the counting measure, and φ ( ξ ) = φ ( x, k ) = ( ϕ k ( x ) , − k ) . Following Lemma 1, we obtain d λ φ d λ ( x, k ) = (cid:12)(cid:12) det( ϕ k ) ′ ( x ) (cid:12)(cid:12) = ( x k = 1 , | x | − / k = − . A slightly more general version of Example 5 above can be used when Q is reversible w.r.t. somemeasure. Example 9.
Let π be probability measures on ( Z , Z ) , ν be a reference measure such that π ≪ ν andassume that Q is ν -reversible. Then with ξ = ( z, z ′ ) and ̟ = d π/ d ν , µ (d ξ ) = π (d z ) Q ( z, d z ′ ) = ̟ ( z ) ν (d z ) Q ( z, d z ′ ) , that is ρ ( ξ ) = ̟ ( ξ ) with λ (d ξ ) = ν (d ξ ) Q ( ξ , d ξ − ) and by assumption λ φ = λ for φ ( z, z ′ ) = ( z ′ , z ) for ( z, z ′ ) ∈ Z . Therefore r ( ξ ) = ρ ◦ φρ ( ξ ) = ̟ ( z ′ ) ̟ ( z ) , ξ ∈ S = { ( z, z ′ ) ∈ Z : ̟ ( z ) ∧ ̟ ( z ′ ) > } . In many common RWM kernels, ν is the Lebesgue (resp. counting) measure on a continuous (resp.discrete) state space. Example 10. (Simplified Neal tempering) Let π be a multimodal distribution on ( Z , Z ) . A strategyproposed by Neal (1996) to mitigate the effect of multimodality on consists of using an instrumental13istribution ˜ π ≡ π also defined on ( Z , Z ) , related to π but less multimodal, to improve the rate ofmoves between modes of π . More specifically define µ (d ξ ) = π (d ξ ) Q ( ξ , d ξ ) ˜ Q ( ξ , d ξ ) Q ( ξ , d ξ ) , where ˜ Q (resp. Q ) is ˜ π − reversible (resp. π − reversible) and consider the involution on E = Z suchthat φ ( ξ , ξ , ξ , ξ ) = ( ξ , ξ , ξ , ξ ) . Using these properties, we obtain µ φ (d ξ ) = π (d ξ ) Q ( ξ , d ξ ) ˜ Q ( ξ , d ξ ) Q ( ξ , d ξ )= π (d ξ ) Q ( ξ , d ξ ) ˜ Q ( ξ , d ξ ) Q ( ξ , d ξ )= d π d˜ π ( ξ ) Q ( ξ , d ξ )˜ π (d ξ ) ˜ Q ( ξ , d ξ ) Q ( ξ , d ξ )= d π d˜ π ( ξ ) Q ( ξ , d ξ ) ˜ Q ( ξ , d ξ )˜ π (d ξ ) Q ( ξ , d ξ )= d π d˜ π ( ξ ) d˜ π d π ( ξ ) Q ( ξ , d ξ ) ˜ Q ( ξ , d ξ ) π (d ξ ) Q ( ξ , d ξ )= d π d˜ π ( ξ ) d˜ π d π ( ξ ) Q ( ξ , d ξ ) ˜ Q ( ξ , d ξ ) Q ( ξ , d ξ ) π (d ξ )= d π d˜ π ( ξ ) d˜ π d π ( ξ ) µ (d ξ ) . It follows that we can take S = { ξ ∈ E : d˜ π/ d π ( ξ ) d π/ d˜ π (d ξ ) > } , and we obtain, r ( ξ ) = d˜ π d π ( ξ ) d π d˜ π ( ξ ) , ξ ∈ S. In this case, we can think of λ = µ ≡ µ φ = λ φ , ρ ≡ and d λ φ / d λ = r ( ξ ) on S . In practice, computationof the acceptance ratio may be facilitated by convenient densities for ˜ π and π with respect to a commondominating measure. The above can be viewed as the justification for the tempered transitions kernelintroduced by Neal (1996), where several instrumental distributions are used; these ideas are alsorelated to the methodology in Neal (2005). Example 11 (Penalty method Ceperley and Dewing (1999)) . In this scenario µ (cid:0) d( z, w ) (cid:1) = µ (d z ) Q z (d w ) , ξ = ( z, w ) ∈ Z × W with W ⊂ R ∗ + , φ ( z, w ) = (cid:0) φ ( z ) , /w (cid:1) for φ : Z → Z an involution and w · Q z (d w ) = Q / · φ ( z ) (d w ) for w > , where for any f : Z → [0 , Z f ( w ) Q / · z (d w ) := Z f (cid:0) w − (cid:1) Q z (d w ) , therefore implying for ξ ∈ S r ( ξ ) = d π φ d π ( z ) Q / · φ ( z ) (d w ) Q z (d w ) = d π φ d π ( z ) w. The motivation for this setup is concerned with the situation where a noisy version of the acceptanceratio d π φ / d π ( z ) is available, where the noise is additive in the log-domain, corresponding to noisyenergies in Physics. The condition on Q z is satisfied by the random variable W := exp( − σ z / σ z Z ) Z ∼ N (0 , for z σ z = σ φ ( z ) because Z f ( w ) Q z (d w ) = Z f (cid:2) exp( − σ z / σ z x ) (cid:3) N (d x ; 0 , Z f (cid:2) exp (cid:0) σ z / − σ z ( x + σ z ) (cid:1)(cid:3) N (d x ; 0 , Z f (cid:2) exp( σ z / − σ z x ) (cid:3) N (d x ; σ z , Z f (cid:2) exp( σ φ ( z ) / − σ φ ( z ) x ) (cid:3) · exp( − σ φ ( z ) / σ φ ( z ) x ) N (d x ; 0 , Z f (cid:0) w − (cid:1) w · Q φ ( z ) (d w )= Z f (cid:0) w (cid:1) w − · Q / · φ ( z ) (d w ) . One can also consider, with z ω z = ω φ ( z ) > , Q z (d w ) = ω z ω z δ ω z (d w ) + 11 + ω z δ /ω z (d w ) , because Z f ( w ) Q z (d w ) = ω z ω z f ( ω z ) + 11 + ω z f ( ω − z )= ω z
11 + ω z f ( ω z ) + ω − z ω z ω z f ( ω − z )= Z f ( w ) wQ / · φ ( z ) (d w ) . Example 12 (Reversible jump MCMC Green (1995)) . Here we are concerned with the situationwhere X is a disjoint union, for example X = F i ∈ N { i } × X i with, for i ∈ N , ( X i , X i ) a measurablespace and X a sigma algebra associated to X ; see Fremlin (2010, 214K) for a construction. Here theprobability distribution of interest is π ( i, d x i ) , that is for i ∈ N , π ( i, · ) : X i R + is a finite measureand P ∞ i =1 π ( i, X i ) = 1 . The idea of Green (1995) to circumvent the possibly differing nature of the X i ’s is to introduce the following space and probability embeddings:(a) E := F i,j ∈ N { ( i, j ) }× X i × U ij such that for ( i, j ) ∈ N there exist measurable bijections X i × U ij → X j × U ji for the measurable sapces (cid:0) X i × U ij , X i ⊗ U ij (cid:1) and (cid:0) X j × U ji , X j ⊗ U ji (cid:1) ;(b) for ( i, j ) ∈ N one chooses mappings φ ij = φ − ji and define φ : E → E the φ ( i, j, x i , u ij ) := (cid:0) j, i, φ ij ( x i , u ij ) (cid:1) .(c) the probability distribution π is embedded in µ ( i, j, d ( x i , u ij )) = π ( i, d x i ) µ i ( j, d u ij | x i ) .This can be viewed as a natural generalization of Remark 5. Remark . In light of Example 3 and its relation to the framework in Tierney (1998), it is naturalto ask whether the framework considered here is more powerful in terms of its ability to express andvalidate Markov kernels. In fact it is not, but is perhaps more natural to use since one does notintroduce additional auxiliary variables in µ . In particular, for a given choice of µ and φ , one canalways embed ( ξ, φ ( ξ )) in the extended space E × E with distribution ˜ µ (d x, d y ) = µ (d x ) δ φ ( x ) (d y ) , anduse the involution ˜ φ ( x, y ) = ( y, x ) . The ˜ µ -reversibility then follows from Theorem 3. For an expressionfor the acceptance ratio, it is then convenient to consider the ˜ φ -invariant reference measure υ = λ + λ ˜ φ .15e obtain that d˜ µ/ d υ ( x ) = ρ ( x ) n d λ φ d λ ( x ) o − , where ρ = d µ/ d λ . We obtain that for x in the same S = S ( µ, µ φ ) , r (( x, φ ( x )) , ( φ ( x ) , x )) = d˜ µ ˜ φ d υ ( x ) d˜ µ d υ ( x ) = ρ ◦ φρ ( x ) 1 + d λ φ d λ ( x )1 + d λ d λ φ ( x ) = ρ ◦ φρ ( x ) d λ φ d λ ( x ) , as in Proposition 2. Reversibility plays a central role in the design of MCMC algorithms but is not necessarily a desirableproperty. In fact, it has been shown that nonreversible Markov chains can converge more quickly insome cases (Diaconis, Holmes, and Neal, 2000), and their ergodic averages can have smaller asymptoticvariance in comparison to a suitable reversible counterpart (Neal, 2004; Sun, Schmidhuber, and Gomez,2010; Chen and Hwang, 2013; Christophe Andrieu, 2016). This can be intuitively attributed to thefact that reversible processes tend to backtrack and/or move in a diffusive way, suggesting slowerexploration of the target distribution in comparison to nonreversible processes that move in a moresystematic way through the state space.We discuss here a popular class of nonreversible MH type updates which can be understood as be-ing the cycle of two µ − reversible Markov kernels. This type of non reversibility is referred to as ( µ, S ) − reversibility in the literature (Christophe Andrieu and Livingstone, 2019) and was first dis-cussed in Yaglom (1949) as a generalisation of deterministic time-reversible systems. The necessity forsome of the conditions below is discussed in Thin, Durmus, et al. (2020). Proposition 3.
Let µ be a probability distribution on ( E, E ) , φ, σ : E → E be involutions with σ suchthat µ σ = µ . Let(a) Π be the µ − reversible Markov kernel using φ and acceptance function a ( r ) = 1 ∧ r ,(b) S be such that for ξ ∈ E , S ( ξ, { σ ( ξ ) } ) = 1 (or for ξ ′ ∈ E , S ( ξ, d ξ ′ ) = δ σ ( ξ ) (d ξ ′ ) ),(c) λ ≫ µ be such that λ ≡ λ φ and λ σ = λ .Let ψ := σ ◦ φ and Ψ such that for ξ ∈ E , Ψ( ξ, { ψ ( ξ ) } ) = 1 (or for ξ ′ ∈ E , Ψ( ξ, d ξ ′ ) = δ ψ ( ξ ) (d ξ ′ ) ) then(a) ψ − = σ ◦ ψ ◦ σ , λ φ = λ ψ − , and ρ ◦ σ = ρ ,(b) the µ − invariant cycle Π := Π S is given by Π ( ξ, d ξ ′ ) = a ◦ r ( ξ ) · Ψ (cid:0) ξ, d ξ ′ (cid:1) + [1 − a ◦ r ( ξ )] S ( ξ, d ξ ′ ) , where with S = { ξ ∈ E : ρ ( ξ ) ∧ [ ρ ◦ ψ ( ξ )d λ ψ − / d λ ( ξ )] > } , r ( ξ ) = ( ρ ◦ ψρ ( ξ ) d λ ψ − d λ ( ξ ) ξ ∈ S, otherwise . (c) In fact Π is ( µ, S ) − reversible (or satisfied the modified or skew detailed balance), that is for ξ, ξ ′ ∈ E µ (d ξ ) Π ( ξ, d ξ ′ ) = µ (d ξ ′ ) S Π S ( ξ ′ , d ξ ) . d) Let µ (d ξ ) := π (d ξ ) µ ξ (d ξ − ) , where µ ξ denotes the conditional distribution of ξ − given ξ under µ . Assume Π to be ( µ, S ) − reversible, where S ( ξ , ξ − ) = (cid:0) S ( ξ ) , ξ − (cid:1) for S andinvolution. Then the Markov kernel Π ( ξ , A ) := Z A ( ξ ′ ) µ ξ (d ξ − ) Π ( ξ ; d ξ ′ ) , A ∈ Z , is ( π, S ) − reversible.(e) Let Π ′ := S Π , then with ψ ′ := φ ◦ σ and Ψ ′ ( ξ, d ξ ′ ) = δ ψ ′ ( ξ ) (d ξ ′ ) then Properties (a)-(d) holdwith Π , ψ , Ψ and r replaced with Π ′ , ψ ′ , Ψ ′ and r ′ := r ◦ σ . Corollary 1. If ψ in Proposition 3 preserves λ then one has r ( ξ ) = ρ ◦ ψ/ρ ( ξ ) on S = { ξ ∈ E : ρ ( ξ ) ∧ ρ ◦ ψ ( ξ ) > } . Indeed, if λ ψ = λ then λ ψ ◦ ψ − = λ ψ − so λ ψ − = λ also. Corollary 2.
In many situations, nonreversible kernels are given in the form of Π or Π ′ , where ψ, ψ ′ : E → E are invertible mappings with the property that ψ − = σ ◦ ψ ◦ σ for σ an involutionleaving µ and λ invariant, and similarly for ψ ′ . This time-reversal feature ensures that we are in thesetup of Proposition 3, since indeed in this setup φ := σ ◦ ψ (or ˜ φ := ψ ◦ σ ) is an involution, thereforedefining Π satisfying the right property. In particular we always have the decomposition Π = S ˜Π = Π S where Π and ˜Π satisfy detailed balance Christophe Andrieu and Livingstone (2019, Theorem 4).Remark . Proposition 3 highlights the fundamental difference between reversible and this type ofnonreversible kernels. Without refreshment of ξ − , the reversible Markov chain started at ξ oscillatesbetween ξ and φ ( ξ ) due to the involutive property, while the nonreversible chain can in principle explorea large subset of states ψ k ( ξ ) , k ∈ N , although rejection leads to backtracking. This fundamentalqualitative behaviour is exploited in more general and realistic setups, even when ξ − is refreshed. Remark . In the same way the results of Maire, Douc, and Olsson (2014) can be used in the contextof Proposition 1 (see Remark 4) one can, for example, deduce optimality properties of Π from thoseof Π by using Christophe Andrieu and Livingstone (2019).In practice, a number of deterministic transformations ψ are used to define π -invariant Markov kernels.The validity of such kernels often rests primarily on showing that the transformation is measure-preserving, typically with the measure being the Lebesgue measure. We give here some exampleswhere π is a probability measure associated with a position variable x ∈ R d and a velocity variable v ∈ R d .A general class of nonreversible MH kernels relies on the choices ξ = ( x, v ) ∈ E = X × V , σ ( x, v ) =( x, − v ) and µ (d x, d v ) = π (d x ) κ (d v ) where κ is such that µ σ = µ . In order to keep presentation simplewe will assume that X = V = R d and that µ has a density ρ ( x, v ) = ̟ ( x ) κ ( v ) with respect to theLebesgue measure on R d . Note that the Lebesgue measure is invariant by σ since its Jacobian is . Lemma 2.
Let x ∈ X ⊆ R d and y ∈ Y ⊆ R d ′ , and ψ : X × Y → X × Y be defined as ψ ( x, y ) = ( x, y + f ( x )) for some function f : X → Y . Then ψ preserves the Lebesgue measure λ on X × Y . Example 13 (Guided Random walk (GRW), Gustafson 1998) . Let ψ ( x, v ) = ( x + v, v ) then φ = σ ◦ ψ = ( x + v, − v ) is an involution, and is in fact the involution used to define the random walkMetropolis. Then ψ preserves λ by Lemma 2. Hence, using that κ ( − v ) = κ ( v ) for ξ ∈ S , r ( x, v ) = ρ ◦ ψρ ( x, v ) = ̟ ( x + v ) ̟ ( x ) , which coincides with the acceptance ratio of the RWM Metropolis. In fact P ( x, d x ′ ) := R κ (d v ) Π ( x, v ; d x ′ ) is the π − reversible RWM Markov kernel. The GRW, of transition Π , differs in that it is µ -invariantbut not reversible and has the property that it introduces memory on the velocity component of theprocess. On its own Π does not lead to an ergodic chain and must be combined with other updates,e.g. occasionally sampling v afresh from κ . 17efore covering Hamiltonian Monte Carlo, and in particular the common variant using the velocityVerlet, or leapfrog, integrator we note that transformations ψ satisfying ψ − = σ ◦ ψ ◦ σ are particularlyintuitive in that the iterated maps ψ ◦ · · · ◦ ψ can be “reversed”. Remark . Let ψ = Id and ψ k = ψ ◦ ψ k − for k ∈ N . If ψ satisfies ψ − = σ ◦ ψ ◦ σ , then ψ istime-reversible in the sense that φ k = σ ◦ ψ k is an involution for any k ∈ N . Indeed, we have Id = ψ − k ◦ ψ k = ( σ ◦ ψ ◦ σ ) k ◦ ψ k = σ ◦ ψ k ◦ σ ◦ ψ k . Lemma 3.
Let x, v ∈ R d and ψ : R d → R d be ψ = ψ B ◦ ψ A ◦ ψ B , where ψ B = ( x, v ) ( x, v + ı ( x )) and ψ A = ( x, v ) ( x + ( v ) , v ) for some functions ı : X → V and : V → X , where ( − v ) = − ( v ) for v ∈ R d . Then ψ A , ψ B and ψ preserve the Lebesgue measure on R d and ψ − = σ ◦ ψ ◦ σ so that ψ is time-reversible in the sense of Remark 12. Example 14 (HMC - leapfrog integrator) . Let π have density ρ = ̟ ⊗ κ w.r.t. λ , the Lebesguemeasure on E = R d . Consider the function h = ψ B ◦ ψ A ◦ ψ B as in Lemma 3 with ı ( x ) = ǫ ∇ log ̟ ( x ) and ( v ) = ǫ ∇ log κ ( v ) . Let Π = Π S be the nonreversible kernel in Proposition 3 with ψ = h k forsome k ∈ N , and acceptance ratio r ( ξ ) = ρ ◦ ψρ ( ξ ) , ξ ∈ S. This kernel is a version of the HMC kernel with leapfrog integrator (see Remark 13 below). It hasdesirable properties, but it is also clear that the µ -invariance of Π applies for a much broader class of ı and , as implied by the appeal to Lemma 3. For example, it is well known that one could replace ̟ in ı with some approximate density (see, e.g., Neal et al., 2011, Section 5.5), i.e. run the “leapfrogintegrator” for a different density but accept or reject using ρ = ̟ ⊗ κ . In order to preserve persistenceof motion (and nonreversibility) this update is typically combined with partial refreshment of thevelocity. As discussed below, full refreshment leads to a reversible algorithm. Remark . It is often the case, as was the case in part of the seminal paper of Horowitz (1991), thatthe kernel considered is reversible. Indeed in those works the kernel considered is, for ( x, A ) ∈ X × X P ( x, A ) := Z κ (d v ) Π ( x, v ; d( y, w )) I { ( y, w ) ∈ A × V } , that is the velocity is refreshed at each iteration and with ξ = ( x, v ) and f, g : X → [0 , , Z f ( x ) g ( x ′ ) π (d x ) P ( x, d x ′ ) = Z f ( x ) g ( x ′ ) µ (d ξ ) Π ( ξ, d ξ ′ )= Z f ( x ) g ( x ′ ) µ (d ξ ′ ) S Π S ( ξ, d ξ ′ )= Z f ( x ) g ( x ′ ) µ S (d ξ ′ ) S Π ( ξ, d ξ ′ )= Z f ( x ) g ( x ′ ) µ (d ξ ′ ) Π ( ξ, d ξ ′ )= Z f ( x ) g ( x ′ ) π (d x ′ ) P ( x ′ , d x ) . where we have used the ( µ, S ) − reversibility of Π , µ = µ S and the fact that S g = g for this choice offunction. 18 xample 15 (MALA and generalized MALA ) . Standard, reversible normal (i.e. κ is the standardnormal distribution) MALA (Besag, 1994) corresponds to one iteration of HMC - leapfrog integratorwith full refreshment of the velocity at each iteration, and indeed here ψ ( x, v ) = (cid:0) x + ı ( x ) + ǫ v, − v − ǫ (cid:2) ı ( x ) + ı (cid:0) x + ǫ ı ( x ) + ǫ v (cid:1)(cid:3)(cid:1) for ı ( x ) = ∇ x log ̟ ( x ) and ǫ > . In Poncet (2017) it is proposed toconsider ı : X × {− , } → X with ı ( x, s ) = ∇ x log ̟ ( x ) + sγ ( x ) for γ : X → X . A naïve idea would be totake ψ B = ( x, v, s ) ( x, v + ı ( x, s ) , s ) and ψ A = ( x, v, s ) ( x + ( v ) , v, s ) with σ ( x, v, s ) = ( x, v, − s ) which is shown to have poor properties; this leads to the development of a scheme relying on an implicitintegration scheme. Example 16 (Hyperplane reflection) . If λ V is the Lebesgue measure on R d , the involution b( x, v ) =( x, v − { n ( x ) ⊤ v } n ( x )) preserves λ = λ X × λ V , where n : R d → R d satisfies k n ( x ) k = n ( x ) ⊤ n ( x ) = 1 for all x ∈ R d . Indeed, we can write the v -component of φ ( x, v ) as v − n ( x ) ⊤ v ) n ( x ) = (Id − n ( x ) n ( x ) ⊤ ) v, and we see that B x = (Id − n ( x ) n ( x ) ⊤ ) is a matrix with B x = Id and so | det B x | = 1 . Since b doesnot move the x -component, it follows that b is λ -preserving.If λ = λ X × λ V and λ V is instead the uniform measure on the sphere S d − = { v ∈ R d : v ⊤ v = 1 } then B ⊤ x = B x so k B x v k = (B x v ) ⊤ B x v = v ⊤ v = k v k , so B x preserves the norm k·k . Letting Leb denotethe Lebesgue measure on R d , and noting from the argument above that b preserves Leb , we can thenconclude that b is λ -preserving as above because for any measurable A ⊆ S d − , λ V ( A ) ∝ Leb( { tv : v ∈ A, t ∈ [0 , } ) .A natural question is whether the requirement that σ : E → E be an involution can be relaxed toinvertibility only. More precisely let ψ , σ : Z → Z be invertible with ψ − = σ − ◦ ψ ◦ σ –such astructure is known as time-reversible symmetry when ψ is the flow of a dynamical system with thisproperty (Lamb and J. A. Roberts, 1998). Let Ψ ( z, d z ′ ) = δ ψ ( z ) (d z ′ ) , S ± ( z, d z ′ ) = δ σ ± ( z ) (d z ′ ) and µ a probability distribution on (cid:0) Z , Z (cid:1) such that µ S ± = µ . Can one define a deterministicMH type kernel leaving µ invariant – Fang, J.-M. Sanz-Serna, and Skeel (2014) provide us with ananswer, see below. Our answer consists of embedding this problem in the ( µ, S ) − reversible framework.Let E = Z × U where U = {− , } and define µ (cid:0) d( z, u ) (cid:1) := µ (d z ) I (cid:8) u ∈ U (cid:9) and consider themappings σ, ψ : Z × U → Z × U such that for f : Z × U → R , f ◦ σ ( z, u ) = f (cid:0) σ u ( z ) , − u (cid:1) and ψ ( z, u ) := (cid:0) σ − u ◦ ψ u ( z ) , u (cid:1) . For any ( z, u ) ∈ Z × U we have that σ ( z, u ) = (cid:0) σ − u ◦ σ u ( z ) , u (cid:1) = ( z, u ) , that is σ isan involution and one can check that ψ − ( z, u ) = (cid:0) ψ − u ◦ σ u ( z ) , u (cid:1) . Noting that ψ − = σ − ◦ ψ ◦ σ is equivalent to ψ − u = σ − u ◦ ψ u ◦ σ u for u ∈ U we have for ( z, u ) ∈ Z × U σ ◦ ψ ◦ σ ( z, u ) = σ ◦ ψ (cid:0) σ u ( z ) , − u (cid:1) = σ (cid:0) σ u ◦ ψ − u ◦ σ u ( z ) , − u (cid:1) = (cid:0) ψ − u ◦ σ u ( z ) , u (cid:1) = ψ − ( z, u ) . f : Z × U → R we have µ S = µ , since Z f ( z, u ) µ σ (cid:0) d( z, u ) (cid:1) = Z f ◦ σ ( z, u ) µ (cid:0) d( z, u ) (cid:1) = Z f (cid:0) σ u ( z ) , − u (cid:1) µ (d z ) I (cid:8) u ∈ U (cid:9) = Z f (cid:0) z, − u (cid:1) µ σ u (d z ) I (cid:8) u ∈ U (cid:9) = Z f (cid:0) z, u (cid:1) µ (d z ) I (cid:8) u ∈ U (cid:9) = Z f ( z, u ) µ (cid:0) d( z, u ) (cid:1) , We are therefore back in the ( µ, S ) − reversible setup and with α ( z, u ) := a ρ ◦ ψ ( z, u ) ρ ( z, u ) d λ ψ − d λ ( z, u ) ! = a ρ ◦ σ − u ◦ ψ u ( z ) ρ ( z ) d λ ψ − u ◦ σ u d λ ( z ) ! = a ρ ◦ ψ u ( z ) ρ ( z ) d λ ψ − u d λ ( z ) ! we can define the kernel Πf ( z, u ) = α ( z, u ) · f ◦ ψ ( z, u ) + ¯ α ( z, u ) · f ◦ σ ( z, u )= α ( z, u ) · f (cid:0) σ − u ◦ ψ u ( z ) , u (cid:1) + ¯ α ( z, u ) · f (cid:0) σ u ( z ) , − u (cid:1) . The kernel e Π : Z × Z → [0 , proposed by Fang, J.-M. Sanz-Serna, and Skeel (2014) is, for g : Z → R , e Πg ( z ) = α ( z, g ◦ ψ ( z ) + ¯ α ( z, g ◦ σ ( z )= α ( z, g ◦ σ ◦ ψ ( z,
1) + ¯ α ( z, g ◦ σ ( z, where ˜ g : Z × U → R is such that ˜ g ( z,
1) = ˜ g ( z, −
1) := g ( z ) , which can therefore be thought of as being Π but used for the value u = 1 only–one could equally have chosen u = − , naturally. One can checkthat this kernel satisfies global balance for µ directly (Fang, J.-M. Sanz-Serna, and Skeel, 2014). Thekernel e Π does not satisfy detailed or skew detailed balance, but noting that φ := σ ◦ ψ is an involutionand letting Φ( z, d z ′ ) := δ φ ( z ) (d z ′ ) , that is Φ f ( z, u ) = ( ψ u , − u ) we use “reversibility” (see the proofProposition 3) to show Z α ( z, g ◦ ψ ( z ) µ (d z ) = Z Z ×{ } ( z, u )˜ g ◦ φ ( z, u ) α ( z, u ) µ (d z, u )= Z Z ×{ } ( z, u )Φ˜ g ( z, u ) α ( z, u ) µ (d z, u )= Z ˜ g ( z, u )Φ Z ×{ } ( z, u ) α ( z, u ) µ (d z, u )= Z g ( z ) Z ×{− } ( z, u ) α ( z, u ) µ (d z, u )= 12 Z g ( z ) α ( z, − µ (d z ) Φ replaced with S (see the proof Proposition 3) Z g ◦ σ ( z ) α ( z, µ (d z ) = Z Z ×{ } ( z, u ) S ˜ g ( z, u ) α ( z, u ) µ (d z, u )= Z ˜ g ( z, u ) S Z ×{ } ( z, u ) α ( z, u ) µ (d z, u )= Z g ( z ) Z ×{− } ( z, u ) α ( z, u ) µ (d z, u )= 12 Z g ( z ) α ( z, − µ (d z ) . Therefore for any g : Z → [0 , Z e Πg ( z ) µ (d z ) = Z g ◦ σ ( z ) µ (d z ) = Z g ( z ) µ (d z ) and we conclude. 21 Markov chain proposals, stopping times and processes & NUTS
In some scenarios it is desirable for µ to involve simulation of a stopped process. In particular, thisallows the amount of simulation required to produce a suitable proposal to be random and ideallybe appropriately adapted to features of the target distribution and the current point. As mentionedin Remark 3, the specification of ( µ, φ ) is not unique for a given Markov kernel, so there is someflexibility in precisely how stopping times and stopped processes are captured in ξ and described by µ . In particular, one often has flexibility in allowing ξ to be infinite-dimensional and to contain arealization of the original process as well as the stopping time, or for ξ to be finite-dimensional and tocontain only the stopped process. In the former case, one will need to adopt an indirect implementationas per Remark 6. We illustrate the former approach on a simple example with i.i.d. proposals. Let ξ = ( n, Z , k ) ∈ N × Z N × N . Assume that ν ≫ π and let ̟ ( z ) = d π/ d ν ( z ) – a common situation is when π and ν have densities w.r.t. some common dominating measure and ̟ ( z ) = π ( z ) /ν ( z ) if we keep the samenotation for these densities. Assume that the distribution of Z under µ is that z ∼ π and for i ∈ N ∗ , z i iid ∼ ν . Let ( s n ) n ∈ N ∗ be a sequence of functions such that s n : Z N → { , } depends only on the first n + 1 members of its argument; i.e. s n ( z ) depends only on z , . . . , z n . Define the stopping time for Z ∈ Z N τ = τ ( Z ) := inf { n ≥ s n ( Z ) = 1 } . (10)For example, one could choose s n ( Z ) = I { P ni =0 ̟ ( z i ) > c } for some constant c > , or s n ( Z ) = I { ESS( z , . . . , z n ) > c } with ESS( z , . . . , z n ) := { P ni =0 ̟ ( z i ) } P ni =0 ̟ ( z i ) , heuristically to ensure that sufficiently many samples have been drawn and that one can be chosento produce a sample approximately drawn from π . For Z ∈ Z N , let n := τ ( Z ) and k ∼ ς ( · ; n, Z ) where ς ( · ; n, Z ) is an arbitrary categorical distribution taking values in J , n K and with probabilitiesdepending on z , . . . , z n only. For k ∈ N , let σ k : Z N → Z N be the swapping function such that, with Z ′ = σ k ( Z ) , z ′ = z k , z ′ k = z and z ′ j = z j for j / ∈ { , k } . Clearly, σ k is an involution and we consider φ ( n, Z , k ) := ( τ ◦ σ k ( Z ) , σ k ( Z ) , k ) , which is an involution since φ ◦ φ ( n, Z , k ) = ( τ ◦ σ k ◦ σ k ( Z ) , σ k ◦ σ k ( Z ) , k ) = ( n, Z , k ) . Letting ν ⊗∞ denote the probability measure associated with an infinite sequence of independent ν -distributed random variables, we have for n ∈ N and k ∈ J , n K , µ ( n, d Z , k ) = ̟ ( z ) ν ⊗∞ (d Z ) ς ( k ; n, Z ) s n ( Z ) n − Y i =1 { − s i ( Z ) } , where we note that z is marginally distributed according to π , which together with φ above definesthe kernel outlined in Alg. 2. One can check that the acceptance ratio is, with ( n ′ , Z ′ , k ′ ) = φ ( n, Z , k ) =( τ ◦ σ k ( Z ) , σ k ( Z ) , k ) and for ξ ∈ S = { ( n, Z , k ) : n = τ ( Z ) , ≤ k ≤ τ ( Z ) ∧ τ ◦ σ k ( Z ) } , r ( ξ ) = ̟ ( z ′ ) ς ( k ′ ; n ′ , Z ′ ) ̟ ( z ) ς ( k ; n, Z ) = ̟ ( z k ) ς ( k ; n ′ , Z ′ ) ̟ ( z ) ς ( k ; n, Z ) . Although theoretically convenient, the algorithm described in Alg. 2 is not very practical due to therequirement to sample the infinite-dimensional Z − . However the definitions of ( s n ) n ∈ N ∗ , τ ( Z ) , k and22 lgorithm 2 Impractical algorithm(a) Simulate (lazily) z i iid ∼ ν , for i ∈ N ∗ .(b) Set n ← τ ( Z ) .(c) Simulate k ∼ ς ( · | n, Z ) .(d) Set Z ′ ← σ k ( Z ) and n ′ ← τ ◦ σ k ( Z ) .(e) With probability a (cid:18) ̟ ( z k ) ς ( k ; n ′ , Z ′ ) ̟ ( z ) ς ( k ; n, Z ) (cid:19) , output z k , otherwise output z . σ k ( Z ) are such that Z is not required in its entirety to simulate from the kernel, which can be achievedwith finite computation provided τ ( Z ) < ∞ . This is described in Alg. 3, with a slight abuse of notationsince s k and σ k are defined on Z N . We will refer to this as a “lazy” implementation or simulationand adopt the presentation in Alg. 2 for brevity. In particular, the explicit lazy implementation inAlg. 3 involves simulating only those components of Z and Z ′ that are required to implement Alg. 2,the details of which are fairly straightforward and tend to obscure the simplicity of the approach.Note that throughout we give the expression for the acceptance ratio on S only in order to alleviatepresentation.If we choose ( s n ) n ∈ N ∗ such that τ ( Z ) = 1 and ς (1; 1 , Z ) = 1 for all Z ∈ Z N then this reduces to theindependent MH (IMH), but of course in general it allows more than one candidate sample from ν tobe simulated. We refer to the kernel in Alg. 2 as an adaptive IMH kernel for this reason. If we let ς ( k ; n, Z ) ∝ ̟ ( z k ) J ,n K ( k ) then we obtain r ( ξ ) = P ni =0 ̟ ( z i ) P n ′ i =0 ̟ ( z ′ i ) J ,n ∧ n ′ K ( k ) An important point is that n ′ may not equal n in general, requiring in particular additional simulationswhen n ′ > n . By choosing ( s n ) n ∈ N and ς ( · ; n, Z ) appropriately one can ensure that n = n ′ for all Z ∈ Z N .This has the appeal that there is no need to perform additional simulations once z , . . . , z τ are realized,and can also mean that the acceptance ratio is one The following lemma provides sufficient conditionsfor this equality to hold. Lemma 4.
Let Z ∈ Z N be such that, with the definition in (10), τ ( Z ) < ∞ and assume further that ( s k ) k ∈ N ∗ satisfies(a) k s k ( Z ) is non-decreasing;(b) s k ( Z ) = s k ◦ σ l ( Z ) for all l ∈ J k K ;Then τ ◦ σ l ( Z ) = τ ( Z ) for l ∈ J τ ( Z ) − K . Example 17.
For k ∈ N , let s k ( Z ) := I { P ki =0 ̟ ( z i ) > c } for some constant c > and let Z ∈ Z N satisfy τ ( Z ) < ∞ . Clearly k s k ( Z ) is non-decreasing and condition (a) in Lemma 4 holds. Assumethat for n ∈ N , s ( Z ) = · · · = s n − ( Z ) = 0 while s n ( Z ) = 1 . It follows that for l ∈ J , n − K ,if k ∈ J n − K then s k ◦ σ l ( Z ) ≤ I { P n − i =0 ̟ ( z i ) > c } = s n − ( Z ) = s k ( Z ) = 0 while if k ≥ n then s k ◦ σ l ( Z ) ≥ I { P ni =0 ̟ ( z i ) > c } = s n ( Z ) = 1 = s k ( Z ) so condition (b) in Lemma 4 holds. If we take23 lgorithm 3 Practical, lazy implementation(a) Set i ← , simulate z ∼ ν .(b) While s i ( z i ) = 0 (i) Set i ← i + 1 .(ii) Simulate z n ∼ ν .(c) Set n ← i = τ ( Z ) .(d) Simulate k ∼ ς ( · ; n, Z ) and set z ′ n = σ k ( z n ) .(e) Set i ← .(f) While s i ( z ′ i ) = 0 (i) Set i ← i + 1 .(ii) If i > n , simulate z ′ i = z i ∼ ν .(g) Set n ′ ← i = τ ◦ σ k ( Z ) = τ ( Z ′ ) .(h) With probability a (cid:18) ̟ ( z k ) ς ( k ; n ′ , Z ′ ) ̟ ( z ) ς ( k ; n, Z ) (cid:19) , output z k , otherwise output z . ς ( k ; n, Z ) ∝ ̟ ( z k ) J ,n − K ( k ) then the conclusion of Lemma 4 holds and we obtain for k ∈ J , n − K ,with ξ = ( n, Z , k ) and φ ( n, Z , k ) := ( τ ◦ σ k ( Z ) , σ k ( Z ) , k ) , r ( ξ ) = P n − i =0 ̟ ( z i ) P n − i =0 ̟ ( z ′ i ) = P n − i =0 ̟ ( z i ) P n − i =0 ̟ ( z i ) = 1 . In contrast, for the choice s n ( Z ) = I { ESS( z , . . . , z n ) > c } , condition (a) in Lemma 4 is not satisfiedand there is no reason for n = n ′ to hold for all Z ∈ Z N . We expand on the idea of Lemma 4 to validatethe NUTS and stopping-time MTM in Sections 5 and 6. Here we demonstrate how one can verify that Markov chain proposals can be used to define π -invariantMarkov kernels. First we show how to deal with a distribution µ involving a doubly-infinite Markovchain as well as a proposal index. Then we consider the more involved but practical scenario where theproposal index is selected from a window of random size, adapted according to user-defined constraintfunctions. A special case of this framework, and indeed the inspiration for the generalization here,is when the Markov chain is a deterministic dynamical system is the No U-Turn Sampler (NUTS) ofHoffman and Gelman (2014).Let π and ν be measures on ( Z , Z ) where π is a probability, ν ≫ π and let ̟ := d π/ d ν . In order topresent our algorithm we require the definition of a two sided Markov chain, from which a proposalstate is chosen within a MH kernel update. Definition 4 (Two-sided ( k, π, Q, Q ∗ ) -Markov chain probability measure Λ k ) . Let π be a probabilitymeasure on (cid:0) Z , Z (cid:1) and Q , Q ∗ : Z × Z → [0 , be transition kernels. Then for k ∈ Z , denote by Λ k (cid:0) Z Z , Z ⊗ Z (cid:1) associated with the Markov chain Z such that Z k ∼ π , and for i ∈ N ∗ , Z i + k | { Z i + k − = z } ∼ Q ( z, · ) and Z k − i | { Z k − i +1 = z } ∼ Q ∗ ( z, · ) .We define Q ( z , · ) to be the probability measure for Z ∼ Λ conditional on a fixed z .For any Z ∈ Z Z we let ς ( · ; Z ) be a probability distribution on (cid:0) Z , P ( Z ) (cid:1) and we are interested in theupdate outlined in Alg. 4, where for i ∈ Z , θ i : Z Z → Z Z is the shift function defined via θ i ( Z ) j = z i + j and to ease the presentation of the algorithms, we write that one should “lazily” simulate a realizationof a double-infinite Markov chain, by which we mean that only a finite number of states of the Markovchain should be required to perform the rest of the algorithm. The simulation of Z is naturally notpractical and a stopping criterion is required, while making sense of the acceptance ratio and itsexpression require an additional assumption on ( ν, Q, Q ∗ ) . These are the topics of the remainder ofthe subsection. Algorithm 4
To sample from P (general)MC ( z , · ) (a) Lazily simulate Z ∼ Q ( z , · ) .(b) Simulate k ∼ ς ( · ; Z ) .(c) With probability a ̟ ( z k ) ς (cid:0) − k ; θ k ( Z ) (cid:1) ̟ ( z ) ς ( k ; Z ) ! , output z k . Otherwise output z .We first introduce an assumption on ( ν, Q, Q ∗ ) justifying the form of the acceptance ratio in fullgenerality. Definition 5 (Reversible triplet ( ν, Q, Q ∗ ) ) . Let ν be a measure on ( Z , Z ) and Q, Q ∗ : Z × Z → [0 , be two Markov kernels. We say that ( ν, Q, Q ∗ ) is a reversible triplet if for z, z ′ ∈ Z , ν (d z ) Q ( z, d z ′ ) = ν (d z ′ ) Q ∗ ( z ′ , d z ) , (11)This implies in particular that Q is ν -invariant, whether ν is a probability measure or not, and Q ∗ isthe time-reversal of Q . In operator theoretic language Q ∗ is the ν − adjoint of Q for the inner product (cid:10) f, g (cid:11) = R f g d ν on L ( ν ) . Importantly for practical purposes, we observe that (11) accommodatesinvertible mappings that leave ν invariant: Proposition 4.
Let ν be a measure on ( Z , Z ) and ψ : Z → Z be invertible and such that ν ψ = ν .Then (11) holds with Q ( z, d z ′ ) = Ψ( z, d z ′ ) := δ ψ ( z ) (d z ′ ) and Q ∗ ( z, d z ′ ) = Ψ ∗ ( z, d z ′ ) := δ ψ − ( z ) (d z ′ ) . The assumption that ( ν, Q, Q ∗ ) is a reversible triplet implies that for any k ∈ Z , Λ and Λ k areequivalent on a suitable restriction of Z Z , with a simple Radon–Nikodym derivative involving ̟ only.This is the property used in Alg. 4 to propose that a chain Z distributed according to Λ is mappedto a chain distributed according to Λ k . Lemma 5.
Let π and ν be measures on ( Z , Z ) where π is a probability and ν ≫ π and let ̟ := d π/ d ν .Assume that ( ν, Q, Q ∗ ) is a reversible triplet. For any k ∈ Z let Λ k be the two-sided ( k, π, Q, Q ∗ ) -Markovchain probability measure and S k := { Z ∈ Z Z : ̟ ( z ) ∧ ̟ ( z k ) > } . Then for any k ∈ Z and Z ∈ S k , dΛ kS k dΛ S k ( Z ) = ̟ ( z k ) ̟ ( z ) .
25e can now establish correctness of Alg. 4.
Corollary 3.
For any Z ∈ Z Z let ς ( · ; Z ) be a probability distribution on (cid:0) Z , P ( Z ) (cid:1) , ξ := ( Z , k ) ∈ Z Z × Z , µ (d Z , k ) = Λ (d Z ) ς ( k ; Z ) and define the involution φ ( Z , k ) := ( θ k ( Z ) , − k ) . Then for ξ ∈ S = (cid:8) ( Z , k ) : ̟ ( z ) ∧ ̟ ( z k ) ∧ ς ( k ; Z ) ∧ ς (cid:0) − k ; θ k ( Z ) (cid:1) > (cid:9) , we have d µ φS d µ S ( ξ ) = ̟ ( z k ) ̟ ( z ) ς (cid:0) − k ; θ k ( Z ) (cid:1) ς ( k ; Z ) , and apply Theorem 3. Alg. 4 is in general not practical due to the requirement of simulation from Q ( z , · ) , a prerequisiteto sample from ς ( · ; Z ) . Key to this is to make the dependence of ς ( · ; Z ) on Z ∈ Z Z “finite”, that isdependent on a finite number of coordinates of Z in order to ensure a finite amount of computation.Numerous options are possible and we outline two here. The first one is purely deterministic. Example 18.
Let τ ∈ N and assume that for any Z ∈ Z Z the probability ς ( · ; Z ) is entirely determinedby the τ + 1 states z − τ , . . . , z τ and of support J − τ, τ K . In this case simulating k ∼ ς ( · ; Z ) only requiressimulation of this subsequence. However in order to compute the acceptance ratio it is required tosimulate what is unrealized in the subsequence z k − τ , . . . , z k + τ , that is z i for i ∈ J k − τ, k + τ K ∩ J − τ, τ K ∁ . An example for ς ( · ; Z ) is ς ( k ; Z ) ∝ J − τ,τ K ( k ) ̟ ( z k ) , in which case the acceptance ratio is P k + τi = k − τ ̟ ( z i ) / P τi = − τ ̟ ( z i ) on S .The above example, in the context of HMC, gives a simple version of what is described in Neal (1994),which can of course be embellished in various ways. It is also possible to adapt τ to the realization Z . Example 19.
It is possible to make τ a function Z τ ( Z ) ∈ N in Example 18, more precisely astopping time adapted to sequences of the form z − i , z − i +1 , . . . , z , . . . , z i − , z i such that with n = τ ( Z ) ,sampling k ∼ ς ( · ; Z ) is entirely determined by z − n , . . . , z , . . . , z n . This leads to the same need foradditional simulation i.e. z i for i ∈ J k − τ ◦ θ k ( Z ) , k + τ ◦ θ k ( Z ) K ∩ J − τ ( Z ) , τ ( Z ) K ∁ where we noticethe need to determine the value of the stopping time value for the sequence θ k ( Z ) , also requiredfor the computation of the acceptance ratio P k + τ ◦ θ k ( Z ) i = k − τ ◦ θ k ( Z ) ̟ ( z i ) / P τ ( Z ) i = − τ ( Z ) ̟ ( z i ) on S , for the choice ς ( k ; Z ) ∝ J − τ ( Z ) ,τ ( Z ) K ( k ) ̟ ( z k ) .In the next section we explore a general technique of ensuring that both windows of states coincide,therefore leading to simplified algorithms. Remark . One could considerably weaken the condition (11) in Lemma 5 to ν (d z ) Q ( z, d z ′ ) = ν ∗ (d z ′ ) Q ∗ ( z ′ , d z ) , (12)where ν and ν ∗ are equivalent but not necessarily equal, at the expense of simplicity. In this case, weobtain for Z ∈ S = { Z : ̟ ( z ) ∧ ̟ ( z k ) > } , dΛ kS dΛ S ( Z ) = ̟ ( z k ) ̟ ( z ) F k ( Z ) , where F k ( Z ) = Q ki =1 d ν d ν ∗ ( z i ) k > , Q i = k +1 d ν ∗ d ν ( z i ) k < , k = 0 . When ν = ν ψ , that is ψ is ν − preserving, then d ν ψ / d ν = 1 . The generality of (12) is natural in thecontext of deterministic, invertible maps ψ that are not measure preserving but such that ν ψ ≡ ν . Inparticular, the analogue of Proposition 4 holds with ν ∗ = ν ψ .26 emma 6. Let ψ : Z → Z be invertible and such that ν ψ ≡ ν . Then with Ψ( z, d z ′ ) = δ ψ ( z ) (d z ′ ) and Ψ ∗ ( z, d z ′ ) = δ ψ − ( z ) (d z ′ ) then with ν ∗ = ν ψ , ν (d z )Ψ( z, d z ′ ) = ν ∗ (d z ′ )Ψ ∗ ( z ′ , d z ) . In the specific case that ν is the Lebesgue measure and ψ − is a diffeomorphism, then d ν ψ d ν ( z ) = (cid:12)(cid:12) det( ψ − ) ′ ( z ) (cid:12)(cid:12) is the Jacobian. In Examples 18 and 19, the two windows around z and z k are typically different when k = 0 . Wenow explain how to devise an instance of the framework where the windows around z and z k areidentical by construction. The main idea consists of introducing an auxiliary variable ℓ that can bethought of as determining the left index of the realized window. To be precise, let m ∈ N be the fixedsize of the window to be realized and ξ := ( Z , ℓ, k ) ∈ Z Z × N × N where ℓ ∼ Uniform( J , m − K ) and k ∼ ς ( k ; ℓ, Z ) = ς ( k ; ℓ, z − ℓ , . . . , z r ) with r := m − − ℓ , that is here µ (d ξ ) := Λ (d Z ) ς ( k ; ℓ, Z ) I (cid:8) ℓ ∈ J , m − K (cid:9) /m . Now define the involution φ ( ξ ) = ( θ k ( Z ) , ℓ + k, − k ) , then, observing that by construction θ k ( Z ) − ( ℓ + k ):( r − k ) = ( z − ℓ , . . . , z r ) , we obtain the acceptance ratio r ( Z , ℓ, k ) = ̟ ( z k ) ς ( − k ; ℓ + k, z − ℓ , . . . , z r ) ̟ ( z ) ς ( k ; ℓ, z − ℓ , . . . , z r ) , on S = { ξ : ̟ ( z k ) ∧ ̟ ( z ) ∧ ς ( − k ; ℓ + k, z − ℓ , . . . , z r ) ∧ ς ( k ; ℓ, z − ℓ , . . . , z r ) > } . For example, if ς ( k ; ℓ, z − ℓ , . . . , z r ) ∝ J − ℓ,r K ( k ) ̟ ( z k ) then the acceptance ratio is for all k ∈ J − ℓ, r K such that ς ( k ; ℓ, z − ℓ , . . . , z r ) > . The resulting algorithm is presented in Alg. 5. Algorithm 5
To sample from P (general)MC ( z , · ) (a) Lazily simulate Z ∼ Q ( z , · ) .(b) Simulate ℓ ∼ Uniform( J , m − K ) and k ∼ ς ( k ; ℓ, z − ℓ , . . . , z r ) .(c) With probability a (cid:18) ̟ ( z k ) ς ( − k ; ℓ + k, z − ℓ , . . . , z r ) ̟ ( z ) ς ( k ; ℓ, z − ℓ , . . . , z r ) (cid:19) , output z k . Otherwise output z .In order to introduce NUTS-like kernels, it is helpful at this point to consider the case where m = 2 n for some n ∈ N and we shall reparameterize ℓ as a sequence of n bits. That is, we define b ∈{ , } n to be a sequence of independent Bernoulli(1 / random variates and write ℓ = P nj =1 b j j − ,so that the distribution of ℓ is indeed Uniform( J , m − K ) . In order to specify the involution in thisreparameterization we define β : J , n − K → { , } n to be the function that computes the “reversed”binary representation of its input with n bits, e.g. β (13) = (1 , , , , , which has the property that β ◦ ℓ ( b ) = b . Finally, we specify φ ( Z , b, k ) = ( θ k ( Z ) , β ( ℓ ( b ) + k ) , − k ) . The intuition is that given Z , thebinary string b defines a particular window z − ℓ , . . . , z r around z and for a given k ∈ J − ℓ, r K , there isa corresponding binary string β (cid:0) ℓ ( b ) + k (cid:1) that defines the same window, but around z k .27 lgorithm 6 To sample from P (NUTS)MC ( z , · ) NUTS-like algorithm(a) Lazily simulate Z ∼ Q ( z , · ) and b .(b) Sample ( ℓ, r ) :(i) Set n ← and ℓ ← r ← .(ii) Set n ← n + 1 .(iii) Set ℓ n ← ℓ n − + b n n − and r n ← r n − + (1 − b n )2 n − .(iv) If s n ( Z , b ) = 0 , go to 2(b)(v) Set ℓ ← ℓ n − , r ← r n − .(c) Sample k ∼ ς ( · | Z , ℓ, r ) .(d) With probability a (cid:18) ̟ ( z k ) ς ( − k | θ k ( Z ) , ℓ + k, r − k ) ̟ ( z ) ς ( k | Z , ℓ, r ) (cid:19) , output z k . Otherwise output z . Let π and ν be measures on ( Z , Z ) where π is a probability and ν ≫ π and let ̟ := d π/ d ν .Define for n ∈ N , ℓ n ( b ) := n X j =1 b j j − , r n ( b ) := 2 n − − ℓ n ( b ) . Let ( s n ) n ∈ N be a sequence of functions where s n : Z Z × { , } N → { , } depends only on windows ofstates in a way that is made clear below. For ( Z , b ) ∈ Z N × { , } N define the stopping time τ ( b, Z ) := inf { n ≥ s n ( Z , b ) = 1 } Specifically, we require that s n ( Z , b ) is a function of the vector ( z − ℓ n ( b ) , . . . , z r n ( b ) ) .Note that ℓ n ( b ) and r n ( b ) can be computed recursively, which suggests step (b) in Alg. 6 where for asequence random variables b = ( b , b , . . . ) ∈ { , } N , b i iid ∼ Bernoulli(1 / one finds τ := τ ( b, Z ) andthe final window is defined by ℓ = ℓ τ − ( b ) and r = r τ − ( b ) , i.e. the most recently added states areignored. The reason for this will become clearer below, but is essentially analogous to the argumentin Example 17. We now turn to the specification of ( s n ) n ∈ N . For ( n, b ) ∈ N × { , } N define m n ( b ) :=2 n − − − ℓ n ( b ) , so that J − ℓ n ( b ) , m n ( b ) K and J m n ( b ) + 1 , r n ( b ) K are integer sequences of length n − ,one of which is J ℓ n − ( b ) , r n − ( b ) K :• if b n = 0 then ℓ n − ( b ) = ℓ n ( b ) and r n − ( b ) = r n ( b ) − n − = 2 n − − ℓ n ( b ) − n − = m n ( b ) ,• if b n = 1 then ℓ n − ( b ) = ℓ n ( b ) − n − = − ( m n ( b ) + 1) and r n − ( b ) = r n ( b ) Let for n ∈ N ∗ s n ( Z , b ) = f n − ( z − ℓ n ( b ) , . . . , z m n ( b ) ) ∨ f n − ( z m n ( b )+1 , . . . , z r n ( b ) ) , (cid:8) f k : Z k → { , } , k ∈ N (cid:9) is defined recursively via f k ( z k ) = ( g k (cid:0) z k (cid:1) ∨ f k − (cid:0) z k − (cid:1) ∨ f k − (cid:0) z (2 k − +1):2 k (cid:1) k ∈ N ∗ g (cid:0) z (cid:1) k = 0 , for some functions (cid:8) g k : Z k → { , } , k ∈ N (cid:9) , which encode the condition for stopping while thefunctions f k − ( z k − ) and f k − ( z (2 k − +1):2 k ) report whether stopping was triggered in either of thetwo main subtrees. Example 20 (HMC-NUTS) . Assume the setup of Example 14 that is with z = ( x, v ) ∈ X × V we target π ( x, v ) ∝ γ ( x ) κ ( v ) and let Q ( z, d z ′ ) = Ψ( z, d z ′ ) = δ ψ ( z ) (d z ′ ) for ψ the leapfrog mapping and assumethat the dominating measure satisfies ν (d z )Ψ( z, d z ′ ) = ν (d z ′ )Ψ ∗ ( z ′ , d z ) with Ψ ∗ ( z, d z ′ ) = δ ψ − ( z ) (d z ′ ) .A possible choice is for k ∈ N ∗ and ℓ, r ∈ Z such that ℓ − r + 1 = 2 k , g k ( z ℓ . . . , z r ) = I (cid:8) ( x r − x ℓ ) ⊤ v ℓ > (cid:9) ∨ I (cid:8) ( x r − x ℓ ) ⊤ v r < (cid:9) ∨ I (cid:8) max i,j ∈ J ℓ,r K π ( z i ) /π ( z j ) > ∆ max (cid:9) , (13)for some ∆ max > , and g ≡ . The first two indicators correspond to the choice made in Hoffmanand Gelman (2014), the motivation for NUTS–see Appendix D for some details. The last indicator isour own suggestion to address numerical errors, since the setup considered by Hoffman and Gelman(2014) corresponds to Example (21) below and numerical errors are addressed in a slightly differentway. Definition 6 (Slice sampler Besag et al. (1995) and Neal (2003)) . Given a target distribution π , withdensity ̟ w.r.t. some dominating measure ν , one can define an extended target distribution ˜ π via thedecomposition ̟ ( z, u ) = ̟ ( z ) I { u ≤ ̟ ( z ) } /̟ ( z ) = I { u ≤ ̟ ( z ) } , where ̟ is the density of ˜ π w.r.t. the product of ν and the Lebesgue measure, and in which conditionalon z , u is uniformly distributed on [0 , ̟ ( z )] . A MwG Markov kernel leaving ˜ π invariant consistsof sampling u uniformly on [0 , ̟ ( z )] and then applying any Markov kernel leaving the conditionaldistribution π u of z given u invariant: the uniform distribution on the “slice” { z ∈ Z : ̟ ( z ) ≥ u } . Forthe purposes of this work, we may seek to define a sophisticated π u -invariant Markov kernel. Example 21 (sliced-HMC-NUTS) . This is what Hoffman and Gelman (2014) refer to as the “simpli-fied” NUTS algorithm. Here the overall target distribution has density η ( x, v, u ) ∝ I { u ≤ γ ( x ) κ ( v ) } and the algorithm consists of a MwG alternating between updating u given z = ( x, v ) and vice versa.Given u ∈ [0 , γ ⊗ κ ( z )] , we focus on sampling from π u ( x, v ) ∝ I { u ≤ γ ( x ) κ ( v ) } . In this scenario, inaddition to (13) it is suggested to use, for k ∈ N ∗ and ℓ, r ∈ Z such that ℓ − r + 1 = 2 k , g k ( z ℓ . . . , z r ) = I (cid:8) ( x r − x ℓ ) ⊤ v ℓ > (cid:9) ∨ I (cid:8) ( x r − x ℓ ) ⊤ v r < (cid:9) , and g ( z ) = I { log γ ⊗ κ ( z ) < log u − ∆ max } , for ∆ max ≥ in order to stop computation when the error arising from the numerical integration ofHamilton’s dynamic leads to an “astronomically” large error. Lemma 7.
The functions { s n , n ∈ N } satisfy for any ( n, Z , b ) ∈ N × Z Z × { , } N ∗ (a) s n ( Z , b ) depends only on the order and the values of z − ℓ n ( b ) , . . . , z r n ( b ) , and not on how they areindexed;(b) s n ( Z , b ) ≥ s n − ( Z , b ) for n ≥ ,(c) s n ( Z , b ) ≥ g n − ( z − ℓ n − ( b ) , . . . , z r n − ( b ) ) . emark . Part (c) of Lemma 7 is useful for computational reasons: if one observes that s n ( Z , b ) = 0 but g n ( z − ℓ n ( b ) , . . . , z r n ( b ) ) = 1 then one can stop and take τ ( Z , b ) = n + 1 without further simulationbeing needed.Let β k : J , k − K → { , } k be the reversed binary representation of i ∈ J , k − K with k bits, sothat e.g., β (13) = (1 , , , , . This function has the property that β k ◦ ℓ k ( b ) = b k . For b ∈ { , } N ,define ¯ b i := ( b i +1 , b i +2 , . . . ) for i ∈ N . We can now present the result that allows one to relate thestopped Markov processes, which is analogous to Lemma 4 in this setting. Lemma 8.
Let n ∈ N , ℓ, r ∈ N ∗ be such that ℓ + r + 1 = 2 n − , k ∈ J − ℓ, r K , Z ∈ Z Z and Z ′ := θ k ( Z ) .Then for any b = ( b n − , ¯ b n − ) ∈ { , } N ∗ such that τ ( Z , b ) = n , ℓ n − ( b ) = ℓ and r n − ( b ) = r , thereexists a unique b ′ n − = β n − ( ℓ + k ) such that with b ′ = ( b ′ n − , ¯ b n − ) ,(a) ℓ n − ( b ′ ) = ℓ + k ,(b) r n − ( b ′ ) = r − k ,(c) ( z ′− ( ℓ + k ) , . . . , z ′ r − k ) = ( z − ℓ , . . . , z r ) ,(d) and τ ( Z ′ , b ′ ) = n . Corollary 4.
For n ∈ N ∗ and k ∈ J − ℓ n − ( b ) , r n − ( b ) K let χ n − ,k : { , } N ∗ → { , } N ∗ such that χ n − ,k ( b ) := (cid:0) β n − ( ℓ n − ( b ) + k ) , ¯ b (cid:1) . Then ( b, k ) ( χ n − ,k ( b ) , − k ) is an involution since b ′ n − = β n − (cid:0) ℓ n − ( b ) + k (cid:1) , ℓ n − ( b ′ ) = ℓ n − ( b ) + k and b n − = β n − (cid:0) ℓ n − ( b ) + k − k (cid:1) . For n ∈ N ∗ let s ( n, b ) : Z Z × N × { , } N ∗ → { , } s ( Z , n, b ) := s n ( Z , b ) n − Y i =1 [1 − s i ( Z , b )] ,ξ := ( Z , n, b, k ) ∈ Z Z × N × { , } N × Z and µ (d Z , n, b, k ) := Λ (d Z )Υ( b ) s ( Z , n, b ) ς (cid:0) k ; ℓ n − ( b ) , r n − ( b ) , Z (cid:1) . From Lemma 8 for ( n, b ) ∈ N × { , } N s ( n, b ) = 1 implies that s ( n, χ n − ,k ( b )) = 1 for any k ∈ J − ℓ n − ( b ) , r n − ( b ) K . Then with the involution φ ( Z , n, b, k ) = ( θ k ( Z ) , n, χ n − ,k ( b ) , − k ) we define S := { ( Z , n, b, k ) ∈ Ξ : ̟ ( z k ) ∧ ̟ ( z ) ∧ s ( Z , n, b ) ∧ s (cid:0) Z , n, χ n − ,k ( b ) (cid:1) ∧ ς ( k ; ℓ n − ( b ) , r n − ( b ) , Z ) ∧ ς ( − k ; ℓ n − ( b ) + k, r n − ( b ) − k, θ k ( Z )) > } , Now for any ξ ∈ S we obtain an expression for the acceptance ratio r ( ξ ) = dΛ k dΛ ( Z ) ς ( − k ; θ k ( Z ) , ℓ + k, r − k ) ς ( k ; Z , ℓ, r )= ̟ ( z k ) ̟ ( z ) ς ( − k ; θ k ( Z ) , ℓ + k, r − k ) ς ( k ; Z , ℓ, r ) . A natural choice of ς ( k ; Z , ℓ, r ) is ς ( k ; Z , ℓ, r ) ∝ ̟ ( z k ) I { k ∈ J − ℓ, r K } , in which case r ( Z , ℓ, r, k ) = 1 forany k ∈ J − ℓ, r K (This is the same in the slice setting, where it corresponds to choosing uniformly from30oints in the slice). One can always improve this slightly (Peskun) by excluding k = 0 , and having anacceptance ratio that is not in general. That is, taking ς ( k ; Z , ℓ, r ) ∝ ( ̟ ( z k ) I { k ∈ J − ℓ, r K \ { } ) ∃ i ∈ J − ℓ, r K \ { } : ̟ ( z i ) > , I { k = 0 } otherwise . in which case r ( ξ ) = 1 for ξ ∈ S . Our understanding from Betancourt (2017), is that Stan uses this(unsliced) “multinomial” (i.e. categorical) sampling, but the exact expression for ς ( k ; Z , ℓ, r ) is notclear. A simple multiple-try Metropolis (MTM) kernel (Liu, Liang, and Wong, 2000) involves n ∈ N proposalsconditional upon the input, from which one is chosen as a candidate to move to. The acceptanceprobability then involves simulating n − proposals from this candidate. The kernel is presented inAlg. 7. We can write ξ = ( Z , Z , k, ℓ ) where Z , Z ∈ Z n , k, ℓ ∈ J , n K , and here ξ = z ,k . For simplicitywe will assume that the target π and proposals Q ( z, · ) , z ∈ Z , have densities ̟ and q ( z, · ) w.r.t. acommon reference measure. We can write, ρ ( Z , Z , k, ℓ ) = I { k ∈ J , n K } n ̟ ( z ,k ) ( n Y i =1 q ( z ,k , z ,i ) ) ς ( ℓ ; z ,k , Z ) n Y i =1 ,i = k q ( z ,ℓ , z ,i ) . The associated involution is φ ( Z , Z , k, ℓ ) := ( Z , Z , ℓ, k ) , so that the acceptance ratio is, for ξ ∈ Sr ( ξ ) = ρ ◦ φρ ( ξ ) = ̟ ( z ,ℓ ) q ( z ,ℓ , z ,k ) ς ( k ; z ,ℓ , Z ) ̟ ( z ,k ) q ( z ,k , z ,ℓ ) ς ( ℓ ; z ,k , Z ) . In practice, one often chooses ς ( ℓ ; z ,k , Z ) ∝ w ( z ,ℓ , z ,k ) where w is a weight function. In particular,Liu, Liang, and Wong (2000) suggest to use w ( z, z ′ ) = ̟ ( z ) q ( z, z ′ ) λ ( z, z ′ ) , where λ ( z, z ′ ) = λ ( z ′ , z ) for all z, z ′ ∈ Z . Then the acceptance ratio can be expressed as r ( ξ ) = P ni =1 w ( z ,i , z ,k ) P ni =1 w ( z ,i , z ,ℓ ) , for ξ ∈ S . To illustrate, a possible choice is to take λ ( z, z ′ ) = { q ( z, z ′ ) q ( z ′ , z ) } − , in which case w ( z, z ′ ) = ̟ ( z ) /q ( z ′ , z ) . We consider now locally adaptive selection of the number of samples n in MTM. In particular, theapproach taken in Section 6.1 needs to be slightly adapted and then the stopping time random variablesintroduced, one for each of the Z := ( z , , z , , . . . ) ∈ Z N ∗ samples and the Z := ( z , , z , , . . . ) ∈ Z N ∗ samples and for ( m, n ) ∈ { , } × N ∗ let Z m,n := ( z m, , . . . , z m,n ) . The kernel is presented in Alg. 8.We define ξ := ( Z , Z , m, n, k, ℓ ) where m, n ∈ N , ( Z , Z ) ∈ Z N ∗ × Z N ∗ and ( k, ℓ ) ∈ J m K × J n K . Welet ξ = z , . Let σ k : Z N ∗ → Z N ∗ be the swapping function such that, with Z ′ = σ k ( Z ) , z ′ = z k ,31 lgorithm 7 Standard MTM kernel(a) Given z , sample k ∼ Uniform (cid:0) J , n K (cid:1) and set z ,k = z (b) Sample z ,i iid ∼ Q ( z ,k , · ) for i ∈ J n K (c) Sample ℓ ∼ ς ( · ; z ,k , Z ) (d) Sample z ,i iid ∼ Q ( z ,ℓ , · ) for i ∈ J n K \ { k } (e) With probability a (cid:18) ̟ ( z ,ℓ ) q ( z ,ℓ , z ,k ) ς ( k ; z ,ℓ , Z ) ̟ ( z ,k ) q ( z ,k , z ,ℓ ) ς ( ℓ ; z ,k , Z ) (cid:19) return z ,ℓ , otherwise z ,k . z ′ k = z and z ′ j = z j for j / ∈ { , k } . For any i ∈ N ∗ let s i : Z × Z N ∗ → { , } be such that i s i ( z, Z ′ ) is non-decreasing and s i ( z, Z ′ ) = s i (cid:0) z, σ ℓ ( Z ′ ) (cid:1) for any ℓ ∈ J i − K . For example, one could choose s i ( z, Z ′ ) = I nP ij =1 w ( z, z ′ i ) ≥ c o for all i ∈ N ∗ and for some c > , where w is a weight function asdescribed in the previous subsection.The “forward” stopping times of interest are, for z ∈ Z and Z , Z ∈ Z N τ ( z, Z ) = inf { n ≥ s n ( z, Z ) = 1 } and τ ( z, Z ) = inf { n ≥ s n ( z, Z ) = 1 } , and we define the { , } -valued functions, for n ∈ N ∗ , s ( n, z, Z ) = s n ( z, Z ) n − Y i =1 (cid:2) − s i ( z, Z ) (cid:3) . For i ∈ { , } the quantity s ( n, z − i, , Z i ) can be thought of as the probability that τ i = n given thevalues z − i, and Z i,n . We define Q ( z, d Z ) to correspond to the distribution of z i iid ∼ Q ( z, · ) for i ∈ N ∗ and Q ( z, d Z ) such that for i ≥ z i iid ∼ Q ( z, · ) and Q ( z, d z ) = π (d z ) , that is under µ below z , ∼ π , µ (cid:0) d( Z , Z ) , m, n, k, ℓ (cid:1) := Q ( z , , d Z ) s ( n, z , , Z ) ς ( ℓ ; z , , Z , n − Q ( z ,ℓ , d Z ) s ( m, z ,ℓ , Z ) I (cid:8) k ∈ J m − K (cid:9) m − , which is the distribution of a process that simulates the stopped processes described above, and chooses ℓ and k , respectively, from a categorical distribution on J n − K and a uniform distribution on J m − K .We define φ ( Z , Z , m, n, k, ℓ ) = ( σ ℓ ( Z ) , σ k ( Z ) , n, m, ℓ, k ) . It is straightforward to verify that φ is aninvolution. What is more interesting is, assuming densities as in the previous subsection, that for ξ ∈ Sr ( ξ ) = ̟ ( z ,ℓ ) q ( z ,ℓ , z ,k ) ς ( k ; z ,ℓ , σ k ( Z ) , m − ̟ ( z , ) q ( z ,k , z ,ℓ ) ς ( ℓ ; z , , Z , n − , one can apply Lemma 4 (reindexing to take into account 1-indexing as opposed to 0-indexing) twiceto determine that s ( n, z , , Z ) = 1 implies s (cid:0) n, z ,k , σ ℓ ( Z ) (cid:1) = 1 and s ( m, z ,ℓ , Z ) = 1 implies s (cid:0) m, z ,ℓ , σ k ( Z ) (cid:1) = 1 . Remark . It is possible, of course, to specify functions s i that do not satisfy the conditions above. Inthis case, the reverse and the forward stopping time probabilities are not necessarily equal, and theirratios will appear in the acceptance ratio. 32 lgorithm 8 Locally adaptive MTM kernel(a) Given z set z , = z (b) Sample z ,i iid ∼ Q ( z , , · ) for i ≥ lazily and obtain n = τ ( z , , Z ) (c) Sample ℓ ∼ ς ( · ; z , , Z , n ) (d) Sample z ,i iid ∼ Q ( z ,ℓ , · ) for i ≥ lazily and obtain m = τ ( z ,ℓ , Z ) (e) Sample k ∼ Uniform (cid:0) J m − K (cid:1) .(f) With probability a (cid:18) ̟ ( z ,ℓ ) q ( z ,ℓ , z ,k ) ς ( k ; z ,ℓ , σ k ( Z ) , m − ̟ ( z , ) q ( z ,k , z ,ℓ ) ς ( ℓ ; z , , Z , n − (cid:19) return z ,ℓ , otherwise z , . It is relatively straightforward to adapt the MTM kernels above to the pseudo-marginal setting (Beau-mont, 2003; Christophe Andrieu and G. O. Roberts, 2009). It is also possible to extend the examplebelow to the situation where one uses stopping times to determine the number of simulations, and alsoto the particle MCMC (Christophe Andrieu, Arnaud Doucet, and Holenstein, 2010) setting, as is donein Anthony Lee (2011) which also contains an earlier version of the stopping time framework detailedin Section 6.2. Anthony Lee (2012) and Del Moral et al. (2015) provide some examples of each insimple scenarios.
Example 22 (Pseudo-marginal MTM) . In particular, in this setting one targets a distribution withdensity ̟ ( z ) w.r.t. some measure ν where ̟ : Z → R + cannot be calculated but for any z ∈ Z one cansimulate a random variable w ∼ Q z with expectation ̟ ( z ) . We introduce the auxiliary distributionwith density π (d z, d w ) = ν (d z ) wQ z (d w ) , such that π (d z ) = ν (d z ) R wQ z (d w ) = ν (d z ) ̟ ( z ) . Letting Q be a Markov kernel evolving on Z , W = ( w , . . . , w n ) ∈ R n and W ′ = ( w ′ , . . . , w ′ n ) ∈ R n we consider the choice ξ = ( z, z ′ , W , W ′ , k, ℓ ) ∈ Z × Z × R n × R n , ξ = ( z, w k ) and µ (d z, d z ′ , d W , d W ′ , k, ℓ ) = I { k ∈ J n K } n ν (d z ) w k Q ⊗ nz (d W ) Q ( z, d z ′ ) Q ⊗ nz ′ (d W ′ ) ς ( ℓ ; W ′ ) , where ς ( k ; W ) ∝ w k J n K ( k ) . The involution can be chosen to be φ ( z, z ′ , W , W ′ , k, ℓ ) = ( z ′ , z, W ′ , W , ℓ, k ) ,giving the acceptance ratio r ( ξ ) = p ( z ′ ) q ( z ′ , z ) p ( z ) q ( z, z ′ ) · P ni =1 w ′ i P ni =1 w i , where for simplicity we assume that { Q ( z, · ); z ∈ Z } and ν have densities { q ( z, · ); z ∈ Z } and p w.r.t.some dominating reference measure. We can view the averages of the w i (resp. w ′ i ) as approximationsof ̟ ( z ) (resp. ̟ ( z ′ ) ) and the value of n controls the variability of the approximation. The corre-sponding Markov kernel is subtly different from the standard pseudo-marginal approach, in that hereone simulates k ∼ Uniform( J n K ) and W − k ∼ Q ⊗ n − z rather than having these variables fixed. In somesense, one can view the standard pseudo-marginal kernel as a MwG approach where one fixes ( k, W ) ,rather than only fixing ξ . 33he next example is an interesting variant in which a shared stopping time is defined, and which hasbeen shown to inherit desirable properties from the limiting MH kernel associated with the pair ( π, Q ) as n → ∞ but which would naturally require computation of f under conditions where the kernel ofExample 22 with any fixed n would not (Anthony Lee and Łatuszyński, 2014). Example 23 (One-hit kernel of A. Lee, C. Andrieu, and A. Doucet, 2012) . Consider the setting ofExample 22 but where Q z is a Bernoulli( ̟ ( z )) distribution with ̟ : Z → [0 , . In this case, w = 1 under π . We want here to adapt the number of simulations so that the acceptance ratio is a reasonableapproximation of the limiting acceptance ratio as n → ∞ , but does not require an excessive numberof simulations. In particular, using a fixed number of simulations may lead to acceptance ratios witha large variance and hence a Markov chain that can get “stuck” for long periods when in regions of thestate space with very small ̟ ( z ) . Let ξ = ( z, z ′ , W , W ′ , n ) ∈ Z × Z × { , } N ∗ × { , } N ∗ × N ∗ , ξ = ( z, w ) and let for z ∈ Z , F z be the probability measure associated with an infinite sequence of independent Q z -distributed random variables. The idea is given z, z ′ ∈ Z and w = 1 we wish to simulate w i iid ∼ Q z , i ∈ N , i ≥ and w ′ i iid ∼ Q z ′ , i ∈ N ∗ independently until there is one further “hit”, i.e. w i and/or w ′ i isequal to . So we define τ ( ξ ) := inf { n ≥ s n ( W , W ′ ) = 1 } , where s n ( W , W ′ ) = I { P ni =1 w i + P ni =1 w ′ i ≥ } . We then define µ (d z, d z ′ , d W , d W ′ , n ) := ν (d z ) w Q ( z, d z ′ ) F z (d W ) F z ′ (d W ′ ) s ( n, W , W ′ ) , where for n ∈ N ∗ , s ( n, W , W ′ ) = s n ( W , W ′ ) Q n − i =1 [1 − s i ( W , W ′ )] so that if ξ = ( z, z ′ , W , W ′ , n ) ∼ µ then n = τ ( ξ ) . We define the involution φ ( z, z ′ , W , W ′ , n ) = ( z ′ , z, σ n ( W ′ ) , σ n ( W ) , n ) , where for i ∈ N ∗ , σ i : { , } N ∗ → { , } N ∗ is the permutation that swaps its − st and i − th inputs. Wecan obtain, for ξ such that p ( z ) q ( z, z ′ ) > , r ( ξ ) = p ( z ′ ) q ( z ′ , z ) p ( z ) q ( z, z ′ ) I { w = w ′ n = 1 , τ ( ξ ) = τ ◦ φ ( ξ ) = n } , i.e. it is essential that w ′ n = 1 and that the stopping time is preserved by the involution. Weobserve that if τ ( ξ ) = 1 , then necessarily w ′ = 1 and τ ◦ φ ( ξ ) = 1 = τ ( ξ ) , so the indicator aboveis . Now consider τ ( ξ ) > . If w τ ( ξ ) = 0 then ( w , w , . . . , w τ ( ξ ) ) = (1 , , . . . , , while necessarily ( w ′ , . . . , w ′ τ ( ξ ) ) = (0 , . . . , , so τ ◦ φ ( ξ ) = τ ( ξ ) and the indicator above is . If on the other hand w τ ( ξ ) = 1 then ( w , w , . . . , w τ ( ξ ) ) = (1 , , . . . , so even if w ′ τ ( ξ ) = 1 we have τ ◦ φ ( ξ ) = 1 = τ ( ξ ) andso r ( ξ ) = 0 . 34 lgorithm 9 Stochastic delayed rejection(a) Given z ∈ Z , set k ← and z ← z .(b) Set k ← k + 1 and simulate z k ∼ Q k ( Z k − , · ) .(c) With probability α k ( Z k ) output φ k ( Z k ) , otherwise go to 2. In delayed rejection, several sources of randomness and involutions are considered in turn until one isaccepted.
Let π be a probability distribution on ( Z , Z ) . For k ∈ N ∗ let ( Z k , Z k ) be measurable spaces, for Z k − ∈ Z k − := Z × Z × · · · Z k − let Q k ( Z k − , · ) be probability distributions on ( Z k , Z k ) . Define η k (d Z k ) := π (d z ) Q ki =1 Q i ( Z i − ; d z i ) , for k ∈ N ∗ let φ k : Z k +1 → Z k +1 be involutions and let α k = a k ◦ r k for an acceptance function a k and β k ( Z k ) = β k − ( Z k − )[1 − α k − ( Z k − )] with β ≡ and α ≡ where r k ( Z k ) := β k ◦ φ k β k ( Z k ) d η φkk,Sk d η k,Sk ( Z k ) Z k ∈ S k otherwise , with S k := S ( η k , η φ k k ) ∩ { Z k ∈ Z k : β k ( Z k ) ∧ β k ◦ φ k ( Z k ) > } , where S ( η k , η φ k k ) is as in Theorem 3. Thedelayed rejection algorithm is described in Alg. 9 and its justification follows from the following: Proposition 5.
With the notation above, define the probability distribution of marginal π , µ ( k, d Z k ) : = α k ( Z k ) β k ( Z k ) η k (d Z k ) , on E = { ( k, Z k ) ∈ { k } × Z k : k ∈ N } and for any ξ = ( k, Z k ) ∈ E the involution φ ( ξ ) = φ ( k, Z k ) = (cid:0) k, φ k ( Z k ) (cid:1) . Then, r ( ξ ) = ( ξ ∈ S ( µ, µ φ )0 otherwise . Although β k ( Z k ) can be updated as the algorithm progresses, the computation of β k ◦ φ k ( Z k ) can beexpensive. Indeed, letting ζ k := φ k ( Z k ) for k ∈ N ∗ , we have β k ◦ φ k ( Z k ) = β k ( ζ k ) = k − Y i =1 [1 − α i ( ζ i )] , where for i ∈ J k − K α i ( ζ i ) = a i ◦ r i ( ζ i ) which may need to be computed afresh for each value of k ingeneral. We will see in Subsection 7.2 an interesting scenario where this is not the case. Example 24 (Delayed-rejection of Tierney and Mira, 1999) . Assume Q i has density q i with respectto the Lebesgue or counting measure for each i ∈ J n K for some n ∈ N ∗ and φ i ( z , z , . . . , z i − , z i ) =( z i , z i − , . . . , z , z ) for i ∈ J n − K “reverse time” and φ n = Id (which ensures finite computations). Example 25 (Generalized delayed-rejection of Green and Mira, 2001) . In this scenario involutionsother than those of Example 24 can be used. As an example, one can choose Q ( x, d y ) on X × Y , Q ( x, y ; d z, d w ) on X × Y × Z ⊗ W and Q arbitrary. Assume Q i has density q i for each i ∈ { , } ,we may choose φ ( x, y ) = ( y, x ) , φ ( x, y, z, w ) = ( z, w, x, y ) and φ ( x, y, z, w, . . . ) = ( x, y, z, w, . . . ) .35 lgorithm 10 Deterministic delayed rejection(a) Given z ∈ Z , set k ← .(b) Set k ← k + 1 .(c) With probability α k ( z ) output φ k ( z ) otherwise go to 2. Delayed rejection can be usefully applied to sample from π defined on ( Z , Z ) using purely deterministicproposals. It is possible to use the framework above, but it is more convenient notationally andconceptually to instead consider E := { ( k, z ) : k ∈ N , z ∈ Z } , the embedding distribution, for k ∈ N ∗ , µ ( k, d z ) = α k ( z ) β k ( z ) π (d z ) , involutions φ k : Z → Z and as before, for each k ∈ N , let α k = a k ◦ r k , β k ( z ) = β k − ( z )[1 − α k − ( z )] with β ≡ and α ≡ , where r k ( z ) = d π φkSk d π Sk ( z ) β k ◦ φ k β k ( z ) z ∈ S k otherwise , with S k = S ( π, π φ k ) ∩ { z ∈ Z : β k ( z ) ∧ β k ◦ φ k ( z ) > } . An algorithmic presentation of delayed rejectionwith deterministic proposals is given in Alg. 10. Its justification follows along the same lines as above,and as before one can choose φ n = Id for some n ∈ N ∗ to ensure that the stage n “proposal” is accepted.As in the stochastic scenario, the computation of β k ◦ φ k ( z ) can be expensive since this requires inparticular the computation of r ◦ φ k ( z ) , . . . , r k − ◦ φ k ( z ) . Assume for simplicity that π has a density ̟ with respect to some measure ν invariant under φ i for i ∈ J k − K , then we remark that on S k r i ◦ φ k ( z ) = ̟ ◦ φ i ◦ φ k ̟ ◦ φ k ( z ) β i ◦ φ i ◦ φ k β i ◦ φ k ( z ) . Then, if for i ≤ k the identity ̟ ◦ φ i ◦ φ k = ̟ ◦ φ k − i holds, we see that no new evaluation of theprobability density is required, which is to be contrasted with the general setup in Subsection 7.1.This identity holds when for ψ : Z → Z , assumed invertible, is such that for an involution σ : Z → Z , σ ◦ ψ ◦ σ = ψ − one considers the involutions φ i = σ ◦ ψ i and has the property ̟ ◦ σ = ̟ , since φ i ◦ φ k = σ ◦ ψ i ◦ σ ◦ ψ k = ψ − i ◦ ψ k = ψ k − i = σ ◦ φ k − i . This is the setup considered in Sohl-Dickstein, Mudigonda, and DeWeese (2014) and Campos and J.Sanz-Serna (2015) where an additional twist, detailed in the next subsection, is used.
Example 26 (DR deterministic) . Consider π defined on (cid:0) X × V , X ⊗ V (cid:1) of density ̟ ( x, v ) = γ ( x ) κ ( v ) and let φ i := σ ◦ ψ i for i ∈ J n − K and φ n = Id with ψ ( x, v ) := ( x + v, v ) and σ ( x, v ) = ( x, − v ) . This canbe useful when trying to traverse a region of low probability. As an example, let X × V = J K × {− , } and ̟ (1 ,
1) = ̟ (1 , − > , ̟ (3 ,
1) = ̟ (3 , − > but ̟ (2 ,
1) = ̟ (2 , −
1) = 0 . In this scenario it isa good idea to choose n = 3 rather than the standard n = 2 choice. The introduction of an auxiliary slice variable can mitigate the computational cost of the delayedrejection approach, and some recently proposed algorithms Sohl-Dickstein, Mudigonda, and DeWeese362014) and Campos and J. Sanz-Serna (2015) can be viewed as following this principle. In particular,we can define ̟ ( z, u ) = ̟ ( z ) I { u ≤ ̟ ( z ) } /̟ ( z ) = I { u ≤ ̟ ( z ) } , and use a slice sampler (Neal, 2003) (see Definition 6), that is a MwG alternating between updating u given z and vice-versa, that is sampling uniformly from the “slice” { z ∈ Z : ̟ ( z ) ≥ u } , for a fixed u ∈ R + . One may use any Markov kernel that leaves this distribution invariant and we naturally focuson MH type updates. Example 27 (Extra chance slice) . For some fixed u , let ̟ u ( z ) ∝ I { u ≤ ̟ ( z ) } . Let φ i ( z ) = σ ◦ ψ i ( z ) for i ∈ J n − K and φ n = Id . If a i ( r ) = a ( r ) = 1 ∧ r , we find that (see Appendix C for a proof) for z ∈ Z , for k ∈ J n K and the convention ∨ i =1 = 0 , r k ( z ) = I {∨ k − i =1 ̟ ◦ φ i ( z ) < u ≤ ̟ ( z ) ∧ ̟ ◦ φ k ( z ) } while β k ( z ) = I { ̟ ( z ) ∧ ∨ k − i =1 ̟ ◦ φ i ( z ) < u } . Hence, one accepts as soon as ̟ ◦ φ k ( z ) ≥ u or onereaches the identity involution φ n = Id . One notices that for ̟ ( z ) > and u ∼ Uniform(0 , ̟ ( z )) then u := u/̟ ( z ) ∼ Uniform(0 , and one can rewrite r k ( z ) = I n ∨ k − i =1 ̟ ◦ φ i ( z ) /̟ ( z ) < u ≤ ∧ (cid:0) ̟ ◦ φ k ( z ) /̟ ( z ) (cid:1)o . The overall slice sampler therefore looks like a standard MH algorithm targetting ̟ , where given u ∼ Uniform(0 , one scans the states φ i ( z ) for i ∈ J n K until the right hand side inequality issatisfied or n is reached. When n = 2 we recover the standard MH algorithm targetting ̟ and withdeterministic proposal, corresponding to a remark going as far back as Higdon (1998).In the context of HMC samplers this can be a way of taking into account the oscillatory nature of theenergy i H ◦ ψ i ( x, v ) under the leapfrog dynamics. More specifically we may have ̟ ◦ φ k ( z ) ≥ u even though ̟ ◦ φ i ( z ) < u for i ∈ J k − K . Note that ψ may involve several steps of the numericalintegrator (which preserves Lebesgue measure and is time-reversible). Example 28.
The “sequential-proposal Metropolis(–Hastings) algorithm” of Park and Atchadé (2020)shares the precise structure of Campos and J. Sanz-Serna (2015) albeit in the scenario where the statesare proposed randomly, but this connection was not made by the authors.
Let b be a volume preserving “bounce” involution, e.g. with b v ( x, v ) := v − h v, n ( x ) i n ( x ) for somefunction n : R d → R d such that for all x ∈ X , k n ( x ) k = 1 we let b( x, v ) := (cid:0) x, b v ( x, v ) (cid:1) for ( x, v ) ∈ X × V .To fix ideas, for the two following examples the scenario where ψ ( x, v ) = ( x + v, v ) corresponds to thealgorithms of Sherlock and Thiery (2017) and Vanetti et al. (2017). Similar ideas are briefly alludedto in Neal (2003). Example 29 (Bouncy I - Sherlock and Thiery (2017)) . Let φ = σ ◦ ψ and φ = φ ◦ b ◦ φ . Note that φ is an involution since φ and b are involutions. We have the convenient property that φ ◦ φ = b ◦ φ (since φ is an involution), and b ◦ φ ( z ) will already have been computed to produce φ ( z ) . We have b ◦ φ ( x, v ) = (cid:0) x + v, b v ( x + v, − v ) (cid:1) and φ ( x, v ) = (cid:0) x + v + b v ( x + v, − v ) , − b v ( x + v, − v ) (cid:1) and thereforefor ξ ∈ S the acceptance ratio is r ( ξ ) = ¯ α ◦ φ ( x, v )( γ ⊗ κ ) ◦ φ ( x, v )¯ α ( x, v ) γ ⊗ κ ( x, v ) = γ (cid:0) x + v + b v ( x + v, − v ) (cid:1) γ ( x ) since, r ◦ φ ( x, v ) = γ ( x + v ) κ ◦ b v ( x + v, − v ) γ ( x ) κ ( v ) = r ( x, v ) . xample 30 (Bouncy II - Vanetti et al. (2017)) . Let φ = σ ◦ ψ and φ = b , where b is an involution.Here more computations are required since φ ◦ φ = φ ◦ b and φ ◦ b( z ) is not typically computed asa by-product of computing φ ( z ) or φ ( z ) . Here for ξ = ( x, v ) ∈ Sr ( ξ ) = ¯ α ◦ φ ( x, v )( γ ⊗ κ ) ◦ φ ( x, v )¯ α ( x, v ) γ ⊗ κ ( x, v ) = ¯ α ◦ φ ( x, v )¯ α ( x, v ) where r ◦ φ ( x, v ) = γ (cid:0) x + b v ( x, v ) (cid:1) κ (cid:0) − b v ( x + v, − v ) (cid:1) γ ( x ) κ ( v ) = γ (cid:0) x + b v ( x, v ) (cid:1) γ ( x ) . In a lineage of contributions Jaster (1999), Bernard, Krauth, and Wilson (2009), Michel, Kapfer,and Krauth (2014), Michel, Mayer, and Krauth (2015), and Michel (2016) efficient continuous timenonreversible Markov process Monte Carlo (MPMC) algorithms have been developed to sample frommodels arising in statistical physics. We show here that it is possible to develop discrete time andexact counterparts of those, that is algorithms of finite run time without any approximation but themachine numerical precision limit and are ensured to leave the desired distribution invariant. Morespecifically let x = ( x , . . . , x m ) for some m ∈ N where x , . . . , x m ∈ X . For example X might be abounded subset of R d for some d or as is a common in the physics literature a torus. This can bethought of as the positions of m particles, modelled as spheres. We also define a velocity variable v ∈ V ⊂ R d . The target distribution has density with respect to some measure ν , which can be theproduct of the Lebesgue or Hausdorff or counting measure depending on the scenario considered, ̟ ( x, v ) = κ ( v ) γ ( x ) ∝ κ ( v ) Y ≤ i Andrieu, Christophe (2016). “On random-and systematic-scan samplers”. In: Biometrika Linkto slides. eprint: https://cs.hse.ru/hdilab/sihdm/2019/ .Andrieu, Christophe, Arnaud Doucet, and Roman Holenstein (2010). “Particle markov chain montecarlo methods”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) ArXiv e-prints .Andrieu, Christophe and Samuel Livingstone (June 14, 2019). “Peskun-Tierney ordering for Markovchain and process Monte Carlo: beyond the reversible scenario”. In: arXiv: http://arxiv.org/abs/1906.06197v1 [math.PR] .Andrieu, Christophe and Gareth O Roberts (2009). “The pseudo-marginal approach for efficient MonteCarlo computations”. In: The Annals of Statistics , pp. 697–725.Andrieu, Christophe and Johannes Thoms (2008). “A tutorial on adaptive MCMC”. In: Statistics andcomputing Genetics issn : 0016-6731. eprint: . url : .Bernard, Etienne P, Werner Krauth, and David B Wilson (2009). “Event-chain Monte Carlo algorithmsfor hard-sphere systems”. In: Physical Review E Journal of theRoyal Statistical Society: Series B (Methodological) Statistical science ,pp. 3–41.Betancourt, Michael (2017). “A conceptual introduction to Hamiltonian Monte Carlo”. In: arXivpreprint arXiv:1701.02434 .Billingsley, Patrick (1995). Probability and measure . John Wiley & Sons.Campos, Cédric M and JM Sanz-Serna (2015). “Extra chance generalized hybrid Monte Carlo”. In: Journal of Computational Physics The Journal of chemical physics Statistics &Probability Letters Automating Invo-lutive MCMC using Probabilistic and Differentiable Programming . arXiv: .Del Moral, Pierre et al. (2015). “The alive particle filter and its use in particle Markov chain MonteCarlo”. In: Stochastic Analysis and Applications Annals of Applied Probability , pp. 726–752.Durmus, Alain, Arnaud Guillin, and Pierre Monmarché (July 2018). “Piecewise Deterministic MarkovProcesses and their invariant measure”. In: arXiv e-prints , arXiv:1807.05421, arXiv:1807.05421.arXiv: .Durmus, Alain, Eric Moulines, and Eero Saksman (2017). “On the convergence of hamiltonian montecarlo”. In: arXiv preprint arXiv:1705.00166 .Durrett, Rick (2019). Probability: Theory and Examples . 5th ed. Cambridge University Press.Dutta, Somak and Sourabh Bhattacharya (2014). “Markov chain Monte Carlo based on deterministictransformations”. In: Statistical Methodology 16, pp. 100–116.Engelbert, H. and A. Shiryaev (1980). “On absolute continuity and singularity of probability measures”.eng. In: Banach Center Publications url : http://eudml.org/doc/209108 .41ang, Youhan, Jesus-Maria Sanz-Serna, and Robert D Skeel (2014). “Compressible generalized hybridMonte Carlo”. In: The Journal of chemical physics Real analysis: modern techniques and their applications . Vol. 40. John Wiley& Sons.Fremlin, DH (Jan. 2010). Measure Theory . Second Edition. Vol. 2. https://wiki.math.ntnu.no/_media/tma4225/2011/fremlin-vol2.pdf :Torres Fremlin.Glatt-Holtz, Nathan E., Justin A. Krometis, and Cecilia F. Mondaini (2020). On the accept-rejectmechanism for Metropolis-Hastings algorithms . arXiv: .Graham, Matthew McKenzie (2018). “Auxiliary Variable Markov Chain Monte Carlo Methods”. In: https: // matt-graham. github. io/ files/ phd_ thesis. pdf .Green, Peter J (1995). “Reversible jump Markov chain Monte Carlo computation and Bayesian modeldetermination”. In: Biometrika Biometrika Statistics and computing The Annals of Applied Probability EPL (Europhysics Letters) Biometrika Journal of the American Statistical Association doi : .eprint: https://amstat.tandfonline.com/doi/pdf/10.1080/01621459.1998.10473712 . url : https://amstat.tandfonline.com/doi/abs/10.1080/01621459.1998.10473712 .Hoffman, Matthew D and Andrew Gelman (2014). “The No-U-Turn sampler: adaptively setting pathlengths in Hamiltonian Monte Carlo.” In: J. Mach. Learn. Res. Physics Letters B Physica A:Statistical Mechanics and its Applications Physica D: Nonlinear Phenomena J. R. Stat. Soc. Ser. B Stat. Methodol. Proceedings of the 2012 Winter Simulation Conference (WSC) . IEEE, pp. 1–12.Lee, Anthony and Krzysztof Łatuszyński (2014). “Variance bounding and geometric ergodicity ofMarkov chain Monte Carlo kernels for approximate Bayesian computation”. In: Biometrika Journal of the American Statistical Association TheAnnals of Statistics The journal of chemical physics url : .Michel, Manon, Sebastian C Kapfer, and Werner Krauth (2014). “Generalized event-chain Monte Carlo:Constructing rejection-free global-balance algorithms from infinitesimal steps”. In: The Journal ofchemical physics EPL (Europhysics Letters) Journal of Computational Physics Statistics andcomputing Learning in graphical models . Springer, pp. 205–228.– (2003). “Slice sampling”. In: Annals of statistics , pp. 705–741.– (2004). “Improving asymptotic variance of MCMC estimators: Non-reversible chains are better”. In: arXiv preprint math/0407281 .– (2005). “Taking bigger Metropolis steps by dragging fast variables”. In: arXiv preprint math/0502099 .Neal, Radford M et al. (2011). “MCMC using Hamiltonian dynamics”. In: Handbook of Markov ChainMonte Carlo 2, pp. 113–162.Neklyudov, Kirill et al. (2020). “Involutive mcmc: a unifying framework”. In: International Conferenceon Machine Learning . PMLR, pp. 7273–7282.Park, Joonha and Yves Atchadé (2020). “Markov chain Monte Carlo algorithms with sequential pro-posals”. In: Statistics and Computing Biometrika issn : 00063444. url : .Poncet, Romain (2017). “Generalized and hybrid MCMC overdamped Langevin algorithms”. In: arXivpreprint arXiv:1701.05833 .Sherlock, Chris and Alexandre H Thiery (2017). “A Discrete Bouncy Particle Sampler”. In: arXivpreprint arXiv:1707.05200 .Sohl-Dickstein, Jascha, Mayur Mudigonda, and Michael R DeWeese (2014). “Hamiltonian Monte Carlowithout detailed balance”. In: arXiv preprint arXiv:1409.5191 .Sun, Yi, Jürgen Schmidhuber, and Faustino Gomez (2010). “Improving the asymptotic performanceof Markov chain Monte-Carlo by inserting vortices”. In: Advances in Neural Information ProcessingSystems 23, pp. 2235–2243.Thin, Achille, Alain Durmus, et al. (2020). “Nonreversible MCMC from conditional invertible trans-forms: a complete recipe with convergence guarantees”.Thin, Achille, Nikita Kotelevskii, et al. (2020). MetFlow: A New Efficient Method for Bridging the Gapbetween Markov Chain Monte Carlo and Variational Inference . arXiv: .Tierney, Luke (Feb. 1998). “A note on Metropolis-Hastings kernels for general state spaces”. In: Ann.Appl. Probab. doi : . url : http://dx.doi.org/10.1214/aoap/1027961031 .Tierney, Luke and Antonietta Mira (1999). “Some adaptive Monte Carlo methods for Bayesian infer-ence”. In: Statistics in medicine arXiv preprintarXiv:1707.05296 .Yaglom, Akiva Moiseevich (1949). “On the statistical reversibility of Brownian motion”. In: Matem-aticheskii Sbornik Proofs Proof of Theorem 3. Let λ = µ + µ φ and define ρ = d µ/ d λ . We observe that λ φ = λ . Then for any A ∈ E , we find using Theorems 2–1, Z A ( ξ ) ρ ◦ φ ( ξ ) λ (d ξ ) = Z A ◦ φ ( ξ ) ρ ( ξ ) λ φ (d ξ )= Z A ◦ φ ( ξ ) µ (d ξ )= Z A ( ξ ) µ φ (d ξ ) , so ρ ◦ φ = d µ φ / d λ . Define S = { ξ ∈ E : ρ ( ξ ) ∧ ρ ◦ φ ( ξ ) > } , which satisfies φ ( S ) = S . Since S is theintersection of two measurable sets, it is measurable and so the restrictions of µ and µ φ to S are welldefined, and for ξ ∈ S , d µ φS d µ S ( ξ ) = ρ ◦ φρ ( ξ ) , d µ S d µ φS ( ξ ) = ρρ ◦ φ ( ξ ) , so µ S ≡ µ φS . Let A = { ξ : ρ ( ξ ) = 0 } and B = { ξ : ρ ( ξ ) > , ρ ◦ φ ( ξ ) = 0 } . We deduce that µ ( A ) = R A ρ ( ξ ) λ (d ξ ) = 0 and µ φ ( B ) = R B ρ ◦ φ ( ξ ) λ (d ξ ) = 0 . Since A ∩ B = ∅ , A ∪ B = S ∁ and µ ( A ) = µ φ ( B ) = 0 we conclude that µ and µ φ are mutually singular on S ∁ .For part b(i), ( ρ ◦ φ/ρ ) ◦ φ = ρ/ρ ◦ φ on S , since φ is an involution. Hence, r ◦ φ = 1 /r on S and since a (0) = 0 , α = r · α ◦ φ on S and α = 0 on S ∁ by the definition of α and condition on a .For part b(ii), combining the first part with a (0) = 0 , Theorems 2–1 and φ an involution, we obtain Z E F ( ξ ) G ◦ φ ( ξ ) α ( ξ ) µ (d ξ ) = Z S F ( ξ ) G ◦ φ ( ξ ) α ( ξ ) µ S (d ξ )= Z S F ( ξ ) G ◦ φ ( ξ ) r ( ξ ) α ◦ φ ( ξ ) µ S (d ξ )= Z S F ( ξ ) G ◦ φ ( ξ ) α ◦ φ ( ξ ) µ φS (d ξ )= Z S F ◦ φ ( ξ ) G ( ξ ) α ( ξ ) µ S (d ξ )= Z E F ◦ φ ( ξ ) G ( ξ ) α ( ξ ) µ (d ξ ) . For the part b(iii), we define the sub-Markov kernels T ( ξ, A ) = α ( ξ ) A ( φ ( ξ )) , ξ ∈ E, A ∈ E , and R ( ξ, A ) = { − α ( ξ ) } A ( ξ ) , ξ ∈ E, A ∈ E , so that Π = T + R . First we observe that Z F ( ξ ) G ( ξ ′ ) µ (d ξ ) R ( ξ, d ξ ′ ) = Z F ( ξ ) G ( ξ ) { − α ( ξ ) } µ (d ξ ) = Z G ( ξ ) F ( ξ ′ ) µ (d ξ ) R ( ξ, d ξ ′ ) . Z F ( ξ ) G ( ξ ′ ) µ (d ξ ) T ( ξ, d ξ ′ ) = Z F ( ξ ) G ◦ φ ( ξ ) α ( ξ ) µ (d ξ )= Z F ◦ φ ( ξ ) G ( ξ ) α ( ξ ) µ (d ξ )= Z G ( ξ ) F ( ξ ′ ) µ (d ξ ) T ( ξ, d ξ ′ ) . Proof of Proposition 1. Let f, g : Z → [0 , be measurable and let F, G : Ξ → [0 , such that F ( ξ ) = f ( ξ ) and G ( ξ ) = g ( ξ ) . Using reversibility of Π , we find Z f ( ξ ) g ( ξ ′ ) π (d ξ ) P ( ξ , d ξ ′ ) = Z f ( ξ ) g ( ξ ′ ) π (d ξ ) µ ξ (d ξ − )Π( ξ ; d ξ ′ )= Z F ( ξ ) G ( ξ ′ ) µ (d ξ )Π( ξ ; d ξ ′ )= Z G ( ξ ) F ( ξ ′ ) µ (d ξ )Π( ξ ; d ξ ′ )= Z g ( ξ ) f ( ξ ′ ) π (d ξ ) P ( ξ , d ξ ′ ) . Proof of Lemma 1. Since g is an involution and g ( Y ) ⊆ Y , we have λ Y = λ gY as explained in Remark 8.Let ψ be a non-negative function and λ (d x, d y ) = λ X (d x ) λ Y (d y ) . Then we find Z ψ ( x, y ) λ (d x, d y ) = Z ψ ( x, y ) λ Y (d y ) λ X (d x )= Z ψ ( x, y ) λ gY (d y ) λ X (d x )= Z ψ ( x, g ( y )) λ Y (d y ) λ X (d x )= Z ψ ( f ( x, y ) , g ( y )) (cid:12)(cid:12) det f ′ y ( x ) (cid:12)(cid:12) λ Y (d y ) λ X (d x )= Z ψ ◦ φ ( x, y ) (cid:12)(cid:12) det f ′ y ( x ) (cid:12)(cid:12) λ (d x, d y ) where f y ( x ) = x f ( x, y ) for each y ∈ Y . Since Z ψ ◦ φ ( x, y ) λ φ (d x, d y ) = Z ψ ( x, y ) λ (d x, d y ) , and for an arbitrary, measurable, non-negative g : E → R we can take ψ = g ◦ φ − to obtain g = ψ ◦ φ ,we obtain that d λ φ / d λ = (cid:12)(cid:12) det f ′ y ( x ) (cid:12)(cid:12) . Proof of Proposition 3. For part (a), the identity ψ − = σ ◦ ψ ◦ σ is verified by observing that, since ψ = σ ◦ φ , σ ◦ ψ ◦ σ = φ ◦ σ and that indeed φ ◦ σ ◦ ψ = ψ ◦ φ ◦ σ = Id . We then note that φ = σ ◦ ψ and so for A ∈ E , λ φ ( A ) = λ ( φ − ( A )) = λ ( φ ( A ) = λ ( σ ◦ ψ ( A )) = λ σ ( ψ ( A )) = λ ( ψ ( A )) = λ ψ − ( A ) . λ σ = λ , for f integrable w.r.t. µ σ , Z f ( ξ ) µ σ (d ξ ) = Z f ◦ σ ( ξ ) µ (d ξ )= Z f ◦ σ ( ξ ) ρ ( ξ ) λ (d ξ )= Z f ◦ σ ( ξ ) ρ ( ξ ) λ σ (d ξ )= Z f ( ξ ) ρ ◦ σ ( ξ ) λ (d ξ ) , and so d µ σ / d λ = ρ ◦ σ . Since µ σ = µ , we have ρ ◦ σ = ρ . Hence, ρ ◦ φ = ρ ◦ σ ◦ ψ = ρ ◦ ψ . We proceedto part (b), and note that Π( ξ, d ξ ′ ) = α ( ξ ) δ φ ( ξ ) (d ξ ′ ) + [1 − α ( ξ )] δ ξ (d ξ ′ ) , from which for ξ ∈ E and f : E → [0 , , with Φ f := f ◦ φ , Π S f = α ( ξ ) · Φ S f + [1 − α ( ξ )] S f and use that S = Id and Φ S f ( ξ ) = φ ( S f ) ( ξ ) = φ ( f ◦ σ ) ( ξ ) = f ◦ σ ◦ φ ( ξ ) . Using the identitiesfrom part (a), we can use the general acceptance ratio for Π from Proposition 2 to obtain, r ( ξ ) = ρ ◦ φρ ( ξ ) d λ φ d λ ( ξ ) = ρ ◦ ψρ ( ξ ) d λ ψ − d λ ( ξ ) , ξ ∈ S. For part (c), for f, g ∈ E → [0 , we use that Π S := Π satisfies detailed balance and µ S = µ Z f ( ξ ) g ( ξ ′ ) µ (d ξ ) Π ( ξ, d ξ ′ ) = Z f ( ξ ) S g ( ξ ′ ) µ (d ξ ) Π S ( ξ, d ξ ′ )= Z f ( ξ ) S g ( ξ ′ ) µ (d ξ ′ ) Π S ( ξ ′ , d ξ )= Z f ( ξ ) S g ( ξ ′ ) µ S (d ξ ′ ) Π S ( ξ ′ , d ξ )= Z f ( ξ ) g ( ξ ′ ) µ (d ξ ′ ) S Π S ( ξ ′ , d ξ ) . For part (e) we proceed as above and for ξ ∈ S notice that S (cid:20) ρ ◦ φρ d λ φ d λ (cid:21) ( ξ ) = ρ ◦ φ ◦ σρ ◦ σ ( ξ ) d λ φ d λ ◦ σ ( ξ )= ρ ◦ φ ◦ σρ ( ξ ) d λ φ ◦ σ d λ ( ξ ) , Z f ( ξ ) d λ φ d λ ◦ σ ( ξ ) λ (d ξ ) = Z f ( ξ ) d λ φ d λ ◦ σ ( ξ ) λ σ (d ξ )= Z f ( ξ ) d λ φ d λ ( ξ ) λ (d ξ )= Z f ◦ φ ( ξ ) λ (d ξ )= Z f ◦ φ ( ξ ) λ σ (d ξ )= Z f ◦ φ ◦ σ ( ξ ) λ (d ξ )= Z f ( ξ ) d λ φ ◦ σ d λ λ (d ξ ) , where we have used that λ φ ◦ σ ( A ) = λ (cid:0) σ ◦ φ ( A ) (cid:1) = λ (cid:0) φ ( A ) (cid:1) = λ φ (cid:0) A (cid:1) . Proof of Lemma 2. By the definition of ψ , for measurable A ∈ X × Y ψ − ( A ) = { ( x, y ) : ( x, y + f ( x )) ∈ A. Let λ X and λ Y be, respectively, the Lebesgue measures on R d X and R d Y . Using the translation-invariance of the Lebesgue measure, we obtain that for arbitrary, measurable A , λ ( ψ − ( A )) = Z R dX Z { y :( x,y + f ( x )) ∈ A } λ Y (d y ) λ X (d x )= Z R dX Z { y :( x,y ) ∈ A } λ Y (d y ) λ X (d x )= λ ( A ) , from which we can conclude that λ ψ = λ . Proof of Lemma 3. Let λ denote the Lebesgue measure on R d . By Lemma 2, ψ A and ψ B each preserve λ , and hence ψ preserves λ as a composition of λ -preserving maps. We observe that σ ◦ ψ ( x, v ) = (cid:0) x + ( v + ı ( x )) , − v − ı ( x ) − ı [ x + ( v + ı ( x )] (cid:1) , so that ψ B ◦ σ ◦ ψ ( x, v ) = (cid:0) x + [ v + ı ( x )] , − v − ı ( x ) (cid:1) , and using ( − v ′ ) = − ( v ′ ) , ψ A ◦ ψ B ◦ σ ◦ ψ ( x, v ) = (cid:0) x, − v − ı ( x ) (cid:1) . It follows that ψ ◦ σ ◦ ψ ( x, v ) = ( x, − v ) , and so σ ◦ ψ ◦ σ ◦ ψ ( x, v ) = ( x, v ) , from which we conclude that ψ − = σ ◦ ψ ◦ σ . Proof of Lemma 4. Let n = τ ( Z ) , so s k ( Z ) = 0 for k ∈ J n − K and s n ( Z ) = 1 . From (10) it is sufficientto show that s k ◦ σ l ( Z ) = s k ( Z ) for ( k, l ) ∈ J n K × J n − K = { ≤ l ≤ k ≤ n, l < n − } ∪ { ≤ k < l ≤ n − } =: S ∪ S , to establish the result. From condition (b), for ( k, l ) ∈ S , s k ◦ σ l ( Z ) = s k ( Z ) and in particular s n − ◦ σ l ( Z ) = 0 for l ∈ J n − K while s n ◦ σ l ( Z ) = 1 . From condition (a) and then condition (b), for ( k, l ) ∈ S , s k ◦ σ l ( Z ) ≤ s l ◦ σ l ( Z ) = s l ( Z ) = 0 and so s k ◦ σ l ( Z ) = 0 = s k ( Z ) .47 roof of Proposition 4. From Theorem 2 and the fact that ν ψ = ν , for any f, g : Z → [0 , Z f ( z ) g ( z ′ ) ν (d z ) Q ( z, d z ′ ) = Z f ( z ) g ◦ ψ ( z ) ν (d z )= Z f ◦ ψ − ( z ) g ( z ) ν ψ (d z )= Z f ◦ ψ − ( z ) g ( z ) ν (d z )= Z f ( z ′ ) g ( z ) ν (d z ) Q ∗ ( z, d z ′ ) , from which one can conclude. Proof of Lemma 5. Let k ∈ Z , then the probability measure Λ k has finite dimensional distributionssatisfying, for n ≥ | k | , Λ kn (d Z ) = π (d z k ) n Y i = k +1 Q ( z i − , d z i ) k − Y i = − n Q ∗ ( z i +1 , d z i ) , which also guarantee the existence of Λ k by Kolmogorov’s Extension Theorem (Billingsley, 1995).Notice that for n ≥ | k | and Z ∈ S k , since ( ν, Q, Q ∗ ) is a reversible triplet, ν (d z k ) n Y i = k +1 Q ( z i − , d z i ) k − Y i = − n Q ∗ ( z i +1 , d z i )= ν (d z k − sign( k ) ) n Y i = k +1 − sign( k ) Q ( z i − , d z i ) k − − sgn( k ) Y i = − n Q ∗ ( z i +1 , d z i ) , implying, Λ kn (d Z ) = ̟ ( z k ) ν (d z k ) n Y i = k +1 Q ( z i − , d z i ) k − Y i = − n Q ∗ ( z i +1 , d z i )= ̟ ( z k ) ν (d z ) n Y i =1 Q ( z i − , d z i ) − Y i = − n Q ∗ ( z i +1 , d z i )= ̟ ( z k ) ̟ ( z ) π (d z ) n Y i =1 Q ( z i − , d z i ) − Y i = − n Q ∗ ( z i +1 , d z i )= ̟ ( z k ) ̟ ( z ) Λ n (d Z ) , from which we conclude by application of Durrett (2019, Theorem 4.3.5), which is a mild generalizationof Engelbert and Shiryaev (1980). Proof of Lemma 6. For f, g : Z → [0 , we have Z f ( z ) g ( z ′ ) ν (d z )Ψ( z, d z ′ ) = Z f ( z ) g ◦ ψ ( z ) ν (d z )= Z f ◦ ψ − ( z ) g ( z ) ν ψ (d z )= Z f ( z ′ ) g ( z ) ν ψ (d z )Ψ ∗ ( z, d z ′ ) . roof of Lemma 7. Part (a) is clear from the definition of s n . To establish parts (b) and (c) we usethe decomposition s n ( Z , b ) = g n − ( z − ℓ n ( b ) , . . . , z m n ( b ) ) ∨ f n − ( z − ℓ n ( b ) , . . . , z − ℓ n ( b )+2 n − − ) ∨ f n − ( z − ℓ n ( b )+2 n − , . . . , z m n ( b ) ) ∨ g n − ( z m n ( b )+1 , . . . , z r n ( b ) ) ∨ f n − ( z m n ( b )+1 , . . . , z m n ( b )+2 n − ) ∨ f n − ( z m n ( b )+2 n − +1 , . . . , z r n ( b ) ) . It follows that s n ( Z , b ) ≥ s n − ( Z , b ) , since if b n = 0 then J − ℓ n ( b ) , m n ( b ) K = J − ℓ n − ( b ) , r n − ( b ) K and so s n − ( Z , b ) = f n − ( z − ℓ n ( b ) , . . . , z − ℓ n ( b )+2 n − − ) ∨ f n − ( z − ℓ n ( b )+2 n − , . . . , z m n ( b ) ) and if b n = 1 then J m n ( b ) + 1 , r n ( b ) K = J − ℓ n − ( b ) , r n − ( b ) K and so s n − ( Z , b ) = f n − ( z m n ( b )+1 , . . . , z m n ( b )+2 n − ) ∨ f n − ( z m n ( b )+2 n − +1 , . . . , z r n ( b ) ) . For the same reasons s n ( Z , b ) ≥ g n − ( z − ℓ n − ( b ) , . . . , z r n − ( b ) ) . Proof of Lemma 8. The uniqueness of b ′ n − follows from the fact that ℓ is uniquely determined by b n − and b ′ n − is uniquely determined by ℓ + k . We have ℓ n − ( b ′ ) = ℓ + k and r n − ( b ′ ) = r − k by construction. Since for i ∈ Z z ′ i = z k + i , it follows that z ′− ℓ n − ( b ′ ) = z k − ℓ − k = z − ℓ and z ′ r n − ( b ′ ) = z k + r − k = z r , so that indeed J z ′− ( ℓ + k ) , z ′ r − k K = J z − ℓ , z r K . Since b ′ n = b n and ℓ n − ( b ′ ) = ℓ n − ( b ) + k , wealso have ℓ n ( b ′ ) = ℓ n − ( b ′ ) + b n n − = ℓ n ( b ) + k and similarly r n ( b ′ ) = r n ( b ) − k , so J z ′− ℓ n ( b ′ ) , z ′ r n ( b ′ ) K = J z − ℓ n ( b ) , z r n ( b ) K . Since τ ( Z , b ) = n , s n ( Z , b ) = 1 and s n − ( Z , b ) = 0 . Since s n − ( Z , b ) and s n ( Z , b ) depend only on the values and the order of their inputs, and not the way they are indexed, we have s n − ( Z , b ) = s n − ( Z ′ , b ′ ) = 0 and s n ( Z , b ) = s n ( Z ′ , b ′ ) = 1 . To conclude that τ ( Z ′ , b ′ ) = n , it remainsonly to show that s i ( Z ′ , b ′ ) = 0 for all i ∈ J , n − K , but this is implied by Lemma 7-(b). Proof of Proposition 5. Let ξ = ( k, Z k ) ∈ S k ∩ { Z k ∈ Z k : α k ( Z k ) ∧ α k ◦ φ k ( Z k ) > } . Then r ( ξ ) = d η φ k k,S k d η k,S k ( Z k ) β k ◦ φ k β k ( Z k ) α k ◦ φ k α k ( Z k )= r k ( Z k ) α k ◦ φα k ( Z k )= 1 , where we have used that α k = r k · α k ◦ φ k on S k by part (b)i of Theorem 3, applied with µ = η k · β k and φ = φ k . B Measure theory tools B.1 Standard results Theorem 4 (Change of variables formula for Lebesgue measure) . Let φ be a continuously differentiable,invertible function. If f : R d → R is integrable then, with λ the Lebesgue measure, Z R d f ◦ φ ( ξ ) | det φ ′ ( ξ ) | λ (d ξ ) = Z R d f ( ξ ) λ (d ξ ) , where φ ′ ( ξ ) is the Jacobian matrix with entries φ ′ ( ξ ) ij = ∂φ i /∂ξ j ( ξ ) . This is covered by Billingsley (1995, Theorem 17.2).49 xample 31 (Jacobian of a linear mapping) . Consider the Lebesgue measure on (cid:0) R , B ( R ) (cid:1) such thatfor any a, b ∈ R , a < b λ (cid:0) ( a, b ] (cid:1) = b − a and consider the scenario φ ( ξ ) = αξ where, without lostof generality, α > . Recalling the definition λ φ − ( A ) := λ (cid:0) φ ( A ) (cid:1) for any A ∈ B ( R ) we have for a, b ∈ R , a < b λ φ − (cid:0) ( a, b ] (cid:1) = λ (cid:0) ( αa, αb ] (cid:1) = α ( b − a ) = αλ (cid:0) ( a, b ] (cid:1) , that is λ φ − = αλ and λ ≡ λ φ − . We deduce on the one hand that Z f ( ξ ) λ (d ξ ) = Z f ◦ φ ( ξ ) λ φ − (d ξ )= Z f ◦ φ ( ξ ) αλ (d ξ )= Z f ◦ φ ( ξ )det φ ′ ( ξ ) λ (d ξ ) and using the Radon-Nikodym theorem (Theorem 1) we also have Z f ( ξ ) λ (d ξ ) = Z f ◦ φ ( ξ ) λ φ − (d ξ )= Z f ◦ φ ( ξ ) d λ φ − d λ ( ξ ) λ (d ξ ) and we deduce that, λ − almost everywhere, d λ φ − d λ ( ξ ) = | det φ ′ ( ξ ) | . This result can be generalised to the multivariate scenario but also to nonlinear invertible and smoothmappings φ : R d → R d by local linearisation. B.2 Proofs Proof of Proposition 2 . First observe that d µ φ / d λ φ = ρ ◦ φ : for any A ∈ E , Z A ( ξ ) ρ ◦ φ ( ξ ) λ φ (d ξ ) = Z A ◦ φ ( ξ ) ρ ( ξ ) λ (d ξ )= Z A ◦ φ ( ξ ) µ (d ξ )= Z A ( ξ ) µ φ (d ξ ) . Then we find for A ∈ E , A ⊆ S , Z A ( ξ ) ρ ◦ φρ ( ξ ) d λ φ d λ ( ξ ) µ S (d ξ ) = Z A ( ξ ) ρ ◦ φρ ( ξ ) d λ φ d λ ( ξ ) µ S (d ξ )= Z A ( ξ ) ρ ◦ φρ ( ξ ) d λ φ d λ ( ξ ) ρ ( ξ ) λ (d ξ )= Z A ( ξ ) ρ ◦ φ ( ξ ) λ φ (d ξ )= Z A ( ξ ) µ φS (d ξ ) , so that indeed d µ φS / d µ S = ρ ◦ φρ · d λ φ d λ . The proof that µ and µ φ are mutually singular on S ∁ follows thesame arguments as in the proof of Theorem 3. 50 X-tra chance proof This is a proof of the claims in Remark 27. Fix u ∈ R + , we show the result by induction. First we have β ( z ) = 1 × , β ◦ φ ( z ) = 1 and by considering z ∈ S ( ̟ u , ̟ φ u ) and z ∈ S ∁ ( ̟ u , ̟ φ u ) separately, andTheorem 6-6 we obtain r ( z ) = I { u ≤ ̟ } I { u ≤ ̟ ◦ φ ( z ) } = I { u ≤ ̟ ( z ) ∧ ̟ ◦ φ ( z ) } and thereforewith a ( r ) = 1 ∧ r we deduce β ( z ) = β ( z )[1 − I { u ≤ ̟ ( z ) ∧ ̟ ◦ φ ( z ) } ] = I { ̟ ( z ) ∧ ̟ ◦ φ ( z ) < u } .Assume that for some k ∈ J , n K and any z ∈ Z β k ( z ) = I { ̟ ( z ) ∧ ∨ k − i =1 ̟ ◦ φ i ( z ) < u } . From the assumption ̟ ◦ φ i ◦ φ k = ̟ ◦ φ k − i this implies β k ◦ φ k ( z ) = I { ̟ ◦ φ k ( z ) ∧ ∨ k − i =1 ̟ ◦ φ k − i ( z ) < u } = I { ̟ ◦ φ k ( z ) ∧ ∨ k − i =1 ̟ ◦ φ i ( z ) < u } . Therefore, proceeding as for r ( z ) above and taking advantage of the fact that β k ( ξ ) , β k ◦ φ k ( z ) ∈ { , } we obtain r k ( z ) = β k ( z ) β k ◦ φ k ( z ) I { u ≤ ̟ ( z ) ∧ ̟ ◦ φ k ( z ) } = I { ̟ ( z ) ∧ ∨ k − i =1 ̟ ◦ φ i ( z ) < u ≤ ̟ ( z ) } I { ̟ ◦ φ k ( z ) ∧ ∨ k − i =1 ̟ ◦ φ i ( z ) < u ≤ ̟ ◦ φ k ( z ) } , = I {∨ k − i =1 ̟ ◦ φ i ( z ) < u ≤ ̟ ( z ) } I {∨ k − i =1 ̟ ◦ φ i ( z ) < u ≤ ̟ ◦ φ k ( z ) } = I {∨ k − i =1 ̟ ◦ φ i ( z ) < u ≤ ̟ ( z ) ∧ ̟ ◦ φ k ( z ) } . When ∨ k − i =1 ̟ ◦ φ i ( z ) < ̟ ( z ) ∧ ̟ ◦ φ k ( z )1 − r k ( z ) = I { u ≤ ∨ k − i =1 ̟ ◦ φ i ( z ) } + I { ̟ ( z ) ∧ ̟ ◦ φ k ( z ) < u } therefore β k +1 ( z ) = I { ̟ ( z ) ∧ ∨ k − i =1 ̟ ◦ φ i ( z ) < u } (cid:2) I { ̟ ( z ) ∧ ̟ ◦ φ k ( z ) < u } + I { u ≤ ∨ k − i =1 ̟ ◦ φ i ( z ) } (cid:3) = I { ̟ ( z ) ∧ ∨ ki =1 ̟ ◦ φ i ( z ) < u } + I { ̟ ( z ) ∧ ∨ k − i =1 ̟ ◦ φ i ( z ) < u ≤ ∨ k − i =1 ̟ ◦ φ i ( z ) } = I { ̟ ( z ) ∧ ∨ ki =1 ̟ ◦ φ i ( z ) < u } , where we have used that I { ̟ ( z ) ∧ ∨ k − i =1 ̟ ◦ φ i ( z ) < u } = I {∨ k − i =1 (cid:0) ̟ ( z ) ∧ ̟ ◦ φ i ( z ) (cid:1) < u } .When ∨ k − i =1 ̟ ◦ φ i ( z ) ≥ ̟ ( z ) ∧ ̟ ◦ φ k ( z ) , using the same argument, β k +1 ( ξ ) = I { ̟ ( z ) ∧ ∨ k − i =1 ̟ ◦ φ i ( z ) < u } = I {∨ ki =1 (cid:0) ̟ ( z ) ∧ ̟ ◦ φ i ( z ) (cid:1) < u } = I { ̟ ( z ) ∧ ∨ ki =1 ̟ ◦ φ i ( z ) < u } , which completes the proof. D NUTS motivation The criterion consists of stopping when k x r − x ℓ k reaches a stationary point, in the hope that it is amaximum. This requires the computation of a differential, that is the first order linear approximationof variations of k x r − x ℓ k when x ℓ (resp. x r ) is perturbed linearly x ℓ + ǫv ℓ (resp. x r + ǫv r ). This leadsto k x r − ( x ℓ + ǫv ℓ ) k − k x r − x ℓ k = − ǫ ( x r − x ℓ ) ⊤ v ℓ + ǫ k v ℓ k , k x r + ǫv r − x ℓ k − k x r − x ℓ k = 2 ǫ ( x r − x ℓ ) ⊤ v r + ǫ k v r k , As ǫ ↓ the dominant and linear term has coefficient lim ǫ ↓ ǫ n k x r − ( x ℓ + ǫv ℓ ) k − k x r − x ℓ k o < ⇐⇒ ( x r − x ℓ ) ⊤ v ℓ > , and lim ǫ ↓ ǫ n k x r + ǫv r − x ℓ k − k x r − x ℓ k o < ⇐⇒ ( x r − x ℓ ) ⊤ v r < . E Event chain algorithms We briefly describe standard event chain processes for soft potentials and pairwise interactions. Define x = ( x , x , . . . , x m ) ∈ X m with X = T d := − / R d / Z d and v ∈ V ⊂ R d . The target distribution ofinterest has density γ ( x, v, i ) = γ ( x ) κ ( v ) ∝ exp (cid:0) − U ( x ) (cid:1) κ ( v ) I { i ∈ J m K } where γ has density with respect to the measure induced by the Lebesgue measure on [ − / , / d Folland (1999, Chapter 6) and κ is the density with respect to the Lebesgue measure on V = R d orthe Hausdorff measure on V = S d − . It is further assumed that κ ( − v ) = κ ( v ) for v ∈ V and we focuson the scenario involving pairwise interactions, U ( x ) := X ≤ i Let V : X = T d → R + be continuously differentiable and such that V ( x ∗ ) = V ( − x ∗ ) forall x ∗ ∈ X . Then for i, j ∈ J m K , i = j (a) λ ( x, v, i ) − S λ ( x, v, i ) = (cid:10) ∇ x U ( x ) , e i (cid:16) v (cid:11) ,(b) S λ j (cid:0) x, v, i (cid:1) = λ i (cid:0) x, v, j (cid:1) . Proof. The first relation follows, for i ∈ J m K , from X j = i λ j (cid:0) x, v, i (cid:1) − S λ j (cid:0) x, v, i (cid:1) = X j = i (cid:10) ∇ ∗ V ( x i − x j ) , v (cid:11) = (cid:10)X j = i ∇ ∗ V ( x i − x j ) , v (cid:11) = (cid:10) ∇ x U ( x ) , e i (cid:16) v (cid:11) . The second property follows from the assumption V ( x ∗ ) = V ( − x ∗ ) = V ◦ s ( x ∗ ) where s ( x ∗ ) = − x ∗ .Indeed, in this scenario the chain rule leads to ∇ ∗ V ( x ∗ ) = ( ∇ ∗ (cid:16) s )( ∇ V ) ◦ s ( x ∗ ) = −∇ ∗ V ( − x ∗ ) andconsequently for i, j ∈ J , m K , i = j S λ j (cid:0) x, v, i (cid:1) = (cid:10) −∇ ∗ V ( x i − x j ) , v (cid:11) + = (cid:10) ∇ ∗ V ( x j − x i ) , v (cid:11) + = λ i (cid:0) x, v, j (cid:1) . We now prove µ ( Lf ) = 0 . We can clearly ignore the refreshment component of the generator. Anintegration by part and Lemma 9 establish that Z h∇ x f ( x, v, i ) , e i ⊗ v i µ (cid:0) d( x, v, i ) (cid:1) = Z h∇ i f ( x, v, i ) , v i µ (cid:0) d( x, v, i ) (cid:1) = Z f ( x, v, i ) h∇ i U ( x ) , v i µ (cid:0) d( x, v, i ) (cid:1) . = Z f ( x, v, i ) [ λ ( x, v, i ) − S λ ( x, v, i )] µ (cid:0) d( x, v, i ) (cid:1) = Z f ( x, v, i ) X j = i (cid:2) λ j (cid:0) x, v, i (cid:1) − S λ j (cid:0) x, v, i (cid:1)(cid:3) µ (cid:0) d( x, v, i ) (cid:1) = Z X j = i λ j (cid:0) x, v, i (cid:1)(cid:2) f ( x, v, i ) − f ( x, v, j ) (cid:3) µ (cid:0) d( x, v, i ) (cid:1) . Z λ ( x, v, i ) · (cid:2) Rf ( x, v, i ) − f ( x, v, i ) (cid:3) µ (cid:0) d( x, v, i ) (cid:1) = Z X j = i λ j ( x, v, i ) (cid:2) f ( x, v, j ) − f ( x, v, i ) (cid:3) µ (cid:0) d( x, v, i ) (cid:1) ..