Maximal couplings of the Metropolis-Hastings algorithm
MMaximal Couplings of the Metropolis–Hastings Algorithm
John O’Leary ∗ Guanyang Wang ∗ Pierre E. Jacob
Harvard University Rutgers University Harvard [email protected] [email protected] [email protected]
Abstract
Couplings play a central role in the analysisof Markov chain Monte Carlo algorithms andappear increasingly often in the algorithmsthemselves, e.g. in convergence diagnostics,parallelization, and variance reduction tech-niques. Existing couplings of the Metropolis–Hastings algorithm handle the proposal andacceptance steps separately and fall short ofthe upper bound on one-step meeting proba-bilities given by the coupling inequality. Thispaper introduces maximal couplings whichachieve this bound while retaining the practi-cal advantages of current methods. We con-sider the properties of these couplings andexamine their behavior on a selection of nu-merical examples.
Markov chain Monte Carlo (MCMC) methods offer apowerful framework for approximating integrals over awide range of probability distributions (Brooks et al.,2011). The Metropolis–Hastings (MH) family of al-gorithms has proved to be especially popular, fromits original forms (Metropolis et al., 1953; Hastings,1970) to modern incarnations such as HamiltonianMonte Carlo (Duane et al., 1987; Neal, 1993, 2011),the Metropolis-adjusted Langevin algorithm (Robertsand Tweedie, 1996), and particle MCMC (Andrieuet al., 2010). In many settings MH methods presentan attractive mix of effectiveness, flexibility, and trans-parency.Couplings of MCMC transition kernels have longplayed a role in the analysis and diagnosis of their con-vergence (Rosenthal, 1995; Johnson, 1996, 1998; Jer-rum, 1998; Rosenthal, 2002; Biswas et al., 2019), asa way of obtaining perfect samples or unbiased esti-mators (Propp and Wilson, 1996; Neal, 1999; Glynnand Rhee, 2014; Heng and Jacob, 2019; Jacob et al., ∗ The first two authors contributed equally to this work.
We write x ∧ y = min( x, y ), x ∨ y = max( x, y ),Unif for the uniform distribution on [0 , α ) for the Bernoulli distribution on { , } with P (Bern( α ) = 1) = α .Let P be a Markov transition kernel with stationarydistribution π on ( X , F ), a Polish space equippedwith the standard Borel σ -algebra. For x ∈ X and A ∈ F , P ( x, A ) denotes the probability of a tran-sition from x to A . We focus on MH-like kernels P , characterized by the property that we can ob-tain X ∼ P ( x, · ) by drawing a proposal x (cid:48) ∼ Q ( x, · )and an acceptance indicator B ∼ Bern( a ( x, x (cid:48) )) andsetting X = Bx (cid:48) + (1 − B ) x . We assume that for all x ∈ X , Q ( x, · ) has density q ( x, · ) with respect to a a r X i v : . [ s t a t . C O ] O c t ’Leary, Wang, & Jacob
10 5 0 5 10z P r o b a b ili t y D e n s i t y q(x, z)r(x) x (z)f(x, z)q(y, z)r(y) y (z)f(x, z) Figure 1: Proposal densities q and MH transitiondensities f , with π = N(0 , x = 1 / y = 4, and Q ( z, · ) = N( z, P ( x, · ) and P ( y, · ) also containpoint masses with weights r ( x ) ≈ .
69 and r ( y ) ≈ . X , F ). We also assume the pro-posal distribution is non-atomic, so that Q ( x, { y } ) = 0for all x, y ∈ X . The acceptance rate under MHwill be a ( x, x (cid:48) ) = 1 ∧ π ( x (cid:48) ) q ( x (cid:48) ,x ) π ( x ) q ( x,x (cid:48) ) , and we allow for al-ternatives such as Barker’s algorithm (Barker, 1965).For x (cid:48) (cid:54) = x we define f ( x, x (cid:48) ) := q ( x, x (cid:48) ) a ( x, x (cid:48) ) and r ( x ) := 1 − (cid:82) f ( x, x (cid:48) ) d x (cid:48) , so that P ( x, · ) has density f ( x, x (cid:48) ) except for an atom where P ( x, { x } ) = r ( x ).See Figure 1 for an illustration of a pair of proposaland transition distributions.A probability distribution γ on X × X is a coupling ofdistributions µ and ν on ( X , F ) if γ ( A × X ) = µ ( A )and γ ( X × A ) = ν ( A ) for any A ∈ F . Let Γ( µ, ν ) bethe set of all couplings of µ and ν . We say that ajoint kernel ¯ P is a coupling of P with itself and write¯ P ∈ Γ( P, P ) if ¯ P (( x, y, ) , · ) ∈ Γ( P ( x, · ) , P ( y, · )) for any x, y ∈ X . Similar definitions apply to couplings ¯ Q ofproposal distributions and couplings ¯ B of acceptanceindicators.The coupling inequality (Levin et al., 2017, Propo-sition 4.7) states that the meeting probability P ( X,Y ) ∼ γ ( X = Y ) ≤ − (cid:107) µ − ν (cid:107) TV for any coupling γ ∈ Γ( µ, ν ). Here (cid:107) µ − ν (cid:107) TV = sup A ∈ F | µ ( A ) − ν ( A ) | is the total variation distance. A coupling thatachieves this bound is said to be maximal, and wewrite Γ max ( µ, ν ) ⊂ Γ( µ, ν ) for the set of maximal cou-plings of µ and ν . Figure 2: Draws from the coupling ¯ P SQ (( x, y ) , · ) using¯ Q MI and the same parameters as in Figure 1. Thegrey lines indicate X = x , Y = y , and X = Y , whilethe histograms show the marginal distributions of X and Y . ¯ P SQ We begin by describing the state-of-the-art couplingof MH transition kernels, first introduced in John-son (1998). Recall that draws from a MH-like ker-nel P ( x, · ) can be obtained via a proposal x (cid:48) ∼ Q ( x, · )which is accepted for X with probability a ( x, x (cid:48) ).Thus a simple way to couple P with itself is todraw coupled proposals ( x (cid:48) , y (cid:48) ) ∼ ¯ Q (( x, y ) , · ) and ac-cept or reject these according to coupled indicators( B x , B y ) ∼ ¯ B (( x, y ) , ( x (cid:48) , y (cid:48) )), where ¯ Q ∈ Γ( Q, Q ) and¯ B (( x, y ) , ( x (cid:48) , y (cid:48) )) ∈ Γ (cid:0) Bern( a ( x, x (cid:48) )) , Bern( a ( y, y (cid:48) )) (cid:1) .We refer to this coupling as ¯ P SQ and summarize itin Algorithm 1.For the chains to meet in finite time we need P ( X = Y | x, y ) > x, y ),which in turn requires P ( x (cid:48) = y (cid:48) | x, y ) >
0. To obtainthis we take the joint proposal distribution ¯ Q to be amaximal coupling of Q with itself. We can draw fromsuch a coupling by sampling x (cid:48) ∼ Q ( x, · ), using this asa rejection sampling proposal for Q ( y, · ), and drawing y (cid:48) in a specified way if this x (cid:48) = y (cid:48) proposal is rejected.This approach achieves the coupling inequality upperbound P ( x (cid:48) = y (cid:48) | x, y ) = 1 − (cid:82) q ( x, z ) ∧ q ( y, z ) d z . Algorithm 1
Draw (
X, Y ) ∼ ¯ P SQ (( x, y ) , · )1. Draw ( x (cid:48) , y (cid:48) ) ∼ ¯ Q (( x, y ) , · )2. Draw ( B x , B y ) ∼ ¯ B (( x, y ) , ( x (cid:48) , y (cid:48) ))3. Set X = B x x (cid:48) +(1 − B x ) x and Y = B y y (cid:48) +(1 − B y ) y
4. Return (
X, Y ) ’Leary, Wang, & Jacob Algorithm 2
Draw ( x (cid:48) , y (cid:48) ) ∼ ¯ Q MI (( x, y ) , · )1. Draw x (cid:48) ∼ Q ( x, · ) and U ∼ Unif2. If
U q ( x, x (cid:48) ) ≤ q ( y, x (cid:48) ), set y (cid:48) = x (cid:48)
3. Else(a) Draw ˜ y ∼ Q ( y, · ) and V ∼ Unif(b) If
V q ( y, ˜ y ) > q ( x, ˜ y ), set y (cid:48) = ˜ y (c) Else go to 3(a)4. Return ( x (cid:48) , y (cid:48) )A simple example of this method is the maximalcoupling with independent residuals ¯ Q MI , introducedin Vaserstein (1969) and called the γ -coupling inLindvall (1992, chap. 1.5). It has the property that x (cid:48) and y (cid:48) are independent when x (cid:48) (cid:54) = y (cid:48) . Algorithm 2describes how to draw from this coupling, and Figure 2illustrates a set of draws ( X, Y ) ∼ ¯ P SQ (( x, y ) , · ) basedon proposals from ¯ Q MI .When X = R d and Q is spherically symmetric, themaximal coupling with reflection residuals, ¯ Q MR , isoften a better alternative. This coupling was intro-duced in the context of Hamiltonian and Langevinmethods (Bou-Rabee et al., 2020; Eberle et al., 2019)and has its origins in the analysis of continuous-timeprocesses (Lindvall and Rogers, 1986; Eberle, 2011).¯ Q MR is identical to ¯ Q MI for x (cid:48) = y (cid:48) . When x (cid:48) (cid:54) = y (cid:48) the coupling ¯ Q MR sets y (cid:48) = T xy ( x (cid:48) ), where T xy ( x (cid:48) ) = y + ( I − ee (cid:48) )( x (cid:48) − x ) and e = ( y − x ) / (cid:107) y − x (cid:107) . Wedefine T yx ( y (cid:48) ) similarly. Note that the transformation z → ( I − ee (cid:48) ) z reflects the e component of z whileleaving the e ⊥ component fixed.A standard choice for the acceptance indicators( B x , B y ) ∼ ¯ B (( x, y ) , ( x (cid:48) , y (cid:48) )) is the unique maximalcoupling of Bern( a ( x, x (cid:48) )) and Bern( a ( y, y (cid:48) )), whichcan be realized by drawing U ∼ Unif and set-ting B x = 1( U ≤ a ( x, x (cid:48) )) and B y = 1( U ≤ a ( y, y (cid:48) )).Among couplings of these distributions, this one yieldsthe maximal probability P ( B x = B y = 1 | x (cid:48) , y (cid:48) ) = a ( x, x (cid:48) ) ∧ a ( y, y (cid:48) ). Other objectives, such as the mini-mization of E [ (cid:107) X − Y (cid:107) | x (cid:48) , y (cid:48) ], are also possible.The coupling ¯ P SQ obtained when ¯ Q is a maximal cou-pling of proposal distributions and ¯ B is the maximalcoupling of acceptance indicators can yield a relativelyhigh chance of X = Y given the current state pair( x, y ), but it typically falls short of the theoretical up-per bound. Under ¯ P SQ , the probability of X = Y is (cid:82) (cid:0) q ( x, z ) ∧ q ( y, z ) (cid:1)(cid:0) a ( x, z ) ∧ a ( y, z ) (cid:1) d z . HoweverLemma 1 implies that the coupling inequality bound is (cid:82) (cid:0) q ( x, z ) a ( x, z ) (cid:1) ∧ (cid:0) q ( y, z ) a ( y, z ) (cid:1) d z , which is alwaysan equal or larger quantity. Figure 3 illustrates thegap between the meeting probabilities under ¯ P SQ andunder any maximal coupling ¯ P . Note that these prob-abilities will coincide when either q ( x, z ) or a ( x, z ) doesnot depend on x , e.g. for the independence sampler. P r o b a b ili t y D e n s i t y q mxy (z) = q(x,z) q(y,z)q mxy (z) (a(x,z) a(y,z))f mxy (z) = f(x,z) f(y,z) Figure 3: Meeting densities based on the same param-eters as in Figure 1. The green line shows the densityof x (cid:48) = y (cid:48) = z under a maximal coupling ¯ Q . The pur-ple line shows the density of X = Y = z under ¯ P SQ assuming the use of a maximal coupling ¯ B . Finally,the orange line shows the density of X = Y = z un-der a maximal coupling ¯ P . The inequalities suggestedhere hold in general. Lemma 1.
Let P be an MH-like transition kernel asdefined above. Then for x (cid:54) = y , (cid:107) P ( x, · ) − P ( y, · ) (cid:107) TV =1 − (cid:82) f ( x, z ) ∧ f ( y, z ) d z .Proof. Let C xy = { z : f ( x, z ) > f ( y, z ) } ∪ { x } \ { y } and similarly for C yx . We have (cid:107) P ( x, · ) − P ( y, · ) (cid:107) TV =sup A | P ( x, A ) − P ( y, A ) | , and | P ( x, A ) − P ( y, A ) | = | (cid:82) A f ( x, z ) − f ( y, z ) d z + 1( x ∈ A ) r ( x ) − y ∈ A ) r ( y ) | . We must have either A = C xy or A = C yx in thesupremum above. Both yield the same value, and so (cid:107) P ( x, · ) − P ( y, · ) (cid:107) TV = 1 − (cid:82) f ( x, z ) ∧ f ( y, z ) d z .We want to maximize the coupling probability at eachstep and reduce typical meeting times, so the factorsabove lead us to ask if there are couplings ¯ P which areboth implementable and maximal. Our contributionanswers this question in the affirmative. We now introduce a collection of implementable cou-plings ¯ P ∈ Γ max ( P, P ) starting from an arbitrary MH-like transition kernel P . We consider two approaches.In Sections 3.1 and 3.2 we show how to couple Markovtransition kernels without explicitly coupling their un-derlying proposal or acceptance distributions. We re-fer to these as full-kernel couplings. In Section 3.3, wemodify Algorithm 1 to maximize the probability of ac-cepting proposals x (cid:48) = y (cid:48) at the expense of decreasing ’Leary, Wang, & Jacob P r o b a b ili t y D e n s i t y A A A'BC f(x, z)f(x, T yx (z))f(y, z) Figure 4: Rejection sampling regions for Algorithms 3and 4 based on the same parameters as Figure 1. Forboth options, meeting occurs on samples drawn fromregion C .the acceptance probabilities when x (cid:48) (cid:54) = y (cid:48) . ¯ P MI , a Full-kernel Coupling withIndependent Residuals Our first full-kernel coupling, which we will refer to as¯ P MI , is inspired by the procedure described in Algo-rithm 2 for drawing from the coupling ¯ Q MI of proposaldistributions. A key difference between coupling pro-posal distributions and transition kernels is that byassumption the proposal distributions are non-atomicand absolutely continuous with respect to an underly-ing measure, while MH-like kernels P can have a pointmass at the current state. Therefore our rejection sam-pling procedure must be modified to yield draws X and Y with the correct marginal distributions.Algorithm 3 contains this modification, which we il-lustrate in Figure 4. The blue and red curves give thecontinuous parts of P ( x, · ) and P ( y, · ), respectively.We think of the algorithm as sampling uniformly fromthe region under the graph along with point masses Algorithm 3
Draw (
X, Y ) ∼ ¯ P MI (( x, y ) , · )1. Draw X ∼ P ( x, · ) and U ∼ Unif2. If X (cid:54) = x and U f ( x, X ) ≤ f ( y, X ), set Y = X
3. Else(a) Draw ˜ y ∼ P ( y, · ) and V ∼ Unif(b) If ˜ y = y , set Y = ˜ y (c) If ˜ y (cid:54) = y and V f ( y, ˜ y ) > f ( x, ˜ y ), set Y = ˜ y (d) Else go to 3(a)4. Return ( X, Y ) at x and y . If the X draw falls in C we can use thesame point for both chains, and otherwise we use re-jection sampling to obtain a Y draw from A ∪ A (cid:48) ∪ { y } .Meetings occur exactly on draws from C , and thus itoccurs with probability (cid:82) f ( x, z ) ∧ f ( y, z ) d z . See Ap-pendix A.1 for a detailed proof that ¯ P MI (( x, y ) , · ) is amaximal coupling of P ( x, · ) and P ( y, · ) and an analysisof the computation cost of this algorithm. ¯ P MR , a Full-kernel Coupling withReflection Residuals All maximal couplings of a given P have equal meet-ing probability, but some perform better than oth-ers. Although P ( X = Y | x, y ) is maximal under¯ P MI (( x, y ) , · ), X and Y are independent when X (cid:54) = Y .This creates a strong tendency for (cid:107) Y − X (cid:107) to growwith the dimension of the state space, as seen in theexperiments of Jacob et al. (2020). Even under a max-imal coupling, P ( X = Y | x, y ) can be small until x and y are close. Thus the time for a pair of coupledchains to meet depends on both the meeting probabil-ity from each state pair and the degree of contractionbetween chains when meeting does not occur.The advantages of using ¯ Q MR vs. ¯ Q MI in Algo-rithm 1 appear to be due to this consideration.See Section 4.2 below for numerical justifications.It also motivates the following full-kernel maximalcoupling with reflection residuals, which we referto as ¯ P MR and describe in detail in Algorithm 4.For this algorithm define f mxy ( z ) := f ( x, z ) ∧ f ( y, z ),˜ f rxy ( x (cid:48) ) := f ( x, x (cid:48) ) − f mxy ( x (cid:48) ), and likewise for ˜ f ryx ( y (cid:48) ).Finally set ˜ f tyx ( y (cid:48) ) := ˜ f ryx ( y (cid:48) ) − ˜ f ryx ( y (cid:48) ) ∧ ˜ f rxy ( T yx ( y (cid:48) )),the y residual after reflection evaluated at y (cid:48) .Figure 4 provides some intuition into the behavior of¯ P MR . The first step of ¯ P MR is identical to ¯ P MI in thatit attempts to draw from the region C , and a meetingoccurs if this is successful. Otherwise ¯ P MR proposesthe reflected point T xy ( X ) for Y , which succeeds ifthis point falls in the region A (cid:48) . If all else fails, we userejection sampling to obtain a Y draw from A ∪ { y } .See Appendix A.2 for a detailed proof of the validityand maximality, and an analysis of the computationcost of ¯ P MR . ¯ P C While the coupling ¯ P MR sometimes outperforms ¯ P MI ,it also has a few limitations. First, the reflection pro-posal strategy requires a high degree of reflection sym-metry between the distributions of X and Y condi-tional on X (cid:54) = Y to be successful. For example in thesetting of Figure 4, the reflection proposal has morethan a 50% chance of being rejected. Without such a ’Leary, Wang, & Jacob Algorithm 4
Draw (
X, Y ) ∼ ¯ P MR (( x, y ) , · )1. Draw X ∼ P ( x, · ) and U ∼ Unif2. If X (cid:54) = x and U f ( x, X ) ≤ f ( y, X ), set Y = X
3. Else(a) Set ˜ y = T xy ( X ) and draw V ∼ Unif(b) If X (cid:54) = x and V f rxy ( X ) ≤ ˜ f ryx (˜ y ), set Y = ˜ y (c) Elsei. Draw ˜ y ∼ P ( y, · ) and W ∼ Unifii. If ˜ y = y , set Y = ˜ y iii. If ˜ y (cid:54) = y and W f ( y, y (cid:48) ) ≤ ˜ f tyx (˜ y ), set Y = ˜ y iv. Else go to 3(c)i.4. Return ( X, Y )symmetry X and Y will tend to be conditionally inde-pendent, resulting in poor contraction between chainswhen meeting does not occur.Second, any full-kernel coupling is constrained to workdirectly with the complicated and irregular geometryof P rather than the simple and often tractable form ofthe proposal distribution Q . In contrast to the rangeof couplings and optimal transport strategies availablewhen a standard distribution is used for Q , it appearsto be more difficult to design high-performance cou-plings directly in terms of the associated transitionkernels P .This motivates the next coupling, which allows the useof any coupling of Q while still possibly achieving themaximal coupling probability identified in Lemma 1.We refer to this new algorithm as ¯ P C , since it resem-bles ¯ P SQ up to the conditional use of one or anotherBernoulli coupling ¯ B depending on whether or not ameeting is proposed. Together with the definitions be-low, Algorithm 5 shows how to draw from ¯ P C . Wecan choose a maximal ¯ Q such as ¯ Q MI or ¯ Q MR to ob-tain a maximal ¯ P , although we also obtain a valid¯ P ∈ Γ( P, P ) for non-maximal choices of ¯ Q .To define ¯ P C , let q mxy ( · ) be the density of ¯ Q (( x, y ) , · )on ∆ = { ( z, z ) : z ∈ X } . We es-tablish the existence and properties of q mxy inLemma 2, below. Define the proposal residual q rxy ( x (cid:48) ) := q ( x, x (cid:48) ) − q mxy ( x (cid:48) ) and the transition kernelresidual f rxy ( x (cid:48) ) := 0 ∨ ( f ( x, x (cid:48) ) − q mxy ( x (cid:48) )), and simi-larly for q ryx ( y (cid:48) ) and f ryx ( y (cid:48) ). We illustrate these func-tions in Figure 5. Algorithm 5
Draw (
X, Y ) ∼ ¯ P C (( x, y ) , · )1. Draw ( x (cid:48) , y (cid:48) ) ∼ ¯ Q (( x, y ) , · )2. If x (cid:48) = y (cid:48) set ¯ B = ¯ B , else set ¯ B = ¯ B
3. Draw ( B x , B y ) ∼ ¯ B (( x, y ) , ( x (cid:48) , y (cid:48) ))4. Set X = B x x (cid:48) +(1 − B x ) x and Y = B y y (cid:48) +(1 − B y ) y
5. Return (
X, Y ) A AB x' = y' case q mxy (z) = q(x,z) q(y,z)f(x, z)=q(x, z)a(x, z)f(y, z)=q(y, z)a(y, z)10 5 0 5 10 15z P r o b a b ili t y D e n s i t y C D x' y' case q rxy (z)q ryx (z) f rxy (z)f ryx (z) Figure 5: Distributions used in Algorithm 5 based onthe same parameters as Figure 1. Upper pane: draws x (cid:48) = y (cid:48) follow q m and are used as proposals for f ( x, · )and f ( y, · ). Lower pane: Draws x (cid:48) (cid:54) = y (cid:48) follow the pro-posal residuals q r and are used as rejection samplingproposals for the transition kernel residuals f r . Lemma 2. If ¯ Q ∈ Γ( Q, Q ) then there exists a density q mxy for ¯ Q (( x, y ) , · ) on ∆ . If ¯ Q is a maximal coupling,then q mxy ( z ) = q ( x, z ) ∧ q ( y, z ) for almost all z .Proof. Let λ denote the base measure on ( X , F )and let λ ∆ be its push-forward to ∆ by themap z (cid:55)→ ( z, z ). Since X is a Polish space,we have A ∆ := { ( z, z ) : z ∈ A } ∈ F ⊗ F for any A ∈ F . Thus ¯ Q (( x, y ) , · ) induces a sub-probability ¯ Q ∆ on ∆. Also ¯ Q ∆ (cid:28) λ ∆ , since if λ ∆ ( A ∆ ) = λ ( A ) = 0then ¯ Q ∆ ( A ∆ ) ≤ ¯ Q (( x, y ) , A × X ) = Q ( x, A ) = 0. TheRadon—Nikodym theorem then guarantees the exis-tence of an integrable function q mxy : X → [0 , ∞ ) suchthat ¯ Q (( x, y ) , A ∆ ) = (cid:82) A q mxy ( z ) d z .Next, we claim q mxy ( z ) ≤ q ( x, z ) ∧ q ( y, z ) for λ -almostall z . Let A := { z : q mxy ( z ) > q ( x, z ) ∧ q ( y, z ) } ∈ F , A x := { z : q mxy ( z ) > q ( x, z ) } ∈ F , and likewise for A y .Since A = A x ∪ A y , λ ( A ) > λ ( A x ) > λ ( A y ) >
0. If λ ( A x ) > Q (( x, y ) , A x × X ) = Q ( x, A x ) < ¯ Q (( x, y ) , ( A x ) ∆ ) ≤ ¯ Q (( x, y ) , A x × X ), acontradiction. The case λ ( A y ) > λ ( A ) = 0.Finally, if ¯ Q is maximal, a total variation computationsimilar to that of Lemma 1 shows that ¯ Q (( x, y ) , ∆) = (cid:82) q ( x, z ) ∧ q ( y, z ) d z . Combining this with the aboveimplies q mxy ( z ) = q ( x, z ) ∧ q ( y, z ) for λ -almost all z .Resuming our definition of ¯ P C , we set acceptance prob-abilities b xy ( x (cid:48) ) := 1 ∧ ( f ( x, x (cid:48) ) /q mxy ( x (cid:48) )) if q mxy ( x (cid:48) ) > b xy ( x (cid:48) ) := 1, c xy ( x (cid:48) ) := f rxy ( x (cid:48) ) /q rxy ( x (cid:48) ) if ’Leary, Wang, & Jacob q rxy ( x (cid:48) ) = 0 or else c xy ( x (cid:48) ) := 1, and likewise for b yx ( y (cid:48) )and c yx ( y (cid:48) ). When x (cid:48) = y (cid:48) we require that the ac-ceptance indicator pair ( B x , B y ) follows the maximalcoupling ¯ B of Bern( b xy ( x (cid:48) )) and Bern( b yx ( y (cid:48) )). When x (cid:48) (cid:54) = y (cid:48) we require that ( B x , B y ) follows any coupling¯ B ∈ Γ(Bern( c xy ( x (cid:48) )) , Bern( c yx ( y (cid:48) ))). Simulation re-sults suggest that maximal couplings for ¯ B performwell, but optimal transport couplings which aim tominimize (cid:107) X − Y (cid:107) are also attractive in this setting.When ¯ Q proposes a meeting x (cid:48) = y (cid:48) , we think of ¯ P C as using these values as rejection sampling propos-als for the transition distributions f ( x, z ) and f ( y, z ).This results in a higher marginal acceptance rate thanwe would have under MH. On the other hand when x (cid:48) (cid:54) = y (cid:48) . we think of this method as falling back torejection sampling of the residual distributions f fromthe residuals distributions of q , both after the removalof q mxy . This produces a lower marginal acceptancerate than MH, exactly counterbalancing the above. InProposition 1 we show that ¯ P C (( x, y ) , · ) is a maximalcoupling of P ( x, · ) and P ( y, · ). Proposition 1.
For any ¯ Q ∈ Γ( Q, Q ) , the out-put ( X, Y ) of Algorithm 5 will follow a coupling ¯ P ∈ Γ( P, P ) . If ¯ Q is maximal then ¯ P will be as well.Proof. At any point x (cid:48) (cid:54) = x , X will have density q mxy ( x (cid:48) ) b xy ( x (cid:48) ) + q rxy ( x (cid:48) ) c xy ( x (cid:48) )= (cid:0) q mxy ( x (cid:48) ) ∧ f ( x, x (cid:48) ) (cid:1) + (cid:0) q rxy ( x (cid:48) ) ∧ f rxy ( x (cid:48) ) (cid:1) = (cid:0) q mxy ( x (cid:48) ) ∧ f ( x, x (cid:48) ) (cid:1) + f rxy ( x (cid:48) ) = f ( x, x (cid:48) ) . The second equality holds because f rxy ( x (cid:48) ) ≤ q rxy ( x (cid:48) ).This in turn holds because f ( x, x (cid:48) ) ≤ q ( x, x (cid:48) ) and q mxy ( x (cid:48) ) ≤ q ( x, x (cid:48) ) by Lemma 2, so f rxy ( x (cid:48) ) = (cid:0) f ( x, x (cid:48) ) ∨ q mxy ( x (cid:48) ) (cid:1) − q mxy ( x (cid:48) ) ≤ q ( x, x (cid:48) ) − q mxy ( x (cid:48) ) = q rxy ( x (cid:48) ) . Integrating the density of X over all x (cid:48) (cid:54) = x yields P ( X = x ) = 1 − (cid:82) f ( x, x (cid:48) ) d x (cid:48) = r ( x ), so we con-clude X ∼ P ( x, · ). A similar argument shows that Y ∼ P ( y, · ). Thus ( X, Y ) follows the desired coupling.If ¯ Q is maximal then by Lemma 2 the probabilitydensity at the proposal ( z, z ) ∼ ¯ Q (( x, y ) , · ) will be q mxy ( z ) = q ( x, z ) ∧ q ( y, z ). By the definition of ¯ B ,the probability of accepting a proposal x (cid:48) = y (cid:48) = z forboth X and Y will be (cid:0) ∧ f ( x,z ) q mxy ( z ) (cid:1) ∧ (cid:0) ∧ f ( y,z ) q myx ( z ) (cid:1) = f ( x,z ) ∧ f ( y,z ) q ( x,z ) ∧ q ( y,z ) . Combining this with the proposal density impliesthat the overall coupled transition kernel density at X = Y = z will be f ( x, z ) ∧ f ( y, z ). Thus P ( X = Y ) = (cid:82) f ( x, z ) ∧ f ( y, z ) d z , and by Lemma 1 we concludethat ¯ P is maximal. D e n s i t y CouplingP SQ with Q MI P SQ with Q MR P MI P MR P C with Q MI P C with Q MR Figure 6: Distribution of meeting times for the exam-ple described in Section 4.1. The four maximal cou-plings yield almost identical distributions of shortermeeting times times while the two ¯ P SQ methods yieldalmost identical distribution of longer ones. In this ex-ample the requirement that ¯ P SQ accept meeting andnon-meeting proposals at exactly the MH rate putsit at a significant disadvantage vs. the maximal cou-plings.We observe that ¯ P C matches the flexibility and compu-tational efficiency of ¯ P SQ while offering a higher meet-ing probability at each iteration. The ability to chosean arbitrary ¯ Q ∈ Γ( Q, Q ) is also a significant advan-tage of ¯ P C over the full-kernel couplings. The extraeffort required to construct, validate, and draw from¯ P MR relative to ¯ P MI shows how challenging such refine-ments can be. Finally, we note that ¯ P C can be morecomputationally efficient than the full-kernel couplings¯ P MI and ¯ P MR , in that it avoids the ‘while’ loops of Step3 of Algorithms 3 and 4. For our first example we consider a toy model thatemphasizes the differences between ¯ P SQ and the max-imal couplings ¯ P MI , ¯ P MR , and ¯ P C . We assume an Ex-ponential target distribution π = Expo(1) and drawproposals from Q ( z, · ) = N( z + κ, σ ) with κ >
0. Wethen accept or reject these proposals at the usual MHrate, so a ( z, z (cid:48) ) = 1 ∧ exp (cid:0) ( z − z (cid:48) )(2 κ/σ + 1) (cid:1) . Ourassumption on κ implies a ( x, z ) ∧ a ( y, z ) = a ( x, z ) if x ≤ y and a ( x, z ) = 1 if z ≤ x . Thus Q will tendto propose increasing values while a favors decreasingones. This tension between the proposal and targetdistributions is characteristic of MH kernels that donot mix rapidly. ’Leary, Wang, & Jacob We construct the transition kernel couplings ¯ P SQ with¯ Q MI , ¯ P MI , ¯ P MR , and ¯ P C with ¯ Q MI as described in Sec-tions 2 and 3. We also define a simple generaliza-tion of the maximal coupling with reflection residuals,such that either x (cid:48) = y (cid:48) or y (cid:48) − ( y + κ ) = ( x + κ ) − x (cid:48) .We use the resulting ¯ Q MR as an alternative proposalkernel coupling for ¯ P SQ and ¯ P C . We set κ = σ =3, draw initial values independently from the target π , and run 10,000 replications for each coupling op-tion. For each replication we record the meeting time τ = min( t ≥ X t = Y t ). As described in Section 1,such meeting times are of theoretical and practicalimportance, and they also make a good measure ofcoupling performance. We summarize the average be-havior of these meeting times in Table 1 and presenttheir full distribution in Figure 6.We find that both non-maximal couplings deliver av-erage meeting times around 75 iterations, while thefour maximal couplings deliver meeting times around61 iterations. We recall that for a given state pair( x, y ), the maximal couplings produce one value of P ( X = Y | x, y ) and two ¯ P SQ couplings produce an-other. The observed clustering of algorithms is con-sistent with the idea that in this example meetingtimes are driven by these one-step meeting probabil-ities rather than by behavior when meeting does notoccur, which varies significantly by algorithm.For a better understanding of these differences,we contrast the behavior of ¯ P SQ and ¯ P C . Al-though we use the same underlying proposal coupling( x (cid:48) , y (cid:48) ) ∼ ¯ Q (( x, y ) , · ) in each case, the two MH transi-tion kernel coupligns differ in that ¯ P SQ accept its pro-posals at exactly the MH rate while ¯ P C uses a higheracceptance probability when x (cid:48) = y (cid:48) and a lower onewhen x (cid:48) (cid:54) = y (cid:48) . In this example, most proposals havea relatively low MH acceptance probability to beginwith. Thus ¯ P C meets more quickly by concentratingthe little acceptance probability available on the drawswhere they could result in a meeting X = Y , while ¯ P SQ is forced to accept its proposals at the same relativelylow rate whether or not a meeting is proposed.The above leads us to expect that maximal couplingsTable 1: Section 4.1 Example ResultsCoupling Avg. Meeting Time S.E.¯ P SQ with ¯ Q MI P SQ with ¯ Q MR P MI P MR P C with ¯ Q MI P C with ¯ Q MR A v g M ee t i n g T i m e Transition Kernel CouplingP SQ with Q MI P MI P C with Q MI P MR P C with Q MR P SQ with Q MR Figure 7: Average meeting times for a range of transi-tion kernel couplings, as described in Section 4.2. Cou-pling strategies that involve reflections of the propos-als appear to outperform the others; among the others,maximal couplings seem to perform better than non-maximal ones.might provide the greatest advantage over the statusquo when low acceptance probabilities are typical, ei-ther due to the presence of a challenging target, animperfect proposal distribution, or both. This exam-ple emphasizes simplicity over realism, but we wouldexpect its principles to hold more broadly, especiallyin cases where mixing is relatively slow.
For our second example we consider MH on R d with a target distribution π = N(0 , I d ) and proposals Q ( z, · ) = N( z, I d σ d ). Following e.g. Roberts et al.(1997) and Christensen et al. (2005) we set σ d = (cid:96) /d with (cid:96) = 2 .
38. This example allows us to consider therole of the dimension and examine differences betweenthe couplings when meeting probabilities are just oneimportant aspect of their behavior.As above, our main object of interest in this example isthe number of iterations required for a pair of coupledchains to meet. We initialize chains on independentdraws from π , consider dimensions d = 1 , . . . ,
10, andrun 1,000 replications for each algorithm. We use amaximal coupling of acceptance indicators for ¯ P SQ and¯ P C , which appears to yield the best results among thesimple acceptance indicator couplings.We present the results of this experiment in Figure 7.There we observe that ¯ P SQ using ¯ Q MI , ¯ P C using ¯ Q MI ,and ¯ P MI seem to yield meeting times that increase ex-ponentially in dimension, although the maximal cou-plings outperform the non-maximal coupling. This ’Leary, Wang, & Jacob A v e r a g e || Y t X t || P SQ with Q MI P MR P MI P C with Q MR P C with Q MI P SQ with Q MR Figure 8: Distance between coupled MH chains by iter-ation, as described in Section 4.2. Couplings in which X (cid:54) = Y implies X, Y independent or almost indepen-dent display little contraction, while those based onthe maximal reflection coupling of proposal distribu-tions ¯ Q MR display strong contraction.blow-up is expected since these options involve inde-pendent or weakly dependent behavior when X (cid:54) = Y .The coupling ¯ P MR delivers somewhat better behavior,with ¯ P SQ and ¯ P C with ¯ Q MR delivering the best perfor-mance. In higher dimensions it appears that the ¯ Q MR version of ¯ P SQ may outperform its ¯ P C counterpart, aninteresting and perhaps counterintuitive result.To understand these differences in meeting times,we must consider what happens under each cou-pling ¯ P when a meeting does not occur. For Q ( z, · ) = N( z, I d σ d ), any ¯ Q ∈ Γ max ( Q, Q ) willyield P ( x (cid:48) = y (cid:48) | x, y ) = P ( χ ≥ (cid:107) y − x (cid:107) / (4 σ d )), seee.g. Pollard (2005, chap. 3.3). Since a meeting can-not occur unless one is proposed, this expression is anupper bound on P ( X = Y | x, y ). Thus we shouldexpect the probability of X = Y to fall off rapidly in (cid:107) y − x (cid:107) , so that couplings which do not promote con-traction between chains in the absence of meeting willyield rapidly increasing meeting times.Figure 8 is consistent with these observations. Asabove we run 1,000 replications per algorithm, initial-izing each chain with an independent draw from thetarget distribution. We set d = 100 to isolate theeffects of contraction due to meeting from other con-tractive behavior and then track the average distancebetween chains as a function of iteration.We find that ¯ P MI , ¯ P SQ with ¯ Q MI , and ¯ P C with ¯ Q MI produce little or no contraction within the distanceachieved by independent draws from the target. Theythen remain far enough apart that the probability of meeting is negligible. The coupling ¯ P MR displays con-traction up to a point. Finally we observe that boththe maximal and non-maximal couplings based on¯ Q MR contract rapidly to within a radius where meet-ing can occur. ¯ P SQ appears to contract more rapidlythan ¯ P C , which is consistent with its slightly strongerperformance in high dimensions. Couplings play a central role in the analysis of MCMCconvergence and increasingly appear in new methodsand estimators. Until now, no general-purpose algo-rithm has been available to sample from a maximalcoupling of MH transition kernels. We fill this gapby introducing three such algorithms, which are im-plementable under the standard assumptions that onecan draw proposals from a distribution Q and computethe density q and acceptance rate a at any points ofthe state space. The proposed couplings are simple toapply and can be used with a variety of MH strategiesincluding the Metropolis-adjusted Langevin algorithm,pseudo-marginal methods, and the MH-within-Gibbsalgorithm.The experiments in Section 4.1 show that the gainsfrom using these methods can be large, especially whenthere is a tension between the proposal density q andthe acceptance rate a . On the other hand the exampleof Section 4.2 shows that maximality is sometimes lessimportant than other properties of a coupling, suchas the contraction behavior when a meeting does notoccur. The two examples considered here are sim-ple ones, and experiments with a wider range of MHalgorithms and target distributions would clarify thestrengths and weaknesses of the proposed couplings.This work raises several questions. First, it it notknown if all couplings of MH kernels might be rep-resented in the form of Algorithm 5 for some appro-priate choice of proposal and acceptance couplings; seeN¨usken and Pavliotis (2019) for the treatment of a sim-ilar question in the setting of continuous-time Markovchains. It would also be interesting to consider theuse of sub-maximal proposal distribution couplings inAlgorithm 5, as suggested in the comment of Gerberand Lee (2020) on Jacob et al. (2020).Both meeting probabilities and contraction rates in-fluence meeting times, and one might wonder aboutderiving maximally contractive couplings in analogyto the present work on meeting times. Reflection cou-plings seem particular effective and are known to beoptimal in special cases (Lindvall and Rogers, 1986).In other cases, synchronous or ‘common random num-ber’ couplings yield strong contraction (e.g. Diaconisand Freedman, 1999). In most other scenarios the user ’Leary, Wang, & Jacob must construct a coupling tailored to the problem athand. The methods proposed here represent a stepforward in coupling design, but many important ques-tions remain.From a more theoretical point of view, while the MHkernel P is known to be the projection of the proposal Q onto the set of π -reversible kernels in a certain met-ric (Billera and Diaconis, 2001), it is not known how ¯ Q relates to the set of maximal couplings of π -reversiblekernels. In particular, it would be interesting to knowif the strategy proposed in Section 3.3 corresponds toa projection of ¯ Q onto that set.Finally the couplings strategies mentioned above areall Markovian. In some cases non-Markovian couplingsare known to deliver more satisfactory performancethan Markovian ones (e.g. Smith, 2014). The designof practical non-Markovian couplings for MCMC is atopic deserving further attention. Acknowledgements
The authors would like to thank Jun Yang, PersiDiaconis, and Niloy Biswas for helpful discussions.Pierre E. Jacob gratefully acknowledges support bythe National Science Foundation through grants DMS-1712872 and DMS-1844695.
References
Andrieu, C., Doucet, A., and Holenstein, R. (2010).Particle Markov chain Monte Carlo methods.
Jour-nal of the Royal Statistical Society: Series B (Sta-tistical Methodology) , 72(3):269–342. 1Barker, A. A. (1965). Monte Carlo calculations of theradial distribution functions for a proton–electronplasma.
Australian Journal of Physics , 18(2):119–134. 2Billera, L. J. and Diaconis, P. (2001). A geometric in-terpretation of the Metropolis–Hastings algorithm.
Statistical Science , pages 335–339. 9Biswas, N., Jacob, P. E., and Vanetti, P. (2019). Es-timating convergence of Markov chains with L-lagcouplings. In
Advances in Neural Information Pro-cessing Systems , pages 7391–7401. 1Bou-Rabee, N., Eberle, A., Zimmer, R., et al. (2020).Coupling and convergence for Hamiltonian MonteCarlo.
Annals of Applied Probability , 30(3):1209–1250. 3Brooks, S., Gelman, A., Jones, G., and Meng, X.-L.(2011).
Handbook of Markov chain Monte Carlo .CRC press. 1Christensen, O. F., Roberts, G. O., and Rosenthal,J. S. (2005). Scaling limits for the transient phase of local Metropolis–Hastings algorithms.
Journal ofthe Royal Statistical Society: Series B (StatisticalMethodology) . 7Diaconis, P. and Freedman, D. (1999). Iterated ran-dom functions.
SIAM Review , 41(1):45–76. 8Duane, S., Kennedy, A. D., Pendleton, B. J., andRoweth, D. (1987). Hybrid Monte Carlo.
PhysicsLetters B , 195(2):216–222. 1Eberle, A. (2011). Reflection coupling and Wassersteincontractivity without convexity.
Comptes RendusMathematique . 3Eberle, A., Guillin, A., and Zimmer, R. (2019).Couplings and quantitative contraction rates forLangevin dynamics.
The Annals of Probability ,47(4):1982–2010. 3Gerber, M. and Lee, A. (2020). Discussion on thepaper by Jacob, O’Leary, and Atchad´e.
Journal ofthe Royal Statistical Society: Series B (StatisticalMethodology) , 82(3):584–585. 8, 11Glynn, P. W. and Rhee, C. H. (2014). Exact esti-mation for Markov chain equilibrium expectations.
Journal of Applied Probability . 1Goodman, J. B. and Lin, K. K. (2009). Coupling con-trol variates for Markov chain Monte Carlo.
Journalof Computational Physics , 228(19):7127–7136. 1Hastings, W. K. (1970). Monte Carlo sampling meth-ods using Markov chains and their applications.
Biometrika , 57(1):97–109. 1Heng, J. and Jacob, P. E. (2019). Unbiased Hamil-tonian Monte Carlo with couplings.
Biometrika ,106(2):287–302. 1Jacob, P. E., O’Leary, J., and Atchad´e, Y. F. (2020).Unbiased Markov chain Monte Carlo methods withcouplings.
Journal of the Royal Statistical Society:Series B (Statistical Methodology) . 1, 4, 8, 11Jerrum, M. (1998). Mathematical foundations of theMarkov chain Monte Carlo method. In
Probabilis-tic Methods for Algorithmic Discrete Mathematics ,pages 116–165. Springer. 1Johnson, V. E. (1996). Studying convergence ofMarkov chain Monte Carlo algorithms using coupledsample paths.
Journal of the American StatisticalAssociation , 91(433):154–166. 1Johnson, V. E. (1998). A coupling-regenerationscheme for diagnosing convergence in Markov chainMonte Carlo algorithms.
Journal of the AmericanStatistical Association , 93(441):238–248. 1, 2Levin, D. A., Peres, Y., and Wilmer, E. L. (2017).
Markov Chains and Mixing Times , volume 107.American Mathematical Soc. 2 ’Leary, Wang, & Jacob
Lindvall, T. (1992).
Lectures on the Coupling Method .John Wiley and Sons, Inc., New York. 3Lindvall, T. and Rogers, L. C. G. (1986). Couplingof Multidimensional Diffusions by Reflection.
TheAnnals of Probability . 3, 8Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N.,Teller, A. H., and Teller, E. (1953). Equation ofstate calculations by fast computing machines.
Thejournal of Chemical Physics , 21(6):1087–1092. 1Middleton, L., Deligiannidis, G., Doucet, A., and Ja-cob, P. E. (2019). Unbiased smoothing using par-ticle independent Metropolis–Hastings. volume 89of
Proceedings of Machine Learning Research , pages2378–2387. PMLR. 1Middleton, L., Deligiannidis, G., Doucet, A., and Ja-cob, P. E. (2020). Unbiased Markov chain MonteCarlo for intractable target distributions.
ElectronicJournal of Statistics , 14(2):2842–2891. 1Neal, R. and Pinto, R. (2001). Improving Markovchain Monte Carlo estimators by coupling to an ap-proximating chain. Technical report, Department ofStatistics, University of Toronto. 1Neal, R. M. (1993). Bayesian learning via stochasticdynamics. In
Advances in Neural Information Pro-cessing Systems , pages 475–482. 1Neal, R. M. (1999). Circularly-coupled Markov chainsampling. Technical report, Department of Statis-tics, University of Toronto. 1Neal, R. M. (2011). MCMC using Hamiltonian dy-namics.
Handbook of Markov Chain Monte Carlo ,2(11):2. 1N¨usken, N. and Pavliotis, G. A. (2019). Constructingsampling schemes via coupling: Markov semigroupsand optimal transport.
SIAM/ASA Journal on Un-certainty Quantification , 7(1):324–382. 8Piponi, D., Hoffman, M., and Sountsov, P. (2020).Hamiltonian Monte Carlo swindles. volume 108 of
Proceedings of Machine Learning Research , pages3774–3783, Online. PMLR. 1Pollard, D. (2005). Asymptopia.
Manuscript, YaleUniversity, Dept. of Statist., New Haven, Connecti-cut . 8Propp, J. G. and Wilson, D. B. (1996). Exact sam-pling with coupled Markov chains and applicationsto statistical mechanics.
Random Structures and Al-gorithms , 9(1-2):223–252. 1Roberts, G. O., Gelman, A., and Gilks, W. R. (1997).Weak convergence and optimal scaling of randomwalk Metropolis algorithms.
Annals of AppliedProbability . 7 Roberts, G. O. and Tweedie, R. L. (1996). Exponen-tial convergence of Langevin distributions and theirdiscrete approximations.
Bernoulli , 2(4):341–363. 1Rosenthal, J. S. (1995). Minorization conditions andconvergence rates for Markov chain Monte Carlo.
Journal of the American Statistical Association . 1Rosenthal, J. S. (2002). Quantitative convergencerates of Mmarkov chains: A simple account.
Elec-tronic Communications in Probability . 1Smith, A. (2014). A Gibbs sampler on the n -simplex. The Annals of Applied Probability , 24(1):114–130. 9Vaserstein, L. N. (1969). Markov processes over de-numerable products of spaces, describing large sys-tems of automata.
Problemy Peredachi Informatsii ,5(3):64–72. 3 ’Leary, Wang, & Jacob
A Appendix
In Appendix A.1, we prove that the coupling ¯ P MI (( x, y ) , · ) described in Algorithm 3 is a maximal couplingof P ( x, · ) and P ( y, · ) and analyze its computational cost. Similarly, in Appendix A.2 we prove that the ¯ P MR described in Algorithm 4 is valid and maximal and analyze its cost. A.1 Validity, maximality, and computation cost of ¯ P MI We prove that Algorithm 3 defines a coupling of the correct marginal distributions and that it attains themaximum one-step meeting probability.
Proposition 2.
The draws ( X, Y ) produced by Algorithm 3 follow a maximal coupling of P ( x, · ) and P ( y, · ) ,with the property that X and Y are conditionally independent given X (cid:54) = Y . Moreover, the coupling probabilityis maximized among all possible couplings, so that ¯ P MI (( x, y ) , X = Y ) = 1 − (cid:107) P ( x, · ) − P ( y, · ) (cid:107) TV . Proof.
It suffices to show Y has marginal distribution P ( y, · ) and the coupling probability equals1 − (cid:107) P ( x, · ) − P ( y, · ) (cid:107) TV . We define the residual measure of y as ˜ P yx ( · ) ∝ r ( y ) δ y ( · ) + ˜ f ryx ( · ), where δ y is the pointmass at state y and ˜ f ryx ( y (cid:48) ) := f ( y, y (cid:48) ) − f ( y, y (cid:48) ) ∧ f ( x, y (cid:48) ) is the ‘unnormalized’ y -residual density evaluated at y (cid:48) . These definitions are consistent with the ones given in Section 3.2. From this point of view, Step 3 can beseen as a standard rejection sampler with proposal measure P ( y, · ) and target measure ˜ P yx ( · ). For any y (cid:48) (cid:54) = y ,let f yx ( y (cid:48) ) denote the transition density of the Y -chain from y to y (cid:48) . Then f yx ( y (cid:48) ) can be written as f yx ( y (cid:48) ) = f yx ( y (cid:48) , Step 2) + f yx ( y (cid:48) , Step 3) . The first term works out to f yx ( y (cid:48) , Step 2) = f ( x, y (cid:48) ) ∧ f ( y, y (cid:48) ), while the second term can be computed as f yx ( y (cid:48) , Step 3) = P (Step 3) f yx ( y (cid:48) | Step 3)= (1 − (cid:90) f ( x, z ) ∧ f ( y, z ) dz ) 1 c ( x, y ) ˜ f ryx ( y (cid:48) ) . Here c ( x, y ) = 1 − (cid:82) f ( y, z ) ∧ f ( x, z ) dz is the normalizing constant of ˜ P yx ( · ), which also equals r ( y ) + (cid:82) ˜ f ryx ( z ) dz .Putting all the terms together, we have f yx ( y (cid:48) ) = f ( x, y (cid:48) ) ∧ f ( y, y (cid:48) ) + ˜ f ryx ( y (cid:48) ) = f ( y, y (cid:48) ) as desired, which justifiesthe validity of Algorithm 3.We can also observe that the coupling probability equals the probability that Algorithm 3 stops at Step 2.Therefore, the coupling probability satisfies¯ P MI (( x, y ) , X = Y ) = (cid:90) f ( x, y (cid:48) ) ∧ f ( y, y (cid:48) ) dy (cid:48) = 1 − (cid:107) P ( x, · ) − P ( y, · ) (cid:107) TV . We conclude that Algorithm 3 maximizes the coupling probability in one step.We analyze the computation cost of Algorithm 3 as follows. To draw one sample from Algorithm 3, one needsto run Step 1 once with probability 1, Step 2 once with probability 1 − r ( x ), Step 3 for N times where N is arandom variable which equals 0 with probability 1 − (cid:107) P ( x, · ) − P ( y, · ) (cid:107) TV , and otherwise a Geometric randomvariable with success probability (cid:107) P ( x, · ) − P ( y, · ) (cid:107) TV . Meanwhile, Step 1 contains one draw from P , Step 2contains two evaluations, Step 3 contains one draw from P and 0 or 2 evaluations, with probability r ( y ) and1 − r ( y ) respectively.Therefore, the expected number of draws from P is 2 and the expected number of evaluations is 4 − r ( x ) − r ( y ).Taking account of the fact that each draw from P contains one draw from q and one evaluation of the acceptanceratio, then Algorithm 3 contains 2 draws from q and 6 − r ( x ) − r ( y ) evaluations in expectation. The varianceof the computing cost depends the total variation distance between P ( x, · ) and P ( y, · ), and goes to infinity as (cid:107) P ( x, · ) − P ( y, · ) (cid:107) TV goes to zero. This can motivate the consideration of sub-maximal coupling strategies, asdescribed in the comment of Gerber and Lee (2020) on Jacob et al. (2020). ’Leary, Wang, & Jacob A.2 Validity, maximality, and computation cost of ¯ P MR We prove that Algorithm 4 defines a valid coupling and attains the maximal coupling probability.
Proposition 3.