CCouplings of the Random-Walk Metropolis algorithm
John O’Leary ∗ February 4, 2021
Abstract
Couplings play a central role in contemporary Markov chain Monte Carlo methods andin the analysis of their convergence to stationarity. In most cases, a coupling must inducerelatively fast meeting between chains to ensure good performance. In this paper we fixattention on the random walk Metropolis algorithm and examine a range of coupling designchoices. We introduce proposal and acceptance step couplings based on geometric, optimaltransport, and maximality considerations. We consider the theoretical properties of thesechoices and examine their implication for the meeting time of the chains. We conclude byextracting a few general principles and hypotheses on the design of effective couplings.
In commemorating the 50th anniversary of the Metropolis–Hastings (MH) algorithm, Dunsonand Johndrow [2020] point to the unbiased estimation method of Jacob et al. [2020] as a leadingstrategy for the parallelization of Markov chain Monte Carlo (MCMC) algorithms. However,they note a challenge: while it is usually easy to find a transition kernel coupling with propertiesneeded for this approach, that choice is rarely unique, and the wrong selection can result in lowestimator efficiency. The design of efficient couplings is, as they write, “an exciting directionthat we expect will see growing attention among practitioners.” In this study we take up thisimportant question.From the early days of Markov chain theory [e.g. Doeblin, 1938, Harris, 1955, Pitman, 1976,Aldous, 1983, Rosenthal, 1995], couplings have played a key role in the analysis of convergenceto stationarity. In recent years they have also been used to formulate MCMC diagnostics[Johnson, 1996, 1998, Biswas et al., 2019], variance reduction methods, [Neal and Pinto, 2001,Goodman and Lin, 2009, Piponi et al., 2020], and new sampling and estimation strategies[Propp and Wilson, 1996, Fill, 1997, Neal, 1999, Flegal and Herbei, 2012, Glynn and Rhee,2014, Jacob et al., 2020, Heng and Jacob, 2019]. Couplings that produce smaller meeting timesgenerally yield better results in the form of tighter bounds, more variance reduction, greatercomputational efficiency, or more precise estimators. ∗ Department of Statistics, Harvard University, Cambridge, MA, USA. Email: [email protected] a r X i v : . [ s t a t . M E ] F e b hus, the design of efficient couplings has been an important question for almost the entirehistory of the coupling method. When a coupling is not required to be co-adapted to the chainsin question, simple arguments show that a maximal coupling of the chains exists and resultsin meeting at the fastest rate allowed by the coupling inequality [Griffeath, 1975, Goldstein,1979]. However when the coupling must be implementable and Markovian, maximal couplingsare known only in special cases [Burdzy and Kendall, 2000, Hsu and Sturm, 2013, Böttcher,2017]. Markovian couplings are easy to work with and are required for many of the applicationsabove, but they are rarely maximal.In this study we consider transition kernel couplings of the Random Walk Metropolis (RWM)algorithm [Metropolis et al., 1953], which is perhaps the oldest, simplest, and best-understoodMCMC method. Transition kernel couplings [Douc et al., 2018, chap. 19] are Markovianby construction. Explicit and implementable couplings of the RWM kernel seem to originatewith Johnson [1998]. These methods were taken up in Jacob et al. [2020], which found thatapparently minor differences in coupling design can have significant implications for meetingtimes, especially in relation to the dimension of the state space. In this paper we continue thisline of inquiry and take a pragmatic approach to the question of coupling design. We ask: whatoptions are available for coupling the RWM kernel, how do these choices affect meeting times,and what lessons can we learn from this simple case?We begin by introducing the essential ingredients of an RWM kernel coupling. First, weconsider proposal distribution couplings, devoting some attention to maximal couplings of themultivariate normal distribution. Next, we turn to coupling at the accept/reject step. Anycoupling of the RWM kernel can be realized as a proposal coupling followed by an acceptancestep coupling [O’Leary and Wang, 2021], so this focus on separate proposal and acceptancecouplings involves no loss of generality. We conclude with a range of simulation exercises tounderstand how various coupling design options affect meeting times. We conclude with somestylized facts and advice on the construction of efficient couplings for the RWM algorithm andbeyond. Throughout the following we write a ∧ b := min( a, b ), a ∨ b := max( a, b ), and L ( Z ) for the law of arandom variable Z . We write Bern( α ) for the Bernoulli distribution on { , } with P (Bern( α ) =1) = α , N( µ, Σ) for the multivariate normal distribution with mean µ and covariance matrix Σ,and N( z ; µ, Σ) for the density of this distribution evaluated at a point z .Fix a target distribution π on a state space ( R d , B ), where B is the Borel σ -algebra on R d .Let Q : R d × B → [0 ,
1] be a proposal kernel. Thus Q ( x, · ) is a probability measure for all z ∈ R d and Q ( · , A ) : R d → [0 ,
1] is measurable for all A ∈ B . We interpret Q ( x, A ) as the probabilityof proposing some point x ∈ A when the current state is x . Assume π has density π ( · ) and Q ( x, · ) has density q ( x, · ) for x ∈ R d , all with respect to Lebesgue measure. The MH acceptanceratio [Hastings, 1970] is then defined as a ( x, x ) := 1 ∧ q ( x ,x ) q ( x,x ) π ( x ) π ( x ) . In this study we focus on the2WM algorithm with multivariate normal proposal increments, so Q ( x, · ) = N( x, I d σ d ) for all x ∈ R d . In this case q ( x, ˜ x ) = q (˜ x, x ) for all x, x ∈ R d , and a ( x, ˜ x ) = 1 ∧ ( π (˜ x ) /π ( x )).We construct an MH chain ( X t ) as follows. First we initialize the chain with a draw X froman arbitrary distribution π on ( R d , B ). At each iteration t we draw ˜ x ∼ Q ( x, · ), where x = X t is the current state of the chain. We then draw an acceptance indicator b x ∼ Bern( a ( x, x )) andset X t +1 := b x x + (1 − b x ) x . It is often convenient to realize the acceptance indicator draw bytaking U ∼ Unif and b x := 1( U ≤ a ( x, x )). The chain ( X t ) defined above will have a transitionkernel P defined by P ( x, A ) := P ( X t +1 ∈ A | X t = x ) for x ∈ R d and A ∈ B .For any probability measures µ and ν on ( R d , B ), we say that a probability measure γ on( R d × R d , B ⊗B ) is a coupling of µ and ν if γ ( A × R d ) = µ ( A ) and γ ( R d × A ) = ν ( A ) for all A ∈ B .We write Γ( µ, ν ) for the set of all such couplings of µ and ν . Next suppose that ( X t ) and ( Y t )are both Markov chains defined on the same probability space and that both evolve accordingto the RWM transition kernel P defined above. We say that ( X t , Y t ) follows a transition kernelcoupling ¯ P based on P if there exists a joint kernel ¯ P : ( R d × R d ) × ( B ⊗ B ) → [0 ,
1] with¯ P (( x, y ) , · ) ∈ Γ( P ( x, · ) , P ( y, · )) for all x, y ∈ R d . We write Γ( P, P ) for the set of all such kernelcouplings. The limitation to couplings of ( X t ) and ( Y t ) that can be expressed in the form aboveis not a trivial one, as described further in Kumar and Ramesh [2001].We write τ = min( t : X t = Y t ) for the first time the chains meet. Couplings ¯ P withthe property that P ( τ < ∞ ) = 1 are called successful. To obtain successful couplings, wegenerally need a proposal kernel coupling with P ( x = y | x, y ) > X t , Y t ) = ( x, y ). This will lead us to consider maximal couplings of the proposal distributions,which achieve the highest possible probability P ( x = y | x, y ) for each x, y ∈ R d . Couplings ¯ P with the property that X t = Y t for all t ≥ τ are called sticky and are also our subject of interesthere. Rosenthal [1997] and Dey et al. [2017] point out that stickiness is a non-trivial property,even for Markovian couplings. However, the couplings we consider can always be made stickyby requiring x = y ∼ Q ( x, · ) = Q ( y, · ) and V = U ∼ Unif if x = y .In this paper we consider a range of coupling options for the RWM transition kernel. Ourgoal is to understand the implications of these options for the distribution of τ and especiallyfor the value of its mean E [ τ ]. The average meeting time serves as a convenient summary ofthe meeting rate and plays a specific role in the efficiency of the estimators described in Jacobet al. [2020]. We will focus on transition kernel couplings that arise by separately coupling theproposals ( x , y ) and the uniform draws ( U, V ) underlying the acceptance indicators ( b x , b y ) =(1( U ≤ a ( x, x )) , V ≤ a ( y, y ))). This strategy is fully general except with respect to theacceptance indicator coupling, as noted in O’Leary and Wang [2021].When it is unlikely to cause confusion, we write ( x, y ) = ( X t , Y t ) for the current state of apair of coupled MH chains, ( x , y ) = ( x + ξ, y + η ) for the proposals, and ( X, Y ) = ( X t +1 , Y t +1 )for the next state pair. We write r = || x − y || for the Euclidean distance between x and y and m = ( x + y ) / e = ( y − x ) /r for the unit vector pointing from x to y , an important direction in many of the constructions described below. Finally, for any z ∈ R d m m = x = y rmxy Figure 1: Coupled chain notation and geometry. We denote current points by x, y ∈ R d , theirseparation by r ≥
0, and their midpoint by m ∈ R d . The unit vector e points in the directionfrom x to y , m ∈ R gives the e component of m , and m − = x − = y − ∈ R d gives thecomponent of these three vectors which is orthogonal to e .we write z = e z ∈ R for the e component of z and z − = ( I d − ee ) z ∈ R d for the projection of z onto the subspace orthogonal to e . Thus we can express any vector z ∈ R d as z = ez + z − .See Figure 1 for an illustration of these quantities. To obtain finite meeting times we generally need P ( x = y | x, y ) > x, y ) with x = y . One solution is to draw (˜ x, ˜ y ) from a maximal coupling ¯ Q ∈ Γ( Q, Q ). Acoupling of ˜ x ∼ Q ( x, · ) and ˜ y ∼ Q ( y, · ) is said to be maximal if it achieves the upper bound givenby the coupling inequality, P (˜ x = ˜ y | x, y ) ≤ − || Q ( x, · ) − Q ( y, · ) || TV = 1 − sup A ∈B | Q ( x, A ) − Q ( y, A ) | . See Thorisson [2000, chap. 1.4] or Levin et al. [2017, chap 4.] for discussion of thisbound and its applications. For any probability distributions µ and ν on ( R d , B ), we writeΓ max ( µ, ν ) for the set of all maximal couplings of µ and ν . Maximal couplings of the proposaldistribution make an appealing starting point, but note that their use is neither necessary norsufficient to maximize P ( X = Y | x, y ). Gerber and Lee [2020] also observe that the variance ofthe computational cost to draw from a maximal coupling can blow up when r = k y − x k → Lemma 3.1.
Let ¯ Q (( x, y ) , · ) be a maximal coupling of Q ( x, · ) and Q ( y, · ), distributions with4ensities q ( x, · ) and q ( y, · ) on R d . If ( x , y ) ∼ ¯ Q (( x, y ) , · ), then for all A ∈ B , P ( x ∈ A, x = y | x, y ) = P ( y ∈ A, x = y | x, y ) = Z A q ( x, z ) ∧ q ( y, z ) d z P ( x ∈ A, x = y | x, y ) = Z A ∨ ( q ( x, z ) − q ( y, z )) d z P ( y ∈ A, x = y | x, y ) = Z A ∨ ( q ( y, z ) − q ( x, z )) d z. We can obtain P ( x = y | x, y ) by evaluating the first equation at A = R d . This meetingprobability takes a particularly simple form for multivariate normal distributions, as we see inthe following extension of Pollard [2005, chap. 3.3]: Lemma 3.2.
If ( x , y ) follows any maximal coupling of N( x, I d σ d ) and N( y, I d σ d ), then P ( x = y | x, y ) = P (cid:16) χ ≥ || y − x || σ d (cid:17) . Proof.
Recall that we write N( z ; µ, Σ) for the density of N( µ, Σ) and have defined m = ( y + x ) / r = k y − x k , e = ( y − x ) /r, z = e z and z − = ( I d − ee ) z for all z ∈ R d . In generalN( z ; µ, I d σ ) = N( z ; µ , σ d ) N( z − ; µ − , I d − σ d ). We can also decompose x and y into e and e ⊥ parts according to x = m − e r = ( m − r ) e + m − and y = m + e r = ( m + r ) e + m − .Combining these expressions with Lemma 3.1 yields the desired conclusion: P (˜ x = ˜ y | x, y ) = Z N( z ; x, I d σ d ) ∧ N( z ; y, I d σ d ) d z = ZZ (cid:16) N( z ; − r , σ d ) ∧ N( z ; r , σ d ) (cid:17) N( z − ; m − , I d − σ d ) d z d z − = Z ∞−∞ N( z ; − r , σ d ) ∧ N( z ; r , σ d ) d z = 2 Z ∞ N( z ; − r , σ d ) d z = 2 P (N(0 , ≥ r σ d ) = P ( χ ≥ r σ d ) . An important implication of Lemma 3.2 is that as we increase the dimension d , the separation r = || y − x || needed to hold P ( x = y | x, y ) constant must vary in proportion to σ d . Under thetypical RWM assumption that σ d = ‘ /d , this means r must shrink at a rate 1 / √ d to maintaina constant probability of meeting proposals. This inverse square-root condition plays a crucialrole in determining the dimension scaling behavior of different couplings as we will observe inthe simulations of Section 6. Note that the meeting probability derived in Lemma 3.2 alsoadmits the following useful inequalities: Lemma 3.3.
Under any maximal coupling of x ∼ N( x, I d σ d ) and y ∼ N( y, I d σ d ), we have1 − q π || y − x || σ d ≤ P ( x = y | x, y ) ≤ σ d k y − x k and P ( x = y | x, y ) ≤ √ − s exp (cid:16) − s || y − x || σ d (cid:17) for s ∈ (0 , / Proof.
For the lower bound, let φ ( z ) = N( z ; 0 ,
1) be the standard normal density and let Φ( z )be the corresponding cumulative distribution function. Since Φ(0) = 1 / φ ( z ) ≤ / √ π for5igure 2: Meeting probability and bounds for maximal couplings of normal distributions on R . The red line (‘Exact’) gives P ( x = y | x, y ) when ( x , y ) follows a maximal coupling ofN( x,
1) and N( y, r = k y − x k . The exact meeting probabilities involve thecomplementary CDF of the χ distribution, so it can be analytically more convenient to use thegiven bounds.all z , then for a > a ) = R a −∞ φ ( z ) d z ≤ + a √ π . This expression rearrangesto 1 − q π a ≤ − Φ( a )), and plugging in a = r/ (2 σ d ) yields the desired lower bound. Thefirst upper bound follows directly from Markov’s inequality: P ( x = y | x, y ) = P ( χ ≥ r σ d ) ≤ E [ χ ] r / (4 σ d ) = 4 σ d r . The second upper bound is due to Chernoff’s inequality, P ( χ ≥ a ) ≤ e − sa E [ e sχ ] for all s > E [ e sχ ] = 1 / √ − s for s < /
2, so plugging in a = r / (4 σ d ) yields the desiredexpression.In Figure 2 we plot the value of P (˜ x = ˜ y | x, y ) as derived in Lemma 3.2 along with the upperand lower bounds from Lemma 3.3. We observe that while the lower bound is tight at r = 0 andin the limit as r → ∞ , the upper bounds only become tight in the large- r limit. When needed,sharper upper and lower bounds can be obtained from more precise Gaussian tail inequalities,see e.g. Abramowitz et al. [1988, chap. 7] and Duembgen [2010].We close by noting that if we can produce draws from one maximal coupling, we can oftentransform these into draws from a maximal coupling of a related pair of distributions. Recallthat for any measure µ on ( R d , B ) and measurable function f : R d → R d , the pushforwardmeasure f ? µ on ( R d , B ) is defined by f ? µ ( A ) := µ ( f − ( A )) for all A ∈ B . Also, if x ∼ µ then f ( x ) ∼ f ? µ . Thus we have the following: Lemma 3.4.
Suppose µ and ν are probability measures on ( R d , B ) and let f : R d → R d bea homeomorphism. ( x , y ) follows a maximal coupling of µ and ν if and only if ( f ( x ) , f ( y ))follows a maximal coupling of f ? µ and f ? ν . 6 roof. First, ( x , y ) ∈ Γ( µ, ν ) if and only if ( f ( x ) , f ( y )) ∈ Γ( f ? µ, f ? ν ) since f is a bijection.Also k f ? µ − f ? ν k TV = sup A ∈B | µ ( f − ( A )) − ν ( f − ( A )) | = sup B ∈B | µ ( B ) − ν ( B ) | = k µ − ν k TV . Here B = { f − ( A ) : A ∈ B} , and B = B since f is a homeomorphism. Since f is a bijection,we also have P ( x = y ) = P ( f ( x ) = f ( y )). Thus ( x , y ) will achieve the coupling inequalitybound exactly when ( f ( x ) , f ( y )) does. Thus we have shown that the former pair follows amaximal coupling of µ and ν if and only if the latter follows a maximal coupling of f ? µ and f ? ν .Lemma 3.4 allows us to efficiently draw from and analyze the maximal independent couplingof distributions like N( x, Σ) and N( y, Σ) in terms of the maximal independent coupling ofN(0 , I d ) and N(Σ − / ( y − x ) , I d ). It can also be useful in the design of couplings when theproposal kernel arises from a deterministic but well-behaved function of a multivariate normalrandom variable, as in the case of Hamiltonian Monte Carlo [Duane et al., 1987, Neal, 1993,2011]. In this section we describe a range of proposal kernel couplings ¯ Q based on the RWM proposalkernel Q ( z, · ) = N( z, I d σ d ) on R d . If ( x , y ) = ( x + ξ, y + η ) ∼ ¯ Q (( x, y ) , · ), then marginally ξ, η ∼ N(0 , I d σ d ). These increments can exhibit a complex dependence pattern, and ( ξ, η )need not be multivariate normal. The simplest option, however, is the independent coupling ξ, η iid ∼ N(0 , I d σ d ). One step more complex is the synchronous or ‘common random numbers’coupling ξ = η ∼ N(0 , I d σ d ). As noted in Givens and Shortt [1984] and Knott and Smith [1984],the synchronous coupling minimizes the expected squared distance E [ k x − y k ] among alljoint distributions with x ∼ N( x, I d σ d ) and y ∼ N( y, I d σ d ). We comment further on optimaltransport couplings below.Another slightly more complex option is the simple reflection coupling, in which ξ ∼ N(0 , I d σ d ) and η = ( I d − ee ) ξ . With the notation z = e z and z − = ( I d − ee ) z for any z ∈ R d , we note that the reflection coupling yields η = − ξ and η − = ξ − . Thus η is thereflection of ξ over the hyperplane H = { z : || z − x || = || z − y ||} = { z : z = m } . Takingthis geometric logic a step further, we can also consider the full-reflection coupling in which ξ ∼ N(0 , I d σ d ) and η = − ξ . This coupling maximizes E [ k y − x k ] just as the synchronouscoupling minimizes it. The independent, synchronous, reflection, and full-reflection couplingsare easy to draw from and straightforward to analyze. They also differ dramatically in the co-variance and transport properties that they establish between x and y , their interactions withvarious accept/reject procedures, and thus the degree of contraction they produce between cou-pled chains. However each of these couplings has the property that if x = y then x = y almost7 lgorithm 1 Draw from the maximal independent coupling of Q ( x, · ) and Q ( y, · )1. Draw x ∼ Q ( x, · ) and W x ∼ Unif2. If W x q ( x, x ) ≤ q ( y, x ), set y = x
3. Else:(a) Draw ˜ y ∼ Q ( y, · ) and W y ∼ Unif(b) If W y q ( y, ˜ y ) > q ( x, ˜ y ), set y = ˜ y (c) Else go to 3(a)4. Return ( x , y )surely. This implies X = Y , so exclusive reliance on these couplings cannot yield P ( τ < ∞ ) = 1unless X = Y . Suppose ( x , y ) ∼ ¯ Q (( x, y ) , · ) for some ¯ Q ∈ Γ max ( Q, Q ). One consequence of Lemma 3.1 is thatall maximal couplings exhibit the same distribution of x and y given x = y . In particular, eachof these variables will have conditional density q mxy ( z ) := q ( x, z ) ∧ q ( y, z ) / R q ( x, w ) ∧ q ( y, w ) d w .We refer to the distributions of x and y given x = y as the residuals of ¯ Q (( x, y ) , · ). In lightof the above, we differentiate between various maximal couplings according to the behavior ofthese residuals, i.e. according to the distribution of ( x , y ) conditional on x = y .The first and perhaps most famous maximal coupling was introduced by Vaserstein [1969]and termed the γ -coupling by Lindvall [1992]. It is the unique maximal coupling with the prop-erty that x and y are independent when x = y . Thus we call this the maximal coupling withindependent residuals, or simply the maximal independent coupling. When Q ( z, · ) = N( z, I d σ d )for z ∈ R d , Lemmas 3.1 and 3.2 imply that this coupling approximates the independent couplingof Q ( x, · ) and Q ( y, · ) as r = || y − x || → ∞ .The references above prove that one can draw from the maximal independent coupling byusing the rejection sampling procedure described in Algorithm 1. This method is simple andversatile, although it suffers from a loss of efficiency as a function of dimension. In our setting,each normal density evaluation requires O ( d ) computations, and these costs can be a factor inalgorithmic performance in high dimensions or when the number of iterations required to obtaina valid y draw is large. Algorithm 2 offers an alternative, which exploits the symmetries andfactorization properties of the multivariate normal distribution. It provides a more efficient wayto draw from the maximal coupling of these distributions, and it also lends itself to extensionsand variations as we consider below. Lemma 4.1 establishes the validity of this algorithm. Lemma 4.1.
The output of Algorithm 2 is distributed according to the maximal independentcoupling of N( x, I d σ d ) and N( y, I d σ d ). Proof.
First we show that the output ( x , y ) of Algorithm 2 follows a coupling of N( x, I d σ d )and N( y, I d σ d ). For x we have x ∼ N( e x, σ d ) and x ∼ N(( I d − ee ) x, ( I d − ee ) σ d ) withindependence between x and x . Thus x = x e + x ∼ N( x, I d σ d ). For y , note that y ∼ N( y − , ( I d − ee ) σ d ) whether or not x = y . This is trivial when x = y . When x = y we have8 lgorithm 2 Draw from the maximal independent coupling of N( x, I d σ d ) and N( y, I d σ d ).1. Compute e = ( y − x ) / k y − x k , m = ( y + x ) / x = x e , and y = y e
2. Draw ( x , y ) from the maximal independent coupling of N( x , σ d ) and N( y , σ d ) usingAlgorithm 13. Independently draw ˜ x ∼ N( x, I d σ d ) and ˜ y ∼ N( y, I d σ d )4. Set x = ( I d − ee )˜ x and y = ( I d − ee )˜ y
5. Set x = ˜ x e + ˜ x − . If ˜ x = ˜ y set y = ˜ y e + ˜ x − , else set y = ˜ y e + ˜ y −
6. Return ( x , y ) Algorithm 3
Draw from the maximal semi-independent coupling of N( x, I d σ d ) and N( y, I d σ d ).1. Compute e = ( y − x ) / k y − x k , m = ( y + x ) / x = x e , and y = y e
2. Draw ( x , y ) from the maximal independent coupling of N( x , σ d ) and N( y , σ d ) usingAlgorithm 13. Draw ˜ z ∼ N( m, I d σ d ), set z = ( I d − ee )˜ z , x = x e + z , and y = y e + z
4. Return ( x , y ) E [ y | x, y, x = y ] = ( I d − ee ) x = ( I d − ee )( m − r e ) = ( I d − ee ) m = ( I d − ee )( m + r e ) = y − .Also, y ∼ N( y e, σ d ) and y are independent, so we conclude y ∼ N( y, I d σ d ).Next, we show that ( x , y ) follows a maximal coupling. By Lemma 3.2, draws from amaximal coupling of N( x, I d σ d ) and N( y, I d σ d ) must meet with probability P ( χ ≥ k y − x k σ d ).By construction we have x = y if and only if x = y . Applying Lemma 3.2 to the maximalcoupling of N( x , σ d ) and N( y , σ d ) shows that meeting occurs with probability P ( χ ≥ ( y − x ) σ d ).We also have y − x = e ( y − x ) = ( y − x ) ( y − x ) / k y − x k = k y − x k , so meeting occurs at themaximal rate.Finally, we observe that x and y are independent conditional on x = y . This holds for x and y since these are drawn from a maximal independent coupling on R , and it holds for x − and y − since in the relevant case these are defined using independent random variables.Thus Algorithm 2 produces draws from the maximal independent coupling of N( x, I d σ d ) andN( y, I d σ d ).Overall, the meeting time associated with a transition kernel coupling depends on thatcoupling’s probability of producing a meeting at each step together with the dynamics of thechains conditional on not meeting. It is often a good idea to control the variance of y − x when x = y , to reduce the tendency of the chains to push apart when meeting does not occur. Thismotivates what we call the maximal coupling with semi-independent residuals, or the maximalsemi-independent coupling, which we define in Algorithm 3. This algorithm differs from themaximal independent coupling in that it has x = y whether or not x = y . The validityof this algorithm follows from essentially the same argument as that of Lemma 4.1.9 lgorithm 4 Draw from the maximal optimal transport coupling of N( x, σ ) and N( y, σ ).1. Draw x ∼ N( x, σ ) and W x ∼ Unif2. If W x ∼ N ( x ; x, σ ) ≤ N( x ; y, σ ), set y = x
3. Else set y = t xy ( x ) using the transport map t xy as defined in Lemma 4.24. Return ( x , y ) As noted above, the joint distribution of (˜ x, ˜ y ) given ˜ x = ˜ y plays an important role in deter-mining the distribution of meeting times. This is especially important since Lemma 3.2 showsthat the probability P ( x = y | x, y ) of meeting proposals must be small until the chains arerelatively close. Thus it is natural to consider not just ways to limit the variance of y − x when x = y , but methods for making this quantity as small as possible.Given a metric δ on R d , we say that ¯ Q (( x, y ) , · ) ∈ Γ( Q ( x, · ) , Q ( y, · )) is an optimal transportcoupling if ¯ Q (( x, y ) , · ) minimizes E ( x ,y ) ∼ ˜ Q [ δ ( x , y ; )] among all couplings ˜ Q ∈ Γ( Q ( x, · ) , Q ( y, · )).In this study we set δ ( x , y ) = k y − x k . Below, we show how to construct an optimaltransport coupling between the residuals of a maximal coupling of N( x, I d σ d ) and N( y, I d σ d ).Optimal transport couplings are not usually available in closed form, but the symmetries of themultivariate normal distribution present an opportunity. We begin with the following result inone dimension: Lemma 4.2.
Suppose ¯ Q (( x, y ) , · ) ∈ Γ max (N( x, σ ) , N( y, σ )), and define the residual dis-tributions µ ( A ) := P ( x ∈ A | x = y , x, y ) and ν ( A ) = P ( y ∈ A | x = y , x, y ) where( x , y ) ∼ ¯ Q (( x, y ) , · ) and A ∈ B . Let Φ x and Φ y be the cumulative distribution functions of µ and ν on R , and define the transport map t xy ( x ) := Φ − y (Φ x ( x )). If x ∼ µ , then ( x , t xy ( x ))is an optimal transport coupling of µ and ν . Also Φ x and Φ y have the functional forms givenin the proof below. Proof.
The main result is due to the cumulative distribution function characterization of optimaltransport maps for non-atomic distributions on R , see e.g. Rachev and Rüschendorf [1998, chap.3.1]. If x ∼ µ and y ∼ ν then by Lemma 3.1, x and y the following CDFs:Φ x ( x ) = F x ( x ∧ m ) − F y ( x ∧ m ) F x ( m ) − F y ( m ) if x < y − F x ( x ∨ m ) − F y ( x ∨ m ) F x ( m ) − F y ( m ) if x ≥ y Φ y ( y ) = F y ( y ∧ m ) − F x ( y ∧ m ) F y ( m ) − F x ( m ) if y < x − F y ( y ∨ m ) − F x ( y ∨ m ) F y ( m ) − F x ( m ) if y ≥ x. Here m = ( x + y ) / F z ( · ) is the CDF of N( z, σ ) for z ∈ R .We say that ¯ Q is a maximal coupling with optimal transport residuals, or a maximal optimaltransport coupling, if ¯ Q (( x, y ) , · ) ∈ Γ max ( Q ( x, · ) , Q ( y, · )) and if the residuals of ¯ Q (( x, y ) , · ) followan optimal transport coupling. The result above suggests an algorithm for drawing from themaximal optimal transport coupling of one-dimensional normal distributions. See Algorithm 4for the details of this method and Lemma 4.3 for a proof of its validity.10 lgorithm 5 Draw from the maximal optimal transport coupling of N( x, I d σ d ) and N( y, I d σ d )1. Compute e = ( y − x ) / k y − x k , m = ( y + x ) / x = x e , and y = y e
2. Draw ( x , y ) from the maximal optimal transport coupling of N( x , σ d ) and N( y , σ d )using Algorithm 43. Draw ˜ z ∼ N( m, I d σ d ), set z = ( I d − ee )˜ z , x = x e + z , and y = y e + z
4. Return ( x , y ) Lemma 4.3.
The output of Algorithm 4 follows a maximal optimal transport coupling ofN( x, σ ) and N( y, σ ). Proof. x ∼ N( x, σ ) by construction. By Lemma 3.1 and the validity of Algorithm 1, we have P ( y ∈ A, y = x ) = R A N( y ; y, σ ) ∧ N( y ; x, σ ) d y for A ∈ B . Lemmas 3.1 and 4.2 alsoimply P ( y ∈ A, y = x ) = R A (N( y ; y, σ ) − N( y ; x, σ )) ∨ y for A ∈ B . Together theseimply y ∼ N( y, σ ), so ( x , y ) follows some coupling of N( x, σ ) and N( y, σ ). Finally, wenote that Algorithm 4 has exactly the same probability of x = y as Algorithm 1 does when Q ( z, · ) = N( z, σ ). We know that the coupling implemented in Algorithm 1 is maximal, so weconclude that the present one is as well.Finally, we combine the result of Lemma 4.3 with the logic of Algorithm 3 to obtain analgorithm for drawing from the maximal coupling with optimal transport residuals on R d . SeeAlgorithm 5 for a statement of this method and Lemma 4.4 for a proof of its validity. Lemma 4.4.
The output of Algorithm 5 follows a maximal optimal transport coupling ofN( x, I d σ d ) and N( y, I d σ d ). Proof.
The proof that ( x , y ) follows a maximal coupling of N( x, I d σ d ) and N( y, I d σ d ) is almostidentical to the argument of Lemma 4.1, except we now use the same draw for y = x ratherthan independent draws x , y ∼ N( m − , ( I d − ee ) σ d ). To see that ( x , y ) follows an optimaltransport coupling conditional on x = y , we apply Theorem 2.1 of Knott and Smith [1984].That result says that if we can write y = T xy ( x ) such that y has the correct distributionand ∂T xy ( x ) /∂x is symmetric and positive definite, then ( x , T xy ( x )) is an optimal transportcoupling for δ ( x , y ) = k y − x k . In this case we have y = T xy ( x ) = t xy ( x ) e + x . Symmetryfollows immediately and positive definiteness follows since t xy is monotonically increasing. Another coupling in the spirit of the previous section is the maximal coupling with reflectionresiduals, also called the maximal reflection coupling. It is the maximal analogue to the reflectioncoupling defined near the beginning of Section 4, and it has previously been considered in Eberleand Majka [2019], Bou-Rabee et al. [2020], and Jacob et al. [2020]. We say that ( x , y ) ∼ ¯ Q (( x, y ) , · ) is a maximal reflection coupling of N( x, I d σ d ) and N( y, I d σ d ) if it is a maximalcoupling and if x = y implies η = ( I d − ee ) ξ , where we define ξ = x − x and η = y − y .When x = y , the maximal reflection coupling yields the same distribution of ( x , y ) as any11 lgorithm 6 Draw from the maximal reflection coupling of N( x, σ ) and N( y, σ )1. Draw x ∼ N( x, σ ) and W x ∼ Unif2. If W x ∼ N ( x ; x, σ ) ≤ N( x ; y, σ ), set y = x
3. Else set ξ = x − x , η = − ξ , and y = y + η
4. Return ( x , y )other maximal coupling, and when x = y it reflects the increments of each chain over thehyperplane equidistant between x and y .This coupling is related to the reflection coupling of diffusions described in Lindvall andRogers [1986], Eberle [2011], Hsu and Sturm [2013], and other studies. These continuous-timereflection couplings are sometimes maximal couplings of processes, in the strong sense that theyproduce the fastest meeting times allowed by the coupling inequality. We will see that using themaximal reflection coupling for RWM proposals also delivers good meeting time performance.In our setting this seems to arise from a felicitous interaction between reflection couplings andthe Metropolis accept/reject step. Understanding the analogy between the continuous- anddiscrete-time settings remains an interesting open question, especially for reflection couplings.As with the maximal independent and optimal transport couplings, we describe an efficientmethod for drawing from the maximal reflection coupling. We begin with Algorithm 6, whichyields draws from the maximal reflection coupling on R . The validity of this algorithm isestablished in Bou-Rabee et al. [2020, sec. 2] and Jacob et al. [2020, sec. 4]. Algorithm 7produces draws from the general form of this coupling on R d , and we establish the validityof this algorithm in Lemma 4.5. For the algorithm and its validity proof, recall that we havedefined z := e z and z − = ( I d − ee ) z for any z ∈ R d . Lemma 4.5.
The output ( x , y ) of Algorithm 7 is distributed according to a maximal reflectioncoupling of N( x, I d σ d ) and N( y, I d σ d ). Proof.
Essentially the same argument as in Lemmas 4.1 and 4.4 establishes that ( x , y ) followsa maximal coupling. For the reflection condition we recall that y = m + r/ e and x = m − r/ e ,which implies y = y e + m − and x = x e + m − . Thus y − y = ( y − y ) e + ζ − and x − x = ( x − x ) e + ζ − . We also have ( I d − ee ) e = − e and ( I d − ee )( I d − ee ) = ( I d − ee ).By the definition of Algorithm 6, y − y = − ( x − x ) when x = y . Thus x = y implies( I d − ee )( x − x ) = − ( I d − ee )( y − y ) e + ( I d − ee ) ζ − = ( y − y ) e + ζ − = y − y. Algorithm 7
Draw from the maximal reflection coupling of N( x, I d σ d ) and N( y, I d σ d )1. Compute e = ( y − x ) / k y − x k , m = ( y + x ) / x = x e , y = y e , and m − = ( I d − ee ) m
2. Draw ( x , y ) from the maximal reflection coupling of N( x , σ d ) and N( y , σ d ), by themethod of Algorithm 63. Draw ζ ∼ N(0 , I d σ d ) and set x = x e + m − + ζ − , and y = y e + m − + ζ −
4. Return ( x , y ) 12e conclude that ( x , y ) satisfies the reflection condition when x = y , and so the output ofAlgorithm 7 follows a maximal reflection coupling of N( x, I d σ d ) and N( y, I d σ d ). It is also possible to choose among the coupling strategies described above – or any validcoupling of the proposal distributions – depending on the current state pair ( x, y ). As impliedby Lemma 3.2, maximal couplings of Gaussian distributions have very little chance of producing x = y unless r = k y − x k is relatively small. Thus we can deploy one coupling method such asa maximal coupling when r is below some threshold and a different coupling when r is above it.This is reminiscent of the two-coupling strategies used in Smith [2014], Pillai and Smith [2017],and Bou-Rabee et al. [2020]. This approach can be deployed to produce faster meeting betweenchains. It can also simplify some theoretical arguments since, for example, the simple reflectioncoupling is simpler to analyze than its maximal coupling counterpart. Recall that we write P for the MH transition kernel generated by the proposal kernel Q andacceptance rate function a . We construct our MH kernel coupling ¯ P ∈ Γ( P, P ) as follows. First,we draw ( X , Y ) such that X , Y ∼ π for some arbitrary initial distribution π . We begineach iteration t by drawing proposals ( x , y ) ∼ ¯ Q (( x, y ) , · ), where ( x, y ) = ( X t , Y t ). Then wedraw acceptance indicators ( b x , b y ) from some joint distribution ¯ B (( x, y ) , ( x , y )) on { , } .Finally we set ( X t +1 , Y t +1 ) = ( X, Y ) where X = b x x + (1 − b x ) x and Y = b y y + (1 − b y ) y . Forany x, y ∈ R d and A ∈ B ⊗ B , we write ¯ P (( x, y ) , A ) := P (( X, Y ) ∈ A | x, y ) for the resultingjoint transition distribution. We want ¯ P (( x, y ) , · ) ∈ Γ( P ( x, · ) , P ( y, · )), and which implies a fewconstraints on the acceptance indicator coupling ¯ B (( x, y ) , ( x , y )).Given any mapping ¯ B from current state and proposal pairs ( x, y ) , ( x , y ) to probabilities on { , } , we can define joint acceptance rate functions a x (( x, y ) , ( x , y )) := P ( b x = 1 | x, y, x , y )and a y (( x, y ) , ( x , y )) := P ( b y = 1 | x, y, x , y ), where ( b x , b y ) ∼ ¯ B (( x, y ) , ( x , y )). These defini-tions make ¯ B (( x, y ) , ( x , y )) a coupling of Bern( a x (( x, y ) , ( x , y ))) and Bern( a y (( x, y ) , ( x , y ))).We want the transition pair ( X, Y ) defined above to imply X ∼ P ( x, · ) and Y ∼ P ( y, · ) condi-tional on ( x, y ). In O’Leary and Wang [2021], this is shown to hold if ( x , y ) ∼ ¯ Q (( x, y ) , · ) forsome ¯ Q ∈ Γ( Q, Q ) and if P ( b x = 1 | x, y, x ) = E [ a x (( x, y ) , ( x , y )) | x, y, x ] = a ( x, x ) for Q ( x, · )-almost all x P ( b y = 1 | x, y, y ) = E [ a y (( x, y ) , ( x , y )) | x, y, y ] = a ( y, y ) for Q ( y, · )-almost all y .These conditions are intuitive, but they allow for relatively complicated forms of a x , a y , and¯ B . For example, this flexibility is used in O’Leary et al. [2020] to formulate an acceptanceindicator coupling ¯ B which yields a maximal transition kernel coupling ¯ P ∈ Γ( P, P ) any timeit is used with a maximal proposal coupling ¯ Q ∈ Γ( Q, Q ). For now, we focus on acceptance13ndicator couplings with a x (( x, y ) , ( x , y )) = a ( x, x ) and a y (( x, y ) , ( x , y )) = a ( y, y ). Theresulting acceptance couplings take a simple form, as described in the following result. Lemma 5.1.
Suppose ( b x , b y ) ∼ ¯ B (( x, y ) , ( x , y )) ∈ Γ(Bern( a ( x, x )) , Bern( a ( y, y ))) for statepairs ( x, y ) , ( x , y ) ∈ R d × R d . Then for some ρ xy ∈ [0 ∨ ( a ( x, x )+ a ( y, y ) − , a ( x, x ) ∧ a ( y, y )], P ( b x = 1 , b y = 1) = ρ xy P ( b x = 1 , b y = 0) = a ( x, x ) − ρ xy P ( b x = 0 , b y = 1) = a ( y, y ) − ρ xy P ( b x = 0 , b y = 0) = 1 − a ( x, x ) − a ( y, y ) + ρ xy . Proof.
Set ρ xy := P ( b x = 1 , b y = 1). The values for P ( b x = 1 , b y = 0) and P ( b x = 0 , b y = 1)follow from the margin conditions, and then the value of P ( b x = 0 , b y = 0) follows from therequirement that P i,j ∈{ , } P ( b x = i, b y = j ) = 1. The constraints on ρ xy follow from therequirement that all of these joint probabilities must fall in [0 , b x ∼ Bern( a ( x, x )) and b y ∼ Bern( a ( y, y )) imply ρ xy = a ( x, x ) a ( y, y ). This satisfies the given bounds, since ρ xy = a ( x, x ) a ( y, y ) ≤ a ( x, x ) ∧ a ( y, y ) and ρ xy = a ( x, x ) a ( y, y ) ≥ a ( x, x ) + a ( y, y ) − − a ( x, x ))(1 − a ( y, y )) ≥ x = ˜ y is proposed and both proposals are accepted. Thissuggests that we should maximize the probability of ( b x , b y ) = (1 ,
1) by choosing ρ xy = a ( x, x ) ∧ a ( y, y ). By Lemma 5.1, this also maximizes the probability of ( b x , b y ) = (0 ,
0) and of b x = b y .Thus, this ρ xy corresponds to using the maximal coupling of Bern( a ( x, x )) and Bern( a ( y, y )),which is unique in this case. Simulation results suggest that some couplings tend to producecontraction between chains when both proposals are accepted, no change in the separationbetween chains when both are rejected, and an increase in separation when one chain is acceptedand the other is rejected. This further argues for drawing ( b x , b y ) from its maximal couplingconditional on ( x, y ) and ( x , y ).Write A B = ( A \ B ) ∪ ( B \ A ) for the symmetric difference of A, B ∈ B . As describedin Section 1, it is convenient to describe acceptance indicators and their couplings in terms ofuniform random variables. In particular we have the following:
Lemma 5.2.
Fix a x , a y ∈ [0 , B ∈ Γ(Bern( a x ) , Bern( a y )) if and only if there exists a coupling¯ U ∈ Γ(Unif , Unif) such that ( b x , b y ) ∼ ˜ B for b x = 1( U ≤ a x ) and b y = 1( V ≤ a y ). In particular, P ( b x = b y | x, y, x , y ) is maximized when U = V and minimized when V = 1 − U . Proof.
Suppose ¯ U ∈ Γ(Unif , Unif) and ( b x , b y ) are defined as in the statement above. b x ∼ Bern( a x ) since P ( b x = 1) = P ( U ≤ a x ) = a x , and similarly for b y . Thus the law of ( b x , b y ) isa coupling of Bern( a x ) and Bern( a y ). For the converse, by Lemma 5.1 any coupling ˜ B will becharacterized by ρ = P ( b x = b y = 1). Thus we must find a coupling ¯ U ∈ Γ(Unif , Unif) suchthat if (
U, V ) ∼ ¯ U then P ( U ≤ a x , V ≤ a y ) = ρ . One such coupling is the distribution on [0 , f ( u, v ) = ρa x a y if u ≤ a x , v ≤ a y a x − ρa x (1 − a y ) if u ≤ a x , v > a ya y − ρ (1 − a x ) a y if u > a x , v ≤ a y − a x − a y + ρ (1 − a x )(1 − a y ) if u > a x , v > a y . Note that when U = V ∼ Unif, we obtain P ( b x = b y = 1) = a ( x, x ) ∧ a ( y, y ), the maximalvalue of ρ , and when 1 − V = U ∼ Unif we achieve P ( b x = b y = 1) = 0 ∨ ( a ( x, x ) + a ( y, y ) − ρ . By Lemma 5.1, the probability of b x = b y is maximized when ρ ismaximized and minimized when ρ is minimized.While the acceptance indicator couplings described above are appealing in their simplicity,we may wonder if we can do better by adapting our choice of ρ xy depending on the currentstate pair ( x, y ) or the proposals ( x , y ). In particular, we consider an ‘optimal transport’approach to selecting ρ xy , in which we aim to minimize the expected distance between state X = b x x + (1 − b x ) x and Y = b y y + (1 − b y ) y after the joint accept/reject step. For each x, y, x , y ∈ R d , we solvemin ρ { E [ δ ( X, Y ) | x, y, x , y ] : ρ ∈ [( a ( x, x ) + a ( y, y ) − ∨ , a ( x, x ) ∧ a ( y, y )] } . As above, we set δ ( X, Y ) = k Y − X k . Note that this is a linear program with linear constraintsin ρ , so for typical proposal couplings ¯ Q the above will almost surely have solution either ρ = 0 ∨ ( a ( x, x ) + a ( y, y ) −
1) or ρ = a ( x, x ) ∧ a ( y, y ). Qualitatively, the lower-bound solutionwill be optimal when δ ( x, y ) and δ ( x , y ) are small relative to δ ( x, y ) and δ ( x , y ). Below, wesee that this is uncommon for the proposal distributions we consider. We now consider a set of simulations on the relationship between coupling design and meetingtimes, with a focus on the role of dimension. High-dimensional target distributions are acommon challenge in applications of MCMC, and previous studies such as Jacob et al. [2020]suggest that apparently similar couplings can produce unexpected and sometimes dramaticdifferences in meeting behavior with increasing dimension d . We expect that theory will somedayprovide definitive guidance on the design of couplings that scale well with dimension. For now,simulations like the following suggest the use of some couplings over others and offer a range ofhypotheses for further analysis.Our simulations target the standard multivariate normal distribution π d = N(0 , I d ) witha range of dimensions d . We focus on the RWM algorithm with proposal kernel Q ( x, · ) =N( x, I d σ d ) with σ d = ‘ /d and ‘ = 2 .
38. This form of proposal variance yields an acceptancerate converging to the familiar 0.234 value as d → ∞ , and diffusion limit arguments suggestthat this choice produces rapid or even optimal mixing behavior at stationarity [Gelman et al.,1996, Roberts et al., 1997] and in the transient phase [Christensen et al., 2005, Jourdain et al.,15able 1: Average meeting times (1000 replications each, d = 10)Acceptance Coupling V = U Independent V = 1 − U Proposal Coupling Avg τ S.E. Avg τ S.E. Avg τ S.E.Maximal Reflection 30 0.8 51 1.4 68 2.0Maximal Semi-Independent 54 1.5 85 2.4 105 3.3Maximal Optimal Transport 104 3.0 155 4.6 183 5.7Maximal Independent 279 8.5 302 9.4 354 11.22014]. Meeting times typically depend on both the coupling and the marginal behavior of eachchain. Our multivariate normal setting is a particularly simple one, but it allows us to isolatethe effect of each coupling decision without concern about the mixing behavior of the marginalkernel.We begin by considering a full grid of proposal and acceptance coupling combinations in d = 10. We initialize each chain using an independent draw from the target distribution. Wethen iterate until meeting occurs and record the observed meeting time over 1000 replicationswith each pair of coupling options. The averages and standard deviations of these meetingtimes τ appear in Table 1. We will consider each of these options in more detail below, but fornow it is important to note a few facts. First, even in low dimension like d = 10, we alreadyobserve more than an order of magnitude difference in the meeting times associated with thebest and worst coupling combinations. In the simulations below we generally evaluate high-performance couplings over a much wider range of dimensions than low-performance couplingsto avoid unnecessary computational expense.Second, among the acceptance couplings considered above, the U = V coupling consistentlyoutperforms the U, V iid ∼ Unif coupling, which consistently outperforms the V = 1 − U coupling.These relationships hold for each choice of proposal coupling. A similar relationship existsamong the proposal couplings, with the maximal reflection coupling delivering the best meetingtimes and the maximal independent coupling delivering the worst ones, again for any choice ofacceptance indicator coupling. These robust and monotone relations also hold in higher andlower dimensions and with other initialization and proposal variance options. Although themeeting times shown here arise from a complex interplay of proposal and acceptance behavior,these simulations suggest that some options can be regarded as generally better or worse thanothers.In this exercise and in many of the simulations below, we initialize chains using independentdraws from the target distribution. When π d = N(0 , I d ), the initialization method does not seemto affect the relative performance of the couplings considered below. This may not hold for alltargets, e.g. mixture distributions with well-separated modes. The development of couplingsand initialization strategies to address especially challenging targets stands as an importanttopic for future research. 16 A v g M ee t i n g T i m e Maximal CouplingsMax IndependentMax Optimal TransportMax Semi-IndependentMax Reflection (a) Maximal proposal distribution couplings A v g M ee t i n g T i m e Reflection CouplingsHybrid, r = . 05Hybrid, r = . 10Hybrid, r = . 15Hybrid, r = . 30Maximal (b) Maximal and hybrid reflection couplings
Figure 3: Scale behavior of the average meeting time of coupled RWM chains as a functionof the dimension of the target distribution π = N(0 , I d ), under proposal distribution couplingoptions defined in Section 4. The chains are initialized at independent draws from the targetand a V = U coupling is used at the Metropolis step. Left: Scaling under four maximalcouplings of the proposal distribution. Right: Scaling under the maximal reflection couplingand simple/maximal reflection coupling hybrids under a range of cutoff parameters ¯ r . We now consider the relationship between proposal couplings and meeting times in more detail.As above, we initialize each chain with an independent draw from the target distribution. Herewe use the U = V ∼ Unif coupling at the Metropolis step, which maximizes the probabilityof making the same accept/reject decision for both chains. As illustrated in Table 1, thisacceptance coupling seems to produce the fastest meeting times for a range of proposal couplings.For each proposal coupling and dimension, we run 1000 pairs of chains until meeting occurs.The average meeting times from this test appear in Figure 3.In Figure 3a, we show the average meeting times for the maximal couplings with independent,optimal transport, semi-independent, and reflection residuals, as defined in Section 4. Figure 3bpresents the corresponding results for the maximal-reflection coupling and for hybrid couplingsthat deploy the maximal reflection coupling when r = || y − x || < ¯ r/ √ d but use the simplereflection coupling when the chains are further apart. We consider hybrid couplings with arange of values of the cutoff parameter ¯ r .These results suggest that meeting times grow exponentially in dimension under the maximalcoupling with independent residuals, close to linearly in dimension under the maximal reflectioncoupling, and somewhere in between for the other two maximal couplings. The hybrid couplingsand maximal reflection coupling show a similar order of dependence on dimension, and thehybrid couplings display an inverse relationship between average meting time and ¯ r . Thisreflects an increasing number of missed opportunities to meet under the hybrid couplings, sincesmaller values of ¯ r result in more situations when r ≥ ¯ r/ √ d even though r is small enough toproduce a reasonable probability of meeting under a maximal coupling.Any maximal coupling of proposal distributions produces meeting proposals with the same17
500 1000 1500 2000 2500t2.50.02.55.07.510.012.5 A v e r a g e R t Max ReflectionReflection Max Semi-IndepSemi-Indep Max OTSync Max IndepIndep
Figure 4: Average distance between chains over 1000 replications as a function of iteration t . Here d = 100, and as in Figure 3 these chains are initialized at independent draws fromthe target. At the meeting time τ , chains switch to a sticky coupling in which x = y and U = V . Maximal couplings and their non-maximal equivalents produce similar dynamics until R t is small enough for there to be a reasonable chance of meeting. This effect is visible forthe reflection couplings but not the other options, which stop contracting before the chains areclose enough for these effects to make a difference.probability, as a function of R t = k Y t − X t k . Thus the variation in average τ reflects differencesin coupled chain dynamics conditional on not meeting. The degree of contraction betweenchains seems to play a particularly important role. As noted above, just after Lemma 3.2, thechains must be within a distance r d = O (1 / √ d ) to maintain a fixed probability of proposedmeetings as d increases. Thus the combination of a proposal and acceptance coupling mustgenerate contraction E [ R t +1 − R t | X t , Y t ] < r d to avoid a fall-off in meetingprobability as a function of dimension. The results above suggest that some proposal couplingsdo this better than others.To visualize this behavior, we run 1000 pairs of coupled chains under a range of maximal andnon-maximal couplings, as described in Section 4. We fix d = 100, initialize chains independentlyfrom the target, and use the U = V ∼ Unif coupling at the accept reject/step. We run all pairsof chains for 2500 iterations and use the sticky coupling described in Section 2 to maintain X t = Y t for t ≥ τ . Finally, we compute R t = k Y t − X t k and plot the average distance overreplications as a function of the iteration t . See Figure 4 for these results.The dynamics produced in this exercise provide a compelling explanation for the meetingtime behavior observed in Figure 3. In the absence of meeting, each coupling seems to producecontraction down to a certain degree of separation between chains. For the maximal independentcoupling that appears to be almost exactly E [ k Y − X k ], the distance obtained by independentdraws from the target distribution. The maximal optimal transport coupling and maximal semi-independent coupling produce contraction to within a smaller radius. The explosive increase inmeeting times under these couplings suggests that these critical distances do not keep pace withthe O (1 / √ d ) rate noted above. By the same token, the maximal reflection coupling appears18o produce sufficient contraction to eventually meet with high probability. Among the fourmaximal couplings considered here and their four non-maximal counterparts, only the maximalreflection coupling produces a high enough meeting probability to eventually diverge from itsnon-maximal counterpart.We can also visualize these differences in drift directly, by creating pairs ( X t , Y t ) with aspecific R t = r , running a single step of the coupled MH kernel, and recording the resultingdistance R t +1 . We show the output of such a test in Figure 5. Again we set d = 100 anduse the U = V ∼ Unif coupling at the acceptance step. We consider a range of r values andinitialize ( x, y ) to have e = (1 , , . . . ), m = 1, and || m || = √ d . In this case we run 10,000replications for each coupling and r value. Consistent with the results above, we find that thedifferent proposal couplings display a range of contraction behavior as a function of the distancebetween the chains, although all are contractive when the chains are far apart and repulsivewhen the chains are close together. Except for the reflection coupling, the R t value where eachcontraction line crosses the x-axis corresponds to the long-run average value of R t , as one wouldexpect for a chain in close to a stable equilibrium around this point.We conclude by noting that the meeting time, separation, and drift behavior illustrated inthe plots above agrees with our expectations in some cases more than others. For instance, it isnot surprising that using a maximal coupling with independent residuals in the proposal stepproduces poor contraction behavior as a function of dimension. Although the proposal variance σ d shrinks in d , this coupling produces almost independent values of X and Y conditional on x = y , whose separation can be expected to increase linearly in the number of independentdimensions. Thus the flat line in Figure 4 agrees with intuition.Each of the other three couplings has the property that x = y when x = y , whichlimits the potential for variance from these components as a function of dimension. However,the relative performance of the maximal semi-independent, maximal optimal transport, andmaximal reflection couplings is almost the opposite of what one might expect. The optimaltransport coupling seems to produce the least contraction in spite of producing the smallestvalues of k x − y k conditional on x = y . At the same time, the reflection coupling appearsto produce the most contraction despite maximizing the variance of y − x . These differencesseem likely to stem from the interaction of these couplings with the acceptance step. Next we consider couplings of the accept/reject step. As noted in Section 5, we focus onacceptance indicator couplings that accept both chains at exactly the MH rate for any pairof proposal states. In light of Lemma 5.2, it is convenient to define acceptance indicators b x = 1( U ≤ a ( x, x )) and b y = 1( V ≤ a ( y, y )) in terms of underlying uniform random variables.We can realize three basic couplings by drawing U ∼ Unif and then either drawing V ∼ Unifindependently, setting V = U , or setting V = 1 − U . The V = U coupling maximizes theprobability of b x = b y while the V = 1 − U coupling minimizes it. We also consider the19
10 15 20 25R t A v e r a g e R t + R t Max IndependentMax Optimal TransportMax Semi-IndependentMax Reflection
Figure 5: Average contraction as a function of the current distance between chains for fourmaximal couplings. For each point R t = r we construct a state pair such that k Y t − X t k = r .We then run one RWM iteration with the specified proposal coupling and the U = V acceptancecoupling, and record the resulting R t +1 = k Y t +1 − X t +1 k . We compute averages over 10,000replications for each coupling and r to obtain the curves shown here. The depicted drift behaviorappears consistent with the meeting times and time series dynamics of Figures 3 and 4.‘optimal transport’ acceptance indicator coupling described in Section 5. We recall that thisoption almost surely coincides with either the V = U or V = 1 − U coupling at each iteration,depending on which of these minimizes the expected distance E [ k Y − X k | x, y, x , y ].We present simulation results on these four acceptance step couplings in Figure 6. Ineach case we use the maximal reflection coupling of proposal distributions and initialize usingindependent draws from the target. In Figure 6a, we see that the V = U coupling producesmeeting times that scale approximately linearly in dimension, while these increase more rapidlyunder the independent and V = 1 − U couplings. The log-linear plot in Figure 6b suggests thatthese latter meeting times may still be less than exponential in dimension.The optimal transport and U = V couplings produce nearly identical results when appliedto the maximal reflection coupling of proposal distributions. A closer look at this scenarioreveals that the optimal transport coupling coincides with the U = V coupling in all 1000replications when d > d = 1 case. Weobserve qualitatively identical behavior when the proposals are maximally coupled with optimaltransport and semi-independent residuals.Figure 7 shows that the acceptance step optimal transport coupling displays more complexbehavior when the proposal distributions follow a maximal coupling with independent residuals.Here V = 1 − U is optimal in a fraction of iterations going to 50% as d increases. Nevertheless,the resulting meeting times are almost indistinguishable from the meeting times delivered bythe V = U and V = 1 − U couplings. This suggests that under the maximal coupling withindependent residuals, the rapid growth in meeting times is due to the proposal coupling morethan any particular choice of acceptance indicator coupling.As in the case of proposal couplings, we can also understand the meeting times associated20 A v g M ee t i n g T i m e Accept/Reject MethodV = 1 - UIndependentOptimal TransportV = U (a) Linear scale A v g M ee t i n g T i m e Accept/Reject MethodV = 1 - UIndependentOptimal TransportV = U (b) Log scale
Figure 6: Scale behavior of the average meeting time of coupled MH chains as a function ofthe dimension of the target distribution π = N(0 , I d ), under various acceptance step couplingsdefined in Section 5. Here the chains are initialized at independent draws from the target, theproposal distributions are related by a maximal reflection coupling, and averages are taken over1000 replications for each dimension and coupling. Left: the V = 1 − U and independent V, U methods produce meeting times which grow rapidly in dimension, in contrast to the V = U andoptimal transport couplings, which produce approximately linear behavior. Right: A log scaleplot suggests that the V = 1 − U and independent methods grow at less than an exponentialrate, which would appear as a straight line on this plot. V = U R a t e i n O p t i m a l T r a n s p o r t (a) Optimal transport coupling choice A v g M ee t i n g T i m e Accept/Reject MethodV = 1 - UIndependentOptimal TransportV = U (b) Meeting times (log scale)
Figure 7: Behavior of the optimal transport acceptance step coupling, as a function of thedimension of the target distribution π = N(0 , I d ). Here the proposal distributions are relatedby a maximal coupling with independent residuals, and as usual the chains are initialized byindependent draws from the target distribution and replicated 1000 times per coupling anddimension. Left: the optimal transport coupling coincides with the V = 1 − U coupling at arate approaching 50% as the dimension d of the target increases. Right: all four acceptance stepcouplings deliver similar, exponentially growing meeting times under the maximal independentcoupling of proposal distributions. 21
500 1000 1500 2000 2500t0.02.55.07.510.012.5 A v e r a g e R t V = U Independent V = 1- U (a) Average separation by iteration t t A v e r a g e R t + R t V = UIndependentV = 1- U (b) Contraction vs. distance between chains
Figure 8: Time series and drift function properties of the distance between chains under thethree simple acceptance step couplings. Left: we repeat the experiment shown in Figure 4 forthe acceptance step couplings. Only the V = U option allows the chains to get close enough toproduce a significant probability of meeting. Right: we repeat the experiment shown in Figure 5for these couplings. The V = U coupling displays more contraction and a smaller x -interceptthan the other options. The behavior of this intercept as a function of d has a significant effecton meeting times.with different acceptance indicator couplings in terms of the contraction between chains. InFigure 8a, we present the average distance between chains under the three simple acceptanceindicator couplings. We run all of these using the maximal reflection coupling of proposaldistributions. As in the proposal coupling case we set d = 100, we initializing each chain withan independent draw from the target, and we use a sticky coupling to ensure X t = Y t for t ≥ τ .The U = V coupling is able to produce sufficient contraction for meeting to take place, whilethis seems out of reach for the independent and V = 1 − U couplings.We also consider the effect of different acceptance couplings on the drift R t +1 − R t as afunction of the current distance between chains R t . We use the maximal reflection coupling atthe proposal step, and we run 10,000 replications for each value of R t and coupling option. Asin the case of proposal couplings, the time series behavior of R t shown in Figure 8a is consistentwith relationship between R t and E [ R t +1 − R t | X t , Y t ] observed in Figure 8b.Both of these tests support the impression that the choice of the acceptance coupling hasa significant effect on the contraction properties of the resulting chains. At a high level, itappears that the right combination of proposal and acceptance strategies can lead to powerfulcontraction between chains down to a point where meeting is reasonably probable under amaximal coupling. The combination of a maximal reflection proposal coupling and a maximalacceptance coupling has this property while most other combinations do not, leading to a rapidgrowth in meeting times as a function of dimension.22 Discussion
In the sections above we have identified a range of options for use in the design of RWMtransition kernel couplings. Our analysis and simulations suggest a few principles for the choiceof these elements, which we summarize as follows.First, the coupling inequality imposes a significant constraint on the ability of any ¯ Q ∈ Γ( Q, Q ) to propose meetings. This suggests using a maximal or nearly maximal coupling toobtain meetings at the highest rate possible. A hybrid approach may also be practical in somecases. When a meeting is not proposed, it seems advantageous to minimize the degrees offreedom in the displacement y − x between proposals. These degrees of freedom accumulatein higher dimensions and eventually create a barrier to contraction between chains. This mayexplain the poor performance of the maximal independent coupling relative to the maximalsemi-independent coupling.Since the probability of a meeting is typically small until the chains are close together, it isimportant to construct a transition kernel coupling that yields strong and persistent contractionbetween chains. Surprisingly, the reflection couplings seem to do the best job of this among theproposal options considered above. These couplings do not have good contraction properties ontheir own, but they seem to set up a favorable interaction with the Metropolis step, especiallywith the U = V coupling. The precise nature of this interaction is an important open question.For now, it appears safe to recommend the reflection coupling for inducing contraction betweenchains.The success of the reflection coupling raises two additional questions. First, we may considerthe extent to which this behavior depends on the log-concavity of the target distribution. Itseems reasonable to think that this coupling may not work as well with irregular targets. Withlog-concave targets, like π d = N(0 , I d ), we can also ask how close the MH transition kernelsbased on a maximal coupling with reflection residuals at the proposal step comes to a maximalcoupling with optimal transport residuals of the transition kernels themselves. This questionseems amenable to either theoretical and numerical methods.On the acceptance indicator side, the U = V coupling has a strong a priori appeal. Thiscoupling gives the highest chance of turning a proposed meeting into an actual meeting. Italso minimizes the probability of accepting one proposal and rejecting the other, which oftenleads to a jump in the distance between chains. While the U = V coupling dominates theother options in this study, we recall that we have focused our attention on the subset ofacceptance indicator couplings in which the conditional acceptance rates a x (( x, y ) , ( x , y )) and a y (( x, y ) , ( x , y )) agree with the MH rates a ( x, x ) and a ( y, y ). The analysis of more generalacceptance couplings deserves further attention.We emphasized the simple case of a multivariate normal target distribution in the simulationsabove. It would be interesting to know the extent to which our conclusions generalize to morechallenging examples such as targets with heavy tails, multi-modality, difficult geometries, andexamples in large discrete state spaces. One might also extend the coupling strategies described23bove to other common MH algorithms such HMC [Duane et al., 1987, Neal, 1993, 2011],the Metropolis-adjusted Langevin algorithm [Roberts and Tweedie, 1996], and particle MCMC[Andrieu et al., 2010]. We expect that couplings for these extensions would involve some of thesame principles as above, but with more moving parts and fewer symmetries to exploit.Perhaps the most important open questions in the area of coupling design concern thedevelopment of theoretical tools to relate proposal and acceptance options to meeting times.Such tools would enable a systematic understanding of the interaction between proposal andacceptance steps. This would also support work on how to pair these to produce as muchpossible contraction as possible between chains. One approach to this might exploit the driftand minorization approach of Rosenthal [1995, 2002], especially the pseudo-small set conceptof Roberts and Rosenthal [2001]. The analyses and simulations above mark a step forward inour understanding of the options for coupling MH transition kernels. They suggest that someoptions might be better than others and hint at why. Acknowledgements
The author thanks Pierre E. Jacob, Yves Atchadé, and Niloy Biswas for their helpful comments.He also gratefully acknowledge support by the National Science Foundation through grant DMS-1844695.
References
M. Abramowitz, I. A. Stegun, and R. H. Romer.
Handbook of Mathematical Functions withFormulas, Graphs, and Mathematical Tables . American Association of Physics Teachers,1988. ISBN 0002-9505. 6D. Aldous. Random walks on finite groups and rapidly mixing Markov chains. In
Séminaire deProbabilités XVII 1981/82 , pages 243–297. Springer, 1983. 1C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo methods.
Journalof the Royal Statistical Society: Series B (Statistical Methodology) , 72(3):269–342, 2010. 24N. Biswas, P. E. Jacob, and P. Vanetti. Estimating convergence of Markov chains with L-lagcouplings. In
Advances in Neural Information Processing Systems , pages 7391–7401, 2019. 1B. Böttcher. Markovian maximal coupling of Markov processes. arXiv preprintarXiv:1710.09654 , 2017. 2N. Bou-Rabee, A. Eberle, and R. Zimmer. Coupling and convergence for Hamiltonian MonteCarlo.
Annals of Applied Probability , 30(3):1209–1250, 2020. 11, 12, 13K. Burdzy and W. S. Kendall. Efficient Markovian couplings: examples and counterexamples.
Annals of Applied Probability , pages 362–409, 2000. 224. F. Christensen, G. O. Roberts, and J. S. Rosenthal. Scaling limits for the transient phaseof local Metropolis–Hastings algorithms.
Journal of the Royal Statistical Society: Series B(Statistical Methodology) , 67(2):253–268, 2005. 15D. Dey, P. Dutta, and S. Biswas. A note on faithful coupling of Markov chains. arXiv preprintarXiv:1710.10026 , 2017. 3W. Doeblin. Exposé de la théorie des chaînes simples constantes de Markov à un nombre finid’états.
Mathématique de l’Union Interbalkanique , 2(77-105):78–80, 1938. 1R. Douc, E. Moulines, P. Priouret, and P. Soulier.
Markov Chains . Springer, 2018. 2, 4S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo.
PhysicsLetters B , 195(2):216–222, 1987. 7, 24L. Duembgen. Bounding standard gaussian tail probabilities. arXiv preprint arXiv:1012.2063 ,2010. 6D. B. Dunson and J. Johndrow. The Hastings algorithm at fifty.
Biometrika , 107(1):1–23, 2020.1A. Eberle. Reflection coupling and Wasserstein contractivity without convexity.
ComptesRendus Mathematique , 349(19-20):1101–1104, 2011. 12A. Eberle and M. B. Majka. Quantitative contraction rates for Markov chains on general statespaces.
Electronic Journal of Probability , 24, 2019. 11J. A. Fill. An interruptible algorithm for perfect sampling via Markov chains. In
Proceedings ofthe Twenty-Ninth Annual ACM Symposium on Theory of Computing , pages 688–695, 1997.1J. M. Flegal and R. Herbei. Exact sampling for intractable probability distributions via aBernoulli factory.
Electronic Journal of Statistics , 6:10–37, 2012. 1A. Gelman, G. O. Roberts, and W. R. Gilks. Efficient metropolis jumping rules.
BayesianStatistics , 5(599-608):42, 1996. 15M. Gerber and A. Lee. Discussion on the paper by Jacob, O’Leary, and Atchadé.
Journal ofthe Royal Statistical Society: Series B (Statistical Methodology) , 82(3):584–585, 2020. 4C. R. Givens and R. M. Shortt. A class of Wasserstein metrics for probability distributions.
The Michigan Mathematical Journal , 31(2):231–240, 1984. 7P. W. Glynn and C.-H. Rhee. Exact estimation for Markov chain equilibrium expectations.
Journal of Applied Probability , 51(A):377–389, 2014. 1S. Goldstein. Maximal coupling.
Zeitschrift für Wahrscheinlichkeitstheorie und VerwandteGebiete , 46(2):193–204, 1979. 2 25. B. Goodman and K. K. Lin. Coupling control variates for Markov chain Monte Carlo.
Journalof Computational Physics , 228(19):7127–7136, 2009. 1D. Griffeath. A maximal coupling for Markov chains.
Zeitschrift für Wahrscheinlichkeitstheorieund Verwandte Gebiete , 31(2):95–106, 1975. 2T. E. Harris. On chains of infinite order.
Pacific Journal of Mathematics , 5(Suppl. 1):707–724,1955. 1W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications.
Biometrika , 57(1):97–109, 1970. 2J. Heng and P. E. Jacob. Unbiased Hamiltonian Monte Carlo with couplings.
Biometrika , 106(2):287–302, 2019. 1E. P. Hsu and K.-T. Sturm. Maximal coupling of Euclidean Brownian motions.
Communicationsin Mathematics and Statistics , 1(1):93–104, 2013. 2, 12P. E. Jacob, J. O’Leary, and Y. F. Atchadé. Unbiased Markov chain Monte Carlo methods withcouplings.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 82(3):543–600, 2020. 1, 2, 3, 11, 12, 15V. E. Johnson. Studying convergence of Markov chain Monte Carlo algorithms using coupledsample paths.
Journal of the American Statistical Association , 91(433):154–166, 1996. 1V. E. Johnson. A coupling-regeneration scheme for diagnosing convergence in Markov chainMonte Carlo algorithms.
Journal of the American Statistical Association , 93(441):238–248,1998. 1, 2B. Jourdain, T. Lelièvre, and B. Miasojedow. Optimal scaling for the transient phase of Metropo-lis Hastings algorithms: The longtime behavior.
Bernoulli , 2014. doi: 10.3150/13-BEJ546.15M. Knott and C. S. Smith. On the optimal mapping of distributions.
Journal of OptimizationTheory and Applications , 43(1):39–49, 1984. 7, 11V. S. A. Kumar and H. Ramesh. Coupling vs. conductance for the Jerrum-Sinclair chain.
Ran-dom Structures and Algorithms , 2001. doi: 10.1002/1098-2418(200101)18:1<1::AID-RSA1>3.0.CO;2-7. 3D. A. Levin, Y. Peres, and E. L. Wilmer.
Markov Chains and Mixing Times , volume 107.American Mathematical Soc., 2017. ISBN 1470429624. 4T. Lindvall.
Lectures on the Coupling Method . Dover Books on Mathematics, 1992. ISBN0-486-42145-7. 8 26. Lindvall and L. C. G. Rogers. Coupling of multidimensional diffusions by reflection.
TheAnnals of Probability , 14(3):860–872, 1986. 12N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equationof state calculations by fast computing machines.
The Journal of Chemical Physics , 21(6):1087–1092, 1953. 2R. Neal and R. Pinto. Improving Markov chain Monte Carlo estimators by coupling to anapproximating chain. Technical report, Department of Statistics, University of Toronto, 2001.1R. M. Neal. Bayesian learning via stochastic dynamics. In
Advances in Neural InformationProcessing Systems , pages 475–482, 1993. 7, 24R. M. Neal. Circularly-coupled Markov chain sampling. Technical report, Department of Statis-tics, University of Toronto, 1999. 1R. M. Neal. MCMC using Hamiltonian dynamics.
Handbook of Markov Chain Monte Carlo , 2(11):2, 2011. 7, 24J. O’Leary and G. Wang. Transition kernel couplings of the Metropolis-Hastings algorithm. arXiv preprint arXiv:2102.00366 , 2021. 2, 3, 13J. O’Leary, G. Wang, and P. E. Jacob. Maximal couplings of the Metropolis–Hastings algorithm. arXiv preprint arXiv:2010.08573 , 2020. 13N. S. Pillai and A. Smith. Kac’s walk on n -sphere mixes in n log n steps. The Annals of AppliedProbability , 27(1):631–650, 2017. 13D. Piponi, M. Hoffman, and P. Sountsov. Hamiltonian Monte Carlo swindles.
Proceedings ofMachine Learning Research , 108:3774–3783, 26–28 Aug 2020. 1J. Pitman. On coupling of Markov chains.
Zeitschrift für Wahrscheinlichkeitstheorie und Ver-wandte Gebiete , 35(4):315–322, 1976. 1D. Pollard.
Asymptopia . Yale University, Department of Statistics, 2005. 5J. G. Propp and D. B. Wilson. Exact sampling with coupled Markov chains and applicationsto statistical mechanics.
Random Structures and Algorithms , 9(1-2):223–252, 1996. 1S. T. Rachev and L. Rüschendorf.
Mass Transportation Problems , volume 1. Springer Science& Business Media, 1998. 10G. O. Roberts and J. S. Rosenthal. Small and pseudo-small sets for Markov chains.
StochasticModels , 17(2):121–145, 2001. 24G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin distributions and theirdiscrete approximations.
Bernoulli , 2(4):341–363, 1996. 2427. O. Roberts, A. Gelman, and W. R. Gilks. Weak convergence and optimal scaling of randomwalk Metropolis algorithms.
The Annals of Applied Probability , 7(1):110–120, 1997. 15J. S. Rosenthal. Minorization conditions and convergence rates for Markov chain Monte Carlo.
Journal of the American Statistical Association , 90(430):558–566, 1995. 1, 24J. S. Rosenthal. Faithful couplings of Markov chains: now equals forever.
Advances in AppliedMathematics , 18(3):372–381, 1997. 3J. S. Rosenthal. Quantitative convergence rates of Markov chains: A simple account.
ElectronicCommunications in Probability , 7:123–128, 2002. 24A. Smith. A Gibbs sampler on the n -simplex. The Annals of Applied Probability , 24(1):114–130,2014. 13H. Thorisson.
Coupling, Stationarity, and Regeneration , volume 14 of
Probability and Its Ap-plications . Springer New York, 2000. 4L. N. Vaserstein. Markov processes over denumerable products of spaces, describing largesystems of automata.