Transition kernel couplings of the Metropolis-Hastings algorithm
TTransition kernel couplings of theMetropolis–Hastings algorithm
John O’Leary ∗ and Guanyang Wang † February 3, 2021
Abstract
Couplings play a central role in the analysis of Markov chain convergence to stationarityand in the construction of novel Markov chain Monte Carlo diagnostics, estimators, andvariance reduction techniques. The quality of the resulting bounds or methods typicallydepends on how quickly the coupling induces meeting between chains, a property sometimesreferred to as its efficiency. The design of efficient Markovian couplings remains a difficultopen question, especially for discrete time processes. In pursuit of this goal, in this paper wefully characterize the couplings of the Metropolis–Hastings (MH) transition kernel, provid-ing necessary and sufficient conditions in terms of the underlying proposal and acceptancedistributions. We apply these results to characterize the set of maximal couplings of the MHkernel, resolving open questions posed in O’Leary et al. [2020] on the structure and prop-erties of these couplings. These results represent an advance in the understanding of theMH kernel and a step toward the formulation of efficient couplings for this popular familyof algorithms.
From the first results on Markov chain ergodicity [Doeblin, 1938, Harris, 1955] to the presentday, the coupling method has played an important role in the analysis of convergence to station-arity. In recent years couplings have also gained prominence as a basis for sampling [Propp andWilson, 1996, Fill, 1997, Neal, 1999, Flegal and Herbei, 2012], convergence diagnosis [John-son, 1996, 1998, Biswas et al., 2019], variance reduction [Neal and Pinto, 2001, Goodmanand Lin, 2009, Piponi et al., 2020], and unbiased estimation [Glynn and Rhee, 2014, Jacobet al., 2020, Heng and Jacob, 2019, Middleton et al., 2019, 2020]. In most cases, couplings thatdeliver smaller meeting times are associated with better results.Famously, the distribution of meeting times that a coupling can produce between a pair of chainsis constrained by the total variation distance between the marginal distributions of those chains[Aldous, 1983, Lindvall, 1992]. Couplings that achieve this bound are said to be maximal and ex-ist in some generality [Griffeath, 1975, Pitman, 1976, Goldstein, 1979]. However, these couplingsare typically not Markovian or co-adapted to the chains in question, while this is a natural start-ing point for the methodological applications noted above. Maximal Markovian couplings are ∗ Department of Statistics, Harvard University, Cambridge, USA. Email: [email protected] † Department of Statistics, Rutgers University, New Brunswick, USA. Email: [email protected] a r X i v : . [ m a t h . S T ] F e b nown to exist – or known not to exist – for a few special cases [Burdzy and Kendall, 2000, Con-nor and Jacka, 2008, Kuwada, 2009, Hsu and Sturm, 2013, Kendall, 2015, Böttcher, 2017, Baner-jee and Kendall, 2017] but remain elusive for practical, discrete-time algorithms.While these studies represent significant progress, they provide only limited guidance for the de-sign of couplings for use in applications. For the Metropolis–Hastings (MH) algorithm [Metropo-lis et al., 1953, Hastings, 1970], couplings can be divided into two categories. Some are defineddirectly in terms of the MH transition kernel [Rosenthal, 1996, 2002], while others involvea proposal coupling followed by a coupling of accept/reject steps [Johnson, 1998, Bou-Rabeeet al., 2020, Jacob et al., 2020]. Both types of couplings have produced significant achievements,but it remains unclear whether one approach should be preferred to the other, a question thatwas recently highlighted in O’Leary et al. [2020].The analysis of MH transition kernel couplings is thus an attractive topic from both a mathemat-ical and an algorithmic perspective. Mathematically, couplings are often defined abstractly asa joint distribution with specific margins. A characterization theorem would provide a unifyingand descriptive framework to specify the contents of the set of MH transition kernel couplings.Algorithmically, characterizing this set of couplings would shed light on the options availableto practitioners and support the search for efficient and implementable coupling algorithms.In this paper, we provide a complete characterization of the set of MH transition kernel couplingsin terms a coupling of the relevant proposal distributions followed by a coupled accept/rejectstep. We show that all MH transition kernel couplings can be expressed in this form, and we givesimple conditions that ensure that a joint kernel of this type corresponds to an MH transitionkernel coupling. Our results hold for simple MH algorithms as well as modern refinements suchas Hamiltonian Monte Carlo [Duane et al., 1987, Neal, 1993, 2011], the Metropolis-adjustedLangevin algorithm [Roberts and Tweedie, 1996], and particle MCMC [Andrieu et al., 2010].They also hold for both continuous and discrete state spaces, ‘lazy’ implementations wherethe proposal distribution is not absolutely continuous with respect to the base measure, andmethods like Barker’s algorithm [Barker, 1965] in which the acceptance rate function differsfrom the usual MH form.With this characterization result in hand, we turn to the question of maximal couplings of theMH transition kernel, sometimes called one-step or step-by-step maximal couplings [Kartashovand Golomozyi, 2013, Pillai and Smith, 2019, Douc et al., 2018] or greedy couplings [Aldousand Fill, 1995]. We characterize maximal couplings of the MH transition kernel in terms ofthe properties of the associated proposal and acceptance couplings. We also resolve an openquestion of O’Leary et al. [2020] on the structure of these couplings and their relationship withmaximal couplings of the MH proposal kernel.In Section 2, we describe our notation and setting, state our main result on kernel couplings,and set out a series of definitions and lemmas to prove it. We introduce a running example ona two-point state space and return to it to illustrate concepts and questions encountered alongthe way. In Section 3, we highlight a few important properties of maximal couplings, considerthe relationship between maximal proposal and transition kernel couplings, and prove our mainresult on the structure of maximal couplings of the MH transition kernel. Finally, in Section 4we discuss our results and highlight important open questions and paths forward.2 Metropolis–Hastings kernel couplings
Let ( Z , G ) and ( W , H ) be two measurable spaces. We say Θ : Z × H → [0 ,
1] is a Markovkernel if Θ( z, · ) is a probability measure on ( W , H ) for each z ∈ Z and Θ( · , A ) : Z → [0 ,
1] isa measurable function for each A ∈ H . Let 2 S denote the power set of a set S , let Bern( α )denote the Bernoulli distribution on { , } with P (Bern( α ) = 1) = α for α ∈ [0 , a ∧ b := min( a, b ) for any a, b ∈ R .Suppose ( X , F ) is a Polish space with base measure λ . We take X to be our state space.Fix a Markov kernel Q : X × F → [0 , a : X × X → [0 , a ( x, · ) is Q ( x, · )-measurable and that a ( x, x ) = 1 for all x ∈ X . Given a current state x ∈ X and a measurableset A ∈ F , we interpret Q ( x, A ) as the probability of proposing a move from x to x ∈ A and a ( x, x ) as the probability of accepting such a proposal. The Metropolis–Hastings (MH) caseis of particular interest. There, we fix a target distribution π (cid:28) λ , assume Q ( x, · ) (cid:28) λ for all x ∈ X , and set a ( x, x ) := 1 ∧ π ( x ) q ( x ,x ) π ( x ) q ( x,x ) with π ( · ) := d π/ d λ and q ( x, · ) := d Q ( x, · ) / d λ . Notethat our analysis also holds for alternative forms of a ( x, x ) and when Q ( x, · ) is not absolutelycontinuous with respect to λ . Thus the following applies equally to continuous and discretestate spaces, lazy algorithms in which Q ( x, { x } ) >
0, and implementations of e.g. Barker’salgorithm [Barker, 1965] in which a ( x, x ) = 1.We call B ( x, x ) := Bern( a ( x, x )) the acceptance indicator distribution associated with anacceptance rate function a . We say that a probability kernel P : X × F → [0 ,
1] is gen-erated by Q and B (or equivalently by Q and a ) if x ∼ Q ( x, · ) and b x ∼ B ( x, x ) imply X := b x x + (1 − b x ) x ∼ P ( x, · ) for all x ∈ X . We say that a transition kernel P is MH-like ifit admits such a representation. Our assumption above that a ( x, x ) = 1 for all x ∈ X simplifiesmany proofs and involves no loss of generality: if ˜ a is another acceptance rate function with a ( x, x ) = ˜ a ( x, x ) for x = x , then Q and ˜ a generate the same transition kernel as Q and a .Let µ and ν be finite measures on ( X , F ). A measure γ on ( X × X , F ⊗ F ) is called acoupling of µ and ν if γ ( A × X ) = µ ( A ) and γ ( X × A ) = ν ( A ) for all A ∈ F . We writeΓ( µ, ν ) for the set of couplings of µ and ν , sometimes called the Fréchet class of these measures[Kendall, 2017]. When µ and ν are probability measures, the coupling inequality states that P ( X = Y ) ≤ − k µ − ν k TV if ( X, Y ) ∼ γ ∈ Γ( µ, ν ). Here k µ − ν k TV = sup A ∈ F | µ ( A ) − ν ( A ) | is the total variation distance. See e.g. Lindvall [1992, chap. 1.2] or Levin et al. [2017, chap.4] for a further discussion of this important inequality. A coupling γ that achieves the abovebound is said to be maximal, and we write Γ max ( µ, ν ) for the set of these.If Θ is a probability kernel on ( Z , G ) and ( W , H ) as defined above, we follow Douc et al. [2018,chap. 19] in calling ¯Θ : ( Z × Z ) × ( H ⊗ H ) → [0 ,
1] a kernel coupling if ¯Θ is a kernelon (
Z × Z , G ⊗ G ) and ( W × W , H ⊗ H ) and if ¯Θ(( z , z ) , · ) ∈ Γ(Θ( z , · ) , Θ( z , · )) for all z , z ∈ Z . We write Γ(Θ , Θ) for the set of all such kernel couplings. Similarly, we say that ¯Θis a maximal kernel coupling if each ¯Θ(( z , z ) , · ) is a maximal coupling of Θ( z , · ) and Θ( z , · ),and we write Γ max (Θ , Θ) for the set of these. It is important to remember that a maximalkernel coupling ¯ P ∈ Γ max ( P, P ) does not usually correspond to a maximal coupling of Markovchains [Aldous, 1983]. The former achieves the highest probability of meeting at each individualstep, while the latter produces the stochastically smallest meeting times allowed by the totalvariation distance between chains at each step.With the definition above, we call any ¯ P ∈ Γ( P, P ) a transition kernel coupling and any3 Q ∈ Γ( Q, Q ) a proposal kernel coupling. We say that ¯ B is an acceptance indicator couplingif ¯ B (( x, y ) , ( x , y )) is any joint distribution on { , } for x, x , y, y ∈ X . For consistencywith our a ( x, x ) = 1 assumption, we require that if ( b x , b y ) ∼ ¯ B (( x, y ) , ( x , y )) then x = x implies b x = 1 and y = y implies b y = 1. We call functions a x , a y : X × X → [0 , a x (( x, y ) , · ) and a y (( x, y ) , · ) are ¯ Q (( x, y ) , · )-measurable andif a x (( x, y ) , ( x, y )) = a y (( x, y ) , ( x , y )) = 1 for all x, y, x , y ∈ X . We then say that ¯ B is anacceptance indicator function based on (or associated with) a x and a y if¯ B (( x, y ) , ( x , y )) ∈ Γ (cid:16) Bern( a x (( x, y ) , ( x , y ))) , Bern( a y (( x, y ) , ( x , y ))) (cid:17) for all x, y, x , y ∈ X . Finally, we say that a proposal coupling ¯ Q and an acceptance indi-cator coupling ¯ B generate a transition kernel coupling ¯ P if ( X, Y ) ∼ ¯ P (( x, y ) , · ) whenever( x , y ) ∼ ¯ Q (( x, y ) , · ), ( b x , b y ) ∼ ¯ B (( x, y ) , ( x , y )), X := b x x +(1 − b x ) x , and Y := b y y +(1 − b y ) y ,for all x, y ∈ X .Johnson [1998] showed that one can start with a proposal coupling ¯ Q ∈ Γ( Q, Q ) and construct anacceptance indicator coupling ¯ B such that ¯ Q and ¯ B generate a valid transition kernel coupling¯ P ∈ Γ( P, P ). In that and subsequent work such as Jacob et al. [2020], ¯ B is taken to be acoupling of Bern( a ( x, x )) and Bern( a ( y, y )), so that proposals from x to x and from y to y are accepted at exactly the MH rate. O’Leary et al. [2020] used a conditional form of ¯ B to arriveat a maximal transition kernel coupling ¯ P from a maximal proposal kernel coupling ¯ Q . In thisstudy, our first goal is to characterize the set of transition kernel couplings Γ( P, P ) in terms ofproposal and acceptance indicator couplings. In particular, we prove the following result:
Theorem 1.
Let P be an MH-like transition kernel on ( X , F ) generated by a proposal kernel Q and an acceptance rate function a . ¯ P ∈ Γ( P, P ) if and only if there exists a ¯ Q ∈ Γ( Q, Q ) ,joint acceptance rate functions a x and a y with E [ a x (( x, y ) , ( x , y )) | x ] = a ( x, x ) for Q ( x, · ) -almost all x and E [ a y (( x, y ) , ( x , y )) | y ] = a ( y, y ) for Q ( y, · ) -almost all y , and an acceptanceindicator coupling ¯ B based on a x and a y such that ¯ Q and ¯ B generate ¯ P . This result shows that all MH transition kernel couplings ¯ P are ‘natural’ in the sense thatthey arise from coupled proposals ( x , y ) ∼ ¯ Q (( x, y ) , · ) that are accepted or rejected accordingto a coupled acceptance indicator draw ( b x , b y ) ∼ ¯ B (( x, y ) , ( x , y )). It also confirms that ashort list of properties, such as the margins of the joint acceptance rate functions, result ina transition kernel ¯ P that falls in Γ( P, P ). Ultimately Theorem 1 allows us to analyze therelatively complicated set of all MH-like transition kernel couplings in terms of simpler andmore tractable ingredients. We now introduce a simple example, which we will return to severaltimes to motivate the constructions in this section:
Example 1.
Let X = { , } and F = 2 X , and assume a current state pair ( x, y ) = (1 , { , } , it is convenient to represent kernels Θ : X × F → [0 ,
1] inthe vector form Θ( z, · ) = (cid:0) Θ( z, { } ) , Θ( z, { } ) (cid:1) and couplings ¯Θ : ( X × X ) × ( F ⊗ F ) → [0 , x, y ) , · ) = x ↓ Θ(( x, y ) , (1 , x, y ) , (2 , Θ(( x, y ) , (1 , x, y ) , (2 , ← y The columns of this matrix represent the destination values for the x chain (e.g. values of X or x ) while the rows represent the equivalent for the y chain. The small ‘ x ’ and ‘ y ’ serve as areminder of the current state pair. 4n this example, we assume a uniform proposal distribution, a target of π = ( / , / ), and anMH acceptance ratio function a . Thus we have Q (1 , · ) = ( / , / ) a (1 , · ) = (1 , ⇒ P (1 , · ) = ( / , / ) Q (2 , · ) = ( / , / ) a (2 , · ) = ( / , ⇒ P (2 , · ) = ( / , / ) . Simple algebra shows that any proposal and transition coupling based on the kernels abovemust take the following forms, for some ρ ∈ [0 , / ] and λ ∈ [0 , / ]:¯ Q ((1 , , · ) = x ↓ ρ / − ρ / − ρ ρ ← y ¯ P ((1 , , · ) = x ↓ λ / − λ / − λ λ + / ← y Theorem 1 implies that for any ¯ P , we can find a ¯ Q and an acceptance indicator coupling ¯ B such that ¯ Q and ¯ B generate ¯ P . This is straightforward to confirm in this simple case. We willreturn to this example to illustrate the concepts below. We will see that it is easy to prove that ¯ P ∈ Γ( P, P ) when ¯ P is generated by a proposal coupling¯ Q and an acceptance coupling ¯ B with the properties given in Theorem 1. However, the converserequires a deeper understanding of the structure of transition kernel couplings. The challengeis that for an arbitrary ¯ P ∈ Γ( P, P ), it is not trivial to construct a ¯ Q and ¯ B that generate ¯ P .For measurable rectangles A x × A y ∈ F ⊗ F with x A x and y A y , there is a simplerelationship between the probability of a proposed move from ( x, y ) to ( x , y ) ∈ A x × A y andthe probability of a transition to ( X, Y ) ∈ A x × A y under a coupling ¯ P (( x, y ) , · ). However,transitions to sets like A x × { y } , { x } × A y , and { x } × { y } can arise from the rejection ofproposed moves to larger sets, such as A x × X , X × A y , and X × X . It is a significant challengeto associate these transition probabilities with proposal probabilities in a consistent way. Thisissue is especially acute when moves from x to x = x can occur with positive probability sincein this case, we also cannot be sure whether to associate a transition from ( x, y ) to ( X, Y ) with X = x with a rejected proposal with x = x or an accepted one with x = x .To overcome this challenge, we look for a general procedure to map from a transition kernel¯ P ∈ Γ( P, P ) to a joint distribution on proposals ( x , y ) and acceptance indicators ( b x , b y ), suchthat all of these are consistent with the given proposal kernel Q and acceptance rate function a . We construct such distributions using the following definition: Definition 1.
We say that a proposal coupling ¯ Q ∈ Γ( Q, Q ) and a transition kernel coupling¯ P ∈ Γ( P, P ) are related by a coupled acceptance mechanism
Φ = (Φ , Φ , Φ , Φ ) if eachΦ ij (( x, y ) , · ) is a measure on ( X × X , F ⊗ F ) and for all A x , A y ∈ F we have1 . ¯ Q (( x, y ) , A x × A y ) = (Φ + Φ + Φ + Φ )(( x, y ) , A x × A y )2 . ¯ P (( x, y ) , A x × A y ) = Φ (( x, y ) , A x × A y ) + Φ (( x, y ) , A x × X )1( y ∈ A y )+ Φ (( x, y ) , X × A y )1( x ∈ A x ) + Φ (( x, y ) , X × X )1( x ∈ A x )1( y ∈ A y )3 . Q ( x, { x } ) = (Φ + Φ )(( x, y ) , { x } × X ) and Q ( y, { y } ) = (Φ + Φ )(( x, y ) , X × { y } ) . As we will see, the existence of a coupled acceptance mechanism Φ relating ¯ Q and ¯ P is equivalentto the existence of an acceptance indicator coupling ¯ B such that ¯ Q and ¯ B generate ¯ P . Specifi-cally, in Lemma 1 we show that if ( X, Y ) ∼ ¯ P (( x, y ) , · ) is generated by ( x , y ) ∼ ¯ Q (( x, y ) , · ) and5 yA y A x11 100100 P((x, y), {x}×{y}) P (( x , y ) , { x } × A y ) P((x, y), A x ×{y})P((x, y), A x ×A y )Q((x, y), A x ×A y ) Figure 1: Schematic diagram of a coupled acceptance mechanism Φ relating a proposal coupling¯ Q and a transition kernel coupling ¯ P . Here ( x, y ) represents the current state and A x × A y is a measurable rectangle in X × X . ¯ Q (( x, y ) , A x × A y ) gives the probability of a proposal( x , y ) ∈ A x × A y . The coupled acceptance mechanism Φ = (Φ , Φ , Φ , Φ ) distributes thisprobability into contributions to the probability ¯ P (( x, y ) , · ) of a transition from ( x, y ) to thesets A x × A y , A x × { y } , { x } × A y , and { x } × { y } . The conditions of Definition 1 ensure that Φagrees with both ¯ Q and ¯ P .( b x , b y ) ∼ ¯ B (( x, y ) , ( x , y )), then a coupled acceptance mechanism Φ relating ¯ Q and ¯ P existsand can be defined by Φ ij (( x, y ) , · ) := P (( x , y ) ∈ · , b x = i, b y = j ) for i, j ∈ { , } . Conversely,in Lemma 6, we show that if a coupled acceptance mechanism Φ relates ¯ Q and ¯ P , then thereexists an acceptance indicator coupling ¯ B such that ¯ Q and ¯ B generate ¯ P .Definition 1 captures the necessary relationships between a proposal coupling ¯ Q and a transi-tion kernel coupling ¯ P . We think of Φ as subdividing the probability in ¯ Q (( x, y ) , · ) into fouracceptance scenarios, depending on whether both, exactly one, or neither proposal is accepted.See Figure 1 for a depiction. Condition 1 requires that the probability assigned to these scenar-ios must add up to ¯ Q everywhere on X × X . Condition 2 says that the resulting distributionover transitions must agree with ¯ P . For instance, if x A x and y A y , then this condi-tion requires the probability of a transition from ( x, y ) to ( X, Y ) ∈ A x × A y to coincide withΦ (( x, y ) , A x × A y ), which we interpret as the probability of proposing and accepting a movefrom ( x, y ) to A x × A y . Condition 3 concerns proposals to x = x and y = y , and says that Φmust agree with the marginal assumption that such proposal are always accepted.The crux of Theorem 1 is to prove that every coupling of MH-like transition kernels ¯ P arisesfrom a proposal and acceptance coupling. Thus we begin by looking for coupled acceptancemechanisms Φ that distribute the probability ¯ Q (( x, y ) , A x × A y ) of a proposal from ( x, y )to ( x , y ) ∈ A x × A y into contributions to the probability of a transition from ( x, y ) to( X, Y ) ∈ A x × A y , A x × { y } , { x } × A y , and { x } × { y } , in a consistent way. The followingresult establishes the necessity of the conditions of Definition 1 for this task. Lemma 1.
Let ¯ P ∈ Γ( P, P ) be a coupling of MH transition kernels. If a proposal coupling ¯ Q and an acceptance indicator coupling ¯ B generate ¯ P , then there exists a coupled acceptancemechanism Φ relating ¯ Q and ¯ P . roof. Fix ( x, y ), let ( x , y ) ∼ ¯ Q (( x, y ) , · ), and let ( b x , b y ) ∼ ¯ B (( x, y ) , ( x , y )). For A ∈ F ⊗ F and i, j ∈ { , } , define Φ ij (( x, y ) , A ) := P (( x , y ) ∈ A, b x = i, b y = j | x, y ). This Φ satisfiesCondition 1 since ¯ Q (( x, y ) , A x × A y ) = P (( x , y ) ∈ A x × A y | x, y ). It also satisfies Condition 2since for all A x , A y ∈ F we have¯ P (( x, y ) , A x × A y ) = P ( X ∈ A x , Y ∈ A y | x, y )= P ( x ∈ A x , y ∈ A y , b x = 1 , b y = 1 | x, y ) + P ( x ∈ A x , y ∈ A y , b x = 1 , b y = 0 | x, y )+ P ( x ∈ A x , y ∈ A y , b x = 0 , b y = 1 | x, y ) + P ( x ∈ A x , y ∈ A y , b x = 0 , b y = 0 | x, y )= Φ (( x, y ) , A x × A y ) + Φ (( x, y ) , A x × X )1( y ∈ A y )+ Φ (( x, y ) , X × A y )1( x ∈ A x ) + Φ (( x, y ) , X × X )1( x ∈ A x )1( y ∈ A y ) . For Condition 3, we have(Φ + Φ )(( x, y ) , { x } × X ) = P ( x = x, b x = 1) = a ( x, x ) Q ( x, { x } ) = Q ( x, { x } )(Φ + Φ )(( x, y ) , X × { y } ) = P ( y = y, b y = 1) = a ( y, y ) Q ( y, { y } ) = Q ( y, { y } ) . Thus Φ is a coupled acceptance mechanism relating ¯ Q and ¯ P .The ¯ Q condition in Definition 1 says that the probability shared out by Φ from any measurablerectangle A x × A y must add up to the probability of a proposal to that rectangle. Although the¯ P condition may appear to be complicated, it takes a more intuitive if less compact form whenwe express it in terms of measurable sets A x , A y ∈ F where x A x and y A y . In particular,we have the following: Lemma 2.
Let ¯ P ∈ Γ( P, P ) and suppose Φ = (Φ , Φ , Φ , Φ ) are a collection of measureson ( X × X , F ⊗ F ) . Then Condition 2 of Definition 1 holds if and only if for all A x , A y ∈ F with x A x and y A y , we have1 . ¯ P (( x, y ) , A x × A y ) = Φ (( x, y ) , A x × A y ) . ¯ P (( x, y ) , A x × { y } ) = Φ (( x, y ) , A x × { y } ) + Φ (( x, y ) , A x × X ) . ¯ P (( x, y ) , { x } × A y ) = Φ (( x, y ) , { x } × A y ) + Φ (( x, y ) , X × A y ) . ¯ P (( x, y ) , { x } × { y } ) = Φ (( x, y ) , { x } × { y } ) + Φ (( x, y ) , { x } × X )+ Φ (( x, y ) , X × { y } ) + Φ (( x, y ) , X × X ) . Proof.
The conditions above follow from Condition 2 of Definition 1 by evaluating that conditionat { x } , A x \ { x } and { y } , A y \ { y } . For the converse, for any A x , A y ∈ F we have¯ P (( x, y ) , A x × A y )= ¯ P (( x, y ) , ( A x \{ x } ) × ( A y \{ y } )) + ¯ P (( x, y ) , ( A x \{ x } ) × { y } )1( y ∈ A y )+ ¯ P (( x, y ) , { x } × ( A y \{ y } ))1( x ∈ A x ) + ¯ P (( x, y ) , { x } × { y } )1( x ∈ A x )1( y ∈ A y ) . Replacing these terms with Conditions 1-4 above yields Condition 2 of Definition 1.Condition 1 of Lemma 2 corresponds to the intuition that the only way to transition from ( x, y )to a point (
X, Y ) ∈ A x × A y with x A x and y A y is to propose and accept ( x , y ) = ( X, Y ).Similarly, Condition 2 says that the probability of transitioning to a point (
X, Y ) ∈ A x × { y } must arise either through accepted proposals to such a point or by proposals ( x , y ) ∈ A x × X ,with x → x accepted and y → y rejected. The other two conditions follow a similar intuition.7wo questions immediately arise regarding Definition 1. First, one may wonder if Condition 3is actually independent of Conditions 1 and 2. For A ∈ F with x A , we have(Φ + Φ )(( x, y ) , A × X ) = Z A a ( x, x ) Q ( x, d x ) . If this held for all A ∈ F it would imply (Φ +Φ )(( x, y ) , { x }×X ) = a ( x, x ) Q ( x, { x } ) = Q ( x, { x } ),so that Condition 3 would follow from Condition 2. Second, we may ask whether a given¯ P ∈ Γ( P, P ) can be related to more than one ¯ Q ∈ Γ( Q, Q ) by coupled acceptance mechanisms.The next example resolves both of these concerns.
Example 2.
We return to the setup of Example 1, in which X = { , } , F = 2 X , ( x, y ) = (1 , Q (1 , · ) = Q (2 , · ) = ( / , / ), P (1 , · ) = ( / , / ), and P (2 , · ) = ( / , / ). Suppose that we havethe following proposal and transition kernel couplings:¯ Q ((1 , , · ) = x ↓ / / / / ← y ¯ P ((1 , , · ) = x ↓ / / / ← y By way of Lemma 2, the first two conditions of Definition 1 imply that any coupled acceptancemechanism Φ relating ¯ Q and ¯ P must take the following form:Φ ((1 , , · ) = / a b Φ ((1 , , · ) = d c / − b Φ ((1 , , · ) = e Φ ((1 , , · ) = / − d / − a − c − e Thus (Φ + Φ )(( x, y ) , { x } × X ) = a + c + d , and (Φ + Φ )(( x, y ) , X × { y } ) = a + b + e .These need not work out to Q ( x, { x } ) = Q ( y, { y } ) = / , without imposing the third conditionof Definition 1. Note also that even with that condition, any choice of a, b, c with 0 ≤ c ≤ b ≤ / and 0 ≤ a ≤ / − b ∨ c will produce a valid Φ relating ¯ Q and ¯ P . Thus we see that coupledacceptance mechanisms do not have to be unique. To prove Theorem 1, we must show that any transition kernel coupling ¯ P ∈ Γ( P, P ) arisesfrom some proposal coupling ¯ Q ∈ Γ( Q, Q ) and an acceptance indicator coupling ¯ B with specificproperties. We approach this problem in stages. In this subsection, we show that for anytransition kernel coupling ¯ P ∈ Γ( P, P ) there exists a proposal coupling ¯ Q ∈ Γ( Q, Q ) and acoupled acceptance mechanism Φ relating ¯ Q and ¯ P in the sense of Definition 1. In the nextsubsection, we show that we can transform this Φ into an acceptance coupling ¯ B with thedesired properties, and finally we use these results to prove the main theorem.8e use the following functions and measures to construct a suitable ¯ Q and Φ for a given ¯ P : α ( x, A ) := Q ( x, A \ { x } ) − P ( x, A \ { x } ) α ( x, A ) := Q ( x, A ∩ { x } ) + P ( x, A \ { x } ) µ ( x, A ) := α ( x,A ) α ( x, X ) if α ( x, X ) > x ∈ A ) otherwise β ( x ) := Q ( x, { x } ) P ( x, { x } ) if P ( x, { x } ) >
01 otherwise.Note that each of α , α , and µ is a sub-probability kernel as defined in Section 2.1. Alsoobserve that these objects are all defined in terms of Q and P rather than any coupling.We establish the properties of these objects below. Lemma 3.
For x ∈ X , A ∈ F , α ( x, { x } ) = 0 , α ( x, A ) = Q ( x, A ) = P ( x, A ) if α ( x, X ) = 0 ,and β ( x ) ∈ [0 , . If x ∼ Q ( x, · ) and b x ∼ Bern( a ( x, x )) then α i ( x, A ) = P ( x ∈ A, b x = i | x ) for i ∈ { , } , and β ( x ) = P ( b x = 1 | X = x ) if P ( x, { x } ) > . Finally, each of α and α is asub-probability kernel.Proof. For all x ∈ X , α ( x, { x } ) = 0 by definition. Since α ( x, X ) = R (1 − a ( x, x )) Q ( x, d x ), α ( x, X ) = 0 implies a ( x, x ) = 0 for Q ( x, · )-almost all x . Thus α ( x, A ) = Q ( x, A ∩ { x } ) + Z A \{ x } a ( x, x ) Q ( x, d x ) = Q ( x, A ) . In general P ( x, A ) = 1( x ∈ A ) R (1 − a ( x, x )) Q ( x, d x )+ R A a ( x, x ) Q ( x, d x ). If α ( x, X ) = 0 thenthe first term is zero and the second term is Q ( x, A ), so α ( x, A ) = P ( x, A ) as well. The generalequality above also yields P ( x, { x } ) = Q ( x, { x } ) + R { x } c (1 − a ( x, x )) Q ( x, d x ) ≤ Q ( x, { x } ) . Thus β ( x ) ∈ [0 ,
1] for all x ∈ X . Finally, if b ∼ Q ( x, · ) , b x ∼ Bern( a ( x, x )), and we define X = b x x + (1 − b x ) x ∼ P ( x, · ), then for any A ∈ F , P ( x ∈ A, b x = 0 | x ) = P ( x ∈ A | x ) − P ( x ∈ A, b x = 1 | x )= P ( x ∈ A \ { x } | x ) − P ( X ∈ A \ { x } | x ) = Q ( x, A \ { x } ) − P ( x, A \ { x } ) . Also P ( x ∈ A, b x = 1 | x ) = P ( x ∈ A ∩ { x } , b x = 1 | x ) + P ( x ∈ A \ { x } , b x = 1 | x )= P ( x ∈ A ∩ { x } | x ) + P ( X ∈ A \ { x } | x ) = Q ( x, A ∩ { x } ) + P ( x, A ∩ { x } ) . Thus both α ( x, · ) and α ( x, · ) define sub-probabilities on ( X , F ). The measurability of α ( · , A )and α ( · , A ) for A ∈ F follows from the equivalent measurability property of Q and P . Lemma 4.
For all x ∈ X and A ∈ F , α ( x, A ) + α ( x, X ) µ ( x, A ) = Q ( x, A ) and µ ( x, { x } ) = 0 if α ( x, X ) > . If x ∼ Q ( x, · ) and b x ∼ Bern( a ( x, x )) then µ ( x, A ) = P ( x ∈ A | b x = 0 , x ) .whenever α ( x, X ) > . Finally, for any x ∈ X , µ ( x, · ) is a probability measure on ( X , F ) .Proof. If α ( x, X ) > α ( x, A ) + α ( x, X ) µ ( x, A ) = Q ( x, A ) directly from the definitionsof α and α . If α ( x, X ) = 0, then α ( x, A ) = Q ( x, A ) by Lemma 3 and the second term dropsout, so the same conclusion holds. If α ( x, X ) > µ ( x, { x } ) = α ( x, { x } ) /α ( x, X ) = 0.If x ∼ Q ( x, · ) , b x ∼ Bern( a ( x, x )), and X = b x x + (1 − b x ) x then by Lemma 3 we have α ( x, A ) = P ( x ∈ A, b x = 0 | x ). Thus if α ( x, X ) = P ( b x = 0 | x ) > µ ( x, A ) = P ( x ∈ A, b x = 0 | x ) P ( b x = 0 | x ) = P ( x ∈ A | b x = 0 , x ) . Since µ ( x, · ) is either an indicator of a measurable set or a well-defined conditional probability,in either case it defines a measure on ( X , F ). 9 xample 3. We return to the setup of Example 1 to build further intuition for α , α , µ, and β .For this example, we continue to assume X = { , } , F = 2 X , Q (1 , · ) = Q (2 , · ) = ( / , / ), a (1 , · ) = (1 , a (2 , · ) = ( / , P (1 , · ) = ( / , / ), and P (2 , · ) = ( / , / ). Thus α , α , β , µ work out to α (1 , · ) = (0 , α (1 , · ) = ( / , / ) µ (1 , · ) = (1 , α (2 , · ) = ( / , α (2 , · ) = ( / , / ) µ (2 , · ) = (1 , β ( · ) = (1 , / )Following Lemma 3, we can see α ( x, { x } ) and α ( x, { x } ) as the probability of proposing amove from x to x and rejecting or accepting it, respectively. The fact that only α (2 ,
1) differsfrom zero and only α (2 ,
2) differs from P follows directly from the form of the acceptancerate function a . Each α ( x, · ) defines a sub-probability on X , and it is natural to define µ , thenormalized version of α from each current state x when any entry of α ( x, · ) is nonzero. Thisleads to the form of µ (2 , · ), which says that given a current state x = 2 and a rejection, we caninfer that x = 1 almost surely. There is no probability of rejection when x = 1, so the value of µ (1 , · ) is best regarded as a placeholder to streamline the proof of Lemma 5.Finally, the function β gives the ratio of Q to P for moves from and to the same point, at leastwhen that ratio exists. In this example it says that if we observed a transition from x = 1 toitself, we must have proposed x = 1 and accepted that proposal. However, a transition from x = 2 to itself has a / probability of having come from a rejected proposal x = 1 and a / probability of having come from an accepted proposal x = 2. Note that if Q ( x, { x } ) = 0 as isoften the case for MH on a continuous state space, then β ( x ) = 0 and we can infer that a movefrom x to itself must have come from a rejected move to x = x .Next, we present the main lemma used in our proof of Theorem 1. Lemma 5.
For any coupling ¯ P ∈ Γ( P, P ) of MH kernels, there exists a proposal coupling ¯ Q ∈ Γ( Q, Q ) and a coupled acceptance mechanism Φ relating ¯ Q and ¯ P .Proof. The proof proceeds in three stages. First we define a collection of measures Φ, directlyin the case of Φ and via product measures for Φ , Φ , and Φ . Second, we define ¯ Q interms of Φ and show that ¯ Q ∈ Γ( Q, Q ). Finally, we show that Φ satisfies the conditions ofDefinition 1, and so agrees with ¯ Q , ¯ P , and a .For any x, y ∈ X and A ∈ F ⊗ F , letΦ (( x, y ) , A ) := ¯ P (cid:0) ( x, y ) , A ∩ ( { x } c × { y } c ) (cid:1) + ¯ P (cid:0) ( x, y ) , A ∩ ( { x } × { y } c ) (cid:1) β ( x )+ ¯ P (cid:0) ( x, y ) , A ∩ ( { x } c × { y } ) (cid:1) β ( y ) + ¯ P (cid:0) ( x, y ) , A ∩ ( { x } × { y } ) (cid:1) β ( x ) β ( y ) . This Φ (( x, y ) , · ) defines a measure on ( X × X , F ⊗ F ), and both Φ (( x, y ) , ( · ) × X ) andΦ (( x, y ) , X × ( · )) define measures on ( X , F ). Then for any x, y ∈ X and A x , A y ∈ F , letΨ (( x, y ) , A x ) := α ( x, A x ) − Φ (( x, y ) , A x × X )Ψ (( x, y ) , A y ) := α ( y, A y ) − Φ (( x, y ) , X × A y ) . We claim that Ψ (( x, y ) , · ) and Ψ (( x, y ) , · ) define sub-probabilities. For any A x ∈ F , α ( x, A x ) = ¯ P (cid:0) ( x, y ) , ( A x \{ x } ) × { y } c (cid:1) + ¯ P (cid:0) ( x, y ) , ( A x ∩ { x } ) × { y } c (cid:1) β ( x )+ ¯ P (cid:0) ( x, y ) , ( A x \{ x } ) × { y } (cid:1) + ¯ P (cid:0) ( x, y ) , ( A x ∩ { x } ) × { y } (cid:1) β ( x ) .
10e observe analogous terms in the expansion of Φ :Φ (cid:0) ( x, y ) , A x × X (cid:1) = ¯ P (cid:0) ( x, y ) , ( A x \{ x } ) × { y } c (cid:1) + ¯ P (cid:0) ( x, y ) , ( A x ∩ { x } ) × { y } c (cid:1) β ( x )+ ¯ P (cid:0) ( x, y ) , ( A x \{ x } ) × { y } (cid:1) β ( y ) + ¯ P (cid:0) ( x, y ) , ( A x ∩ { x } ) × { y } (cid:1) β ( x ) β ( y ) . Thus we can rewrite Ψ asΨ (( x, y ) , A x ) = (cid:16) ¯ P (( x, y ) , ( A x ∩ { x } ) × { y } ) β ( x ) + ¯ P (( x, y ) , ( A x \{ x } ) × { y } ) (cid:17) (1 − β ( y )) ≤ ¯ P (( x, y ) , A x × { y } ) ≤ ¯ P (( x, y ) , X × { y } ) = P ( y, { y } ) . Similarly, Ψ (( x, y ) , A y ) = (cid:0) ¯ P (( x, y ) , { x }× ( A y ∩{ y } )) β ( y )+ ¯ P (( x, y ) , { x }× ( A y \{ y } )) (cid:1) (1 − β ( x )).We conclude that Ψ (( x, y ) , · ) , Ψ (( x, y ) , · ) ∈ [0 ,
1] and each of these defines a sub-probabilityon ( X , F ). We also define Ψ ( x, y ) := 1 − α ( x, X ) − α ( y, X )+Φ (( x, y ) , X ×X ). Then by sim-ilar algebraic manipulations as above, Φ ( x, y ) = ¯ P (( x, y ) , { x }×{ y } )(1 − β ( x ))(1 − β ( y )) ∈ [0 , , Φ , and Φ . Let Φ (( x, y ) , · ) be the product measure based on Ψ (( x, y ) , · )and µ ( y, · ); let Φ (( x, y ) , · ) be the product measure based on µ ( x, · ) and Ψ (( x, y ) , · ); and letΦ (( x, y ) , · ) be the product measure based on µ ( x, · ) and µ ( y, · ), scaled by Ψ ( x, y ). Thus for x, y ∈ X and A x , A y ∈ F we haveΦ (( x, y ) , A x × A y ) = Ψ (( x, y ) , A x ) µ ( y, A y )Φ (( x, y ) , A x × A y ) = µ ( x, A x )Ψ (( x, y ) , A y )Φ (( x, y ) , A x × A y ) = µ ( x, A x ) µ ( y, A y )Ψ ( x, y ) . Since µ ( x, X ) = µ ( y, X ) = 1 by Lemma 4, we have Φ (( x, y ) , A x × X ) = Ψ (( x, y ) , A x ),Φ (( x, y ) , X × A y ) = Ψ (( x, y ) , A y ), and Φ (( x, y ) , X × X ) = Ψ ( x, y ).Now, define ¯ Q (( x, y ) , A ) := (Φ + Φ + Φ + Φ ) (cid:0) ( x, y ) , A (cid:1) for all A ∈ F ⊗ F . Then¯ Q (( x, y ) , · ) is a measure on ( X × X , F ⊗ F ) since all of the Φ ij (( x, y ) , · ) are. To see that¯ Q ∈ Γ( Q, Q ), observe that for all A x ∈ F , the identities of Lemmas 3 and 4 and the definitionsabove yield ¯ Q (( x, y ) , A x × X ) = (Φ + Φ + Φ + Φ ) (cid:0) ( x, y ) , A x × X (cid:1) = (Φ + Φ )( A x × X ) + µ ( x, A x )(Φ + Φ )( X × X )= α ( x, A x ) + µ ( x, A x ) α ( x, X ) = Q ( x, A x ) . Likewise, ¯ Q (( x, y ) , X × A y ) = Q ( y, A y ) for any A y ∈ F . Thus we conclude that ¯ Q ∈ Γ( Q, Q ).The ¯ Q condition of Definition 1 is satisfied by construction. For the ¯ P condition, we check thefour cases described in Lemma 2. For the case of ( A x \{ x } ) × ( A y \{ y } ), by construction we haveΦ (( x, y ) , ( A x \{ x } ) × ( A y \{ y } )) = ¯ P (( x, y ) , ( A x \{ x } ) × ( A y \{ y } )) . For ( A x \{ x } ) × { y } ,Φ (( x, y ) , ( A x \{ x } ) × { y } ) + Φ (( x, y ) , ( A x \{ x } ) × X )= Φ (( x, y ) , ( A x \{ x } ) × { y } ) + P ( x, ( A x \{ x } )) − Φ (( x, y ) , ( A x \{ x } ) × X )= ¯ P (( x, y ) , ( A x \{ x } ) × X ) − ¯ P (( x, y ) , ( A x \{ x } ) × { y } c )= ¯ P (( x, y ) , ( A x \{ x } ) × { y } ) . (( x, y ) , { x } × ( A y \{ y } )) + Φ (( x, y ) , X × ( A y \{ y } )) = ¯ P (( x, y ) , { x } × ( A y \{ y } )),and for { x } × { y } we haveΦ (( x, y ) , { x } × { y } ) + Φ (( x, y ) , { x } × X ) + Φ (( x, y ) , X × { y } ) + Φ (( x, y ) , X × X )= Φ (( x, y ) , { x } × { y } ) − Φ (( x, y ) , { x } × X ) − Φ (( x, y ) , X × { y } ) + Φ (( x, y ) , X × X )+ 1 + Q ( x, { x } ) − Q ( x, { x } ) − P ( x, { x } c ) + Q ( y, { y } ) − Q ( y, { y } ) − P ( y, { y } c )= 1 − ¯ P (( x, y ) , { x } c × X ) − ¯ P (( x, y ) , X × { y } c ) + ¯ P (( x, y ) , { x } c × { y } c )= ¯ P (( x, y ) , { x } × { y } ) . For the third condition of Definition 1, note that by the construction of Φ and the α definition,(Φ + Φ )(( x, y ) , { x } × X ) = α ( x, { x } ) = Q ( x, { x } )(Φ + Φ )(( x, y ) , X × { y } ) = α ( y, { y } ) = Q ( y, { y } ) . Thus we conclude that Φ is a coupled acceptance mechanism relating ¯ P and ¯ Q . Example 4.
We once return to the setup of Example 1 to construct Φ and ¯ Q as in the proof ofLemma 5. Recall that we assume X = { , } , F = 2 X , ( x, y ) = (1 , Q (1 , · ) = Q (2 , · ) = ( / , / ), P (1 , · ) = ( / , / ), and P (2 , · ) = ( / , / ). Following Example 2, we assume¯ P ((1 , , · ) = x ↓ / / / ← y . Following the construction in the proof of Lemma 5 and the values of α , α , µ, and β computedin Example 3, we haveΦ ((1 , , · ) = / / / Φ ((1 , , · ) = / / Similar calculations show that both Φ ((1 , , · ) and Φ ((1 , , · ) consist entirely of zeros.Finally, since ¯ Q (( x, y ) , · ) = (Φ + Φ + Φ + Φ )(( x, y ) , · ), we have¯ Q ((1 , , · ) = x ↓ / / / / ← y We see that this proposal coupling has the marginal distributions Q (1 , · ) = Q (2 , · ) = ( / , / ). Next, we show that if we have a coupled acceptance mechanism Φ relating ¯ Q and ¯ P , then thereexists an acceptance indicator coupling ¯ B such that ¯ Q and ¯ B generate ¯ P . In the following, wewrite ∆ n − for the set of multinomial distributions on a set with n elements. Lemma 6.
Let ¯ P ∈ Γ( P, P ) and let ¯ Q ∈ Γ( Q, Q ) be a proposal coupling related to ¯ P by acoupled acceptance mechanism Φ . Then there exist ¯ Q (( x, y ) , · ) -measurable functions φ ij (( x, y ) , · )12 or i, j ∈ { , } such that φ = ( φ , φ , φ , φ ) (cid:0) ( x, y ) , ( x , y ) (cid:1) ∈ ∆ for ¯ Q (( x, y ) , · ) -almost all ( x , y ) . If we define ¯ B such that ( b x , b y ) ∼ ¯ B (( x, y ) , ( x , y )) has P ( b x = i, b y = j | x, y, x , y ) = φ ij (cid:0) ( x, y ) , ( x , y ) (cid:1) , then ¯ Q and ¯ B generate ¯ P .Proof. The ¯ Q condition in Definition 1 implies that Φ ij (( x, y ) , · ) (cid:28) ¯ Q (( x, y ) , · ) for i, j ∈ { , } .For each i, j we define the Radon–Nikodym derivative φ ij (( x, y ) , · ) := dΦ ij (( x, y ) , · ) / d ¯ Q (( x, y ) , · ).The Radon–Nikodym derivative is linear and d ¯ Q (( x, y ) , · ) / d ¯ Q (( x, y ) , · ) = 1, so for ¯ Q (( x, y ) , · )-almost all ( x , y ) we have P i,j ∈{ , } φ ij (( x, y ) , ( x , y )) = 1. This implies that each φ ij ≤ φ ij ≥
0, and so φ (( x, y ) , ( x , y )) ∈ ∆ for ¯ Q (( x, y ) , · )-almost all ( x , y ).Now define a joint probability ( b x , b y ) ∼ ¯ B (( x, y ) , ( x , y )) on { , } such that for i, j ∈ { , } , P (cid:0) b x = i, b y = j | x, y, x , y (cid:1) = φ ij (cid:0) ( x, y ) , ( x , y ) (cid:1) . Suppose ( x , y ) ∼ ¯ Q (( x, y ) , · ) and ( b x , b y ) ∼ ¯ B (( x, y ) , ( x , y )) as defined above and let( X, Y ) := ( x b x + x (1 − b x ) , y b y + y (1 − b y )) . Then for any A y , A y ∈ F we have P (( X, Y ) ∈ A x × A y , b x = 1 , b y = 1) = Φ (( x, y ) , A x × A y ) P (( X, Y ) ∈ A x × A y , b x = 1 , b y = 0) = Φ (( x, y ) , A x × X )1( y ∈ A y ) P (( X, Y ) ∈ A x × A y , b x = 0 , b y = 1) = Φ (( x, y ) , X × A y )1( x ∈ A x ) P (( X, Y ) ∈ A x × A y , b x = 0 , b y = 0) = Φ (( x, y ) , X × X )1( x ∈ A x )1( y ∈ A y )It follows from these expressions and the definition of a coupled acceptance mechanism Φ that P (( X, Y ) ∈ A x × A y ) = ¯ P (( x, y ) , A x × A y ) on all measurable rectangles A x × A y , and hence that P (( X, Y ) ∈ A ) = ¯ P (( x, y ) , A ) for all A ∈ F ⊗ F . We conclude that ¯ Q and ¯ B generate ¯ P . Lemma 7.
Let Φ be a coupled acceptance mechanism relating ¯ Q and ¯ P , and let a x (( x, y ) , · ) := d(Φ + Φ )(( x, y ) , · )d ¯ Q (( x, y ) , · ) and a y (( x, y ) , · ) := d(Φ + Φ )(( x, y ) , · )d ¯ Q (( x, y ) , · ) . Then E [ a x (( x, y ) , ( x , y )) | x ] = a ( x, x ) and E [ a y (( x, y ) , ( x , y )) | y ] = a ( y, y ) , for Q ( x, · ) -almostall x and Q ( y, · ) -almost all y , respectively.Proof. As noted above, Φ ij (( x, y ) , · ) (cid:28) ¯ Q (( x, y ) , · ) for i, j ∈ { , } , so the joint acceptance rates a x (( x, y ) , · ) and a y (( x, y ) , · ) exist and are ¯ Q (( x, y ) , · ) measurable.For all A ∈ F , the defining property of conditional expectation and Conditions 2 and 3 ofDefinition 1 imply Z x ∈ A ) E [ a x (( x, y ) , ( x , y )) | x ] ¯ Q (( x, y ) , (d x , d y ))= Z x ∈ A ) a x (( x, y ) , ( x , y )) ¯ Q (( x, y ) , (d x , d y )) = (Φ + Φ )(( x, y ) , A × X )= ¯ Q (( x, y ) , ( { x } ∩ A ) × X ) + ¯ P (( x, y ) , ( A \ { x } ) × X ) = Q ( x, { x } ∩ A ) + P ( x, A \ { x } )= Z A a ( x, x ) Q ( x, d x ) = Z x ∈ A ) a ( x, x ) ¯ Q (( x, y ) , (d x , d y )) . By the essential uniqueness of the Radon–Nikodym derivative, there exists a measurable set˜ A ∈ F with ¯ Q (( x, y ) , ˜ A × X ) = Q ( x, ˜ A ) = 1 and E [ a x (( x, y ) , ( x , y )) | x ] = a ( x, x ) for all x ∈ ˜ A . A similar argument yields E [ a y (( x, y ) , ( x , y )) | y ] = a ( y, y ) for Q ( y, · )-almost all y .13 .5 Main result With the above lemmas in hand, we can now prove the main result of this section.
Proof of Theorem 1.
For the ‘only if’ case, assume ¯ Q ∈ Γ( Q, Q ), that a x (( x, y ) , · ) and a y (( x, y ) , · )are measurable and satisfy the conditional expectation conditions, and that¯ B (cid:0) ( x, y ) , ( x , y ) (cid:1) ∈ Γ (cid:16) Bern (cid:0) a x (( x, y ) , ( x , y )) (cid:1) , Bern (cid:0) a y (( x, y ) , ( x , y )) (cid:1)(cid:17) . Also suppose ( x , y ) ∼ ¯ Q (( x, y ) , · ), ( b x , b y ) ∼ ¯ B (( x, y ) , ( x , y ))), and let ¯ P (( x, y ) , · ) be the lawof ( X, Y ) := ( b x x + (1 − b x ) x, b y y + (1 − b y ) y ). Then for any A ∈ F ,¯ P (( x, y ) , A × X ) = P ( X ∈ A | x, y ) = P ( X ∈ A, b x = 1 | x, y ) + P ( X ∈ A, b x = 0 | x, y )= P ( x ∈ A, b x = 1 | x, y ) + P ( b x = 0 | x, y )1( x ∈ A )= E [ a x (( x, y ) , ( x , y ))1( x ∈ A ) | x, y ] + r ( x )1( x ∈ A )= P ( x, A \ { x } ) + P ( x, A ∩ { x } ) = P ( x, A ) . A similar argument shows ¯ P (cid:0) ( x, y ) , X × A ) = P ( y, A ). Thus ¯ P ∈ Γ( P, P ) as desired.For the ‘if’ case, take any ¯ P ∈ Γ( P, P ). By Lemma 5, there exists a proposal coupling¯ Q ∈ Γ( Q, Q ) and a coupled acceptance mechanism Φ relating ¯ Q and ¯ P . Then by Lemma 6,there exist ¯ Q (( x, y ) , · )-measurable functions φ ij (( x, y ) , · ) for i, j ∈ { , } such that if we define( b x , b y ) ∼ ¯ B (( x, y ) , ( x , y )) with P ( b x = i, b y = j | ( x, y ) , ( x , y )) = φ ij (cid:0) ( x, y ) , ( x , y ) (cid:1) , then ¯ Q and ¯ B generate ¯ P .In general, if ( b x , b y ) ∈ { , } follow P ( b x = i, b y = j ) = φ ij for i, j ∈ { , } and φ ∈ ∆ . Thenthere exists a coupling ¯ B ∈ Γ(Bern( φ + φ )) , Bern( φ + φ ) such that ( b x , b y ) ∼ ¯ B . Thisfollows because P ( b x = 1 , b y = 0) + P ( b x = 1 , b y = 1) = φ + φ P ( b x = 0 , b y = 1) + P ( b x = 1 , b y = 1) = φ + φ . In the present case, we can define ¯ Q (( x, y ) , · )-measurable acceptance rates a x (( x, y ) , · ) and a y (( x, y ) , · ) such that a x (( x, y ) , ( x , y )) := ( φ + φ )(( x, y ) , ( x , y )) a y (( x, y ) , ( x , y )) := ( φ + φ )(( x, y ) , ( x , y )) . Lemma 7 shows that these have the desired conditional expectations, and by construction¯ B (( x, y ) , ( x , y )) ∈ Γ (cid:0) Bern( a x (( x, y ) , ( x , y ))) , Bern( a y (( x, y ) , ( x , y ))) (cid:1) . Thus we conclude that¯ P is generated by a proposal coupling ¯ Q and an acceptance indicator coupling ¯ B with thedesired form. Example 5.
Returning once more to our running example, we most recently showed that thetransition kernel coupling ¯ P was related to the proposal coupling ¯ Q by the coupled acceptancemechanism Φ given below:¯ P ((1 , , · ) = / / / ¯ Q ((1 , , · ) = / / / / Φ ((1 , , · ) = / / / Φ ((1 , , · ) = / / ((1 , , · ) = Φ ((1 , , · ) = Applying the same construction used in the proof of Theorem 1, we find that ¯ P ((1 , , · ) canbe generated from ¯ Q ((1 , , · ) together with an acceptance indicator coupling ¯ B such that if( b x , b y ) ∼ ¯ B (( x, y ) , · ), then P ( b x = b y = 1 | x, y, · ) = / P ( b x = 1 , b y = 0 | x, y, · ) = / P ( b x = 0 , b y = 1 | x, y, · ) = P ( b x = b y = 0 | x, y, · ) = We conclude that the given procedure indeed provides a method to find a proposal and accep-tance coupling to reproduce the given MH transition coupling at the point ( x, y ) = (1 , In an application of the results above, we now characterize the maximal couplings of MH-liketransition kernels, sometimes called the greedy couplings of the associated chains. Given an MH-like transition kernel P , the identification of couplings ¯ P ∈ Γ( P, P ) that induce rapid meetingbetween chains is a challenging but important question for theoretical analysis and especiallyfor methodology, where Markovian couplings are often the most straightforward to implement.Maximal couplings ¯ P ∈ Γ max ( P, P ) represent myopically optimal solutions to the meeting timeminimization problem, achieving the largest possible meeting probability P ( X = Y | x, y ) ateach state pair ( x, y ). A clear understanding of the structure of Γ max ( P, P ) is valuable fordesigning practical couplings and provides a useful reference point for the question of efficientMarkovian couplings of MH-like chains.Write ∆ := { ( z, z ) : z ∈ X } for the diagonal of X × X , δ : X → ∆ for the map z ( z, z ), and A ∆ := δ ( A ) = { ( z, z ) : z ∈ A } . We have assumed that ( X , F ) is a Polish space, so ∆ ∈ F ⊗ F and δ is a measurable function. As noted above, the coupling inequality states that if µ and ν areprobability measures on ( X , F ) and γ ∈ Γ( µ, ν ), then P ( X,Y ) ∼ γ ( X = Y ) ≤ − k µ − ν k TV , and acoupling that achieves this upper bound is said to be maximal. We will see in the next sectionthat maximality also imposes significant constraints on the form of a coupling γ ∈ Γ( µ, ν ). Wecan use these properties to identify a coupling as maximal, which is often more direct thandoing so by showing that a coupling achieves the total variation bound. Given probability measures µ and ν on ( X , F ), the Hahn-Jordan theorem [e.g. Dudley, 2002,chapter 5.6] states that there exists a measurable set S ∈ F and sub-probability measures µ r and ν r such that µ − ν = µ r − ν r and µ r ( S c ) = ν r ( S ) = 0. The pair ( S, S c ) is referred to15s a Hahn decomposition for µ − ν , and the expression µ − ν = µ r − ν r is referred to as aJordan decomposition. The Jordan decomposition is unique while the Hahn decomposition isessentially unique in the sense that if R ∈ F is another set with µ r ( R c ) = ν r ( R ) = 0, then( µ − ν )( S R ) = 0. Here A B = ( A \ B ) ∪ ( B \ A ) denotes the symmetric difference ofmeasurable sets.In the Jordan decomposition of µ − ν , µ r and ν r are called the upper and lower variation of µ − ν , and µ ∧ ν := µ − µ r = ν − ν r is called the meet or infimum measure of µ and ν . µ ∧ ν is non-negative and has the defining property that if η is another measure on ( X , F ) with η ( A ) ≤ µ ( A ) ∧ ν ( A ) for all A ∈ F , then η ( A ) ≤ ( µ ∧ ν )( A ) for all A ∈ F . Note by the definitionof total variation, k µ − ν k TV = sup A ∈ F | µ ( A ) − ν ( A ) | = µ r ( X ) = ν r ( X ) = 1 − ( µ ∧ ν )( X ). Seee.g. Dshalalow [2012, chap. 5] or Aliprantis and Burkinshaw [1998, sec. 36] for more on thelattice-theoretic properties of the set of measures on ( X , F ).For any measure µ on ( X , F ) let δ ? µ be the pushforward of µ by the diagonal map δ , so that δ ? µ ( A ) = µ ( δ − ( A )) for A ∈ F ⊗ F . This makes δ ? µ a measure on ( X × X , F ⊗ F ), with δ ? µ ( A ) = δ ? µ ( A ∩ ∆) for A ∈ F ⊗ F , and δ ? µ ( B ∆ ) = µ ( B ) for B ∈ F . With this notation, wehave the following characterization of maximal couplings: Lemma 8 (Douc et al. [2018], Theorem 19.1.6) . Let µ − ν = µ r − ν r be the Jordan decompositionof a pair of probability measures µ and ν on ( X , F ) . A coupling γ ∈ Γ( µ, ν ) is maximal if andonly if there exists a γ r ∈ Γ( µ r , ν r ) such that γ ( A ) = γ r ( A ) + δ ? (cid:0) µ ∧ ν (cid:1) ( A ) for all A ∈ F ⊗ F . Note that Γ( µ r , ν r ) will be nonempty since µ r ( X ) = ν r ( X ) by the Jordan decomposition. Also γ r (∆) = 0, as γ r ( X × X ) = µ ( X ) = k µ − ν k TV = ( γ − δ ? ( µ ∧ ν ))( X × X ) = γ r (∆ c ). Finally,we observe that Lemma 8 implies the maximal coupling recognition result of Ernst et al. [2019,Lemma 20]. In particular, we have the following characterization of maximal couplings basedon the Hahn decomposition: Corollary 1 (Hahn Maximality Condition) . Let µ and ν be measures on ( X , F ). A cou-pling γ ∈ Γ( µ, ν ) is maximal if and only if there is an S ∈ F such that γ (( S c × X ) \ ∆)= γ (( X × S ) \ ∆) = 0 . Any ( S, S c ) with this property will be a Hahn decomposition for µ − ν .Proof. Let µ − ν = µ r − ν r be a Jordan decomposition, so that for some S ∈ F we have µ r ( S c ) = ν r ( S ) = 0. If γ ∈ Γ( µ, ν ) is maximal, Lemma 8 implies that γ ( A ) = γ r ( A )+ δ ? ( µ ∧ ν )( A )for all A ∈ F ⊗ F . Thus γ (( S c × X ) \ ∆) = γ r ( S c × X ) = µ r ( S c ) = 0. Similarly, γ (( X × S ) \ ∆) = 0.For the converse, let γ ∈ Γ( µ, ν ) and γ (( S c × X ) \ ∆) = γ (( X × S ) \ ∆) = 0. For any B ∈ F , µ ( B ) = γ ( B × X ) = γ (( B × X ) \ ∆) + γ ( B ∆ ) ν ( B ) = γ ( X × B ) = γ (( X × B ) \ ∆) + γ ( B ∆ ) . By assumption, S contains the support of γ (( · × X ) \ ∆) and S c contains the support of γ (( X × · ) \ ∆). Thus µ ( · ) − ν ( · ) = γ (( · × X ) \ ∆) − γ (( X × · ) \ ∆) is the Jordan decom-position of µ − ν and ( S, S c ) is a Hahn decomposition. The uniqueness of the Jordan decom-position implies γ (( B × X ) \ ∆) = µ r ( B ) and γ (( X × B ) \ ∆) = ν r ( B ), which in turn yields γ r ( · ) := γ ( · \ ∆) ∈ Γ( µ r , ν r ). We also have ( µ ∧ ν )( B ) = µ ( B ) − µ r ( B ) = ν ( B ) − ν r ( B ) , so theabove implies γ ( B ∆ ) = ( µ ∧ ν )( B ) for all B ∈ F . Thus we conclude that for any A ∈ F ⊗ F , γ ( A ) = γ ( A \ ∆) + γ ( A ∩ ∆) = γ r ( A ) + δ ? ( µ ∧ ν )( A ).We will use the result above to establish conditions on the form of any ¯ P ∈ Γ max ( P, P ) in termsof proposal and acceptance couplings. First, we consider the relationship of the maximality oftransition kernel coupling ¯ P and the maximality of a proposal coupling ¯ Q that generates it.16 lgorithm 1 Construction of ¯ Q for Lemma 91. Draw ( x m , y m ) ∼ ¯ Q m (( x, y ) , · ) and ( b x , b y ) ∼ ¯ B m (( x, y ) , ( x m , y m ))2. For z ∈ { x, y } :(a) If b z = 1, set z = z m . Else:(b) Draw (˜ x, ˜ y ) ∼ ¯ Q m (( x, y ) , · ) and (˜ b x , ˜ b y ) ∼ ¯ B m (( x, y ) , (˜ x, ˜ y )(c) If (˜ b x , ˜ b y ) = ( b x , b y ), set z = ˜ z (d) Else: go to 2(b)3. Return ( x , y ) and ( b x , b y ) It may seem that if ¯ P ∈ Γ max ( P, P ) is generated by a proposal coupling ¯ Q ∈ Γ( Q, Q ) andsome acceptance coupling ¯ B then ¯ Q might have to be a maximal coupling. The proposal-basedmaximal coupling of O’Leary et al. [2020] has this property, and it seems plausible that tomaximize the probability of X = Y one might need to start by maximizing the probabilityof x = y . However, the following shows that no special relationship exists between maximalproposal and transition couplings. Lemma 9.
Suppose the transition kernel coupling ¯ P ∈ Γ max ( P, P ) is generated by a pro-posal coupling ¯ Q m ∈ Γ max ( Q, Q ) and an acceptance coupling ¯ B m , that ¯ Q m (( x, y ) , ∆ c ) > for some ( x, y ) , and at that ( x, y ) , P ( b x = b y = 1 | x, y ) < where ( x m , y m ) ∼ ¯ Q m (( x, y ) , · ) and ( b x , b y ) ∼ ¯ B m (( x, y ) , ( x m , y m )) . Then there exists a non-maximal coupling ¯ Q ∈ Γ( Q, Q ) and anacceptance indicator coupling ¯ B such that ¯ Q and ¯ B also generate ¯ P . In the following proof, we use ¯ Q m to construct a ¯ Q that agrees with ¯ Q m on accepted proposalsand independently redraws rejected ones. The hypotheses on the support of ¯ Q m and ( b x , b y )are very general but are needed to ensure that this procedure results in a coupling ¯ Q that isnot maximal. Proof.
Let ¯ Q (( x, y ) , · ) be the distribution of the ( x , y ) output of Algorithm 1. We claim that¯ Q ∈ Γ( Q, Q ). For A ∈ F ,¯ Q (( x, y ) , A × X )= P ( x m ∈ A, b x = 1 | x, y ) + X j ∈{ , } P ( b x = 0 , b y = j | x, y ) P (˜ x ∈ A | ˜ b x = 0 , ˜ b y = j, x, y )= P ( x m ∈ A, b x = 1 | x, y ) + X j ∈{ , } P ( x m ∈ A, b x = 0 , b y = j | x, y ) = P ( x m ∈ A | x ) = Q ( x, A ) . A similar argument shows ¯ Q (( x, y ) , X × A ) = Q ( y, A ). Let ( x, y ) be such that ¯ Q m (( x, y ) , ∆ c ) > x m or y m at ( x, y ). Thus P ( x = y | x, y ) < P ( x m = y m | x, y ), and so we conclude that ¯ Q is not a maximal coupling.Next, for i, j ∈ { , } and A ∈ F ⊗ F , define Φ ij (( x, y ) , A ) := P (( x , y ) ∈ A, b x = i, b y = j )using the full output of Algorithm 1. We observe that this is a coupled acceptance mechanismrelating ¯ Q and ¯ P . The first condition of Definition 1 is satisfied by construction. For the secondcondition, define X m = b x x m + (1 − b x ) x and Y m = b y y m + (1 − b y ) y . Since ¯ Q m and ¯ B m generate17 P , we must have ( X m , Y m ) ∼ ¯ P (( x, y ) , · ). Thus for any A x × A y ∈ F ⊗ F ,Φ (( x, y ) , A x × A y ) + Φ (( x, y ) , A x × X )1( y ∈ A y )+ Φ (( x, y ) , X × A y )1( x ∈ A x ) + Φ (( x, y ) , X × X )1( x ∈ A x )1( y ∈ A y )= P ( X ∈ A x , Y ∈ A y , b x = 1 , b y = 1 | x, y ) + P ( X ∈ A x , Y = y, b x = 1 , b y = 0 | x, y )+ P ( X = x, Y ∈ A y , b x = 0 , b y = 1 | x, y ) + P ( X = x, Y = y, b x = 0 , b y = 0 | x, y )= ¯ P (( x, y ) , A x × A y ) . The third condition on Φ follows from the fact that b x = 1 if x m = x and b y = 1 if y m = y .Since Φ is a coupled acceptance mechanism relating ¯ Q and ¯ P , Lemma 6 ensures the existenceof a ¯ B such that ¯ Q and ¯ B generate ¯ P .The result above shows that if a maximal transition coupling ¯ P ∈ Γ max ( P, P ) is generated bya proposal coupling ¯ Q and an acceptance coupling ¯ B , then ¯ Q does not have to be maximal.The following example shows that some maximal couplings ¯ P cannot be generated from anymaximal coupling ¯ Q ∈ Γ max ( Q, Q ). Example 6.
Assume X = { , , } , F = 2 X , and ( x, y ) = (1 , Q ( x = 1 , · ) = (0 , / , / ) P ( x = 1 , · ) = ( / , / , Q ( y = 2 , · ) = ( / , , / ) P ( y = 2 , · ) = ( / , , / ) . It is straightforward to verify that these transition distributions can be generated from the givenproposals, and correspond to an MH scenario with Q (3 , · ) = (0 , ,
0) and π = ( / , / , / ).By Lemma 8, any ¯ Q ∈ Γ max ( Q, Q ) and ¯ P ∈ Γ max ( P, P ) will take the following form:¯ Q ((1 , , · ) = x ↓ / ← y / ¯ P ((1 , , · ) = x ↓ / ← y / There does not exist an acceptance indicator coupling ¯ B such that it and the unique maximal ¯ Q given above generate ¯ P , as ( x , y ) ∼ ¯ Q ((1 , , · ) , ( b x , b y ) ∼ ¯ B ((1 , , ( x , y )), X = b x x +(1 − b x ) x , Y = b y y + (1 − b y ) y , and ( X, Y ) ∼ ¯ P ((1 , , · ) yield a contradiction: / = P ( X = 2 , Y = 3 | x = 1 , y = 2) ≤ P ( x = 2 , y = 3 | x = 1 , y = 2) = 0 . Note that in line with Theorem 1, we can generate ¯ P ((1 , , · ) using the following proposalcoupling and joint acceptance rates a x and a y :˜ Q ((1 , , · ) = x ↓ / ← y / a x ((1 , , (2 , a y ((1 , , (2 , a x ((1 , , (3 , a y ((1 , , (3 , . One upshot of Lemma 9 and Example 6 is that a coupling ¯ P ∈ Γ max ( P, P ) requires a certainamount of proposal probability on the diagonal, but the maximality of ¯ Q is neither necessarynor sufficient for ¯ Q to be able to generate ¯ P . 18 .3 Characterization of maximal kernel couplings Next we turn to the main result of this section, which extends Theorem 1 to characterize themaximal couplings of an MH-like transition kernel in terms of proposal and acceptance indicatorcouplings. For each x, y ∈ X , let ( S xy , S cxy ) be any Hahn decomposition for P ( x, · ) − P ( y, · ).Thus S xy ∈ F and P ( x, A ) ≥ P ( y, A ) for any A ∈ F with A ⊂ S xy . Recall that Hahn de-compositions are unique only up to sets of ¯ Q (( x, y ) , · ) measure zero. However, the conditions S xy below also hold only up to ¯ Q (( x, y ) , · )-null sets, and so the characterization below is unaf-fected by the particular choice of S xy . Note that if P ( x, · ) and P ( y, · ) have Radon–Nikodymderivatives p ( x, · ) and p ( y, · ) with respect to a common dominating measure, then we can use S xy = { z : p ( x, z ) ≥ p ( y, z ) } for these sets. Theorem 2.
Let P be the MH-like transition kernel on ( X , F ) generated by a proposal ker-nel Q and an acceptance rate function a . Then ¯ P ∈ Γ max ( P, P ) if and only if ¯ P is gener-ated by ¯ Q ∈ Γ( Q, Q ) and an acceptance indicator coupling ¯ B with the following properties: if ( b x , b y ) ∼ ¯ B (( x, y ) , ( x , y )) , then1. P ( b x = 1 | x, y, x ) = a ( x, x ) for Q ( x, · ) -almost all x P ( b y = 1 | x, y, y ) = a ( y, y ) for Q ( y, · ) -almost all y and for ¯ Q (( x, y ) , · ) -almost all ( x , y ) ,3. P ( b x = b y = 1 | x, y, x , y ) = 0 if x = y and either x ∈ S cxy or y ∈ S xy P ( b x = 1 , b y = 0 | x, y, x , y ) = 0 if x = y and either x ∈ S cxy or y ∈ S xy P ( b x = 0 , b y = 1 | x, y, x , y ) = 0 if y = x and either x ∈ S cxy or y ∈ S xy P ( b x = 0 , b y = 0 | x, y, x , y ) = 0 if x = y and either x ∈ S cxy or y ∈ S xy . Recall that by Corollary 1, the maximality of a coupling ¯ P ∈ Γ( P, P ) is equivalent to a conditionon the support of each ¯ P (( x, y ) , · ). Conditions 3-6 relate these support constraints to thebehavior of a proposal coupling ¯ Q and an acceptance indicator coupling ¯ B . See Figure 2 for anillustration of the acceptance scenarios considered in these conditions and a visual intuition forwhy certain ones must be ruled out for ¯ Q and ¯ B to generate a maximal ¯ P . Proof of Theorem 2.
Suppose ¯ P ∈ Γ max ( P, P ). By Theorem 1, there exists a ¯ Q ∈ Γ( Q, Q ) and aproposal coupling ¯ B based on joint acceptance rate functions a x and a y such that ¯ Q and ¯ B gen-erate ¯ P , E [ a x (( x, y ) , ( x , y )) | x ] = a ( x, x ) for Q ( x, · )-almost all x , and E [ a y (( x, y ) , ( x , y )) | y ]= a ( y, y ) for Q ( y, · )-almost all y . Let ( x , y ) ∼ ¯ Q (( x, y ) , · ) and ( b x , b y ) ∼ ¯ B (( x, y ) , ( x , y )).Then since ¯ B is a coupling of Bern( a x (( x, y ) , ( x , y ))) and Bern( a y (( x, y ) , ( x , y ))), we have P ( b x = 1 | x, y, x , y ) = a x (( x, y ) , ( x , y )) P ( b y = 1 | x, y, x , y ) = a y (( x, y ) , ( x , y )) . Thus the conditional expectation properties of a x and a y are equivalent to Conditions 1 and 2.¯ Q and ¯ B generate ¯ P , so ( X, Y ) ∼ ¯ P (( x, y ) , · ) where X = b x x +(1 − b x ) x and Y = b y y +(1 − b y ) y .Since ¯ P is maximal, Corollary 1 implies 0 = ¯ P (cid:0) ( x, y ) , ( S cxy ×X ) \ ∆ (cid:1) = P ( X ∈ S cxy , X = Y | x, y ).Breaking this up into the four possible acceptance scenarios ( b x , b y ) = (1 , , , ,
0) yields 0 = E [1( x = y , x ∈ S cxy ) P ( b x = b y = 1 | x, y, x , y )]0 = E [1( x = y, x ∈ S cxy ) P ( b x = 1 , b y = 0 | x, y, x , y )]0 = E [1( y = x, x ∈ S cxy ) P ( b x = 0 , b y = 1 | x, y, x , y )]0 = E [1( x ∈ S cxy \ { y } ) P ( b x = b y = 0 | x, y, x , y )] . y x yxy S xy S cxy S xy S cxy Case 2Case 1 Case 3Case 4
Figure 2: Diagram of acceptance scenarios considered in Theorem 2. The support of a maximalcoupling ¯ P (( x, y ) , · ) is contained in the union of S xy × S cxy (gray box) and ∆ (the diagonal).Arrows illustrate the relationship of proposals ( x , y ) to transitions ( X, Y ) under different ac-cept/reject combinations, with transitions outside the support of ¯ P (( x, y ) , · ) forbidden almostsurely. Case 1: the maximality of ¯ P does not constrain the acceptance pattern of proposals( x , y ) ∈ S xy × S cxy . Case 2: proposals in ∆ can be fully accepted ( b x = b y = 1) or fully rejected( b x = b y = 0), but y must be accepted if x ∈ S cxy , and x must be accepted if y ∈ S xy . Case3: proposals in ( S xy × S cxy ) c ∩ ∆ c must be fully rejected unless y = x or x = y . Case 4: aproposal ( x , y ) outside the support of ¯ P (( x, y ) , · ) may be partially accepted ( b x = b y ) if ityields a transition to ( x, x ) or ( y, y ). 20n turn, these equations imply that for ¯ Q (( x, y ) , · )-almost all ( x , y ), P ( b x = b y = 1 | x, y, x , y ) = 0 if x = y and x ∈ S cxy P ( b x = 1 , b y = 0 | x, y, x , y ) = 0 if x = y and x ∈ S cxy P ( b x = 0 , b y = 1 | x, y, x , y ) = 0 if y = x and x ∈ S cxy P ( b x = b y = 0 | x, y, x , y ) = 0 if x = y and x ∈ S cxy . This shows that the first either/or case of each of Conditions 3-6 are satisfied. Since ¯ P ismaximal, Corollary 1 also implies 0 = ¯ P (cid:0) ( x, y ) , ( X × S xy ) \ ∆ (cid:1) . Proceeding as above shows thatthe second either/or cases are also satisfied. So we conclude that if ¯ P ∈ Γ max ( P, P ), then ¯ P isgenerated by a ¯ Q and ¯ B satisfying the six conditions stated above.For the converse, suppose that ¯ Q ∈ Γ( Q, Q ) and a proposal coupling ¯ B generate ¯ P and satisfythe given hypotheses. Since Conditions 1 and 2 are equivalent to the a x and a y conditions ofTheorem 1, we have ¯ P ∈ Γ( P, P ). Now let ( x , y ) ∼ ¯ Q (( x, y ) , · ), ( b x , b y ) ∼ ¯ B (( x, y ) , ( x , y )), X = b x x + (1 − b x ) x , and Y = b y y + (1 − b y ) y ). ¯ Q and ¯ B generate ¯ P , so¯ P (( x, y ) , ( S cxy × X ) \ ∆) = P ( X ∈ S cxy , X = Y | x, y )= P ( x ∈ A, x = y , b x = 1 , b y = 1 | x, y ) + P ( x ∈ A, x = y, b x = 1 , b y = 0 | x, y )+ P ( y = x, b x = 0 , b y = 1 | x, y )1( x ∈ A ) + P ( b x = 0 , b y = 0 | x, y )1( x ∈ A )1( x = y ) = 0 . The last equality follows directly from Conditions 3-6 of the Theorem, with Condition 3 ensuringthat the first term equals zero, Condition 4 ensuring that the second term equals zero, and so on.A similar argument yields ¯ P (( x, y ) , ( X × S xy ) \ ∆) = 0. By Corollary 1, ¯ P is maximal if and onlyif there is a measurable set S ∈ F such that ¯ P (( x, y ) , ( S c × X ) \ ∆) = ¯ P (( x, y ) , ( X × S ) \ ∆) = 0.The argument above shows that S xy has these properties, so we conclude that ¯ P ∈ Γ max ( P, P ). In this study, we have considered the form of kernel couplings in Γ(
P, P ) and maximal kernelcouplings in Γ max ( P, P ) based on a MH-like transition kernel P . In Theorem 1, we foundthat when P is generated by a proposal kernel Q and an acceptance function a , then any¯ P ∈ Γ( P, P ) must arise from a proposal coupling ¯ Q ∈ Γ( Q, Q ) and an acceptance indicatorcoupling ¯ B whose properties are determined by a . We also showed the converse – that anykernel ¯ P generated this way must be an element of Γ( P, P ). In Theorem 2, we took this astep further, and found that the maximal kernel couplings ¯ P ∈ Γ max ( P, P ) correspond to pairsof proposal and acceptance coupling with a somewhat more extended set of properties. Whilea direct analysis of Γ(
P, P ) and Γ max ( P, P ) is possible for very small state spaces, the resultsabove offer a dramatic simplification of these sets in the general case.Looking forward, we believe these characterization results should aid in the design of couplingsfor use in estimators, variance reduction techniques, convergence diagnostics, and other methodsthat involve Markovian couplings of the MH algorithm. We also expect these results to supportthe theoretical analysis of efficient kernel couplings for this important class of discrete-timeMarkov chains. For example, it may be possible to produce meeting time bounds or informa-tion on the spectrum of ¯ P based on underlying data on ¯ Q and ¯ B , using a refinement of thedrift-and-minorization methods of Rosenthal [1995, 2002] or arguments like those of Atchadéand Perron [2007]. Better understanding of the set of kernel couplings may also facilitate opti-mization arguments like those of Boyd et al. [2004, 2006, 2009].21inally, there appear to be significant opportunities to further understand the structural prop-erties of Γ( P, P ) and Γ max ( P, P ), and to relate their properties to those of Γ(
Q, Q ) and theacceptance couplings that relate them. As noted in the introduction, maximal Markovian cou-plings are known to exist in only a few special cases. In other cases they are known not toexist [Connor, 2007], and in some scenario, meaningful bounds exist on the efficiency of anyMarkovian coupling [Burdzy and Kendall, 2000]. To our knowledge, these questions remainlargely open for the MH-like case. A better understanding of the set of Markovian couplingsshould make it possible to identify better implementable couplings and settle the question ofwhether efficient or maximal couplings are implementable for this important case.
Acknowledgments
The authors would like to thank Pierre E. Jacob, Persi Diaconis, and Qian Qin for helpfulcomments. John O’Leary gratefully acknowledges support by the National Science Foundationthrough grant DMS-1844695.
References
D. Aldous. Random walks on finite groups and rapidly mixing Markov chains. In
Séminaire deProbabilités XVII 1981/82 , pages 243–297. Springer, 1983. 1, 3D. Aldous and J. Fill.
Reversible Markov Chains and Random Walks on Graphs . Berkeley, 1995.2C. D. Aliprantis and O. Burkinshaw.
Principles of Real Analysis . Gulf Professional Publishing,1998. 16C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo methods.
Journalof the Royal Statistical Society: Series B (Statistical Methodology) , 72(3):269–342, 2010. 2Y. F. Atchadé and F. Perron. On the geometric ergodicity of Metropolis-Hastings algorithms.
Statistics , 41(1):77–84, 2007. 21S. Banerjee and W. S. Kendall. Rigidity for Markovian maximal couplings of elliptic diffusions.
Probability Theory and Related Fields , 168(1-2):55–112, 2017. 2A. A. Barker. Monte Carlo calculations of the radial distribution functions for a proton–electronplasma.
Australian Journal of Physics , 18(2):119–134, 1965. 2, 3N. Biswas, P. E. Jacob, and P. Vanetti. Estimating convergence of Markov chains with L-lagcouplings. In
Advances in Neural Information Processing Systems , pages 7391–7401, 2019. 1B. Böttcher. Markovian maximal coupling of Markov processes. arXiv preprintarXiv:1710.09654 , 2017. 2N. Bou-Rabee, A. Eberle, and R. Zimmer. Coupling and convergence for Hamiltonian MonteCarlo.
Annals of Applied Probability , 30(3):1209–1250, 2020. 2S. Boyd, P. Diaconis, and L. Xiao. Fastest mixing Markov chain on a graph.
SIAM Review , 46(4):667–689, 2004. 21S. Boyd, P. Diaconis, J. Sun, and L. Xiao. Fastest mixing Markov chain on a path.
TheAmerican Mathematical Monthly , 113(1):70–74, 2006. 2122. Boyd, P. Diaconis, P. Parrilo, and L. Xiao. Fastest mixing Markov chain on graphs withsymmetries.
SIAM Journal on Optimization , 20(2):792–819, 2009. 21K. Burdzy and W. S. Kendall. Efficient Markovian couplings: examples and counterexamples.
Annals of Applied Probability , pages 362–409, 2000. 2, 22S. Connor.
Coupling: Cutoffs, CFTP and Tameness . PhD thesis, University of Warwick, 2007.22S. Connor and S. Jacka. Optimal co-adapted coupling for the symmetric random walk on thehypercube.
Journal of Applied Probability , 45(3):703–713, 2008. 2W. Doeblin. Exposé de la théorie des chaînes simples constantes de Markov à un nombre finid’états.
Mathématique de l’Union Interbalkanique , 2(77-105):78–80, 1938. 1R. Douc, E. Moulines, P. Priouret, and P. Soulier.
Markov Chains . Springer, 2018. 2, 3, 16J. H. Dshalalow.
Foundations of Abstract Analysis . Springer Science & Business Media, 2012.16S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo.
PhysicsLetters B , 195(2):216–222, 1987. 2R. Dudley.
Real Analysis and Probability . Cambridge University Press, New York, 2002. 15P. A. Ernst, W. S. Kendall, G. O. Roberts, and J. S. Rosenthal. MEXIT: Maximal un-couplingtimes for stochastic processes.
Stochastic Processes and their Applications , 129(2):355–380,2019. 16J. A. Fill. An interruptible algorithm for perfect sampling via Markov chains. In
Proceedings ofthe Twenty-Ninth Annual ACM Symposium on Theory of Computing , pages 688–695, 1997.1J. M. Flegal and R. Herbei. Exact sampling for intractable probability distributions via aBernoulli factory.
Electronic Journal of Statistics , 6:10–37, 2012. 1P. W. Glynn and C.-H. Rhee. Exact estimation for Markov chain equilibrium expectations.
Journal of Applied Probability , 51(A):377–389, 2014. 1S. Goldstein. Maximal coupling.
Zeitschrift für Wahrscheinlichkeitstheorie und VerwandteGebiete , 46(2):193–204, 1979. 1J. B. Goodman and K. K. Lin. Coupling control variates for Markov chain Monte Carlo.
Journalof Computational Physics , 228(19):7127–7136, 2009. 1D. Griffeath. A maximal coupling for Markov chains.
Zeitschrift für Wahrscheinlichkeitstheorieund Verwandte Gebiete , 31(2):95–106, 1975. 1T. E. Harris. On chains of infinite order.
Pacific Journal of Mathematics , 5(Suppl. 1):707–724,1955. 1W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications.
Biometrika , 57(1):97–109, 1970. 2J. Heng and P. E. Jacob. Unbiased Hamiltonian Monte Carlo with couplings.
Biometrika , 106(2):287–302, 2019. 1E. P. Hsu and K.-T. Sturm. Maximal coupling of Euclidean Brownian motions.
Communicationsin Mathematics and Statistics , 1(1):93–104, 2013. 223. E. Jacob, J. O’Leary, and Y. F. Atchadé. Unbiased Markov chain Monte Carlo methods withcouplings.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 82(3):543–600, 2020. 1, 2, 4V. E. Johnson. Studying convergence of Markov chain Monte Carlo algorithms using coupledsample paths.
Journal of the American Statistical Association , 91(433):154–166, 1996. 1V. E. Johnson. A coupling-regeneration scheme for diagnosing convergence in Markov chainMonte Carlo algorithms.
Journal of the American Statistical Association , 93(441):238–248,1998. 1, 2, 4M. Kartashov and V. Golomozyi. Maximal coupling procedure and stability of discrete Markovchains. ii.
Theory of Probability and Mathematical Statistics , 87:65–78, 2013. 2W. S. Kendall. Coupling, local times, immersions.
Bernoulli , 21(2):1014–1046, 2015. 2W. S. Kendall. Lectures on probabilistic coupling, 2017. 3K. Kuwada. Characterization of maximal Markovian couplings for diffusion processes.
ElectronicJournal of Probability , 14:633–662, 2009. 2D. A. Levin, Y. Peres, and E. L. Wilmer.
Markov Chains and Mixing Times , volume 107.American Mathematical Soc., 2017. ISBN 1470429624. 3T. Lindvall.
Lectures on the Coupling Method . Dover Books on Mathematics, 1992. ISBN0-486-42145-7. 1, 3N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equationof state calculations by fast computing machines.
The Journal of Chemical Physics , 21(6):1087–1092, 1953. 2L. Middleton, G. Deligiannidis, A. Doucet, and P. E. Jacob. Unbiased smoothing using particleindependent Metropolis-Hastings. In K. Chaudhuri and M. Sugiyama, editors,
Proceedings ofMachine Learning Research , volume 89, pages 2378–2387. PMLR, 16–18 Apr 2019. 1L. Middleton, G. Deligiannidis, A. Doucet, and P. E. Jacob. Unbiased Markov chain MonteCarlo for intractable target distributions.
Electronic Journal of Statistics , 14(2):2842–2891,2020. ISSN 1935-7524. 1R. Neal and R. Pinto. Improving Markov chain Monte Carlo estimators by coupling to anapproximating chain. Technical report, Department of Statistics, University of Toronto, 2001.1R. M. Neal. Bayesian learning via stochastic dynamics. In
Advances in Neural InformationProcessing Systems , pages 475–482, 1993. 2R. M. Neal. Circularly-coupled Markov chain sampling. Technical report, Department of Statis-tics, University of Toronto, 1999. 1R. M. Neal. MCMC using Hamiltonian dynamics.
Handbook of Markov Chain Monte Carlo , 2(11):2, 2011. 2J. O’Leary, G. Wang, and P. E. Jacob. Maximal couplings of the Metropolis–Hastings algorithm. arXiv preprint arXiv:2010.08573 , 2020. 1, 2, 4, 17N. S. Pillai and A. Smith. Mixing times for a constrained Ising process on the two-dimensionaltorus at low density. In
Annales de l’Institut Henri Poincaré, Probabilités et Statistiques ,volume 55, pages 1649–1678. Institut Henri Poincaré, 2019. 224. Piponi, M. Hoffman, and P. Sountsov. Hamiltonian Monte Carlo swindles.
Proceedings ofMachine Learning Research , 108:3774–3783, 26–28 Aug 2020. 1J. Pitman. On coupling of Markov chains.
Zeitschrift für Wahrscheinlichkeitstheorie und Ver-wandte Gebiete , 35(4):315–322, 1976. 1J. G. Propp and D. B. Wilson. Exact sampling with coupled Markov chains and applicationsto statistical mechanics.
Random Structures and Algorithms , 9(1-2):223–252, 1996. 1G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin distributions and theirdiscrete approximations.
Bernoulli , 2(4):341–363, 1996. 2J. S. Rosenthal. Minorization conditions and convergence rates for Markov chain Monte Carlo.
Journal of the American Statistical Association , 90(430):558–566, 1995. 21J. S. Rosenthal. Analysis of the Gibbs sampler for a model related to James-Stein estimators.
Statistics and Computing , 6(3):269–275, 1996. 2J. S. Rosenthal. Quantitative convergence rates of Markov chains: A simple account.