Double Happiness: Enhancing the Coupled Gains of L-lag Coupling via Control Variates
DDouble Happiness: Enhancing the Coupled Gains of L-lagCoupling via Control Variates
Radu V. Craiu * and Xiao-Li Meng † October 14, 2020
Abstract
The recently proposed L-lag coupling for unbiased MCMC (Biswas et al., 2019; Jacob et al., 2020) calls for a jointcelebration by MCMC practitioners and theoreticians. For practitioners, it circumvents the thorny issue of decidingthe burn-in period or when to terminate an MCMC iteration, and opens the door for safe parallel implementation. Fortheoreticians, it provides a powerful tool to establish elegant and easily estimable bounds on the exact error of MCMCapproximation at any finite number of iteration. A serendipitous observation about the bias correcting term led us tointroduce naturally available control variates into the L-lag coupling estimators. In turn, this extension enhances thecoupled gains of L-lag coupling, because it results in more efficient unbiased estimators as well as a better bound onthe total variation error of MCMC iterations, albeit the gains diminish with the numerical value of L. Specifically, thenew bound is theoretically guaranteed to never exceed the one given previously. We also argue that L-lag couplingrepresents a long sought after coupling for the future, breaking a logjam of the coupling-from-the-past type of perfectsampling, by reducing the generally un-achievable requirement of being perfect to being unbiased , a worthwhiletrade-off for ease of implementation in most practical situations. The theoretical analysis is supported by numericalexperiments that show tighter bounds and a gain in efficiency when control variates are introduced.
Keywords: Coupling from the Past; Maximum coupling; Median absolute deviation; Parallel implementation;Total variation distance; Unbiased MCMC.
We thank Pierre Jacob and his team for a series of tantalizing articles (e.g., Jacob et al., 2020, 2019; Heng and Jacob,2019; Biswas et al., 2019) that revitalize the excitement we experienced (e.g., Murdoch and Meng, 2001; Meng,2000; Craiu and Meng, 2011; Stein and Meng, 2013) from the Coupling From The Past (CFTP; Propp and Wilson,1996, 1998) and more generally perfect sampling. This time, we are dealing more with a form of Coupling For TheFuture (CFTF), literally and figuratively, as we shall discuss. The clever “cross-time coupling” idea of Glynn andRhee (2014), on which CFTF is based upon, allows us to move away from the CFTP framework, which engulfedthe MCMC (Markov chain Monte Carlo) community just around the turn of the century because of its promise toprovide perfect/exact MCMC samplers (see, for instance the annotated bibliography, Wilson, 1998). However, theresearch progress on perfect or exact samplers has slowed down significantly in the new millennium because they arevery challenging, if not impossible, to develop for many routine Bayesian computational problems (see, for example,Murdoch and Meng, 2001, for a demonstration).In its most basic form, a CFTP-type perfect sampler couples a Markov chain { X t , t ≥ } with itself but fromdifferent starting points, and runs two or more chains until they coalesce at a time τ . This apparent convergencedoes not guarantee in general that X τ is from the desired stationary distribution π ( x ) . By shifting the entire chain to * Department of Statistical Sciences, University of Toronto † Department of Statistics, Harvard University a r X i v : . [ s t a t . C O ] O c t negative time”(i.e., the past), { X t , t ≤ } , Propp and Wilson (1996) have shown that if we follow this coalescentchain until it reaches the present time, that is, t = 0 , then the resulting X will be exactly from π ( x ) . Perhaps the mostintuitive way to understand this scheme is to realize that running a chain from its infinite past ( t = −∞ ) to present( t = 0 ) is mathematically equivalent to running the chain from the present ( t = 0 ) to the infinite future ( t = + ∞ ).The CFTP is a clever way to realize this seemingly impossible task, relying on the fact that if the coalescence occursregardless how we have chosen the starting point, then the chain has “forgotten” its origin and hence settled in theperfect asymptotic land.But there is no free lunch. Being perfect is never easy, especially in mathematical sense. No error of any kindis allowed, and this requirement has manifested in two ways that greatly limit the practicality of perfect sampling.First, constructing a perfect sampler, especially for distributions with continuous and unbounded state spaces—whichare ubiquitous in routine statistical applications—is a very challenging task in general, despite its great success forproblems with some special structures such as with certain monotonic properties (see Berthelsen and Møller, 2002;Corcoran and Tweedie, 2002; Huber, 2004, 2002; Ensor and Glynn, 2000; Huber, 2004; Murdoch and Takahara, 2006,for example). Secondly, even if a perfect sampler is devised, it can be excruciatingly slow because it refuses to deliveran output until it can guarantee its perfection, and one must devise problem-specific strategies for speed-up (e.g.,Th¨onnes, 1999; Dobrow and Fill, 2003; Møller, 1999; Dobrow and Fill, 2003; Corcoran and Schneider, 2005). A relaxation of the exact sampling paradigm with important practical consequences has been proposed by Glynn andRhee (2014) and Glynn (2016) who put forth strategies for exact estimation of integrals via Markov chain MonteCarlo. The contrast between exact sampling and exact estimation is a very large conceptual leap that allows to bypassmost of the difficulties of perfect sampling while maintaining some of its important benefits. Building on the work ofGlynn and co-authors, the L-lag coupling of Biswas et al. (2019) and Jacob et al. (2020) only aims to deliver unbiasedestimators of E [ h ( X π )] for any (integrable) h , where X π denote a random variable defined by π ( X ) . An astute readermay immediately question if this is really a weaker requirement because the fact that E [ h ( X π )] = E [ h ( X ˜ π )] for all(integrable) h , would immediately imply π ( X ) = ˜ π ( X ) (almost surely). But this is where the innovation of L-lagcoupling lies, because it does not couple a chain with itself from two or more starting points [e.g., two extreme statesas with monotone coupling; see Propp and Wilson (1996)], but rather two chains which have the same transitionprobability, start from the same starting point, or more generally the same initial distribution π , but are time-shiftedby an integer lag, L > .To illustrate, consider the case of L = 1 , which was the focus of Jacob et al. (2020), where two chains X = { X t , t ≥ } and Y = { Y t , t ≥ } are coupled in such a way that both of them have the same transition kernel (andhence the same target stationary distribution), and that with probability one there exists a finite stopping time τ suchthat X t = Y t − for all t ≥ τ . This construction allows them to show that the following estimator based on both X and Y , H k ( X , Y ) = h ( X k ) + τ − (cid:88) j = k +1 [ h ( X j ) − h ( Y j − )] (1.1)is an unbiased estimator for E [ h ( X π )] for any k ≥ (under mild conditions). Heuristically, this is because the sumin (1.1) is the same as (cid:80) ∞ j = k +1 [ h ( X j ) − h ( Y j − )] since any term with j ≥ τ must be zero by the coupling scheme.Furthermore, for the purpose of calculating expectations, we can replace h ( Y j − ) by h ( X j − ) for any j because X j − and Y j − have identical distribution by construction. But h ( X k ) + (cid:80) j = k +1 [ h ( X j ) − h ( X j − )] is nothing but lim t →∞ h ( X t ) , which has the same distribution as h ( X π ) .The cleverness of constructing an estimator based on both X , Y to ensure E [ H k ( X , Y )] = E [ h ( X π )] for any h bypasses the requirement that X τ itself must be perfect. The series of illustrative and practical examples in Jacobet al. (2020) as well as Jacob et al. (2019), Heng and Jacob (2019) and Biswas et al. (2019) provide good evidenceof the practicality of this approach. The use of parallel computation for estimating I = E π [ h ( X )] recommends usingE [ E π [ h ( x ) |U j ]] where the inner expectation is the estimate obtained from the j th parallel process, U j , and the outermean averages over all processes. However, if each inner mean is a biased estimator for I , then the accumulation oferrors can be seriously misleading. This has been documented in the Monte Carlo literature extensively, for instance in2lynn and Heidelberger (1991) and Nelson (2016). Hence, unbiased MCMC designs allow one to take full advantageof parallel computation strategies without having to worry about the accumulation of bias as the number of parallelprocesses increases. The expression (1.1) also opens a path to explore further improvements, and that is the starting point of our exploration.In Craiu and Meng (2020), we noticed that (1.1) can be expressed equivalently as H k ( X , Y ) = h ( X ( τ − ∨ k ) + τ − (cid:88) j = k [ h ( X j ) − h ( Y j )] , (1.2)where A ∨ B = max { A, B } . Expression (1.1) renders the insight underlying Jacob et al. (2020), which is that H k ( X , Y ) achieves the desired unbiasedness by providing a time-forward bias correction to h ( X k ) ’s, whenever τ >k + 1 ; and hence Coupling for the Future. (No correction is needed when τ ≤ k + 1 .) The dual expression (1.2)indicates that H k ( X, Y ) can also be viewed as a time-backward bias correction to h ( X τ − ) for its imperfection,because k < τ − .Most intriguingly, each correcting term ∆ j ≡ h ( X j ) − h ( Y j ) in (1.2) has mean zero by the construction of {X , Y} .However, the sum (cid:80) τ − j = k [ h ( X j ) − h ( Y j )] does not have mean zero because τ is random and it depends critically on {X , Y} . Indeed, if this sum had mean , then X ( τ − ∨ k would have been a perfect draw from π ( X ) because thenE [ h ( X ( τ − ∨ k )] = E [ h ( X π )] for any (integrable) h , which would imply that X ( τ − ∨ k ∼ π .However, the fact that E (∆ j ) = 0 suggests that we can use any linear combination of ∆ j ’s as a control variate for H k ( X, Y ) . Using control variates to reduce estimation errors is a well known technique in the literature of improvingMCMC samplers and estimators via efficiency swindles, such as antithetic and control variates, Rao-Blackwellization,and so on, some of which we have explored in the past (e.g., Van Dyk and Meng, 2001; Craiu and Meng, 2001, 2005;Craiu and Lemieux, 2007; Yu and Meng, 2011). For example, for any finite constant η > k + 1 , the estimator H ∗ k ( X , Y ; η ) = H k ( X , Y ) − η − (cid:88) j = k ∆ j = h ( X ( τ − ∨ k ) + τ − (cid:88) j = k ∆ j − η − (cid:88) j = k ∆ j , (1.3)shares the mean of H k ( X, Y ) , but can have a smaller variance with a judicious choice of η . Intuitively, this reductionof variance is possible because of the potential partial cancellation (on average) of the ∆ j terms in the last twosummations in (1.3).Indeed, Section 2 below investigates a more general class of control variates, and derives the optimal choice byestablishing the minimal upper bound within the class on the total variation distance between the target π and π t , thedistribution of X k . This leads to both an improved theoretical bound over the one in Biswas et al. (2019), as reported inSection 2, as well as a more efficient estimator than (1.1) via parallel implementation. Section 3 describes estimationmethods and algorithms, and Section 4 provides examples and illustrations of both kinds of gains. Section 5 completesour first tour of CFTF with a discussion of future work. The scheme of L-lag coupling extends the coupling of { X k , Y k − } to the more general form of coupling of { X k , Y k − L } for some fixed L ≥ , as detailed in Biswas et al. (2019). The significance of this extension can be best understood byexpressing the L-lag coupling idea in its mathematically equivalent form of seeking τ L such that X k + L = Y k for all k ≥ τ L , and letting L → ∞ while keeping k fixed. Intuitively, it is then clear that the larger the L , the closer is thedistribution of Y τ L to the target because X τ + L converges to X ∞ ∼ π as L → ∞ , and X and Y share the same target π . 3ndeed, by extending (1.1) to a general L , Biswas et al. (2019) have shown that (under mild regularity conditions)the total variation distance between π k , the distribution of X k , and π is bounded by a very simple function of τ L and ( k, L ) : d TV ( π k , π ) ≤ E [ J k,L ] , where J k,L = max (cid:26) , (cid:100) τ L − L − kL (cid:101) (cid:27) , (2.1)and (cid:100) a (cid:101) denotes the smallest integer that exceeds a . We can clearly see the impact of increasing L or k , since largervalues of either of them make it more likely that τ L − L − k < and hence J k,L = 0 . Perhaps a crystal cleardemonstration of this fact is when τ L follows a geometric distribution with success probability p and state space { L + i, i ≥ } (since τ L ≥ L by definition), or equivalently δ = τ − ( L − ∼ Geo( p ) . Then, letting q = 1 − p , wehave (see Biswas et al., 2019) d TV ( π k , π ) ≤ E [ J k,L ] = q k +1 − q L . (2.2)We see that the bound is a decreasing function of both k and L , though it decreases much faster with k , which controlsthe rate of convergence, than with L , which controls only the (constant) scaling factor. We also observe that thebound can be trivial since it can be larger than for small k and/or L , whereas d TV cannot, suggesting there is roomfor improvement. Nevertheless, (2.1) is a remarkable bound because it encodes all the intricacies relevant for theconvergence speed of X , including the choice of X , into a uni-variate (truncated) coupling time J k,L . In the case of(2.2), the bound also immediately establishes the geometric ergodicity of X , and provides a rather practical way toassess the bound by estimating p or more generally by assessing J k,L directly, say from a parallel implementation (seeSection 3).It is perhaps even more remarkable to see that the left hand side of (2.1) is a property of the marginal chain X (and equivalently of the Y chain), but its right hand side depends on the construction of the joint chain {X , Y} . Thissuggests that we can seek improvement by better coupling. Furthermore, as we establish below, even without changingthe coupling scheme we can still obtain better bounds by using more efficient estimators than (1.1).For a general L , the forward-correction expression in (1.1) becomes (Biswas et al., 2019), H k,L ( X , Y ) = h ( X k ) + J k,L (cid:88) j =1 (cid:2) h ( X k + jL ) − h ( Y k +( j − L ) (cid:3) . (2.3)It is easy to verify that the backward-correction expression (1.2) takes the form of H k,L ( X , Y ) = h ( X k + LJ k,L ) + J k,L − (cid:88) j =0 [ h ( X k + jL ) − h ( Y k + jL )] . (2.4) Remark 1:
The (random) subscript in X k + JL cannot be reduced to ( τ − L ) ∨ k , the most obvious extension of theindex ( τ − ∨ k in (1.2). This is because k + J k,L L ≥ ( τ − L ) ∨ k , but the inequality can be strict when τ > k + L .For example, if τ = L + k + M , where M is a positive integer less than L (which does not exist when L = 1 ), k + J k,L L = k + L , but ( τ − L ) ∨ k = k + M . Remark 2:
Whereas (2.3) and (2.4) are equivalent as equalities, they can and will lead to different inequalities de-pending on how we bound their respective right-hand sides. This is both a bonus and a trap, as we shall discussbelow.
For notation simplicity, we drop the variables k, L from the notation of J k,L and we let ∆ k,j = h ( X k + jL ) − h ( Y k + jL ) .Then we know ∆ k,j has mean zero for any { k, j } and L . This means that for any random sequence (cid:126)η ≡ { η j , j ≥ } such that: (A) it is independent of {X , Y} , and (B) (cid:80) j =1 E (cid:126)η | η j | < ∞ , we can use C η = (cid:80) j ≥ η j ∆ k,j as a controlvariate for H k,L ≡ H k,L ( X , Y ) , because E [ C η ] = 0 . That is, ˜ H ( (cid:126)η ) k,L ( X , Y ) = H k,L ( X , Y ) − (cid:88) j ≥ η j ∆ k,j (2.5)4ill also be an unbiased estimator of E [ h ( X π )] with a smaller variance than (2.4). The question is what will be a goodchoice of η ?Instead of seeking (cid:126)η to minimize Var (cid:104) ˜ H ( (cid:126)η ) k,L (cid:105) , which is not an easy task and will also likely produce an h -dependentsolution, we follow the argument used by Biswas et al. (2019) for obtaining their bound as given in (2.1) to seek (cid:126)η thatminimizes a class of bounds of d TV ( π t , π ) over the choice of (cid:126)η that satisfies (A) and (B). This will lead to a sharperbound than (2.1) because it corresponds to the special case of (cid:126)η = 0 , which in general is not an optimal choice, asshown below.We proceed by using the same argument as in Biswas et al. (2019) for proving (2.1), but using (2.5) instead of (2.3).We use (2.3) instead of its equivalent (2.4) to ensure that our proof covers their proof as a special case. Specifically,the unbiasedness of (2.5) implies that for any k ≥ ,E [ h ( X π ) − h ( X k )] = E J (cid:88) j =1 (cid:2) h ( X k + jL ) − h ( Y k +( j − L ) (cid:3) − (cid:88) j ≥ η j ∆ k,j = E (cid:88) j ≥ (cid:2) h ( X k + jL ) − h ( Y k +( j − L ) (cid:3) { j ≤ J } − (cid:88) j ≥ η j [ h ( X k + jL ) − h ( Y k + jL )] (2.6) = E (cid:88) j ≥ h ( X k + jL )[1 { j ≤ J } − η j ] + (cid:88) j ≥ h ( Y k + jL )[ η j − { j +1 ≤ J } ] − h ( Y k )1 {
J < j ) ≤ − Pr( ˜ J ≤ m ˜ J ) ≤ / because Pr( ˜ J ≤ m ˜ J ) ≥ / by the definition of m ˜ J , implying η j = 0 . Therefore we know the maximal number ofnon-zero η j ’s cannot exceed m ˜ J . In turn, m ˜ J can be zero, but not − because Pr( ˜ J = −
1) = 0 . J = 0) < . .This automatically implies that condition (B) is trivially satisfied. For this choice of (cid:126)η , (2.7) yields our new bound for d TV ( π k , π ) , B k,L = (cid:88) j ≥ min { S j , − S j } + 0 . J > (2.9) = (cid:88) j ≥ min { Pr( J ≥ j ) , Pr( J ≤ j ) } . (2.10)In deriving the last equality we used the fact that S j + 0 . J = j ) = Pr( J ≥ j ) and − S j + 0 . J = j ) =Pr( J ≤ j ) , and Pr(
J >
0) = (cid:80) j ≥ Pr( J = j ) . 5 .3 Understand and Compare the Bounds It is immediate from expression (2.10) that our new bound cannot exceed the bound of Biswas et al. (2019) as givenin (2.1) because (2.10) obviously cannot exceed (cid:80) j ≥ Pr( J ≥ j ) , which is E [ J ] . The next result reveals alternativeforms for the new bound, providing additional insights, including the optimality of the choice η j = 1 { j ≤ m ˜ J } . This isa significant improvement since in the previous section we merely demonstrated that the optimal (cid:126)η cannot have morethan m ˜ J non-zero η j ’s. Theorem 2.1.
Under the same regularity conditions as in Biswas et al. (2019), we have B k,L = E | ˜ J k,L − m ˜ J k,L | + Pr( J k,L > − . (2.11) = E | J k,L − m J k,L | + Pr( J k,L > − S k,L (2.12) = 0 . (cid:88) j ≥ [1 − | Pr( τ > k + ( j + 1) L ) + Pr( τ > k + jL ) − | ] (2.13) + 0 . τ > k + L ) , (2.14) where S k,L = max { Pr( J k,L > m J k,L ) , Pr( J k,L < m J k,L ) } ≤ . , m ˜ J k,L and m J k,L are respectively the smallestinteger median of ˜ J k,L and J k,L .Proof. To reduce the notation overload, we drop the k, L for J , ˜ J , m J and m ˜ J . We have already established that theoptimal (cid:126)η must be of the form η j = 1 { j ≤ m } for some m ≥ . Note here the use of m = 0 permits (cid:126)η = 0 since j ≥ ,and it is also consistent with setting η = 1 . We can minimize the right hand side of (2.7) with respect to such a class,that is, with respect to the choice of m . But it is easy to see that (cid:88) j ≥ E | { j ≤ ˜ J } − η j | = (cid:88) j ≥ E | { j ≤ ˜ J } − { j ≤ m } | − E [1 − { ≤ ˜ J } ]= (cid:88) j ≥ E (cid:104) { min { ˜ J,m }
0) + E | ˜ J − m ˜ J | − . , (2.17)which proves (2.11).In order to prove (2.12) we start from (2.10). Let G ( j ) = Pr( J ≤ j ) − Pr( J ≥ j ) = P r ( J < j ) − Pr(
J > j ) , (2.18)for j ≥ . Then it is easy to verify that G ( j ) is a monotone increasing function, so G ( j ) − G ( m J ) share thesame sign with j − m J , for all j (cid:54) = m J . It follows that the sum in (2.10) can be decomposed into three parts A = (cid:80) m J − j =1 Pr( J ≤ j ) , B = 1 { m J > } min { Pr( J ≤ m J ) , Pr( J ≥ m J ) } , and C = (cid:80) j ≥ m J +1 Pr( J ≥ j ) . When m J = 0 , C = E [ J ] , B = 0 because { m J > } = 0 and A = 0 by convention since m J − < . If p j = Pr( J = j ) ,6hen it is easy to see that whenever m J ≥ , A = m J − (cid:88) j =1 j (cid:88) h =1 p h + ( m J − p = m J − (cid:88) h =1 m J − (cid:88) j = h p h + ( m J − p = m J − (cid:88) h =1 ( m J − h ) p h + ( m J − p = m J − (cid:88) h =0 ( m J − h ) p h − p ; (2.19) C = ∞ (cid:88) j = m J +1 ∞ (cid:88) h = j p h = ∞ (cid:88) h = m J +1 h (cid:88) j = m J +1 p h = ∞ (cid:88) h = m J +1 ( h − m J ) p h . (2.20)Noting that ( m J − h ) p h = 0 when h = m J , we see that when m J ≥ A + B + C = E | J − m J | − p + min { Pr( J ≥ m J ) , Pr( J ≤ m J ) } = E | J − m J | + Pr( J >
0) + min { Pr( J ≥ m J ) , Pr( J ≤ m J ) } − E | J − m J | + Pr( J > − max { Pr(
J < m J ) , Pr(
J > m J ) } . (2.21)When m J = 0 , A = B = 0 , and C = (cid:80) h ≥ hp h = E [ J ] , which is (2.12) because S k,L = Pr( J > , cancellingexactly the Pr(
J > term. This completes the proof of (2.12).The proof of (2.14) follows also from (2.10), using the identity max { a, b } = 0 . a + b + | a − b | ] and that Pr( J ≥ j ) + Pr( J ≤ j ) = 1 + Pr( J = j ) for any j . This leads to (cid:88) j ≥ min { Pr( J ≥ j ) , Pr( J ≤ j ) } = 0 . (cid:88) j ≥ [1 + Pr( J = j ) − | Pr( J ≥ j ) − J > j ) | ]= 0 . (cid:88) j ≥ [1 − | Pr(
J > j ) + Pr(
J > j − − | ] + 0 . J > Expression (2.14) then follows because { J > j } = { τ > k + ( j + 1) L } The above result tells us that whenever m J = 0 , our bound is identical to the one given by Biswas et al. (2019).From (2.10), two bounds are the same if and only if G (1) ≥ , which is the same as p ≥ − p , where p k =Pr( J = k ) . Clearly this inequality is satisfied when m J = 0 , that is, when p ≥ / . It also implies that m J ≤ ,because for any j < m J G ( j ) = Pr( J < j ) + Pr( J ≤ j ) − ≤ J < m J ) − < , (2.22)because Pr( J ≤ m J − < . since m J is the smallest integer median. Therefore we have Theorem 2.2.
Under the same regularity conditions of Theorem 2.1, a sufficient and necessary condition for the boundin Theorem 2.1 to equal E [ J ] is p ≥ − p .Proof. The if and only if condition follows trivially from the identity G (1) = 2 p + p − . Clearly p ≥ / implies G (1) ≥ . The necessity but insufficiency of m J = 1 is seen by the fact that m J = 1 if and only if p + p ≥ / and − p ≥ / , which is the same as − p ≤ p ≤ , which implies G (1) ≥ , but is not implied by it. Remark 3
Theorem 2.2 implies that m J = 0 is a sufficient condition, and m J ≤ is a necessary condition for the twobounds to be the same. But the condition m J = 1 itself is is not sufficient. Remark 4
An intriguing new insight provided by bound (2.12) is that not only the average coupling time matters, thevariation of the coupling time is important too. The S k,L term also suggests that even the symmetry matters, since S k,L achieves its maximum when the distribution is symmetrical locally around the median.7et ζ = τ − t , which is the number of steps needed after time t in order to couple (assuming the coupling has notalready happened by time t ). Then the sufficient and necessary condition in Theorem 2.2 is the same as Pr( ζ ≤ L ) ≥ Pr( ζ > L ) , suggesting that the new bound is more useful when the distribution of ζ places more mass on the right side of thecoupling interval ( L, L ] than on its left side. The implication is that the improvement of the new bound, if any, willmore likely come from those situations where either t is small or τ is large (for fixed L ), that is, when the mixing ispoor. We assume that
Q > coupled processes { ( X ( q ) t , Y ( q ) t ) : 1 ≤ q ≤ Q } are run in parallel and that, for all ≤ q ≤ Q , X ( q ) = { X ( q ) k } k ≥ and Y ( q ) = { Y ( q ) k } k ≥ have been successfully L-coupled. The latter implies that the chains X ( q ) are run L more steps than the Y ( q ) chains and there exists stopping time { τ ( q ) : q = 1 , . . . , Q } such that X ( q ) t + L = Y ( q ) t for all t ≥ τ ( q ) . We will work with a modified version of (2.4) that incorporate control variates: H ∗ ( q ) k,L ( X ( q ) , Y ( q ) ) = h (cid:18) X ( q ) k + J ( q ) k,L L (cid:19) + J ( q ) k,L − (cid:88) j =0 (cid:104) h ( X ( q ) k + jL ) − h ( Y ( q ) k + jL ) (cid:105) − m ˜ J ( q ) k,L (cid:88) j =0 (cid:104) h ( X ( q ) k + jL ) − h ( Y ( q ) k + jL ) (cid:105) , (3.1)where m ˜ J denotes the smallest integer median of ˜ J . An unbiased estimator of H ∗ ( q ) k,L ( X ( q ) , Y ( q ) ) is straightforward toproduce, but additional care must be paid to maintain the independence between the estimator for m ˜ J and ( X ( q ) , Y ( q ) ) .To satisfy the latter we construct the unbiased estimator m ( q )˜ J from all coupled processes but the q -th one, as describedin Algorithm 1.1. Compute J ( q ) k,L ;2. Sample independently ζ ( q ) ∼ Bernoulli (0 . and set ˜ J ( q ) k,L = J ( q ) k,L − ζ ( q ) ;3. Set m ˜ J ( q ) k,L = (cid:98) med ( { ˜ J ( h ) k,L : 1 ≤ h ≤ Q, h (cid:54) = q }(cid:99) where med ( A ) denotes the median of the values in set A and (cid:98)·(cid:99) is the floor function. Algorithm 1:
Algorithm for computing m ˜ J ( q ) k,L for a fixed k and all q ∈ { , , . . . , Q } .8he time-averaging version of (2.4) is H ( q ) k : r ; L ( X ( q ) , Y ( q ) ) = 1 r − k + 1 r (cid:88) t = k (cid:20) h ( X ( q )( t + J ( q ) k,L L )+ J ( q ) t,L − (cid:88) j =0 (cid:104) h ( X ( q ) t + jL ) − h ( Y ( q ) t + jL ) (cid:105) = 1 r − k + 1 r (cid:88) t = k h ( X ( q )( t + J ( q ) t,L L ) + 1 r − k + 1 r (cid:88) t = k J ( q ) t,L − (cid:88) j =0 (cid:104) h ( X ( q ) t + jL ) − h ( Y ( q ) t + jL ) (cid:105) = 1 r − k + 1 r (cid:88) t = k h ( X ( q )( t + J ( q ) t,L L )+ 1 r − k + 1 r ∧ ( τ − L ) (cid:88) t = k J ( q ) t,L − (cid:88) j =0 (cid:104) h ( X ( q ) t + jL ) − h ( Y ( q ) t + jL ) (cid:105) , (3.2)where a ∧ b denotes the minum of the pair ( a, b ) .The average estimator that includes the control-variate swindle is then: H ∗ ( q ) k : r ; L ( X ( q ) , Y ( q ) ) = H ( q ) k : r ; L ( X ( q ) , Y ( q ) ) − r − k + 1 r (cid:88) t = k m ˜ J ( q ) t,L (cid:88) j =0 (cid:104) h ( X ( q ) t + jL ) − h ( Y ( q ) t + jL ) (cid:105) . (3.3)Because each term in the control-variate term above, h ( X ( q ) t + jL ) − h ( Y ( q ) t + jL ) , has mean zero, we expect that the gainfrom the control variate swindle diminishes when r increases because the law of large numbers would kick in, leadingthe overall control-variate term approaching zero. We will see this phenomenon in Section 4. When estimating B k,L we can use either (2.11), (2.12), or (2.14). In all our numerical experiments we have used(2.12) using the steps described in Algorithm 2.1. Compute J ( q ) k,L and m ( q ) k,L for all q = 1 , . . . , Q ;2. Compute the empirical means e k,L = 1 Q Q (cid:88) q =1 (cid:12)(cid:12)(cid:12) J ( q ) k,L − m ( q ) k,L (cid:12)(cid:12)(cid:12) , p k,L = 1 Q Q (cid:88) q =1 { J ( q ) k,L > } g k,L = 1 Q Q (cid:88) q =1 { J ( q ) k,L >m ( q ) k,L } , s k,L = 1 Q Q (cid:88) q =1 { J ( q ) k,L 3. Compute ˆ B k,L = e k,L + p k,L − g k,L ∨ s k,L , where a ∨ b denotes the maximum between a and b . Algorithm 2: Algorithm for estimation of B k,L for any given k and L .In the next section we investigate the performance of the control variate swindle and compare the new total varia-tion bound with (2.1) provided by Biswas et al. (2019). 9 Examples and Illustrations When coupling two independent Metropolis samplers, the distribution of the coupling time τ is Geometric. This isdue to the fact that the maximal coupling procedure uses the same proposal for both transition kernels and couplingoccurs when both chains accept it.Nevertheless, as a theoretical illustration, we adopt δ = τ − ( L − ∼ Geo ( p ) for its analytic tractability. Because J = max (cid:8) , (cid:100) δ − k − L (cid:101) (cid:9) , we see that Pr( J = 0) = Pr( δ ≤ k + 1) = 1 − q k +1 , Pr( J > j ) = Pr( δ > k + 1 + Lj ) = q k +1+ Lj , j = 0 , , . . . , (4.1)where q = 1 − p . That is, J is a mixture of (i) the Dirac point measure δ { } with mixture proportion − q k +1 , and(ii) a geometric distribution with probability of success − q L with weight q k +1 . This implies immediately that thebound given in Biswas et al. (2019) has the expression (2.2).For our new bound, in this case it is easier to use expression (2.10) directly. Let m be the largest integer such that Pr( J ≥ m ) ≥ Pr( J ≤ m ) , that is, m is the smallest integer that ensures q k +1+ L ( m − + q k +1+ Lm ≥ ⇐⇒ m = (cid:22) L − k − L − log[1 + q L ] L log( q ) (cid:23) . (4.2)Clearly, when m ≤ , our B k,L ( p ) is the same as the old bound (2.2). When m ≥ , we have B k,L ( p ) = m (cid:88) j =1 Pr( J ≤ j ) + (cid:88) j ≥ m Pr( J ≥ j ) { by (4 . } = m (cid:88) j =1 (cid:2) − q k +1+ Lj (cid:3) + ∞ (cid:88) j = m +1 q k +1+ L ( j − = m − q k +1+ L [1 − q mL ]1 − q L + q k +1+ mL − q L = m − q k +1+ L [1 − q mL − q ( m − L ]1 − q L . (4.3)We can also compute B k,L ( p ) directly from (2.14) as an infinite sum B k,L ( p ) = 0 . (cid:88) j ≥ [1 − | q k +1+ L ( j − + q k +1+ Lj − | ] + 0 . q k +1 . (4.4)Figure 1 compares the bounds (2.2) in red, and (4.4), in green, for different values of p , L and t . One can see thatthe new bound is sharper, and only for larger values of p , corresponding to fast-mixing chains, the two bounds areindistinguishable. The horizontal line in Figure 1 marks the obvious bound since d TV ≤ . We also notice that forvery small values of p both bounds are vacuous, but the new bound has a larger range for being non-vacuous.The simulations in the remaining two examples rely on the unbiasedmcmc package of Pierre Jacob that is avail-able here: https://github.com/pierrejacob/unbiasedmcmc/tree/master/vignettes . Additional programs for im-plementing the new ideas in this paper are available as supplemental materials from the authors. The Ising model example follows the setup in Biswas et al. (2019). We consider a × square lattice of pixelswith values in {− , } , and with periodic boundaries. A state of the system is then x ∈ {− , } × and the target10igure 1: Comparison of bound (2.1) provided by Biswas et al. (2019) (dashed black line) and the new bound given in(2.17) (solid red line ). Note that for small values of p both bounds are vacuous.11igure 2: Ising Model: Comparison of TV bounds for the PT algorithm for SGS for L ∈ { , , , } .The dashed black line shows the bound (2.1) derived in Biswas et al. (2019) and the solid red line shows the newbound given in (2.17). 12robability is defined as π β ( x ) ∝ exp( β (cid:80) i ∼ j x i x j ) , where i ∼ j means that x i and x j are pixel values in neighboursites. This illustration uses the parallel tempering algorithm (PT, see Swendsen and Wang, 1986) coupled with a singlesite Gibbs (SSG) updating. It is known that larger values of β increase the dependence between neighbouring sitesand this “stickiness” leads to slow mixing of the SSG. The target of interest corresponds to β = 0 . and we use 12chains, each targeting a target π β ( x ) with β values equally spaced between . and β = 0 . . Figure 2 shows thetotal variation bounds, the one provided by (2.1) is shown in dashed black and the (2.17) is in solid red. The boundsare derived for ≤ k < , and L ∈ { , , , } . For smaller values of L the patterns are similarbut TV bounds are larger for smaller values of k . The new bound, (2.17), is computed from Q = 50 parallel runsand is averaged over independent replicates, while (2.1) is averaged over independent replicates of a singlecoupled process.As in the previous example, the bound proposed here is smaller for relatively smaller values of L , but this happensmostly in the k -region where both bounds are vacuous. As k increases the bound (2.1) derived in Biswas et al. (2019)catches up to (2.17). To compare both the bounds and the unbiased estimators, we follow Biswas et al. (2019) and consider a Bayesian logis-tic regression model for the German Credit data from Lichman (2013). The data consist of n = 1000 binary responses, { Y i : 1 ≤ i ≤ n } and d = 49 covariates, { x i ∈ R d ; 1 ≤ i ≤ n } . The response Y i indicates whether the i -th individ-ual is fit to receive credit ( Y i = 1 ) or not ( Y i = 0 ). The logistic regression model frames the probabilistic dependencebetween the response and covariate as Pr( Y i = 1 | x i ) = [1 + exp( − x Ti β )] − . The prior is set to β ∼ N (0 , I d ) .Sampling from the posterior distribution is done via the Polya-Gamma sampler of Polson et al. (2013), using the Rprograms made available by Biswas et al. (2019) at https://github.com/niloyb/LlagCouplings. InFigure 3 we compare Biswas et al.’s bound (2.1) (dashed black line) with our bound (2.17) (solid red line). The boundin (2.1) is averaged over 2000 independent replicates. The new bound is computed from running 50 coupled processesin parallel and averaged over 40 replicates, thus totalling the same number of runs. The difference between the twobounds is apparent for smaller values of L when the new bound is sharper for small values of k , but the gain diminishesquickly as L increases.Of interest are also the gains in efficiency for the Monte Carlo estimators when implementing the control variateswindle. Using 500 independent replicates of a single coupled process with lag L = 5 we obtain Monte Carlo estimatesof the posterior means for the regression coefficients. In Figure 4 we present the relative reduction in variance (RRV)computed as RRV = Var MCCV ( ˆ β ) Var MC ( ˆ β ) where ˆ β is the posterior mean of the regression coefficients, β ∈ R and Var MC ,Var MCCV denote the estimated Monte Carlo variances of ˆ β obtained without and, respectively, with control variates.The left panel shows the RRV when using the single run estimators (2.4) and (3.1), while the right panel plots RRVfor the mean estimators (3.2) and (3.3). We see clearly that the gain is rather significant for m = 1 (left panels), butdiminishes when m = 30 (right panels), as we discussed in Section 3. The idea of L-lag coupling clearly has opened multiple avenues for future research. The use of control variates is justone of them. Although the practical gain is small or possibly even negative when we take into account the increasedcomputation for computing the control variates, the theoretical gain is very intriguing, because it is truly a free lunch.That is, we obtain a theoretically superior bound without imposing any additional assumption. This naturally raisesa question: is our bound the best possible without further conditions? We do not know. We do not even know howto study theoretically such a question, because as far as we are aware of, this is the first time a better distance boundis obtained by constructing a more efficient Monte Carlo estimator. Whereas seeking other more efficient estimatorsseems to be a natural direction, we must keep in mind that they would incur additional computational cost, and hencethey are not free lunch. Therefore, there is no logical implication that we can keep getting more free theoretical lunch.One plausible direction is to go beyond linearly combining mean-zero control variates, though we have not beenable to land anything useful. We certainly invite interested readers to seek better (theoretical) free lunch. But even13igure 3: German Credit Data: Comparison of TV bounds for the Polya-Gamma sampler for L ∈ { , , , } .The dashed black line shows the bound (2.1) derived in Biswas et al. (2019) and the solid red line shows the newbound given in (2.17). The bound (2.17) is obtained from running 50 coupled chains in parallel and averaging over 40independently replicated experiments. The bound (2.1) is averaged over 2000 independent replicates.14igure 4: German Credit Data. Relative (RRV) reduction in variance for the 49 regression coefficients. Top panels :the lag is L = 5 . Bottom panels : the lag is L = 20 . Left panels : RRV is obtained from the single estimators withoutand, respectively, with control variates using k = 5 in (2.4) and (3.1). Right panels : RRV is obtained from the averageestimators without and, respectively, with control variates using k = 5 and r = 30 in (3.2) and (3.3).15ithout seeking better bounds, our current bounds already offer the opportunity to investigate fresh perspectives foroptimizing an MCMC kernel using adaptive ideas and we intend to pursue these in our second tour. Acknowledgements We are grateful to Pierre Jacob for many useful comments and a gracefully patient guidance through the package unbiasedmcmc that allowed us to perform the simulations in the paper. We also thank Yves Atchad´e, TamasPapp, Christopher Sherlock and Lei Sun for very helpful discussions and comments. This research has been partiallysupported by grants from NSERC of Canada (RVC) and NSF of USA (XLM). References B ERTHELSEN , K. K. and M ØLLER , J. (2002). A primer on perfect simulation for spatial point processes. Bull. Braz.Math. Soc. (N.S.) ISWAS , N., J ACOB , P. E. and V ANETTI , P. (2019). Estimating convergence of Markov chains with l-lag couplings.In Advances in Neural Information Processing Systems .C ORCORAN , J. N. and S CHNEIDER , U. (2005). Pseudo-perfect and adaptive variants of the Metropolis-Hastingsalgorithm with an independent candidate density. J. Stat. Comput. Simul. ORCORAN , J. N. and T WEEDIE , R. L. (2002). Perfect sampling from independent Metropolis-Hastings chains. J.Statist. Plann. Inference RAIU , R. V. and L EMIEUX , C. (2007). Acceleration of the multiple-try Metropolis algorithm using antithetic andstratified sampling. Statistics and computing RAIU , R. V. and M ENG , X.-L. (2001). Antithetic coupling for perfect sampling. In Bayesian Methods, withApplications to Science, Policy and Official Statistics - Selected papers from ISBA 2000 (E. I. George, ed.).C RAIU , R. V. and M ENG , X.-L. (2005). Multiprocess parallel antithetic coupling for backward and forward Markovchain Monte Carlo. The Annals of Statistics RAIU , R. V. and M ENG , X.-L. (2011). Perfection within reach: exact mcmc sampling. Handbook of Markov ChainMonte Carlo RAIU , R. V. and M ENG , X.-L. (2020). Discussion of ”Unbiased Markov chain Monte Carlo with couplings” byPierre E. Jacob, John O’Leary and Yves F. Atchad´e. J. Royal Statist. Society, Series B to appear.D OBROW , R. P. and F ILL , J. A. (2003). Speeding up the FMMR perfect sampling algorithm: a case study revisited. Random Structures Algorithms NSOR , K. B. and G LYNN , P. W. (2000). Simulating the maximum of a random walk. Journal of statistical planningand inference LYNN , P. W. (2016). Exact simulation vs exact estimation. In . IEEE.G LYNN , P. W. and H EIDELBERGER , P. (1991). Analysis of parallel replicated simulations under a completion timeconstraint. ACM Transactions on Modeling and Computer Simulation (TOMACS) LYNN , P. W. and R HEE , C.- H . (2014). Exact estimation for Markov chain equilibrium expectations. Journal ofApplied Probability ENG , J. and J ACOB , P. E. (2019). Unbiased Hamiltonian Monte Carlo with couplings. Biometrika UBER , M. (2004). Perfect sampling using bounding chains. Ann. Appl. Probab. UBER , M. L. (2002). A bounding chain for Swendsen-Wang. Random Structures and Algorithms ACOB , P. E., L INDSTEN , F. and S CH ¨ ON , T. B. (2019). Smoothing with couplings of conditional particle filters. Journal of the American Statistical Association ACOB , P. E., O’L EARY , J. and A TCHAD ´ E , Y. F. (2020). Unbiased Markov chain Monte Carlo with couplings (withdiscussion). J. Royal Statist. Society, Series B , to appear.L ICHMAN , M. (2013). Uci machine learning repository, 2013.M ENG , X. L. (2000). Towards a more general Propp-Wilson algorithm: Multistage backward coupling. In MonteCarlo Methods (N. Madras, ed.), vol. of Fields Institute Communications . American Mathematical Society.M ØLLER , J. (1999). Perfect simulation of conditionally specified models. J. Royal Statist. Society, Series B URDOCH , D. J. and M ENG , X.-L. (2001). Towards perfect sampling for Bayesian mixture priors. In BayesianMethods, with Applications to Science, Policy and Official Statistics - Selected papers from ISBA 2000 (E. I. George,ed.).M URDOCH , D. J. and T AKAHARA , G. (2006). Perfect sampling for queues and network models. ACM Transactionson Modeling and Computer Simulation ELSON , B. L. (2016). ‘some tactical problems in digital simulation’for the next 10 years. Journal of Simulation OLSON , N. G., S COTT , J. G. and W INDLE , J. (2013). Bayesian inference for logistic models using P´olya–Gammalatent variables. Journal of the American statistical Association ROPP , J. G. and W ILSON , D. B. (1996). Exact sampling with coupled Markov chains and applications to statisticalmechanics. Random Structures and Algorithms ROPP , J. G. and W ILSON , D. B. (1998). How to get a perfectly random sample from a generic Markov chainand generate a random spanning tree of a directed graph. J. Algorithms TEIN , N. M. and M ENG , X.-L. (2013). Practical perfect sampling using composite bounding chains: the Dirichlet-multinomial model. Biometrika WENDSEN , R. H. and W ANG , J.-S. (1986). Replica Monte Carlo simulation of spin-glasses. Physical review letters H ¨ ONNES , E. (1999). Perfect simulation of some point processes for the impatient user. Adv. in Appl. Probab. AN D YK , D. A. and M ENG , X.-L. (2001). The art of data augmentation. Journal of Computational and GraphicalStatistics ILSON , D. B. (1998). Annotated bibliography of perfectly random sampling with Markov chains. In Mi-crosurveys in Discrete Probability (D. Aldous and J. Propp, eds.), vol. of DIMACS Series in DiscreteMathematics and Theoretical Computer Science . American Mathematical Society. Updated versions appear at .Y U , Y. and M ENG , X.-L. (2011). To center or not to center: That is not the question—an Ancillarity–SufficiencyInterweaving Strategy (ASIS) for boosting MCMC efficiency. Journal of Computational and Graphical Statistics20