Convergence Rates of Two-Component MCMC Samplers
CConvergence Rates of Two-Component MCMCSamplers
Qian Qin ∗ and Galin L. Jones † School of StatisticsUniversity of MinnesotaJune 29, 2020
Abstract
Component-wise MCMC algorithms, including Gibbs and conditional Metropolis-Hastingssamplers, are commonly used for sampling from multivariate probability distributions. A long-standing question regarding Gibbs algorithms is whether a deterministic-scan (systematic-scan)sampler converges faster than its random-scan counterpart. We answer this question when thesamplers involve two components by establishing an exact quantitative relationship betweenthe L convergence rates of the two samplers. The relationship shows that the deterministic-scan sampler converges faster. We also establish qualitative relations among the convergencerates of two-component Gibbs samplers and some conditional Metropolis-Hastings variants.For instance, it is shown that if a two-component conditional Metropolis-Hastings sampler isgeometrically ergodic, then so are the associated Gibbs samplers. ∗ [email protected] † [email protected] Partially supported by the National Science Foundation. a r X i v : . [ m a t h . S T ] J un Introduction
Markov chain Monte Carlo (MCMC) algorithms are useful for sampling from complicated distribu-tions (Brooks et al., 2011). Component-wise MCMC algorithms, such as Gibbs samplers and condi-tional Metropolis-Hastings (CMH) samplers, sometimes called Metropolis-within-Gibbs, are amongthe most useful in multivariate settings. We study the convergence rates of two-component Gibbssamplers and the mixed case where one of the components is updated using Metropolis-Hastings,paying particular attention to the relationship between the convergence rates of the Markov chains.Investigating the convergence rates of the underlying Markov chains is important for ensuring areliable simulation effort (Geyer, 1992; Flegal et al., 2008; Jones and Hobert, 2001). If the Markovchain converges sufficiently fast, then, under moment conditions, a central limit theorem will obtain(Chan and Geyer, 1994; Doss et al., 2014; Hobert et al., 2002; Jones, 2004; Robertson et al., 2020).Additionally, asymptotically valid Monte Carlo standard errors are available (Dai and Jones, 2017;Flegal and Jones, 2010; Jones et al., 2006; Vats et al., 2018, 2019).Let Π(d x, d y ) be a joint probability distribution having support X × Y and let Π X | Y (d x | y ), y ∈ Y ,and Π Y | X (d y | x ), x ∈ X , be full conditional distributions. There are many potential component-wiseMCMC algorithms having Π as their invariant distribution. When it is possible to simulate fromthe conditionals, it is natural to use a Gibbs sampler. One version is the deterministic-scan Gibbs(DG) sampler, which is now described. Algorithm 1
Deterministic-scan Gibbs sampler Input:
Current value ( X n , Y n ) = ( x, y ). Draw Y n +1 from Π Y | X ( ·| x ), and call the observed value y (cid:48) . Draw X n +1 from Π X | Y ( ·| y (cid:48) ). Set n = n + 1.An alternative is the random-scan Gibbs (RG) sampler which is described below.2 lgorithm 2 Random-scan Gibbs sampler with selection probability r ∈ (0 , Input:
Current value ( X n , Y n ) = ( x, y ). Draw U ∼ Bernoulli( r ), and call the observed value u . If u = 1, draw X n +1 from Π X | Y ( ·| y ), and set Y n +1 = y . If u = 0, draw Y n +1 from Π Y | X ( ·| x ), and set X n +1 = x . Set n = n + 1.Two-component Gibbs samplers are surprisingly useful and widely applicable in the analysis ofsophisticated Bayesian statistical models. In particular, they arise naturally in data augmentationsettings (Hobert, 2011; Tanner and Wong, 1987; van Dyk and Meng, 2001).There is abundant study of the convergence properties of Gibbs samplers, both in the generalcase (see Liu et al., 1994; Roberts and Polson, 1994; Liu et al., 1995) and for two-componentGibbs samplers in specific statistical settings; see, among many others, Diaconis et al. (2008),Doss and Hobert (2010), Ekvall and Jones (2019), Hobert and Geyer (1998), Johnson and Jones(2008), Johnson and Jones (2015), Jones and Hobert (2004), Khare and Hobert (2013), Marchevand Hobert (2004), Roy (2012), Tan and Hobert (2009), Wang and Roy (2018b), and Wang andRoy (2018a). However, there is not yet an answer to the following basic question: which convergesfaster, a deterministic- or random-scan Gibbs sampler?There exist some qualitative results related to this question (see Johnson et al., 2013; Tan et al.,2013). For instance, Roberts and Rosenthal (1997) show that a random-scan Gibbs sampler isuniformly ergodic whenever an associated deterministic-scan Gibbs sampler is too. There is alsoliterature devoted to finding the convergence rates of various Gibbs samplers when Π is Gaussian,or approximately Gaussian (see, e.g., Amit, 1991, 1996; Amit and Grenander, 1991; Roberts andSahu, 1997). However, in general, the relationship between the convergence rates of deterministic-and random-scan Gibbs samplers has been poorly understood.A related question is addressed by Andrieu (2016), who shows that the DG sampler yields samplemeans with smaller asymptotic variances than its random-scan counterpart, assuming that, in theRG sampler, the selection probability is r = 1 / L convergence rate of a Markov chain is a number in[0 , ρ ( P DG ) be the L convergence rate ofthe DG sampler, and, ρ ( P RG ), that of the RG sampler. Then we show that ρ ( P RG ) = 1 + (cid:112) − r (1 − r )[1 − ρ ( P DG )]2 . (1)There are some easy, but noteworthy, consequences of this result. Notice that (i) ρ ( P RG ) ∈ [1 / , ρ ( P DG ) ∈ [0 , ρ ( P DG ) or ρ ( P RG ) increases so does the other; (iii) if ρ ( P DG ) < ρ ( P RG ) > ρ ( P DG ), but ρ ( P DG ) = 1 if and only if ρ ( P RG ) = 1; and (iv) the optimal selectionprobability for P RG is r = 1 / ρ ( P RG ) = 1 + (cid:112) ρ ( P DG )2 . In Section 4 we will generalize this discussion and show that the DG sampler converges faster evenafter taking into account computation time.Perhaps the most common type of MCMC sampler in applications are conditional Metropolis-Hastings (CMH) samplers. These Markov chains arise when it is infeasible to sample from at leastone of the conditional distributions associated with Π so that a Metropolis-Hastings update mustbe used. Assume that Π Y | X and Π X | Y , respectively, admit density functions π Y | X and π X | Y . Let q ( ·| x, y ) , ( x, y ) ∈ X × Y , be a proposal density function on X . The deterministic-scan CMH (DC)sampler we study is now described. 4 lgorithm 3 Deterministic-scan CMH sampler Input:
Current value ( X n , Y n ) = ( x, y ) Draw Y n +1 from Π Y | X ( ·| x ), and call the observed value y (cid:48) . Draw a random element Z from q ( ·| x, y (cid:48) ), and call the observed value z . With probability a ( z ; x, y (cid:48) ) = min (cid:26) , π X | Y ( z | y (cid:48) ) q ( x | z, y (cid:48) ) π X | Y ( x | y (cid:48) ) q ( z | x, y (cid:48) ) (cid:27) , set X n +1 = z ; with probability 1 − a ( z ; x, y (cid:48) ), set X n +1 = x . Set n = n + 1.There is an obvious alternative random-scan CMH (RC) sampler. Algorithm 4
Random-scan CMH sampler with selection probability r ∈ (0 , Input:
Current value ( X n , Y n ) = ( x, y ). Draw U ∼ Bernoulli( r ), and call the observed value u . If u = 1, draw a random element Z from q ( ·| x, y ), and call the observed value z . With probability a ( z ; x, y ) = min (cid:26) , π X | Y ( z | y ) q ( x | z, y ) π X | Y ( x | y ) q ( z | x, y ) (cid:27) , set X n +1 = z ; with probability 1 − a ( z ; x, y ), set X n +1 = x . Set Y n +1 = y . If u = 0, draw Y n +1 from Π Y | X ( ·| x ), and set X n +1 = x . Set n = n + 1.Despite their utility, compared to Gibbs samplers there has been little investigation of CMHMarkov chains (Fort et al., 2003; Herbei and McKeague, 2009; Johnson et al., 2013; Jones et al., 2014;Rosenthal and Rosenthal, 2015; Roberts and Rosenthal, 1997, 1998) but what there is tends not tofocus on specific statistical models. For example, Johnson et al. (2013) show that if a deterministicscan component-wise Markov chain is uniformly ergodic, then so is its random-scan counterpart,thus generalizing the result proved for Gibbs samplers by Roberts and Rosenthal (1997), which wasdescribed previously.Both versions of Gibbs samplers are special cases of the respective versions of CMH samplers.Thus it is plausible that there should be some relationship among the convergence rates of theMarkov chains of Algorithms 1–4. There are a few results in this direction. For example, there aresufficient conditions which ensure that if the RG Markov chain is geometrically ergodic, then so is5he RC Markov chain (Jones et al., 2014). However, these relationships are not well understood ingeneral and the following question has not been addressed satisfactorily: if one of the four basiccomponent-wise samplers is geometrically ergodic, then, in general, which of the remaining threeare also geometrically ergodic? DG RGDC RC
Figure 1: Relationship among two-component Gibbs samplers and their CMH variants in terms of L geometric ergodicity.We give an answer to this question by developing qualitative relationships among the convergencerates of the DG, RG, DC, and RC samplers, which are depicted in Figure 1. Here, we consider L geometric ergodicity. A Markov chain is L geometrically ergodic if its L convergence rateis strictly less than 1. Under regularity conditions, L geometric ergodicity is equivalent to theusual notion of geometric ergodicity defined in terms of the total variation distance (Roberts andRosenthal, 1997; Roberts and Tweedie, 2001). (This equivalence will be made precise in Section 3.)In Figure 1, a solid arrow from one sampler to another means that, if the former is L geometricallyergodic, then so is the latter. A dashed arrow means that L geometric ergodicity of the formeronly implies that of the latter under appropriate conditions on the proposal density q ( ·| x, y ). Oneof these conditions is (C1) in Section 5.Figure 1 yields the following. The DG sampler is L geometrically ergodic if and only if the RGsampler is. If the RC sampler is L geometrically ergodic for some proposal density, then so arethe DG and RG samplers. If the DC sampler is L geometrically ergodic for some proposal density,then so is the RC sampler with the same proposal density. The relations depicted in Figure 1 holdregardless of the selection probabilities for the random-scan samplers.The rest of this article is organized as follows. Section 2 contains some general theoreticalbackground. In Section 3, we lay out some basic properties of the four types of samplers. In6ection 4, we derive (1). In Section 5, we establish the relations shown in Figure 1. Finally, sometechnical details are relegated to the Appendices. Let ( Z , F ) be a countably generated measurable space and let P be a Markov transition kernel(Mtk), that is, let P : Z × F → [0 ,
1] be such that for each z ∈ Z , P ( z, · ) is a probability measureand for each A ∈ F , P ( · , A ) is measurable. If { Z n } ∞ n =0 is a Markov chain whose one-step dynamicsare determined by P , then for z ∈ Z , A ∈ F , and each positive integer n and nonnegative integer j ,the n -step kernel is given by P n ( z, A ) = Pr( Z n + j ∈ A | Z j = z ) . If ω is a probability measure on ( Z , F ) and A ∈ F , define( ωP )( A ) = (cid:90) ω ( dz ) P ( z, A ) . Say ω is invariant for P if ωP = ω . If P ( z, dz (cid:48) ) ω ( dz ) = P ( z (cid:48) , dz ) ω ( dz (cid:48) ) , (2)then P is said to be reversible with respect to ω . Integrating both sides of the equality in (2) showsthat ω is invariant for P .For a measurable function f : Z → R and a probability measure µ : F → [0 , P f )( z ) = (cid:90) f ( z (cid:48) ) P ( z, dz (cid:48) ) and µf = (cid:90) Z f ( z ) µ (d z ) . Assume that ω is invariant for P . Let L ( ω ) be the set of measurable real functions f that aresquare integrable with respect to ω and let L ( ω ) be the set of functions f ∈ L ( ω ) such that ωf = 0. For f, g ∈ L ( ω ), define their inner product to be (cid:104) f, g (cid:105) ω = (cid:90) Z f ( z ) g ( z ) ω (d z ) , and let (cid:107) f (cid:107) ω = (cid:104) f, f (cid:105) ω . Then ( L ( ω ) , (cid:104)· , ·(cid:105) ω ) and ( L ( ω ) , (cid:104)· , ·(cid:105) ω ) form two real Hilbert spaces. Forany f ∈ L ( ω ), we have P f ∈ L ( ω ). Thus, P can be regarded as a linear operator on L ( ω ). Let (cid:107) P (cid:107) ω = sup f ∈ L ( ω ) , (cid:107) f (cid:107) ω =1 (cid:107) P f (cid:107) ω .
7y the Cauchy-Schwarz inequality, (cid:107) P (cid:107) ω ≤
1. When P is reversible with respect to ω , P , as anoperator on L ( ω ), is self-adjoint so that (cid:104) P f , f (cid:105) ω = (cid:104) f , P f (cid:105) ω for f , f ∈ L ( ω ), and (cid:107) P (cid:107) ω = sup f ∈ L ( ω ) , (cid:107) f (cid:107) ω =1 |(cid:104) P f, f (cid:105) ω | . Moreover, if P is self-adjoint, then for each positive integer n , (cid:107) P n (cid:107) ω = (cid:107) P (cid:107) nω (see, e.g., Helmberg, 2014, §
30 Corollary 8.1, §
31 Corollary 2.1). Say P is non-negative definite ifit is self-adjoint, and (cid:104) P f, f (cid:105) ω ≥ f ∈ L ( ω ).For two probability measures µ and ν on ( Z, F ), define their L (or χ ) distance to be (cid:107) µ − ν (cid:107) ω = sup f ∈ L ( ω ) , (cid:107) f (cid:107) ω =1 | µf − νf | . Let L ∗ ( ω ) be the set of probability measures µ such that d µ/ d ω ∈ L ( ω ). The L convergence rateof the Markov chain associated with P , denoted by ρ ( P ), is defined to be the infimum of ρ ∈ [0 , µ ∈ L ∗ ( ω ), there exists C µ < ∞ such that, for each positive integer n , (cid:107) µP n − ω (cid:107) ω < C µ ρ n . When ρ ( P ) <
1, we say that the Markov chain is L geometrically ergodic, or more simply, P is L geometrically ergodic. The following is a direct consequence of Roberts and Rosenthal’s (1997)Theorem 2.1. Lemma . If P is reversible with respect to ω , then ρ ( P ) = (cid:107) P (cid:107) ω .The following comparison lemma will be useful in conjunction with Lemma 2.1. Lemma . Let P and P be Mtks on ( Z , F ) having a common stationary distribution ω . Supposefurther that (cid:107) P (cid:107) ω < δ > z ∈ Z and A ∈ F , P ( z, A ) ≥ δP ( z, A ).Then (cid:107) P (cid:107) ω < Proof.
Without loss of generality, assume that δ <
1. Let R ( z, A ) = (1 − δ ) − ( P ( z, A ) − δP ( z, A )).Then R ( z, A ) defines an Mtk such that ωR = ω . By Cauchy-Schwarz, (cid:107) R (cid:107) ω ≤
1. By the triangleinequality, (cid:107) P (cid:107) ω ≤ δ (cid:107) P (cid:107) ω + (1 − δ ) (cid:107) R (cid:107) ω < roposition . Let P and P be Mtks on ( Z , F ) such that for any 0 < r < P r = rP + (1 − r ) P is reversible with respect to ω . If ρ ( P r ) < r ∈ (0 , ρ ( P r ) < r ∈ (0 , Proof.
For each z ∈ Z and A ∈ F , P r ( z, A ) ≥ min (cid:26) rr , − r − r (cid:27) P r ( z, A ) . Since P r is reversible with respect to ω for all r ∈ (0 , We begin by defining the Markov transition kernels for the four algorithms described in Section 1along with some related Markov chains that will be useful later. Then we will turn our attentionto some basic properties of the operators and total variation norms for these Markov chains.Suppose ( X × Y , F X × F Y ) is a countably-generated measurable space with a joint probabilitydistribution Π(d x, d y ). Let Π X (d x ) and Π Y (d y ) be the associated marginal distributions andΠ X | Y (d x | y ) and Π Y | X (d y | x ) be the full conditional distributions. To avoid trivial cases we makethe following standing assumption. Assumption . There exist A , A ∈ F X and B , B ∈ F Y such that A ∩ A = ∅ , B ∩ B = ∅ ,and that Π X ( A ) >
0, Π X ( A ) >
0, Π Y ( B ) >
0, Π Y ( B ) > F X and F Y may contain only sets of measure zero orone, and all the problems we study become essentially trivial.Letting Π, Π X , or Π Y play the role of ω from Section 2, as appropriate, allows us to consider theMtks defined in the sequel as linear operators on the appropriate Hilbert spaces. Assumption 3.1ensures L (Π), L (Π X ), and L (Π Y ) contain non-zero elements. The Mtk for the DG sampler is P DG (( x, y ) , (d x (cid:48) , d y (cid:48) )) = Π X | Y (d x (cid:48) | y (cid:48) )Π Y | X (d y (cid:48) | x ) . P DG has Π as its invariant distribution, but it is not reversible with respect to Π. If δ x and δ y are point masses at x and y , respectively, then the Mtk for the RG sampler is P RG (( x, y ) , (d x (cid:48) , d y (cid:48) )) = r Π X | Y (d x (cid:48) | y ) δ y (d y (cid:48) ) + (1 − r )Π Y | X (d y (cid:48) | x ) δ x (d x (cid:48) ) . It is well known that P RG is reversible with respect to Π and hence has Π as its invariant distribution.Now let P MH denote the Metropolis-Hastings Mtk (Tierney, 1994, 1998) which is reversible withrespect to the full conditional Π X | Y . Then the Mtk for the DC sampler is P DC (( x, y ) , (d x (cid:48) , d y (cid:48) )) = P MH (d x (cid:48) | x, y (cid:48) ) Π Y | X (d y (cid:48) | x ) . Note that P DC has Π as its invariant distribution, but it is not reversible with respect to Π. TheMtk for the RC sampler is P RC (( x, y ) , (d x (cid:48) , d y (cid:48) )) = rP MH (d x (cid:48) | x, y ) δ y (d y (cid:48) ) + (1 − r )Π Y | X (d y (cid:48) | x ) δ x (d x (cid:48) )and it is again well known that P RC is reversible with respect to Π and hence has Π as its invariantdistribution.It will be convenient to consider marginalized versions of the DG chain, which we now define.The X -marginal DG chain is defined on X , and its Mtk is P XDG ( x, d x (cid:48) ) = (cid:90) Y Π X | Y (d x (cid:48) | y )Π Y | X (d y | x ) . Similarly, the Y -marginal DG chain is defined on Y , and has Mtk P YDG ( y, d y (cid:48) ) = (cid:90) X Π Y | X (d y (cid:48) | x )Π X | Y (d x | y ) . Note that P XDG and P YDG are reversible with respect to Π X and Π Y , respectively (Liu et al., 1995).Moreover, it is well-known that the convergence properties of the marginal, P XDG and P YDG , chainsare essentially those of the original DG chain (Robert, 1995; Roberts and Rosenthal, 2001).There also exists an X -marginal version of the DC sampler (but not a Y -marginal version) withMtk given by P XDC ( x, d x (cid:48) ) = (cid:90) Y P MH (d x (cid:48) | x, y )Π Y | X (d y | x ) . Jones et al. (2014) show that P XDC is reversible with respect to Π Y and enjoys the same qualitativerate of convergence in total variation norm as the parent DC sampler.10 .2 Operator norms It is clear that P DG , P RG , P DC , and P RC can be regarded as operators defined on L (Π). Amongthem, P RG and P RC are self-adjoint. It can be checked that P RG is non-negative definite (Rudolfand Ullrich, 2013). Also, P XDG and P XDC are self-adjoint operators on L (Π X ), while P YDG is aself-adjoint operator on L (Π Y ). Moreover, P XDG and P YDG are non-negative definite (Liu et al.,1995).Using Lemma 2.1 and the fact that RG and RC chains are reversible with respect to Π, we have ρ ( P RG ) = (cid:107) P RG (cid:107) Π and ρ ( P RC ) = (cid:107) P RC (cid:107) Π . Similar relations for the deterministic-scan samplers are given in the following lemma, whose proofis given in Appendix A.
Lemma . For each positive integer n , (cid:107) P n DG (cid:107) / ( n − / = ρ ( P DG ) = (cid:107) P XDG (cid:107) Π X = (cid:107) P YDG (cid:107) Π Y (cid:107) P n DC (cid:107) / ( n − ≤ ρ ( P DC ) = (cid:107) P XDC (cid:107) Π X ≤ (cid:107) P n DC (cid:107) /n Π . ( (cid:107) P n DC (cid:107) / ( n − is interpreted as 0 when n = 1.)Applying Lemmas 2.1 and 3.2 we obtain the following. Corollary . ρ ( P DG ) = ρ ( P XDG ) = ρ ( P YDG ) and ρ ( P DC ) = ρ ( P XDC ).We will require one more result which is due to Liu et al. (1994); see also Liu et al. (1995) andVidav (1977). For g ∈ L (Π X ) and h ∈ L (Π Y ), let γ ( g, h ) = (cid:90) X × Y g ( x ) h ( y )Π(d x, d y )and ¯ γ = sup { γ ( g, h ) : g ∈ L (Π X ) , (cid:107) g (cid:107) Π X = 1 , h ∈ L (Π Y ) , (cid:107) h (cid:107) Π Y = 1 } . We say that ¯ γ ∈ [0 ,
1] is the maximal correlation between X and Y . Lemma . ¯ γ = (cid:107) P XDG (cid:107) Π X = (cid:107) P YDG (cid:107) Π Y . .3 Total variation We consider the connection between L geometric ergodicity and the usual notion of geometricergodicity defined through the total variation norm, denoted by (cid:107) · (cid:107) TV . For the four component-wise Markov chains considered here we can use results from Roberts and Tweedie (2001) to showthat these concepts are equivalent (see also Roberts and Rosenthal, 1997). A proof is provided inAppendix B. Proposition . Let P denote the Mtk for any of the DG, RG, DC, and RC Markov chains. Supposethat P is ϕ -irreducible. Then P is L -geometrically ergodic if and only if it is Π-almost everywheregeometrically ergodic in the sense that for Π-almost every ( x, y ), there exist C ( x, y ) and t < n , (cid:107) P n (( x, y ) , · ) − Π( · ) (cid:107) TV ≤ C ( x, y ) t n . ρ ( P DG ) and ρ ( P RG ) The main result of this section follows and its proof is given in Section 4.1.
Theorem . ρ ( P RG ) = 1 + (cid:112) − r (1 − r )[1 − ρ ( P DG )]2 . We illustrate Theorem 4.1 in two examples.
Example . When Π is Gaussian, there are explicit formulas for ρ ( P DG ) and ρ ( P RG ) (Amit, 1996;Roberts and Sahu, 1997). In particular, when Π is a bivariate Gaussian, and the correlationbetween X and Y is γ ∈ [ − , ρ ( P DG ) = γ (see, e.g., Diaconis et al., 2008).Meanwhile, ρ ( P RG ) = 1 + (cid:112) − r (1 − r )(1 − γ )2(Levine and Casella, 2008). This is in accordance with the general result in Theorem 4.1. Example . When X × Y is a finite set, Π can be written in the form of a probability massfunction (pmf), π ( · , · ). For illustration, take X = Y = [5] = { , , , , } , and generate the elementsof π ( i, j ) , ( i, j ) ∈ [5] × [5] , via a Dirichlet distribution. The convergence rates of DG and RG samplerscan then be calculated using the spectra of their transition matrices. We repeat this experiment 20times for different values of selection probabilities. The results are displayed in Figure 2.12 .0 0.2 0.4 0.6 0.8 1.0 . . . . . . r=0.2 r D r R l ll lll ll ll lll ll ll l l l . . . . . . r=0.4 r D r R ll lll ll ll llll ll ll l ll . . . . . . r=0.5 r D r R l ll ll l llll ll llll ll ll . . . . . . r=0.6 r D r R l ll l ll l lll l lll ll lll l . . . . . . r=0.8 r D r R ll ll l llll ll l ll lll ll l . . . . . . r=1 r D r R ll ll llll llll ll ll l lll Figure 2: Relationship between ρ ( P DG ) and ρ ( P RG ) for discrete target distributions. In each subplot,20 joint pmfs are randomly generated using Dirichlet distributions. Each circle corresponds to ajoint pmf. The solid curves depict the relationship given in Theorem 4.1.We now turn our attention to some of the implications of Theorem 4.1. Notice that, given theselection probability r ∈ (0 , ρ ( P RG ) and ρ ( P DG ) are monotonic functions of each other. Indeed,in light of Lemmas 3.2 and 3.4, given any selection probability, the convergence rates of the twotypes of Gibbs chains are completely determined by the maximal correlation between X and Y . ρ ( P DG ) = ¯ γ = 0 if and only if ρ ( P RG ) = max { r, − r } ; ρ ( P DG ) = ¯ γ = 1 if and only if ρ ( P RG ) = 1;and when ρ ( P DG ) = ¯ γ ∈ (0 , ρ ( P RG ) ∈ (max { r, − r } ,
1) and ρ ( P DG ) < ρ ( P RG ).Let k ∗ > ρ ( P RG ) k ∗ = ρ ( P DG ) so that, roughly speaking, one iteration of theDG sampler is “worth” k ∗ iterations of the RG sampler in terms of convergence rate. By Young’s13nequality, ρ ( P RG ) = 1 + (cid:112) − r (1 − r ) + 4 r (1 − r ) ρ ( P DG )2 ≥ [1 − r (1 − r ) + 4 r (1 − r ) ρ ( P DG )] / ≥ ρ ( P DG ) r (1 − r ) . Therefore, k ∗ ≥ / [ r (1 − r )].Let t and t be the time it takes to sample from Π X | Y and Π Y | X , respectively. For simplicity,assume that they are constants. Suppose that, within unit time, one can run k D iterations of theDG sampler, and k R iterations of the RG sampler. Then k R k D ≈ t + t rt + (1 − r ) t . Since k ∗ ≥ / [ r (1 − r )], ρ ( P DG ) k D = ρ ( P RG ) k ∗ k D ≈ exp (cid:26) [log ρ ( P RG )] k ∗ k R rt + (1 − r ) t t + t (cid:27) ≤ ρ ( P RG ) k R . In this sense, the DG sampler converges faster than its random-scan counterpart.
Remark . Before we begin the proof, we note that the result of Theorem 4.1 is related to thetheory of two projections (B¨ottcher and Spitkovsky, 2010). When r = 1 /
2, an alternative proof ofTheorem 4.1 is available if we apply results on the norm of the sum of two projections (e.g., Duncanand Taylor, 1976, Theorem 7) along with Lemma 3.2.By Lemmas 3.2 and 3.4, ρ ( P DG ) = ¯ γ , where ¯ γ ∈ [0 ,
1] is the maximal correlation between X and Y . To prove Theorem 4.1, we need to connect ρ ( P RG ) to ¯ γ . We begin with a preliminary result. Lemma . ρ ( P RG ) ≥ max { − r + r ¯ γ , r + (1 − r )¯ γ } . Proof.
Let g ∈ L (Π X ) be such that (cid:107) g (cid:107) Π X = 1. Let f g be such that f g ( x, y ) = g ( x ) for each( x, y ) ∈ X × Y so that f g ∈ L (Π), and (cid:107) f g (cid:107) Π = 1. By Cauchy-Schwartz, (cid:107) P RG (cid:107) Π ≥ (cid:104) P RG f g , f g (cid:105) Π = r (cid:90) X × Y (cid:18)(cid:90) X g ( x (cid:48) )Π X | Y (d x (cid:48) | y ) (cid:19) g ( x )Π(d x, d y ) + (1 − r ) (cid:104) g, g (cid:105) Π X = r (cid:104) P XDG g, g (cid:105) Π X + 1 − r . (3)14ecall that P XDG is non-negative definite. This implies that (cid:107) P XDG (cid:107) Π X = sup {(cid:104) P XDG g (cid:48) , g (cid:48) (cid:105) Π X : g (cid:48) ∈ L (Π X ) , (cid:107) g (cid:48) (cid:107) Π X = 1 } . (See, e.g., Helmberg, 2014, §
14 Corollary 5.1.) Taking the supremum with respect to g in (3) yields (cid:107) P RG (cid:107) Π ≥ − r + r (cid:107) P XDG (cid:107) Π X = 1 − r + r ¯ γ , (4)where the last equality follows from Lemma 3.4.By an analogous argument, (cid:107) P RG (cid:107) Π ≥ r + (1 − r ) (cid:107) P YDG (cid:107) Π Y = r + (1 − r )¯ γ . (5)Recall that ρ ( P RG ) = (cid:107) P RG (cid:107) Π . The proof is completed by combining (4) and (5).Our proof of Theorem 4.1 hinges on the fact that, for each f ∈ L (Π) and ( x, y ) ∈ X × Y , P RG f ( x, y ) can be written in the form of g ( x ) + h ( y ), where g ( x ) = (1 − r ) (cid:90) Y f ( x, y (cid:48) )Π Y | X (d y (cid:48) | x ) , h ( y ) = r (cid:90) X f ( x (cid:48) , y )Π X | Y (d x (cid:48) | y ) . As we will see, this allows us to restrict our attention to a well-behaved subspace of L (Π) whenstudying the norm of P RG .For g ∈ L (Π X ) and h ∈ L (Π Y ), let g ⊕ h be the function on X × Y such that( g ⊕ h )( x, y ) = g ( x ) + h ( y )for ( x, y ) ∈ X × Y (in a Π-almost everywhere sense). Let H = { g ⊕ h : g ∈ L (Π X ) , h ∈ L (Π Y ) } . Then H , equipped with the inner product (cid:104)· , ·(cid:105) Π , is a subspace of L (Π). For g ⊕ h ∈ H , (cid:107) g ⊕ h (cid:107) = (cid:107) g (cid:107) X + (cid:107) h (cid:107) Y + 2 γ ( g, h ) , where γ ( g, h ) is defined in Section 3.1. It follows that(1 − ¯ γ )( (cid:107) g (cid:107) X + (cid:107) h (cid:107) Y ) ≤ (cid:107) g ⊕ h (cid:107) ≤ (1 + ¯ γ )( (cid:107) g (cid:107) X + (cid:107) h (cid:107) Y ) . (6)When ¯ γ < g ⊕ h = 0 if and only if g = 0 and h = 0. It follows that, whenever ¯ γ <
1, for any f ∈ H , the decomposition f = g ⊕ h is unique.To proceed, we present two technical results concerning H . Lemma 4.6 is proved in Appendix C,and Lemma 4.7 is a direct consequence of (6). 15 emma . If ¯ γ <
1, then H is a Hilbert space. Lemma . Let ¯ γ < { g n } ∞ n =1 and { h n } ∞ n =1 are sequences in L (Π X ) and L (Π Y ), respectively. For g ∈ L (Π X ) and h ∈ L (Π Y ), lim n →∞ ( g n ⊕ h n ) = g ⊕ h if and only iflim n →∞ g n = g , and lim n →∞ h n = h . It is easy to check that, for every f ∈ L (Π), P RG f ∈ H . Define P RG | H to be P RG restrictedto H . The norm of P RG | H is (cid:107) P RG | H (cid:107) Π = sup f ∈ H, (cid:107) f (cid:107) Π =1 (cid:107) P RG f (cid:107) Π . We then have the following lemma.
Lemma . (cid:107) P RG (cid:107) Π = (cid:107) P RG | H (cid:107) Π . Proof.
It is clear that (cid:107) P RG (cid:107) Π ≥ (cid:107) P RG | H (cid:107) Π . (7)Because the range of P RG is in H , for any f ∈ L (Π) and positive integer n , (cid:107) P n RG f (cid:107) Π = (cid:107) P RG | n − H P RG f (cid:107) Π ≤ (cid:107) P RG | H (cid:107) n − (cid:107) f (cid:107) Π . Note that we have used the fact that (cid:107) P RG (cid:107) Π ≤
1. Since P RG is self-adjoint, for each positiveinteger n , (cid:107) P n RG (cid:107) Π = (cid:107) P RG (cid:107) n Π . It follows that (cid:107) P RG (cid:107) Π = lim n →∞ (cid:107) P n RG (cid:107) /n Π ≤ lim n →∞ (cid:107) P RG | H (cid:107) ( n − /n Π = (cid:107) P RG | H (cid:107) Π . (8)Combining (7) and (8) yields the desired result.We are now ready to prove the theorem. Proof of Theorem 4.1.
When ¯ γ = 1, the theorem follows from Lemma 4.5 and the fact that ρ ( P RG ) = (cid:107) P RG (cid:107) Π ≤
1. Assume that ¯ γ <
1. We first show that ρ ( P RG ) ≤ (cid:112) − r (1 − r )(1 − ¯ γ )2 . (9)It follows from Lemma 4.8 that ρ ( P RG ) = (cid:107) P RG | H (cid:107) Π . Note that P RG | H is a non-negative definiteoperator on H . By Lemma D.1 in Appendix D, ρ ( P RG ) is an approximate eigenvalue of P RG | H , thatis, there exists a sequence of functions { g n ⊕ h n } ∞ n =1 in H such that (cid:107) g n ⊕ h n (cid:107) Π = 1 for each n , andlim n →∞ [ P RG ( g n ⊕ h n ) − ρ ( P RG )( g n ⊕ h n )] = 0 . (10)16or every positive integer n , P RG ( g n ⊕ h n ) = [(1 − r ) g n + (1 − r ) Q h n ] ⊕ ( rQ g n + rh n ) , where Q : L (Π Y ) → L (Π X ) and Q : L (Π X ) → L (Π Y ) are bounded linear transformationssuch that, for g ∈ L (Π X ) and h ∈ L (Π Y ),( Q h )( x ) = (cid:90) Y h ( y )Π Y | X (d y | x ) , ( Q g )( y ) = (cid:90) X g ( x )Π X | Y (d x | y ) . By Lemma 4.7, (10) implies thatlim n →∞ { [1 − r − ρ ( P RG )] g n + (1 − r ) Q h n } = 0 , lim n →∞ { [ r − ρ ( P RG )] h n + rQ g n } = 0 . (11)Applying Q to the second equality in (11) yieldslim n →∞ { [ r − ρ ( P RG )] Q h n + rP XDG g n } = 0 . Subtracting (a multiple of) this from (a multiple of) the first equality in (11) giveslim n →∞ { [1 − r − ρ ( P RG )][ r − ρ ( P RG )] g n − r (1 − r ) P XDG g n } = 0 . (12)Similarly, applying Q to the first equality in (11) and subtracting it from the second equalityin (11) yields lim n →∞ { [1 − r − ρ ( P RG )][ r − ρ ( P RG )] h n − r (1 − r ) P YDG h n } = 0 . (13)By Lemma 3.4, (cid:107) P XDG g n (cid:107) Π X ≤ ¯ γ (cid:107) g n (cid:107) Π X , (cid:107) P YDG h n (cid:107) Π Y ≤ ¯ γ (cid:107) h n (cid:107) Π Y . It follows from (12) and (13)that lim sup n →∞ { [1 − r − ρ ( P RG )][ r − ρ ( P RG )] − r (1 − r )¯ γ }(cid:107) g n (cid:107) Π X ≤ , lim sup n →∞ { [1 − r − ρ ( P RG )][ r − ρ ( P RG )] − r (1 − r )¯ γ }(cid:107) h n (cid:107) Π Y ≤ . In particular,lim sup n →∞ { [1 − r − ρ ( P RG )][ r − ρ ( P RG )] − r (1 − r )¯ γ } ( (cid:107) g n (cid:107) Π X + (cid:107) h n (cid:107) Π Y ) ≤ . (14)By the triangle inequality, (cid:107) g n (cid:107) Π X + (cid:107) h n (cid:107) Π Y = (cid:107) g n ⊕ (cid:107) Π + (cid:107) ⊕ h n (cid:107) Π ≥ (cid:107) g n ⊕ h n (cid:107) Π = 1 .
17t then follows from (14) that[ ρ ( P RG ) + r − ρ ( P RG ) − r ] − r (1 − r )¯ γ ≤ . (15)This proves (9).Next, we show that ρ ( P RG ) ≥ (cid:112) − r (1 − r )(1 − ¯ γ )2 . (16)This will complete the proof, since, by Lemmas 3.2 and 3.4, ρ ( P DG ) = ¯ γ . If ¯ γ = 0, then (16)follows immediately from Lemma 4.5. Now assume ¯ γ ∈ (0 , ρ ( P RG ) = (cid:107) P RG (cid:107) Π . Itsuffices to show that (cid:107) P RG (cid:107) Π ≥ (cid:112) − r (1 − r )(1 − ¯ γ )2 . (17)Recall that P XDG is non-negative definite, and (cid:107) P XDG (cid:107) Π X = ¯ γ . Hence, ¯ γ is an approximateeigenvalue of P XDG . In other words, there exists a sequence of functions { ˆ g n } ∞ n =1 in L (Π X ) suchthat (cid:107) ˆ g n (cid:107) Π X = 1 for each n , and lim n →∞ ( P XDG ˆ g n − ¯ γ ˆ g n ) = 0 . (18)Let a = 2 r − (cid:112) − r (1 − r )(1 − ¯ γ )2(1 − r )¯ γ . Consider the sequence of functions { ˆ g n ⊕ aQ ˆ g n } n in H ⊂ L (Π). It is easy to show that, for each n , P RG (ˆ g n ⊕ aQ ˆ g n ) = (1 − r )[ˆ g n + aP XDG ˆ g n ] ⊕ r ( a + 1) Q ˆ g n = (1 − r )(1 + a ¯ γ )ˆ g n ⊕ r ( a + 1) Q ˆ g n + (1 − r ) a ( P XDG ˆ g n − ¯ γ ˆ g n ) ⊕ . (19)It is straightforward to verify that(1 − r )(1 + a ¯ γ ) = r ( a + 1) a = 1 + (cid:112) − r (1 − r )(1 − ¯ γ )2 . Hence, (19) can be written as P RG (ˆ g n ⊕ aQ ˆ g n ) − (cid:112) − r (1 − r )(1 − ¯ γ )2 (ˆ g n ⊕ aQ ˆ g n ) = (1 − r ) a ( P XDG ˆ g n − ¯ γ ˆ g n ) ⊕ . By (18), the right-hand-side goes to 0 ∈ H as n → ∞ . Moreover, by (6), (cid:107) ˆ g n ⊕ aQ ˆ g n (cid:107) ≥ − ¯ γ > (cid:107) P RG (cid:107) Π ≥ lim sup n →∞ (cid:13)(cid:13)(cid:13)(cid:13) P RG (cid:18) ˆ g n ⊕ aQ ˆ g n (cid:107) ˆ g n ⊕ aQ ˆ g n (cid:107) Π (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) Π = 1 + (cid:112) − r (1 − r )(1 − ¯ γ )2 , and (17) holds. 18 Qualitative Relationship among Convergence Rates
It follows from Theorem 4.1 that the RG sampler is L geometrically ergodic if and only if theassociated DG sampler is too. (The “if” part is essentially proved by Roberts and Rosenthal, 1997.)Our objective for this section is to establish similar relations between other pairs of component-wisesamplers introduced in Section 1, and eventually build Figure 1.First, by Proposition 2.3, if the RG or RC sampler is L geometric ergodic for some selectionprobability, then it is L geometrically ergodic for all selection probabilities. This allows us to treatthe selection probabilities of the RG and RC sampler as arbitrary in what follows.We now review some existing results. It is certainly not true that, in general, L geometricergodicity of the DG and RG samplers implies that of the DC or RC samplers. However, it is shownin Jones et al. (2014) that, under the following condition, (C1), the DC sampler is L geometricallyergodic whenever the associated DG sampler is, and the RC sampler is L geometrically ergodicwhenever the associated RG sampler is.(C1) C = sup ( x (cid:48) ,x,y ) ∈ X × X × Y π X | Y ( x (cid:48) | y ) q ( x (cid:48) | x, y ) < ∞ . (C1) is analogous to a commonly-used condition for uniform ergodicity for full dimensional Metropolis-Hastings samplers (Liu, 1996; Mengersen and Tweedie, 1996; Roberts and Tweedie, 1996; Smithand Tierney, 1996). Indeed, if P MH is the Metropolis-Hastings Mtk which is reversible with respectto Π X | Y with proposal density q , then under (C1), for each y ∈ Y , x ∈ X , and A ∈ F X , P MH ( A | x, y ) ≥ (cid:90) A min (cid:26) q ( x | x (cid:48) , y ) π X | Y ( x | y ) , q ( x (cid:48) | x, y ) π X | Y ( x (cid:48) | y ) (cid:27) π X | Y ( x (cid:48) | y ) d x (cid:48) ≥ C Π X | Y ( A | y ) . (20)For a fixed y ∈ Y , the Markov chain on X defined by P MH has stationary distribution Π X | Y ( ·| y )and (20) implies that this chain is uniformly ergodic. Moreover, (C1) implies that if C can becalculated, then one can use an accept-reject sampler, at least in principle, to sample from Π X | Y .Next, we present a negative result. L geometric ergodicity of the RC sampler does not nec-essarily imply that of the associated DC sampler. Indeed, a counter example can be constructedas follows. Let X = Y = { , } , and suppose that Π is a uniform distribution on X × Y . ThenΠ Y | X and Π X | Y are uniform distributions on Y and X , respectively. Let q ( ·| x, y ) be defined withrespect to the counting measure, and suppose that, for y ∈ { , } , q (2 | , y ) = q (1 | , y ) = 1. In19ther words, q ( ·| x, y ) always proposes a point in X that is different from x . The resulting RC chainis L -geometrically ergodic, but the associated DC chain is periodic, and it is easy to show that ρ ( P DC ) = 1.The relations that we have described so far can be summarized in Figure 3a. As in Figure 1,a solid arrow from one Markov chain to another means that L geometric ergodicity of the formerimplies that of the latter, while a dashed arrow means that the relation does not hold in general, butdoes under (C1). A dotted arrow from one sampler to another means that L geometric ergodicityof the former does not imply that of the latter in general, and we have not yet addressed whetherit does under (C1). DG RGDC RC (a)
DG RGDC RC (b)
DG RGDC RCXDGYDGXDC (c)
Figure 3: Building the relations among the convergence rates of component-wise samplers.Lemma 2.2 allows us to establish the following result, which shows that the RC sampler is L geometrically ergodic whenever the DC sampler is. This allows us to draw a solid arrow from theDC sampler to the RC sampler in Figure 3b. Proposition . If ρ ( P DC ) <
1, then ρ ( P RC ) < Proof.
By Lemma 3.2, (cid:107) P DC (cid:107) Π ≤ ρ ( P DC ) <
1. For ( x, y ) ∈ X × Y and A ∈ F X × F Y , P RC (( x, y ) , A ) ≥ r (1 − r ) P DC (( x, y ) , A ) . By Lemma 2.2, (cid:107) P RC (cid:107) Π <
1. Since P RC is self-adjoint, ρ ( P RC ) = (cid:107) P RC (cid:107) Π = (cid:107) P RC (cid:107) / < roposition . If ρ ( P RC ) <
1, then ρ ( P DG ) < Proof.
Consider the contrapositive and recall that, by Lemmas 3.2 and 3.4, ρ ( P DG ) = ¯ γ . Assumethat ¯ γ = 1. It suffices to show that ρ ( P RC ) = 1.Let g ∈ L (Π X ) and h ∈ L (Π Y ) be such that (cid:107) g (cid:107) Π X = (cid:107) h (cid:107) Π Y = 1. Let f g ∈ L (Π) besuch that f g ( x, y ) = g ( x ), and, f h ∈ L (Π), f h ( x, y ) = h ( y ). Recall that ρ ( P RC ) = (cid:107) P RC (cid:107) Π . ByCauchy-Schwarz, ρ ( P RC ) ≥ (cid:104) P RC f h , f g (cid:105) Π = r (cid:90) X × Y h ( y ) g ( x )Π(d x, d y ) + (1 − r ) (cid:90) X × Y (cid:90) Y h ( y (cid:48) ) Π Y | X (d y (cid:48) | x ) g ( x ) Π(d x, d y )= γ ( g, h ) . Taking the supremum with respect to g and h shows that ρ ( P RC ) ≥ ¯ γ = 1.Incorporating Propositions 5.1 and 5.2 in Figure 3a yields Figure 3b. From here, it is straight-forward to obtain Figure 1.Finally, we can integrate Corollary 3.3 into Figure 1, and this yields Figure 3c. AppendicesA Proof of Lemma 3.2
We will prove (cid:107) P n DG (cid:107) / ( n − / = ρ ( P DG ) = (cid:107) P XDG (cid:107) Π X = (cid:107) P YDG (cid:107) Π Y . The proof for the other equation is similar.(i) (cid:107) P XDG (cid:107) Π X = (cid:107) P YDG (cid:107) Π Y . This is given in Liu et al.’s (1994) Theorem 3.2.(ii) (cid:107) P n DG (cid:107) / ( n − / = (cid:107) P XDG (cid:107) Π X . Firstly, since P XDG is self-adjoint, for each positive integer n , (cid:107) P n XDG (cid:107) Π X = (cid:107) P XDG (cid:107) n Π X and similarly for P YDG .We begin by showing that (cid:107) P n DG (cid:107) / ( n − / ≤ (cid:107) P XDG (cid:107) Π X . Let f ∈ L (Π) be such that (cid:107) f (cid:107) Π = 1,and let h f ( y ) = (cid:90) X f ( x, y )Π X | Y (d x | y ) , y ∈ Y ,g f ( x ) = (cid:90) Y h f ( y )Π Y | X (d y | x ) , x ∈ X . h f ∈ L (Π Y ), and g f ∈ L (Π X ). It is easy to verify that, for each positive integer n and ( x, y ) ∈ X × Y , P n DG f ( x, y ) = P n − XDG g f ( x ). Moreover, by the Cauchy-Schwarz inequality, (cid:107) h f (cid:107) Π Y ≤
1. It follows that (cid:107) g f (cid:107) X = (cid:104) g f , g f (cid:105) Π X = (cid:104) h f , P YDG h f (cid:105) Π Y ≤ (cid:107) P YDG (cid:107) Π Y = (cid:107) P XDG (cid:107) Π X . Therefore, (cid:107) P n DG f (cid:107) Π = (cid:107) P n − XDG g f (cid:107) Π X ≤ (cid:107) P XDG (cid:107) n − X (cid:107) g f (cid:107) Π X ≤ (cid:107) P XDG (cid:107) n − / X . Taking the supremum with respect to f yields the desired inequality.We now show that (cid:107) P YDG (cid:107) Π Y ≤ (cid:107) P n DG (cid:107) / ( n − / and it will follow immediately that (cid:107) P XDG (cid:107) Π X ≤(cid:107) P n DG (cid:107) / ( n − / . Let h ∈ L (Π Y ) be such that (cid:107) h (cid:107) Π Y = 1. Let f h ∈ L (Π) be such that f h ( x, y ) = h ( y ) for ( x, y ) ∈ X × Y . Then (cid:107) f h (cid:107) Π = 1. Lastly, let Q h ∈ L (Π X ) be such that( Q h )( x ) = (cid:82) Y h ( y )Π Y | X (d y | x ). Careful calculation shows that (cid:104) h, P n − YDG h (cid:105) Π Y = (cid:104) P n − XDG Q h, P n − XDG Q h (cid:105) Π X = (cid:104) P n DG f h , P n DG f h (cid:105) Π ≤ (cid:107) P n DG (cid:107) . Since P n − YDG is non-negative definite, (cid:107) P n − YDG (cid:107) Π Y = sup {(cid:104) P n − YDG h (cid:48) , h (cid:48) (cid:105) Π Y : h (cid:48) ∈ L (Π Y ) , (cid:107) h (cid:48) (cid:107) Π Y = 1 } . This shows that (cid:107) P YDG (cid:107) n − Y ≤ (cid:107) P n DG (cid:107) .(iii) ρ ( P DG ) = (cid:107) P XDG (cid:107) Π X . By Lemma 2.1, (cid:107) P XDG (cid:107) Π X = ρ ( P XDG ), the L convergence rate of the X -marginal DG chain.We now show that ρ ( P XDG ) ≤ ρ ( P DG ). Let g ∈ L (Π X ) be such that (cid:107) g (cid:107) Π X = 1. Let f g ∈ L (Π) be such that f g ( x, y ) = g ( x ). Then (cid:107) f g (cid:107) Π = 1. For any µ ∈ L ∗ (Π X ) and positiveinteger n , | µP n XDG g − Π X g | = | ˜ µP n DG f g − Π f g | ≤ (cid:107) ˜ µP n DG − Π (cid:107) Π , where ˜ µ is any measure in L ∗ (Π) such that (cid:82) Y ˜ µ ( · , d y ) = µ ( · ). Taking the supremum withrespect to g shows that (cid:107) µP n XDG − Π X (cid:107) Π X ≤ (cid:107) ˜ µP n DG − Π (cid:107) Π . This implies that ρ ( P XDG ) ≤ ρ ( P DG ). 22inally, we show that ρ ( P DG ) ≤ ρ ( P XDG ). Let ˜ µ ∈ L ∗ (Π), and define f ∈ L (Π) and g f ∈ L (Π X ) as in (ii). Then, for a positive integer n , | ˜ µP n DG f − Π f | = | µP n − XDG g f − Π X g f | ≤ (cid:107) µP n − XDG − Π X (cid:107) Π X , where µ ( · ) = (cid:82) Y ˜ µ ( · , d y ). Taking the supremum with respect to f shows that (cid:107) ˜ µP n DG − Π (cid:107) Π ≤ (cid:107) µP n − XDG − Π X (cid:107) Π X , which implies that ρ ( P DG ) ≤ ρ ( P XDG ). B Proof of Proposition 3.5
We will prove the result for P DC and P RC . The proofs for P DG and P RG are similar.Consider P RC . Then the claim follows immediately due to its reversibility with respect to Π(Roberts and Tweedie, 2001, Theorem 2).Now suppose P DC is L geometrically ergodic. Then it is Π-a.e. geometrically ergodic (Robertsand Tweedie, 2001, Theorem 1). Conversely, suppose that the P DC is Π-a.e. geometrically ergodic.This implies P XDC is Π X -a.e. geometrically ergodic. It is also straightforward to check that P XDC is ϕ ∗ -irreducible, with ϕ ∗ ( · ) = (cid:82) Y ϕ ( · , d y ). Since P XDC is reversible with respect to Π X , it is also L geometrically ergodic. By Corollary 3.3, P DC must be L geometrically ergodic as well. C Proof of Lemma 4.6
It suffices to show that H is closed (see, e.g., Helmberg, 2014, § H , { g n ⊕ h n } ∞ n =1 , such that lim n →∞ ( g n ⊕ h n ) = f ∈ L (Π) . The sequence { g n ⊕ h n } is Cauchy, that is,lim n →∞ sup m ≥ n (cid:107) g n ⊕ h n − ( g m ⊕ h m ) (cid:107) Π = 0 . By (6), (cid:107) g n ⊕ h n − ( g m ⊕ h m ) (cid:107) ≥ (1 − ¯ γ )( (cid:107) g n − g m (cid:107) X + (cid:107) h n − h m (cid:107) Y ) . γ < { g n } and { h n } are Cauchy as well. By the completeness of L (Π X ) and L (Π Y ), thereexist g ∈ L (Π X ) and h ∈ L (Π Y ) such thatlim n →∞ g n = g , lim n →∞ h n = h . Again by (6), (cid:107) g n ⊕ h n − ( g ⊕ h ) (cid:107) ≤ (1 + ¯ γ )( (cid:107) g n − g (cid:107) X + (cid:107) h n − h (cid:107) Y ) . This implies that lim n →∞ ( g n ⊕ h n ) = g ⊕ h . Hence, f = g ⊕ h ∈ H , meaning that H is closed. D A Lemma concerning Theorem 4.1
The following lemma is the result of several elementary facts in functional analysis. See, e.g.,Helmberg (2014), §
23, 24.
Lemma
D.1 . Let H (cid:48) be a real or complex Hilbert space equipped with inner product (cid:104)· , ·(cid:105) and norm (cid:107) · (cid:107) . Let P be a bounded non-negative definite operator on H (cid:48) . Then (cid:107) P (cid:107) is an approximateeigenvalue of P , i.e., there exists a sequence { f n } ∞ n =1 in H (cid:48) such that (cid:107) f n (cid:107) = 1 for each n , andlim n →∞ (cid:107) P f n − (cid:107) P (cid:107) f n (cid:107) = 0. Proof.
Since P is non-negative definite, (cid:107) P (cid:107) = sup f ∈ H (cid:48) , (cid:107) f (cid:107) =1 (cid:104) P f, f (cid:105) . It follows that there exists a sequence { f n } n in H (cid:48) such that (cid:107) f n (cid:107) = 1 for each n , and lim n →∞ (cid:104) P f n , f n (cid:105) = (cid:107) P (cid:107) . Note that (cid:104) P f n , f n (cid:105) ≤ (cid:107) P f n (cid:107) ≤ (cid:107) P (cid:107) . This implies that (cid:107)
P f n (cid:107) → (cid:107) P (cid:107) as n → ∞ . It follows thatlim n →∞ (cid:107) P f n − (cid:107) P (cid:107) f n (cid:107) = lim n →∞ ( (cid:107) P f n (cid:107) + (cid:107) P (cid:107) − (cid:107) P (cid:107)(cid:104) P f n , f n (cid:105) ) = 0 . eferences Amit, Y. (1991). On rates of convergence of stochastic relaxation for Gaussian and non-Gaussiandistributions.
Journal of Multivariate Analysis , 38(1):82–99.Amit, Y. (1996). Convergence properties of the Gibbs sampler for perturbations of Gaussians.
Annals of Statistics , 24(1):122–140.Amit, Y. and Grenander, U. (1991). Comparing sweep strategies for stochastic relaxation.
Journalof Multivariate Analysis , 37(2):197–222.Andrieu, C. (2016). On random-and systematic-scan samplers.
Biometrika , 103(3):719–726.B¨ottcher, A. and Spitkovsky, I. M. (2010). A gentle guide to the basics of two projections theory.
Linear Algebra and its Applications , 432(6):1412–1459.Brooks, S., Gelman, A., Jones, G. L., and Meng, X.-L. (2011).
Handbook of Markov Chain MonteCarlo . CRC Press, Boca Raton.Chan, K. S. and Geyer, C. J. (1994). Discussion: Markov chains for exploring posterior distributions.
Annals of Statistics , 22(4):1747–1758.Dai, N. and Jones, G. L. (2017). Multivariate initial sequence estimators in Markov chain MonteCarlo.
Journal of Multivariate Analysis , 159:184–199.Diaconis, P., Khare, K., and Saloff-Coste, L. (2008). Gibbs sampling, exponential families andorthogonal polynomials (with discussion).
Statistical Science , 23:151–200.Doss, C. R., Flegal, J. M., Jones, G. L., and Neath, R. C. (2014). Markov chain Monte Carloestimation of quantiles.
Electronic Journal of Statistics , 8:2448–2478.Doss, H. and Hobert, J. P. (2010). Estimation of Bayes factors in a class of hierarchical randomeffects models using a geometrically ergodic MCMC algorithm.
Journal of Computational andGraphical Statistics , 19:295–312.Duncan, J. and Taylor, P. (1976). Norm inequalities for C ∗ -algebras. Proceedings of the RoyalSociety of Edinburgh Section A: Mathematics , 75(2):119–129.25kvall, K. O. and Jones, G. L. (2019). Convergence analysis of a collapsed Gibbs sampler forBayesian vector autoregressions.
Preprint arXiv:1907.03170 .Flegal, J. M., Haran, M., and Jones, G. L. (2008). Markov chain Monte Carlo: Can we trust thethird significant figure?
Statistical Science , 23:250–260.Flegal, J. M. and Jones, G. L. (2010). Batch means and spectral variance estimators in Markovchain Monte Carlo.
The Annals of Statistics , 38:1034–1070.Fort, G., Moulines, E., Roberts, G., and Rosenthal, J. (2003). On the geometric ergodicity of hybridsamplers.
Journal of Applied Probability , 40(1):123–146.Geyer, C. J. (1992). Practical Markov chain Monte Carlo (with discussion).
Statistical Science ,7:473–511.Greenwood, P. E., McKeague, I. W., and Wefelmeyer, W. (1998). Information bounds for Gibbssamplers.
Annals of Statistics , 26(6):2128–2156.Helmberg, G. (2014).
Introduction to Spectral Theory in Hilbert Space . Elsevier.Herbei, R. and McKeague, I. W. (2009). Hybrid samplers for ill-posed inverse problems.
Scandi-navian Journal of Statistics , 36:839–853.Hobert, J. P. (2011). The data augmentation algorithm: Theory and methodology. In Brooks,S., Gelman, A., Jones, G., and Meng, X.-L., editors,
Handbook of Markov Chain Monte Carlo .Chapman & Hall/CRC Press.Hobert, J. P. and Geyer, C. J. (1998). Geometric ergodicity of Gibbs and block Gibbs samplers fora hierarchical random effects model.
Journal of Multivariate Analysis , 67:414–430.Hobert, J. P., Jones, G. L., Presnell, B., and Rosenthal, J. S. (2002). On the applicability ofregenerative simulation in Markov chain Monte Carlo.
Biometrika , 89(4):731–743.Johnson, A. A. and Jones, G. L. (2008). Comment: Gibbs sampling, exponential families, andorthogonal polynomials.
Statistical Science , 23:183–186.Johnson, A. A. and Jones, G. L. (2015). Geometric ergodicity of random scan Gibbs samplers forhierarchical one-way random effects models.
Journal of Multivariate Analysis , 140:325–342.26ohnson, A. A., Jones, G. L., and Neath, R. C. (2013). Component-wise Markov chain MonteCarlo: Uniform and geometric ergodicity under mixing and composition.
Statistical Science ,28(3):360–375.Jones, G. L. (2004). On the Markov chain central limit theorem.
Probability Surveys , 1:299–320.Jones, G. L., Haran, M., Caffo, B. S., and Neath, R. (2006). Fixed-width output analysis for Markovchain Monte Carlo.
Journal of the American Statistical Association , 101:1537–1547.Jones, G. L. and Hobert, J. P. (2001). Honest exploration of intractable probability distributionsvia Markov chain Monte Carlo.
Statistical Science , 16(4):312–334.Jones, G. L. and Hobert, J. P. (2004). Sufficient burn-in for Gibbs samplers for a hierarchicalrandom effects model.
The Annals of Statistics , 32:784–817.Jones, G. L., Roberts, G. O., and Rosenthal, J. S. (2014). Convergence of conditional Metropolis-Hastings samplers.
Advances in Applied Probability , 46(2):422–445.Khare, K. and Hobert, J. P. (2013). Geometric ergodicity of the Bayesian lasso.
Electronic Journalof Statistics , 7:2150–2163.Levine, R. A. and Casella, G. (2008). Comment: On random scan Gibbs samplers.
StatisticalScience , 23(2):192–195.Liu, J. S. (1996). Metropolized independent sampling with comparisons to rejection sampling andimportance sampling.
Statistics and Computing , 6(2):113–119.Liu, J. S., Wong, W. H., and Kong, A. (1994). Covariance structure of the Gibbs sampler withapplications to the comparisons of estimators and augmentation schemes.
Biometrika , 81:27–40.Liu, J. S., Wong, W. H., and Kong, A. (1995). Covariance structure and convergence rate of theGibbs sampler with various scans.
Journal of the Royal Statistical Society, Series B , 57(1):157–169.Marchev, D. and Hobert, J. P. (2004). Geometric ergodicity of van Dyk and Meng’s algorithm forthe multivariate student’s t model. Journal of the American Statistical Association , 99(465):228–238. 27engersen, K. L. and Tweedie, R. L. (1996). Rates of convergence of the Hastings and Metropolisalgorithms.
Annals of Statistics , 24(1):101–121.Robert, C. P. (1995). Convergence control methods for Markov chain Monte Carlo algorithms.
Statistical Science , 10:231–253.Roberts, G. O. and Polson, N. G. (1994). On the geometric convergence of the Gibbs sampler.
Journal of the Royal Statistical Society: Series B , 56(2):377–384.Roberts, G. O. and Rosenthal, J. S. (1997). Geometric ergodicity and hybrid Markov chains.
Electronic Communications in Probability , 2(2):13–25.Roberts, G. O. and Rosenthal, J. S. (1998). Two convergence properties of hybrid samplers.
TheAnnals of Applied Probability , 8:397–407.Roberts, G. O. and Rosenthal, J. S. (2001). Markov chains and de-initializing processes.
Scandi-navian Journal of Statistics , 28(3):489–504.Roberts, G. O. and Rosenthal, J. S. (2016). Surprising convergence properties of some simple Gibbssamplers under various scans.
International Journal of Statistics and Probability , 5(1):51–60.Roberts, G. O. and Sahu, S. K. (1997). Updating schemes, correlation structure, blocking andparameterization for the Gibbs sampler.
Journal of the Royal Statistical Society: Series B ,59(2):291–317.Roberts, G. O. and Tweedie, R. L. (1996). Geometric convergence and central limit theorems formultidimensional Hastings and Metropolis algorithms.
Biometrika , 83(1):95–110.Roberts, G. O. and Tweedie, R. L. (2001). Geometric L and L convergence are equivalent forreversible Markov chains. Journal of Applied Probability , 38(A):37–41.Robertson, N., Flegal, J. M., Vats, D., and Jones, G. L. (2020). Assessing and visualizing simulta-neous simulation error.
Preprint arXiv:1904.11912 .Rosenthal, J. S. and Rosenthal, P. (2015). Spectral bounds for certain two-factor non-reversibleMCMC algorithms.
Electronic Communications in Probability , 20:1–10.28oy, V. (2012). Convergence rates for MCMC algorithms for a robust Bayesian binary regressionmodel.
Electronic Journal of Statistics , 6:24653–2485.Rudolf, D. and Ullrich, M. (2013). Positivity of hit-and-run and related algoprithms.
ElectronicCommunications in Probability , 18:1–8.Smith, R. L. and Tierney, L. (1996). Exact transition probabilities for the independence Metropolissampler.
Preprint .Tan, A. and Hobert, J. P. (2009). Block Gibbs sampling for Bayesian random effects modelswith improper priors: Convergence and regeneration.
Journal of Computational and GraphicalStatistics , 18(4):861–878.Tan, A., Jones, G. L., and Hobert, J. P. (2013). On the geometric ergodicity of two-variable Gibbssamplers. In Jones, G. L. and Shen, X., editors,
Advances in Modern Statistical Theory andApplications: A Festschrift in Honor of Morris L. Eaton , pages 25–42. Institute of MathematicalStatistics, Beachwood, Ohio.Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data aug-mentation (with discussion).
Journal of the American statistical Association , 82(398):528–540.Tierney, L. (1994). Markov chains for exploring posterior distributions. the Annals of Statistics ,22(4):1701–1728.Tierney, L. (1998). A note on Metropolis–Hastings kernels for general state spaces.
The Annals ofApplied Probability , 8:1–9.van Dyk, D. A. and Meng, X.-L. (2001). The art of data augmentation (with discussion).
Journalof Computational and Graphical Statistics , 10(1):1–50.Vats, D., Flegal, J. M., and Jones, G. L. (2018). Strong consistency of multivariate spectral varianceestimators in Markov chain Monte Carlo.
Bernoulli , 24:1860–1909.Vats, D., Flegal, J. M., and Jones, G. L. (2019). Multivariate output analysis for Markov chainMonte Carlo.
Biometrika , 106:321–337.Vidav, I. (1977). The norm of the sum of two projections.
Proceedings of the American MathematicalSociety , 65(2):297–298. 29ang, X. and Roy, V. (2018a). Convergence analysis of the block Gibbs sampler for Bayesian probitlinear mixed models with improper priors.
Electronic Journal of Statistics , 12:4412–4439.Wang, X. and Roy, V. (2018b). Geometric ergodicity of P´olya-Gamma Gibbs sampler for Bayesian-logistic regression with a flat prior.