Convergence of Gibbs Sampling: Coordinate Hit-and-Run Mixes Fast
CConvergence of Gibbs Sampling:Coordinate Hit-and-Run Mixes Fast
Aditi Laddha and Santosh S. Vempala ∗ Abstract
The Gibbs Sampler is a general method for sampling high-dimensional distributions, datingback to 1971 [14]. In each step, we pick a random coordinate and re-sample that coordinatefrom the distribution induced by fixing all other coordinates. While it has become widely usedover the past half-century, guarantees of efficient convergence have been elusive. Here we showthat for convex bodies in R n with diameter D , the resulting Coordinate Hit-and-Run (CHAR)algorithm mixes in poly ( n, D ) steps. This is the first polynomial guarantee for this widely-usedalgorithm. We also give a lower bound on the mixing rate, showing that it is strictly worse thanhit-and-run or the ball walk in the worst case. Sampling a high-dimensional distribution is a fundamental problem and a basic ingredient of al-gorithms for optimization, integration, statistical inference, and other applications. Progress onsampling algorithms has led to many useful tools, both theoretical and practical. In the most gen-eral setting, given access to a function f : R n → R + , the goal is to generate a point x whose densityis proportional to f ( x ) . Two special cases of particular interest are when f is uniform over a convexbody and a Gaussian restricted to a convex set.The generic approach to sampling is by a Markov chain over the state space. The chain is de-signed so that it is ergodic, time-reversible, and has the desired density as its stationary distribution.The key question is then its rate of convergence. The Ball walk and Hit-and-Run work in full gen-erality, and have been shown to mix rapidly (i.e, the convergence rate is polynomial) for arbitrarylogconcave densities. Over three decades of improvements, the complexity of this problem has beenreduced to a small polynomial in the dimension for the total number of function evaluations with afactor of n per function call for the total number of arithmetic operations. For a logconcave densitywith support of diameter D , the mixing time is O ∗ ( n D ) and the total computational complexityis O ∗ ( n D ) .A simple and widely-used algorithm that pre-dates these developments considerably is the GibbsSampler proposed by Turchin in 1971 [14]. It is inspired by statistical physics and is commonlyused for sampling distributions [3, 4] and for Bayesian inference [5, 6, 7]. To sample a multivariatedensity, at each step, the sampler picks a coordinate (either at random or in order, cycling throughthe coordinates), fixes all other coordinates, and re-samples this coordinate from the induced dis-tribution. This is very similar to Hit-and-Run, except that instead of picking the next direction ∗ Georgia Tech. Email: {aladdha6, vempala}@gatech.edu. Supported in part by NSF awards 1717349, 1839323and 1909756. The diameter D can be effectively made O ( √ n ) after an affine transformation, and so these complexities are O ∗ ( n ) and O ∗ ( n ) . However, computing the transformation itself takes O ∗ ( n . ) oracle calls and O ∗ ( n ) totalarithmetic complexity. a r X i v : . [ c s . D S ] S e p niformly at random from the unit sphere, it is picked only from one of the n basis vectors (see[1] for a historical account and more background). It was reported to be significantly faster thanhit-and-run in state-of-the-art software for volume computation and integration [2]. Gibbs sam-pling, also called Coordinate Hit-and-Run, has a computational benefit: updating the current pointtakes O ( n ) time rather than O ( n ) even for polyhedra since the update is along only one coordinatedirection! Thus the overhead per step is reduced from O ( n ) as in all previous algorithms to O ( n ) . However, despite of a half-century of intense study, the convergence rate of Gibbs sampling hasremained an open problem. There is currently no polynomial bound known for its conductance andmixing rate.In this paper, we show that the Gibbs sampler mixes rapidly for any convex body K . Before weformally state our main theorem, we define the Gibb sampler. Coordinate Hit-and-Run.
We describe the algorithm for sampling uniformly from a convexbody K ∈ R n . Let { e i : 1 ≤ i ≤ n } be the standard basis for R n . The starting point is in theinterior of K . Algorithm 1:
Coordinate Hit-and-Run (CHAR)
Input: a point x (0) ∈ K , integer T . for i = 1 , , · · · , T do Pick a uniformly random axis direction e j Set x i to be a random point along the line (cid:96) = (cid:8) x ( i − + te j : t ∈ R (cid:9) chosen uniformlyfrom (cid:96) ∩ K. endOutput: x T .To sample from a general logconcave density f : R n → R + the only change is in Step 2, wherethe next point y is chosen according to f ( y ) restricted to (cid:96) . In both cases, the process is symmetricand ergodic and so the stationary distribution of the Markov chain is the desired distribution . We can now state our main theorem.
Theorem 1.
Let K be a convex body in R n containing a unit ball with R = E K ( (cid:107) x − ¯ x (cid:107) ) . Thenthe mixing time of Coordinate Hit-and-Run from a warm start in K is (cid:101) O (cid:0) n R (cid:1) . We note that the mixing time of the ball walk and hit-and-run from a warm start is (cid:101) O ( n R ) [9, 12]. While our bound is likely not the best polynomial bound for CHAR, in Section 4, we showthat it is necessarily higher than the bound for hit-and-run.A key ingredient of our proof is a new “ (cid:96) ”-isoperimetric inequality. We will need the followingdefinition. Definition 2 (Axis-disjoint) . Two measurable sets S , S are called axis-disjoint if ∀ x ∈ S , ∀ y ∈ S , |{ i ∈ [ n ] : x i = y i }| ≤ n − .In words, no point from S is on the same axis-parallel line as any point in S . ( See Fig. 1.) Theorem 3 (Isoperimetry) . Let K be a convex body in R n containing a unit ball with R = E K ( (cid:107) x − ¯ x (cid:107) ) . Let S , S ⊂ K be two measurable subsets of K such that S , S are axis-disjoint.Then for any ε ≥ , the set S = K \ S \ S satisfies vol( S ) ≥ cεn . R log n (min { vol( S ) , vol( S ) } − ε vol( K )) where c is a fixed constant. S and S At a high level, we follow the proof of rapid mixing based on the conductance of Markov chains[13] in the continuous setting [11]. We give a simple, new one-step coupling lemma which reducesthe problem of lower bounding the conductance of the underlying Markov chain to an isoperimetricinequality about axis-disjoint sets in high dimension. Roughly speaking, the inequality says thefollowing: If two subsets of a convex body are axis-disjoint, then the remaining mass of the body isproportional to the smaller of the two subsets. This inequality is our main technical contribution.In comparison, for Euclidean distance says that for any two subsets of a convex body, the remainingmass is proportional to their (minimum) Euclidean distance times the smaller of the two subsetvolumes.Standard approaches to proving such inequalities, notably localization [8, 10], which reduce thedesired high-dimensional inequality to a one-dimensional inequality, do not seem to be directlyapplicable to proving this “ (cid:96) -type” inequality. So we develop a first-principles approach where wefirst prove the inequality for cubes, taking advantage of their product structure, and then for generalbodies using a tiling of space with cubes. In the course of the latter part, we will use several knownproperties of convex bodies, including Euclidean isoperimetry. The main idea of the proof is as follows. Assume vol( S ) ≤ vol( S ) . We consider all cubes of a gridpartition that intersect S . For cubes C where the intersection S ∩ C is less than half the volumeof C , we prove a new isoperimetric inequality for a cube. For the set of remaining cubes C , call acube a border cube if it has at least one facet adjacent to some other cube that intersects K and isnot in C . For a border cube C , we note that at least half of the volume of any cube that neighbors C is along an axis-parallel line through some point in S ∩ C . We will combine this with a lowerbound on the volume of border cubes. Lemma 4 (Cube isoperimetry) . For an axis-aligned cube C ∈ R n , and any two axis-disjoint subsets S , S ⊂ C , with S = K \ S \ S , the following holds: vol( S ) ≥ n log n · min { vol( S ) , vol( S ) } . oreover, if vol( S ) ≤ vol( C ) , then vol( S ) ≥ n log n · vol( S ) .Remark. We believe that the bound above is not optimal, and even an absolute constant factormight be possible. In the appendix, we give a different proof achieving a weaker bound.
Proof.
Assume C is a unit cube and vol( S ) ≤ vol( S ) . Consider the partition, S , of S into axis-connected sets. A set X is called axis-connected if for all x, y ∈ X , its possible to move from x to y by moving along axes parallel lines within X . Note that if X, Y ∈ S and X (cid:54) = Y , then X and Y are axis-disjoint.For a set X ⊂ K , define ext ( X ) = { y ∈ K \ X : ∃ x ∈ X, | i : x ( i ) = y ( i ) | = n − } . ext ( X ) is thesubset of K \ X that is reachable from X in one step of CHAR.If for all S ∈ S , vol (ext( S )) ≥ c n vol( S ) , then vol( S ) ≥ vol ( ∪ S ∈S ext ( S )) ≥ n (cid:88) S ∈S vol (ext( S )) ≥ c n vol( S ) and we are done. The second inequality is true because any point x ∈ ∪ S ∈S ext ( S ) can belongto the extensions of at most n subsets in S as these subsets are axis-disjoint and hence x can onlybe reachable from different subsets along different axes.If not, then there must exist at least one axis-connected subset of S , say S such that vol (ext( S )) ≤ c n vol( S ) . We can continue the argument by considering S instead of S . First, if we start with auniform point in S and perform CHAR in S , i.e., pick a random axis-parallel line through thecurrent point, then go to a uniform point in S along the line, the current point will remain uniformin S because S is axis-connected. Call this process P . We will compare it with the process P ,which in each step picks a uniform point along the line in the cube (rather than only in S ). Startingfrom any point in C , the process P will produce a uniform point in C after n log n steps. If theprocess P moves to a point in S at every step, then the distributions induced by P and P arethe same at every step. Hence the probability of coupling these processes in n log n steps directlycorresponds to the ratio of volumes of S and C . Next, we lower bound this probability.The probability of picking a point outside S from a uniform point in S is at most c /n . To seethis, W.L.O.G. let e be the direction selected by a step of P and let C ⊆ [0 , n be the extensionof S along the -st axis. Note that vol( C ) ≤ (cid:0) c n (cid:1) vol( S ) . Let ˆ C denote the projection of C along the the last n − coordinates and for each y ∈ ˆ C , let q ( y ) = |{ y + te : t ∈ R } ∩ S | . Then vol n ( C ) = vol n − ( ˆ C ) and E y [ q ( y )] = 1vol n − (cid:16) ˆ C (cid:17) (cid:90) y ∈ ˆ C q ( y ) dy = vol( S )vol n ( C ) = Pr x ∈ C [ x ∈ S ] . For each x ∈ S , let ˆ x denote the projection of x in the last n − coordinates. Then, Pr x ∼ S [ P ( x ) / ∈ S ] = Pr x ∼ C [ P ( x ) / ∈ S | x ∈ S ]= Pr x ∼ C [ P ( x ) / ∈ S and x ∈ S ]Pr x ∈ C [ x ∈ S ]= (cid:82) x ∈ C S ( x )(1 − q (ˆ x )) dx vol ( C ) Pr x ∈ C [ x ∈ S ]= (cid:82) y ∈ ˆ C q ( y )(1 − q ( y )) dy vol n − (cid:16) ˆ C (cid:17) Pr x ∈ C [ x ∈ S ] E y [ q ( y ) (1 − q ( y ))]Pr x ∈ C [ x ∈ S ] ≤ E y [ q ( y )] E y [(1 − q ( y ))]Pr x ∈ C [ x ∈ S ]= 1 − E y [ q ( y )] = 1 − vol( S )vol( C ) ≤ c n . In each step, we can couple these processes so that with probability at least (1 − c n ) they areat the same point, and after n log n steps they are at the same point with probability at least (1 − c n ) n log n ≥ e − c log n . By choosing c =
110 log n , we have that with probability e − c log n > ,all points encountered along the way by P will be in S . On the other hand, P produces a uniformpoint in the cube. Hence we have a contradiction.The next lemma is an isoperimetric inequality from [8]. Lemma 5 (Euclidean isoperimetry) . [8] Let K ⊂ R n be a convex body containing a unit ball and R = E K ( (cid:107) x − ¯ x (cid:107) ) . For any subset S ⊂ K of volume at most vol( K ) / , we have vol( ∂S ) ≥ ln 2 R vol( S ) . We can now prove the new isoperimetric inequality, restated below for convenience.
Theorem 3 (Isoperimetry) . Let K be a convex body in R n containing a unit ball with R = E K ( (cid:107) x − ¯ x (cid:107) ) . Let S , S ⊂ K be two measurable subsets of K such that S , S are axis-disjoint.Then for any ε ≥ , the set S = K \ S \ S satisfies vol( S ) ≥ cεn . R log n (min { vol( S ) , vol( S ) } − ε vol( K )) where c is a fixed constant.Proof of Theorem (3). Let S , S ⊂ K be axis-disjoint subsets. Let K (cid:48) = (1 − α ) K for a parameter α > to be chosen shortly, and S (cid:48) i = S i ∩ K (cid:48) . Assume vol( S (cid:48) ) ≤ vol( S (cid:48) ) . Then by the Euclideanisoperimetric theorem, we have that vol( ∂S (cid:48) ) ≥ cR vol( S (cid:48) ) where ∂S (cid:48) only refers to the internal boundary of S (cid:48) inside K (cid:48) .Next consider a standard lattice of width δ , with each lattice point inducing a cube of side length δ . We choose δ = α √ n to ensure that cubes that intersect K (cid:48) are fully contained in K . Let C bethe set of cubes that intersect S . We divide them into two types, C are the cubes where S takesup less than (1 − (cid:15) ) ≤ of the volume of the cube in K and C are the rest, where S takes upat least (1 − (cid:15) ) of each cube. If vol( C ∩ S ) ≥ p ( n ) vol( S ) , i.e., at least a p ( n ) fraction of vol( S ) resides in C , then consider C (cid:48) = { c ∈ C : c ∩ K (cid:48) (cid:54) = φ } . By choosing an appropriate value of α , wehave C (cid:48) ⊆ K and vol( C (cid:48) ∩ S ) ≥ vol( C ∩ S ∩ K (cid:48) ) ≥ vol( C ∩ S ) − (1 − (1 − α ) n )vol( K ) . Applying Lemma 4 to each cube in C , with we get vol( S ) ≥ n log n · (cid:88) c ∈C (cid:48) vol( c ∩ S ) = 110 n log n · vol( C (cid:48) ∩ S ) n log n (vol( C ∩ S ) − (1 − (1 − α ) n )vol( K )) ≥ n log n (cid:18) p ( n ) · vol( S ) − (1 − (1 − α ) n )vol( K ) (cid:19) and by setting α ≤ ε np ( n ) we get vol( S ) ≥ n log np ( n ) (vol( S ) − (cid:15) · vol( K )) . So assume not. Then vol( C ∩ S ) ≥ (cid:16) − p ( n ) (cid:17) · vol( S ) . Consider the internal boundary of C ,Figure 2.1: Illustration of the isoperimetry proof ∂ C , in K . It consists of facets of n -dimensional cubes and for a facet, f , on this boundary withnormal axis e f , let the cube adjacent to this facet in C be f and the cube adjacent to the facet notin C be f . Now, f cannot contain C ∩ S (it can contain C ∩ S but we account for that later bysubtracting its mass from that of S ) and since f has at least − (cid:15) of its mass as S , the support ofmarginal of S along any axis direction will be at least − (cid:15) of the support of marginal of f alongthat axis. Therefore, at least (1 − (cid:15) ) fraction of the mass in f and by extension f is reachablefrom a point in S along e f and therefore cannot be in S . Since every such neighboring cube canbe counted at most n times using this argument, we get n · ∂ C · (1 − (cid:15) ) δ mass in S . But f might not be (fully) contained in K . So, we need to move to K (cid:48) . By choosing an appropriate valueof α , we can ensure that the neighboring cubes are fully contained in K . This argument is true forevery facet in ∂ ( C ∩ K (cid:48) ) because ∂ ( C ∩ K (cid:48) ) ⊆ ∂ C as ∂ ( C ∩ K (cid:48) ) only consists of the boundary of C ∩ K (cid:48) internal to K (cid:48) . We also know that vol (cid:0) C ∩ K (cid:48) ∩ S (cid:1) ≤ vol (cid:0) C ∩ K (cid:48) (cid:1) ≤ − (cid:15) ) vol (cid:0) S ∩ K (cid:48) (cid:1) ≤ − (cid:15) ) vol (cid:0) K (cid:48) (cid:1) and by Lemma 5 vol (cid:0) ∂ (cid:0) C ∩ K (cid:48) (cid:1)(cid:1) ≥ cR · min (cid:8) vol (cid:0) C ∩ K (cid:48) (cid:1) , vol (cid:0) K (cid:48) \C ∩ K (cid:48) (cid:1)(cid:9) cR · min (cid:26) vol (cid:0) C ∩ K (cid:48) (cid:1) , − (cid:15) − (cid:15) ) vol (cid:0) K (cid:48) (cid:1)(cid:27) ≥ cR · min (cid:8) vol (cid:0) C ∩ K (cid:48) (cid:1) , (1 − (cid:15) ) vol (cid:0) C ∩ K (cid:48) (cid:1)(cid:9) ≥ cR · (1 − (cid:15) ) · vol (cid:0) C ∩ K (cid:48) (cid:1) . This gives vol( S ) ≥ n · (1 − (cid:15) ) · vol (cid:0) ∂ (cid:0) C ∩ K (cid:48) (cid:1)(cid:1) · δ − p ( n ) vol( S ) ≥ n · (1 − (cid:15) ) · (1 − (cid:15) ) · cR · vol (cid:0) C ∩ K (cid:48) (cid:1) · δ − p ( n ) vol( S ) ≥ n · (1 − (cid:15) ) · (1 − (cid:15) ) · cR · (vol ( C ∩ S ) − (1 − (1 − α ) n )vol( K )) · δ − p ( n ) vol( S ) ≥ δ n · (1 − (cid:15) ) · (1 − (cid:15) ) · cR · (cid:18)(cid:18) − p ( n ) (cid:19) · vol( S ) − (1 − (1 − α ) n )vol( K ) (cid:19) − p ( n ) vol( S ) Setting (1 − (cid:15) ) = , p ( n ) = nRδc = n √ nRc(cid:15) and α = ε n , we have vol( S ) ≥ cδ nR (vol( S ) − ε vol( K )) = c(cid:15) n √ nR (vol( S ) − ε vol( K )) . This proves the theorem with isoperimetric coefficient n log n · p ( n ) ≥ cε n √ nR log n . For any measurable subset S ⊆ K and x ∈ K , let P x ( S ) be the probability that one step ofcoordinate hit-and-run from x goes to S. Also, P x ( { y } ) = P y ( { x } ) , ∀ x, y ∈ K .The conductance of a subset S of a state space K with stationary distribution Q is φ ( S ) = (cid:82) S P x ( K \ S ) dQ ( S )min Q ( S ) , Q ( K \ S ) For any s ∈ [0 , / the s -conductance of the Markov chain is: φ s = inf S : s ≤ Q ( S ) ≤ φ ( S ) . The following theorem shows that the s -conductance of a Markov chain bounds its rate of conver-gence from a warm start. Theorem 6. [11] Suppose that a lazy, time-reversible Markov chain with stationary distribution Q has s -conductance at least φ s . Then with initial distribution Q , and H s = sup {| Q ( A ) − Q o ( A ) | : A ⊂ K, Q ( A ) ≤ s } , the distribution Q t after t steps satisfies d T V ( Q t , Q ) ≤ H s + H s s (cid:18) − φ s (cid:19) t .
7e will now bound the s -conductance of CHAR. The following simple lemma lets us reduce toaxis-disjoint subsets. Lemma 7.
Let K = { S ∪ S } be a partition of K . Let S (cid:48) = { x ∈ S : P x ( S ) < n } and S (cid:48) = { x ∈ S : P x ( S ) < n } . Then S (cid:48) and S (cid:48) are axis disjoint.Proof. Assume not, then let l be an axis parallel line passing through both S (cid:48) and S (cid:48) . Let x ∈ S (cid:48) ∩ l and y ∈ S (cid:48) ∩ l . Then P x ( S ) ≥ n len ( l ∩ S ) len ( l ) ⇒ len ( l ∩ S ) < len ( l ∩ K )2 and P y ( S ) ≥ n len ( l ∩ S ) len ( l ) ⇒ len ( l ∩ S ) < len ( l ∩ K )2 . This is a contradiction as len ( l ∩ K ) = len ( l ∩ S ) + len ( l ∩ S ) . Theorem 8.
Let K be a convex body in R n containing a unit ball with R = E K ( (cid:107) x − ¯ x (cid:107) ) . Thenthe s -conductance of coordinate hit-and-run in K is at least csn . R log n .Proof. Let K = S ∪ S be a partition of K into measurable sets and P x ( y ) denote the probabilityof going from x to y in one step of coordinate hit-and-run. Then, φ ( S ) = (cid:82) x ∈ S P x ( S )min { π K ( S ) , π K ( S ) } . Let S (cid:48) = { x ∈ S : P x ( S ) < n } and S (cid:48) = { x ∈ S : P x ( S ) < n } Let S (cid:48) = K \ S (cid:48) \ S (cid:48) . From Lemma 7, we know that S (cid:48) and S (cid:48) are axis-disjoint. Thus, from Theorem3, with ψ = cn . R log n vol( S (cid:48) ) ≥ ψ · s (cid:16) min { vol( S (cid:48) ) , vol( S (cid:48) ) } − s K ) (cid:17) . If vol ( S (cid:48) ) < vol ( S ) / , then (cid:90) x ∈ S P x ( S ) dx = (cid:90) x ∈ S (cid:48) P x ( S ) dx + (cid:90) x ∈ S \ S (cid:48) P x ( S ) dx ≥ n vol( S \ S (cid:48) ) ≥ n vol( S ) . So, assume vol ( S (cid:48) ) ≥ vol ( S ) / and vol( S (cid:48) ) ≥ vol( S ) / . Then, (cid:90) x ∈ S P x ( S ) ≥ (cid:90) x ∈ S \ S (cid:48) P x ( S ) dx ≥ n vol( S \ S (cid:48) ) (3.1)8lso, (cid:90) x ∈ S P x ( S ) dx ≥ (cid:90) x ∈ S P x ( S \ S (cid:48) ) dx = (cid:90) y ∈ S \ S (cid:48) P y ( S ) dy ≥ n vol( S \ S (cid:48) ) (3.2)Thus, from equations (3.1) and (3.2), (cid:90) x ∈ S P x ( S ) dx ≥ · n (vol( S \ S (cid:48) ) + vol( S \ S (cid:48) ))= 14 n vol( S (cid:48) ) ≥ ψn s (cid:16) min { vol( S (cid:48) ) , vol( S (cid:48) ) } − s K ) (cid:17) ≥ c (cid:48) sn . R log n (cid:16) min { vol( S ) , vol( S ) } − s K ) (cid:17) for some constant c (cid:48) . In this section, we show a lower bound of / ( n D ) for the conductance of CHAR. Fix a simplex C in R n − with barycenter at zero. We contruct a convex body K in R n so that K ( x ) , the slice of K with the first coordinate x , is C + ( x , , . . . , for x ∈ [0 , D ] and empty outside this range of x . We choose D ≥ n . Let S ⊂ K be the set of all points in K with x ≤ D/ . We now observethat the axis-aligned extension of S has volume bounded by O (1 /nD ) times the volume of S. Thisshows that the isoperimetric ratio is O (1 /nD ) . Next, we note that the extension of S goes beyond S only along e , and the probability that CHAR chooses e at any step is only /n . This gives aconductance bound of O (1 / ( n D )) .Figure 4.1: The lower bound construction9his translates to a lower bound of (cid:101) Ω( n D ) on the mixing rate even from a warm start.We sketch the argument. Consider two subsets of K at opposite ends: K ∩ { x : x ≤ D/ } and K ∩ { x : x ≥ D/ } . Suppose we start with a uniformly random point in the first set. Then inorder to mix, the current point must reach the latter set. The probability of selecting e is /n .However, even when e is selected, each step is of size only (cid:101) O (1 /n ) W.H.P. So the process is roughlylike a random walk that takes a step of size ± /n on an interval of length D/ , i.e., a step of size ± in an interval of length nD/ . This takes Ω( n D ) steps along e , which means a total of Ω( n D ) steps of CHAR. Even though this is worse than the (cid:101) O ( n D ) mixing rate of hit-and-run, it is aninteresting open problem to determine the precise mixing rate of CHAR. References [1] Hans C. Andersen and Persi Diaconis. Hit and run as a unifying device.
Journal de la sociétéfrançaise de statistique , 148(4):5–28, 2007.[2] B. Cousins and S.S. Vempala. A practical volume algorithm.
Math. Prog. Computation , 2016.to appear.[3] Persi Diaconis, Kshitij Khare, and Laurent Saloff-Coste. Gibbs sampling, conjugate priors andcoupling.
Sankhya A , 72(1):136–169, 2010.[4] Persi Diaconis, Gilles Lebeau, and Laurent Michel. Gibbs/metropolis algorithms on a convexpolytope.
Mathematische Zeitschrift , 272(1-2):109–129, 2012.[5] Jenny Rose Finkel, Trond Grenager, and Christopher D Manning. Incorporating non-localinformation into information extraction systems by gibbs sampling. In
Proceedings of the 43rdAnnual Meeting of the Association for Computational Linguistics (ACL’05) , pages 363–370,2005.[6] Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the bayesianrestoration of images.
IEEE Transactions on pattern analysis and machine intelligence , (6):721–741, 1984.[7] Edward I George and Robert E McCulloch. Variable selection via gibbs sampling.
Journal ofthe American Statistical Association , 88(423):881–889, 1993.[8] R. Kannan, L. Lovász, and M. Simonovits. Isoperimetric problems for convex bodies and alocalization lemma.
Discrete & Computational Geometry , 13:541–559, 1995.[9] R. Kannan, L. Lovász, and M. Simonovits. Random walks and an O ∗ ( n ) volume algorithmfor convex bodies. Random Structures and Algorithms , 11:1–50, 1997.[10] Yin Tat Lee and Santosh Srinivas Vempala. Eldan’s stochastic localization and the KLS hy-perplane conjecture: An improved lower bound for expansion. In
Proc. of IEEE FOCS , 2017.[11] L. Lovász and M. Simonovits. Random walks in a convex body and an improved volumealgorithm. In
Random Structures and Alg. , volume 4, pages 359–412, 1993.[12] L. Lovász and S. Vempala. Hit-and-run from a corner.
SIAM J. Computing , 35:985–1005, 2006.[13] A. Sinclair and M. Jerrum. Approximate counting, uniform generation and rapidly mixingMarkov chains.
Information and Computation , 82:93–133, 1989.1014] V. Turchin. On the computation of multidimensional integrals by the monte-carlo method.
Theory of Probability & Its Applications , 16(4):720–724, 1971.
Appendix A Alternate Cube Isoperimetry
Lemma 9 (Alternate cube isoperimetry) . For an axis-aligned cube C ∈ R n , and any two axis-disjoint subsets S , S ⊂ C , with S = K \ S \ S , the following holds: vol( S ) ≥ n log n · min { vol( S ) , vol( S ) } . Moreover, if vol( S ) ≤ vol( C ) , then vol( S ) ≥ n log n · vol( S ) .Proof. Assume C is a unit cube and vol( S ) ≤ vol( S ) . For an axis e i , and a line (cid:96) parallel to e i intersecting S , we call the line bad with respect to e i if | (cid:96) ∩ S | < (1 − ( c /n )) . Let B i be the setof points in S lying on a bad line with respct to e i . If vol( B i ) > c n vol( S ) for any i ∈ { , . . . , n } ,then we are done, as the extension of S along e i would pick up volume at least c c n vol( S ) .Therefore we can assume that vol( B i ) ≤ c n vol( S ) for all i and hence vol( B ) ≤ c n vol( S ) where B = ∪ i ∈ [ n ] B i . So, (1 − c /n ) fraction of the points in S have no bad lines through them alongany axis direction. Consider the partition of S into axis-connected sets. A set X is called axis-connected if for all x, y ∈ X , if its possible to move from x to y by moving along axis parallel lineswithin X . vol( B ) ≤ c n vol( S ) implies that there must exist at least one axis-connected subset of S , say S such that vol( B ∩ S ) ≤ c n vol( S ) . We can continue the argument by considering S insteadof S .Suppose B is empty and we start with a point in S with no bad lines and change one coordinateat a time, setting it to a uniformly random number in [0 , , once for each coordinate. The finalresult will be a random point in the cube. On the other hand, from the bounds derived, withprobability at least (1 − c n ) n ≥ e − c > , the sequence of points and the last point are all in S .So, we have vol( S ) > vol( C ) , a contradiction.Going to the general case when B is not empty, we argue as follows. First, if we start with auniform point in S and perform CHAR in S , i.e., pick a random axis-parallel line through thecurrent point, then go to a uniform point in S along the line, the current point will remain uniformin S because S is axis-connected. Call this process P . We will compare it with the process P , which in each step picks a uniform point along the line in the cube (rather than only in S ).Starting from any point in C , the process P will produce a uniform point in C after n log n steps.Let b i = vol( B i ) / vol( S ) . Then (cid:80) ni =1 b i ≤ c /n and b i ≤ c /n for all i ∈ { , . . . , n } . Since eachpoint of P is uniform in S , the expected number of bad points encountered by P is at most c n · n log n = c log n < / . So, with probability at least , the number of bad points is less than . In each step, we can couplethese processes so that with probability at least (1 − c n ) they are at the same point, and after n log n steps they are at the same point with probability at least (1 − c n ) n log n ≥ e − c log n . By choosing c = c =