A simple Markov chain for independent Bernoulli variables conditioned on their sum
AA simple Markov chain for independent Bernoulli variablesconditioned on their sum
Jeremy Heng , Pierre E. Jacob , and Nianqiao Ju ∗21 ESSEC Business School, Singapore Department of Statistics, Harvard University, USA
Abstract
We consider a vector of N independent binary variables, each with a different probabilityof success. The distribution of the vector conditional on its sum is known as the conditionalBernoulli distribution. Assuming that N goes to infinity and that the sum is proportional to N , exact sampling costs order N , while a simple Markov chain Monte Carlo algorithm using“swaps” has constant cost per iteration. We provide conditions under which this Markovchain converges in order N log N iterations. Our proof relies on couplings and an auxiliaryMarkov chain defined on a partition of the space into favorable and unfavorable pairs. Let x = ( x , . . . , x N ) be an N -vector in { , } N , with sum P Nn =1 x n = I . Let ( p , . . . , p N ) ∈ (0 , N , and denote the associated “odds” by w n = p n / (1 − p n ). Define the set S z = { n ∈ [ N ] : x n = z } for z ∈ { , } , where [ N ] = { , . . . , N } , i.e. S z has indices n ∈ [ N ] at which x n = z .We consider the task of sampling x ∈ { , } N from a distribution obtained by specifying anindependent Bernoulli distribution with probability p n on each component x n , and conditioningon P Nn =1 x n = I for some value 0 ≤ I ≤ N . This is known as the conditional Bernoullidistribution and will be denoted by CB( p, I ). The support of CB( p, I ) is denoted as X = { x ∈{ , } N : P Nn =1 x n = I } . We assume that 1 ≤ I ≤ N/
2, since we can always swap the labels“0”and “1”. We consider the asymptotic regime where N and I go to infinity at the same rate.One can sample exactly from CB( p, I ) [Chen et al., 1994, Chen and Liu, 1997], for a cost oforder IN , thus order N in the context of interest here; see Appendix A. As an alternative, weconsider a simple Markov chain Monte Carlo (MCMC) algorithm that leaves CB( p, I ) invariant[Chen et al., 1994, Liu et al., 1995]. Starting from an arbitrary state x ∈ X , this MCMC performsthe following steps at each iteration. ∗ Corresponding author: [email protected] a r X i v : . [ s t a t . C O ] D ec . Sample i ∈ S and i ∈ S uniformly, independently from one another.2. Propose to set x i = 1 and x i = 0, and accept with probability min(1 , w i /w i ).By keeping track of the sets S and S , the algorithm can be implemented using a constant costper iteration. The purpose of this article is to show that this Markov chain converges to itstarget distribution CB( p, I ) in the order of N log N iterations, under mild conditions on p and I . As the cost per iteration is constant, this provides an overall competitive scheme to samplefrom CB( p, I ). We denote the transition kernel of the above Metropolis–Hastings algorithm by P ( x, · ), and aMarkov chain generated using the algorithm by ( x ( t ) ) t ≥ , starting from x (0) ∼ π . The initialdistribution π could correspond to setting I components of x (0) to 1, chosen uniformly withoutreplacement, or setting x i = 1 for i = 1 , . . . , I and the other components to 0.If all probabilities in p are identical, the chain is equivalent to the Bernoulli–Laplace diffusionmodel, which is well-studied [Diaconis and Shahshahani, 1987, Donnelly et al., 1994, Eskenazisand Nestoridi, 2020]. In particular, Diaconis and Shahshahani [1987] showed that mixing of thechain occurs in the order of N log N iterations when I is proportional to N , via a Fourier analysisof the group structure of the chain. A mixing time of the same order can be obtained with asimple coupling argument [Guruswami, 2000]. Here we consider the case where p is a vector ofrealizations of random variables in (0 , N log N . As we will see in Section 2, the coupling argument alone falls apartin the case of unequal probabilities p , but can be successfully combined with a partition of thepair of state spaces into favorable and unfavorable pairs, to be defined in Section 3.1. Bounds onthe transitions from parts of the space are used to define a simple Markov chain on the partitionlabels, which allows us to obtain our bounds in Section 3.2.The problem of sampling CB( p, I ) has various applications, such as survey sampling [Chenet al., 1994], hypothesis testing in logistic regression [Chen and Liu, 1997, Broström and Nilsson,2000], testing the hypothesis of proportional hazards [Broström and Nilsson, 2000], and samplingfrom a determinantal point process [Hough et al., 2006, Kulesza and Taskar, 2012]. Consider two chains ( x ( t ) ) and (˜ x ( t ) ), each marginally evolving according to P , with initialization x (0) ∼ π (0) and ˜ x (0) ∼ CB( p, I ). Define the sets ˜ S z = { n ∈ [ N ] : ˜ x n = z } for z = 0 ,
1. TheHamming distance between two states x and ˜ x is d ( x, ˜ x ) = P Nn =1 ( x n = ˜ x n ). Since x, ˜ x ∈ X sumto I , the distance d ( x, ˜ x ) must be an even number. If d ( x, ˜ x ) = D then | S ∩ ˜ S | = N − I − D/ | S ∩ ˜ S | = I − D/ | ˜ S ∩ S | = | S ∩ ˜ S | = D/
2, where | · | denotes the cardinality of a set.2et d ( t ) denote the distance between x ( t ) and ˜ x ( t ) at iteration t . Following e.g. Guruswami[Section 6.2, 2000], in the case of identical probabilities p = ( p , . . . , p N ), a path coupling strategy[Bubley and Dyer, 1997] gives an accurate upper bound on the mixing time of the chain. Thestrategy is to study the distance d ( t ) as the iterations progress. Denote the total variationdistance between the law of x ( t ) and its limiting distribution CB( p, I ) by k x ( t ) − CB( p, I ) k TV .By the coupling inequality and Markov’s inequality, k x ( t ) − CB( p, I ) k TV ≤ P (cid:16) x ( t ) = ˜ x ( t ) (cid:17) = P (cid:16) d ( t ) > (cid:17) ≤ E h d ( t ) i . If the contraction E [ d ( t ) | x ( t − = x, ˜ x ( t − = ˜ x ] ≤ (1 − c ) d ( t − holds for all x, ˜ x ∈ X with c ∈ (0 , k x ( t ) − CB( p, I ) k TV is less than (1 − c ) t E [ d (0) ]. Notingthat E [ d (0) ] ≤ N and writing κ = − (log(1 − c )) − >
0, an upper bound on the (cid:15) -mixing time,defined as the first time t at which k x ( t ) − CB( p, I ) k TV ≤ (cid:15) , is given by κ log( N/(cid:15) ). Thus we seeka contraction result, with c as large as possible.Instead of considering all pairs ( x, ˜ x ) ∈ X , the path coupling argument allows us to restrictour attention to contraction from pairs of adjacent states. We write the set of adjacent states as¯ X adj = { ( x, ˜ x ) ∈ X : d ( x, ˜ x ) = 2 } ; see Appendix B.1 for more details on path coupling. We now introduce a coupling ¯ P of P ( x, · ) and P (˜ x, · ), for any pair ( x, ˜ x ) ∈ X , although we willprimarily be interested in the case ( x, ˜ x ) ∈ ¯ X adj . First, sample i , ˜ i from the following maximalcoupling of the uniform distributions on S and ˜ S :1. with probability | S ∩ ˜ S | / ( N − I ), sample i uniformly in S ∩ ˜ S and set ˜ i = i ,2. otherwise sample i uniformly in S \ ˜ S and ˜ i uniformly in ˜ S \ S , independently.We then sample i , ˜ i with a similar coupling, independently of the pair ( i , ˜ i ). Using theseproposed indices, swaps are accepted or rejected using a common uniform random number.These steps define a coupled transition kernel ¯ P (( x, ˜ x ) , · ).Under ¯ P , the distance between the chains can only decrease, so for ( x , ˜ x ) ∼ ¯ P (( x, ˜ x ) , · ) from( x, ˜ x ) ∈ ¯ X adj , the distance d ( x , ˜ x ) is either zero or two. We denote the expected contractionfrom ( x, ˜ x ) by c ( x, ˜ x ), i.e. E [ d ( x , ˜ x ) | x, ˜ x ] = (1 − c ( x, ˜ x )) d ( x, ˜ x ). In the case ( x, ˜ x ) ∈ ¯ X adj ,we denote by a the single index at which x a = 0 , ˜ x a = 1, and by b the single index at which x b = 1 , ˜ x b = 0. An illustration of such states is in Table 1. Up to a re-labelling of x and ˜ x , wecan assume w a ≤ w b . By considering all possibilities when propagating ( x, ˜ x ) ∈ ¯ X adj through ¯ P ,we find that c ( x, ˜ x ) = 1 N − I × I × (cid:12)(cid:12)(cid:12)(cid:12) − w a w b (cid:12)(cid:12)(cid:12)(cid:12) + X i ∈ S ∩ ˜ S min (cid:18) , w a w i (cid:19) + X i ∈ S ∩ ˜ S min (cid:18) , w i w b (cid:19) . (1)The derivation of (1) is in Appendix B.2. The next question is whether this contraction rate c ( x, ˜ x ) can be lower bounded by a quantity of order N − ; if this is the case, a mixing time oforder N log N would follow. 3 a b Nx . . . . . . . . . . . . x . . . . . . . . . . . . x, ˜ x ) ∈ ¯ X adj . They differ at indices a and b only, with x a = ˜ x b = 0 and x b = ˜ x a = 1. The other components of x and ˜ x are identical, and equal to 0 or 1. N m ee t i ng t i m e / N l og N (a) Meeting times /N log N N % m i x i ng t i m e / N l og N (b) Estimated 1%-mixing times /N log N Figure 1: Meeting times ( left ) and estimated upper bounds on the mixing time of the chain x ( t ) targeting CB( p, I ) ( right ), divided by N log N , against N . Here the probabilities p areindependent Uniform(0,1) and I = N/ If the probabilities p are identical, c ( x, ˜ x ) simplifies to ( N − / { ( N − I ) I } for all ( x, ˜ x ) ∈ ¯ X adj .Assuming that I ∝ N , this is of order N − and leads to a mixing time in N log N [Guruswami,2000]. It follows from Diaconis and Shahshahani [1987] that this contraction rate is sharp in itsdependency on N . The same conclusion holds in the case where ( p n ) are not identical but arebounded away from 0 and 1, i.e. w n ∈ [ w lb , w ub ] with 0 < w lb < w ub < ∞ independent of N . Inthat case, we obtain the rate c ( x, ˜ x ) ≥ ( N − / { ( N − I ) I } w lb /w ub , which worsens as the ratio w lb /w ub gets smaller.The main difficulty addressed in this article arises when min n p n and max n p n get arbitrarilyclose to 0 and 1 as N increases. This scenario is common, for example if ( p n ) are independentUniform(0 , n w n ∼ N − and max n w n ∼ N . Thus for w a = min n w n and w b = max n w n , the contraction in (1) can be of order N − when I ∝ N , which leads to an upperbound on the mixing time of order N log N . To set our expectations appropriately, we followthe approach of Biswas et al. [2019] to obtain empirical upper bounds on the mixing time as N increases. Details of the approach, which itself is based on couplings, are given in Appendix C.Figure 1 shows the estimated upper bound on the mixing time, divided by N log N , as a functionof N , when ( p n ) are generated (once for each value of N ) from independent Uniform(0,1) and I is set to N/
2. The figure suggests that the mixing time might scale as N log N .4ur contribution is to refine the coupling argument in order to establish an upper boundon the mixing time of order N log N , under conditions which allow for example ( p n ) to beindependent Uniform(0,1). A practical consequence of our result stated in Section 3.2 is that thesimple MCMC algorithm is competitive compared to exact sampling strategies for CB( p, I ). In the worst case scenario, w a might be of order N − and w b of order N , resulting in a rate c ( x, ˜ x ) of order N − . However, this is not necessarily typical of a pair of states ( x, ˜ x ) ∈ ¯ X adj .This prompts us to partition ¯ X adj into “unfavorable” states, from which their probability ofcontracting is smaller than order N − , and “favorable” states, from which meeting occurs withprobability of order N − . The precise definition of this partition will be made in relation to theodds ( w n ). Since c ( x, ˜ x ) in (1) depends on ( w n ) and I , we will care about statements holdingwith high probability under the distribution of ( w n ) and I , which are described in Assumptions3.1 and 3.2. Fortunately, we will see in Proposition 3.1 that favorable states can be reached fromunfavorable ones with probability at least order N − , while unfavorable states are visited fromfavorable ones with probability less than order N − . This will prove enough for us to establisha mixing time of order N log N in Theorem 1. Assumption 3.1. (Condition on the odds). The odds ( w n ) are such that there exist ζ > < l < r < ∞ and η > N large enough, P ( |{ n ∈ [ N ] : w n / ∈ ( l, r ) }| ≤ ζN ) ≥ − exp( − ηN ) . This assumption states that with exponentially high probability, a proportion of the odds thatfalls within an interval can be defined independently of N . The condition can be verified usingfor example Hoeffding’s inequality if the odds ( w n ) are independently and identically distributedon (0 , ∞ ), but also under weaker conditions. The statement “for all N large enough” means forall N ≥ N where N ∈ N . Assumption 3.2. (Conditions on I ). There exist 0 < ξ ≤ / η > N large enough, P ( ξN ≤ I ) ≥ − exp( − η N ) . This assumption formalizes what we mean by I ∝ N , and is probabilistic rather than setting I = b ξN c for some ξ ∈ (0 , / ξN / ≤ ( N − I ) I ≤ (1 − ξ ) N / I ≤ N/ Proposition 3.1.
Suppose Assumptions 3.1 and 3.2 hold such that ζ < ξ . Then we can define ξ F → D , ξ U → F , ξ F → U , ν > < w lo < w hi < ∞ such that, for all N large enough, with5igure 2: Conditional dependencies of the processes ( x ( t ) , ˜ x ( t ) ) and ( Z ( t ) ).probability at least 1 − exp( − νN ), the sets of favorable and unfavorable states defined as¯ X U = { ( x, ˜ x ) ∈ ¯ X adj : w a < w lo and w b > w hi } , (2)¯ X F = { ( x, ˜ x ) ∈ ¯ X adj : w a ≥ w lo or w b ≤ w hi } , (3)and the “diagonal” set ¯ X D = { ( x, ˜ x ) ∈ X : x = ˜ x } , satisfy the following statements under thecoupling ¯ P described in Section 2.2,¯ P (( x, ˜ x ) , ¯ X D ) ≥ ξ F → D /N, ∀ ( x, ˜ x ) ∈ ¯ X F , (4)¯ P (( x, ˜ x ) , ¯ X F ) ≥ ξ U → F /N, ∀ ( x, ˜ x ) ∈ ¯ X U , (5)¯ P (( x, ˜ x ) , ¯ X U ) ≤ ξ F → U /N, ∀ ( x, ˜ x ) ∈ ¯ X F . (6)The proof in Appendix D.1 relies on a careful inspection of the various cases arising in thepropagation of the coupled chains. The proposition provides bounds on the transition probabil-ities between the subsets ¯ X U , ¯ X F and ¯ X D . We relate the coupled chain ( x ( t ) , ˜ x ( t ) ) to an auxiliary Markov chain denoted by ( Z ( t ) ), definedon a space with three states { , , } , associated with the subsets ¯ X U , ¯ X F and ¯ X D , respectively.We introduce the Markov transition matrix Q = − ξ U → F /N ξ U → F /N ξ F → U /N − ξ F → U /N − ξ F → D /N ξ F → D /N , (7)where the constants ξ F → D , ξ U → F , ξ F → U > N islarge enough for each entry, including 1 − ξ U → F /N and 1 − ξ F → U /N − ξ F → D /N , to be positive.We then observe that a Markov chain ( Z ( t ) ) with transition Q is such that there exists r ∈ (0 , N satisfying P ( Z ( N ) = 3 | Z (0) = 1) ≥ − r ; details can be found in AppendixD.2.We now relate the auxiliary chain to ( x ( t ) , ˜ x ( t ) ) using a strategy inspired by Jacob and Ryder[2014]. Consider the variable B ( t ) ∈ { , , } defined as 1 if ( x ( t ) , ˜ x ( t ) ) ∈ ¯ X U , 2 if ( x ( t ) , ˜ x ( t ) ) ∈ ¯ X F and 3 if x ( t ) = ˜ x ( t ) . The key idea is to construct the auxiliary chain ( Z ( t ) ) on { , , } , in such away that it is (marginally) a Markov chain with transition matrix Q in (7), and also such that6 N m ee t i ng t i m e (a) Meeting times
123 512 1024 2048 4096 8192 16384 N % m i x i ng t i m e / N (b) Estimated 1%-mixing times divided by N Figure 3: Meeting times ( left ) and estimated upper bounds on the mixing time, divided by N ( right ), against N . The probabilities p are independent Uniform(0,1) and I = 10 for all N . Z ( t ) ≤ B ( t ) for all t almost surely; this is possible thanks to Proposition 3.1. Thus the event { Z ( t ) = 3 } will imply { B ( t ) = 3 } = { x ( t ) = ˜ x ( t ) } , and we can translate the hitting time of( Z ( t ) ) to its absorbing state into a statement about the meeting time of ( x ( t ) , ˜ x ( t ) ). An explicitconstruction of ( Z ( t ) ) is described in Appendix D.3; Figure 2 represents the dependency structurewhere Z ( t +1) is constructed given Z ( t ) , but also conditional upon ( x ( t ) , ˜ x ( t ) ) and ( x ( t +1) , ˜ x ( t +1) )to ensure that the inequality Z ( t +1) ≤ B ( t +1) holds almost surely.The convergence of ( Z ( t ) ) to its absorbing state translates into an upper bound on the mixingtime of ( x ( t ) ) of the order of N log N iterations, which is our main result. Theorem 1.
Under Assumptions 3.1 and 3.2 such that ζ < ξ , there exist κ > , ν > , N ∈ N independent of N such that, for any (cid:15) ∈ (0 , , and for all N ≥ N , with probability at least − exp( − νN ) , we have k x ( t ) − CB ( p, I ) k TV ≤ (cid:15) for all t ≥ κN log( N/(cid:15) ) . The proof of Theorem 1 is given in Appendix D.4.
Using the strategy of Biswas et al. [2019], we assess the convergence rate of the chain in theregime where I is sub-linear in N . Figure 3 shows the estimated upper bounds on the mixingtime obtained in the case where I is fixed to 10 while N grows, and where ( p n ) are independentUniform(0,1) (generated once for each value of N ). The figure might suggest that the mixingtime grows at a slower rate than N in this setting, and thus that MCMC is competitive relativeto exact sampling. Understanding the small I regime remains an open problem.Our approach relies on a partition of the state space and an auxiliary Markov chain definedon the subsets given by the partition. This technique bear a resemblance to partitioning thestate space with more common drift and contraction conditions [Durmus and Moulines, 2015,Qin and Hobert, 2019], but appears to be distinct.7he present setting is similar to the question of sampling permutations via random swaps.For that problem, direct applications of the coupling argument result in upper bounds on themixing time of the order of at least N . Bormashenko [2011] devises an original variant of thepath coupling strategy to obtain an upper bound in N log N , which is the correct dependencyon N ; see Berestycki and Şengül [2019] for recent developments leading to sharp constants.The proposed analysis captures the impact of the dimension N faithfully. It fails to provideaccurate constants and exact characterizations of how the mixing time depends on the distri-bution of the probabilities ( p n ) and the sum I . Yet our analysis already supports the use ofMCMC over exact sampling strategies for conditional Bernoulli sampling, especially as part ofencompassing MCMC algorithms such as that of Yang et al. [2016] for Bayesian variable selection. Acknowledgments
This work was funded by CY Initiative of Excellence (grant “Investissements d’Avenir” ANR-16-IDEX-0008). Pierre E. Jacob gratefully acknowledges support by the National Science Foun-dation through grants DMS-1712872 and DMS-1844695.
References
Nathanaël Berestycki and Batı Şengül. Cutoff for conjugacy-invariant random walks on the permutationgroup.
Probability Theory and Related Fields , 173(3-4):1197–1241, 2019.Niloy Biswas, Pierre E Jacob, and Paul Vanetti. Estimating convergence of Markov chains with L-lagcouplings. In
Advances in Neural Information Processing Systems , pages 7391–7401, 2019.Olena Bormashenko. A coupling argument for the random transposition walk. arXiv preprintarXiv:1109.3915 , 2011.Göran Broström and Leif Nilsson. Acceptance–rejection sampling from the conditional distribution ofindependent discrete random variables, given their sum.
Statistics: A Journal of Theoretical andApplied Statistics , 34(3):247–257, 2000.Russ Bubley and Martin Dyer. Path coupling: A technique for proving rapid mixing in Markov chains.In
Proceedings 38th Annual Symposium on Foundations of Computer Science , pages 223–231. IEEE,1997.Sean X Chen and Jun S Liu. Statistical applications of the Poisson-Binomial and conditional Bernoullidistributions.
Statistica Sinica , pages 875–892, 1997.Xiang-Hui Chen, Arthur P Dempster, and Jun S Liu. Weighted finite population sampling to maximizeentropy.
Biometrika , 81(3):457–469, 1994.Persi Diaconis and Mehrdad Shahshahani. Time to reach stationarity in the Bernoulli–Laplace diffusionmodel.
SIAM Journal on Mathematical Analysis , 18(1):208–218, 1987.Peter Donnelly, Peter Lloyd, and Aidan Sudbury. Approach to stationarity of the Bernoulli–Laplacediffusion model.
Advances in Applied Probability , 26(3):715–727, 1994. lain Durmus and Éric Moulines. Quantitative bounds of convergence for geometrically ergodic Markovchain in the Wasserstein distance with application to the Metropolis adjusted Langevin algorithm. Statistics and Computing , 25(1):5–19, 2015.Alexandros Eskenazis and Evita Nestoridi. Cutoff for the Bernoulli–Laplace urn model with o ( n ) swaps. Ann. Inst. H. Poincaré Probab. Statist. , 56(4):2621–2639, 11 2020. doi: 10.1214/20-AIHP1052. URL https://doi.org/10.1214/20-AIHP1052 .Venkatesan Guruswami. Rapidly mixing Markov chains: A comparison of techniques.
Available: cs.washington. edu/homes/venkat/pubs/papers. html , 2000.J Ben Hough, Manjunath Krishnapur, Yuval Peres, and Bálint Virág. Determinantal processes andindependence.
Probability surveys , 3:206–229, 2006.Pierre E Jacob and Robin J Ryder. The Wang–Landau algorithm reaches the flat histogram criterionin finite time.
The Annals of Applied Probability , 24(1):34–53, 2014.Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning.
Foundations andTrends in Machine Learning , 5(2–3):123–286, 2012.Jun S Liu, Andrew F Neuwald, and Charles E Lawrence. Bayesian models for multiple local sequencealignment and Gibbs sampling strategies.
Journal of the American Statistical Association , 90(432):1156–1170, 1995.Qian Qin and James P Hobert. Geometric convergence bounds for Markov chains in Wasserstein distancebased on generalized drift and contraction conditions. arXiv preprint arXiv:1902.02964 , 2019.Yun Yang, Martin J Wainwright, and Michael I Jordan. On the computational complexity of high-dimensional Bayesian variable selection.
The Annals of Statistics , 44(6):2497–2532, 2016.
A Exact sampling of conditional Bernoulli
We describe a procedure to sample exactly from CB( p, I ) for a cost of order N . First, compute a( I + 1) × N matrix of entries q ( i, n ), for i ∈ { , . . . , I } , n ∈ [ N ], where q ( i, n ) = P ( P Nm = n x m = i ) witheach x n independent Bernoulli( p n ). To compute these entries, proceed as follows. The initial conditionsare given by q (0 , n ) = P N X m = n x m = 0 ! = N Y m = n P ( x m = 0) = N Y m = n (1 − p m ) , n ∈ [ N ] , (8)in the case of no success, q (1 , N ) = P ( x N = 1) = p N where the sum reduces to a single Bernoulli variable,and q ( i, n ) = 0 for i > N − n + 1 because a sum of N − n + 1 Bernoulli variables cannot be larger than N − n + 1, in particular q ( i, N ) = 0 for all i ≥
2. The other entries q ( i, n ) can be obtained recursively,via q ( i, n ) = p n q ( i − , n + 1) + (1 − p n ) q ( i, n + 1) , i ∈ [ N ] , n ∈ [ N − . ndeed, for i ∈ [ N ] and n ∈ [ N − x n ∈ { , } , the law of totalprobability gives q ( i, n ) = P ( x n = 1) P N X m = n x m = i | x n = 1 ! + P ( x n = 0) P N X m = n x m = i | x n = 0 ! = p n P N X m = n +1 x m = i − ! + (1 − p n ) P N X m = n +1 x m = i ! . (9)Having obtained the ( I + 1) × N entries q ( i, n ), we now derive a sequential decomposition of aconditioned Bernoulli distribution that enables sampling in the order of N operations. To sample x ,we compute P ( x = 1 | P Nn =1 x n = i ), as P x = 1 | N X n =1 x n = i ! = P ( x = 1) P (cid:0)P Nn =1 x n = i | x = 1 (cid:1) P (cid:0)P Nn =1 x n = i (cid:1) . (10)Note that the denominator is q ( i,
1) and the numerator is p q ( i − , n ∈ { , . . . , N − } , P x n = 1 | x , . . . , x n − , N X n =1 x n = i ! = P ( x n = 1) P (cid:0)P Nm = n x m = i − i n − | x n = 1 (cid:1) P (cid:0)P Nm = n x m = i − i n − (cid:1) , (11)with i n = P nm =1 x m . The numerator can be recognized as p n q ( i − i n − − , n + 1) and the denominatoras q ( i − i n − , n ). Lastly, given x , . . . , x N − , P Nn =1 x n = i , we can set x N to zero or one deterministically,namely x N = i − i N − . B Contractive coupling
B.1 Path coupling
Given current states ( x, ˜ x ) ∈ X , let ( x , ˜ x ) ∼ ¯ P (( x, ˜ x ) , · ) denote new states sampled from ¯ P (( x, ˜ x ) , · ), acoupling of P ( x, · ) and P (˜ x, · ). We want to establish the contraction E (cid:2) d ( x , ˜ x ) | x, ˜ x (cid:3) ≤ (1 − c ) d ( x, ˜ x ) , ∀ ( x, ˜ x ) ∈ X , (12)with a large contraction rate c ∈ (0 , E (cid:2) d ( x , ˜ x ) | x, ˜ x (cid:3) ≤ (1 − c ) d ( x, ˜ x ) , ∀ ( x, ˜ x ) ∈ ¯ X adj . (13)It operates as follows. Suppose that (13) holds under the coupling ¯ P . For two arbitrary states( x, ˜ x ) / ∈ ¯ X adj with d ( x, ˜ x ) = D >
2, we consider a “path” x = z , z , . . . , z L = ˜ x of L = D/ d ( z ‘ , z ‘ +1 ) = 2 for ‘ = 0 , . . . , L − P L‘ =1 d ( z ‘ − , z ‘ ) equals D . As there could be multiple such paths, to remove any ambiguity, we define a deterministic path bygoing through x and ˜ x from left to right, introducing a new element in the path for each encountereddiscrepancy. We then generate the new states ( x , ˜ x ) using the following procedure:1. sample ( z , z ) ∼ ¯ P (( z , z ) , · ),2. for ‘ = 2 , . . . , L , sample z ‘ from the conditional of ¯ P (( z ‘ − , z ‘ ) , ( z ‘ − , z ‘ )) given z ‘ − ,3. set x = z and ˜ x = z L . y construction we have ˜ x | ˜ x ∼ P (˜ x, · ), thus this scheme defines a coupling of P ( x, · ) and P (˜ x, · ). Underthe above coupling, we have E (cid:2) d ( x , ˜ x ) | x, ˜ x (cid:3) = E (cid:2) d ( x , ˜ x ) | x = z , . . . , z L = ˜ x (cid:3) ≤ E " L X ‘ =1 d ( z ‘ − , z ‘ ) | x = z , . . . , z L = ˜ x = L X ‘ =1 E (cid:2) d ( z ‘ − , z ‘ ) | z ‘ − , z ‘ (cid:3) ≤ L X ‘ =1 (1 − c ) d ( z ‘ − , z ‘ ) , for any ( x, ˜ x ) / ∈ ¯ X adj . The first equality holds because ( z ‘ ) is obtained deterministically given x, ˜ x .The rest follow from triangle inequalities, linearity of expectation, conditional independencies betweenthe variables introduced in the coupling construction, and the assumption of contraction from adjacentstates in (13). The last expression is equal to (1 − c ) d ( x, ˜ x ) by construction of the path. In summary,the path coupling argument allows us to extend contraction between adjacent states (13) to contractionfor any pair of states (12) with the same rate. B.2 Contraction rate of ¯ P for adjacent states We compute the contraction rate c ( x, ˜ x ) in (1) under ¯ P , the coupling described in Section 2.2.The coupling of ( i , ˜ i ) is such that P ( i = a, ˜ i = b ) = ( N − I ) − . Similarly the maximal couplingon ( i , ˜ i ) leads to P ( i = b, ˜ i = a ) = I − . Under that coupling, if none of the indices i , ˜ i , i , ˜ i are in { a, b } , then the proposed swaps will be either accepted or rejected jointly and the distance d ( x, ˜ x ) willbe unchanged. We consider proposed swaps that could affect the discrepancy.1. The index i can be equal to a (with probability ( N − I ) − ). In that case, ˜ i must be equal to b , which is the only index in ˜ S that is not in S ; this comes from the maximal coupling strategyfor sampling ( i , ˜ i ). The index i can be equal to b (with probability I − ), in which case ˜ i = a again due to the maximal coupling strategy. The discrepancy is then reduced if exactly one of thetwo proposed swaps is accepted, which happens with probability (cid:12)(cid:12)(cid:12)(cid:12) min (cid:18) , w i w i (cid:19) − min (cid:18) , w ˜ i w ˜ i (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = | − w a /w b | .
2. The index i can be equal to a (again with probability ( N − I ) − ) and i not equal to b (withprobability ( I − I − ). Then ˜ i = i (maximum coupling), and given i = b , the discrepancy isreduced if both proposed swaps are accepted. The probability of reducing the discrepancy given { i = a, i = b } is 1 I − X i ∈ S ∩ ˜ S min (cid:18) , w a w i (cid:19) .
3. The index i can be different from a , in which case ˜ i = i , and the discrepancy might be reducedif i = b , ˜ i = a and both swaps are accepted. Given { i = a, ˜ i = b } this occurs with probability1 N − I − X i ∈ S ∩ ˜ S min (cid:16) , w i w b (cid:17) . The contraction rate in (1) follows from summing up the above three possibilities. Estimation of upper bounds on the mixing time
We briefly describe the choices made in applying the L -lag coupling approach of Biswas et al. [2019].For the choice of coupling, we implemented the kernel ¯ P presented in Section 2.2. A careful imple-mentation of the kernel, by keeping track of the four sets S ij of indices n such that x n = i, ˜ x n = j for i, j ∈ { , } , results in a constant cost per iteration of the coupled chain. The chains are initialized bysampling I indices without replacement in binary vectors of length N , and setting these components toone and the others to zero.We use a lag of L = 1, and run 500 independent runs of coupled lagged chains, to obtain as manyrealizations of the meeting time τ . We employ the key identity in Biswas et al. [2019], k x ( t ) − CB( p, I ) k TV ≤ E h max (cid:16) , l τ − L − tL m(cid:17)i . The expectation on the right hand side is estimated by an average of independent copies of the meetingtime, for any desired iteration t . As the estimate is itself decreasing in t , we can find the smallest t suchthat the estimate is less than (cid:15) = 0 .
01, and this provides an estimated upper bound on the (cid:15) -mixingtime.
D Proofs
D.1 Proof of Proposition 3.1
Under the assumptions, we define 0 < w lo < w hi < ∞ and δ > w lo + δ = l and w hi − δ = r ,with ( l, r ) as the interval in Assumption 3.1. The assumption thus guarantees that with high probability,the number of odds in ( w n ) that are outside of ( w lo + δ, w hi − δ ) is less than ζN . In particular, thenumber of odds above w hi , and the number of odds below w lo are both less than ζN . The δ term canbe arbitrarily small and is helpful in a calculation below. Proof of (4) . We start with the transition from ¯ X F to ¯ X D . Assume ( x, ˜ x ) ∈ ¯ X F . For such states, w a > w lo or w b < w hi , therefore the contraction rate c ( x, ˜ x ) (1) is at least1 I ( N − I ) X i ∈ S ∩ ˜ S min(1 , w lo /w i ) if w a > w lo ,1 I ( N − I ) X i ∈ S ∩ ˜ S min(1 , w i /w hi ) if w b < w hi . • First case ( w a > w lo ): Note that there are I − S ∩ ˜ S . Using Assumption 3.2, I is atleast ξN with high probability. Using Assumption 3.1, the number of odds in ( w n ) above w hi isless than ζN . On the intersection of events, which is not empty if N is large enough, among the I − S ∩ ˜ S , there are at least ( ξ − ζ ) N − w hi . The sum P i ∈ S ∩ ˜ S min(1 , w lo /w i ) is thus larger than (( ξ − ζ ) N − w lo /w hi . We obtain the lower bound¯ P (( x, ˜ x ) , ¯ X D ) ≥ − ξ ) − N − · (( ξ − ζ ) N − w lo /w hi . • Second case ( w b < w hi ): In that case, among the N − I − S ∩ ˜ S , under the assumptionsthere are at least (1 / − ζ ) N − w n ) that are larger than w lo . Thus the sum P i ∈ S ∩ ˜ S min(1 , w i /w hi ) is larger than ((1 / − ζ ) N − w lo /w hi , and we obtain the lower bound¯ P (( x, ˜ x ) , ¯ X D ) ≥ − ξ ) − N − · ((1 / − ζ ) N − w lo /w hi . rom the two cases, we obtain for N large enough a lower bound of the form ξ F → D /N for some ξ F → D > Proof of (5) . We next consider the probability of transitioning from ¯ X U to ¯ X F . For any ( x, ˜ x ) ∈ ¯ X U ,such transitions occurs in two distinct cases.• We propose swapping component i = a and i = b , and ˜ i = b and ˜ i = i , such that w i < w hi ,and exactly one of the two swaps is accepted. In that case b becomes i .• We propose swapping component i = b and i = a , and ˜ i = i and ˜ i = a , such that w i > w lo ,and exactly one of the two swaps is accepted. In that case a becomes i .Note that the case of swapping components a and b results in an unfavorable state, upon relabeling ofthe states x and ˜ x to maintain w a ≤ w b . As we only seek a lower bound of ξ U → F /N in (5), it is sufficientto only consider the first case. Since w a ≤ w b , if only one swap is accepted, it must be the one on the ˜ x chain; then b becomes i and a remains unchanged.Selection of i = a and i = b occurs with probability ( N − I ) − × ( I − I − . Acceptance of exactlyone swap occurs with probability | min(1 , w a /w i ) − min(1 , w b /w i ) | . Thus the probability of moving to¯ X F via the acceptance of one swap is at least1( N − I ) I X i ∈ S \{ b } | min(1 , w a /w i ) − min(1 , w b /w i ) | ( w i < w hi )= 1( N − I ) I X i ∈ S \{ b } | − min(1 , w a /w i ) | ( w i < w hi ) because w b > w hi ≥ N − I ) I X i ∈ S \{ b } | − min(1 , w a /w i ) | ( w lo + δ < w i < w hi ) because fewer terms ≥ N − I ) I X i ∈ S \{ b } | − w a / ( w lo + δ ) | ( w lo + δ < w i < w hi ) ≥ N − I ) I (( ξ − ζ ) N − | − w a / ( w lo + δ ) | . The last inequality holds when I ≥ ξN , and when at most ζN entries of ( w n ) are outside of ( w lo + δ, w hi );again we work in the intersection of the high probability events specified by the assumptions.We conclude by noting that, for w a < w lo , we have | − w a / ( w lo + δ ) | ≥ δ/ ( w lo + δ ), which is aconstant independent of N ; this is where the δ term comes in handy. Thus for ( x, ˜ x ) ∈ ¯ X U , we can moveto ¯ X F with probability¯ P (( x, ˜ x ) , ¯ X F ) ≥ − ξ ) − N − · (( ξ − ζ ) N − δ/ ( w lo + δ ) , which is at least ξ U → F /N for some ξ U → F > N gets large. Proof of (6) . We finally consider the probability of moving from ¯ X F to ¯ X U . As we want an upperbound of this quantity, we have to consider all possible routes from ( x, ˜ x ) ∈ ¯ X F to ¯ X U . If the state is suchthat w a > w lo and w b < w hi , then the pair cannot transition to an unfavorable state. In other words,¯ P (( x, ˜ x ) , ¯ X U ) can be equal to zero. Transition to an unfavorable state occurs in two distinct cases:• if w a < w lo , w b < w hi , and if the swap changes b to some i with w i > w hi ;• if w b > w hi , w a > w lo , and if the swap changes a to some i with w i < w lo .The first case happens if the drawn indices are ( i , ˜ i ) = ( a, b ) and i = ˜ i in S \{ b } such that w i > w hi ,and if we accept the swap for the chain ˜ x but not for x , which occurs with probability ( w b − w a ) /w i . hus the probability associated with this transition is1( N − I ) I X i ∈ S : w i >w hi ( w b − w a ) /w i ≤ N − I ) I X i ∈ S ( w i > w hi ) ≤ ξ − N − · ζN, in the event that the number of odds above w hi is less than ζN .The second case occurs if the drawn indices are i = ˜ i in S \{ a } with w i < w lo and ( i , ˜ i ) = ( b, a ),and if we accept the swap for the chain ˜ x but not for x , which occurs with probability w i · ( w − a − w − b ) ≤ N − I ) I X i ∈ S : w i
0. Using the fact that(1 − α/N ) N is less than exp( − α ) for all α >
0, we can lower bound the probability P ( Z ( N ) = 3 | Z (0) = 1)by a constant independent of N . D.3 Construction of the auxiliary chain
We now detail the construction of the auxiliary chain ( Z ( t ) ). The construction is done conditionally on( x ( t ) , ˜ x ( t ) ). We refer readers to Figure 2 and recall that B ( t ) is a deterministic function of ( x ( t ) , ˜ x ( t ) ).Note that the dependencies shown in Figure 2 are a consequence of the following construction.First, if Z ( t ) = 1, we construct Z ( t +1) as follows.• If B ( t ) = 1 or B ( t ) = 2, – if B ( t +1) = 1 set Z ( t +1) = 1, – otherwise, set Z ( t +1) = 2 with probability ( ξ U → F /N ) / P ( B ( t +1) ∈ { , }| x ( t ) , ˜ x ( t ) ), and set Z ( t +1) = 1 otherwise.• If B ( t ) = 3, sample Z ( t +1) from { , , } using the probabilities in the first row of (7).Let us check that the above transition probabilities are well-defined and lie in [0 , B ( t ) = 1,( ξ U → F /N ) / P ( B ( t +1) ∈ { , }| x ( t ) , ˜ x ( t ) ) is less than ( ξ U → F /N ) / P ( B ( t +1) = 2 | x ( t ) , ˜ x ( t ) ) which is less thanone by Proposition 3.1, Equation (5). If B ( t ) = 2, ( ξ U → F /N ) / P ( B ( t +1) ∈ { , }| x ( t ) , ˜ x ( t ) ) is less thanone if N is large enough, using Proposition 3.1 again. Indeed P ( B ( t +1) ∈ { , }| x ( t ) , ˜ x ( t ) ) = 1 − P ( B ( t +1) = 1 | x ( t ) , ˜ x ( t ) ) ≥ − ξ F → U /N, by Equation (6), and this is larger than ξ U → F /N if N is large enough, for any ξ U → F , ξ F → U . he goal of this construction is that Z ( t +1) = 2 only if B ( t +1) ∈ { , } , so that Z ( t +1) ≤ B ( t +1) holdsalmost surely. We can compute P ( Z ( t +1) = 2 | Z ( t ) = 1 , x ( t ) , ˜ x ( t ) ) = P ( Z ( t +1) = 2 , B ( t +1) = 1 | Z ( t ) = 1 , x ( t ) , ˜ x ( t ) )+ P ( Z ( t +1) = 2 , B ( t +1) ∈ { , }| Z ( t ) = 1 , x ( t ) , ˜ x ( t ) ) , which, using the conditional dependencies implied by the construction is equal to0+ P ( Z ( t +1) = 2 | B ( t +1) ∈ { , } , Z ( t ) = 1 , x ( t ) , ˜ x ( t ) ) × P ( B ( t +1) ∈ { , }| x ( t ) , ˜ x ( t ) ) = ξ U → F /N, for all values of B ( t ) in { , , } . Since the probability P ( Z ( t +1) = 2 | Z ( t ) = 1 , x ( t ) , ˜ x ( t ) ) is the same forall ( x ( t ) , ˜ x ( t ) ), we deduce that P ( Z ( t +1) = 2 | Z ( t ) = 1) = ξ U → F /N .We proceed similarly for the second row of (7), assuming Z ( t ) = 2. In that case we must have B ( t ) ≥
2. Consider the following construction.• If B ( t ) = 2, – if B ( t +1) = 1, set Z ( t +1) = 1, – if B ( t +1) = 2, sample Z ( t +1) from { , , } with probabilities (cid:18) ξ F → U /N − P ( B ( t +1) = 1 | x ( t ) , ˜ x ( t ) ) P ( B ( t +1) = 2 | x ( t ) , ˜ x ( t ) ) , − ξ F → U /N − P ( B ( t +1) = 1 | x ( t ) , ˜ x ( t ) ) P ( B ( t +1) = 2 | x ( t ) , ˜ x ( t ) ) , (cid:19) , – if B ( t +1) = 3, sample Z ( t +1) from { , , } with probabilities (cid:18) , − ξ F → D /N P ( B ( t +1) = 3 | x ( t ) , ˜ x ( t ) ) , ξ F → D /N P ( B ( t +1) = 3 | x ( t ) , ˜ x ( t ) ) (cid:19) . • If B ( t ) = 3, sample Z ( t +1) from { , , } using the probabilities in the second row of (7).We can again verify that the probabilities are well-defined and lie in [0 , N is large enough so that P ( B ( t +1) = 2 | x ( t ) , ˜ x ( t ) ) + P ( B ( t +1) = 1 | x ( t ) , ˜ x ( t ) ) ≥ ξ F → U /N .Then we can compute P ( Z ( t +1) = 1 | Z ( t ) = 2 , x ( t ) , ˜ x ( t ) ) = P ( Z ( t +1) = 1 , B ( t +1) = 1 | Z ( t ) = 2 , x ( t ) , ˜ x ( t ) )+ P ( Z ( t +1) = 1 , B ( t +1) = 2 | Z ( t ) = 2 , x ( t ) , ˜ x ( t ) ) + 0 . If B ( t ) = 2, this becomes1 × P ( B ( t +1) = 1 | x ( t ) , ˜ x ( t ) )+ (cid:18) ξ F → U /N − P ( B ( t +1) = 1 | x ( t ) , ˜ x ( t ) ) P ( B ( t +1) = 2 | x ( t ) , ˜ x ( t ) ) (cid:19) × P ( B ( t +1) = 2 | x ( t ) , ˜ x ( t ) )+0 × P ( B ( t +1) = 3 | x ( t ) , ˜ x ( t ) ) = ξ F → U /N. If B ( t ) = 3, we also have P ( Z ( t +1) = 1 | Z ( t ) = 2 , x ( t ) , ˜ x ( t ) ) = ξ F → U /N .We next compute the transition from state 2 to state 3, P ( Z ( t +1) = 3 | Z ( t ) = 2 , x ( t ) , ˜ x ( t ) ) = P ( Z ( t +1) = 3 , B ( t +1) = 1 | Z ( t ) = 2 , x ( t ) , ˜ x ( t ) )+ P ( Z ( t +1) = 3 , B ( t +1) = 2 | Z ( t ) = 2 , x ( t ) , ˜ x ( t ) )+ P ( Z ( t +1) = 3 , B ( t +1) = 3 | Z ( t ) = 2 , x ( t ) , ˜ x ( t ) ) . f B ( t ) = 2, this becomes0 × P ( B ( t +1) = 1 | x ( t ) , ˜ x ( t ) )+ 0 × P ( B ( t +1) = 2 | x ( t ) , ˜ x ( t ) )+ ξ F → D /N P ( B ( t +1) = 3 | x ( t ) , ˜ x ( t ) ) × P ( B ( t +1) = 3 | x ( t ) , ˜ x ( t ) ) = ξ F → D /N. If B ( t ) = 3, we also find that P ( Z ( t +1) = 3 | Z ( t ) = 2 , x ( t ) , ˜ x ( t ) ) = ξ F → D /N . Again these probabilities donot depend on B ( t ) or ( x ( t ) , ˜ x ( t ) ), thus the evolution of the chain ( Z ( t ) ) given Z ( t ) = 2 is described bythe second row of (7). D.4 Upper bound on mixing time
Using the auxiliary chain ( Z ( t ) ) we can state the following result about the N -th iteration of ¯ P fromadjacent states. Proposition D.1.
Under Assumptions 3.1 and 3.2 such that ζ < ξ , we can define r ∈ (0 , ν > N ∈ N (independent of N ) such that, for all N ≥ N , with probability at least 1 − exp( − νN ), underthe coupled kernel ¯ P E [ d ( x ( N ) , ˜ x ( N ) ) | x (0) = x, ˜ x (0) = ˜ x ] ≤ r d ( x, ˜ x ) = 2 r, for any two states ( x, ˜ x ) ∈ ¯ X adj . Proof of Proposition D.1.
Under the assumptions we can apply Proposition 3.1. We place ourselves onthe large probability event from that proposition. This gives us the constants needed to define thetransition matrix Q in (7), the Markov chain ( Z ( t ) ) with transition (7) and such that Z ( t ) ≤ B ( t ) for all t ≥ r ∈ (0 ,
1) such that P ( Z ( N ) = 3 | Z (0) = 1) ≥ − r . Based on theconstruction, P ( x ( N ) = ˜ x ( N ) ) ≥ P ( Z ( N ) = 3 | Z (0) = 1), thus P ( x ( N ) = ˜ x ( N ) | x (0) = x, ˜ x (0) = ˜ x ) ≥ − r .Noting that E [ d ( x ( N ) , ˜ x ( N ) ) | x (0) = x, ˜ x (0) = ˜ x ] = 2 P ( x ( N ) = ˜ x ( N ) | x (0) = x, ˜ x (0) = ˜ x ) concludes theproof.We can finally return to the path coupling argument, and apply it to a chain that follows P N , the N -th iterate of the transition kernel of the original chain. From Proposition D.1, we have a contractionrate of r ∈ (0 ,
1) independently of N , for a coupling of P N from adjacent states, and we obtain the maintheorem as follows. Proof of Theorem 1.
The path coupling argument shows that for a chain (ˇ x ( t ) ) evolving according to P N , there exist κ, ν > (cid:15) >
0, with probability at least 1 − exp( − νN ), we have k ˇ x ( t ) − CB( p, I ) k TV ≤ (cid:15) for all t ≥ κ log( N/(cid:15) ) . The variable ˇ x ( t ) has the same law as x ( tN ) , thus with a change of time variable, s = tN , we obtain k x ( s ) − CB( p, I ) k TV ≤ (cid:15) for all s ≥ κN log( N/(cid:15) ) ..