An MCMC Method to Sample from Lattice Distributions
AAn MCMC Method to Sample from LatticeDistributions
Anand Jerry George
Department of Electrical EngineeringIndian Institute of Science, BengaluruEmail: [email protected]
Navin Kashyap
Department of Electrical Communication EngineeringIndian Institute of Science, BengaluruEmail: [email protected]
Abstract —We introduce a Markov Chain Monte Carlo(MCMC) algorithm to generate samples from probability distri-butions supported on a d -dimensional lattice Λ = B Z d , where B is a full-rank matrix. Specifically, we consider lattice distributions P Λ in which the probability at a lattice point is proportional to agiven probability density function, f , evaluated at that point. Togenerate samples from P Λ , it suffices to draw samples from a pull-back measure P Z d defined on the integer lattice. The probabilityof an integer lattice point under P Z d is proportional to the densityfunction π = | det( B ) | f ◦ B . The algorithm we present in thispaper for sampling from P Z d is based on the Metropolis-Hastingsframework. In particular, we use π as the proposal distributionand calculate the Metropolis-Hastings acceptance ratio for a well-chosen target distribution. We can use any method, denoted byALG, that ideally draws samples from the probability density π , to generate a proposed state. The target distribution is apiecewise sigmoidal distribution, chosen such that the coordinate-wise rounding of a sample drawn from the target distributiongives a sample from P Z d . When ALG is ideal, we show that ouralgorithm is uniformly ergodic if − log( π ) satisfies a gradientLipschitz condition. I. I
NTRODUCTION
Drawing samples from probability distributions is a ubiq-uitous problem in statistics and machine learning. In thispaper, we look into the problem of sampling from probabilitydistributions supported on a lattice (referred to as a latticedistribution ). Specifically, we consider lattice distributions inwhich the probability at a lattice point is proportional toa given probability density function evaluated at that point.Such distributions have found applications in cryptography,lattice coding, and secure communication [1], [2], [3]. Anexample of a lattice distribution that has received much interestfrom researchers is the lattice Gaussian. A lattice Gaussiandistribution is a probability distribution defined on a lattice Λ ,such that for each x ∈ Λ , the probability of x is proportionalto a Gaussian density function evaluated at x . Lattice Gaussiansampling, also known as discrete Gaussian sampling (DGS),is closely related to shortest vector problem (SVP) and closestvector problem (CVP), which are computationally hard latticeproblems [4]. Lattice Gaussian sampling is also employedfor decoding and signal detection in Multiple Input MultipleOutput (MIMO) systems [5]. There are some known methods The work of N. Kashyap was supported in part by a SwarnajayantiFellowship awarded by the Dept. of Science and Technology (DST), Govt. ofIndia. to generate samples from lattice Gaussian distributions [1],[6], [7], but few methods are available for drawing samplesfrom arbitrary lattice distributions [8]. We try to address thisproblem. The motivation to go beyond lattice Gaussian is dueto [3], where samples from a particular fat-tailed lattice dis-tribution are used to achieve information-theoretically perfectsecurity.Since sampling methods for generic lattice distributions arenot known prior to this work, we compare our algorithmwith other available lattice Gaussian samplers. Algorithmscurrently available for sampling from d -dimensional latticeGaussians require a sub-routine to generate samples froma 1-dimensional lattice Gaussian with arbitrary variance pa-rameter. These schemes update each coordinate sequentiallyand therefore, cannot take advantage of the currently avail-able optimized linear algebra libraries. Also, techniques usedto sample from 1-dimensional lattice Gaussians are eithercomputationally inefficient or require a pre-computed table[9]. This paper proposes a simple and provable algorithmthat samples directly from d -dimensional lattice distributions,including lattice Gaussians.Now we give a brief introduction to our algorithm. Let Λ be a d -dimensional lattice generated by a full-rank matrix B . As mentioned earlier, we consider lattice distributions P Λ in which the probability at a lattice point is proportional tothe value of a given probability density function f at thatpoint. By a simple reduction, it becomes evident that, togenerate samples from P Λ , it suffices to draw samples from aprobability distribution P Z d defined on the integer lattice. Theprobability of an integer lattice point under P Z d is proportionalto the probability density function π ( x ) = | det( B ) | f ( Bx ) for all x ∈ R d . (1)This paper explores the possibility of using Markov chainMonte Carlo (MCMC) methods to draw samples from P Z d .MCMC is a well-known paradigm in statistics to samplefrom the desired probability distribution by establishing aMarkov chain whose stationary distribution is the same asthe desired distribution. In particular, we use the well-knownMetropolis-Hastings algorithm to establish a Markov chainwith the desired stationary distribution (referred to as the targetdistribution ). In a Metropolis-Hastings algorithm, a candidatefor the next state of the Markov chain is drawn from a a r X i v : . [ s t a t . C O ] J a n articular proposal distribution and is then accepted as thenext state with a certain acceptance probability . In contrastto the other instances in literature where Metropolis-Hastingsalgorithms are used for lattice Gaussian sampling [6], [7], weuse a proposal distribution and target distribution that haveprobability density functions. In particular, we use π givenin (1) as the proposal distribution. We can use any off-the-shelf method, denoted by ALG , that ideally draws samplesfrom the probability density π , to generate a proposed state.In fact, this enables us to sample from lattice distributionsbeyond lattice Gaussians, i.e., when π is not a Gaussiandensity. We use a piecewise sigmoidal probability density asthe target distribution, a sample from which, after coordinate-wise rounding, gives a sample from P Z d . With this choiceof the target distribution, we show that our algorithm hasexponentially fast convergence ( uniform ergodicity ) if thelattice distribution is such that − log( π ) satisfies a gradientLipschitz condition. We also derive a bound on the rate atwhich our algorithm converges to the target distribution when ALG deviates from its ideality.
A. Organization of the Paper
The remainder of the paper is organized as follows. InSection II, we formalize the problem statement and recallsome basics of MCMC and its convergence. In Section III, wedescribe our algorithm and its convergence analysis. Proposi-tion 1 in Section III shows that our algorithm is uniformlyergodic if − log( π ) satisfies a gradient Lipschitz condition. InProposition 2, we derive a bound on the rate at which ouralgorithm converges to the target distribution when ALG isnon-ideal. In Section IV, we present our algorithm’s simu-lation results for three different target distributions: isotropicGaussian distribution on Z d , isotropic Gaussian distributionon the Leech lattice, and “Perfect Security distribution” on Z d . Section V discusses the performance of our algorithmand compares it with Klein’s algorithm [10], [1] for latticeGaussian sampling. II. P RELIMINARIES
A. Notations
We denote the state space of a Markov chain by X . Theoperator ∧ operates on two numbers to output the minimumof them. Nearest integer point to a vector x ∈ R d obtained bycoordinate-wise rounding is denoted by [ x ] . We use U [0 , to denote the uniform probability distribution on the interval [0 , . We denote the Borel-sigma algebra on R d by B ( R d ) .For two probability measures µ and ν defined on the sameprobability space, we use the notation µ (cid:28) ν to indicate that µ is absolutely continuous with respect to ν . B. Lattice Distributions
We now formally define a lattice and probability distribu-tions defined on a lattice. Let B ∈ R d × d be a full-rank matrix.The d -dimensional lattice Λ generated by B is defined as Λ := { Bz : z ∈ Z d } . Any probability distribution defined with Λ as the support isknown as a lattice distribution. In this paper, we look at aspecific class of lattice distributions in which the probabilitydistribution is induced by a density function on R d . That is, theprobability of a lattice point is equal to the density functionevaluated at that point with appropriate normalization. Thispaper mainly considers density functions having the followingform f ( x ) = e − ψ ( x ) Z ψ for all x ∈ R d , where ψ ( x ) is known as a potential function , and Z ψ = (cid:82) R d e − ψ ( x ) d x is a normalization constant. However, in prac-tice, the algorithm that we develop works for any latticedistribution induced by a density function. Let P Λ ( x ) for x ∈ Λ be a lattice distribution induced by the above f ( x ) ,i.e., P Λ ( x ) = e − ψ ( x ) Z for all x ∈ Λ , where Z = (cid:88) x ∈ Λ e − ψ ( x ) . Let B denote a generator matrix of the lattice Λ . Then, forgenerating a sample from the probability distribution P Λ , itsuffices to sample from P Z d ( z ) = P Λ ( Bz ) , where z ∈ Z d ,and then obtain x as x = Bz . So, our problem reduces to oneof sampling from the following probability distribution over Z d : P Z d ( z ) = e − ψ ( Bz ) Z for all z ∈ Z d . Let ϕ ( x ) := ψ ( Bx ) for all x ∈ R d , so that P Z d ( z ) = e − ϕ ( z ) Z for all z ∈ Z d . (2)Also, let us define the probability density π as: π ( x ) := e − ϕ ( x ) K for all x ∈ R d , (3)where K = (cid:90) R d e − ϕ ( x ) d x . Note that π is related to f as given in (1). We have now re-duced the problem of sampling from a probability distributiondefined on an arbitrary lattice Λ to sampling from a probabilitydistribution defined on Z d . In the rest of this paper, we try todevelop an algorithm for generating samples from the latticedistribution P Z d induced by the probability density π .
1) Lattice Gaussian Distribution:
A lattice distributionthat has received significant attention is the lattice Gaussiandistribution. A lattice Gaussian distribution defined on a lattice Λ is given by D Λ ,σ, c ( x ) = e − (cid:107) x − c (cid:107) σ (cid:80) y ∈ Λ e − (cid:107) y − c (cid:107) σ for all x ∈ Λ , (4)where c ∈ R d is the mean vector and σ is the variance param-eter. Lattice Gaussian distributions have important practicalapplications, particularly in cryptography [1]. . The Metropolis-Hastings Algorithm As stated earlier, we take the MCMC route to generatesamples from the desired lattice distribution. MCMC is aclass of sampling algorithms in which a Markov chain is setup whose stationary distribution is the same as the desiredprobability distribution (also called the target distribution ).The idea is to simulate this chain for a certain number of stepsto draw samples approximately from the desired probabilitydistribution.Given a Markov chain, it is straightforward to find itsstationary distribution. However, it is not apparent how to finda Markov chain with the desired stationary distribution. TheMetropolis-Hastings algorithm provides a recipe for establish-ing a Markov chain with the desired stationary distribution.Let ¯ π denote the probability distribution from which we wantto draw samples. Then, the Metropolis-Hastings algorithmconsists of two steps in generating the next state of the Markovchain: • Let x be the current state. Generate a proposed state y from some probability distribution q ( x , · ) (referred to asthe proposal distribution ). • Accept the proposed state as the next state of Markovchain with probability α ( x , y ) given by α ( x , y ) = 1 ∧ ¯ π ( y ) q ( y , x )¯ π ( x ) q ( x , y ) . If the proposed state is independent of the current state, wecall this the
Independent Metropolis-Hastings algorithm. Theacceptance ratio is then given by α ( x , y ) = 1 ∧ ¯ π ( y ) q ( x )¯ π ( x ) q ( y ) . From now on, we refer to the Markov chain associated withan Independent Metropolis-Hastings algorithm by MH Markovchain.
D. Distance between probability distributions
For assessing the goodness of any sampling algorithm, it isessential to have a metric defined on the space of probabilitydistributions. The metric we use in our analysis is the TotalVariation Distance (TVD). For two distributions µ and ν defined on ( R d , B ( R d )) , we use (cid:107) µ − ν (cid:107) T V to denote theirTVD given by (cid:107) µ − ν (cid:107) T V = sup A ∈B ( R d ) | µ ( A ) − ν ( A ) | . If λ is a probability measure such that µ (cid:28) λ and ν (cid:28) λ ,then an alternate expression for TVD is given by (cid:107) µ − ν (cid:107) T V = 12 (cid:90) R d (cid:12)(cid:12)(cid:12)(cid:12) dµdλ ( x ) − dνdλ ( x ) (cid:12)(cid:12)(cid:12)(cid:12) λ ( d x ) , (5)where dµdλ and dνdλ are the Radon-Nikodym derivatives of µ and ν with respect to λ (see Lemma 2.1 in [11]). E. Convergence to stationarity
In this section, we give some definitions useful in evaluatingthe convergence of a Markov chain to its stationary distribu-tion. We refer the reader to [12], [13] for a comprehensivereview of these topics.
Definition 1.
A Markov chain with transition kernel P andstationary distribution ¯ π is uniformly ergodic if there exists < δ < and M < ∞ such that for all x ∈ X , (cid:107) P t ( x , · ) − ¯ π ( · ) (cid:107) T V ≤ M (1 − δ ) t . Theorem 1. (Theorem 8 in [13]). Let P be the transitionkernel of a Markov chain and ¯ π be its stationary distribution.Suppose there exists a δ > and a probability measure ν suchthat, for all measurable B ⊆ X , P ( x , B ) ≥ δν ( B ) for all x ∈ X . Then the Markov chain with transition kernel P is uniformlyergodic and satisfies the following inequality: (cid:107) P t ( x , · ) − ¯ π ( · ) (cid:107) T V ≤ (1 − δ ) t for all x ∈ X . Definition 2.
A Markov chain with transition kernel P andstationary distribution ¯ π is geometrically ergodic if there exists < δ < such that, for all x ∈ X , (cid:107) P t ( x , · ) − ¯ π ( · ) (cid:107) T V ≤ M ( x )(1 − δ ) t , with M ( x ) < ∞ . Definition 3.
For a small (cid:15) > , and initial state x , a mixingtime t mix ( (cid:15) ; x ) is defined as t mix ( (cid:15) ; x ) = inf { t : (cid:107) P t ( x , · ) − ¯ π ( · ) (cid:107) T V < (cid:15) } . III. I
NDEPENDENT M ETROPOLIS -H ASTINGS WITH R OUNDING (IMHR)In this section, we introduce an Independent Metropolis-Hastings algorithm for sampling from lattice distribution P Z d defined in (2). In this algorithm, we suppose that it ispossible to generate samples from the probability density π defined in (3). Any state-of-the-art MCMC algorithm suchas Hamiltonian Monte Carlo (HMC) [14], or Metropolisadjusted Langevin algorithm (MALA) [15], can be used forthis purpose. The idea is to use π as the proposal distributionin the Independent Metropolis-Hastings algorithm. For such amethod to be effective in sampling from P Z d , we need a targetdistribution ¯ π with the following properties: • Using a random variable with probability distribution ¯ π ,we should be able to efficiently derive a random variablewith distribution P Z d . • The probability distribution ¯ π should be statistically closeto π . This will reduce the possibility of rejecting a pro-posal in the Independent Metropolis-Hastings algorithm,thereby improving its convergence speed to the stationarydistribution.A naive approach would be to choose ¯ π as a piece-wise constant density. That is, ¯ π ( x ) is equal to π ([ x ]) withppropriate normalization for all x ∈ R d . It is easy to see thatthe rounding operation on a sample generated from ¯ π givesa sample from P Z d . The drawback of such an approach isthat the Markov chain thus generated need not be uniformlyergodic, even for lattice Gaussians (see Appendix A). Thismotivates us to find a ¯ π that is a better approximation to π . Fig. 1. Different approximations for Gaussian density
We define a new probability distribution ¯ π which will becalled the target distribution henceforth, as follows: ¯ π ( x ) = 1 Z e − ϕ (¯ x ) e x − ¯ x ) T ∇ ϕ (¯ x ) for all x ∈ R d , (6)where Z and ϕ are the same entities which appear in (2),and ¯ x is the nearest integer point to x which is obtainedby coordinate-wise rounding. We should visualize ¯ π as aprobability density function obtained by approximating π using a sigmoid function within each unit hypercube in R d andthen normalizing. The sigmoid function is chosen such that,at the center of any unit hypercube, its value and gradient areproportional to the value and gradient of π . This is illustratedin Figure 1 where π is a Gaussian density function. Note thatthe functions plotted in Figure 1 are unnormalized. We canobtain a sample from P Z d by rounding the sample generatedfrom ¯ π to its nearest integer point. To see this, let X be arandom variable with probability density ¯ π . Let Z = [ X ] andlet S denote the unit hypercube centered at the origin, i.e., S = [ − , ] d . Then, P ( Z = z ) = P ( X ∈ z + S )= (cid:90) S ¯ π ( z + u ) d u ( a ) = 12 (cid:90) S (¯ π ( z + u ) + ¯ π ( z − u )) d u = 1 Z e − ϕ ( z ) (cid:90) S (cid:18)
11 + e u T ∇ ϕ ( z ) + 11 + e − u T ∇ ϕ ( z ) (cid:19) d u = 1 Z e − ϕ ( z ) (cid:90) S (cid:32)
11 + e u T ∇ ϕ ( z ) + e u T ∇ ϕ ( z ) e u T ∇ ϕ ( z ) (cid:33) d u = 1 Z e − ϕ ( z ) , where ( a ) is due to the symmetry of S . Therefore, to generatesamples from P Z d , it suffices to draw samples from ¯ π and thendo coordinate-wise rounding.Summarizing, IMHR is an Independent Metropolis-Hastings algorithm with π defined in (3) as the proposaldistribution and ¯ π defined in (6) as the target distribution.The steps of IMHR are as described in Algorithm 1. Algorithm 1:
IMHR Algorithm
Input: X , π, ϕ, ∇ ϕ Output:
Sample from a distribution statistically closeto P Z d for t = 1 , , . . . do Let x be the state of X t − ;Generate y from the probability distribution π ;Round x to its nearest point in Z d to get ¯ x ;Round y to its nearest point in Z d to get ¯ y ; ¯ π ( x ) = − ϕ (¯ x ))1+exp(2( x − ¯ x ) T ∇ ϕ (¯ x )) ; ¯ π ( y ) = − ϕ (¯ y ))1+exp(2( y − ¯ y ) T ∇ ϕ (¯ y )) ;Calculate acceptance ratio α ( x , y ) = 1 ∧ ¯ π ( y ) π ( x )¯ π ( x ) π ( y ) ;Generate a sample u from U [0 , ; if u ≤ α ( x , y ) then let X t = y ; else X t = x ; endif t > t mix ( (cid:15) ; X ) then Round X t to its nearest point in Z d to get ¯ X t ;Output ¯ X t ; endend A. Convergence analysis of Algorithm 1
In this section, we analyze the convergence speed of Al-gorithm 1 to its stationary distribution. Algorithm 1 requiresa sub-routine, denoted by
ALG henceforth, which is ideallycapable of drawing samples from π . The following analysisassumes that we have such a sub-routine available. Error due tonon-availability of such an ideal sub-routine will be analyzedin the next section.First, we state a well known theorem which is true in generalfor an Independent Metropolis-Hastings algorithm. Theorem 2. (Theorem 2.1 in [16]) An IndependentMetropolis-Hastings algorithm is uniformly ergodic if thereexist δ > such that π ( x )¯ π ( x ) ≥ δ for all x ∈ X , (7) where π is the density from which proposed state is generated,and ¯ π is the target density. That is, the transition kernel ¯ P ofthe MH Markov chain satisfies the following: (cid:107) ¯ P k ( x , · ) − ¯ π ( · ) (cid:107) T V ≤ (1 − δ ) k for all x ∈ X . ext, we define a widely used smoothness property offunctions called L -smoothness. Definition 4.
A function f : R d → R is called L -smooth ifthe gradient of f is Lipschitz continuous with parameter L .That is, f should satisfy the following property (cid:107)∇ f ( x ) − ∇ f ( y ) (cid:107) ≤ L (cid:107) x − y (cid:107) , for all x , y ∈ R d . The following is a well known fact about L -smooth func-tions (see Lemma 5 in [17]). If f is L -smooth, then for all x , y ∈ R d , f ( y ) ≤ f ( x ) + ( y − x ) T ∇ f ( x ) + L (cid:107) x − y (cid:107) . (8)Now we will give conditions on the probability density π thatguarantees uniform ergodicity for Algorithm 1. Proposition 1.
Let π and ¯ π be as defined in (3) and (6)respectively. Let ¯ x denote the nearest integer point to x . If ϕ is an L -smooth function, then for all x ∈ R d we have, π ( x )¯ π ( x ) = ZK e − ϕ ( x ) (1 + e x − ¯ x ) T ∇ ϕ (¯ x ) )2 e − ϕ (¯ x ) ≥ ZK e − dL > . (9) Therefore, by Theorem 2, Algorithm 1 is uniformly ergodic forsuch a π .Proof: Let y = x − ¯ x . Then, π ( x )¯ π ( x ) = ZK e ϕ (¯ x ) − ϕ (¯ x + y ) (1 + e y T ∇ ϕ (¯ x ) )2 ( a ) ≥ ZK e − y T ∇ ϕ (¯ x ) − L (cid:107) y (cid:107) (1 + e y T ∇ ϕ (¯ x ) )2= ZK e − L (cid:107) y (cid:107) ( e y T ∇ ϕ (¯ x ) + e − y T ∇ ϕ (¯ x ) )2 ( b ) ≥ ZK e − L (cid:107) y (cid:107) ( c ) ≥ ZK e − dL > , where ( a ) follows from (8) due to L -smoothness of ϕ , ( b ) isdue to AM-GM inequality and ( c ) is because y ∈ [ − , ] d . Corollary 1. If π is a Gaussian density, then Algorithm 1produces a uniformly ergodic Markov chain.B. Effect of non-ideality of ALG
As defined in the previous section,
ALG denotes the methodused to draw samples from π in Algorithm 1. For the analysisin this section, we suppose that ALG is an MCMC method.In practice,
ALG could be methods like HMC or MALA.By non-ideality of
ALG , we mean that the TVD between theprobability distribution from which
ALG generate samples and π is nonzero. The non-ideality mentioned above can occur dueto the finite time given for convergence in ALG . On accountof this non-ideality, the proposed state in Algorithm 1 willhave a probability distribution different from the one used in the calculation of acceptance ratio (which is π ). This altersthe stationary distribution of Markov chain associated withAlgorithm 1.Suppose the Markov chain associated with ALG is geomet-rically ergodic for the stationary distribution π . In that case,we show in the following proposition that the error due tonon-ideality of ALG can be bounded. For a discussion on theconditions under which methods like HMC and MALA aregeometrically ergodic, we refer the reader to [18], [15].
Proposition 2.
Let π defined in (3) be such that ϕ is L-smooth.Let t ∈ R d be a fixed initial state of ALG . Suppose
ALG satisfies the following geometric ergodicity condition. (cid:107) P n ( t , · ) − π ( · ) (cid:107) T V ≤ V ρ n , (10) where P is the transition kernel corresponding to ALG . ( V inthe above expression may depend on the fixed initial state t .)Also, let P be such that π (cid:28) P ( t , · ) and P ( t , { t } ) > . Thenfor any x ∈ R d , Algorithm 1 generates a Markov chain withtransition kernel ¯ P that satisfies the following inequality: (cid:107) ¯ P k ( x , · ) − ¯ π ( · ) (cid:107) T V ≤ (1 − Cδ ) k + (cid:18) Cδ (cid:19) V ρ n δ , (11) where ¯ π is the target distribution defined in (6), δ is obtainedfrom (7), n is the number of iterations of ALG , k is the numberof iterations of Algorithm 1, and C is a constant that satisfiesthe following inequality: − V ρ n δ ≤ C ≤ V ρ n δ . (12) Proof:
Let us denote P n ( t , · ) by q ( · ) . By geometricergodicity of ALG we have, (cid:107) q − π (cid:107) T V ≤ V ρ n . (13)Although ALG generates the proposed state from the distribu-tion q , for calculating the acceptance ratio, we use the distri-bution π . Hence, the Markov chain generated by Algorithm 1has the following transition kernel ¯ P ( x , d y ) = q ( d y ) (cid:18) ∧ ¯ π ( y ) π ( x )¯ π ( x ) π ( y ) (cid:19) + δ x ( d y ) (cid:18) − (cid:90) R d q ( d z ) (cid:18) ∧ ¯ π ( z ) π ( x )¯ π ( x ) π ( z ) (cid:19)(cid:19) , where δ x ( · ) is the delta measure at x . It is straightforward toverify using the detailed balance equation that the stationarydistribution of the above Markov chain with transition kernel ¯ P is given by ν ( d x ) = ¯ π ( x ) q ( d x ) Cπ ( x ) , (14)where C = (cid:90) R d ¯ π ( x ) q ( d x ) π ( x ) . (15)Also, since ϕ is L -smooth, from Proposition 1, we have π ( x )¯ π ( x ) ≥ δ > . (16)he first step in this proof is to show that the above Markovchain with transition kernel ¯ P is uniformly ergodic. Then weshow that its stationary distribution ν and probability density ¯ π are statistically close. These two results are combined toobtain the result stated in Proposition 2. Uniform Ergodicity of ¯ P :From the expression for ¯ P , for all x ∈ R d and allmeasurable A ⊆ R d , we have ¯ P ( x , A ) ≥ (cid:90) A q ( d y ) (cid:18) ∧ ¯ π ( y ) π ( x )¯ π ( x ) π ( y ) (cid:19) ( a ) = (cid:90) A Cν ( d y ) π ( y )¯ π ( y ) (cid:18) ∧ ¯ π ( y ) π ( x )¯ π ( x ) π ( y ) (cid:19) = (cid:90) A Cν ( d y ) (cid:18) π ( y )¯ π ( y ) ∧ π ( x )¯ π ( x ) (cid:19) ( b ) ≥ Cδ (cid:90) A ν ( d y ) = Cδν ( A ) , where ( a ) and ( b ) are due to (14) and (16) respectively. Thusby Theorem 1, ¯ P is the transition kernel of a uniformly ergodicMarkov chain. Therefore, (cid:107) ¯ P k ( t , · ) − ν ( · ) (cid:107) T V ≤ (1 − Cδ ) k . (17)Next, we find a bound on the value of C . Note that from theassumption P ( t , { t } ) > , it follows that for any measurableset A ⊆ R d and integer n , P n ( t , A ) > whenever P ( t , A ) > . Therefore, P ( t , · ) (cid:28) q . This, together with the assumption π (cid:28) P ( t , · ) , allows us to conclude that π (cid:28) q . Therefore, wehave the following: (cid:90) R d ¯ π ( d x ) ( a ) = (cid:90) R d ¯ π ( x ) π ( x ) π ( d x ) ( b ) = (cid:90) R d ¯ π ( x ) π ( x ) dπdq ( x ) q ( d x ) , (18)where ( a ) follows from the fact that π and ¯ π are densityfunctions which are positive everywhere and ( b ) is due to theabsolute continuity of π with respect to q . Then, | C − | ( a ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) R d ¯ π ( x ) π ( x ) q ( d x ) − (cid:90) R d ¯ π ( x ) π ( x ) dπdq ( x ) q ( d x ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) R d ¯ π ( x ) π ( x ) (cid:18) − dπdq ( x ) (cid:19) q ( d x ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:90) R d ¯ π ( x ) π ( x ) (cid:12)(cid:12)(cid:12)(cid:12) − dπdq ( x ) (cid:12)(cid:12)(cid:12)(cid:12) q ( d x ) ≤ δ (cid:90) R d (cid:12)(cid:12)(cid:12)(cid:12) − dπdq ( x ) (cid:12)(cid:12)(cid:12)(cid:12) q ( d x ) ( b ) = 2 δ (cid:107) q − π (cid:107) T V ≤ V ρ n δ , where ( a ) follows from (15) and (18), and ( b ) is due to thealternate definition of TVD given in (5). Therefore, we have δ − V ρ n ≤ Cδ ≤ δ + 2 V ρ n = ⇒ − V ρ n δ ≤ C ≤ V ρ n δ . (19) TVD between ν and ¯ π :Now we will show that ν and ¯ π are statistically closeprobability distributions. (cid:107) ν − ¯ π (cid:107) T V ( a ) = 12 (cid:90) R d (cid:12)(cid:12)(cid:12)(cid:12) dνdq ( x ) − d ¯ πdq ( x ) (cid:12)(cid:12)(cid:12)(cid:12) q ( d x )= 12 (cid:90) R d (cid:12)(cid:12)(cid:12)(cid:12) ¯ π ( x ) Cπ ( x ) − ¯ π ( x ) π ( x ) dπdq ( x ) (cid:12)(cid:12)(cid:12)(cid:12) q ( d x )= 12 C (cid:90) R d ¯ π ( x ) π ( x ) (cid:12)(cid:12)(cid:12)(cid:12) − C dπdq ( x ) (cid:12)(cid:12)(cid:12)(cid:12) q ( d x ) ≤ Cδ (cid:90) R d (cid:12)(cid:12)(cid:12)(cid:12) − C dπdq ( x ) (cid:12)(cid:12)(cid:12)(cid:12) q ( d x ) ≤ Cδ (cid:90) R d (cid:18) C (cid:12)(cid:12)(cid:12)(cid:12) − dπdq ( x ) (cid:12)(cid:12)(cid:12)(cid:12) + | C − | (cid:19) q ( d x )= 1 δ (cid:107) q − π (cid:107) T V + | C − | Cδ ( b ) ≤ V ρ n δ + V ρ n Cδ , (20)where ( a ) is due to the alternate definition of TVD given in(5), and ( b ) is due to (13) and (19).Finally using triangle inequality, we have (cid:107) ¯ P k ( t , · ) − ¯ π ( · ) (cid:107) T V ≤ (cid:107) ¯ P k ( t , · ) − ν ( · ) (cid:107) T V + (cid:107) ν − ¯ π (cid:107) T V ≤ (1 − Cδ ) k + (1 + 1 Cδ ) V ρ n δ . (21)IV. S IMULATION R ESULTS
This section illustrates the speed of convergence of Algo-rithm 1 to P Z d . For this, ideally we would like to show plotsof TVD as a function of the number of iterations. However,evaluating distance between high dimensional probability dis-tributions is computationally hard. So, in our simulations, wecompute an entity TVD m instead of TVD. We compute TVD m as follows: Initialize Algorithm 1 with a fixed point in thestate space. Then we run t iterations of Algorithm 1. Repeatthis 100,000 times for each value of t . For each t , use thesesamples to form d h i ( z ) for ≤ i ≤ d and z ∈ Z . We denote the i th marginaldistribution of P Z d by P i Z . If P i Z is not available in closed form,we estimate it using an MCMC method (see Appendix B-Afor the exact algorithm used), with sufficient time given foronvergence. Calculate the Total Variation Distance between h i and P i Z using the following formula:TVD ( i ) = 12 (cid:88) z ∈ Z | h i ( z ) − P i Z ( z ) | for ≤ i ≤ d. Finally, TVD m is the maximum of TVD’s calculated formarginal distributions.TVD m = max { TVD ( i ) : 1 ≤ i ≤ d } . We plot TVD m for different values of t . The TVD m vs. t plotsdepicts the number of iterations required for Algorithm 1 toconverge to its stationary distribution.In another simulation, we plot the autocorrelation functionof the time series x , x , · · · , x N obtained using Algorithm 1.In many instances, autocorrelation plots have been used toassess the number of iterations of the Markov chain required toproduce two almost independent samples [19]. The definitionof the autocorrelation function that we use is as follows:ACF ( τ ) = (cid:80) N − τt =1 x Tt x t + τ (cid:80) Nt =1 x Tt x t , where x t is the state of the Markov chain at iteration t and N is the total number of samples in the time series. Now wepresent results of these simulations for different probabilitydensities π .
1) Isotropic Gaussian distribution:
We first consider thecase when π is an isotropic Gaussian density. The potentialfunction ϕ of an isotropic Gaussian density is σ -smooth,where σ is the variance of the Gaussian. Therefore, fromProposition 1, we see that the factor that governs the rate ofconvergence of Algorithm 1 is r = dσ . In this simulation, wefix σ = 1 and vary dimension d to get different values of r .We use state as the initial state of the algorithm. TVD m vs. t for different values of r is shown in Figure 2. Autocorrelationvs. τ is shown in Figure 3. The number of samples ( N ) usedto calculate the autocorrelation function is 10,000. Fig. 2. TVD m vs. Number of Iterations ( t ) for isotropic Gaussian Fig. 3. ACF vs τ for isotropic Gaussian
2) Gaussian distribution on the Leech lattice:
Next, weconsider a lattice Gaussian distribution supported on the Leechlattice with dimension equal to 24 (see (23) for the generatormatrix of the Leech lattice). This simulation illustrates theperformance of Algorithm 1 for non-isotropic Gaussian. TheLeech lattice induces a highly skewed lattice Gaussian distri-bution on Z d . The density π now takes the following form: π ( x ) = M e − (cid:107) Bx (cid:107) σ for all x ∈ R d , where B is the generator matrix of the Leech lattice and M isthe normalization constant. We plot TVD m vs. t for differentvalues of σ . This is shown in the Figure 4. State was usedas the initial state of the algorithm. Fig. 4. TVD m vs. Number of Iterations ( t ) for a Gaussian distribution on theLeech lattice
3) Perfect Security distribution:
Finally, we present thesimulation results when π is the following probability densitywhich was used to achieve perfect security in [3]. The prob-ability density function of a “Perfect Security distribution” isgiven by: π ( x ) = M Ω d ( (cid:107) x (cid:107) ρ ) j d − − (cid:107) x (cid:107) ρ for all x ∈ R d , here Ω d ( u ) = (cid:18) u (cid:19) d − J d − ( u ) ,J k ( u ) is the Bessel function of k th order and j k is the firstzero of k th order Bessel function and M is the normalizationconstant. Let X = [ X , X , · · · , X d ] be a random vector withperfect security probability distribution. Then, the variance ofeach component X i is given by the following equation [20]:Var ( X i ) = 4 ρ j d − d . We use HMC (see Appendix B-B for parameters used) tosample from this continuous density π . We plot TVD m vs. t for different values of d . We fix the value of ρ such thatthe variance of the distribution is for each value of d . TheTVD m vs. t plot is shown in Figure 5. State was used asthe initial state of the algorithm. Fig. 5. TVD m vs. Number of Iterations ( t ) for Perfect Security distribution V. D
ISCUSSION
We presented a simple MCMC algorithm to draw samplesfrom lattice distributions. As demonstrated through the PerfectSecurity distribution sampling, Algorithm 1 can sample fromdistributions beyond lattice Gaussians. To the best of ourknowledge, prior to this work, there were no efficient algo-rithms known to generate samples from lattice distributionsother than lattice Gaussians. The main feature of Algorithm 1,which makes it competitive even among the lattice Gaussiansampling algorithms, is its computational efficiency. The com-putations in Algorithm 1 are vector operations, which is highlyoptimized when current linear algebra libraries (for instance,OpenBLAS or Intel MKL) are used for implementation. Mostof the algorithms currently available for sampling from alattice Gaussian do coordinate-wise sequential sampling us-ing 1-dimensional lattice Gaussian samplers. This method isinefficient when the lattice dimension under consideration islarge.A popular algorithm for sampling from lattice Gaussians isKlein’s algorithm [10], [1]. In Figure 6, we compare the run-time per iteration of Klein’s algorithm with Algorithm 1 for different values of dimension when the desired distribution isa lattice Gaussian on Z d with variance parameter σ = 1 . Thisexperiment was run in a python environment on a machinewith Intel i7-6700 @ 3.40GHz CPU and 8GB RAM. It is clearfrom Figure 6 that the scaling of the run-time per iterationwith dimension is much better for Algorithm 1. However,multiple iterations of Algorithm 1 are required to generate asample approximately from the stationary distribution. FromFigure 2, we can obtain the minimum number of iterationsof Algorithm 1 required to bring the TVD m below a smallnumber. Multiplying the run-time per iteration of Algorithm 1by the minimum number of iterations, we see that the run-time required to generate a sample from lattice Gaussian iscomparable for Klein’s algorithm and Algorithm 1. For exam-ple, Algorithm 1 takes 13 iterations to bring the TVD m below0.005 when dimension equals 50. Run-time per iteration forAlgorithm 1 is 46 µs when dimension equals 50. This impliesa total run-time of 598 µs to generate a sample approximatelyfrom the stationary distribution. Klein’s algorithm requires justone iteration to generate a sample from a lattice Gaussiandistribution. However, it takes 798 µs per iteration. Fig. 6. Run-time vs Dimension for isotropic lattice Gaussian
From Proposition 1, we see that when π is an isotropicGaussian density with variance equal to σ , the TVD be-tween the probability distribution after k th iteration of Al-gorithm 1 and the stationary distribution is upper bounded by (1 − ZK e − d σ ) k . This indicates that our algorithm may notbe well suited for distributions with very low variance andhigh dimension. Figures 2 and 4 validate this by illustratingthat convergence is slow for low variance and high dimensioncases. In simulations, we observe that at high dimensions,the average acceptance, which is the fraction of iterationsin which the proposed state is accepted, becomes very lowfor Algorithm 1. Figure 7 shows the degradation of averageacceptance with dimension. A low acceptance ratio makes theIndependent Metropolis-Hastings algorithm inefficient due tofrequent rejection of the proposed state. Therefore, at very highdimensions, we suggest using the Metropolis-within-Gibbsstrategy [21]. In the Metropolis-within-Gibbs algorithm, the ig. 7. Average Acceptance vs Dimension for isotropic lattice Gaussian number of variables updated at a time, determines the averageacceptance. R EFERENCES[1] C. Gentry, C. Peikert, and V. Vaikuntanathan, “Trapdoors for hardlattices and new cryptographic constructions,” in
Proc. 40th Annu. ACMSymp. Theory Comput. , 2008, p. 197–206.[2] C. Ling and J. Belfiore, “Achieving AWGN channel capacity with latticeGaussian coding,”
IEEE Trans. Inf. Theory , vol. 60, no. 10, pp. 5918–5929, 2014.[3] S. Vatedka, N. Kashyap, and A. Thangaraj, “Secure compute-and-forward in a bidirectional relay,”
IEEE Trans. Inf. Theory , vol. 61, no. 5,pp. 2531–2556, 2015.[4] D. Aggarwal, D. Dadush, O. Regev, and N. Stephens-Davidowitz,“Solving the Shortest Vector Problem in n time using discrete Gaussiansampling: Extended abstract,” in Proc. STOC , 2015, p. 733–742.[5] S. Liu, C. Ling, and D. Stehle, “Decoding by sampling: A randomizedlattice algorithm for bounded distance decoding,”
IEEE Trans. Inf.Theory , vol. 57, no. 9, pp. 5933–5945, 2011.[6] Z. Wang and C. Ling, “On the geometric ergodicity of Metropolis-Hastings algorithms for lattice Gaussian sampling,”
IEEE Trans. Inf.Theory , vol. 64, no. 2, pp. 738–751, 2018.[7] Z. Wang, S. Lyu, and L. Liu, “Learnable Markov Chain Monte Carlosampling methods for lattice Gaussian distribution,”
IEEE Access , vol. 7,pp. 87 494–87 503, 2019.[8] S. Anaswara, “Sampling from multidimensional distributions supportedon a lattice,” Master’s thesis, Indian Institute of Science, Bengaluru,2020.[9] J. Folláth, “Gaussian sampling in lattice based cryptography,”
Tatra Mt.Math. Publ. , vol. 60, pp. 1–23, 2014.[10] P. Klein, “Finding the closest lattice vector when it’s unusually close,”in
Proc. ACM-SIAM Symp. Discrete Algorithms , 2000, p. 937–941.[11] A. B. Tsybakov,
Introduction to Nonparametric Estimation . Springer,2008.[12] S. Meyn and R. L. Tweedie,
Markov Chains and Stochastic Stability ,2nd ed. Cambridge University Press, 2009.[13] G. O. Roberts and J. S. Rosenthal, “General state space Markov chainsand MCMC algorithms,”
Probab. Surveys , vol. 1, pp. 20–71, 2004.[14] R. M. Neal, “MCMC using Hamiltonian dynamics,” in
Handbook ofMarkov chain Monte Carlo , S. Brooks, A. Gelman, G. Jones, and X.-L.Meng, Eds. Chapman and Hall/CRC, 2011, p. 113–162.[15] G. O. Roberts and R. L. Tweedie, “Exponential convergence of Langevindistributions and their discrete approximations,”
Bernoulli , vol. 2, no. 4,pp. 341–363, 1996.[16] K. L. Mengersen and R. L. Tweedie, “Rates of convergence of theHastings and Metropolis algorithms,”
Ann. Statist. , vol. 24, no. 1, pp.101–121, 1996. [17] R. Dwivedi, Y. Chen, M. J. Wainwright, and B. Yu, “Log-concavesampling: Metropolis-Hastings algorithms are fast,”
J. Mach. Learn. Res ,vol. 20, no. 183, pp. 1–42, 2019.[18] S. Livingstone, M. Betancourt, S. Byrne, and M. Girolami, “On thegeometric ergodicity of Hamiltonian Monte Carlo,”
Bernoulli , vol. 25,no. 4A, pp. 3109–3138, 2019.[19] S. Brooks, A. Gelman, G. Jones, and X.-L. Meng, Eds.,
Handbook ofMarkov Chain Monte Carlo . Chapman and Hall/CRC, 2011.[20] W. Ehm, T. Gneiting, and D. Richards, “Convolution roots of radialpositive definite functions with compact support,”
Trans. Am. Math. Soc ,vol. 356, no. 11, pp. 4655–4685, 2004.[21] L. Tierney, “Markov chains for exploring posterior distributions,”
Ann.Statist. , vol. 22, no. 4, pp. 1701–1728, 1994.[22] S. F. Jarner and E. Hansen, “Geometric ergodicity of Metropolisalgorithms,”
Stoch. Process. Their Appl , vol. 85, no. 2, pp. 341–361,2000.[23] G. O. Roberts, A. Gelman, and W. R. Gilks, “Weak convergence andoptimal scaling of random walk Metropolis algorithms,”
Ann. Appl.Probab. , vol. 7, no. 1, pp. 110–120, 1997.
PPENDIX AP IECE - WISE C ONSTANT APPROXIMATION FOR G AUSSIANDENSITY
In this section, we substantiate the claim in Section III thatthe choice of ¯ π ( x ) = π ([ x ]) can give rise to a Markov chainwhich is not uniformly ergodic. We show this for a simple casewhere π is a 1-dimensional Gaussian density. From Theorem2.1 in [16], it follows that, if ess inf π ( x )¯ π ( x ) = 0 with respect to ¯ π measure, then Independent Metropolis Hastings algorithmis not even geometrically ergodic. Let π be a 1-dimensionalGaussian density and ¯ π ( x ) = π ([ x ]) . Let ¯ x denote [ x ] and let y = x − ¯ x . Then, π ( x )¯ π ( x ) = M e − (¯ x + y ) e − ¯ x = M e − xy e − y , (22)where M is a constant. By definition, essential infimum of π ( x )¯ π ( x ) with respect to ¯ π measure is the greatest number a suchthat the set, A = { x ∈ R : π ( x )¯ π ( x ) < a } has zero ¯ π -measure. It is clear from (22) that, by choosing alarge value for x , π ( x )¯ π ( x ) can be made arbitrary close to 0 withina set of nonzero ¯ π measure. = ⇒ ess inf π ( x )¯ π ( x ) = 0 . This shows that for ¯ π ( x ) = π ([ x ]) , Independent MetropolisHastings algorithm need not even be geometrically ergodic.A PPENDIX BO THER
MCMC
METHODS USED
A. MCMC method used in the estimation of marginal distri-butions
This section elaborates on the MCMC method used forestimating marginal distributions P i Z required to calculateTVD m in Section IV. We use the Random Walk Metropolis(RWM) algorithm to estimate the marginal distributions. Thisis a Metropolis-Hastings algorithm in which proposal density q ( x , y ) is a function of (cid:107) y − x (cid:107) . We refer an interested readerto [22], [23] for more on random walk Metropolis algorithms.Target distribution ¯ π used in this algorithm is the followingpiece-wise constant density derived from π . ¯ π ( x ) = π ([ x ]) for all x ∈ R d . Due to the symmetric nature of the proposal density, theacceptance ratio takes the following simple form: α ( x , y ) = 1 ∧ ¯ π ( y )¯ π ( x ) . In particular, we use the random walk Metropolis algorithmwith proposal density q ( x , · ) being a Gaussian density withmean x and covariance matrix Σ . We choose Σ to be propor-tional to the covariance matrix of π . Algorithm 2 describes the steps involved in the random walkMetropolis. We do 500 iterations of this algorithm to giveit enough time to converge to the stationary distribution andthereby generate one sample. We use 200,000 such samples toform the histogram for each co-ordinate, which gives us theestimate of marginal distributions. Algorithm 2:
Random Walk Metropolis Algorithm
Input: π, Σ , X Output:
Sample from a distribution statistically closeto P Z d for t = 1 , , . . . do Let x denote the state of X t − ;Generate w from N (0 , Σ) ; y ← x + w ;Round y to its nearest point in Z d to get ¯ y ;Round x to its nearest point in Z d to get ¯ x ;Calculate acceptance ratio α ( x , y ) = 1 ∧ π (¯ y ) π (¯ x ) ;Generate a sample u from U [0 , ; if u ≤ α ( x , y ) then let X t = y ; else X t = x ; endif t > t mix ( (cid:15) ; X ) then Round X t to its nearest point in Z d to get ¯ X t ; endend B. Hamiltonian Monte Carlo (HMC)
HMC is used to generate samples from perfect securitydistribution in section IV-3. We refer the reader to [14] foran exposition on HMC. The input parameters to HMC are thenumber of Leapfrog steps ( L ) and the Leapfrog step-size ( (cid:15) ).The values of L and (cid:15) used in our simulation are as givenbelow: L = (cid:36) (cid:18) d (cid:19) (cid:37) ,(cid:15) = 1 . (cid:18) d (cid:19) . Inside HMC, we resample momentum variables from anisotropic Gaussian density with variance equal to 9. We usefive iterations of HMC to approximately generate a samplefrom the perfect security distribution. = 1 √ −
30 4 0 0 0 0 0 2 0 0 0 2 0 2 0 0 0 0 0 2 2 0 0 10 0 4 0 0 0 0 2 0 0 0 2 0 0 2 0 0 2 0 0 2 0 0 10 0 0 4 0 0 0 2 0 0 0 2 0 0 0 2 0 0 2 0 2 0 0 10 0 0 0 4 0 0 2 0 0 0 0 0 2 2 2 0 2 2 2 2 0 0 10 0 0 0 0 4 0 2 0 0 0 0 0 2 0 0 0 0 2 0 0 0 0 10 0 0 0 0 0 4 2 0 0 0 0 0 0 2 0 0 0 0 2 0 0 0 10 0 0 0 0 0 0 2 0 0 0 0 0 0 0 2 0 2 0 0 0 0 0 10 0 0 0 0 0 0 0 4 0 0 2 0 2 2 2 0 2 2 2 2 2 2 10 0 0 0 0 0 0 0 0 4 0 2 0 2 0 0 0 2 0 0 0 2 0 10 0 0 0 0 0 0 0 0 0 4 2 0 0 2 0 0 0 2 0 0 0 2 10 0 0 0 0 0 0 0 0 0 0 2 0 0 0 2 0 0 0 2 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 4 2 2 2 0 0 0 0 2 2 2 10 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 2 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 2 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 2 2 2 2 2 2 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 2 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 2 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1