[PDF] Oops I Took A Gradient: Scalable Sampling for Discrete Distributions

Abstract

We propose a general and scalable approximate sampling strategy for probabilistic models with discrete variables. Our approach uses gradients of the likelihood function with respect to its discrete inputs to propose updates in a Metropolis-Hastings sampler. We show empirically that this approach outperforms generic samplers in a number of difficult settings including Ising models, Potts models, restricted Boltzmann machines, and factorial hidden Markov models. We also demonstrate the use of our improved sampler for training deep energy-based models on high dimensional discrete data. This approach outperforms variational auto-encoders and existing energy-based models. Finally, we give bounds showing that our approach is near-optimal in the class of samplers which propose local updates.

Full PDF

OOops I Took A Gradient: Scalable Sampling for Discrete Distributions

Will Grathwohl

Kevin Swersky Milad Hashemi David Duvenaud

Chris J. Maddison Abstract

We propose a general and scalable approximatesampling strategy for probabilistic models withdiscrete variables. Our approach uses gradientsof the likelihood function with respect to its dis-crete inputs to propose updates in a Metropolis-Hastings sampler. We show empirically that thisapproach outperforms generic samplers in a num-ber of difﬁcult settings including Ising models,Potts models, restricted Boltzmann machines, andfactorial hidden Markov models. We also demon-strate the use of our improved sampler for trainingdeep energy-based models on high dimensionaldiscrete data. This approach outperforms varia-tional auto-encoders and existing energy-basedmodels. Finally, we give bounds showing that ourapproach is near-optimal in the class of samplerswhich propose local updates.

1. Introduction

Discrete structure is everywhere in the real world, fromtext to genome sequences. The scientiﬁc community isbuilding increasingly complex models for this discrete data,increasing the need for methods to sample from discretedistributions. Sampling from a discrete distribution mayseem like a simpler task than sampling from a continu-ous one: even a one-dimensional continuous distributioncan have an uncountable inﬁnity of outcomes, whereas adiscrete distribution is at most countable. However, mostcommon continuous distributions have some kind of simpli-fying structure, such as differentiable densities, which canbe exploited to speed up sampling and inference.Of course, many discrete distributions have structure aswell. Notably, discrete distributions over combinatorialspaces often have some kind of block independence struc-ture among their variables. Although this can be used tospeed up sampling and inference, it may be difﬁcult to de-tect such structure automatically. Typically, users need to University of Toronto and Vector Institute Google Research,Brain Team. Correspondence to: Will Grathwohl < [email protected] > . Arxiv Preprint , Copyright 2021 by the author(s). know this structure a priori and must hard-code it into analgorithm to speed up sampling.

Figure 1.

Our approach visualized. Often discrete distributionsare deﬁned by continuous functions whose input is restricted to adiscrete set; here R restricted to Z . We use a Taylor series com-puted on the underlying continuous function to estimate likelihoodratios of of making discrete moves; here ± in either direction.These estimated likelihood ratios are used to inform a proposaldistribution over moves in the original discrete space. In search for a unifying structure, we notice that many dis-crete distributions are written (and implemented) as dif-ferentiable functions of real-valued inputs. The discretestructure is created by restricting the continuous inputs toa discrete subset of their domain. In this paper, we use thegradients of these underlying continuous functions to informproposal distributions for MCMC sampling in large discreteprobabilistic models. This new family of gradient-informedproposals for MCMC may be seen as a class of adaptiveGibbs samplers or a fast approximation to locally-informedproposals (Umrigar, 1993; Liu, 1996; Zanella, 2020).As we show, this gradient information is cheaply availablefor many discrete distributions and can lead to orders ofmagnitude improvements in sampling efﬁciency. In somecases, it even outperforms samplers that exploit hard-codedindependence structure.We apply these sampling techniques to improve parame-ter inference in discrete energy-based models such as Isingand Potts models. We also ﬁnd that our method is partic- a r X i v : . [ c s . L G ] F e b ibbs with Gradients ularly well-suited to sample from unnormalized discretedistributions parameterized by deep neural networks. Theimproved efﬁciency of our sampler allows us to apply manytechniques from large-scale continuous EBMs to success-fully train deep EBMs for discrete data. These modelsare simple to deﬁne, ﬂexible, and outperform baseline varia-tional auto-encoders (Kingma & Welling, 2013) and existingenergy-based models.

2. Background

In this work we are concerned with sampling from unnor-malized distributions over discrete data log p ( x ) = f ( x ) − log Z where f ( x ) is the unnormalized log-probability of x and Z = (cid:80) x e f ( x ) is the normalizing constant which we assumein unknown. We restrict our study to D -dimensional binary x ∈ { , } D and categorical x ∈ { , , . . . , K } D data as allﬁnite-dimensional discrete distributions can be embeddedin this way. Gibbs sampling is perhaps the simplest and most genericmethod for sampling from discrete distributions. At eachstep, the sampler partitions the dimensions of an input x in to two groups x u and x − u where u is a subset of the D dimensions of x and − u is its complement. The nextsample in the chain is created by updating x u to be a samplefrom p ( x u | x − u ) , the conditional distribution of the chosendimensions, given all others. In some distributions, block-independence structure exists, making certain partitions easyto sample from and update in parallel.In the worst case, if such structure does not exist, we can let x u = x i , simply the i -th dimension of of x . In this setting p ( x i | x − i ) is simply a one-dimensional categorical distribu-tion over K possible outcomes, which requires K evalua-tions of the unnormalized log-density function. We typicallyﬁx an ordering of the dimensions and iterate through thisordering to ensure that each dimension is updated.While simple, updating one dimension at a time can be prob-lematic for high-dimensional data. Consider for example abinary model of the MNIST dataset. Almost all dimensionswill represent the background and if chosen for a Gibbsupdate, they will almost certainly not change. Similarly, thepixels on the interior of a digit will also not change. Thisamounts to wasted computation every time we propose adimension, which does not change. Ideally, if we couldbias our partition choice to dimensions which are likely tochange, we could build a more efﬁcient Gibbs sampler.Consider a similar sampler for the binary case;Metropolis-Hastings with the proposal distribution q ( x (cid:48) | x ) = (cid:80) i q ( x (cid:48) | x, i ) q ( i ) where q ( i ) is a distributionover indices i ∈ { , . . . , D } and q ( x (cid:48) | x, i ) = ( x (cid:48) = x − i ) .With this sampler we will ﬁrst sample an index i ∼ q ( i ) ,then ﬂip the i -th bit of x to obtain x (cid:48) and accept thisproposed update with probability min (cid:26) exp( f ( x (cid:48) ) − f ( x )) q ( x | x (cid:48) ) q ( x (cid:48) | x ) , (cid:27) . (1)This approach can lead to considerable performance im-provements if q ( i ) is biased towards dimensions whichare more likely to ﬂip. To this end, considerable workhas been done to use prior sampling data to adaptively up-date the q ( i ) while maintaining the validity of MCMC sam-pling (Łatuszy´nski et al., 2013; Richardson et al., 2010).Revisiting our MNIST example, we can see the pixels mostlikely to change are those at the edge of a digit. Where thisedge is varies considerably and depends on the entire input x . Thus, we can imagine an input dependent proposal q ( i | x ) should be able to outperform an unconditional proposal q ( i ) .Our proposed approach does exactly that, using gradientinformation to inform a proposal over dimensions whichleads to more efﬁcient sampling. A good Metropolis-Hastings proposal needs to balance theincrease in the likelihood of the proposed point x (cid:48) with thedecrease in the probability of the reverse proposal q ( x | x (cid:48) ) .A natural strategy for increasing the likelihood is to uselocally-informed proposals: q τ ( x (cid:48) | x ) ∝ e τ ( f ( x (cid:48) ) − f ( x )) ( x (cid:48) ∈ H ( x )) . (2)where H ( x ) is the Hamming ball of some size around x and τ > is a temperature parameter. In this case, the pro-posal is simply a tempered Softmax over f ( x (cid:48) ) − f ( x ) forall x (cid:48) ∈ H ( x ) . If the temperature τ is too low, this proposalwill aggressively optimize the local likelihood increase from x → x (cid:48) , but possibly collapse the reverse proposal probabil-ity q τ ( x | x (cid:48) ) . If the temperature τ is too high, this proposalmay increase the reverse proposal probability q τ ( x | x (cid:48) ) , butignore the local likelihood increase.The temperature that balances both of these terms is τ = 2 (Umrigar, 1993; Zanella, 2020). We include a derivation ofthis fact in Appendix A. In fact, Zanella (2020) showed that q ( x (cid:48) | x ) is in the optimal subclass of locally-informed pro-posals. Zanella (2020) also demonstrated that this can leadto large improvements in empirical performance per sam-pling step compared to other generic samplers like Gibbsand the Hamming-Ball sampler (Titsias & Yau, 2017).Unfortunately, while powerful, these locally-informed pro-posals requires us to to compute f ( x (cid:48) ) − f ( x ) for every x (cid:48) ∈ H ( x ) . For D -dimensional data and a Hamming win-dow size of 1, this requires O ( D ) evaluations of f which ibbs with Gradients can become prohibitive as D grows. Our proposed approachreduces this to O (1) evaluations while incurring a minimaldecrease in the efﬁciency of each sampling step.

3. Searching for Structure

For some distributions, structure exists which enables thelocal differences f ( x (cid:48) ) − f ( x ) to be computed efﬁciently.This is not true in general, but even in settings where it is,bespoke derivations and implementations are required. Ide-ally, we could identify a structure that is ubiquitous acrossmany interesting discrete distributions, can be exploitedin a generic and generalizable way, and can allow us toaccurately estimate the local differences.To ﬁnd such structure, we examine the functional form ofthe unnormalized log-probability for some common, diversefamilies of discrete distributions in Table 1.Distribution log p ( x ) + log Z Categorical x T θ Poisson x log λ − log Γ( x + 1) HMM (cid:80) Tt =1 x Tt +1 Ax t − ( w T x − y ) σ RBM (cid:80) i softplus ( W x + b ) i + c T x Ising x T W x + b T x Potts (cid:80) Li =1 h Ti x i + (cid:80) Li,j =1 x Ti J ij x j Deep EBM f θ ( x ) Table 1.

Unnormalized log-likelihoods of common discrete distri-butions. All are differentiable with respect to x . The formulas in Table 1 are not only the standard way thesedistributions are written down, but they are also the standardway these distributions are implemented in probabilisiticprogramming frameworks.The key insight here is that these are all continuous, differ-entiable functions accepting real-valued inputs, even thoughthey are evaluated only on a discrete subset of their domain.We propose to exploit this structure and that gradients, inthe form of Taylor-series approximations, can be used toefﬁciently estimate likelihood ratios between a given input x and other discrete states x (cid:48) .When we are dealing with D -dimensional binary data, wecan estimate the likelihood ratios of ﬂipping each bit with ˜ d ( x ) = − (2 x − (cid:12) ∇ x f ( x ) (3)where ˜ d ( x ) i ≈ f ( x − i ) − f ( x ) and x − i is x with the i -th bitﬂipped. If we are dealing with D -dimensional categoricaldata we can estimate a similar quantity ˜ d ( x ) ij = ∇ x f ( x ) ij − x Ti ∇ x f ( x ) i (4) where ˜ d ( x ) ij approximates the log-likelihood ratio of ﬂip-ping the i -th dimension of x from its current value to thevalue j . Similar ﬁrst-order approximations can easily bederived for larger window sizes as well with linear operatorsapplied to the gradient of the log-probability function.

4. Gibbs With Gradients

We now present our main algorithm. We use a Taylor-series(Equations 3, 4) to approximate the likelihood ratios within alocal window of a point x . We use these estimated likelihoodratios to produce an approximation q ∇ ( x (cid:48) | x ) ∝ e ˜ d ( x )2 ( x (cid:48) ∈ H ( x )) (5)to q ( x (cid:48) | x ) of Equation 2, which we use in the standardMetropolis-Hastings algorithm.Our experiments focus on a simple and fast instance ofthis approach which only considers local moves inside of aHamming window of size 1. For binary data, these window-1 proposals have an even simpler form since all x (cid:48) ∈ H ( x ) differ by only dimension. Proposing a move from x to x (cid:48) isequivalent to choosing which dimension i to change. We cansample this from a categorical distribution over D choices: q ( i | x ) = Categorical (cid:32)

Softmax (cid:32) ˜ d ( x )2 (cid:33)(cid:33) (6)Thus when x binary, to sample from q ∇ ( x (cid:48) | x ) , we sim-ply sample which dimension to change i ∼ q ( i | x ) , andthen deterministically set x (cid:48) = flipdim ( x, i ) . In thiscase, when x and x (cid:48) differ only in dimension i , we have q ∇ ( x (cid:48) | x ) = q ( i | x ) and q ∇ ( x | x (cid:48) ) = q ( i | x (cid:48) ) .Because of the relationship to Adaptive Gibbs, we call oursampler Gibbs-With-Gradients. Pseudo-code describing oursampler can be found in Algorithm 1. In the categorical data Algorithm 1

Gibbs With Gradients

Input: unnormalized log-prob f ( · ) , current sample x Compute ˜ d ( x ) { Eq. 3 if binary, Eq. 4 if categorical. } Compute q ( i | x ) = Categorical (cid:16)

Softmax (cid:16) ˜ d ( x )2 (cid:17)(cid:17) Sample i ∼ q ( i | x ) x (cid:48) = flipdim ( x, i ) Compute q ( i | x (cid:48) ) = Categorical (cid:16)

Softmax (cid:16) ˜ d ( x (cid:48) )2 (cid:17)(cid:17) Accept with probability: min (cid:18) exp( f ( x (cid:48) ) − f ( x )) q ( i | x (cid:48) ) q ( i | x ) , (cid:19) setting, the proposal must choose not only which dimensionto change, but also to what value. Thus, q ( i | x ) in this settingis a DK -way Softmax. ibbs with Gradients We describe some simple extensions in Appendix D.

Zanella (2020) proved that “locally-balanced” proposals,like q ( x (cid:48) | x ) in Equation 2, are the optimal locally-informedproposals for Metropolis-Hastings. In this section we showthat, under smoothness assumptions on f , our methods arewithin a constant factor of q ( x (cid:48) | x ) in terms of asymptoticefﬁciency.To understand the asymptotic efﬁciency of MCMC tran-sition kernels, we can study the asymptotic variance andspectral gap of the kernel. The asymptotic variance is de-ﬁned as var p ( h, Q ) = lim T →∞ T var (cid:32) T (cid:88) t =1 h ( x t ) (cid:33) (7)where h : X → R is a scalar-valued function, Q is a p -stationary Markov transition kernel, and X ∼ p ( x ) . Thespectral gap is deﬁned asGap ( Q ) = 1 − λ (8)where λ is the second largest eigenvalue of the transitionprobability matrix of Q . Both of these quantities measurethe asymptotic efﬁciency of Q . The asymptotic variancesmeasures the additional variance incurred when using se-quential samples from Q to estimate E p [ h ( x )] . The spectralgap is related to the mixing time, with larger values corre-sponding to faster mixing chains (Levin & Peres, 2017).Since our method approximates q ( x (cid:48) | x ) , we should expectsome decrease in efﬁciency. We characterize this decrease interms of the asymptotic variance and spectral gap, under theassumption of Lipschitz continuity of ∇ x f ( x ) . In particular,we show that the decrease is a constant factor that dependson the Lipschitz constant of ∇ x f ( x ) and the window sizeof our proposal. Theorem 1

Let Q ( x (cid:48) , x ) and Q ∇ ( x (cid:48) , x ) be the Markovtransition kernels given by the Metropolis-Hastings algo-rithm using the locally balanced proposal q ( x (cid:48) | x ) andour approximation q ∇ ( x (cid:48) | x ) . Let f be an L -smooth log-probability function and p ( x ) = exp( f ( x )) Z . Then it holds(a) var p ( h, Q ∇ ) ≤ var p ( h,Q ) c + − cc · var p ( h ) (b) Gap ( Q ∇ ) ≥ c · Gap ( Q ) where c = e − LD H and D H = sup x (cid:48) ∈ H ( x ) || x − x (cid:48) || . A proof can be found in Appendix B. This roughly states that Q ∇ ( x (cid:48) , x ) is no less than c -times as efﬁcient than Q ( x (cid:48) , x ) per step for estimating expectations. As expected, our ap-proach matches the efﬁciency of the target proposal whenthe Taylor-series approximation is accurate. An example:

Consider an Ising model on a cyclic2D lattice. This model has log-probability function f ( x ) = θ · x T Jx − log Z where J is the binary adjacencymatrix of the lattice, θ is the connectivity strength and Z isthe unknown normalizing constant. We can see the gradientis ∇ x f ( x ) = 2 θ · Jx and can bound L ≤ σ ( J ) θ = 8 θ .For θ = . and a Hamming window of size 1, this gives c = . , regardless of D . Since a single evaluation of f has an O ( D ) cost, it costs O ( D ) to compute q ( x (cid:48) | x ) .Compared to the exact local-differences proposal, the Gibbs-With-Gradients proposal q ∇ ( x (cid:48) | x ) incurs at most a constantloss in sampling efﬁciency per-iteration but gives a O ( D ) increase in speed.

5. Relationship to Continuous Relaxations

Why hasn’t this relatively simple approach been proposedbefore? A number of prior works have used gradient infor-mation for discrete sampling. Instead of using gradients toinform discrete updates directly, these methods transportthe problem into a continuous relaxation, perform updatesthere, and transform back after sampling. This approachincurs many of the pitfalls of continuous sampling withoutproviding the scalability. We ﬁnd these methods are notcompetitive with Gibbs-With-Gradients in high dimensions.In more detail, these methods use the discrete target dis-tribution to create a related continuous distribution (relax-ation) whose samples can be transformed to samples fromthe target distribution. They then apply gradient-basedsampling methods such as Stein Variational Gradient De-scent (Liu & Wang, 2016) or Hamiltonian Monte-Carlo(HMC) (Neal et al., 2011) to the new distribution. Exam-ples of such methods are the recently proposed Discrete-SVGD (D-SVGD) (Han et al., 2020) and DiscontinuousHMC (Nishimura et al., 2017).A key challenge of these approaches is that the relaxed dis-tribution can be arbitrarily difﬁcult to sample from, highlymulti-modal and require small step-sizes. Further, metricsof MCMC performance and mixing in the relaxed space maynot indicate performance and mixing in the discrete space.These methods also require the tuning of many additionalhyper-parameters such as step-size, momentum, and the tem-perature of the relaxation. In contrast, our approach operatesdirectly in discrete space, and has no hyper-parameters.Figure 2 compares these approaches on the task of sam-pling from restricted Boltzmann machines (RBMs) of upto 1000 dimensions. We compare to D-SVGD and tworelaxation-based baselines derived from the Metropolis- The local difference function of an Ising model can be com-puted more efﬁciently. However, this requires a bespoke derivationand implementation, and is not possible for general pmfs, such asthose parameterized by neural networks. ibbs with Gradients

Adjusted Langevin Algorithm (Besag, 1994) and HMC.We compare the log-MMD between generated samples and“ground-truth” samples generated with Block-Gibbs. Wealso display samples from an MNIST-trained model. In con-trast to all three baselines, our approach does not degradewith dimension. Additional results, details, and discussioncan be found in Appendix C. These relaxation-based ap-proaches do not scale beyond 200 dimensions, so we do notcompare to them in our main experimental results section

Figure 2.

Comparison to gradient-based samplers with continuousrelaxations. GWG, D-SVGD, R-HMC, and R-MALA refer togibbs-with-gradients, Discrete SVGD, Relaxed HMC and RelaxedMALA, respectively. Left: Log-MMD (lower is better) betweentrue samples and generated samples for RBMs of increasing di-mension (over 3 runs). “Target” is log-MMD between two sets ofBlock-Gibbs samples. Right: Visualized samples from an RBMtrained on MNIST.

6. Sampling From EBMs

To demonstrate the beneﬁts and generality of our proposedapproach to sampling, we present results sampling from3 distinct and challenging distributions; Restricted Boltz-mann Machines, Lattice Ising models, and Factorial HiddenMarkov Models. Each is evaluated differently based onthe properties of the distribution. We compare our sampler,Gibbs-With-Gradients, against standard Gibbs samplingand the Hamming Ball Sampler (Titsias & Yau, 2017) –two generic approaches for discrete sampling. When avail-able, we also compare with samplers which exploit knownstructure in the distribution of interest.In the following, Gibbs- X refers to Gibbs sampling with ablock-size of X , and HB- X - Y refers to the Hamming Ballsampler with a block size of X and a hamming ball size of Y , and GWG refers to Gibbs-With-Gradients. Gibbs-1 isthe fastest sampler tested. In our current implementation,we ﬁnd Gibbs-2, HB-10-1, and GWG have approximately1.6, 6.6, 2.1 times the cost of Gibbs-1 per step, respectively.Thus the run-time of GWG is most comparable to Gibbs-2. Restricted Boltzmann Machines are unnormalizedlatent-variable models deﬁned as: log p ( x ) = log(1 + exp( W x + c )) + b T x − log Z (9)where { W, b, c } deﬁne its parameters and x ∈ { , } D . Wetrain an RBM with 500 hidden units on the MNIST datasetusing contrastive divergence (Hinton, 2002). We generatesamples with various MCMC samplers and compare themin two ways. First, using the Maximum Mean Discrepancy(MMD) (Gretton et al., 2012) between a set of samples fromeach sampler and a set of “ground-truth” samples gener-ated using the structured Block-Gibbs sampler available toRBMs (see Appendix E for details). Next, we report theEffective Sample Size (ESS) of a summary statistic oversampling trajectories. Results can be seen in Figure 3. Figure 3.

RBM sampling results. Left: Log-MMD of samples oversteps (lower is better). “Target” is Log-MMD between two setsof Block-Gibbs samples. Right: Log-ESS of various samplersafter 10,000 steps. Gibbs-With-Gradients matches Block-Gibbs inMMD and outperforms unstructured baselines in ESS.

We see on the left that GWG matches the structured Block-Gibbs sampler in MMD (“Target” in the Figure), while theother samplers do not. On the right we see that the effectivesample size of GWG is notably above the baselines andis approximately halfway between the baselines and theBlock-Gibbs sampler (in log-space). We note, Block-Gibbscan update all 784 dimensions in each iterations. GWG andGibbs-1 can update 1 and Gibbs-2 and Hamming Ball canupdate 2 dimensions per iteration.

Lattice Ising Models are models for binary data deﬁnedby log p ( x ) = θ · x T Jx − log Z (10)where θ is the connectivity strength and J is the binary adja-cency matrix, here restricted to be a 2D cyclic lattice. Thismodel was originally proposed to model the spin magneticparticles (Ising, 1924). We sample from models with in-creasing dimension and connectivity strength. We evaluate ibbs with Gradients using Effective Sample Size (ESS) of a summary statistic(full details in Appendix F). Results can be seen in Figure 4. Figure 4.

Ising model sampling results. The y-axis shows log-ESSover 100,000 samples steps. Left: 10x10 lattice, right: 40x40lattice. We see GWG outperforms in most settings.

We see GWG provides a notable increase in ESS for Isingmodels with higher connectivity. These models are harderto sample from as their dimensions are more correlated.

Factorial Hidden Markov Models (FHMM) are latent-variable time-series models, similar to HMMs but their hid-den state consists of distinct, independent factors. The con-tinuous data y ∈ R L of length L is generated by the binaryhidden state x ∈ { , } L × K with K factors as p ( x, y ) = p ( y | x ) p ( x ) p ( y | x ) = L (cid:89) t =1 N ( y t ; W x t + b, σ ) p ( x ) = p ( x ) L (cid:89) t =2 p ( x t | x t − ) (11)We create a random FHMM with 1000 time-steps and a10-dimensional hidden state and then draw samples y . Wegenerate posterior samples p ( x | y ) and evaluate our sam-plers using reconstruction error and joint likelihood. Fullmodel description and experimental details can be found inAppendix G and results can be seen in Figure 5.In this setting, the Hamming-Ball sampler exploits knownstructure in the problem. Each block chosen by the samplerconsists of the 10-dimensional latent state x t , as opposed to10 random dimensions. Thus, the Hamming-Ball samplerin this setting is a stronger baseline. Despite this, we ﬁndGWG notably outperforms the baseline samplers.

7. Training EBMs

Training EBMs is a challenging task. Computing the likeli-hood for Maximum Likelihood inference requires compu- Computed using Tensorﬂow Probability

Figure 5.

Factorial Hidden Markov Model results. “Block-HB”refers to the block-structured hamming ball sampler. Left, log-jointdensity and right, mean log-reconstruction error. GWG performsbest in both evaluations, outperforming the Hamming Ball samplerwhich exploits model structure. tation of the normalizing constant Z = (cid:80) x e f θ ( x ) which istypically intractable. Thankfully, the gradient of likelihoodcan be more easily expressed as: ∇ θ log p ( x ) = ∇ θ f θ ( x ) − E p θ ( x ) [ ∇ θ f θ ( x )] (12)therefore, if samples can be drawn from p θ ( x ) , then an unbi-ased gradient estimator can be derived. We can approximatethis estimator using MCMC. When a slow-mixing MCMCsampler is used to draw these samples, we obtain biasedgradient estimates and this leads to sub-par learning. Im-provements in MCMC can then lead to improvements inparameter inference for unnormalized models. We explorehow GWG can be applied to parameter inference for someclassical discrete EBMs. We generate Ising models with different sparse graph struc-tures; a 2D cyclic lattice and a random Erdos-Renyi graph.We generate training data with a long-run Gibbs chain andtrain models using Persistent Contrastive Divergence (Tiele-man, 2008) with an (cid:96) penalty to encourage sparsity. Weevaluate our models using the RMSE between the inferredconnectivity matrix ˆ J and the true matrix J .Full experimental details and additional results can be foundin Appendix F.2, F.3 and results can be seen in Figure 6.In all settings, GWG greatly outperforms Gibbs sampling.This allows for much faster training than standard Gibbswhile recovering higher-quality solutions. Proteins are deﬁned by a discrete sequence of 20 aminoacids x ∈ { , . . . , } D where D is the length of the pro-tein. The Potts model has long been a popular approachfor modelling the evolutionary distribution of protein se- ibbs with Gradients Figure 6.

Training Ising models with increasing MCMC steps.Left: Lattice Ising (dim = 625, θ = . ). Right: Erdos-RenyiIsing. Values are log (RMSE) between the learned and true J .GWG leads to better solutions with lower computational cost. quences (Lapedes et al., 1999). The model takes the form log p ( x ) = D (cid:88) i =1 h Ti x i + D (cid:88) i,j =1 x Ti J ij x j − log Z (13)where x i is a one-hot encoding of the i -th amino acid in x , J ∈ R { D × D × × } and h ∈ R { D × } are the model’sparameters and Z is the model’s normalizing constant. ThePotts model’s likelihood is the sum of pairwise interactions.Marks et al. (2011) demonstrated that the strength of theseinteractions can correspond to whether or not two aminoacids touch when the protein folds. These inferred contactscan then be used to infer the 3D structure of the protein.Since the Potts model is unnormalized, maximum likelihoodlearning is difﬁcult, and (cid:96) -regularized Pseudo-likelihoodMaximization (PLM) (Besag, 1975) is used to train themodel. Recently Ingraham & Marks (2017) found thatimproved contact prediction could be achieved with MCMC-based maximum likelihood learning. Unfortunately, due tothe limitations of discrete MCMC samplers, their study wasrestricted to small proteins (less than 50 amino acids).GWG allows these performance improvements to scale tolarge proteins as well. We train Potts models on 2 largeproteins: OPSD BOVIN, and CADH1 HUMAN. We trainusing PCD where samples are generated with GWG andGibbs. We run PLM as a baseline. These proteins are muchlarger than those studied in Ingraham & Marks (2017) withOPSD BOVIN, and CADH1 HUMAN having 276, and 526amino acids, respectively .We predict couplings using the J parameter of our models.We compute a “coupling-strength” for each pair of amino-acids as || J ij || which gives a measure of how much indices i and j interact to determine the ﬁtness of a protein. Wesort index pairs by their coupling strength and compare thehighest scoring pairs with known contacts in the proteins. After standard data pre-processing as in Ingraham & Marks(2017)

Full experimental details and additional results can be foundin Appendix H and results can be seen in Figure 7. For thesmaller protein, Gibbs sampling outperforms PLM but forthe larger protein, the slow-mixing of the sampler causesthe performance to drop below that of PLM. Despite theincreased size, GWG performs the best.

Figure 7.

Recall Curves for contact prediction with Potts models.Gibbs-With-Gradients leads to higher recall.

8. Deep EBMs for Discrete Data

Deep Energy-Based Models have rapidly gained popularityfor generative modeling. These models take the form log p ( x ) = f θ ( x ) − log Z (14)where f θ : R D → R is a deep neural network. The re-cent success of these models can be attributed to a fewadvancements including; the use of tempered Langevin sam-plers (Nijkamp et al., 2020) and large persistent chains (Du& Mordatch, 2019). This has enabled EBMs to become acompetitive approach for image-generation (Song & Ermon,2019), adversarial robustness (Grathwohl et al., 2019; Hillet al., 2020), semi-supervised learning (Song & Ou, 2018;Grathwohl et al., 2020a) and many other problems.These advances rely on gradient-based sampling which re-quires continuous data. Thus, these scalable methods cannotbe applied towards training deep EBMs on discrete data. Weexplore how Gibbs-With-Gradients can enable the trainingof deep EBMs on high-dimensional binary and categori-cal data. To our knowledge, models of this form have notbe successfully trained on such data in the past. We traindeep EBMs parameterized by Residual Networks (He et al.,2016) on small binary and continuous image datasets us-ing PCD (Tieleman, 2008) with a replay buffer as in Du &Mordatch (2019); Grathwohl et al. (2019). The continuousimages were treated as 1-of-256 categorical data.PCD training is very sensitive to the choice of MCMC sam-pler. As an initial experiment, we attempted to train thesemodels using standard Gibbs but found that the sampler wastoo slow to enable stable training within a reasonable com-pute budget. On the binary data we needed to train with ibbs with Gradients Data Type Dataset VAE VAE EBM EBM RBM DBN(MLP) (Conv) (GWG) (Gibbs)Binary Static MNIST -86.05 -82.41 -80.01 -117.17 -86.39 -85.67Dynamic MNIST -82.42 -80.40 -80.51 -121.19 — —(log-likelihood ↑ ) Omniglot -103.52 -97.65 -94.72 -142.06 -100.47 -100.78Caltech Silhouettes -112.08 -106.35 -96.20 -163.50 — —Categorical Frey Faces 4.61 ↓ ) Histopathology 5.82 5.59 — — — Table 2.

Test-set log-likelihoods for models trained on discrete image datasets. RBM and DBN results are taken from Burda et al. (2015),VAE results taken from Tomczak & Welling (2018).

Gibbs sampling steps per training iteration. All models wetrained with fewer steps quickly diverged. GWG requiredonly 40. This made training with Gibbs 9.52x slower thanGWG. For a fair comparison, the Gibbs results in Table 2were trained for an equal amount of wall-clock time as theGWG models.For the categorical data, we could not train models withGibbs sampling. Each step of Gibbs requires us to evaluatethe energy function 256 (for each possible pixel value) times.GWG requires 2 function evaluations. Thus the amount ofcompute per iteration for Gibbs is 128x greater than GWG.Further, to make Gibbs train stably, we would need to usemany more steps, as with the binary data. This would giveroughly a increase in run-time. Therefore, training amodel of this form with Gibbs is simply not feasible.Full experimental details can be found in Appendix I. Wepresent long-run samples from our trained models in Figure8 and test-set likelihoods in Table 2. Likelihoods are esti-mated using Annealed Importance Sampling (Neal, 2001).We compare the performance of our models to VariationalAutoencoders (Kingma & Welling, 2013) and two otherEBMs; an RBM and a Deep Belief Network (DBN) (Hin-ton, 2009). On most datasets, our Resnet EBM outperformsthe other two EBMs and the VAEs. Our improved samplerenables deep EBMs to become a competitive approach togenerative modeling on high-dimensional discrete data.We include some preliminary results using Gibbs-With-Gradients to train EBMs for text data in Appendix J.

9. Future Directions and Conclusion

In this work we have presented Gibbs-With-Gradients, anew approach to MCMC sampling for discrete distributions.Our approach exploits a powerful structure, gradient infor-mation, which is available to a very large class of importantdiscrete distributions. We then use this gradient informationto construct proposal distributions for Metropolis-Hastings.We have demonstrated on a diverse set of distributions thatthis approach to sampling considerably outperforms base-line samplers which do not exploit known structure in the

Figure 8.

Left: data. Right: Samples from ResNet EBM. Sam-ples generated with annealed Markov chain using 300,000 Gibbs-With-Gradients steps. Top to bottom: MNIST, omniglot, CaltechSilhouettes, Frey Faces, Histopathology. target distribution as well as many that do. Further, we ﬁndour approach outperforms prior discrete samplers which usegradient information with continuous relaxations.We ﬁnd Gibbs-With-Gradients performs very well at sam-pling from deep energy-based models and allows, for theﬁrst time, unconstrained deep EBMs to be trained on dis-crete data and outperform other deep generative models.We believe there is considerable room for future work build-ing on top of our method. We only explored samplers whichmodify 1 variable per proposed update. We believe con-siderable improvements could be made if the window sizeof the sampler was expanded but this would require moreefﬁcient algorithms to sample from the larger proposal.Next, we have shown that gradient-based approximationsto the local difference function can be accurate and use-ful for complex discrete distributions. Local differencefunctions have been used in the past to generalize ScoreMatching (Lyu, 2012), and Stein Discrepancies (Han & Liu,2018). We believe there is great potential to explore howgradient-based approximations could enable the generaliza-tion of recent deep EBM training methods based on ScoreMatching and Stein Discrepancies (Song & Ermon, 2019;Grathwohl et al., 2020b) to models of discrete data. ibbs with Gradients

10. Acknowledgements

We would like to thank Eli Weinstein for helping us properlypresent our protein results and we would like to thank KellyBrock for help and feedback for working with the proteindata.We thank Jesse Bettencourt, James Lucas, Matt Hoffman,Rif Saurous, David Madras, and Jacob Kelly for helpfulfeedback on our draft.

References

Besag, J. Statistical analysis of non-lattice data.

Journal ofthe Royal Statistical Society: Series D (The Statistician) ,24(3):179–195, 1975.Besag, J. Comments on “representations of knowledge incomplex systems” by u. grenander and mi miller.

J. Roy.Statist. Soc. Ser. B , 56:591–592, 1994.Burda, Y., Grosse, R., and Salakhutdinov, R. Accurateand conservative estimates of mrf log-likelihood usingreverse annealing. In

Artiﬁcial Intelligence and Statistics ,pp. 102–110, 2015.Deng, Y., Bakhtin, A., Ott, M., Szlam, A., and Ranzato, M.Residual energy-based models for text generation. arXivpreprint arXiv:2004.11714 , 2020.Du, Y. and Mordatch, I. Implicit generation and gen-eralization in energy-based models. arXiv preprintarXiv:1903.08689 , 2019.Gers, F. A., Schmidhuber, J., and Cummins, F. Learning toforget: Continual prediction with lstm. 1999.Grathwohl, W., Wang, K.-C., Jacobsen, J.-H., Duvenaud, D.,Norouzi, M., and Swersky, K. Your classiﬁer is secretlyan energy based model and you should treat it like one. arXiv preprint arXiv:1912.03263 , 2019.Grathwohl, W., Kelly, J., Hashemi, M., Norouzi, M., Swer-sky, K., and Duvenaud, D. No mcmc for me: Amortizedsampling for fast and stable training of energy-based mod-els. arXiv preprint arXiv:2010.04230 , 2020a.Grathwohl, W., Wang, K.-C., Jacobsen, J.-H., Duvenaud, D.,and Zemel, R. Learning the stein discrepancy for trainingand evaluating energy-based models without sampling.In

International Conference on Machine Learning , pp.3732–3747. PMLR, 2020b.Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch¨olkopf, B.,and Smola, A. A kernel two-sample test.

The Journal ofMachine Learning Research , 13(1):723–773, 2012.Gutmann, M. and Hyv¨arinen, A. Noise-contrastive estima-tion: A new estimation principle for unnormalized statisti-cal models. In

Proceedings of the Thirteenth InternationalConference on Artiﬁcial Intelligence and Statistics , pp.297–304. JMLR Workshop and Conference Proceedings,2010.Han, J. and Liu, Q. Stein variational gradient descent with-out gradient. arXiv preprint arXiv:1806.02775 , 2018.Han, J., Ding, F., Liu, X., Torresani, L., Peng, J., and Liu,Q. Stein variational inference for discrete distributions. arXiv preprint arXiv:2003.00605 , 2020. ibbs with Gradients

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. In

Proceedings of the IEEEconference on computer vision and pattern recognition ,pp. 770–778, 2016.He, T., McCann, B., Xiong, C., and Hosseini-Asl, E. Jointenergy-based model training for better calibrated nat-ural language understanding models. arXiv preprintarXiv:2101.06829 , 2021.Hill, M., Mitchell, J., and Zhu, S.-C. Stochastic security:Adversarial defense using long-run dynamics of energy-based models. arXiv preprint arXiv:2005.13525 , 2020.Hinton, G. E. Training products of experts by minimizingcontrastive divergence.

Neural computation , 14(8):1771–1800, 2002.Hinton, G. E. Deep belief networks.

Scholarpedia , 4(5):5947, 2009.Ingraham, J. and Marks, D. Variational inference for sparseand undirected models. In

International Conference onMachine Learning , pp. 1607–1616. PMLR, 2017.Ising, E.

Beitrag zur Theorie des Ferround Param-agnetismus . PhD thesis, PhD thesis (Mathematisch-Naturwissenschaftliche Fakult¨at der Hamburgischen . . . ,1924.Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 , 2014.Kingma, D. P. and Welling, M. Auto-encoding variationalbayes. arXiv preprint arXiv:1312.6114 , 2013.Lapedes, A. S., Giraud, B. G., Liu, L., and Stormo,G. D. Correlated mutations in models of protein se-quences: phylogenetic and structural effects.

LectureNotes-Monograph Series , pp. 236–256, 1999.Łatuszy´nski, K., Roberts, G. O., Rosenthal, J. S., et al.Adaptive gibbs samplers and related mcmc methods.

TheAnnals of Applied Probability , 23(1):66–98, 2013.Levin, D. A. and Peres, Y.

Markov chains and mixing times ,volume 107. American Mathematical Soc., 2017.Liu, J. S. Peskun’s theorem and a modiﬁed discrete-stategibbs sampler.

Biometrika , 83(3), 1996.Liu, Q. and Wang, D. Stein variational gradient descent: Ageneral purpose bayesian inference algorithm.

Advancesin neural information processing systems , 29:2378–2386,2016.Lyu, S. Interpretation and generalization of score matching. arXiv preprint arXiv:1205.2629 , 2012. Maddison, C. J., Mnih, A., and Teh, Y. W. The concretedistribution: A continuous relaxation of discrete randomvariables. arXiv preprint arXiv:1611.00712 , 2016.Marks, D. S., Colwell, L. J., Sheridan, R., Hopf, T. A.,Pagnani, A., Zecchina, R., and Sander, C. Protein 3dstructure computed from evolutionary sequence variation.

PloS one , 6(12):e28766, 2011.Neal, R. M. Annealed importance sampling.

Statistics andcomputing , 11(2):125–139, 2001.Neal, R. M. et al. Mcmc using hamiltonian dynamics.

Hand-book of markov chain monte carlo , 2(11):2, 2011.Nijkamp, E., Hill, M., Han, T., Zhu, S.-C., and Wu, Y. N.On the anatomy of mcmc-based maximum likelihoodlearning of energy-based models. In

Proceedings of theAAAI Conference on Artiﬁcial Intelligence , volume 34,pp. 5272–5280, 2020.Nishimura, A., Dunson, D., and Lu, J. Discontinuoushamiltonian monte carlo for sampling discrete param-eters. arXiv preprint arXiv:1705.08510 , 2, 2017.Ramachandran, P., Zoph, B., and Le, Q. V. Searching foractivation functions. arXiv preprint arXiv:1710.05941 ,2017.Richardson, S., Bottolo, L., and Rosenthal, J. S. Bayesianmodels for sparse regression analysis of high dimensionaldata.

Bayesian Statistics , 9:539–569, 2010.Song, Y. and Ermon, S. Generative modeling by estimatinggradients of the data distribution. In

Advances in NeuralInformation Processing Systems , pp. 11918–11930, 2019.Song, Y. and Ou, Z. Learning neural random ﬁeldswith inclusive auxiliary generators. arXiv preprintarXiv:1806.00271 , 2018.Taylor, A., Marcus, M., and Santorini, B. The penn treebank:an overview.

Treebanks , pp. 5–22, 2003.Tieleman, T. Training restricted boltzmann machines usingapproximations to the likelihood gradient. In

Proceedingsof the 25th international conference on Machine learning ,pp. 1064–1071, 2008.Tieleman, T. and Hinton, G. Using fast weights to improvepersistent contrastive divergence. In

Proceedings of the26th Annual International Conference on Machine Learn-ing , pp. 1033–1040, 2009.Titsias, M. K. and Yau, C. The hamming ball sampler.

Journal of the American Statistical Association , 112(520):1598–1611, 2017. ibbs with Gradients

Tomczak, J. and Welling, M. Vae with a vampprior. In

International Conference on Artiﬁcial Intelligence andStatistics , pp. 1214–1223. PMLR, 2018.Umrigar, C. J. Accelerated metropolis method.

Phys. Rev.Lett. , 71:408–411, Jul 1993. doi: 10.1103/PhysRevLett.71.408. URL https://link.aps.org/doi/10.1103/PhysRevLett.71.408 .Zanella, G. Informed proposals for local mcmc in discretespaces.

Journal of the American Statistical Association ,115(530):852–865, 2020. ibbs with Gradients

A. Balancing Locally-Informed Proposals

As discussed in Section 2.2, locally informed proposals needto balance the likelihood increase with the reverse proposalprobability. In particular, consider proposals of the form: q τ ( x (cid:48) | x ) ∝ e τ ( f ( x (cid:48) ) − f ( x )) ( x (cid:48) ∈ H ( x )) . (15)where H ( x ) is the Hamming ball of some size around x and τ > is a temperature parameter. Here we provide aderivation of the fact that τ = 2 balances these two terms.When we examine the acceptance rate (Equation 1) of thisproposal we ﬁnd exp( f ( x (cid:48) ) − f ( x )) q τ ( x | x (cid:48) ) q τ ( x (cid:48) | x )= exp( f ( x (cid:48) ) − f ( x )) exp( τ ( f ( x ) − f ( x (cid:48) )) Z ( x )exp( τ ( f ( x (cid:48) ) − f ( x )) Z ( x (cid:48) )= exp (cid:18)(cid:18) − τ (cid:19) ( f ( x (cid:48) ) − f ( x )) (cid:19) Z ( x ) Z ( x (cid:48) ) (16)where Z ( x ) = (cid:80) x (cid:48) ∈ H ( x ) exp( τ ( f ( x (cid:48) ) − f ( x )) is the nor-malizing constant of the proposal. By setting τ = 2 , themost variable terms cancel and we are left with Z ( x ) Z ( x (cid:48) ) . Thus,the acceptance rate is equal to the ratio of the normalizingconstants of the proposal distributions. If the Hamming ballis small and the function f is well behaved (i.e Lipschitz)then, since x (cid:48) is near x , Z ( x (cid:48) ) will be near Z ( x ) and theacceptance rate will be high. B. Proof of Theorem 1

Our proof follows from Theorem 2 of Zanella (2020) whichstates that for two p -reversible Markov transition kernels Q ( x (cid:48) , x ) and Q ( x (cid:48) , x ) , if there exists c > for all x (cid:48) (cid:54) = x such that Q ( x (cid:48) , x ) > c · Q ( x (cid:48) , x ) then(a) var p ( h, Q ) ≤ var p ( h,Q ) c + − cc · var p ( h ) (b) Gap ( Q ) ≥ c · Gap ( Q ) where var p ( h, Q ) is the asymptotic variance deﬁned in Equa-tion 7, Gap ( Q ) is the spectral gap deﬁned in Equation 8, andvar p ( h ) is the standard variance E p [ h ( x ) ] − E p [ h ( x )] .Our proof proceeds by showing we can bound Q ∇ ( x (cid:48) , x ) ≥ c · Q ( x (cid:48) , x ) , and the results of the theo-rem then follow directly from Theorem 2 of Zanella(2020). B.1. Deﬁnitions

We begin by writing down the proposal distribution of inter-est and their corresponding Markov transition kernels. For ease of notion we deﬁne some values ∆( x (cid:48) , x ) := f ( x (cid:48) ) − f ( x ) ∇ ( x (cid:48) , x ) := ∇ x f ( x ) T ( x (cid:48) − x ) D H := sup x (cid:48) ∈ H ( x ) || x (cid:48) − x || We restate the target proposal for x (cid:48) ∈ H ( x ) q ( x (cid:48) | x ) = exp (cid:16) ∆( x (cid:48) ,x )2 (cid:17) Z ( x ) where we have deﬁned Z ( x ) = (cid:88) x (cid:48) ∈ H ( x ) exp (cid:18) ∆( x (cid:48) , x )2 (cid:19) . This proposal leads to the Markov transition kernel Q ( x (cid:48) , x ) = q ( x (cid:48) | x ) min (cid:26) , Z ( x ) Z ( x (cid:48) ) (cid:27) = min  exp (cid:16) ∆( x (cid:48) ,x )2 (cid:17) Z ( x ) , exp (cid:16) ∆( x (cid:48) ,x )2 (cid:17) Z ( x (cid:48) )  . We now restate our approximate proposal for x (cid:48) ∈ H ( x ) q ∇ ( x (cid:48) | x ) = exp (cid:16) ∇ ( x (cid:48) ,x )2 (cid:17) ˜ Z ( x ) where we have deﬁned ˜ Z ( x ) = (cid:88) x (cid:48) ∈ H ( x ) exp (cid:18) ∇ ( x (cid:48) , x )2 (cid:19) which leads to the Markov transition kernel Q ∇ ( x (cid:48) , x ) = q ∇ ( x (cid:48) | x ) min  , exp (∆( x (cid:48) , x ))exp (cid:16) ∇ ( x (cid:48) ,x ) −∇ ( x,x (cid:48) )2 (cid:17) ˜ Z ( x )˜ Z ( x (cid:48) )  = min  exp (cid:16) ∇ ( x (cid:48) ,x )2 (cid:17) ˜ Z ( x ) , exp (cid:16) ∆( x (cid:48) , x ) + ∇ ( x,x (cid:48) )2 (cid:17) ˜ Z ( x (cid:48) )  . B.2. Preliminaries

It can be seen that ∇ ( x (cid:48) , x ) is a ﬁrst order Taylor-seriesapproximation to ∆( x (cid:48) , x ) and it follows directly from theLipschitz continuity of ∇ x f ( x ) that |∇ ( x (cid:48) , x ) − ∆( x (cid:48) , x ) | ≤ L || x (cid:48) − x (cid:48) || (17)and since we restrict x (cid:48) ∈ H ( x ) we have − L D H ≤ ∇ ( x (cid:48) , x ) − ∆( x (cid:48) , x ) ≤ L D H (18) ibbs with Gradients B.3. Normalizing Constant Bounds

We derive upper- and lower-bounds on ˜ Z ( x ) in terms of Z ( x ) . ˜ Z ( x ) = (cid:88) x (cid:48) ∈ H ( x ) exp (cid:18) ∇ ( x (cid:48) , x )2 (cid:19) = (cid:88) x (cid:48) ∈ H ( x ) exp (cid:18) ∆( x (cid:48) , x )2 (cid:19) exp (cid:18) ∇ ( x (cid:48) , x ) − ∆( x (cid:48) , x )2 (cid:19) ≤ (cid:88) x (cid:48) ∈ H ( x ) exp (cid:18) ∆( x (cid:48) , x )2 (cid:19) exp (cid:18) LD H (cid:19) ≤ exp (cid:18) LD H (cid:19) (cid:88) x (cid:48) ∈ H ( x ) exp (cid:18) ∆( x (cid:48) , x )2 (cid:19) = exp (cid:18) LD H (cid:19) Z ( x ) (19)Following the same argument we can show ˜ Z ( x ) ≥ exp (cid:18) − LD H (cid:19) Z ( x ) . (20) B.4. Inequalities of Minimums

We show Q ∇ ( x (cid:48) , x ) ≥ c · Q ( x (cid:48) , x ) for c = exp (cid:16) − LD H (cid:17) .Since both Q ( x (cid:48) , x ) = min { a, b } and Q ∇ ( x (cid:48) , x ) = min { a ∇ , b ∇ } it is sufﬁcient to show a ∇ ≥ c · a and b ∇ ≥ c · b to provethe desired result.We begin with the a terms a ∇ a = exp (cid:16) ∇ ( x (cid:48) ,x )2 (cid:17) ˜ Z ( x ) Z ( x )exp (cid:16) ∆( x (cid:48) ,x )2 (cid:17) = Z ( x )˜ Z ( x ) exp (cid:18) ∇ ( x (cid:48) , x )2 − ∆( x (cid:48) , x )2 (cid:19) ≥ exp (cid:18) − LD H (cid:19) exp (cid:18) ∇ ( x (cid:48) , x )2 − ∆( x (cid:48) , x )2 (cid:19) ≥ exp (cid:18) − LD H (cid:19) exp (cid:18) − LD H (cid:19) = exp (cid:18) − LD H (cid:19) (21) Now the b terms b ∇ b = exp (cid:16) ∆( x (cid:48) , x ) + ∇ ( x,x (cid:48) )2 (cid:17) ˜ Z ( x (cid:48) ) Z ( x (cid:48) )exp (cid:16) ∆( x (cid:48) ,x )2 (cid:17) = Z ( x (cid:48) )˜ Z ( x (cid:48) ) exp (cid:16) ∆( x (cid:48) , x ) + ∇ ( x,x (cid:48) )2 (cid:17) exp (cid:16) ∆( x (cid:48) ,x )2 (cid:17) = Z ( x (cid:48) )˜ Z ( x (cid:48) ) exp (cid:18) ∆( x (cid:48) , x )2 + ∇ ( x, x (cid:48) )2 (cid:19) ≥ exp (cid:18) − LD H (cid:19) exp (cid:18) ∆( x (cid:48) , x )2 + ∇ ( x, x (cid:48) )2 (cid:19) = exp (cid:18) − LD H (cid:19) exp (cid:18) ∇ ( x, x (cid:48) )2 − ∆( x, x (cid:48) )2 (cid:19) ≥ exp (cid:18) − LD H (cid:19) exp (cid:18) − LD H (cid:19) = exp (cid:18) − LD H (cid:19) (22) B.5. Conclusions

We have shown that a ∇ ≥ exp (cid:16) − LD H (cid:17) a and b ∇ ≥ exp (cid:16) − LD H (cid:17) b and therefore it holds that Q ∇ ( x (cid:48) , x ) ≥ exp (cid:18) − LD H (cid:19) Q ( x (cid:48) , x ) (23)From this, the main result follows directly from Theorem 2of Zanella (2020). C. Relationship to Relaxations

Han et al. (2020) show that sampling from any discretedistribution can be transformed into sampling from a con-tinuous distribution with a piece-wise density function. Forsimplicity we focus on a distribution p ( x ) over binary data x ∈ { , } D . To do this we will create a D -dimensional con-tinuous distribution p c ( z ) where z ∈ R D . We must specifya base distribution p ( z ) which we choose to be N (0 , I ) .We must then specify a function Γ( z ) : R D → { , } D which maps regions of equal mass under the base distri-bution to values in { , } D . A natural choice is Γ( z ) = sign ( z ) We then deﬁne p c ( z ) as p c ( z ) = N ( z ; 0 , I ) p (Γ( z )) and it can be easily veriﬁed that generating z ∼ p c ( z ) , x = Γ( z ) will produce a sample from p ( x ) . ibbs with Gradients Thus, we have transformed a discrete sampling task intoa task of sampling from a continuous distribution with apiece-wise density. Han et al. (2020) further relax this bydeﬁning p λc ( x ) = N ( z ; 0 , I ) p (Γ λ ( z )) where Γ λ ( x ) is a continuous approximation to Γ( x ) . Anatural choice for sign ( x ) is a tempered sigmoid function Γ λ ( x ) = 11 + e − x/λ with temperature λ which controls the smoothness of therelaxation. This is similar to the Concrete relaxation (Mad-dison et al., 2016) for binary variables.D-SVGD proposes to use the gradients of log p λc ( x ) to pro-duce updates for their continuous samples which are ad-justed using an importance-weighted correction as proposedin Han & Liu (2018).We can apply this same approach to other sampling methodssuch as MALA and HMC. C.1. Relaxed MCMC

Gradient-based MCMC samplers such as MALA or HMCconsist of a proposal distribution, q ( x (cid:48) | x ) , and a metropolisaccept/reject step. As a baseline of comparison, we presenttwo straight-forward applications of the above relaxation tosampling from discrete distributions. In both settings wewill use the continuous, differentiable surrogate p λc ( x ) togenerate a proposal and we will perform our Metropolisstep using the piece-wise target p c ( x ) . Relaxed MALA (R-MALA):

Given a sample x , we sam-ple a proposed update x (cid:48) from: q ( x (cid:48) | x ) = N (cid:16) x (cid:48) ; x + (cid:15) ∇ x log p λc ( x ) , (cid:15) (cid:17) (24)and we accept this proposal with probability min (cid:26) p c ( x (cid:48) ) q ( x | x (cid:48) ) p c ( x ) q ( x (cid:48) | x ) , (cid:27) . (25)R-MALA has two parameters; the stepsize (cid:15) and the temper-ature of the relaxation λ . We search over (cid:15) ∈ { . , . , . } and λ ∈ { ., ., . } Relaxed HMC (R-HMC) works similarly to R-MALA.Given a sample x we sample an initial momentum vector v ∼ N (0 , M ) where M is the mass matrix. We perform k steps of leapfrog integration on the relaxed Hamiltonian H λ ( x, v ) = − log p λc ( x ) + 12 v T M v with step-size (cid:15) .The proposal x (cid:48) , v (cid:48) is accepted with probability min (cid:26) H ( x, v ) H ( x (cid:48) , v (cid:48) ) , (cid:27) (26)where H is the target Hamiltonian H ( x, v ) = − log p c ( x ) + 12 v T M v.

We ﬁx the mass matrix M = I and the number of steps k =5 . This leaves two parameters, the ste-psize (cid:15) and tempera-ture λ . As with R-MALA we search over (cid:15) ∈ { . , . , . } and λ ∈ { . , , . } . C.2. Experimental Details

We compare D-SVGD, R-MALA, and R-HMC to Gibbs-With-Gradients at the task of sampling from RBMs. Wepresent results in two settings; random RBMs with increas-ing dimension, and an RBM trained on MNIST using Con-trastive Divergence.The random RBMs have visible dimension [25 , , , , , and all have hiddenunits. The MNIST RBM has 784 visible units and 500hidden units and is trained as in Appendix E.1. FollowingHan et al. (2020) the random RBMs are initialized as W ∼ N (0 , . I ) , b, c ∼ N (0 , I ) . All samples are initialized to a random uniform Bernoullidistribution and all samplers are run for 2000 steps. Weevaluate by computing the Kernelized MMD between eachsampler’s set of samples and a set of approximate “ground-truth” samples generated with the RBMs efﬁcient block-Gibbs sampler. We generate 500 ground truth samples and100 samples for each sampler tested. In Figure 2 we plotthe ﬁnal log-MMD with standard-error over 5 runs withdifferent random seeds. Samples on the right of the ﬁgureare generated in the same way from the MNIST RBM.For D-SVGD we search over relaxation temperatures λ ∈ { . , ., . } . We optimize the samples with the Adamoptimizer (Kingma & Ba, 2014). We search over learn-ing rates in { . , . , . } . We use an RBF kernel k ( x, x (cid:48) ) = exp (cid:16) −|| x − x (cid:48) || h (cid:17) and h = med / (2 log( n + 1)) where med is the median pairwise distance between thecurrent set of n samples.All reported results for D-SVGD, R-MALA, and R-HMCare the best results obtained over all tested hyper-parameters.We found all of these methods to be very sensitive to theirhyper-parameters – in particular, the relaxation temperature λ . We believe it may be possible to improve the performanceof these methods through further tuning of these parametersbut we found doing so beyond the scope of this comparison. ibbs with Gradients D. Gibbs-With-Gradients Extensions

D.1. Extensions To Larger Windows

We can easily extend our approach to proposals with largerwindow sizes. This would amount to a a Taylor-series ap-proximation to likelihood ratios where more than 1 dimen-sion of the data has been perturbed. These would come inthe form of linear functions of f ( x ) and ∇ x f ( x ) . It is likely,of course, that as the window-size is increased, the accuracyof our approximation will decrease as will the quality of thesampler.In all of our experiments, we found a window-size of 1 togive a considerable improvement over various baselines sowe did not explore further. We believe this is an excitingavenue for future work. D.2. Multi-Sample Variant

As mentioned, all experiments presented in the main paperuse a window size of 1 meaning only 1 dimension can bechanged at a time. In the binary case, we sample a dimension i ∼ q ( i | x ) which tells us which dimension to ﬂip to generateour proposed update. A simple, and effective extension tothis is to simply re-sample multiple indices from this samedistribution i , . . . , i N ∼ q ( i | x ) where N is the number of draws. We then generate x (cid:48) byﬂipping the bit at each sampled index i n . This changes theacceptance probability to min (cid:40) exp( f ( x (cid:48) ) − f ( x )) (cid:81) Nn =1 q ( i n | x (cid:48) ) (cid:81) Nn =1 q ( i n | x ) , (cid:41) . (27)This proposal makes a larger number of approximations andassumptions but we ﬁnd that in some settings it can providefaster convergence and can have reasonable acceptance rates.We demonstrate this in our RBM experiments in Figure9. We replicate Figure 3 but add the multi-sample variantdescribed above with N = 3 and N = 5 samples. We ﬁndin this case the multi-sample variant has faster convergenceand greater ESS. E. Restricted Boltzmann Machines

Restricted Boltzmann Machines deﬁne a distribution overbinary data x and latent variables h . The model is deﬁnedas: log p ( x, h ) = h T W x + b T x + c T h − log Z (28)where Z is the normalizing constant and { W, b, c } arethe model’s parameters. In this model we can efﬁciently Figure 9.

RBM Sampling with Gibbs-With-Gradients extensions.GWG-3 and GWG-5 are the multi-sample variant of GWG de-scribed above with n = 3 and n = 5 , respectively. marginalize out the latent variable h to obtain: log p ( x ) = log (cid:88) h p ( x, h )= log (cid:88) h exp( h T W x + b T x + c T h )= log(1 + exp( W x + c )) + b T x − log Z (29)While the joint and marginal are both unnormalized, we cansee the conditional distirbutions can be easily normalizedand take the form: p ( x | h ) = Bernoulli ( W x + c ) p ( h | x ) = Bernoulli ( W T h + b ) . We can exploit this structure to more efﬁciently samplefrom RBMs. We can perform Block-Gibbs updates by start-ing at some initial x , and repeatedly sample h ∼ p ( h | x ) , x ∼ p ( x | h ) . Exploiting this structure leads to much moreefﬁcient sampling than standard Gibbs and other samplers(see Figure 3). E.1. Experimental Details

We explore the performance of various approaches to samplefrom an RBM trained on the MNIST dataset. The RBMhas 500 hidden units (and 784 visible units). We train themodel using contrastive divergence (Hinton, 2002) for 1epoch through the dataset using a batch size of 100. Weuse 10 steps of MCMC sampling using the Block-Gibbssampler deﬁned above to generate samples for each trainingiteration. We use the Adam (Kingma & Ba, 2014) optimizerwith a learning rate of . .Our ﬁrst result compares samples generated by variousapproaches with samples generated with the Block-Gibbssampler described above. We generate a set of 500 sam-ples using the Block-Gibbs sampler run for 10,000 steps. ibbs with Gradients At this length, the samples are very close to true sam-ples from the model. Next we generate a set of 100 sam-ples from a number of other samplers: Gibbs, HammingBall and Gibbs-With-Gradients. After every MCMC tran-sition we compute the Kernelized Maximum Mean Dis-crepancy (Gretton et al., 2012) between the current setof samples and our “ground-truth” long-run Block-Gibbssamples. We use an exponential average Hamming ker-nel K ( x, x (cid:48) ) = exp (cid:16) − (cid:80) Di =1 ( x i = x (cid:48) i ) D (cid:17) to compute theMMD.The next result is the effective sample size of a test statisticfor each sampler. Following Zanella (2020), our test statisticis the Hamming distance between the current sample and arandom input conﬁguration. We present a box-plot showingthe median, standard-deviation, and outliers over 32 chains. E.2. Additional Results

We visualize the samples after 10,000 steps of each testedsampler in Figure 10. We can see the Gibbs-With-Gradientssamples much more closely matches the Block-Gibbs sam-ples. This result is reﬂected in the improved MMD scoressee in Figure 3 (left).

Figure 10.

Sets of samples drawn from a ﬁxed RBM with variousMCMC approaches after 10,000 steps.

F. Ising Models

Ising models are unnormalized models for binary data de-ﬁned as log p ( x ) = x T Jx + b T x − log Z (30)where J and b are the model’s parameters and Z is thenormalizing constant. J determines which other variableseach x i is correlated with. If J = 0 then the model becomesa factorized Bernoulli distribution. If all of the non-zeroindices of J are the same, then we can pull out this value and rewrite the model as log p ( x ) = θx T Jx + b T x − log Z (31)where now θ controls how correlated each x i is with itsconnected variables and J controls which variables each x i is connected to. Our lattice Ising models take this formwhere the J is the adjacency matrix of a cyclic 2D latticeand θ controls the strength of the connectivity. F.1. Experimental Details: Sampling

We experiment with our sampler’s ability to sample fromIsing models on the 2D cyclic lattice as various factors chage.These include the connectivity strength and the size of thelattice. We run each sampler for 100,000 steps and evaluateusing the ESS of a test statistic. Following Zanella (2020)our test statistic is the Hamming distance between the cur-rent sample and a random input conﬁguration. We presentthe ESS (in log-scale), averaged with standard-errors, over32 random seeds.We can see in both 10x10 and 40x40 lattice sizes, our sam-pler outperforms Gibbs and the Hamming ball.

F.2. Experimental Details: Training

We create Ising models with 2 kinds of graph structure; acyclic 2D lattice and a random Erdos-Renyi (ER) graph.For the lattice we create models with a 10x10, 25x25,and 40x40 lattice leading to 100, 625, and 1600 dimen-sional distributions. We train models with connectivity θ ∈ [ − . , . , . , . , . .For the ER graph, we create a model with 200 nodes. TheER edge probability is chosen so each node has an averageof 4 neighbors. The strength of each edge is IID sampledfrom N (cid:0) , (cid:1) .A dataset of 2000 examples is generated from each modelusing 1,000,000 steps of Gibbs sampling. We train modelsusing persistent contrastive divergence (Tieleman, 2008)with a buffer size of 256 examples. Models are trainedwith the Adam optimizer (Kingma & Ba, 2014) using alearning rate of .001 and a batch size of 256. We update thepersistent samples using Gibbs and Gibbs-With-Gradients.We train models with { , , , , } steps of MCMCper training iteration and compare their results. We trainall models with an (cid:96) penalty to encourage sparsity withstrength .01.We compare results using the root-mean-squared-error be-tween the true connectivity matrix J and the inferred con-nectivity matrix ˆ J . ibbs with Gradients F.3. Additional Results: Training Ising Models

In Figure 11, we present an expanded version of Figure6 which presents additional results. In these additionalexperiments we ﬁnd Gibbs-With-Gradients considerablyoutperforms training with Gibbs sampling.

Figure 11.

Training Ising models. Top Left: Lattice Ising withincreasing θ (dim = 50, steps = 50). Top Right: Lattice Ising withincreasing dimension ( θ = . , steps = 25). Bottom Right: LatticeIsing with increasing steps (dim = 25, θ = . ). Bottom Right:Erdos-Renyi Ising with increasing steps. Values are log (RMSE)between the learned J and the true J . Gibbs-With-Gradients leadsto better solutions with lower computational cost. G. Factorial Hidden Markov Models

Factorial Hidden Markov Models (FHMM) are a general-ization of Hidden Markov Models and model real-valuedtime-series data. The observed data y ∈ R L is generatedconditioned on a discrete latent-variable x ∈ { , } L × K .This latent-variable is drawn from the product of K inde-pendent Markov processes as seen below. The data y t isgenerated by by the K -dimensional state vector x t with Gaussian noise added. p ( x, y ) = p ( y | x ) p ( x ) p ( y | x ) = L (cid:89) t =1 N ( y t ; W x t + b, σ ) p ( x ) = p ( x ) L (cid:89) t =2 p ( x t | x t − ) p ( x ) = K (cid:89) k =1 Bernoulli ( x k ; α k ) p ( x t +1 | x t ) = K (cid:89) k =1 Bernoulli ( x ( t +1) k ; β x tk k (1 − β k ) − x tk ) (32)The posterior p ( x | y ) has no closed form and thus we mustrely on MCMC techniques to sample from it. G.1. Experimental Details

We sample the parameters of an FHMM randomly as

W, b ∼ N (0 , I ) (33)and set σ = . , α k = . and β k = . for for all k .We then sample x ∼ p ( x ) and y ∼ p ( y | x ) and run allsamplers for 10,000 steps to generate samples from p ( x | y ) .The Hamming Ball Sampler (Titsias & Yau, 2017) is specialfor this model as it exploits the known block-structure ofthe posterior. We use a block size of 10 and the blocks arechosen to be all 10 dims of the latent state at a randomlychosen time x t . Thus, this sampler is aware of more hard-coded structure in the model than the Gibbs baseline andGibbs-With-Gradients. H. Potts Models for Proteins

We train the MCMC models using PCD (Tieleman, 2008)with a buffer size of 2560. At each training iteration wesample a mini batch of 256 examples and 256 random sam-ples from the persistent sample buffer. These are updatedusing 50 steps of either Gibbs or Gibbs-With-Gradients andthe gradient estimator of Equation 12 is used to update themodel parameters. We train for 10,000 iterations using theAdam optimizer (Kingma & Ba, 2014). Following Markset al. (2011) we use block- (cid:96) regularization. This regularizertakes the form L ( J ) = (cid:88) ij || J ij || . (34)We add this regularizer to the maximum likelihood gradientestimator. We tested regularization strength parameters in ibbs with Gradients { . , . , . } and found . to work best for PLM, Gibbs,and Gibbs-With-Gradients.Ground truth couplings were extracted from an experimen-tally validated distance-map. As is standard, we considerany pair of amino acids to be a contact if they are within 5angstroms of each other. H.1. Recall on PF10018

We do not present results on PF10018 in the main body asit was used to to tune hyper-parameters. For completeness,we present them here in Figure 12. As with the other pro-tiens, the MCMC-based training outperforms PLM but by asmaller margin and GWG and Gibbs perform comparablyhere. This further supports the beneﬁt of MCMC trainingover PLM sets in on larger data as does the beneﬁt of GWGover Gibbs.

Figure 12.

Recall Curves for contact prediction with Potts models.

H.2. Visualized Contacts

We visualize the inferred couplings for CADH1 HUMANin Figure 13. We see that GWG most accurately matchesthe known structure with Gibbs inferring spurious couplingsand PLM missing many couplings near the diagonal.

I. Deep EBMs

I.1. Architectures

We train on two types of data; binary and categorical. Forthe binary datasets, the data is represented as { , } D where D is the dimension of the data.For the categorical datasets, each categorical variable isrepresented as a “one-hot” vector. Thus, for image data,each pixel is represented with a 256-dimensional vector. Todeal with the very high dimensionality of this input param-eterization, we ﬁrst map each one-hot vector to a learned,low-dimensional embedding with a linear transformation.We map to D p = 4 dimensions for all models tested. Wethen feed this ( D × D p ) -dimensional input to our network.There are certainly more efﬁcient ways to represent this Figure 13.

Inferred Couplings for CADH1 HUMAN. “GroundTruth” is the matrix of known distances between amino acids.All other matrices are the norms of the Potts model J ij parameter. data, but our intention was not to achieve state-of-the-artresults on these datasets and instead to demonstrate our sam-pler could enable the training of energy-based models onhigh-dimensional discrete data, so we use the most straight-forward parameterization.The ResNet used for our EBM is identical for all datasets.The network has 8 residual blocks with 64 feature maps.Each residual block has 2 convolutional layers. The ﬁrsttwo residual blocks have a stride of 2. The output featuresare mean-pooled across the spatial dimensions and a sin-gle linear layer is used on top to provide the energy. TheSwish (Ramachandran et al., 2017) nonlinearity ( x · σ ( x ) )is used throughout. I.2. Experimental Details

We trained all models Adam (Kingma & Ba, 2014) usinga learning rate of . . We linearly warm-up the learn-ing rate for the ﬁrst 10,000 iterations. We found this wasnecessary to help burn in the replay buffer of samples.For the large datasets (static/dynamic MNIST, Omniglot)we use a replay buffer with 10,000 samples. For the smallerdatasets (Caltech, Freyfaces, Histopathology) the buffer sizeis 1000. Unlike recent work on continuous EBMs (Du &Mordatch, 2019; Grathwohl et al., 2019), we do not reini-tialize the buffer samples to noise. We found this resultedin unstable training and lower likelihoods.We train all models for 50,000 iterations. We use the same ibbs with Gradients training/validation/testing splits as Tomczak & Welling(2018). We evaluate models every 5000 iterations using10,000 steps of AIS. We select the model which performsthe best on the validation data under this procedure. Finalresults in Table 2 are generated from the selected modelsby running 300,000 iterations of AIS. We evaluate using amodel who’s weights are given by an exponential movingaverage of the training model’s weights. This is analogous totraining with “fast-weights” as in Tieleman & Hinton (2009).We ﬁnd this greatly improves likelihood performance andsample quality. We use an exponential moving average withweight . and did not experiment with other values.We believe better results could be obtained with larger mod-els or alternative architectures, but we leave this for futurework.I.2.1. P ARTITION F UNCTION E STIMATION WITH

AISWe estimate likelihoods by estimating the partition functionusing Annealed Importance Sampling (AIS) (Neal, 2001).AIS underestimates the log-partition-function leading to over-estimating the likelihood. The estimation error can bereduced by using a larger number of intermediate distribu-tions or a more efﬁcient MCMC sampler. Results presentedin Table 2 were generated with 300,000 intermediate distri-butions. We chose this number as it appears to be sufﬁcientlyhigh for our partition function estimate to converge. Despitethis, these are upper-bounds and therefore should not beconsidered deﬁnitive proof that one model achieves higherlikelihoods than another.We anneal between our model’s unnormalized log-probability f ( x ) and a multivariate Bernoulli or Categoricaldistribution, log p n ( x ) , ﬁt to the training data, for binaryand categorical data, respectively. f t ( x ) = β t f ( x ) + (1 − β t ) log p n ( x ) (35)where β t is linearly annealed from 0 to 1. Alternative strate-gies such as sigmoid annealing could be used, but we leavethis for future work.In Figure 14 we plot the estimated likleihoods for our Cal-tech Silhouettes models as the number of intermediate dis-tributions increases. It can be seen that between 30,000and 300,000 ( ≈ . → . ) the values appear to be con-verged, thus we feel our reported number faithfully representour models’ performance. I.3. Additional Results

We present additional long-run samples from our convolu-tional EBM. These samples were generated using an an-nealed Markov Chain (as described above) and Gibbs-With-Gradients as the base MCMC sampler.

Figure 14.

AIS likelihood estimates as the number of intermediatedistributions increases for our Caltech Silhouettes Resnet EBM.Values converge after , ≈ . step s Figure 15.

Static MNIST Samples ibbs with Gradients

Figure 16.

Dynamic MNIST Samples

Figure 17.

Omniglot Samples

Figure 18.

Caltech Silhouette Samples

Figure 19.

Freyfaces Samples ibbs with Gradients

Figure 20.

Histopathology Samples ibbs with Gradients

J. Preliminary Text EBM Results

There has recently been interest in learning Energy-BasedModels of text data. An EBM for text could enable non-autoregressive generation and more ﬂexible conditioningthan autoregressive models. For example, an EBM trainedon language pairs p ( x, y ) could be used to translate in ei-ther direction without retraining and could be structured as log p ( x, y ) = f θ ( x ) T g φ ( y ) − log Z so that each languagecomponent could be trained separately. We also have muchmore architectural freedom when specifying EBMs meaning f θ is free to be a CNN, RNN, transformer, or MLP.A few works have had success training and applying textEBMs. Deng et al. (2020) ﬁnd that EBMs can be used toimprove the generative modeling performance of large-scaletransformer language models and He et al. (2021) ﬁnd thatJoint Energy-Based Models (Grathwohl et al., 2019) canimprove the calibration of text classiﬁers. Because of thediscrete structure of the text data, both of these works trainusing Noise Contrastive Estimation (Gutmann & Hyv¨arinen,2010) using a pretrained autoregressive language model asthe noise distribution. NCE requires a noise distributionwhich can be sampled from and enables exact likelihoodcomputation. Thus, these approaches rely on and are limitedby the quality of these autoregressive models.Ideally, we could train a text EBM on its own, without anauxiliary model. One way to do this is to use the gradientestimator of Equation 12 but the MCMC sampling task isvery challenging. Text models typically have a vocabularyabove 10,000 words so the size of the discrete sample spaceis tremendous. Further, as noted in Section 8, to apply Gibbssampling to a model like this we would need to evaluate theenergy function over 10,000 times to perform a single stepof sampling!We believe Gibbs-With-Gradients can provide an avenueto train and sample from these kinds of models. As apreliminary experiment we train non-autoregressvive lan-guage models on a shortened version of the Penn Tree Bankdataset (Taylor et al., 2003). This is a dataset of short sen-tences with 10,000 words. We cut out all sentences withgreater than 20 words and pad all shorter sentences with an“end of sequence” token. We feel this simpliﬁed setting issufﬁcient for a proof-of-concept as the conﬁguration spaceis very large and Gibbs sampling is not feasible.Our model consists of a bidirectional LSTM (Gers et al.,1999) with 512 hidden units. We project the 10,000 wordsto an embedding of size 256 with a learned mapping. Tocompute the energy, we take the last hidden-state from eachdirection, concatenate them together to a 1024-dimensionalvector. We project this to 512 dimensions with a linear layer,apply a Swish nonlinearity and then map to 1 dimensionwith another linear layer. We train using PCD with a buffer size of 1000 and we use20 steps of MCMC with Gibbs-With-Gradients to updatethe samples at every training iteration. Besides this, trainingwas identical to our image EBMs in section 8.We compare with a simple autoregressive language modelwhich is based on an LSTM with 512 hidden units and usea learned word embedding of size 256.We ﬁnd the autoregressive model slightly outperforms theEBM. The test-set log-likelihoods of the EBM and autore-gressive model are − . and − . , respectively. Forcomparison, a uniform distribution over possible tokensobtains − . and a Categorical distribution ﬁt to thetraining data obtains − . .While we are aware these are far from state-of-the-art lan-guage modelling results, we believe they demonstrate thatGibbs-With-Gradients can enable MCMC-trained EBMs tosuccessfully model text data with large vocabulary sizes. Atevery step, the sampler has , ×

20 = 200 ,000