GumBolt: Extending Gumbel trick to Boltzmann priors
GGumBolt: Extending Gumbel trick to Boltzmannpriors
Amir H. Khoshaman
D-Wave Systems Inc. ∗ [email protected] Mohammad H. Amin
D-Wave Systems Inc.Simon Fraser University [email protected]
Abstract
Boltzmann machines (BMs) are appealing candidates for powerful priors in varia-tional autoencoders (VAEs), as they are capable of capturing nontrivial and multi-modal distributions over discrete variables. However, non-differentiability of thediscrete units prohibits using the reparameterization trick, essential for low-noiseback propagation. The Gumbel trick resolves this problem in a consistent way byrelaxing the variables and distributions, but it is incompatible with BM priors. Here,we propose the GumBolt, a model that extends the Gumbel trick to BM priors inVAEs. GumBolt is significantly simpler than the recently proposed methods withBM prior and outperforms them by a considerable margin. It achieves state-of-the-art performance on permutation invariant MNIST and OMNIGLOT datasets in thescope of models with only discrete latent variables. Moreover, the performance canbe further improved by allowing multi-sampled (importance-weighted) estimationof log-likelihood in training, which was not possible with previous models.
Variational autoencoders (VAEs) are generative models with the useful feature of learning represen-tations of input data in their latent space. A VAE comprises of a prior (the probability distributionof the latent space), a decoder and an encoder (also referred to as the approximating posterior orthe inference network). There have been efforts devoted to making each of these components morepowerful. The decoder can be made richer by using autoregressive methods such as pixelCNNs,pixelRNNs (Oord et al., 2016) and MADEs (Germain et al., 2015). However, VAEs tend to ignorethe latent code (in the sense described by Yeung et al. (2017)) in the presence of powerful decoders(Chen et al., 2016; Gulrajani et al., 2016; Goyal et al., 2017). There are also a myriad of worksstrengthening the encoder distribution (Kingma et al., 2016; Rezende and Mohamed, 2015; Salimanset al., 2015). Improving the priors is manifestly appealing, since it directly translates into a morepowerful generative model. Moreover, a rich structure in the latent space is one of the main purposesof VAEs. Chen et al. (2016) observed that a more powerful autoregressive prior and a simple encoderis commensurate with a powerful inverse autoregressive approximating posterior and a simple prior.Boltzmann machines (BMs) are known to represent intractable and multi-modal distribu-tions (Le Roux and Bengio, 2008), ideal for priors in VAEs, since they can lead to a more expressivegenerative model. However, BMs contain discrete variables which are incompatible with the repa-rameterization trick, required for efficient propagation of gradients through stochastic units. It isdesirable to have discrete latent variables in many applications such as semi-supervised learning ∗ Currently at Borealis AI32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada. a r X i v : . [ c s . L G ] M a r Kingma et al., 2014), semi-supervised generation (Maaløe et al., 2017) and hard attention models(Serrà et al., 2018; Gregor et al., 2015), to name a few. Many operations, such as choosing betweenmodels or variables are naturally expressed using discrete variables (Yeung et al., 2017).Rolfe (2016) proposed the first model to use a BM in the prior of a VAE, i.e., a discrete VAE (dVAE).The main idea is to introduce auxiliary continuous variables (Fig. 1(a)) for each discrete variablethrough a “smoothing distribution”. The discrete variables are marginalized out in the autoencodingterm by imposing certain constraints on the form of the relaxing distribution. However, the discretevariables cannot be marginalized out from the remaining term in the objective (the KL term). Theirproposed approach relies on properties of the smoothing distribution to evaluate these terms. InAppendix B, we show that this approach is equivalent to REINFORCE when dealing with some partsof the KL term ( i.e. , the cross-entropy term). Vahdat et al. (2018) proposed an improved version,dVAE++, that uses a modified distribution for the smoothing variables, but has the same form forthe autoencoding part (see Sec. 2.1). The qVAE, (Khoshaman et al., 2018), expanded the dVAE tooperate with a quantum Boltzmann machine (QBM) prior (Amin et al., 2016). A major shortcomingof these methods is that they are unable to have multi-sampled (importance-weighted) estimates ofthe objective function during training, which can improve the performance.To use the reparameterization trick directly with discrete variables (without marginalization), acontinuous and differentiable proxy is required. The Gumbel (reparameterization) trick, independentlydeveloped by Jang et al. (2016) and Maddison et al. (2016), achieves this by relaxing discretedistributions. However, BMs and in general discrete random Markov fields (MRFs) are incompatiblewith this method. Relaxation of discrete variables (rather than distributions) for the case of factorialcategorical prior (Gumbel-Softmax) was also investigated in both works. It is not obvious whethersuch relaxation of discrete variables would work with BM priors.The contributions of this work are as follows: we propose the GumBolt, which extends the Gumbeltrick to BM and MRF priors and is significantly simpler than previous models that marginalizediscrete variables. We show that BMs are compatible with relaxation of discrete variables (rather thandistributions) in Gumbel trick. We propose an objective using such relaxation and show that the mainlimitations of previous models with BM priors can be circumvented; we do not need marginalizationof the discrete variables, and can have an importance-weighted objective. GumBolt considerablyoutperforms the previous works in a wide series of experiments on permutation invariant MNISTand OMNIGLOT datasets, even without the importance-weighted objective (Sec. 5). Increasing thenumber of importance weights can further improve the performance. We obtain the state-of-the-artresults on these datasets among models with only discrete latent variables.
Consider a generative model involving observable variables x and latent variables z . The jointprobability distribution can be decomposed as p θ ( x, z ) = p θ ( z ) p θ ( x | z ) . The first and second termson the right hand side are the prior and decoder distributions, respectively, which are parametrized by θ . Calculating the marginal p θ ( x ) involves performing intractable, high dimensional sums or integrals.Assume an element x of the dataset X comprising of N independent samples from an unknownunderlying distribution is given. VAEs operate by introducing a family of approximating posteriors q φ ( z | x ) and maximize a lower bound (also known as the ELBO), L ( x ; θ, φ ) , on the log-likelihood log p θ ( x ) (Kingma and Welling, 2013): log p θ ( x ) ≥ L ( x ; θ, φ ) = E q φ ( z | x ) (cid:20) log p θ ( x, z ) q φ ( z | x ) (cid:21) = E q φ ( z | x ) [log p θ ( x | z )] − D KL (cid:0) q φ ( z | x ) (cid:107) p θ ( z ) (cid:1) , (1)where the first term on the right-hand side is the autoencoding term and D KL is the Kullback-Leiblerdivergence (Bishop, 2011). In VAEs, the parameters of the distributions (such as the means in the caseof Bernoulli variables) are calculated using neural nets. To backpropagate through latent variables z ,the reparameterization trick is used; z is reparametrized as a deterministic function f ( φ, x, ρ ) , wherethe stochasticity of z is relegated to another random variable, ρ , from a distribution that does notdepend on φ . Note that it is impossible to backpropagate if z is discrete, since f is not differentiable.2 .2 Gumbel trick The non-differentiability of f can be resolved by finding a relaxed proxy for the discrete variables.Assume a binary unit, z , with mean ¯ q and logit l ; i.e. , p ( z = 1) = ¯ q = σ ( l ) , where σ ( l ) ≡ − l ) is the sigmoid function. Since σ ( l ) is a monotonic function, we can reparametrize z as z = H ( ρ − (1 − ¯ q )) = H (cid:0) l + σ − ( ρ ) (cid:1) (Maddison et al., 2016), where H is the Heaviside function, ρ ∼ U with U being a uniform distribution in the range [0 , , and σ − ( ρ ) = log( ρ ) − log(1 − ρ ) is the inversesigmoid or logit function. This transformation results in a non-differentiable reparameterization, butcan be smoothed when the Heaviside function is replaced by a sigmoid function with a temperature τ , i.e. , H ( . . . ) −→ σ ( ...τ ) . Thus, we introduce the continuous proxy (Maddison et al., 2016): f ( φ, x, ρ ) = ζ = σ (cid:18) l ( φ, x ) + σ − ( ρ ) τ (cid:19) . (2)The continuous ζ is differentiable and is equal to the discrete z in the limit τ −→ . Our goal is to use a BM as prior. A BM is a probabilistic energy model described by p θ ( z ) = e − E θ ( z ) Z θ , (3)where E θ ( z ) is the energy function, and Z θ = (cid:80) { z } e − E θ ( z ) is the partition function; z is a vector ofbinary variables. Since finding p θ ( z ) is typically intractable, it is common to use sampling techniquesto estimate the gradients. To facilitate MCMC sampling using Gibbs-block technique, the connectivityof latent variables is assumed to be bipartite; i.e. , z is decomposed as [ z , z ] giving − E θ ( z ) = a T z + b T z + z T W z , (4)where a , b and W are the biases (on z and z , respectively) and weights. This bipartite structure isknown as the restricted Boltzmann Machine. (a) d(q)VAE(++) (b) Concrete, Gumbel-Softmax (c) GumBolt Figure 1: Schematic of the discussed models with discrete variables in their latent space. The dashedred and solid blue arrows represent the inference network, and the generative model, respectively. (a)dVAE, qVAE (Khoshaman et al., 2018) and dVAE++ have the same structure. They involve auxiliarycontinuous variables, ζ , for each discrete variable, z , provided by the same conditional probabilitydistribution, r ( ζ | z ) , in both the generative and approximating posterior networks. (b) Concrete andGumbel-Softmax apply the Gumbel trick to the discrete variables to obtain the ζ s that appear in boththe inference and generative models. (c) GumBolt only involves discrete variables in the generativemodel, and the relaxed ζ s are used in the inference model during training. Note that during evaluation,the temperature is set to zero, leading to ζ = z . 3he importance-weighted or multi-sampled objective of a VAE with BM prior can be written as: log p θ ( x ) ≥ L k ( x ; θ, φ ) = E (cid:81) i q φ ( z i | x ) (cid:34) log 1 k k (cid:88) i =1 p θ ( z i ) p θ ( x | z i ) q φ ( z i | x ) (cid:35) = E (cid:81) i q φ ( z i | x ) (cid:34) log 1 k k (cid:88) i =1 e − E θ ( z i ) p θ ( x | z i ) q φ ( z i | x ) (cid:35) − log Z θ , (5)where k is the number of samples or importance-weights over which the Monte Carlo objective iscalculated, (Mnih and Rezende, 2016). z i are independent vectors sampled from q φ ( z i | x ) . Note thatwe have taken out Z θ from the argument of the expectation value, since it is independent of z . Thepartition function is intractable but its derivative can be estimated using sampling: ∇ θ log Z θ = ∇ θ log (cid:88) { z } e − E θ ( z ) = − (cid:80) { z } ∇ θ E θ ( z ) e − E θ ( z ) (cid:80) { z } e − E θ ( z ) = − E p θ ( z ) [ ∇ θ E θ ( z )] . (6)Here, (cid:80) { z } involves summing over all possible configurations of the binary vector z . The objective L k ( x ; θ, φ ) cannot be used for training, since it involves non-differentiable discrete variables. Thiscan be resolved by relaxing the distributions: log p θ ( x ) ≥ ˜ L k ( x ; θ, φ ) = E (cid:81) i q φ ( ζ i | x ) (cid:34) log 1 k k (cid:88) i =1 p θ ( ζ i ) p θ ( x | ζ i ) q φ ( ζ i | x ) (cid:35) = E (cid:81) i q φ ( ζ i | x ) (cid:34) log 1 k k (cid:88) i =1 e − E θ ( ζ i ) p θ ( x | ζ i ) q φ ( ζ i | x ) (cid:35) − log ˜ Z θ . Here, ζ i is a continuous variable sampled from Eq. 2, which is consistent with the Gumbel probability q φ ( ζ i | x ) defined in (Maddison et al., 2016), and p θ ( ζ ) ≡ e − E θ ( ζ ) / ˜ Z θ , where ˜ Z θ ≡ (cid:82) dζe − E θ ( ζ ) .The expectation distribution is the joint distribution over independent z i samples. Notice that log ˜ Z θ is different from log Z θ , therefore its derivatives cannot be estimated using discrete samples froma BM, making this method inapplicable for BM priors. The derivatives could be estimated usingsamples from a continuous distribution, which is very different from the BM distribution. Analyticalcalculation of the expectations, suggested for Bernoulli prior by Maddison et al. (2016) is alsoinfeasible for BMs, since it requires exhaustively summing over all possible configurations of thebinary units. To replace log ˜ Z θ with log Z θ , we introduce a proxy probability distribution: ˘ p θ ( ζ ) ≡ e − E θ ( ζ ) Z θ . (7)Note that ˘ p θ ( ζ ) is not a true (normalized) probability density function, but ˘ p θ ( ζ ) → p θ ( z ) as τ → .Now consider the following theorems (see Appendix A for proof): Theorem 1.
For any polynomial function E θ ( z ) of n z binary variables z ∈ { , } n z , the extremaof the relaxed function E θ ( ζ ) with ζ ∈ [0 , n z reside on the vertices of the hypercube, i.e., ζ extr ∈{ , } n z . Theorem 2.
For any polynomial function E θ ( z ) of n z binary variables z ∈ { , } n z , the proxyprobability ˘ p θ ( ζ ) ≡ e − E θ ( ζ ) /Z θ , with ζ ∈ [0 , n z , is a lower bound to the true probability p θ ( ζ ) ≡ e − E θ ( ζ ) / ˜ Z θ , i.e. , ˘ p θ ( ζ ) ≤ p θ ( ζ ) , where Z θ ≡ (cid:80) { z } e − E θ ( z ) and ˜ Z θ ≡ (cid:82) { ζ } dζe − E θ ( ζ ) . Therefore, according to theorem (2), replacing p θ ( ζ ) with ˘ p θ ( ζ ) , we obtain a lower bound on ˜ L k ( x ; θ, φ ) : ˘ L k ( x ; θ, φ ) = E (cid:81) i q φ ( ζ i | x ) (cid:34) log 1 k k (cid:88) i =1 e − E θ ( ζ i ) p θ ( x | ζ i ) q φ ( ζ i | x ) (cid:35) − log Z θ ≤ ˜ L k ( x ; θ, φ ) . (8)4his allows reparameterization trick, while making it possible to use sampling to estimate the gradients.The structure of our model with a BM prior is portrayed in Figure 1(c), where both continuous anddiscrete variables are used. Notice that in the limit τ → , ˘ p θ ( ζ i ) becomes a probability massfunction (pmf) p θ ( z i ) , while q φ ( ζ i | x ) remains as a probability density function (pdf). To resolve thisinconsistency, we replace q φ ( ζ i | x ) with the Bernoulli pmf: ˘ q φ ( ζ i | x ) = ζ i log ¯ q i +(1 − ζ i ) log(1 − ¯ q i ) ,which approaches q φ ( z i | x ) when τ → . The training objective F k ( x ; θ, φ ) = E (cid:81) i q φ ( ζ i | x ) (cid:34) log 1 k k (cid:88) i =1 e − E θ ( ζ i ) p θ ( x | ζ i )˘ q φ ( ζ i | x ) (cid:35) − log Z θ (9)becomes L k ( x ; θ, φ ) at τ = 0 as desired (see Fig. 2(a) for the relationship among different objectives).This is the analog of the Gumbel-Softmax trick (Jang et al., 2016) when applied to BMs. Duringtraining, τ should be kept small for the continuous variables to stay close to the discrete variables,a common practice with Gumbel relaxation (Tucker et al., 2017). For evaluation, τ is set to zero,leading to an unbiased evaluation of the objective function. For generation, discrete samples fromBM are directly fed into the decoder to obtain the probabilities of each input feature. The term in the discrete objective that involves the prior distribution is E q φ ( z | x ) [log p θ ( z )] . When z isreplaced with ζ , there is no guarantee that the parameters that optimize E q φ ( ζ | x ) [log p θ ( ζ )] would alsooptimize the discrete version. This happens naturally for Bernoulli distribution used in (Jang et al.,2016), since the extrema of the prior term in the objective log( p ( ζ )) = ζ log(¯ p ) + (1 − ζ ) log(1 − ¯ p ) occur at the boundaries ( i.e. , when ζ = 1 or ζ = 0 ). This means that throughout the training, thevalues of ζ are pushed towards the boundary points, consistent with the discrete objective.In the case of a BM prior, according to theorem (1) (proved in Appendix A), the extrema of log p θ ( ζ ) ∝ − E θ ( z ) also occur on the boundaries; this shows that having a BM rather than a factorialBernoulli distribution does not exacerbate the training of GumBolt. Several approaches have been devised to calculate the derivative of the expectation of a function withrespect to the parameters of a Bernoulli distribution,
I ≡ ∇ φ E q φ ( z ) [ f ( z )] :1. Analytical method: for simple functions, e.g. , f ( z ) = z , one can analytically calculate theexpectation and obtain I = ∇ φ E q φ ( z ) [ z ] = ∇ φ ¯ q, where ¯ q is the mean of the Bernoullidistribution. This is a non-biased estimator with zero variance, but can only be appliedto very simple functions. This approach is frequently used in semi-supervised learning(Kingma et al., 2014) by summing over different categories.2. Straight-through method: continuous proxies are used in backpropagation to evaluatederivatives, but discrete units are used in forward propagation (Bengio et al., 2013; Raikoet al., 2014).3. REINFORCE trick: I = E q φ ( z ) [ f ( z ) ∇ φ log q φ ( z )] , although it has high variance, whichcan be reduced by variance reduction techniques (Williams, 1992).4. Reparameterization trick: this method, as delineated in Sec. 2.1-2.2, is biased except in thelimit where the proxies approach the discrete variables.5. Marginalization: if possible, one can marginalize the discrete variables out from some partsof the loss function (Rolfe, 2016).NVIL (Mnih and Gregor, 2014) and its importance-weighted successor VIMCO (Mnih and Rezende,2016) use (3) with input-dependent signals obtained from neural networks to subtract from a baselinein order to reduce the variance of the estimator. REBAR (Tucker et al., 2017) and its generalization,RELAX (Grathwohl et al., 2017) use (3) and employ (4) in their control variates obtained using theGumbel trick. DARN (Gregor et al., 2013) and MuProp (Gu et al., 2015) apply the Taylor expansionof the function f ( z ) to synthesize baselines. dVAE and dVAE++ (Fig. 1(a)), which are the only workswith BM priors, operate primarily based on (5) in their autoencoding term and use a combination5able 1: Test-set log-likelihood of the GumBolt compared against dVAE and dVAE++. k representsthe number of samples used to calculate the objective during training . Note that dVAE and dVAE++are only consistent with k = 1 . See the main text for more details.dVAE dVAE++ GumBolt k = 1 k = 1 k = 1 k = 5 k = 20 − ∼ − − ∼ ∼ − ∼ − − ∼ ∼ of (1-4) for their KL term. In Appendix B, we show that dVAE has elements of REINFORCE incalculating the derivative of the KL term. Our approach, GumBolt, exploits (4), and does not requiremarginalizing out the discrete units. In order to explore the effectiveness of the GumBolt, we present the results of a wide set of experimentsconducted on standard feed-forward structures that have been used to study models with discretelatent variables (Maddison et al., 2016; Tucker et al., 2017; Vahdat et al., 2018). At first, we evaluateGumBolt against dVAE and dVAE++ baselines, all in the same framework and structure. We alsodemonstrate empirically that the GumBolt objective, Eq. 9, faithfully follows the non-differentiablediscrete objective throughout the training. We then note on the relation between our model and othermodels that involve discrete variables. We also gauge the performance advantage GumBolt obtainsfrom the BM by removing the couplings of the BM and re-evaluating the model.
We compare the models on statically binarized MNIST (Salakhutdinov and Murray, 2008) andOMNIGLOT datasets (Lake et al., 2015) with the usual compartmentalization into the training,validation, and test-sets. The -sample estimation of log-likelihood (Burda et al., 2015) of themodels are reported in Table 1. The structures used are the same as those of (Vahdat et al., 2018),which were in turn adopted from (Tucker et al., 2017) and (Maddison et al., 2016). We performedexperiments with dVAE, dVAE++ and GumBolt on the same structure, and set the temperatureto zero during evaluation (the results reported in (Vahdat et al., 2018) are calculated using non-zero temperatures). The inference network is chosen to be either factorial or have two hierarchies(Fig. 1(c)). In the case of two hierarchies, we have: q φ ( z | x ) = q φ ( z , z | x ) = q φ ( z | x ) q φ ( z | z , x ) , where z = [ z , z ] . The meaning of the symbols in Table 1 are as follows: − , and ∼ represent linear and nonlinear layersin the encoder and decoder neural networks. The number of stochastic layers (hierarchies) in theencoder is equal to the number of symbols. The dimensionality of the latent space is times thenumber of symbols; e.g. , ∼ ∼ means two stochastic layers (just as in Fig. 1(c)), with hidden layers(each one containing deterministic units) in the encoder. The dimensionality of each stochasticlayer is equal to in the encoder network; the generative network is a × RBM (a total of stochastic units), for ∼ ∼ and − − , whereas, for − and ∼ , it is a × RBM. Note that inthe case of − − , only one layer of deterministic units is used in each one of the two hierarchies.The decoder network receives the samples from the RBM and probabilistically maps them into theinput space using one or two layers of deterministic units. Since the RBM has bipartite structure,our model has two stochastic layers in the generative model. The chosen hyper-parameters are asfollows: M iterations of parameter updates using the ADAM algorithm (Kingma and Ba, 2014),6ith the default settings and batch size of were carried out. The initial learning rate is × − and is subsequently reduced by . at , , and of the total iterations. KL annealing(Sønderby et al., 2016) was used via a linear schedule during of the total iterations. The valueof temperature, τ was set to for all the experiments involving GumBolt, for experiments withdVAE, and and for dVAE++ on the MNIST and OMNIGLOT datasets, respectively; thesevalues were cross-validated from { , . . . , } . The GumBolt shows the same average performancefor temperatures in the range { , , } . The reported results are the averages from performing theexperiments times. The standard deviations in all cases are less than . ; we avoid presentingthem individually to keep the table less cluttered. We used the batch-normalization algorithm (Ioffeand Szegedy, 2015) along with tanh nonlinearities. Sampling the RBM was done by performing steps of Gibbs updates for every mini-batch, in accordance with our baselines,using persistentcontrastive divergence (PCD) (Tieleman, 2008). We have observed that by having and PCDsteps, the performance of our best model on MNIST dataset is deteriorated by . and . nats onaverage, respectively. In order to estimate the log-partition function, log Z θ , a GPU implementationof parallel tempering algorithm with bridge sampling was used (Desjardins et al., 2010; Bennett,1976; Shirts and Chodera, 2008), with a set of parameters to ensure the variance in log Z θ is lessthan 0.01: K burn-in steps were followed by K sweeps, times (runs), with a pilot run todetermine the inverse temperatures (such that the replica exchange rates are approximately . ).We underscore several important points regarding Table 1. First, when one sample is used in thetraining objective ( k = 1 ), GumBolt outperforms dVAE and dVAE++ in all cases. This can be due tothe efficient use of reparameterization trick and the absence of REINFORCE elements in the structureof GumBolt as opposed to dVAE (Appendix B). Second, the previous models do not apply when k > . GumBolt allows importance weighted objectives according to Eq. 9. We see that in all cases,by adding more samples to the training objective, the performance of the model is enhanced.Fig. 2(b) depicts the k = 20 estimation of the GumBolt and discrete objectives on the training andvalidation sets during training. It can be seen that over-fitting does not happen since all the objectivesare improving throughout the training. Also, note that the differentiable GumBolt proxy closelyfollows the non-differentiable discrete objective. Note that the kinks in the learning curves are causedby our stepwise change of the learning rate and is not an artifact of the model. (a) × iterations L val L tr F val F tr (b) Figure 2: (a) Relationship between the different objectives. Note that functional dependence on θ and φ have been suppressed for brevity. (b) The values of the discrete ( L ) and GumBolt ( F )objectives (with k = 20 for all objectives) throughout the training on the training and validation sets(of MNIST dataset) for a GumBolt on ∼ ∼ structure during iterations of training. The subscripts“val” and “tr” correspond to the validation and training sets, respectively. The abrupt changes arecaused by stepwise annealing of the learning rate. This figure signifies that the differentiable GumBoltobjective faithfully follows the non-differentiable discrete objective, leading to no overfitting causedby following a wrong objective. We did not use early stopping in our experiments.7 .2 Comparison with other discrete models and the importance of powerful priors If the BM prior is replaced with a factorial Bernoulli distribution, GumBolt transforms into CON-CRETE (Maddison et al., 2016) (when continuous variables are used inside discrete pdfs) andGumbel-Softmax (Jang et al., 2016). This can be achieved by setting the couplings ( W ) of the BMto zero, and keeping the biases. Since the performance of CONCRETE and Gumbel-Softmax hasbeen extensively compared against other models (Maddison et al., 2016; Jang et al., 2016; Tuckeret al., 2017; Grathwohl et al., 2017), we do not repeat these experiments here; we note however thatCONCRETE performs favorably to other discrete latent variable models in most cases.Table 2: Performance of GumBolt (test-set log-likelihood) in the presence and absence of couplingweights. -nW in the second column signifies that the elements of the coupling matrix W are set tozero, throughout the training rather than just during evaluation. Removing the weights significantlydegrades the performance of GumBolt. GumBolt k = 20 GumBolt-nW k = 20 − MNIST ∼ ∼ ∼ − OMNIGLOT ∼ ∼ ∼ throughoutthe training (denoted by GumBolt-nW). The GumBolt with couplings significantly outperforms theGumBolt-nW. It was shown in (Vahdat et al., 2018) that dVAE and dVAE++ outperform other modelswith discrete latent variables (REBAR, RELAX, VIMCO, CONCRETE and Gumbel-Softmax) onthe same structure. By outperforming the previuos models with BM priors, our model achievesstate-of-the-art performance in the scope of models with discrete latent variables.Another important question is if some of the improved performance in the presence of BMs can besalvaged in the GumBolt-nW by having more powerful neural nets in the decoder. We observed thatby making the decoder’s neural nets wider and deeper, the performance of the GumBolt-nW does notimprove. This predictably suggests that the increased probabilistic capability of the prior cannot beobtained by simply having a more deterministically powerful decoder. In this work, we have proposed the GumBolt that extends the Gumbel trick to Markov random fieldsand BMs. We have shown that this approach is effective and on the entirety of a wide host of structuresoutperforms the other models that use BMs in their priors. GumBolt is much simpler than previousmodels that require marginalization of the discrete variables and achieves state-of-the-art performanceon MNIST and OMNIGLOT datasets in the context of models with only discrete variables.
References
Amin, M. H., Andriyash, E., Rolfe, J., Kulchytskyy, B., and Melko, R. (2016). Quantum boltzmannmachine.Bengio, Y., Léonard, N., and Courville, A. (2013). Estimating or propagating gradients throughstochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 .Bennett, C. H. (1976). Efficient estimation of free energy differences from monte carlo data.
Journalof Computational Physics , 22(2):245–268.Bishop, C. M. (2011).
Pattern Recognition and Machine Learning . Springer, New York, 1st ed. 2006.corr. 2nd printing 2011 edition edition.Burda, Y., Grosse, R., and Salakhutdinov, R. (2015). Importance weighted autoencoders. arXivpreprint arXiv:1509.00519 . 8hen, X., Kingma, D. P., Salimans, T., Duan, Y., Dhariwal, P., Schulman, J., Sutskever, I., and Abbeel,P. (2016). Variational lossy autoencoder. arXiv preprint arXiv:1611.02731 .Desjardins, G., Courville, A., Bengio, Y., Vincent, P., and Delalleau, O. (2010). Tempered markovchain monte carlo for training of restricted boltzmann machines. In
Proceedings of the thirteenthinternational conference on artificial intelligence and statistics , pages 145–152.Germain, M., Gregor, K., Murray, I., and Larochelle, H. (2015). Made: masked autoencoder fordistribution estimation. In
Proceedings of the 32nd International Conference on Machine Learning(ICML-15) , pages 881–889.Goyal, A. G. A. P., Sordoni, A., Côté, M.-A., Ke, N., and Bengio, Y. (2017). Z-forcing: Trainingstochastic recurrent networks. In
Advances in Neural Information Processing Systems , pages6716–6726.Grathwohl, W., Choi, D., Wu, Y., Roeder, G., and Duvenaud, D. (2017). Backpropagationthrough the void: Optimizing control variates for black-box gradient estimation. arXiv preprintarXiv:1711.00123 .Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and Wierstra, D. (2015). Draw: A recurrentneural network for image generation. arXiv preprint arXiv:1502.04623 .Gregor, K., Danihelka, I., Mnih, A., Blundell, C., and Wierstra, D. (2013). Deep autoregressivenetworks. arXiv preprint arXiv:1310.8499 .Gu, S., Levine, S., Sutskever, I., and Mnih, A. (2015). Muprop: Unbiased backpropagation forstochastic neural networks. arXiv preprint arXiv:1511.05176 .Gulrajani, I., Kumar, K., Ahmed, F., Taiga, A. A., Visin, F., Vazquez, D., and Courville, A. (2016).Pixelvae: A latent variable model for natural images. arXiv preprint arXiv:1611.05013 .Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training byreducing internal covariate shift. In
International Conference on Machine Learning , pages 448–456.Jang, E., Gu, S., and Poole, B. (2016). Categorical reparameterization with gumbel-softmax. arXivpreprint arXiv:1611.01144 .Khoshaman, A., Vinci, W., Denis, B., Andriyash, E., and Amin, M. H. (2018). Quantum variationalautoencoder.
Quantum Science and Technology , 4(1):014001.Kingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 .Kingma, D. P., Mohamed, S., Rezende, D. J., and Welling, M. (2014). Semi-supervised learning withdeep generative models. In
Advances in Neural Information Processing Systems , pages 3581–3589.Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. (2016).Improved variational inference with inverse autoregressive flow. In
Advances in Neural InformationProcessing Systems , pages 4743–4751.Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114 .Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Human-level concept learning throughprobabilistic program induction.
Science , 350(6266):1332–1338.Le Roux, N. and Bengio, Y. (2008). Representational power of restricted boltzmann machines anddeep belief networks.
Neural computation , 20(6):1631–1649.Maaløe, L., Fraccaro, M., and Winther, O. (2017). Semi-supervised generation with cluster-awaregenerative models. arXiv preprint arXiv:1704.00637 .Maddison, C. J., Mnih, A., and Teh, Y. W. (2016). The concrete distribution: A continuous relaxationof discrete random variables. arXiv preprint arXiv:1611.00712 .9nih, A. and Gregor, K. (2014). Neural variational inference and learning in belief networks. arXivpreprint arXiv:1402.0030 .Mnih, A. and Rezende, D. (2016). Variational inference for monte carlo objectives. In
InternationalConference on Machine Learning , pages 2188–2196.Oord, A. v. d., Kalchbrenner, N., and Kavukcuoglu, K. (2016). Pixel recurrent neural networks. arXivpreprint arXiv:1601.06759 .Raiko, T., Berglund, M., Alain, G., and Dinh, L. (2014). Techniques for learning binary stochasticfeedforward neural networks. arXiv preprint arXiv:1406.2989 .Rezende, D. J. and Mohamed, S. (2015). Variational inference with normalizing flows. arXiv preprintarXiv:1505.05770 .Rolfe, J. T. (2016). Discrete variational autoencoders. arXiv preprint arXiv:1609.02200 .Ross, S. M. (2013).
Applied probability models with optimization applications . Courier Corporation.Salakhutdinov, R. and Murray, I. (2008). On the quantitative analysis of deep belief networks. In
Proceedings of the 25th international conference on Machine learning , pages 872–879. ACM.Salimans, T., Kingma, D., and Welling, M. (2015). Markov chain monte carlo and variationalinference: Bridging the gap. In
International Conference on Machine Learning , pages 1218–1226.Serrà, J., Surís, D., Miron, M., and Karatzoglou, A. (2018). Overcoming catastrophic forgetting withhard attention to the task. arXiv preprint arXiv:1801.01423 .Shirts, M. R. and Chodera, J. D. (2008). Statistically optimal analysis of samples from multipleequilibrium states.
The Journal of chemical physics , 129(12):124105.Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., and Winther, O. (2016). Ladder variationalautoencoders. In
Advances in neural information processing systems , pages 3738–3746.Tieleman, T. (2008). Training restricted boltzmann machines using approximations to the likelihoodgradient. In
Proceedings of the 25th international conference on Machine learning , pages 1064–1071. ACM.Tucker, G., Mnih, A., Maddison, C. J., Lawson, J., and Sohl-Dickstein, J. (2017). Rebar: Low-variance, unbiased gradient estimates for discrete latent variable models. In
Advances in NeuralInformation Processing Systems , pages 2624–2633.Vahdat, A., Macready, W. G., Bian, Z., and Khoshaman, A. (2018). Dvae++: Discrete variationalautoencoders with overlapping transformations. arXiv preprint arXiv:1802.04920 .Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforce-ment learning. In
Reinforcement Learning , pages 5–32. Springer.Yeung, S., Kannan, A., Dauphin, Y., and Fei-Fei, L. (2017). Tackling over-pruning in variationalautoencoders. arXiv preprint arXiv:1706.03643 .10 ppendixesA Theorems regarding GumBolt
Theorem 1.
For any polynomial function E θ ( z ) of n z binary variables z ∈ { , } n z , the extremaof the relaxed function E θ ( ζ ) with ζ ∈ [0 , n z reside on the vertices of the hypercube, i.e., ζ extr ∈{ , } n z . Proof.
For a binary variable z i and an integer n , we have z ni ≡ z i . . . z i (cid:124) (cid:123)(cid:122) (cid:125) n times = z i . Therefore, the polynomial function E θ ( z ) can only have linear dependence on z i and can be writtenas E θ ( z ) = (cid:88) i z i g i,θ ( z − i ) , (10)where g i,θ ( z − i ) is a polynomial function of all z j (cid:54) = i with j < i , to exclude double-counting. Theenergy function of a BM is a special case of this equation. The relaxed function will have derivatives ∂E θ ( ζ ) ∂ζ i = g θ ( ζ − i ) . (11)Due to the linearity of the equation, for nonzero g θ ( ζ − i ) there is always ascent or descent directionfor ζ i , therefore, the extrema will be on the vertices of the hypercube. Theorem 2.
For any polynomial function E θ ( z ) of binary variables z ∈ { , } n z , the proxyprobability ˘ p θ ( ζ ) ≡ e − E θ ( ζ ) /Z θ , with ζ ∈ [0 , n z , is a lower bound to the true probability p θ ( ζ ) ≡ e − E θ ( ζ ) / ˜ Z θ , i.e. , ˘ p θ ( ζ ) ≤ p θ ( ζ ) , where Z θ ≡ (cid:80) { z } e − E θ ( z ) and ˜ Z θ ≡ (cid:82) { ζ } dζe − E θ ( ζ ) . Proof.
Let E min be the minimum of E θ ( z ) . According to the previous theorem, E min is also theminimum of E θ ( ζ ) . Therefore ˜ Z θ = (cid:90) { ζ } dζe − E θ ( ζ ) ≤ (cid:90) { ζ } dζe − E min = e − E min ≤ (cid:88) { z } e − E θ ( z ) = Z θ . (12)Therefore ˘ p θ ( ζ ) = e − E θ ( ζ ) Z θ ≤ e − E θ ( ζ ) ˜ Z θ = p θ ( ζ ) . (13) B The equivalence of dVAE and REINFORCE in dealing with thecross-entropy term
In this Appendix, we show that the previous work with a BM prior, dVAE (Rolfe, 2016), is equivalentto REINFORCE when calculating the derivatives of the cross entropy term in the loss function.Note that a discrete variable reparametrized as z = H ( ρ − (1 − ¯ q )) is non-differentiable, due tothe discontinuity caused by the Heaviside function. Consider calculating ∇ φ E q φ ( z | x ) [ E θ ( z )] , whichappears in the gradients of the objective function. The gradients of the coupling terms can be writtenas: ∇ φ E q φ ( z | x ) n z (cid:88) i,j z i W ij z j = E ρ ∼U ∇ φ n z (cid:88) i,j z i ( ρ ) W ij z j ( ρ ) . (14)11sing a spike(at 0)-and-exponential relaxation, r ( ζ i | z i ) , i.e. , r ( ζ i | z i ) = (cid:40) δ ( ζ i ) , if z i = 0 exp( − ζ i /τ ) Z , if z i = 1 , (15)where Z is the normalization constant. It is proved in (Rolfe, 2016) that the derivatives can becalculated as follows: E ρ ∼U ∇ φ n z (cid:88) i,j z i ( ρ ) W ij z j ( ρ ) = E ρ ∼U n z (cid:88) i,j − z i ( ρ )1 − ¯ q i ( ρ ) W ij z j ( ρ ) ∇ φ ¯ q i ( ρ ) . (16)In order to show that this is equivalent to REINFORCE, first consider the spike (at one)-and-exponential distribution: r ( ζ i | z i ) = (cid:40) δ ( ζ i ) , if z i = 1 exp( − ζ i /τ ) Z , if z i = 0 , (17)which is equivalent to spike(at 0)-and-exponential relaxation distribution (since there is nothingspecial about z = 0 ). Using this distribution and the same line of reasoning used in (Rolfe, 2016), thederivatives of the coupling term become: E ρ ∼U ∇ φ n z (cid:88) i,j z i ( ρ ) W ij z j ( ρ ) = E ρ ∼U n z (cid:88) i,j z i ( ρ )¯ q i ( ρ ) W ij z j ( ρ ) ∇ φ ¯ q i ( ρ ) . (18)Now consider the REINFORCE trick applied to the coupling term: ∇ φ E q φ ( z | x ) n z (cid:88) i,j z i W ij z j = E q φ ( z | x ) n z (cid:88) i,j z i W ij z j ∇ φ log q φ ( z i , z j | x ) . (19)Assuming the general autoregressive encoder, where every z i depends on all the preceding variables, z