[PDF] Calibrating Energy-based Generative Adversarial Networks

Abstract

In this paper, we propose to equip Generative Adversarial Networks with the ability to produce direct energy estimates for samples.Specifically, we propose a flexible adversarial training framework, and prove this framework not only ensures the generator converges to the true data distribution, but also enables the discriminator to retain the density information at the global optimal. We derive the analytic form of the induced solution, and analyze the properties. In order to make the proposed framework trainable in practice, we introduce two effective approximation techniques. Empirically, the experiment results closely match our theoretical analysis, verifying the discriminator is able to recover the energy of data distribution.

Full PDF

PPublished as a conference paper at ICLR 2017 C ALIBRATING E NERGY - BASED G ENERATIVE A DVER - SARIAL N ETWORKS

Zihang Dai , Amjad Almahairi ∗ , Philip Bachman , Eduard Hovy & Aaron Courville Language Technologies Institute, Carnegie Mellon University. MILA, Universit´e de Montr´eal. Maluuba Research. A BSTRACT

In this paper we propose equipping Generative Adversarial Networks with theability to produce direct energy estimates for samples. Speciﬁcally, we developa ﬂexible adversarial training framework, and prove this framework not only en-sures the generator converges to the true data distribution, but also enables thediscriminator to retain the density information at the global optimum. We derivethe analytic form of the induced solution, and analyze its properties. In order tomake the proposed framework trainable in practice, we introduce two effectiveapproximation techniques. Empirically, the experiment results closely match ourtheoretical analysis, verifying that the discriminator is able to recover the energyof data distribution.

NTRODUCTION

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) represent an important mile-stone on the path towards more effective generative models. GANs cast generative model trainingas a minimax game between a generative network ( generator ), which maps a random vector into thedata space, and a discriminative network ( discriminator ), whose objective is to distinguish gener-ated samples from real samples. Multiple researchers Radford et al. (2015); Salimans et al. (2016);Zhao et al. (2016) have shown that the adversarial interaction with the discriminator can result in agenerator that produces compelling samples. The empirical successes of the GAN framework werealso supported by the theoretical analysis of Goodfellow et al., who showed that, under certain con-ditions, the distribution produced by the generator converges to the true data distribution, while thediscriminator converges to a degenerate uniform solution.While GANs have excelled as compelling sample generators, their use as general purpose probabilis-tic generative models has been limited by the difﬁculty in using them to provide density estimatesor even unnormalized energy values for sample evaluation.It is tempting to consider the GAN discriminator as a candidate for providing this sort of scoringfunction. Conceptually, it is a trainable sample evaluation mechanism that – owing to GAN train-ing paradigm – could be closely calibrated to the distribution modeled by the generator. If thediscriminator could retain ﬁne-grained information of the relative quality of samples, measured forinstance by probability density or unnormalized energy, it could be used as an evaluation metric.Such data-driven evaluators would be highly desirable for problems where it is difﬁcult to deﬁneevaluation criteria that correlate well with human judgment. Indeed, the real-valued discriminatorof the recently introduced energy-based GANs Zhao et al. (2016) might seem like an ideal candidateenergy function. Unfortunately, as we will show, the degenerate fate of the GAN discriminator atthe optimum equally afﬂicts the energy-based GAN of Zhao et al..In this paper we consider the questions: (i) does there exists an adversarial framework that inducesa non-degenerate discriminator, and (ii) if so, what form will the resulting discriminator take? Weintroduce a novel adversarial learning formulation, which leads to a non-degenerate discriminatorwhile ensuring the generator distribution matches the data distribution at the global optimum. Wederive a general analytic form of the optimal discriminator, and discuss its properties and their ∗ Part of this work was completed while author was at Maluuba Research a r X i v : . [ c s . L G ] F e b ublished as a conference paper at ICLR 2017relationship to the speciﬁc form of the training objective. We also discuss the connection betweenthe proposed formulation and existing alternatives such as the approach of Kim & Bengio (2016).Finally, for a speciﬁc instantiation of the general formulation, we investigate two approximationtechniques to optimize the training objective, and verify our results empirically. ELATED W ORK

Following a similar motivation, the ﬁeld of Inverse Reinforcement Learning (IRL) (Ng & Russell,2000) has been exploring ways to recover the “intrinsic” reward function (analogous to the discrim-inator) from observed expert trajectories (real samples). Taking this idea one step further, appren-ticeship learning or imitation learning (Abbeel & Ng, 2004; Ziebart et al., 2008) aims at learning apolicy (analogous to the generator) using the reward signals recovered by IRL. Notably, Ho & Er-mon draw a connection between imitation learning and GAN by showing that the GAN formulationcan be derived by imposing a speciﬁc regularization on the reward function. Also, under a specialcase of their formulation, Ho & Ermon provide a duality-based interpretation of the problem, whichinspires our theoretical analysis. However, as the focus of (Ho & Ermon, 2016) is only on the policy,the authors explicitly propose to bypass the intermediate IRL step, and thus provide no analysis ofthe learned reward function.The GAN models most closely related to our proposed framework are energy-based GAN models ofZhao et al. (2016) and Kim & Bengio (2016). In the next section, We show how one can derive bothof these approaches from different assumptions regarding regularization of the generative model.

LTERNATIVE F ORMULATION OF A DVERSARIAL T RAINING

ACKGROUND

Before presenting the proposed formulation, we ﬁrst state some basic assumptions required by theanalysis, and introduce notations used throughout the paper.Following the original work on GANs (Goodfellow et al., 2014), our analysis focuses on the non-parametric case, where all models are assumed to have inﬁnite capacities. While many of the non-parametric intuitions can directly transfer to the parametric case, we will point out cases where thistransfer fails. We assume a ﬁnite data space throughout the analysis, to avoid technical machineryout of the scope of this paper. Our results, however, can be extended to continuous data spaces, andour experiments are indeed performed on continuous data.Let X be the data space under consideration, and P = { p | p ( x ) ≥ , ∀ x ∈ X , (cid:80) x ∈X p ( x ) = 1 } be the set of all proper distributions deﬁned on X . Then, p data ∈ P : X (cid:55)→ R and p gen ∈ P : X (cid:55)→ R will denote the true data distribution and the generator distribution. E x ∼ p f ( x ) denotes theexpectation of the quantity f ( x ) w.r.t. x drawn from p . Finally, the term “discriminator” will referto any structure that provides training signals to the generator based on some measure of differencebetween the generator distribution and the real data distribution, which which includes but is notlimited to f -divergence.3.2 P ROPOSED F ORMULATION

In order to understand the motivation of the proposed approach, it is helpful to analyze the optimiza-tion dynamics near convergence in GANs ﬁrst.When the generator distribution matches the data distribution, the training signal (gradient) w.r.t.the discriminator vanishes. At this point, assume the discriminator still retains density information,and views some samples as more real and others as less. This discriminator will produce a trainingsignal (gradient) w.r.t. the generator, pushing the generator to generate samples that appear morereal to the discriminator. Critically, this training signal is the sole driver of the generator’s training.Hence, the generator distribution will diverge from the data distribution. In other words, as long asthe discriminator retains relative density information, the generator distribution cannot stably matchthe data distribution. Thus, in order to keep the generator stationary as the data distribution, thediscriminator must assign ﬂat (exactly the same) density to all samples at the optimal.2ublished as a conference paper at ICLR 2017From the analysis above, the fundamental difﬁculty is that the generator only receives a single train-ing signal (gradient) from the discriminator, which it has to follow. To keep the generator stationary,this single training signal (gradient) must vanish, which requires a degenerate discriminator. In thiswork, we propose to tackle this single training signal constraint directly. Speciﬁcally, we intro-duce a novel adversarial learning formulation which incorporates an additional training signal to thegenerator, such that this additional signal can • balance (cancel out) the discriminator signal at the optimum, so that the generator can staystationary even if the discriminator assigns non-ﬂat density to samples • cooperate with the discriminator signal to make sure the generator converges to the datadistribution, and the discriminator retains the correct relative density informationThe proposed formulation can be written as the following minimax training objective, max c min p gen ∈P E x ∼ p gen (cid:2) c ( x ) (cid:3) − E x ∼ p data (cid:2) c ( x ) (cid:3) + K ( p gen ) , (1)where c ( x ) : X (cid:55)→ R is the discriminator that assigns each data point an unbounded scalar cost, and K ( p gen ) : P (cid:55)→ R is some (functionally) differentiable, convex function of p gen . Compared to theoriginal GAN, despite the similar minimax surface form, the proposed fomulation has two crucialdistinctions.Firstly, while the GAN discriminator tries to distinguish “fake” samples from real ones using binaryclassiﬁcation, the proposed discriminator achieves that by assigning lower cost to real samples andhigher cost to “fake” one. This distinction can be seen from the ﬁrst two terms of Eqn. (1), wherethe discriminator c ( x ) is trained to widen the expected cost gap between “fake” and real samples,while the generator is adversarially trained to minimize it. In addition to the different adversarialmechanism, a calibrating term K ( p gen ) is introduced to provide a countervailing source of trainingsignal for p gen as we motivated above. For now, the form of K ( p gen ) has not been speciﬁed. But aswe will see later, its choice will directly decide the form of the optimal discriminator c ∗ ( x ) .With the speciﬁc optimization objective, we next provide theoretical characterization of both thegenerator and the discriminator at the global optimum.Deﬁne L ( p gen , c ) = E x ∼ p gen (cid:2) c ( x ) (cid:3) − E x ∼ p data (cid:2) c ( x ) (cid:3) + K ( p gen ) , then L ( p gen , c ) is the Lagrange dualfunction of the following optimization problem min p gen ∈P K ( p gen ) s.t. p gen ( x ) − p data ( x ) = 0 , ∀ x ∈ X (2)where c ( x ) , ∀ x appears in L ( p gen , c ) as the dual variables introduced for the equality constraints.This duality relationship has been observed previously in (Ho & Ermon, 2016, equation (7)) underthe adversarial imitation learning setting. However, in their case, the focus was fully on the generatorside (induced policy), and no analysis was provided for the discriminator (reward function).In order to characterize c ∗ , we ﬁrst expand the set constraint on p gen into explicit equality andinequality constraints: min p gen K ( p gen ) s.t. p gen ( x ) − p data ( x ) = 0 , ∀ x − p gen ( x ) ≤ , ∀ x (cid:88) x ∈X p gen ( x ) − . (3)Notice that K ( p gen ) is a convex function of p gen ( x ) by deﬁnition, and both the equality and inequalityconstraints are afﬁne functions of p gen ( x ) . Thus, problem (2) is a convex optimization problem.What’s more, since (i) dom K is open, and (ii) there exists a feasible solution p gen = p data to (3), bythe reﬁned Slater’s condition (Boyd & Vandenberghe, 2004, page 226), we can further verify thatstrong duality holds for (3). With strong duality, a typical approach to characterizing the optimalsolution is to apply the Karush-Kuhn-Tucker (KKT) conditions, which gives rise to this theorem:3ublished as a conference paper at ICLR 2017 Proposition 3.1.

By the KKT conditions of the convex problem (3) , at the global optimum, theoptimal generator distribution p ∗ gen matches the true data distribution p data , and the optimal discrim-inator c ∗ ( x ) has the following form: c ∗ ( x ) = − ∂K ( p gen ) ∂p gen ( x ) (cid:12)(cid:12)(cid:12)(cid:12) p gen = p data − λ ∗ + µ ∗ ( x ) , ∀ x ∈ X , where µ ∗ ( x ) = (cid:26) , p data ( x ) > u x , p data ( x ) = 0 ,λ ∗ ∈ R , is an under-determined real number independent of x,u x ∈ R + , is an under-determined non-negative real number. (4)The detailed proof of proposition 3.1 is provided in appendix A.1. From (4), we can see the exactform of the optimal discriminator depends on the term K ( p gen ) , or more speciﬁcally its gradient.But, before we instantiate K ( p gen ) with speciﬁc choices and show the corresponding forms of c ∗ ( x ) ,we ﬁrst discuss some general properties of c ∗ ( x ) that do not depend on the choice of K . Weak Support Discriminator.

As part of the optimal discriminator function, the term µ ∗ ( x ) plays the role of support discriminator. That is, it tries to distinguish the support of the datadistribution, i.e. SUPP ( p data ) = { x ∈ X | p data ( x ) > } , from its complement set with zero-probability, i.e. SUPP ( p data ) (cid:123) = { x ∈ X | p data ( x ) = 0 } . Speciﬁcally, for any x ∈ SUPP ( p data ) and x (cid:48) ∈ SUPP ( p data ) (cid:123) , it is guaranteed that µ ∗ ( x ) ≤ µ ∗ ( x (cid:48) ) . However, because µ ∗ ( · ) is under-determined, there is nothing preventing the inequality from degenerating into an equality. Therefore,we name it the weak support discriminator. But, in all cases, µ ∗ ( · ) assigns zero cost to all data pointswithin the support. As a result, it does not possess any ﬁne-grained density information inside of thedata support. It is worth pointing out that, in the parametric case, because of the smoothness and thegeneralization properties of the parametric model, the learned discriminator may generalize beyondthe data support. Global Bias.

In (4), the term λ ∗ is a scalar value shared for all x . As a result, it does not affect therelative cost among data points, and only serves as a global bias for the discriminator function.Having discussed general properties, we now consider some speciﬁc cases of the convex function K , and analyze the resulting optimal discriminator c ∗ ( x ) in detail.1. First, let us consider the case where K is the negative entropy of the generator distribution, i.e. K ( p gen ) = − H ( p gen ) . Taking the derivative of the negative entropy w.r.t. p gen ( x ) , we have c ∗ ent ( x ) = − log p data ( x ) − − λ ∗ + µ ∗ ( x ) , ∀ x ∈ X , (5)where µ ∗ ( x ) and λ ∗ have the same deﬁnitions as in (4).Up to a constant, this form of c ∗ ent ( x ) is exactly the energy function of the data distribution p data ( x ) . This elegant result has deep connections to several existing formulations, which includemax-entropy imitation learning (Ziebart et al., 2008) and the directed-generator-trained energy-based model (Kim & Bengio, 2016). The core difference is that these previous formulationsare originally derived from maximum-likelihood estimation, and thus the minimax optimizationis only implicit. In contrast, with an explicit minimax formulation we can develop a betterunderstanding of the induced solution. For example, the global bias λ ∗ suggests that there existsmore than one stable equilibrium the optimal discriminator can actually reach. Further, µ ∗ ( x ) can be understood as a support discriminator that poses extra cost on generator samples whichfall in zero-probability regions of data space.2. When K ( p gen ) = (cid:80) x ∈X p gen ( x ) = (cid:107) p gen (cid:107) , which can be understood as posing (cid:96) regular-ization on p gen , we have ∂K ( p gen ) ∂p gen ( x ) (cid:12)(cid:12) p gen = p data = p data ( x ) , and it follows c ∗ (cid:96) ( x ) = − p data ( x ) − λ ∗ + µ ∗ ( x ) , ∀ x ∈ X , (6)with µ ∗ ( x ) , λ ∗ similarly deﬁned as in (4).Surprisingly, the result suggests that the optimal discriminator c ∗ (cid:96) ( x ) directly recovers the neg-ative probability − p data ( x ) , shifted by a constant. Thus, similar to the entropy solution (5), itfully retains the relative density information of data points within the support.4ublished as a conference paper at ICLR 2017However, because of the under-determined term µ ∗ ( x ) , we cannot recover the distribution den-sity p data exactly from either c ∗ (cid:96) or c ∗ ent if the data support is ﬁnite. Whether this ambiguity canbe resolved is beyond the scope of this paper, but poses an interesting research problem.3. Finally, let’s consider consider a degenerate case, where K ( p gen ) is a constant. That is, we dontprovide any additional training signal for pgen at all. With K ( p gen ) = const, we simply have c ∗ cst ( x ) = λ ∗ + µ ∗ ( x ) , ∀ x ∈ X , (7)whose discriminative power is fully controlled by the weak support discriminator µ ∗ ( x ) . Thus,it follows that c ∗ cst ( x ) won’t be able to discriminate data points within the support of p data , andits power to distinguish data from SUPP ( p data ) and SUPP ( p data ) (cid:123) is weak. This closely matchesthe intuitive argument in the beginning of this section.Note that when K ( p gen ) is a constant, the objective function (1) simpliﬁes to: max c min p gen ∈P E x ∼ p gen (cid:2) c ( x ) (cid:3) − E x ∼ p data (cid:2) c ( x ) (cid:3) , (8)which is very similar to the EBGAN objective (Zhao et al., 2016, equation (2) and (4)). Aswe show in appendix A.2, compared to the objective in (8), the EBGAN objective puts extraconstraints on the allowed discriminator function. In spite of that, the EBGAN objective suf-fers from the single-training-signal problem and does not guarantee that the discriminator willrecover the real energy function (see appendix A.2 for detailed analysis).As we ﬁnish the theoretical analysis of the proposed formulation, we want to point out that simplyadding the same term K ( p gen ) to the original GAN formulation will not lead to both a generator thatmatches the data distribution, and a discriminator that retains the density information (see appendixA.3 for detailed analysis). ARAMETRIC I NSTANTIATION WITH E NTROPY A PPROXIMATION

While the discussion in previous sections focused on the non-parametric case, in practice we are lim-ited to a ﬁnite amount of data, and the actual problem involves high dimensional continuous spaces.Thus, we resort to parametric representations for both the generator and the discriminator. In orderto train the generator using standard back-propagation, we do not parametrize the generator distri-bution directly. Instead, we parametrize a directed generator network that transforms random noise z ∼ p z ( z ) to samples from a continuous data space R n . Consequently, we don’t have analytical ac-cess to the generator distribution, which is deﬁned implicitly by the generator network’s noise → datamapping. However, the regularization term K ( p gen ) in the training objective (1) requires the gen-erator distribution. Faced with this problem, we focus on the max-entropy formulation, and exploittwo different approximations of the regularization term K ( p gen ) = − H ( p gen ) .4.1 N EAREST -N EIGHBOR E NTROPY G RADIENT A PPROXIMATION

The ﬁrst proposed solution is built upon an intuitive interpretation of the entropy gradient. Firstly,since we construct p gen by applying a deterministic, differentiable transform g θ to samples z from aﬁxed distribution p z , we can write the gradient of H ( p gen ) with respect to the generator parameters θ as follows: − ∇ θ H ( p gen ) = E z ∼ p z [ ∇ θ log p gen ( g θ ( z ))] = E z ∼ p z (cid:20) ∂g θ ( z ) ∂θ ∂ log p gen ( g θ ( z )) ∂g θ ( z ) (cid:21) , (9)where the ﬁrst equality relies on the “reparametrization trick”. Equation 9 implies that, if we cancompute the gradient of the generator log-density log p gen ( x ) w.r.t. any x = g θ ( z ) , then we candirectly construct the Monte-Carlo estimation of the entropy gradient ∇ θ H ( p gen ) using samplesfrom the generator.Intuitively, for any generated data x = g θ ( z ) , the term ∂ log p gen ( x ) ∂x essentially describes the directionof local change in the sample space that will increase the log-density. Motivated by this intuition,we propose to form a local Gaussian approximation p i gen of p gen around each point x i in a batch ofsamples { x , ..., x n } from the generator, and then compute the gradient ∂ log p gen ( x i ) ∂x i based on the5ublished as a conference paper at ICLR 2017Gaussian approximation. Speciﬁcally, each local Gaussian approximation p i gen is formed by ﬁndingthe k nearest neighbors of x i in the batch { x , ..., x n } , and then placing an isotropic Gaussian distri-bution at their mean (i.e. maximimum likelihood). Based on the isotropic Gaussian approximation,the resulting gradient has the following form ∂ log p gen ( x i ) ∂x i ≈ µ i − x i , where µ i = 1 k (cid:88) x (cid:48) ∈ KNN ( x i ) x (cid:48) is the mean of the Gaussian (10)Finally, note the scale of this gradient approximation may not be reliable. To ﬁx this problem, wenormalize the approximated gradient into unit norm, and use a single hyper-parameter to model thescale for all x , leading to the following entropy gradient approximation − ∇ θ H ( p gen ) ≈ α k (cid:88) x i = g θ ( z i ) µ i − x i (cid:107) µ i − x i (cid:107) (11)where α is the hyper-parameter and µ i is deﬁned as in equation (10).An obvious weakness of this approximation is that it relies on Euclidean distance to ﬁnd the k nearestneighbors. However, Euclidean distance is usually not the proper metric to use when the effectivedimension is very high. As the problem is highly challenging, we leave it for future work.4.2 V ARIATIONAL L OWER BOUND ON THE E NTROPY

Another approach we consider relies on deﬁning and maximizing a variational lower bound on theentropy H ( p gen ( x )) of the generator distribution. We can deﬁne the joint distribution over observeddata and the noise variables as p gen ( x, z ) = p gen ( x | z ) p gen ( z ) , where simply p gen ( z ) = p z ( z ) is aﬁxed prior. Using the joint, we can also deﬁne the marginal p gen ( x ) and the posterior p gen ( z | x ) .We can also write the mutual information between the observed data and noise variables as: I ( p gen ( x ); p gen ( z )) = H ( p gen ( x )) − H ( p gen ( x | z ))= H ( p gen ( z )) − H ( p gen ( z | x )) , (12)where H ( p gen ( . | . )) denotes the conditional entropy. By reorganizing terms in this deﬁnition, wecan write the entropy H ( p gen ( x )) as: H ( p gen ( x )) = H ( p gen ( z )) − H ( p gen ( z | x )) + H ( p gen ( x | z )) (13)We can think of p gen ( x | z ) as a peaked Gaussian with a ﬁxed, diagonal covariance, and hence itsconditional entropy is constant and can be dropped. Furthermore, H ( p gen ( z )) is also assumed to beﬁxed a priori. Hence, we can maximize H ( p gen ( x )) by minimizing the conditional entropy: H ( p gen ( z | x )) = E x ∼ p gen ( x ) (cid:2) E z ∼ p gen ( z | x ) [ − log p gen ( z | x )] (cid:3) (14)Optimizing this term is still problematic, because (i) we do not have access to the posterior p gen ( z | x ) , and (ii) we cannot sample from it. Therefore, we instead minimize a variational up-per bound deﬁned by an approximate posterior q gen ( z | x ) : H ( p gen ( z | x )) = E x ∼ p gen ( x ) (cid:2) E z ∼ p gen ( z | x ) [ − log q gen ( z | x )] − KL ( p gen ( z | x ) (cid:107) q gen ( z | x )) (cid:3) ≤ E x ∼ p gen ( x ) (cid:2) E z ∼ p gen ( z | x ) [ − log q gen ( z | x )] (cid:3) = U ( q gen ) . (15)We can also rewrite the variational upper bound as: U ( q gen ) = E x,z ∼ p gen ( x,z ) [ − log q gen ( z | x )] = E z ∼ p gen ( z ) (cid:2) E x ∼ p gen ( x | z ) [ − log q gen ( z | x )] (cid:3) , (16)which can be optimized efﬁciently with standard back-propagation and Monte Carlo integration ofthe relevant expectations based on independent samples drawn from the joint p gen ( x, z ) . By mini-mizing this upper bound on the conditional entropy H ( p gen ( z | x )) , we are effectively maximizinga variational lower bound on the entropy H ( p gen ( x )) .6ublished as a conference paper at ICLR 2017Figure 1: True energy functions and samples from synthetic distributions. Green dots in the sampleplots indicate the mean of each Gaussian component. XPERIMENTS

In this section, we verify our theoretical results empirically on several synthetic and real datasets. Inparticular, we evaluate whether the discriminator obtained from the entropy-regularized adversarialtraining can capture the density information (in the form of energy), while making sure the generatordistribution matches the data distribution. For convenience, we refer to the obtained models asEGAN-Ent. Our experimental setting follows closely recommendations from Radford et al. (2015),except in Sec. 5.1 where we use fully-connected models (see appendix B.1 for details). YNTHETIC LOW - DIMENSIONAL DATA

First, we consider three synthetic datasets in -dimensional space, which are drawn from the fol-lowing distributions: (i) Mixture of 4 Gaussians with equal mixture weights, (ii) Mixture of 200Gaussians arranged as two spirals (100 components each spiral), and (iii) Mixture of 2 Gaussianswith highly biased mixture weights, P ( c ) = 0 . , P ( c ) = 0 . . We visualize the ground-truthenergy of these distributions along with 100K training samples in Figure 1. Since the data lies in -dimensional space, we can easily visualize both the learned generator (by drawing samples) andthe discriminator for direct comparison and evaluation. We evaluate here our EGAN-Ent modelusing both approximations: the nearest-neighbor based approximation (EGAN-Ent-NN) and thevariational-inference based approximation (EGAN-Ent-VI), and compare them with two baselines:the original GAN and the energy based GAN with no regularization (EGAN-Const).Experiment results are summarized in Figure 2 for baseline models, and Figure 3 for the proposedmodels. As we can see, all four models can generate perfect samples. However, for the discrimi-nator, both GAN and EGAN-Const lead to degenerate solution, assigning ﬂat energy inside the em-pirical data support. In comparison, EGAN-Ent-VI and EGAN-Ent-NN clearly capture the densityinformation, though to different degrees. Speciﬁcally, on the equally weighted Gaussian mixtureand the two-spiral mixture datasets, EGAN-Ent-NN tends to give more accurate and ﬁne-grainedsolutions compared to EGAN-Ent-VI. However, on the biased weighted Gaussian mixture dataset,EGAN-Ent-VI actually fails to captures the correct mixture weights of the two modes, incorrectlyassigning lower energy to the mode with lower probability (smaller weight). In contrast, EGAN-Ent-NN perfectly captures the bias in mixture weight, and obtains a contour very close to the groundtruth.To better quantify these differences, we present detailed comparison based on KL divergence inappendix B.2. What’s more, the performance difference between EGAN-Ent-VI and EGAN-Ent-NNon biased Gaussian mixture reveals the limitations of the variational inference based approximation,i.e. providing inaccurate gradients. Due to space consideratiosn, we refer interested readers to theappendix B.3 for a detailed discussion.5.2 R ANKING

NIST

DIGITS

In this experiment, we verify that the results in synthetic datasets can translate into data with higherdimensions. While visualizing the learned energy function is not feasible in high-dimensional space,we can verify whether the learned energy function learns relative densities by inspecting the rankingof samples according to their assigned energies. We train on × images of a single handwritten For more details, please refer to https://github.com/zihangdai/cegan_iclr2017 . (a) Standard GAN(b) Energy GAN without regularization (EGAN-Const) Figure 2: Learned energies and samples from baseline models whose discriminator cannot retaindensity information at the optimal. In the sample plots, blue dots indicate generated samples, andred dots indicate real ones. (a) Entropy regularized Energy GAN with variational inference approximation (EGAN-Ent-VI)(b) Entropy regularized Energy GAN with nearest neighbor approximation (EGAN-Ent-NN)

Figure 3: Learned energies and samples from proposed models whose discriminator can retain den-sity information at the optimal. Blue dots are generated samples, and red dots are real ones.digit from the NIST dataset. We compare the ability of EGAN-Ent-NN with both EGAN-Constand GAN on ranking a set of 1,000 images, half of which are generated samples and the rest are realtest images. Figures 4 and 5 show the top-100 and bottom-100 ranked images respectively for eachmodel, after training them on digit . We also show in Figure 7 the mean of all training samples,so we can get a sense of what is the most common style (highest density) of digit 1 in NIST. Wecan notice that all of the top-ranked images by EGAN-Ent-NN look similar to the mean sample.In addition, the lowest-ranked images are clearly different from the mean image, with either high(clockwise or counter-clockwise) rotation degrees from the mean, or an extreme thickness level. Wedo not see such clear distinction in other models. We provide in the appendix B.4 the ranking of thefull set of images.5.3 S AMPLE QUALITY ON NATURAL IMAGE DATASETS

In this last set of experiments, we evaluate the visual quality of samples generated by our modelin two datasets of natural images, namely CIFAR-10 and CelebA. We employ here the variational-based approximation for entropy regularization, which can scale well to high-dimensional data.Figure 6 shows samples generated by EGAN-Ent-VI. We can see that despite the noisy gradientsprovided by the variational approximation, our model is able to generate high-quality samples. , which is an extended versionof MNIST with an average of over K examples per digit. (a) EGAN-Ent-NN(b) EGAN-Const(c) GAN

Figure 4: 100 highest-ranked images out of 1000 generated and reals (bounding box) samples. (a) EGAN-Ent-NN(b) EGAN-Const(c) GAN

Figure 5: 100 lowest-ranked images out of 1000 generated and reals (bounding box) samples.We futher validate the quality of our model’s samples on CIFAR-10 using the

Inception score pro-posed by (Salimans et al., 2016) . Table 1 shows the scores of our EGAN-Ent-VI, the best GANmodel from Salimans et al. (2016) which uses only unlabeled data, and an EGAN-Const modelwhich has the same architecture as our model. We notice that even without employing suggestedtechniques in Salimans et al. (2016), energy-based models perform quite similarly to the GANmodel. Furthermore, the fact that our model scores higher than EGAN-Const highlights the im-portance of entropy regularization in obtaining good quality samples. ONCLUSION

In this paper we have addressed a fundamental limitation in adversarial learning approaches, whichis their inability of providing sensible energy estimates for samples. We proposed a novel adversariallearning formulation which results in a discriminator function that recovers the true data energy. Weprovided a rigorous characterization of the learned discriminator in the non-parametric setting, andproposed two methods for instantiating it in the typical parametric setting. Our experimental resultsverify our theoretical analysis about the discriminator properties, and show that we can also obtainsamples of state-of-the-art quality.

CKNOWLEDGEMENTS

We would like to thank the developers of Theano (Theano Development Team, 2016) for developingsuch a powerful tool for scientiﬁc computing. Amjad Almahairi was supported by funding fromMaluuba Research. Using the evaluation script released in https://github.com/openai/improved-gan/ (a) CIFAR-10 (b) CelebA

Figure 6: Samples generated from our model.Model Our model Improved GAN † EGAN-ConstScore ± std. 7.07 ± .10 6.86 ± .06 6.7447 ± † As reported in Sali-mans et al. (2016) without using labeled data. Figure 7: mean digit R EFERENCES

Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In

Proceedings of the twenty-ﬁrst international conference on Machine learning , pp. 1. ACM, 2004.Stephen Boyd and Lieven Vandenberghe.

Convex optimization . Cambridge university press, 2004.Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In

Advances in Neural Infor-mation Processing Systems , pp. 2672–2680, 2014.Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. arXiv preprintarXiv:1606.03476 , 2016.Taesup Kim and Yoshua Bengio. Deep directed generative models with energy-based probabilityestimation. arXiv preprint arXiv:1606.03439 , 2016.A. Ng and S. Russell. Algorithms for inverse reinforcement learning. In

Icml , pp. 663–670, 2000.Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplersusing variational divergence minimization. arXiv preprint arXiv:1606.00709 , 2016.Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deepconvolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 , 2015.Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.Improved techniques for training gans. arXiv preprint arXiv:1606.03498 , 2016.Theano Development Team. Theano: A Python framework for fast computation of mathematicalexpressions. arXiv e-prints , abs/1605.02688, May 2016.Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126 , 2016.Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inversereinforcement learning. In

AAAI , pp. 1433–1438, 2008.10ublished as a conference paper at ICLR 2017

A S

UPPLEMENTARY MATERIALS FOR S ECTION A.1 O

PTIMAL DISCRIMINATOR FORM UNDER THE PROPOSED FORMULATION

Proof of proposition 3.1.

Reﬁning the Lagrange L ( p gen , c ) by introducing additional dual variablesfor the probability constraints (the second and third), the new Lagrange function has the form L ( p gen , c, µ, λ ) = K ( p gen ) + (cid:88) x ∈X c ( x ) (cid:16) p gen ( x ) − p data ( x ) (cid:17) − (cid:88) x ∈X µ ( x ) p gen ( x ) + λ ( (cid:88) x ∈X p gen ( x ) − (17)where c ( x ) ∈ R , ∀ x , µ ( x ) ∈ R + , ∀ x , and λ ∈ R are the dual variables. The KKT conditions for theoptimal primal and dual variables are as follows ∂K ( p gen ) ∂p gen ( x ) (cid:12)(cid:12)(cid:12)(cid:12) p gen = p data + c ∗ ( x ) − µ ∗ ( x ) + λ ∗ = 0 , ∀ x (stationarity) µ ∗ ( x ) p ∗ gen ( x ) = 0 , ∀ x (complement slackness) µ ∗ ( x ) ≥ , ∀ x (dual feasibility) p ∗ gen ( x ) ≥ , p ∗ gen ( x ) = p data ( x ) , ∀ x (primal feasibility) (cid:88) x ∈X p gen ( x ) = 1 (primal feasibility) (18)Rearranging the conditions above, we get p ∗ gen ( x ) = p data ( x ) , ∀ x ∈ X as well as equation (4), whichconcludes the proof.A.2 O PTIMAL CONDITIONS OF

EBGANIn (Zhao et al., 2016), the training objectives of the generator and the discriminator cannot be writtenas a single minimax optimization problem since the margin structure is only applied to the objectiveof the discriminator. In addition, the discriminator is designed to produce the mean squared recon-struction error of an auto-encoder structure. This restricted the range of the discriminator outputto be non-negative, which is equivalent to posing a set constraint on the discriminator under thenon-parametric setting.Thus, to characterize the optimal generator and discriminator, we adapt the same analyzing logicused in the proof sketch of the original GAN (Goodfellow et al., 2014). Speciﬁcally, given a spe-ciﬁc generator distribution p gen , the optimal discriminator function given the generator distribution c ∗ ( x ; p gen ) can be derived by examining the objective of the discriminator. Then, the conditionaloptimal discriminator function is substituted into the training objective of p gen , simplifying the “ad-versarial” training as a minimizing problem only w.r.t. p gen , which can be well analyzed.Firstly, given any generator distribution p gen , the EBGAN training objective for the discriminatorcan be written as the following form c ∗ ( x ; p gen ) = arg max c ∈C − E p gen max(0 , m − c ( x )) − E p data c ( x )= arg max c ∈C E p gen min(0 , c ( x ) − m ) − E p data c ( x ) (19)where C = { c : c ( x ) ≥ , ∀ x ∈ X } is the set of allowed non-negative discriminator functions. Notethis set constraint comes from the fact the mean squared reconstruction error as discussed above.Since the problem (19) is independent w.r.t. each x , the optimal solution can be easily derived as c ∗ ( x ; p gen ) =  , p gen ( x ) < p data ( x ) m, p gen ( x ) > p data ( x ) α x , p gen ( x ) = p data ( x ) > β x , p gen ( x ) = p data ( x ) = 0 (20)where α x ∈ [0 , m ] is an under-determined number, a β x ∈ [0 , ∞ ) is another under-determined non-negative real number, and the subscripts in m, α x , β x reﬂect that fact that these under-determinedvalues can be distinct for different x . 11ublished as a conference paper at ICLR 2017This way, the overall training objective can be cast into a minimization problem w.r.t. p gen , p ∗ gen = arg min p gen ∈P E x ∼ p gen c ∗ ( x ; p gen ) − E x ∼ p data c ∗ ( x ; p gen )= arg min p gen ∈P (cid:88) x ∈X (cid:104) p gen ( x ) − p data ( x ) (cid:105) c ∗ ( x ; p gen ) (21)where the second term of the ﬁrst line is implicitly deﬁned as the problem is an adversarial gamebetween p gen and c . Proposition A.1.

The global optimal of the EBGAN training objective is achieved if and only if p gen = p data . At that point, c ∗ ( x ) is fully under-determined.Proof. The proof is established by showing contradiction.Firstly, assume the optimal p ∗ gen (cid:54) = p data . Thus, there must exist a non-equal set X (cid:54) = = { x | p data ( x ) (cid:54) = p ∗ gen ( x ) } , which can be further splitted into two subsets, the greater-than set X > = { x | p ∗ gen ( x ) >p data ( x ) } , and the less-than set X < = { x | p ∗ gen ( x ) < p data ( x ) } . Similarly, we deﬁne the equal set X = = { x : p ∗ gen ( x ) = p data ( x ) } . Obviously, X > (cid:83) X < (cid:83) X = = X .Let L ( p gen ) = (cid:80) x ∈X (cid:104) p gen ( x ) − p data ( x ) (cid:105) c ∗ ( x ; p gen ) , substituting the results from equation (20)into (21), the L ( p gen ) ∗ can be written as L ( p ∗ gen ) = (cid:88) x ∈X < (cid:83) X < (cid:83) X = (cid:2) p ∗ gen ( x ) − p data ( x ) (cid:3) c ∗ ( x ; p ∗ gen )= (cid:88) x ∈X < (cid:2) p ∗ gen ( x ) − p data ( x ) (cid:3) c ∗ ( x ; p ∗ gen ) + (cid:88) x ∈X > (cid:2) p ∗ gen ( x ) − p data ( x ) (cid:3) c ∗ ( x ; p ∗ gen )= m (cid:88) x ∈X > p ∗ gen ( x ) − p data ( x ) > (22)However, when p (cid:48) gen = p data , we have L ( p (cid:48) gen ) = 0 < L ( p ∗ gen ) (23)which contradicts the optimal (miminum) assumption of p ∗ gen . Hence, the contradiction concludesthat at the global optimal, p ∗ gen = p data . By equation (20), it directly follows that c ∗ ( x ; p ∗ gen ) = α x ,which completes the proof.A.3 A NALYSIS OF ADDING ADDITIONAL TRAINING SIGNAL TO

GAN

FORMULATION

To show that simply adding the same training signal to GAN will not lead to the same result, it ismore convenient to directly work with the formulation of f -GAN (Nowozin et al., 2016, equation(6)) family, which include the original GAN formulation as a special case.Speciﬁcally, the general f -GAN formulation takes the following form max c min p gen ∈P E x ∼ p gen (cid:2) f (cid:63) ( c ( x )) (cid:3) − E x ∼ p data (cid:2) c ( x ) (cid:3) , (24)where the f (cid:63) ( · ) denotes the convex conjugate (Boyd & Vandenberghe, 2004) of the f -divergencefunction. The optimal condition of the discriminator can be found by taking the variation w.r.t. c ,which gives the optimal discriminator c ∗ ( x ) = f (cid:48) ( p data ( x ) p gen ( x ) ) (25)where f (cid:48) ( · ) is the ﬁrst-order derivative of f ( · ) . Note that, even when we add an extra term L ( p gen ) to equation (24), since the term K ( p gen ) is a constant w.r.t. the discriminator, it does not change theresult given by equation (25) about the optimal discriminator. As a consequence, for the optimal12ublished as a conference paper at ICLR 2017discriminator to retain the density information, it effectively means p gen (cid:54) = p data . Hence, there willbe a contradiction if both c ∗ ( x ) retains the density information, and the generator matches the datadistribution.Intuitively, this problem roots in the fact that f -divergence is quite “rigid” in the sense that given the p gen ( x ) it only allows one ﬁxed point for the discriminator. In comparison, the divergence used in ourproposed formulation, which is the expected cost gap, is much more ﬂexible. By the expected costgap itself, i.e. without the K ( p gen ) term, the optimal discriminator is actually under-determined. B S

UPPLEMENTARY M ATERIALS FOR SECTION B.1 E

XPERIMENT SETTING

Here, we specify the neural architectures used for experiements presented in Section 5.Firstly, for the Egan-Ent-VI model, we parameterize the approximate posterior distribution q gen ( z | x ) with a diagonal Gaussian distribution, whose mean and covariance matrix are the output of atrainable inference network, i.e. q gen ( z | x ) = N ( µ, I σ ) µ, log σ = f infer ( x ) (26)where f infer denotes the inference network, and I is the identity matrix. Note that the InferenceNetwork only appears in the Egan-Ent-VI model.For experiments with the synthetic datasets, the following fully-connected feed forward neural net-works are employed • Generator:

FC(4,128)-BN-ReLU-FC(128,128)-BN-ReLU-FC(128,2) • Discriminator:

FC(2,128)-ReLU-FC(128,128)-ReLU-FC(128,1) • Inference Net:

FC(2,128)-ReLU-FC(128,128)-ReLU-FC(128,4*2) where FC and BN denote fully-connected layer and batch normalization layer respectively. Note thatsince the input noise to the generator has dimension , the Inference Net output has dimension ,where the ﬁrst 4 elements correspond the inferred mean, and the last 4 elements correspond to theinferred diagonal covariance matrix in log scale.For the handwritten digit experiment, we closely follow the DCGAN (Radford et al., 2015) archi-tecture with the following conﬁguration • Generator:

FC(10,512*7*7)-BN-ReLU-DC(512,256;4c2s)-BN-ReLU-DC(256,128;4c2s)-BN-ReLU-DC(128,1;3c1s)-Sigmoid • Discriminator:

CV(1,64;3c1s)-BN-LRec-CV(64,128;4c2s)-BN-LRec-CV(128,256;4c2s)-BN-LRec-FC(256*7*7,1) • Inference Net:

CV(1,64;3c1s)-BN-LRec-CV(64,128;4c2s)-BN-LRec-CV(128,256;4c2s)-BN-LRec-FC(256*7*7,10*2)

Here,

LRec is the leaky rectiﬁed non-linearity recommended by Radford et al. (2015). In addition,

CV(128,256,4c2s) denotes a convolutional layer with 128 input channels, 256 output channels,and kernel size 4 with stride 2. Similarly,

DC(256,128,4c2s) denotes a corresponding trans-posed convolutional operation. Compared to the original DCGAN architecture, the discriminatorunder our formulation does not have the last sigmoid layer which squashes a scalar value into aprobability in [0, 1].For celebA experiment with × color images, we use the following architecture • Generator:

FC(10,512*4*4)-BN-ReLU-DC(512,256;4c2s)-BN-ReLU-DC(256,128;4c2s)-BN-ReLU-DC(256,128;4c2s)-BN-ReLU-DC(128,3;4c2s)-Tanh • Discriminator:

CV(3,64;4c2s)-BN-LRec-CV(64,128;4c2s)-BN-LRec-CV(128,256;4c2s)-BN-LRec-CV(256,256;4c2s)-BN-LRec-FC(256*4*4,1) • Inference Net:

CV(3,64;4c2s)-BN-LRec-CV(64,128;4c2s)-BN-LRec-CV(128,256;4c2s)-BN-LRec-CV(256,256;4c2s)-BN-LRec-FC(256*4*4,10*2) × , similar architecture is used • Generator:

FC(10,512*4*4)-BN-ReLU-DC(512,256;4c2s)-BN-ReLU-DC(256,128;3c1s)-BN-ReLU-DC(256,128;4c2s)-BN-ReLU-DC(128,3;4c2s)-Tanh • Discriminator:

CV(3,64;3c1s)-BN-LRec-CV(64,128;4c2s)-BN-LRec-CV(128,256;4c2s)-BN-LRec-CV(256,256;4c2s)-BN-LRec-FC(256*4*4,1) • Inference Net:

CV(3,64;3c1s)-BN-LRec-CV(64,128;4c2s)-BN-LRec-CV(128,256;4c2s)-BN-LRec-CV(256,256;4c2s)-BN-LRec-FC(256*4*4,10*2)

Given the chosen architectures, we follow Radford et al. (2015) and use Adam as the optimizationalgorithm. For more detailed hyper-parameters, please refer to the code.B.2 Q

UANTITATIVE COMPARISON OF DIFFERENT MODELS

Gaussian Mixture: KL ( p data (cid:107) p emp ) = 0 . , KL ( p emp (cid:107) p data ) = 0 . KL Divergence p gen (cid:107) p emp p emp (cid:107) p gen p gen (cid:107) p data p data (cid:107) p gen p disc (cid:107) p emp p emp (cid:107) p disc p disc (cid:107) p data p data (cid:107) p disc p gen (cid:107) p disc p disc (cid:107) p gen GAN 0.3034 0.5024 0.2498 0.4807 6.7587 2.0648 6.2020 2.0553 2.4596 7.0895EGAN-Const 0.2711 0.4888 0.2239 0.4735 6.7916 2.1243 6.2159 2.1149 2.5062 7.0553EGAN-Ent-VI 0.1422 0.1367 0.0896 0.1214 0.8866 0.6532 0.7215 0.6442 0.7711 1.0638EGAN-Ent-NN

Biased Gaussian Mixture: KL ( p data (cid:107) p emp ) = 0 . , KL ( p emp (cid:107) p data ) = 0 . KL Divergence p gen (cid:107) p emp p emp (cid:107) p gen p gen (cid:107) p data p data (cid:107) p gen p disc (cid:107) p emp p emp (cid:107) p disc p disc (cid:107) p data p data (cid:107) p disc p gen (cid:107) p disc p disc (cid:107) p gen GAN 0.0788 0.0705 0.0413 0.0547 7.1539 2.5230 6.4927 2.5018 2.5205 7.1140EGAN-Const 0.1545 0.1649 0.1211 0.1519 7.1568 2.5269 6.4969 2.5057 2.5860 7.1995EGAN-Ent-VI

Two-spiral Gaussian Mixture: KL ( p data (cid:107) p emp ) = 0 . , KL ( p emp (cid:107) p data ) = 1 . KL Divergence p gen (cid:107) p emp p emp (cid:107) p gen p gen (cid:107) p data p data (cid:107) p gen p disc (cid:107) p emp p emp (cid:107) p disc p disc (cid:107) p data p data (cid:107) p disc p gen (cid:107) p disc p disc (cid:107) p gen GAN 0.5297 0.2701 0.3758 0.7240 6.3507 1.7180 4.3818 1.0866 1.6519 5.7694EGAN-Const 0.7473 1.0325 0.7152 1.6703 5.9930 1.5732 3.9749 0.9703 1.8380 6.0471EGAN-Ent-VI 0.2014 0.1260

Table 2: Pairwise KL divergence between distributions. Bold face indicate the lowest divergencewithin group.In order to quantify the quality of recovered distributions, we compute the pairwise KL divergenceof the following four distributions: • The real data distribution with analytic form, denoted as p data • The empirical data distribution approximated from the 100K training data, denoted as p emp • The generator distribution approximated from 100K generated data, denoted as p gen • The discriminator distribution re-normalized from the learned energy, denoted as p disc Since the synthetic datasets are two dimensional, we approximate both the empirical data distribu-tion and the generator distribution using the simple histogram estimation. Speciﬁcally, we divide thecanvas into a 100-by-100 grid, and assign each sample into its nearest grid cell based on euclideandistance. Then, we normalize the number of samples in each cell into a proper distribution. Whenrecovering the discriminator distribution from the learned energy, we assume that µ ∗ ( x ) = 0 (i.e.inﬁnite data support), and discretize the distribution into the same grid cells p disc ( x ) = exp( − c ∗ ( x )) (cid:80) x (cid:48) ∈ Grid exp( − c ∗ ( x (cid:48) )) , ∀ x ∈ GridBased on these approximation, Table 2 summarizes the results. For all measures related to thediscriminator distribution, EGAN-Ent-VI and EGAN-Ent-NN signiﬁcantly outperform the other twobaseline models, which matches our visual assessment in Figure 2 and 3. Meanwhile, the generatordistributions learned from our proposed framework also achieve relatively lower divergence to boththe empirical data distribution and the true data distribution.14ublished as a conference paper at ICLR 2017B.3 C

OMPARISON OF THE ENTROPY ( GRADIENT ) APPROXIMATION METHODS

In order to understand the performance difference between EGAN-Ent-VI and EGAN-Ent-NN, weanalyze the quality of the entropy gradient approximation during training. To do that, we visualizesome detailed training information in Figure 8. (a) Training details under variational inference entropy approximation(b) Training details under nearest neighbor entropy approximation

Figure 8: For convenience, we will use Fig. (i,j) to refer to the subplot in row i, column j. Fig. (1,1):current energy plot. Fig. (1,2): frequency map of generated samples in the current batch. Fig. (1,3):frequency map of real samples in the current batch. Fig-(1,4): frequency difference between realand generated samples. Fig. (2,1) comparison between more generated from current model and realsample. Fig. (2,2): the discriminator gradient w.r.t. each training sample. Fig. (2,3): the entropygradient w.r.t. each training samples. Fig. (2,4): all gradient (discriminator + entropy) w.r.t. eachtraining sample.As we can see in ﬁgure 8a, the viarational entropy gradient approximation w.r.t. samples is notaccurate: • It is inaccurate in terms of gradient direction. Ideally, the direction of the entropy gradi-ent should be pointing from the center of its closest mode towards the surroundings, with15ublished as a conference paper at ICLR 2017the direction orthogonal to the implicit contour in Fig. (1,2). However, the direction ofgradients in the Fig. (2,3) does not match this. • It is inaccurate in magnitude. As we can see, the entropy approximation gradient (Fig.(2,3)) has much larger norm than the discriminator gradient (Fig. (2,2)). As a result, thetotal gradient (Fig. (2,4)) is fully dominated by the entropy approximation gradient. Thus,it usually takes much longer for the generator to learn to generate rare samples, and thetraining also proceeds much slower compared to the nearest neighbor based approximation.In comparison, the nearest neighbor based gradient approximation is much more accurate as shownin 8b. As a result, it leads to more accurate energy contour, as well as faster training. What’s more,from Figure 8b Fig. (2,4), we can see the entropy gradient does have the cancel-out effect on thediscriminator gradient, which again matches our theory.B.4 R

ANKING

NIST D

IGITS

Figure 9 shows the ranking of all 1000 generated and real images (from the test set) for three models:EGAN-Ent-NN, EGAN-Const, and GAN. We can clearly notice that in EGAN-Ent-NN the top-ranked digits look very similar to the mean digit. From the upper-left corner to the lower-rightcorner, the transition trend is: the rotation degree increases, and the digits become increasingly thickor thin compared to the mean. In addition, samples in the last few rows do diverge away from themean image: either highly diagonal to the right or left, or have different shape: very thin or thick, ortypewriter script. Other models are not able to achieve a similar clear distinction for high versus lowprobability images. Finally, we consistently observe the same trend in modeling other digits, whichare not shown in this paper due to space constraint.B.5 C

LASSIFIER PERFORMANCE AS A PROXY MEASURE

As mentioned in Section 5, evaluating the proposed formulation quantitatively on high-dimensionaldata is extremely challenging. Here, in order to provide more quantitative intuitions on the learneddiscriminator at convergence, we adopt a proxy measure. Speciﬁcally, we take the last-layer activa-tion of the converged discriminator network as ﬁxed pretrained feature, and build a linear classiﬁerupon it. Hypothetically, if the discriminator does not degenerate, the extracted last-layer featureshould maintain more information about the data points, especially compared to features from de-generated discriminators. Following this idea, we ﬁrst train EGAN-Ent-NN, EGAN-Const, andGAN on the MNIST till convergence, and then extract the last-layer activation from their discrimi-nator networks as ﬁxed feature input. Based on ﬁxed feature, a randomly initialized linear classiﬁeris trained to do classiﬁcation on MNIST. Based on 10 runs (with different initialization) of each ofthe three models, the test classiﬁcation performance is summarized in Table 3. For comparison pur-pose, we also include a baseline where the input features are extracted from a discriminator networkwith random weights.Test error (%) EGAN-Ent-NN EGAN-Const GAN RandomMin (a) EGAN-Ent-NN(b) EGAN-Const(c) GAN(a) EGAN-Ent-NN(b) EGAN-Const(c) GAN