Decomposed Adversarial Learned Inference
DDecomposed Adversarial Learned Inference
Alexander Hanbo Li ∗ , , Yaqing Wang ∗ , , Changyou Chen and Jing Gao Amazon Alexa AI SUNY [email protected], { yaqingwa, changyou, jing } @buffalo.edu Abstract
Effective inference for a generative adversarialmodel remains an important and challenging prob-lem. We propose a novel approach, DecomposedAdversarial Learned Inference (DALI), which ex-plicitly matches prior and conditional distributionsin both data and code spaces, and puts a direct con-straint on the dependency structure of the genera-tive model. We derive an equivalent form of theprior and conditional matching objective that canbe optimized efficiently without any parametric as-sumption on the data. We validate the effectivenessof DALI on the MNIST, CIFAR-10, and CelebAdatasets by conducting quantitative and qualitativeevaluations. Results demonstrate that DALI signifi-cantly improves both reconstruction and generationas compared to other adversarial inference models.
Deep directed generative models like variational autoencoder(VAE) [Kingma and Welling, 2013; Rezende et al. , 2014]and generative adversarial network (GAN) [Goodfellow et al. ,2014] have been proved to be powerful for modeling com-plex high-dimensional distributions. While both VAE andGAN can learn to generate realistic images, their underlyingmechanisms are fundamentally different. VAE maps the datainto low-dimensional codes using an encoder, and then recon-structs the original data by a decoder. This allows it to performboth generation and inference. GAN, on the other hand, trainsa generator and a discriminator adversarially. The generatorlearns to fool the discriminator by mapping low-dimensionalnoise vectors to the data space; at the same time, the discrim-inator evolves to detect the generated fake samples from thetrue ones. These two methods have complementary strengthsand weaknesses. VAE can learn a bidirectional mapping be-tween data and code spaces, but relies on over-simplified para-metric assumptions on the complex data distribution, therebycausing it to generate blurry images [Donahue et al. , 2016;Goodfellow et al. , 2014; Larsen et al. , 2015]. GAN gener-ates more realistic samples than VAE [Radford et al. , 2015;Larsen et al. , 2015] because the adversarial regime allows itto learn more complex distributions. However, note that GANonly learns a unidirectional mapping for data generation, and does not allow inferring the latent codes from given samples.This is limiting because the ability of inference is very crucialfor several downstream applications, such as classification,clustering, similarity search, and interpretation. Furthermore,GAN also suffers from the mode collapse problem [Che etal. , 2016; Salimans et al. , 2016] – many modes of the datadistribution are not represented in the generated samples.Therefore, one may wonder on whether we can developa generative model that enjoys the strengths of both GANand VAE without their inherent weaknesses. Such modelshould be able to generate high-quality samples as good asGAN, have an inference mechanism as effective as VAE,and also avoid the mode collapse issue. Many recent ef-forts have been devoted to combining VAE with adversar-ial discriminator(s) [Brock et al. , 2016; Che et al. , 2016;Larsen et al. , 2015; Makhzani et al. , 2015; Mescheder etal. , 2017]. However, VAE-GAN hybrids tend to manifesta compromise of the strengths and weaknesses of both theapproaches. The main reason is that all of them retain theVAE structure, which requires an explicit metric to measurethe data reconstruction and assumes over-simplified paramet-ric data distributions. To overcome such limitations, ad-versarially learned inference (ALI) [Donahue et al. , 2016;Dumoulin et al. , 2016] was recently proposed, wherein thediscriminator is trained on the joint distribution of data and la-tent codes. In this way, under a perfect discriminator, one canmatch joint distributions of the decoder and encoder, thereby,performing inference by sampling from the encoder’s con-ditional that also matches the decoder’s posterior. In prac-tice, however, the equilibrium of the jointly adversarial gameis hard to attain as the dependency structure between dataand codes is not explicitly specified. The reconstructionsof ALI are thus not always faithful [Dumoulin et al. , 2016;Li et al. , 2017] implying that its inference is not always effec-tive.To overcome the aforementioned issues, in this paper, wepropose a novel approach, decomposed adversarial learned in-ference (DALI), that integrates efficient inference to GAN andovercomes the limitations of prior approaches. The approachkeeps the structure simple, involving only one generator, oneencoder, and one discriminator. Furthermore, DALI’s objec-tive is directly derived from our goal of matching both priorand conditional distributions of the generator and encoder,instead of a heuristic combination with l k norm regularization. a r X i v : . [ c s . L G ] A p r ompared to regular GANs, DALI has the ability to conductinference, and also does not suffer from the mode collapseproblem. Moreover, DALI also abandons the unrealistic para-metric assumption on the conditional data distribution, anddoes not require any reconstruction in the data space. This isfundamentally different from VAE or VAE-GAN hybrids inwhich the l k norm is used to measure the data reconstruction.The usage of simple data-fitting metrics on the complex datadistribution leads to worse generation performance. Differentfrom ALI, DALI decomposes the hard problem of matchingthe joint distributions into two sub-tasks – explicitly matchingthe priors on the latent codes and the conditionals on the data.As a consequence of more restrictive constraint, it achieves bet-ter generation and more faithful reconstruction than ALI. Notethat GAN variations with inference mechanism usually worsegeneration performance as compared to regular GANs [Rosca et al. , 2018]. To the best of our knowledge, as demonstrated inthe experiments, DALI is the first framework that further im-proves the generating performance compared with GANs withthe same architecture, while providing consistent inference oneven complicated distributions. We consider the generative model p θ ∗ ( z ) p θ ∗ ( x | z ) , where alatent variable z ( i ) is first generated from the prior distribution p θ ∗ ( z ) , and then the data x ( i ) is sampled from the condi-tional distribution p θ ∗ ( x | z ) . The parameter θ ∗ stands for theground truth parameter of the underlying distribution. Theprior p θ ∗ ( z ) is always assumed to be a simple parametric distri-bution (e.g. N ( , I ) ), but the generative conditional p θ ∗ ( x | z ) is much more complicated and not known to us. Moreover,the posterior distribution p θ ∗ ( z | x ) is intractable but stands foran important inference procedure: given a data x ( i ) , it allowsus to infer its latent variable z ( i ) . In our method, we will model the generating process by aneural network called generator , and the inference process byanother neural network called encoder . Consider the follow-ing two distributions of the generator and encoder, and theircorresponding sampling procedures: • the generator distribution: p θ ( z ) p θ ( x | z ) ; z ∼ p θ ( z ) , x ∼ p θ ( x | z ) . • the encoder distribution: q φ ( x ) q φ ( z | x ) ; x ∼ q φ ( x ) , z ∼ q φ ( z | x ) .The generator’s conditional p θ ( x | z ) approximates the generat-ing distribution p θ ∗ ( x | z ) . The encoder’s conditional q φ ( z | x ) approximates the posterior distribution p θ ∗ ( z | x ) , which iswhat we need for inference. The marginal distribution q φ ( x ) stands for the empirical data distribution, and the othermarginal p θ ( z ) is taken to be p θ ( z ) , which is always a knowndistribution like standard Gaussian. The ultimate goal is to match the joint distributions, p θ ( x , z ) and q φ ( x , z ) . If this is achieved, we are guaranteed that all marginals match and all conditionals match as well. In particu-lar, the conditional q φ ( z | x ) matches the posterior p θ ( z | x ) . Wepropose to decompose this goal into two sub-tasks – match-ing the priors p θ ( z ) and q φ ( z ) , and matching the conditionals p θ ( x | z ) and q φ ( x | z ) . There are two advantages. Firstly, weexplicitly define the dependency structure z → x . Secondly,the explicit constraints on both priors and conditionals arestronger than merely one constraint on the joint distributions.More formally, we decompose the problem of minimiz-ing KL ( p θ ( x , z ) , q φ ( x , z )) into matching both the prior andconditional distributions, that is, to minimize E p θ ( z ) KL ( p θ ( x | z ) || q φ ( x | z )) + KL ( p θ ( z ) || q φ ( z )) . (1)Note that (1) is not identical to ALI’s objective, but theirminimums are attained at the same point. By the propertiesof KL-divergence, when the minimum of (1) is attained, wehave p θ ( z ) = q φ ( z ) and p θ ( x | z ) = q φ ( x | z ) for all x and z ,and hence p θ ( x , z ) = q φ ( x , z ) . The objective (1) cannot be directly optimized because both q φ ( z ) and q φ ( x | z ) are impossible to sample from, as the flowin the encoder is from x to z . However, we prove that theintractable (1) can be rephrased as the combination of a KL-divergence term and a reconstruction term, both containingonly distributions that can either be sampled from or directlyevaluated.Firstly, by definition of KL-divergence, for any fixed z , KL ( p θ ( x | z ) || q φ ( x | z ))= E p θ ( x | z ) [log p θ ( x | z ) − log q φ ( x | z )] . (2)Then by Bayes’ theorem, we have log q φ ( x | z ) = log q φ ( x ) +log q φ ( z | x ) − log q φ ( z ) . Plugging this identity into (2) anddoing some algebra, we get KL ( p θ ( x | z ) || q φ ( x )) − E p θ ( x | z ) [log q φ ( z | x )] + log q φ ( z ) . (3)Next for the second term of (1), we also write out the defi-nition KL ( p θ ( z ) || q φ ( z )) = E p θ ( z ) [log p θ ( z ) − log q φ ( z )] . Then we have E p θ ( z ) KL ( p θ ( x | z ) || q φ ( x | z )) + KL ( p θ ( z ) || q φ ( z ))= E p θ ( z ) [ (3) ] + E p θ ( z ) [log p θ ( z ) − log q φ ( z )]= E p θ ( z ) (cid:2) KL ( p θ ( x | z ) || q φ ( x )) − E p θ ( x | z ) log q φ ( z | x ) (cid:3) + C (4)where C = E p θ ( z ) log p θ ( z ) is a constant because theprior p θ ( z ) is a fixed parametric distribution. For example,when z ∼ N ( , I d ) , we have E p θ ( z ) log p θ ( z ) = − d (1 +log(2 π )) / . Therefore, minimizing the objective (1) is nowtransformed to minimizing the new objective E p θ ( z ) (cid:8) KL ( p θ ( x | z ) || q φ ( x )) + E p θ ( x | z ) [ − log q φ ( z | x )] (cid:9) . (5)Intuitively, term ( I ) = E p θ ( z ) KL ( p θ ( x | z ) || q φ ( x )) measuresthe difference between the generated and real samples, andterm ( II ) = E p θ ( z ) E p θ ( x | z ) [ − log q φ ( z | x )] measures the re-construction of the latent codes. We summarize the aboveprocedure as a proposition. roposition 1. The final objective function (5) E p θ ( z ) { KL ( p θ ( x | z ) || q φ ( x )) + E p θ ( x | z ) [ − log q φ ( z | x )] } is minimized when p θ ( z ) = q φ ( z ) and p θ ( x | z ) = q φ ( x | z ) forall z , x . And hence, at the minimum, the joint distributions p θ ( x , z ) = q φ ( x , z ) . The VAE [Kingma and Welling, 2013] method, using ournotations in this paper, actually depends on the followingidentity: KL ( q φ ( z | x ) || p θ ( z | x )) (6) = KL ( q φ ( z | x ) || p θ ( z )) − E q φ ( z | x ) log p θ ( x | z ) + log p θ ( x ) . Then because of the non-negativity of KL-divergence, we have log p θ ( x ) ≥ E q φ ( z | x ) [log p θ ( x | z )] − KL ( q φ ( z | x ) || p θ ( z )) (cid:124) (cid:123)(cid:122) (cid:125) ELBO , and hence maximizing the log-likelihood of the observationscan be transferred to maximizing the evidence lower bound(ELBO). But taking a closer look at (6) and comparing it to (3),we notice that (6) is a decomposition of the KL-divergence be-tween two conditionals, q φ ( z | x ) and p θ ( z | x ) . Therefore, wecan follow the same approach after (3) and get the followingidentity: E q φ ( x ) KL ( q φ ( z || x ) || p θ ( z | x )) + KL ( q φ ( x ) || p θ ( x )) − E q φ ( x ) log q φ ( x )= E q φ ( x ) (cid:8) KL ( q φ ( z | x ) || p θ ( z )) + E q φ ( z | x ) [ − log p θ ( x | z )] (cid:9) . (7)We denote I vae = E q φ ( x ) [ KL ( q φ ( z | x ) || p θ ( z ))] and II vae = E q φ ( x ) E q φ ( z | x ) [ − log p θ ( x | z )] . Since the marginal q φ ( x ) stands for the empirical data distribution, the right hand side of(7) is the empirical expectation of the negative ELBO, whichis what VAE tries to minimize. We then conclude from (7)that VAE performs marginal distribution matching in the dataspace and conditional distribution matching in the latent space.This distribution matching of VAE is also observed by Rosca et al. [2018].However, the marginal distributions in the data space arevery complex, and the direction x → z in the conditionaldistributions in the latent space is actually opposite to thegenerating process z → x . Hence, in order to match thesedistributions, VAE’s objective has a reconstruction term II vae on x , and a regularization term I vae on latent z . But to evaluateboth terms, we need to make parametric assumptions on bothconditionals q φ ( z | x ) and p θ ( x | z ) . The assumption on I vae can be loosed using GANs [Makhzani et al. , 2015], but theassumption on II vae is critical and limits the performance ofVAE-GAN hybrids.Our model, DALI, instead performs marginal distributionmatching in the latent space and conditional distribution match-ing in the data space. From (5), since the term I will be re-placed with an adversarial game (see Section 3.4), the onlyassumption we need to make is on term II, that is, on theconditional q φ ( z | x ) . And our model is very flexible in itsdependence on z . This assumption is much weaker than thaton p θ ( x | z ) and does not lead to the problems of VAE or VAE-GANs (e.g. blurriness). The KL-divergence part (I) can be replaced by an adversarialgame using the f -divergence theory [Nowozin et al. , 2016].The reconstruction term (II) is a log-likelihood and can be sim-ply evaluated if we assume a parametric q φ ( z | x ) . Therefore,our framework only requires exactly one generator G , one discriminator D , and one encoder E . We will now discuss howto play the adversarial game and measure the reconstructionin details. Adversarial game
Because we do not want to make anyparametric assumption on the distribution p θ ( x | z ) , an adver-sarial game will be played to distinguish p θ ( x | z ) from q φ ( x ) .By the theory of f -GAN [Nowozin et al. , 2016], we constructan adversarial game with the value function V ( G, D ) to be E x ∼ q φ ( x ) [ D ( x )] + E z ∼ p θ ( z ) [ − exp θ ( D ( G ( z )) − . (8)Under the perfect discriminator, finding the optimal generatorof (8) is then equivalent to minimizing the KL-divergence.The activation function for the discriminator in (8) is just theidentity mapping instead of the sigmoid function in the originalGAN. But just like in the original GAN, the generator of (8)also suffers from the gradient vanishing problem [Goodfellow,2016]. Therefore, in our experiments, we maximize (8) for thediscriminator D , but minimize E z ∼ p θ ( z ) [ − D ( G ( z ))] for thegenerator G . We call the algorithm using this value functionDALI- f .As shown in Fedus et al. ; Lucic et al. [2017; 2018], theequilibrium of the adversarial is hard to attain in practice,and we are not using the theoretical value function to train G because of the gradient vanishing problem. Therefore, we alsotry using WGAN and GAN for the adversarial game in ourexperiments, and find out GAN provides consistently betterand more stable results. Reconstruction
Because of the simplicity of the distributionof z , we make a reasonable parametric assumption on q φ ( z | x ) so that the log-likelihood can be explicitly calculated. In thispaper we will assume z | x ∼ N ( µ ( x ) , σ ( x ) I ) , and define L ( z , µ ( x ) , σ ( x )) := − log q φ ( z | x )= 12 d (cid:88) j =1 (cid:32) ( z j − µ j ( x )) σ j ( x ) + log σ j ( x ) + log(2 π ) (cid:33) , (9)where d is the dimension of the latent variable z . In thiscase, the encoder network only needs to output two vectors, µ ( x ) and σ ( x ) , that is, E ( x ) = ( µ ( x ) , σ ( x )) . Then we cancompute the approximate negative posterior log-likelihood byplugging E ( x ) into (9). Final Framework
To summarize, our final optimizationproblem is min
G,E max D (cid:110) V ( G, D ) + λ E p θ ( z ) [ L ( z , E ( G ( z )))] (cid:111) . (10)Here, λ is a hyper-parameter that needs to be set so that twoparts of (10) are in the same scale. We will discuss the selec-tion of λ in detail in the experiment section. .5 Training and Inference Procedures The training procedure is summarized in Algorithm 1. Givenrandom z ( i ) ∼ p θ ( z ) , we first generate samples ˜ x ( i ) ∼ p θ ( x | z ( i ) ) using the generator. Then the discriminator is up-dated to distinguish between generated and real samples. Theencoder outputs the parameters for the distribution q φ ( z | x ) ,from which we calculate the log-likelihood in (II). Then thegenerator and encoder are updated together to minimize the re-construction error (i.e. maximize the expected log-likelihood),while the generator has an extra goal that is to fool the dis-criminator. For any data x ( i ) , its inferred latent code is set tobe the conditional mean µ ( x ( i ) ) = E q φ ( z | x ( i ) ) [ z ] . Then the re-construction of x ( i ) is G ( µ ( x ( i ) )) . Besides the reconstruction,we can also generate more samples which are close to x ( i ) inthe sense that they have similar latent codes. This can be doneby first sampling z ’s from the posterior q φ ( z | x ( i ) ) , and thenmap them to the data space using the generator. Algorithm 1
The DALI training procedure. θ g , θ d , θ e ← initialize network parameters repeat z (1) , . . . , z ( n ) ∼ p θ ( z ) (cid:46) Draw n samples from the prior ˜ x ( j ) ← G ( z ( j ) ) , j = 1 , . . . , n (cid:46) Generate samples usingthe generator network ( µ (˜ x ( j ) ) , σ (˜ x ( j ) )) ← E (˜ x ( j ) ) (cid:46) Calculate mean andvariance of q φ ( z | ˜ x ( j ) ) ρ ( i ) q ← D ( x ( i ) ) , i = 1 , . . . , n (cid:46) Compute discriminatorpredictions ρ ( j ) p ← D (˜ x ( j ) ) , j = 1 , . . . , n L d ← − n (cid:80) ni =1 log( ρ ( i ) q ) − n (cid:80) nj =1 log (1 − ρ ( j ) p ) (cid:46) Compute discriminator loss L g ← − n (cid:80) nj =1 log( ρ ( j ) p ) (cid:46) Compute generator loss L e ← λn (cid:80) nj =1 L ( z ( j ) , µ (˜ x ( j ) ) , σ (˜ x ( j ) )) (cid:46) Computeencoder loss L rec ← L g + L e (cid:46) Compute reconstruction loss θ d ← θ d − ∇ θ d L d (cid:46) Gradient update on discriminatornetwork ( θ g , θ e ) ← ( θ g , θ e ) − ∇ ( θ g ,θ e ) L rec (cid:46) Gradient update ongenerator and encoder networks until convergence
We evaluate our proposed method, DALI, for both reconstruc-tion and generation tasks, on the data sets MNIST [LeCun etal. , 1998], CIFAR-10 [Krizhevsky et al. , 2009] and CelebA[Liu et al. , 2015]. To show the effectiveness of DALI on modecollapse reduction, we also conduct the same 2D Gaussianmixture experiment as in Dumoulin et al. [2016]. The architec-tures of our discriminator and generator are based on DCGAN[Radford et al. , 2015] and slightly simpler, which can be easilyreplaced by more advanced state-of-the-art GANs, and we usea deterministic generator throughout the experiments. Ourencoder network consists of convolutional layers followed bytwo separated fully connected networks, which are used topredict the mean and variance of the posterior q φ ( z | x ) , respec-tively. The Adam optimizer [Kingma and Ba, 2014] is used and the learning rate decay strategy suggested by Kingma andBa[2014] is applied. Since there are d summands in (9), wesimply set λ to be /d in our experiments to calculate theaverage distance on each dimension of z . We also observe thatthe discriminator shares a similar task with the encoder: bothof them need to extract higher level features from raw images.Therefore, in order to reduce the number of parameters and tostabilize the training procedure, our encoder takes the interme-diate hidden representation learned by the discriminator as itsown input. It is worth noting that the encoder does not updatethe common feature extracting layers. We use the PyTorch 1.1to implement our model. In this section, we use quantitative measures (MSE, InceptionScore (IS), Frechet Inception Distance (FID)) to compare theinference and generation performance of DALI, GAN, ALIand ALICE. And for fair comparison, GAN is implementedto have the identical generator and discriminator with DALI.We also include a reduced version of DALI, named DALI- l ,in which the conditional distribution q φ ( z | x ) of the encoder isassumed to be a Gaussian with identity covariance matrix. Toevaluate the performance of inference, we measure it throughreconstructing test images and calculating the mean squarederror (MSE), which has been adopted in Li et al. [2017]. Asfor generation, we calculate the inception score [Salimans et al. , 2016] on , randomly generated images. Theinception scores on MNIST are evaluated by the pre-trainedclassifier from Li et al. [2017], and the inception scores onCIFAR-10 is based on the ImageNet. The quantitative resultsare summarized in Table 1. Inference
From Table 1, DALI achieves the best reconstruc-tion results on both data sets. On MNIST, DALI significantlydecreases the MSE by 68% and 95% compared with ALICEand ALI respectively. On the more complicated CIFAR-10data set, DALI decreases the MSE by 95% and 97%. In orderto alleviate the non-identifiable issue of ALI, ALICE adds theconditional entropy constraint by explicitly regularizing the l k norms between the reconstructed and real images. How-ever, as the data distribution becomes more complicated likein CIFAR-10, the l k norms become inadequate to measure thereconstruction. Consequently, ALICE’s reconstruction erroron CIFAR-10 increases significantly compared with that onMNIST. In contrast, the reconstruction performance of DALIis consistent on both data sets. The reason is that our modelexplicitly specifies the dependency structure of the generativemodel, and matches both prior and conditional distributionswithout using the simple data-fitting l k metrics in the dataspace. This can be further justified by the performance ofDALI- l which follows the same structure. Compared withDALI- l , DALI further decreases the MSE significantly bya relative 49% on CIFAR-10, which shows that the inferredconditional variance is crucial for achieving the faithful recon-structions on complicated data sets. Generation
DALI outperforms all the baseline models in-cluding GAN on inception score. This suggests that DALI canbring further improvement on generation performance insteadof deteriorating it like the other baselines. The reason that both able 1: MSE (lower is better) and Inception scores (higher is better) on MNIST and CIFAR-10. ALI and ALICE results are from theexperiments in Li et al. [2017].
Method MNIST CIFAR-10MSE Inception Score MSE Inception Score FIDDALI ± ± ± ± DALI- l ± ± ± ± ± ± ± ± ± ± ± ± ± ± Figure 1: Reconstruction comparison between our proposed model DALI (first row) and ALI (BiGAN) (second row) on MNIST, CIFAR-10 andCelebA datasets. In each subfigure, the odd columns represent original samples from the test set and the even columns are their reconstructions.
ALI and ALICE perform worse than GAN on generation isthat the task of matching two complicated joint distributions, p θ ( z , x ) and q φ ( z , x ) , is more difficult than the task of theregular GAN, which is to match only the marginals, p θ ( x ) and q φ ( x ) . The proposed model DALI explicitly defines the de-pendency structure between z and x , which is more effectivecompared with one step joint distribution matching. Com-parison between DALI and DALI- l shows that the learnedvariance is also critical for better generation performance. Wealso want to highlight that DALI’s generation performancecan be further improved by replacing the adversarial networkwith more advanced state-of-the-art GANs. In Figure 1, we compare reconstruction of DALI with theresults reported in ALI[Dumoulin et al. , 2016] (BiGAN [Don-ahue et al. , 2016]). From the first column of Figure 1, weobserve that ALI provides a certain level of reconstructions.However, it fails to capture the precise style of the original digits. In contrast, DALI can achieve very sharp and faithfulreconstructions. On CIFAR-10, ALI’s reconstructions are lessfaithful and oftentimes make mistakes in capturing exact ob-ject placement, color, style, and object identity. Our modelproduces better reconstructions in all these aspects. For thereconstructions on CelebA, DALI reproduces the similar style,color and face placement, and even achieves a high level offace identity. As stated in Dumoulin et al. [2016], they believeALI’s unfaithful reconstructions is caused by underfitting. Thisalso leads us to believe that our adversarial regime (marginaland conditional distribution matching) is more efficient forinference compared to joint distribution matching regimes.
To show the effectiveness of our model on mode collapsereduction, we perform the same synthetic experiment as inDumoulin et al. [2016]. The data is a 2D Gaussian mixture of25 components laid out on a grid. To quantify the degree ofmode collapse, we use the two metrics used in Srivastava et able 2: Degree of mode collapse, measured by modes captured (higher is better) and % high quality samples (higher is better) on 2D grid data.The baseline results of GAN, ALI and Unrolled GAN are reported in Srivastava et al. [2017]. GAN ALI Unrolled GAN VAEGAN VEEGAN SN-GAN DALI DALI- f Modes (Max 25) 3.3 15.84 23.6 21.4 24.6
25 25 25 % High Quality Samples 0.5 1.6 16 34.1 40 67.8 al. [2017]: the number of modes captured and the percentageof high quality samples . A generated sample is counted ashigh quality if it is within three standard deviations of thenearest mode. Then the number of modes captured is the num-ber of mixture components whose mean is nearest to at leastone high quality sample. We compare the proposed methodDALI and DALI- f to ALI, Unrolled GAN [Metz et al. , 2016],VAEGAN [Larsen et al. , 2015], VEEGAN [Srivastava et al. ,2017] and SN-GAN [Miyato et al. , 2018]. As shown in Table2, the proposed model DALI provides the best performanceon both measures consistently. More specifically, DALI cancapture 25 modes every time and generate more than 80% ofhigh-quality samples. This suggests that the proposed modelDALI significantly alleviates the mode collapse issue of theGAN framework and hence further improves the generationperformance. The most straightforward way to learn an inference mech-anism is to learn the inverse mapping of GAN’s generatorpost-hoc [Zhu et al. , 2016]. However, since its training pro-cess is the same as GAN, it still suffers from mode collapseproblem. InfoGAN [Chen et al. , 2016] minimizes the mutualinformation between a subset c of the latent code and thegenerated samples, and hence can only do partial inferenceon c . AGE [Ulyanov et al. , 2017] encourages encoder andgenerator to be reciprocal by simultaneously minimizing an l reconstruction error in the data space and an l error in thecode space. This is closely related to the cycle-consistencycriterion [Zhu et al. , 2017; Kim et al. , 2017; Yi et al. , 2017;Li et al. , 2017]. Although the pairwise reconstruction errorshelp reduce mode collapse, the data reconstruction is still mea-sured by l or l norm, which brings the same problem ofVAE and VAE-GAN hybrids. It is worth noting that the maindifference between our method and VAE is not about whichdivergence we use, but rather about upon which space we cal-culate the divergence. In VAE, they calculate the divergenceon z -space, but in DALI, we calculate the divergence on x .Putting (reverse) KL-divergence on x allows us to play theadversarial game on the more complicated distribution of x ,but leave the parametric reconstruction to simpler z .Different from the heuristic combination of VAE and GANs,Mescheder et al. [2017] theoretically derived an adversarialgame to replace the KL-divergence term in the variationallower bound (also called ELBO), and gives the new method,adversarial variational Bayes (AVB), much more flexibility inits dependence on latent z . However, the reconstruction termon x still exists and so is the parametric assumption on theconditional data distribution, leading to the blurriness in theirreconstructed and generated samples. ALI [Dumoulin et al. , 2016; Donahue et al. , 2016] is an el-egant approach to bring inference mechanism into adversariallearning without assuming parametric distribution on the data.Different from our work, it directly plays an adversarial gameto match the joint distributions of the decoder and encoder.But in practice, ALI’s reconstructions are not necessarily faith-ful because the dependency structures within the two jointdistributions are not specified [Li et al. , 2017]. ALICE [Li etal. , 2017] tries to solve this problem by regularizing ALI usingan extra conditional entropy constraint on the data. The con-ditional entropy is either explicitly measured by l k norm, orimplicitly learned by adversarial training. However, when thedata distribution becomes complicated (e.g. CIFAR-10), the l k metric may lead to blurry reconstructions and the adversarialtraining is hard to achieve [Li et al. , 2017]. Compared withALI and ALICE, our method is proven to minimize the KL-divergence between both priors and conditionals of generatorand encoder, and can provide consistent effective inferenceeven on complicated distribution (see Section 4.1).Srivastava et al. [2017] proposed VEEGAN to tackle themode collapse issue of GANs by adding an implicit variati-noal learning on the latent z . To our best knowledge, this is byfar the only approach that is also reconstructing z . Differentfrom VAEs, VEEGAN autoencodes the latent variable or noise z . By doing so, it enforces the generator not to collapse themappings of z to a single mode, because otherwise, the en-coder will not be able to recover all the noise z . The details oftheir model can be summarized as ALI regularized by an extrareconstruction of latent z . Therefore, VEEGAN is similarto ALICE in the sense that they are both adversarial gameson the joint distribution with an extra regularization on eitherdata or latent reconstruction. Our model DALI instead onlyplays the adversarial game on the marginal data distribution,and reconstructs the latent z by maximizing its log-likelihoodunder the latent posterior distribution. We proposed a novel framework, DALI, which matches bothprior and conditional distributions between the generator andthe encoder. Adversarial inference is incorporated into thisframework and there is no parametric assumption on the con-ditional data distribution. We show in the experiments that theproposed method not only allows efficient inference but alsoimproves the image generation.The assumption on q φ ( z | x ) can be further released usingan autoregressive p θ ( z ) . However, the same technique cannotbe easily applied to q φ ( x ) or p θ ( x | z ) . Therefore, we believethe reconstruction direction z → x → z is more expressivethan the opposite x → z → x . eferences [Brock et al. , 2016] Andrew Brock, Theodore Lim, James MRitchie, and Nick Weston. Neural photo editing with introspectiveadversarial networks. arXiv preprint arXiv:1609.07093 , 2016.[Che et al. , 2016] Tong Che, Yanran Li, Athul Paul Jacob, YoshuaBengio, and Wenjie Li. Mode regularized generative adversarialnetworks. arXiv preprint arXiv:1612.02136 , 2016.[Chen et al. , 2016] Xi Chen, Yan Duan, Rein Houthooft, John Schul-man, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretablerepresentation learning by information maximizing generativeadversarial nets. In Advances in Neural Information ProcessingSystems , pages 2172–2180, 2016.[Donahue et al. , 2016] Jeff Donahue, Philipp Kr¨ahenb¨uhl, andTrevor Darrell. Adversarial feature learning. arXiv preprintarXiv:1605.09782 , 2016.[Dumoulin et al. , 2016] Vincent Dumoulin, Ishmael Belghazi, BenPoole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, andAaron Courville. Adversarially learned inference. arXiv preprintarXiv:1606.00704 , 2016.[Fedus et al. , 2017] William Fedus, Mihaela Rosca, Balaji Lakshmi-narayanan, Andrew M Dai, Shakir Mohamed, and Ian Goodfellow.Many paths to equilibrium: Gans do not need to decrease a diver-gence at every step. arXiv preprint arXiv:1710.08446 , 2017.[Goodfellow et al. , 2014] Ian Goodfellow, Jean Pouget-Abadie,Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. Generative adversarial nets. In
Advances in neural information processing systems , pages 2672–2680, 2014.[Goodfellow, 2016] Ian Goodfellow. Nips 2016 tutorial: Generativeadversarial networks. arXiv preprint arXiv:1701.00160 , 2016.[Kim et al. , 2017] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim,Jungkwon Lee, and Jiwon Kim. Learning to discover cross-domain relations with generative adversarial networks. arXivpreprint arXiv:1703.05192 , 2017.[Kingma and Ba, 2014] Diederik P Kingma and Jimmy Ba. Adam:A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[Kingma and Welling, 2013] Diederik P Kingma and Max Welling.Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 ,2013.[Krizhevsky et al. , 2009] Alex Krizhevsky, Geoffrey Hinton, et al.Learning multiple layers of features from tiny images. Technicalreport, Citeseer, 2009.[Larsen et al. , 2015] Anders Boesen Lindbo Larsen, Søren KaaeSønderby, Hugo Larochelle, and Ole Winther. Autoencodingbeyond pixels using a learned similarity metric. arXiv preprintarXiv:1512.09300 , 2015.[LeCun et al. , 1998] Yann LeCun, L´eon Bottou, Yoshua Bengio,and Patrick Haffner. Gradient-based learning applied to documentrecognition.
Proceedings of the IEEE , 86(11):2278–2324, 1998.[Li et al. , 2017] Chunyuan Li, Hao Liu, Changyou Chen, YuchenPu, Liqun Chen, Ricardo Henao, and Lawrence Carin. Alice:Towards understanding adversarial learning for joint distributionmatching. In
Advances in Neural Information Processing Systems ,pages 5501–5509, 2017.[Liu et al. , 2015] Ziwei Liu, Ping Luo, Xiaogang Wang, and XiaoouTang. Deep learning face attributes in the wild. In
Proceedingsof the IEEE International Conference on Computer Vision , pages3730–3738, 2015. [Lucic et al. , 2018] Mario Lucic, Karol Kurach, Marcin Michalski,Sylvain Gelly, and Olivier Bousquet. Are gans created equal? alarge-scale study. In
Advances in neural information processingsystems , pages 700–709, 2018.[Makhzani et al. , 2015] Alireza Makhzani, Jonathon Shlens,Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarialautoencoders. arXiv preprint arXiv:1511.05644 , 2015.[Mescheder et al. , 2017] Lars Mescheder, Sebastian Nowozin, andAndreas Geiger. Adversarial variational bayes: Unifying varia-tional autoencoders and generative adversarial networks. arXivpreprint arXiv:1701.04722 , 2017.[Metz et al. , 2016] Luke Metz, Ben Poole, David Pfau, and JaschaSohl-Dickstein. Unrolled generative adversarial networks. arXivpreprint arXiv:1611.02163 , 2016.[Miyato et al. , 2018] Takeru Miyato, Toshiki Kataoka, MasanoriKoyama, and Yuichi Yoshida. Spectral normalization for gen-erative adversarial networks. arXiv preprint arXiv:1802.05957 ,2018.[Nowozin et al. , 2016] Sebastian Nowozin, Botond Cseke, and Ry-ota Tomioka. f-gan: Training generative neural samplers usingvariational divergence minimization. In
Advances in Neural Infor-mation Processing Systems , pages 271–279, 2016.[Radford et al. , 2015] Alec Radford, Luke Metz, and Soumith Chin-tala. Unsupervised representation learning with deep con-volutional generative adversarial networks. arXiv preprintarXiv:1511.06434 , 2015.[Rezende et al. , 2014] Danilo Jimenez Rezende, Shakir Mohamed,and Daan Wierstra. Stochastic backpropagation and approx-imate inference in deep generative models. arXiv preprintarXiv:1401.4082 , 2014.[Rosca et al. , 2018] Mihaela Rosca, Balaji Lakshminarayanan, andShakir Mohamed. Distribution matching in variational inference. arXiv preprint arXiv:1802.06847 , 2018.[Salimans et al. , 2016] Tim Salimans, Ian Goodfellow, WojciechZaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improvedtechniques for training gans. In
Advances in Neural InformationProcessing Systems , pages 2234–2242, 2016.[Srivastava et al. , 2017] Akash Srivastava, Lazar Valkoz, Chris Rus-sell, Michael U Gutmann, and Charles Sutton. Veegan: Reducingmode collapse in gans using implicit variational learning. In
Advances in Neural Information Processing Systems , pages 3308–3318, 2017.[Ulyanov et al. , 2017] Dmitry Ulyanov, Andrea Vedaldi, and VictorLempitsky. It takes (only) two: Adversarial generator-encodernetworks. arXiv preprint arXiv:1704.02304 , 2017.[Yi et al. , 2017] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong.Dualgan: Unsupervised dual learning for image-to-image transla-tion. arXiv preprint , 2017.[Zhu et al. , 2016] Jun-Yan Zhu, Philipp Kr¨ahenb¨uhl, Eli Shechtman,and Alexei A Efros. Generative visual manipulation on the naturalimage manifold. In
European Conference on Computer Vision ,pages 597–613. Springer, 2016.[Zhu et al. , 2017] Jun-Yan Zhu, Taesung Park, Phillip Isola, andAlexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593arXiv preprint arXiv:1703.10593