[PDF] "Best-of-Many-Samples" Distribution Matching

Abstract

Generative Adversarial Networks (GANs) can achieve state-of-the-art sample quality in generative modelling tasks but suffer from the mode collapse problem. Variational Autoencoders (VAE) on the other hand explicitly maximize a reconstruction-based data log-likelihood forcing it to cover all modes, but suffer from poorer sample quality. Recent works have proposed hybrid VAE-GAN frameworks which integrate a GAN-based synthetic likelihood to the VAE objective to address both the mode collapse and sample quality issues, with limited success. This is because the VAE objective forces a trade-off between the data log-likelihood and divergence to the latent prior. The synthetic likelihood ratio term also shows instability during training. We propose a novel objective with a "Best-of-Many-Samples" reconstruction cost and a stable direct estimate of the synthetic likelihood. This enables our hybrid VAE-GAN framework to achieve high data log-likelihood and low divergence to the latent prior at the same time and shows significant improvement over both hybrid VAE-GANS and plain GANs in mode coverage and quality.

Full PDF

UUnder review as a conference paper at ICLR 2020 “B EST - OF -M ANY -S AMPLES ” D

ISTRIBUTION M ATCHING

Apratim Bhattacharyya, Mario Fritz, Bernt Schiele

Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbr¨ucken, Germany { abhattac, mfritz, schiele } @mpi-inf.mpg.de A BSTRACT

Generative Adversarial Networks (GANs) can achieve state-of-the-art samplequality in generative modelling tasks but suffer from the mode collapse prob-lem. Variational Autoencoders (VAE) on the other hand explicitly maximize areconstruction-based data log-likelihood forcing it to cover all modes, but suf-fer from poorer sample quality. Recent works have proposed hybrid VAE-GANframeworks which integrate a GAN-based synthetic likelihood to the VAE objec-tive to address both the mode collapse and sample quality issues, with limitedsuccess. This is because the VAE objective forces a trade-off between the datalog-likelihood and divergence to the latent prior. The synthetic likelihood ratioterm also shows instability during training. We propose a novel objective witha “Best-of-Many-Samples” reconstruction cost and a stable direct estimate of thesynthetic likelihood. This enables our hybrid VAE-GAN framework to achievehigh data log-likelihood and low divergence to the latent prior at the same time andshows signiﬁcant improvement over both hybrid VAE-GANS and plain GANs inmode coverage and quality.

NTRODUCTION

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have achieved state-of-the-artsample quality in generative modeling tasks. However, GANs do not explicitly estimate the datalikelihood. Instead, it aims to “fool” an adversary, so that the adversary is unable to distinguishbetween samples from the true distribution and the generated samples. This leads to the generationof high quality samples (Adler & Lunz, 2018; Brock et al., 2019). However, there is no incentiveto cover the whole data distribution. Entire modes of the true data distribution can be missed –commonly referred to as the mode collapse problem.In contrast, the Variational Auto-Encoders (VAEs) (Kingma & Welling, 2014) explicitly maximizedata likelihood and can be forced to cover all modes (Bozkurt et al., 2018; Shu et al., 2018). VAEsenable sampling by constraining the latent space to a unit Gaussian and sampling through the latentspace. However, VAEs maximize a data likelihood estimate based on the L / L reconstruction costwhich leads to lower overall sample quality – blurriness in case of image distributions. Therefore,there has been a spur of recent work (Donahue et al., 2017; Larsen et al., 2016; Rosca et al., 2019)which aims integrate GANs in a VAE framework to improve VAE generation quality while covering allthe modes. Notably in Rosca et al. (2019), GANs are integrated in a VAE framework by augmentingthe L / L data likelihood term in the VAE objective with a GAN discriminator based syntheticlikelihood ratio term.However, Rosca et al. (2019) reports that in case of hybrid VAE-GANs, the latent space does notusually match the Gaussian prior. This is because, the reconstruction log-likelihood in the VAEobjective is at odds with the divergence to the latent prior (Tabor et al., 2018) (also in case ofalternatives proposed by Makhzani et al. (2016); Arjovsky et al. (2017)). This problem is furtherexacerbated with the addition of the synthetic likelihood term in the hybrid VAE-GAN objective – itis necessary for sample quality but it introduces additional constraints on the encoder/decoder. Thisleads to the degradation in the quality and diversity of samples. Moreover, the synthetic likelihoodratio term is unstable during training – as it is the ratio of outputs of a classiﬁer, any instabilityin the output of the classiﬁer is magniﬁed. We directly estimate the ratio using a network with a1 a r X i v : . [ c s . L G ] S e p nder review as a conference paper at ICLR 2020controlled Lipschitz constant, which leads to signiﬁcantly improved stability. Our contributionsin detail are, 1. We propose a novel objective for training hybrid VAE-GAN frameworks, whichrelaxes the constraints on the encoder by giving the encoder multiple chances to draw sampleswith high likelihood enabling it to generate realistic images while covering all modes of the datadistribution, 2. Our novel objective directly estimates the synthetic likelihood term with a controlledLipschitz constant for stability, 3. Finally, we demonstrate signiﬁcant improvement over prior hybridVAE-GANs and plain GANs on highly muti-modal synthetic data, CIFAR-10 and CelebA. ELATED W ORK

Generative Autoencoders.

VAEs (Kingma & Welling, 2014) allow for generation by maintaining aGaussian latent space. In Kingma & Welling (2014), the Gaussian constraint in applied point-wiseand latent representation of each point is forced towards zero. Adversarial Auto-encoders (AAE)(Makhzani et al., 2016) and Wasserstein Auto-encoders (WAE) (Arjovsky et al., 2017) tackle thisproblem by an approximate estimate of the divergence which only requires the latent space to beGaussian as a whole. But, the Gaussian constraint in (Arjovsky et al., 2017; Kingma & Welling,2014; Makhzani et al., 2016; Mahajan et al., 2019) is still at odds with the data log-likelihood. Inthis work, we enable the encoder to maintain both the latent representation constraint and high datalog-likelihood using a novel objective. Furthermore, we integrate a GAN-based synthetic likelihoodterm to the objective to enhance the sharpness of generated images.

Mode Collapse in Classical GANs.

The classic GAN formulation (Goodfellow et al., 2014; Radfordet al., 2016) has several shortcomings – importantly mode collapse. Denoising Feature Matching(Warde-Farley & Bengio, 2017) deals with the mode collapse by regularizing the discriminatorusing an auto-encoder. MDGAN (Che et al., 2017) uses two separate discriminators and regularizesusing a auto-encoder. In EBGAN (Zhao et al., 2017a), the discriminator is interpreted as an energyfunctional and is also cast in an auto-encoder framework, leading to improvements in semi-supervisedlearning tasks. BEGAN (Berthelot et al., 2017) proposes a Wasserstein distance based objective totrain such GANs with auto-encoder based discriminators. The proposed approach leads to smootherconvergence. InfoGAN (Chen et al., 2016) maximizes the mutual information between a smallsubset of latent variables and observations in a Information Theoretic framework. This leads todisentangled and more interpretable latent representations. PacGAN (Lin et al., 2018) proposesto deal with the mode collapse problem by using the discriminator to distinguish between productdistributions. D2GAN (Nguyen et al., 2017) proposes to use two discriminators – one for the forwardKL divergence between the true and generated distributions and one for the reverse. BourGAN (Xiaoet al., 2018) proposes to learn the distribution of the latent space (instead of assuming Gaussian)which reﬂects the distribution of the data. In (Srivastava et al., 2017), a inverse mapping from fromlatent to data space is learned and the generator is penalized based on the inverted distribution tocover all modes. Ravuri et al. (2018) proposes a moment matching paradigm different from VAEs orGANs. However, as the presented moment matching network involves an order of magnitude moreparameters compared to VAEs or GANs, we do not consider them here. As we propose a hybridVAE-GAN framework these techniques can be applied on top to potentially improve results. However,in hybrid VAE-GANs the reconstruction loss already incentivizes the coverage of all modes.

Wasserstein Loss based Formulations.

Arjovsky et al. (2017); Gulrajani et al. (2017) proposesGANs which minimize the Wasserstein distance between true and generated distributions. Miyatoet al. (2018) demonstrates improved results by applying Spectral Normalization on the weights.In Tran et al. (2018), distance constraints are applied on top. In Adler & Lunz (2018) WGANswere extended to Banach Spaces to emphasize edges or large scale behavior. Orthogonally, Karraset al. (2018) focus on progressively learning to use more complex model architectures to improveperformance. We use the regularization techniques developed for WGANs to improve stabilityof our hybrid VAE-GAN framework. Brock et al. (2019) shows very high quality generations athigh resolutions but these are class conditional. However, diverse class conditional generation isconsiderably easier as intra-class variability is generally much lower than inter-class variability. Here,we focus on the more complex unconditional image generation task.

Hybrid VAE-GANs.

In Larsen et al. (2016) a VAE-GAN hybrid is proposed with discriminatorfeature matching – the VAE decoder is trained to match discriminator features instead of a L / L reconstruction loss. ALI (Dumoulin et al., 2016) proposes to instead match the encoder and decoder2nder review as a conference paper at ICLR 2020joint distributions – with limited success on diverse datasets. BiGAN (Donahue et al., 2017), buildsupon ALI to learn inverse mappings from the data to the latent space and demonstrate effectivenesson various discriminative tasks. Rosca et al. (2019) extends standard VAEs by replacing the log-likelihood term with a hybrid version based on synthetic likelihoods. The KL-divergence constraintto the prior is also recast to a synthetic likelihood form, which can be enforced by a discriminator (asin Makhzani et al. (2016); Tolstikhin et al. (2018)). The second improvement is crucial in generatingrealistic images at par with classic/Wasserstein GANs. We further improve upon Rosca et al. (2019)by allowing the encoder multiple chances to draw desired samples and enforcing stability – enablingit to maintain low divergence to the prior while generating realistic images. OVEL O BJECTIVE FOR H YBRID

VAE-GAN S We begin with a brief overview of hybrid VAE-GANs followed by details of our novel objective.

Overview.

Hybrid VAE-GANs (Figure 1) are generative models for data distributions x ∼ p ( x ) thattransform a latent distribution z ∼ p ( z ) to a learned distribution ˆ x ∼ p θ ( x ) approximating p ( x ) . TheGAN ( G θ , D I alone can generate realistic samples, but has trouble covering all modes. The VAE( R φ , G θ , D L ) can cover all modes of the distribution, but generates lower quality samples overall.VAE-GANs leverage the strengths of both VAEs and GANs to generate high quality samples whilecapturing all modes. We begin with a discussion of the prior hybrid VAE-GAN objectives (Roscaet al., 2019) and its shortcomings, followed by our novel “Best-of-Many-Samples” objective with anovel reconstruction term and regularized stable direct estimate of the synthetic likelihood.3.1 S HORTCOMINGS OF H YBRID

VAE-GAN O

BJECTIVES

Hybrid VAE-GANs (Dumoulin et al., 2016; Makhzani et al., 2016; Rosca et al., 2019; Zhao et al.,2017b) maximizes the log-likelihood of the data (x ∼ p ( x ) ) akin to VAEs. The log-likelihood,assuming the latent space to be distributed according to p ( z ) , log( p θ ( x )) = log (cid:16) (cid:90) p θ ( x | z ) p ( z ) dz (cid:17) . (1)Here, p ( z ) is usually Gaussian. This requires the generator G θ to generate samples that assign highlikelihood to every example x in the data distribution for a likely z ∼ p ( z ) . Thus, the decoder θ canbe forced to cover all modes of the data distribution x ∼ p ( x ) . In contrast, GANs never directlymaximize the data likelihood and there is no direct incentive to cover all modes.However, the integral in (1) is intractable. VAEs and Hybrid VAE-GANs use amortized variationalinference using a recognition network q φ ( z | x ) ( R φ ). The ﬁnal hybrid VAE-GAN objective of thestate-of-the-art α -GAN (Rosca et al., 2019) which integrates a synthetic likelihood ratio term is, L α -GAN = λ E q φ ( z | x ) log( p θ ( x | z )) + E q φ ( z | x ) log (cid:16) D I ( x | z )1 − D I ( x | z ) (cid:17) − KL ( p ( z ) (cid:107) q φ ( z | x )) . (2)This objective has two important shortcomings. Firstly, as pointed in (Bhattacharyya et al., 2018;Tolstikhin et al., 2018), this objective severely constrains the recognition network as the averagelikelihood of the samples generated from the posterior q φ ( z | x ) is maximized. This forces all samplesfrom q φ ( z | x ) to explain x equally well, penalizing any variance in q φ ( z | x ) and thus forcing it awayfrom the Gaussian prior p ( z ) . Therefore, this makes it difﬁcult to match the prior in the latent spaceand the encoder is forced to trade-off between a good estimate of the data log-likelihood and thedivergence to the latent prior.Secondly, the synthetic likelihood ratio term is the ratio of the output of D I , any instability (non-smoothness) in the output of the classiﬁer is magniﬁed. Moreover, there is no incentive for D I to be smooth (stable). For two similar images, { x , x } with | x − x | ≤ (cid:15) , the change of output | D I ( x | z ) − D I ( x | z ) | can be arbitrarily large. This means that a small change in the generatoroutput (e.g. after a gradient descent step) can have a large change in the discriminator output.Next, we describe how we can effectively leverage multiple samples from q φ ( z | x ) to deal with theﬁrst issue. Finally, we derive a stable synthetic likelihood term (Rosca et al., 2019; Wood, 2010) todeal with the second issue. 3nder review as a conference paper at ICLR 2020Figure 1: Overview of our BMS-VAE-GAN framework. The terms of our novel objective (7) arehighlighted at the right. We consider only the best sample from the generator G θ while computingthe reconstruction loss.3.2 L EVERAGING M ULTIPLE S AMPLES

Building upon Bhattacharyya et al. (2018), we derive an alternative variational approximation of (1),which uses multiple samples to relax the constrains on the recognition network (full derivation inAppendix A), L MS = log (cid:16) (cid:90) p θ ( x | z ) q φ ( z | x ) dz (cid:17) − KL ( q φ ( z | x ) (cid:107) p ( z )) . (3)In comparison to the α -GAN objective (2) where the expected likelihood assigned by each sampleto the data point x was considered, we see that in (3) the likelihood is computed considering allgenerated samples. The recognition network gets multiple chances to draw samples which assignhigh likelihood to x. This allows q φ ( z | x ) to have higher variance, helping it better match the priorand signiﬁcantly reducing the trade-off with the data log-likelihood. Next, we describe how we canintegrate a synthetic likelihood term in (3) to help us generate sharper images.3.3 I NTEGRATING S TABLE S YNTHETIC L IKELIHOOD WITH THE “B EST - OF -M ANY ” S

AMPLES

Considering only L / L reconstruction based likelihoods p θ ( x | z ) (as in Bhattacharyya et al. (2018);Kingma & Welling (2014); Tolstikhin et al. (2018)) might not be sufﬁcient in case of complex highdimensional distributions e.g. in case of image data this leads to blurry samples. Synthetic estimatesof the likelihood Wood (2010) leverages a neural network (usually a classifer) which is jointly trainedto distinguish between real and generated samples. The network is traiend to assign low likelihoodto generated samples and higher likelihood to real data samples. Starting from (3), we integrate asynthetic likelihood term with weight β to encourage our generator to generate realistic samples. The L /L reconstruction likelihood (with weight α ) forces the coverage of all modes. However, unlikeprior work (Bhattacharyya et al., 2019; Rosca et al., 2019), our synthetic likelihood estimator D I is not a classiﬁer. We ﬁrst convert the likelihood term to a likelihood ratio form which allows forsynthetic estimates, L MS = α log (cid:16) E q φ ( z | x ) p θ ( x | z ) (cid:17) + β log (cid:16) E q φ ( z | x ) p θ ( x | z ) (cid:17) − KL ( q φ ( z | x ) (cid:107) p ( z )) ∝ α log (cid:16) E q φ ( z | x ) p θ ( x | z ) p ( x ) (cid:17) + β log (cid:16) E q φ ( z | x ) p θ ( x | z ) (cid:17) − KL ( q φ ( z | x ) (cid:107) p ( z )) . (4)To enable the estimation of the likelihood ratio p θ ( x | z ) / p ( x ) using a neural network, we introduce theauxiliary variable y where, y = 1 denotes that the sample was generated and y = 0 denotes that thesample is from the true distribution. We can now express (4) (using Bayes theorem, see Appendix A), = α log (cid:16) E q φ ( z | x ) p θ ( x | z , y = 1) p ( x | y = 0) (cid:17) + β log (cid:16) E q φ ( z | x ) p θ ( x | z ) (cid:17) − KL ( q φ ( z | x ) (cid:107) p ( z )) . = α log (cid:16) E q φ ( z | x ) p θ ( y = 1 | z , x )1 − p ( y = 1 | x ) (cid:17) + β log (cid:16) E q φ ( z | x ) p θ ( x | z ) (cid:17) − KL ( q φ ( z | x ) (cid:107) p ( z )) . (5)The ratio p θ ( y =1 | z , x ) / − p ( y =1 | x ) should be high for generated samples which are indistinguishablefrom real samples and low otherwise. In case of image distributions, we ﬁnd that direct estimation of4nder review as a conference paper at ICLR 2020the numerator/denominator (as in Rosca et al. (2019)) exacerbates instabilities (non-smoothness) ofthe estimate. Therefore, we estimate this ratio directly using the neural network D I ( x ) – trained toproduce high values for images indistinguishable from real images and low otherwise, L MS-S ∝ α log (cid:16) E q φ ( z | x ) D I ( x | z ) (cid:17) + β log (cid:16) E q φ ( z | x ) p θ ( x | z ) (cid:17) − KL ( q φ ( z | x ) (cid:107) p ( z )) . (6)To further unsure smoothness, we directly control the Lipschitz constant K of D I . This ensures, ∀ x , x , | D I ( x | z ) − D I ( x | z ) | ≤ K | x − x | – the function is strictly smooth everywhere. Smallchanges in generator output cannot arbitrarily change the synthetic likelihood estimate, henceallowing the generator to smoothly improve sample quality. We constrain the Lipschitz constant K to 1 using Spectral Normalization Miyato et al. (2018). Note that the likelihood p θ ( x | z ) takes theform e − λ (cid:107) x − ˆ x (cid:107) n in (6) – a log-sum-exp which is numerically unstable. As we perform stochasticgradient descent, we can deal with this after stochastic (MC) sampling of the data points. We canwell estimate the log-sum-exp using the max – the “Best-of-Many-Samples” (Nielsen & Sun, 2016), log (cid:16) T i = T (cid:88) i =1 p θ ( x | ˆ z i ) (cid:17) ≥ max i log( p θ ( x | ˆ z i )) − log( T ) In practice, we observe that we can improve sharpness of generated images by penalizing generator G θ , using the least realistic of the T samples, log (cid:16) i = T (cid:88) i =1 D I ( x | ˆ z i ) (cid:17) ≥ min i log (cid:0) D I ( x | ˆ z i ) (cid:1) Our ﬁnal “Best-of-Many”-VAE-GAN objective takes the form (ignoring the constant log( T ) term), L BMS-S = α min i log (cid:0) D I ( x | ˆ z i ) (cid:1) + β max i log( p θ ( x | ˆ z i )) − KL ( q φ ( z | x ) (cid:107) p ( z )) . (7)We use the same optimization scheme as in Rosca et al. (2019). We provide the algorithm in detail inAppendix B. Approximation Errors.

The “Best-of-Many-Samples” scheme introduces the log( T ) error term.However, this error term is dominated by the low data likelihood term in the beginning of optimization(Bhattacharyya et al., 2018). Later, as generated samples become more diverse, the log likelihoodterm is dominated by the Best of T samples – “Best of Many-Samples” is equivalent. Classiﬁer based estimate of the prior term.

Recent work (Makhzani et al., 2016; Arjovsky et al.,2017; Rosca et al., 2019) has shown that point-wise minimization of the KL-divergence usingits analytical form leads to degradation in image quality. Instead, KL-divergence term is recastin a synthetic likelihood ratio form minimized “globally” using a classiﬁer instead of point-wise.Therefore, unlike Bhattacharyya et al. (2018), here we employ a classiﬁer based estimate of theKL-divergence to the prior. However, as pointed out by prior work on hybrid VAE-GANs (Roscaet al., 2019), a classiﬁer based estimate with still leads to mismatch to the prior as the trade-off withthe data log-likelihood still persists without the use of the “Best-of-Many-Samples”. Therefore, aswe shall demonstrate next, the beneﬁts of using the “Best-of-Many-Samples” extends to case when aclassiﬁer based estimate of the KL-divergence is employed.

XPERIMENTS

Next, we evaluate on multi-modal synthetic data as well as CIFAR-10 and CelebA. We perform allexperiments on a single Nvidia V100 GPU with 16GB memory. We use as many samples duringtraining as would ﬁt in GPU memory so that we make the same number of forward/backward passesas other approaches and minimize the computational overhead of sampling multiple samples.5nder review as a conference paper at ICLR 2020Table 1: Evaluation on multi-modal synthetic data.

2D Grid (25 modes) 2D Ring (8 modes)Method Modes HQ% Modes HQ%VEEGAN (Srivastava et al., 2017) 24.6 40 8 52.9GDPP-GAN (Elfeki et al., 2019) 24.8 68.5 8 71.7SN-GAN (Miyato et al., 2018) 23.8 ± ± ± ± ± ± ± ± α -GAN (Rosca et al., 2019) 25 70.5 ± ± T = 10 ± ± Table 2: Visualization of samples.

Target WAE α -GAN BMS-VAE-GAN WAE α -GAN BMS-VAE-GAN WAE α -GAN BMS-VAE-GAN Latent spacesamples z, q φ ( z ) (cid:28) p ( z ) Latent spacesamples z, q φ ( z ) (cid:28) p ( z ) Correspondingdata spacesamples, p θ ( x | z ) Correspondingdata spacesamples, p θ ( x | z ) Table 3: Effect of our novel objective in the latent space.

Top Row:

The standard WAE and α -GANobjectives leads to mismatch to the prior in the latent space. We show samples z (in red) whichare highly likely under the standard Gaussian prior (blue) but have low probability under the learntmarginal posterior q φ ( z ) . Bottom Row:

We show that such points z lead to low quality data samples(in red), which do correspond to any of the modes.4.1 E

VALUATION ON M ULTI - MODAL S YNTHETIC DATA .We evaluate in Tables 1 and 2 on the standard 2D Grid and Ring datasets, which are highly challengingdue to their multi-modality. The metrics considered are the number of modes captured and % of highquality samples (within 3 standard deviations of a mode). The generator/discriminator architectureis same as in Srivastava et al. (2017). We see that our BMS-VAE-GAN (using the best of T = 10 samples) outperforms state of the art GANs e.g. (Eghbal-zadeh et al., 2019) and the WAE and α -GANbaselines. The explicit maximization of the data log-likelihood enables our BMS-VAE-GAN and theWAE and α -GAN baselines to capture all modes in both the grid and ring datasets. The signiﬁcantlyincreased proportion of high quality samples with respect to WAE and α -GAN baselines is due toour novel “Best-of-Many-Samples” objective. We illustrate this in Table 3. Following Rosca et al.(2019) we analyze the learnt latent spaces in detail, in particular we check for points (in red) whichare likely under the Gaussian prior p ( z ) (blue) but have low probability under the marginal posterior q φ ( z ) = (cid:82) q φ ( z | x ) dx . We use tSNE to project points from our 32-dimensional latent space to 2D. InTable 3 (Top Row) we clearly see that there are many such points in case of the WAE and α -GANbaselines (note that this low probability threshold is common across all methods). In Table 3 (BottomRow) we see that these points lead to the generation of low quality samples (in red) in the data space.Therefore, we see that our “Best-of-Many-Samples” samples objective helps us match the prior inthe latent space and thus this leads to the generation of high quality samples and outperforming bothstate of the art GANs and hybrid VAE-GAN baselines.Table 4: IvOM on Cifar10. Method IvOM ↓ DCGAN (Radford et al., 2016) 0.0084 ± ± ± α -GAN + SN (Ours) T = 1 ± T = 30 ± Table 5: Closest generated images found using IvOM.

Test Sample SN-GAN α -GAN + SN BMS-VAE-GAN VALUATION ON

CIFAR-10Next, we evaluate on the CIFAR-10 dataset. Auto-encoding based approaches (Kingma & Welling,2014; Makhzani et al., 2016) do not perform well on this dataset, as a simple Gaussian reconstructionbased likelihood is insufﬁcient for such highly multi-modal image data. This makes CIFAR-10 a verychallenging dataset for hybrid VAE-GANs.

Architecture Details.

We use two different types of architectures for the generator/discriminatorpair G θ , D I : DCGAN based (Radford et al., 2016) as used in Rosca et al. (2019) and the StandardCNN used in Miyato et al. (2018); Tran et al. (2018). Experimental Details and Baselines.

We use the ADAM optimizer (Kingma & Ba, 2015) and uselearning rate of × − , β = 0 . and β = 0 . for all components. We use the same architectureof the latent space discriminator D L as in α -GAN Rosca et al. (2019) (3-layer MLP with 750 neuronseach). Values of log( D I ) ∈ [0 , work well.We consider the following baselines for comparison against our BMS-VAE-GAN with a DCGANgenerator/discriminator, 1. A standard DCGAN (Radford et al., 2016), 2. The α -GAN model of(Rosca et al., 2019). Furthermore, we compare our BMS-GAN with the Standard CNN genera-tor/discriminator to, 1. SN-GAN (Miyato et al., 2018), 2. BW-GAN (Adler & Lunz, 2018), 3. Dis-t-GAN (Tran et al., 2018), 4. Our α -GAN + SN is an improved version of the α -GAN whichincludes Spectral Normalization for stable estimation of synthetic likelihoods. Again, the α -GANand α -GAN + SN baselines are identical to the corresponding BMS-VAE-GAN except for the“Best-of-Many-Samples” reconstruction likelihood. Method FID ↓ DCGAN ArchitectureWAE (Tolstikhin et al., 2018) 89.3 ± T = 10 ± ± α -GAN (Rosca et al., 2019) 29.4 ± T = 10 ± Standard CNN ArchitectureSN-GAN (Miyato et al., 2018) 25.5BW-GAN (Adler & Lunz, 2018) 25.1 α -GAN + SN (Ours) T = 1 ± T = 10 ± T = 30 ± Dist-GAN (Tran et al., 2018) 22.9BMS-VAE-GAN (Ours) T = 10 ± Table 6: FID on CIFAR-10.

Discussion of Results.

We report results in Table 6. Pleasenote that the higher latent space dimensionality (100) makesthe latent spaces much harder to reliably analyze. Therefore,we rely on the FID and IoVM metrics. We follow evaluationprotocol of Miyato et al. (2018); Tran et al. (2018) and use10k/5k real/generated samples to compute the FID score. The α -GAN (Rosca et al., 2019) model with (DCGAN architecture)demonstrates better ﬁt to the true data distribution (29.3 vs30.7 FID) compared to a plain DCGAN. This again shows theability of hybrid VAE-GANs in improving the performanceof plain GANs. We observe that our novel “Best-of-Many-Samples” optimization scheme outperforms both the plainDCGAN and hybrid α -GAN(28.8 vs 29.4 FID), conﬁrmingthe advantage of using “Best-of-Many-Samples”. Furthermore,we see that our BMS-VAE outperforms the state-of-the-artplain auto-encoding WAE (Tolstikhin et al., 2018).We further compare our BMS-VAE-GAN to state-of-the-art GANs using the Standard CNN architec-ture in Table 6 with 100k generator iterations. Our α -GAN + SN ablation signiﬁcantly outperformsthe state-of-the-art plain GANs (Adler & Lunz, 2018; Miyato et al., 2018), showing the effectivenessof hybrid VAE-GANs with a stable direct estimate of the synthetic likelihood on this highly diversedataset. Furthermore, our BMS-VAE-GAN model trained using the best of T = 30 samples signif-icantly improves over the α -GAN + SN baseline (23.4 vs 24.6 FID), showing the effectiveness ofour “Best-of-Many-Samples”. We also compare to Tran et al. (2018) using 300k generator iterations,again outperforming by a signiﬁcant margin (21.8 vs 22.9 FID). The IoVM metric of Srivastava et al.(2017) (Tables 4 and 5), illustrates that we are also able to better reconstruct the image distribution.The improvement in both sample quality as measured by the FID metric and data reconstruction asmeasured by the IoVM metric shows that our novel “Best-of-Many-Samples” objective helps us bothmatch the prior in the latent space and achieve high data log-likelihood at the same time.4.3 E VALUATION ON C ELEB

ANext, we evaluate on CelebA at resolutions 64 ×

64 and 128 × Training and Architecture Details.

As the focus is to evaluate objectives for hyrid VAE-GANs, weuse simple DCGAN based generators and discriminators for generation at both 64 ×

64 and 128 × (a) Our α -GAN + SN ( T = 1 , 128 × T = 10 , 128 × Figure 2: CelebA Random Samples. Our “Best of Many” reconstruction cost leads to sharper results.

Baselines and Experimental Details.

We consider the following baselines for comparison with ourBMS-GAN with T = { , } samples, 1. WAE (Tolstikhin et al., 2018) the state-of-the-art plainauto-encoding generative model, 2. α -GAN (Rosca et al., 2019) the state-of-the-art hybrid VAE-GAN,3. Our α -GAN + SN is an improved version of the α -GAN which includes Spectral Normalization forstable estimation of synthetic likelihoods. Note, the α -GAN baseline is identical to our BMS-GANexcept for the “Best-of-Many” reconstruction likelihood. Moreover, we include several plain GANbaselines, 1. Wasserstein GAN with gradient penalty (WGAN-GP) Gulrajani et al. (2017), 2. SpectralNormalization GAN (SN-GAN) Miyato et al. (2018), 3. Dist-GAN (Tran et al., 2018).To train our BMS-VAE-GAN and α R-GAN models we use the two time-scale update rule (Heuselet al., 2017) with learning rate of × − for the generator and × − for the discriminator. Weuse the Adam optimizer with β = 0 . and β = 0 . . We use a three layer MLP with 750 neuronsas the latent space discriminator D L (as in Rosca et al. (2019)) and a DCGAN based recognitionnetwork R φ . We use the hinge loss to train D I to produce high values for real images and low valuesfor generated images, log( D I ) ∈ [ − . , . works well. Method FID ↓ Resolution: 64 × ± T = 10 ± ± ± ± ± ± α -GAN (Rosca et al., 2019) 19.2 ± α -GAN + SN (Ours) T = 1 ± T = 10 ± T = 30 ± Resolution: 128 × ± α R-GAN (Ours) T = 1 ± T = 10 ± Table 7: FID on CelebA.

Discussion of Results.

We train all models for 200k itera-tions and report the FID scores (Heusel et al., 2017) for allmodels using 10k/10k real/generated samples in Table 7. Thepure auto-encoding based WAE (Tolstikhin et al., 2018) hasthe weakest performance due to blurriness. Our pure auto-encoding BMS-VAE (without synthetic likelihoods) improvesupon the WAE (39.8 vs 41.2 FID), already demonstrating theeffectiveness of using “Best-of-Many-Samples”. We see thatthe base DCGAN has the weakest performance among theGANs. BEGAN suffers from partial mode collapse. The SN-GAN improves upon WGAN-GP, showing the effectivenessof Spectral Normalization. However, there exists considerableartifacts in its generations. The α -GAN of Rosca et al. (2019),which integrates the base DCGAN in its framework performssigniﬁcantly better (31.1 vs 19.2 FID). This shows the effec-tiveness of VAE-GAN frameworks in increasing quality anddiversity of generations. Our enhanced α -GAN + SN regu-larized with Spectral Normalization performs signiﬁcantly better (15.1 vs 19.2 FID). This showsthe effectiveness of a regularized direct estimate of the synthetic likelihood. Using the gradientpenalty regularizer of Gulrajani et al. (2017) lead to drop of 0.4 FID. Our BMS-VAE-GAN improvessigniﬁcantly over the α -GAN + SN baseline using the “Best-of-Many-Samples” (13.6 vs 15.1 FID).The results at 128 ×

128 resolution mirror the results at 64 ×

64. We additionally evaluate using theIoVM metric in Appendix C. We see that by using the “Best-of-Many-Samples” we obtain sharper(Figure 4d) results that cover more of the data distribution as shown by both the FID and IoVM.

ONCLUSION

We propose a new objective for training hybrid VAE-GAN frameworks which overcomes keylimitations of current hybrid VAE-GANs. We integrate, 1. A “Best-of-Many-Samples” reconstructionlikelihood which helps in covering all the modes of the data distribution while maintaining a latentspace as close to Gaussian as possible, 2. A stable estimate of the synthetic likelihood ratio.. Ourhybrid VAE-GAN framework outperforms state-of-the-art hybrid VAE-GANs and plain GANs ingenerative modelling on CelebA and CIFAR-10, demonstrating the effectiveness of our approach.8nder review as a conference paper at ICLR 2020 R EFERENCES

Jonas Adler and Sebastian Lunz. Banach wasserstein gan.

NeurIPS , 2018.Martin Arjovsky, Soumith Chintala, and L´eon Bottou. Wasserstein gan.

ICML , 2017.David Berthelot, Thomas Schumm, and Luke Metz. Began: boundary equilibrium generativeadversarial networks. arXiv preprint arXiv:1703.10717 , 2017.Apratim Bhattacharyya, Bernt Schiele, and Mario Fritz. Accurate and diverse sampling of sequencesbased on a best of many sample objective.

CVPR , 2018.Apratim Bhattacharyya, Mario Fritz, and Bernt Schiele. Bayesian prediction of future street scenesusing synthetic likelihoods.

ICLR , 2019.Alican Bozkurt, Babak Esmaeili, Dana H Brooks, Jennifer G Dy, and Jan-Willem van de Meent. Canvaes generate novel examples?

NeurIPS Workshop , 2018.Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high ﬁdelity naturalimage synthesis.

ICLR , 2019.Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. Mode regularized generativeadversarial networks.

ICLR , 2017.Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:Interpretable representation learning by information maximizing generative adversarial nets.

NIPS ,2016.Michael Comenetz.

Calculus: the elements . World Scientiﬁc Publishing Company, 2002.Jeff Donahue, Philipp Kr¨ahenb¨uhl, and Trevor Darrell. Adversarial feature learning.

ICLR , 2017.Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky,and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704 , 2016.Hamid Eghbal-zadeh, Werner Zellinger, and Gerhard Widmer. Mixture density generative adversarialnetworks.

CVPR , 2019.Mohamed Elfeki, Camille Couprie, Morgane Riviere, and Mohamed Elhoseiny. Gdpp: Learningdiverse generations using determinantal point process.

ICML , 2019.Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,Aaron Courville, and Yoshua Bengio. Generative adversarial nets.

NIPS , 2014.Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.Improved training of wasserstein gans.

NIPS , 2017.Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Ganstrained by a two time-scale update rule converge to a local nash equilibrium.

NIPS , 2017.Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans forimproved quality, stability, and variation.

ICLR , 2018.Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.

ICLR , 2015.Diederik P Kingma and Max Welling. Auto-encoding variational bayes.

ICLR , 2014.Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoen-coding beyond pixels using a learned similarity metric.

ICML , 2016.Zinan Lin, Ashish Khetan, Giulia Fanti, and Sewoong Oh. Pacgan: The power of two samples ingenerative adversarial networks.

NeurIPS , 2018.Shweta Mahajan, Teresa Botschen, Iryna Gurevych, and Stefan Roth. Joint wasserstein autoencodersfor aligning multimodal embeddings.

ICCV Workshop , 2019.9nder review as a conference paper at ICLR 2020Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarialautoencoders.

ICLR , 2016.Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization forgenerative adversarial networks.

ICLR , 2018.Tu Nguyen, Trung Le, Hung Vu, and Dinh Phung. Dual discriminator generative adversarial nets.

NIPS , 2017.Frank Nielsen and Ke Sun. Guaranteed bounds on the kullback–leibler divergence of univariatemixtures.

IEEE Signal Processing Letters , 2016.Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deepconvolutional generative adversarial networks.

ICLR , 2016.Suman Ravuri, Shakir Mohamed, Mihaela Rosca, and Oriol Vinyals. Learning implicit generativemodels with the method of learned moments.

ICML , 2018.Mihaela Rosca, Balaji Lakshminarayanan, , and Shakir Mohamed. Distribution matching in varia-tional inference. arXiv preprint arXiv:1802.06847 , 2019.Rui Shu, Hung H Bui, Shengjia Zhao, Mykel J Kochenderfer, and Stefano Ermon. Amortizedinference regularization.

NeurIPS , 2018.Akash Srivastava, Lazar Valkoz, Chris Russell, Michael U Gutmann, and Charles Sutton. Veegan:Reducing mode collapse in gans using implicit variational learning.

NIPS , 2017.Jacek Tabor, Szymon Knop, Przemyslaw Spurek, Igor Podolak, Marcin Mazur, and StanislawJastrzebski. Cramer-wold autoencoder. arXiv preprint arXiv:1805.09235 , 2018.Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders.

ICLR , 2018.Ngoc-Trung Tran, Tuan-Anh Bui, and Ngai-Man Cheung. Dist-gan: An improved gan using distanceconstraints.

ECCV , 2018.David Warde-Farley and Yoshua Bengio. Improving generative adversarial networks with denoisingfeature matching.

ICLR , 2017.Simon N Wood. Statistical inference for noisy nonlinear ecological dynamic systems.

Nature , 2010.Chang Xiao, Peilin Zhong, and Changxi Zheng. Bourgan: Generative networks with metric embed-dings.

NeurIPS , 2018.Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial network.

ICLR ,2017a.Shengjia Zhao, Jiaming Song, and Stefano Ermon. Infovae: Information maximizing variationalautoencoders. arXiv preprint arXiv:1706.02262 , 2017b.10nder review as a conference paper at ICLR 2020 A PPENDIX

A. A

DDITIONAL D ERIVATIONS

We begin with a derivation of the multi-sample objective (3). We maximize the log-likelihood of thedata (x ∼ p ( x ) ). The log-likelihood, assuming the latent space to be distributed according to p ( z ) , log( p θ ( x )) = log (cid:16) (cid:90) p θ ( x | z ) p ( z ) dz (cid:17) . (8)Here, p ( z ) is usually Gaussian. However, the integral in (8) is intractable. VAEs and Hybrid VAE-GANs use amortized variational inference using an (approximate) variational distribution q φ ( z | x ) (jointly learned), log( p θ ( x )) = log (cid:16) (cid:90) p θ ( x | z ) p ( z ) q φ ( z | x ) q φ ( z | x ) dz (cid:17) . To arrive at a tractable objective, the standard VAE objective applies the Jensen inequality at this stage,but this forces the ﬁnal objective to consider the average data-likelihood. Following Bhattacharyyaet al. (2018), we apply the Mean Value theorem of Integration (Comenetz, 2002) to leverage multiplesamples, log( p θ ( x )) ≥ log (cid:16) (cid:90) ba p θ ( x | z ) q φ ( z | x ) dz (cid:17) + log (cid:16) p ( z (cid:48) ) q φ ( z (cid:48) | x ) (cid:17) , z (cid:48) ∈ [ a, b ] . (9)We can lower bound (9) with the minimum value of z (cid:48) , log( p θ ( x )) ≥ log (cid:16) (cid:90) ba p θ ( x | z ) q φ ( z | x ) dz (cid:17) + min z (cid:48) ∈ [ a,b ] log (cid:16) p ( z (cid:48) ) q φ ( z (cid:48) | x ) (cid:17) . (S2)As the term on the right is difﬁcult to estimate, we approximate it using the KL divergence (as inBhattacharyya et al. (2018)). Intuitively, as the KL divergence heavily penalizes q φ ( z | x ) if it is highfor low values p ( z ) , this ensures that the ratio p ( z (cid:48) ) / q φ ( z (cid:48) | x ) is maximized. Similar to Bhattacharyyaet al. (2018), this leads to the “many-sample” objective (4) of the main paper, L MS = log (cid:16) E q φ ( z | x ) p θ ( x | z ) (cid:17) − KL ( q φ ( z | x ) (cid:107) p ( z )) . (4)Next, we provide a detailed derivation of (5). Again, to enable the estimation of the likelihood ratio p θ ( x | z ) / p ( x ) using a neural network, we introduce the auxiliary variable y where, y = 1 denotes thatthe sample was generated and y = 0 denotes that the sample is from the true distribution. We cannow express (5) as (using Bayes theorem), α log (cid:16) E q φ ( z | x ) p θ ( x | z , y = 1) p ( x | y = 0) (cid:17) + β log (cid:16) E q φ ( z | x ) p θ ( x | z ) (cid:17) − KL ( q φ ( z | x ) (cid:107) p ( z )) . = α log (cid:16) E q φ ( z | x ) p θ ( y = 1 | z , x )1 − p ( y = 1 | x ) (cid:17) + β log (cid:16) E q φ ( z | x ) p θ ( x | z ) (cid:17) − KL ( q φ ( z | x ) (cid:107) p ( z )) . This is because, (assuming independence p ( z , x ) = p ( z ) p ( x ) ) p θ ( x | z , y = 1) = p ( y = 1 | z , x ) p ( x ) p ( y = 1) and, p θ ( x | y = 0) = p ( y = 0 | x ) p ( x ) p ( y = 0) . Assuming, p ( y = 0) = p ( y = 1) (equally likely to be true or generated), p θ ( x | z , y = 1) p ( x | y = 0) = p θ ( y = 1 | z , x ) p ( y = 0 | x ) . Algorithm 1:

BMS-VAE-GAN Training. Initialize parameters of R φ , G θ , D I , D L ; for i ← to max iters do Update R φ , G θ (jointly) using our L BMS-S objective; Update D I using hinge loss to produce high values ( ≥ a ) for real images and low ( ≤ b )otherwise: E p ( x ) max { , a − log( D I ( x )) } + E p ( z ) max { , b + log( D I ( G θ ( z ))) } ; Update D L using the standard cross-entropy loss: E p ( z ) log( D L ( z )) + E p ( x ) log(1 − D L ( R φ ( x ))) ; end A PPENDIX

B. T

RAINING A LGORITHM

We detail in algorithm 1, how the components R φ , G θ , D I , D L of our BMS-VAE-GAN (see FigureFigure 1) are trained. We follow Rosca et al. (2019) in designing algorithm 1. However, unlike Roscaet al. (2019), we train R φ , G θ jointly as we found it to be computationally cheaper without any lossof performance. Also unlike Rosca et al. (2019), we use the hinge loss to update D I as it leads toimproved stability (as discussed in the main paper). A PPENDIX

C. A

DDITIONAL R ESULTS USING THE I O VM M

ETRIC

We additionally evaluate using the IoVM on CelebA in Table 8, using the base DCGAN architecture at64 ×

64 resolution. We observe that our BMS-VAE-GAN performs better. The improvement is smallercompared to CIFAR-10 because CelebA is less multi-modal compared to CIFAR-10. However,we still observe better overall sample quality from our BMS-VAE-GAN. This means that althoughdifference in data reconstruction is smaller, our BMS-VAE-GAN enables better match the prior inthe latent space. Finally, we provide additional examples of closest matches found using IoVM inFigure 3, illustrating regions of the data distribution captured by BMS-VAE-GAN but not capturedby SN-GAN or α -GAN + SN. Method IoVM ↓ SN-GAN (Miyato et al., 2018) 0.0221 ± α -GAN + SN (Ours) T = 1 ± T = 10 ± Table 8: Evaluation on CelebA using the IoVM metric. A PPENDIX

D. A

DDITIONAL Q UALITATIVE E XAMPLES ON C ELEB A AND

CIFAR-10

In Figure 4, we compare qualitatively our BMS-VAE-GAN against other state-of-the-art GANs. Weuse the same settings as in the main paper and use the same DCGAN architecture across methods (asthe aim is to evaluate training objectives). Again note that, approaches like Karras et al. (2018) usemore larger generator/discriminator architectures and can be applied on top. We see that BEGAN(Berthelot et al., 2017) produces sharp images (with only a few very minuscule artifacts), but lackdiversity – also reﬂected by the lower FID score in Table 2 of the main paper. In comparison, bothSN-GAN (Miyato et al., 2018) and Dist-GAN (Tran et al., 2018) produce sharp and diverse images(again reﬂected by the FID score in Table 2 of the main paper) but also introduce artifacts. Dist-GAN(Tran et al., 2018) introduces relatively fewer artifacts in comparison to SN-GAN (Miyato et al.,2018). In comparison, our BMS-VAE-GAN strikes the best balance – generating sharp and diverseimages with few if any artifacts (also again reﬂected by the FID scores in the main paper).We also provide additional qualitative examples on CIFAR-10 in Figure 5, highlighting sharperimages compared to α -GAN +SN. 12nder review as a conference paper at ICLR 2020 Test Sample SN-GAN α -GAN + SN BMS-VAE-GAN

Test Sample SN-GAN α -GAN + SN BMS-VAE-GAN

Figure 3: Closest images found by optimising in the latent space. Left: CIFAR-10, Right: CelebA. A PPENDIX

E. A

DDITIONAL D IVERSITY E VALUATION USING

LPIPS

In Table 9 we include diversity using the LPIPS metric. To compute the LPIPS diversity score 5ksamples were randomly generated and the similarity within the batch was computed. We see thatour BMS-VAE-GAN generates the most diverse examples on both datasets, further highlighting theeffectiveness of our “Best-of-Many-Samples” objective.

Method CelebA ↓ CIFAR-10 ↓ SN-GAN 0.160 0.148 α -GAN + SN (Ours) T = 1 T = 10 Table 9: Evaluation using the LPIPS metric.13nder review as a conference paper at ICLR 2020 (a) BEGAN (Berthelot et al., 2017) (64 ×

64) (b) SN-GAN (Miyato et al., 2018) (64 × ×

64) (d) Our BMS-VAE-GAN ( T = 30 , 64 × Figure 4: CelebA Random Samples of state-of-the-art GANs versus our BMS-VAE-GAN (usingDCGAN architecture). (a) α -GAN +SN ( T = 1 ) (b) BMS-VAE-GAN ( T = 30 ))