GANs with Variational Entropy Regularizers: Applications in Mitigating the Mode-Collapse Issue
Pirazh Khorramshahi, Hossein Souri, Rama Chellappa, Soheil Feizi
GGANs with Variational Entropy Regularizers:Applications in Mitigating the Mode-Collapse Issue
Pirazh Khorramshahi ∗ , Hossein Souri ∗ , Rama Chellappa, and Soheil FeiziUniversity of Maryland, College Park{pirazhkh, hsouri, rama}@umiacs.umd.edu, [email protected] Abstract —Building on the success of deep learning, GenerativeAdversarial Networks (GANs) provide a modern approach tolearn a probability distribution from observed samples. GANsare often formulated as a zero-sum game between two sets offunctions; the generator and the discriminator. Although GANshave shown great potentials in learning complex distributionssuch as images, they often suffer from the mode collapse issuewhere the generator fails to capture all existing modes of theinput distribution. As a consequence, the diversity of generatedsamples is lower than that of the observed ones. To tackle thisissue, we take an information-theoretic approach and maximize avariational lower bound on the entropy of the generated samplesto increase their diversity. We call this approach GANs with Vari-ational Entropy Regularizers (GAN+VER). Existing remedies forthe mode collapse issue in GANs can be easily coupled with ourproposed variational entropy regularization. Through extensiveexperimentation on standard benchmark datasets, we show allthe existing evaluation metrics highlighting difference of real andgenerated samples are significantly improved with GAN+VER.
I. I
NTRODUCTION
Generative Adversarial Networks (GANs) [1] have showngreat promise in generating realistic samples. They are shownto have variety of applications in computer vision [2]–[8], nat-ural language processing [9]–[12], semantic segmentation [13],[14], and cybersecurity [15]–[17]. GANs are able to generatehigh quality synthetic text, audio, image, and video whichis hardly distinguishable from real data. More specifically,in computer vision, GANs have been widely used for imagegeneration [7], [18], [19], super-resolution [20], restoration [8],[21], and translation [5].In spite of GANs success in generating high quality realisticdata, there are issues which can limit their application. Notonly it is not trivial to train a generator and discriminatorpair, but also it is hard to evaluate their performance [22].Moreover,
Mode collapse is a common issue during training,where the diversity of generated samples becomes smaller thanthat of the real data. As an example, with the MNIST datasetwith 10 modes e.g. { , . . . , } , the generated samples mayonly capture few of these modes, as demonstrated in Fig.1a. Mode collapse in GANs can be the result of improperformulations of training objective [23], [24], low-capacitygenerator [25], [26] or weak discriminator functions [27],[28]. Main approaches to mitigate the mode collapse issueinclude (i) discriminator augmentation such as PacGAN [28]where the discriminator is modified to make decisions based ∗ The first two authors equally contributed to this work. (a) Random Samples ofgenerated MNITS dataset(Vanilla GAN) (b) Random samples ofgenerated MNIST dataset(Vanilla GAN+VER)
Fig. 1: Mode collapse issue in generated samples. In (a)generator mostly generates , , modes; however, addingvariational entropy regularization increases the diversity ofgenerated samples (b).on multiple samples of either real or generated distributions,(ii) encoder-based regularization where an encoder function,mapping from the input space to the latent space, is used toincrease the diversity of generated samples (e.g. [29]–[31]),and (iii) mixture generators where generated samples proba-bilistically come from multiple generator functions [25], [26].In [25], the latter approach has been used and optimized overmixture proportions to mitigate the collapse of rare modes.Although the proposed method in [25] decreases the modecollapse issue especially on rare modes, it requires knowledgeof the number of modes in the data distribution which may notbe available. Also, training a GAN with mixture generators ismore difficult than the standard GAN used in practice. Anotherway to reduce mode collapse is to maximize the entropy of thegenerator [32]. However, maximizing differential entropy of ahigh dimensional random variable is not easy as it requiresintegration in a high dimensional space.In this paper, we propose to take an information-theoretic approach and use Mutual Information (MI) between input andoutput of the generator as an alternative to differential entropy.Our proposed approach is complimentary to existing methodsand can be effortlessly coupled with them to further mitigatethe mode collapse in generative models.In summary, our contributions in this paper can be summa-rized as follows: • We prove that MI of input and output of generatoris directly related to the entropy of generated samples a r X i v : . [ c s . L G ] S e p nd maximizing MI directly increases the diversity ofgenerated samples. • Through extensive experimentation, we show that ourproposed approach reduces the degree of mode collapseacross standard benchmarks. • We establish benchmark results on the best GAN evalu-ation metrics. II. R
ELATED W ORK
Several approaches for mitigating mode collapse have beenproposed [28], [31]–[35]. Tolstikhin et al. [33] proposed totrain a collection of generators instead of one generator,inspired by boosting techniques. Srivastava et al. [31] proposedVEEGAN to address this problem by employing a reconstruc-tor network which reverses the generator role and maps thegenerated data to noise. VEEGAN tries to train the generatorand reconstructor network jointly. Consequently, the generatorwill be encouraged to generate the entirety of the true datadistribution. Lin et al. [28] proposed to modify the discrim-inator to make decisions based on multiple samples from asame class; therefore, the discriminator will be able to dobinary hypothesis testing which penalizes the generator proneto mode collapse. Recently, prescribed Generative AdversarialNetworks (PresGAN) have been proposed by Dieng et al. [32]. PresGAN mitigate mode collapse by adding the negativeof entropy of the generator to the loss function and tries tomaximize the entropy by minimizing the loss function. Sincethe density of the generator is intractable, PresGAN add noiseto the output of the density network and use unbiased estimatesto compute gradients of the entropy regularization term. In ourmodel, we avoid computing entropy of the generator directly.Instead, our model benefits from using the MI between thelatent variable Z and generated sample ˆ X . In section III weprove MI in this case is identical to entropy.III. M ETHOD
GANs are based on minimizing a distance measure betweentwo distributions. That is: min θ g d (cid:0) P X , P ˆ X (cid:1) , (1)where X and ˆ X = G ( Z ; θ g ) represent the distributions ofreal and generated data respectively, with Z being the inputlatent variable to the generator. Moreover, d ( ., . ) is a distancemeasure between two distributions and θ g is the set ofparameters of the generator. In the last couple of years, severaldistance measures have been used in optimization (1) suchas optimal transport (OT) measures [23], [36], divergence measures [3], [37], moment-based measures [38], [39], etc.In particular, an example of the OT distance is Wasserstein defined as: d ( P X , P ˆ X ) := min P X, ˆ X E (cid:104) (cid:107) X − ˆ X (cid:107) p (cid:105) , (2)where p ≥ is the order of the Wasserstein distance. Note that in general, d ( ., . ) may not be a metric. The celebrated Shannon entropy [40] is a fundamentalmeasure of diversity in distributions. Therefore, forcing thegenerator to increase the entropy of the generative distributionshould directly impact the mode collapse issue. This can bedone through entropy regularization in GANs objective: min θ g d (cid:0) P X , P ˆ X (cid:1) − λh (cid:0) P ˆ X (cid:1) , (3)Where h ( . ) is the differential entropy defined as h ( µ ) = E x ∼ µ [ − log( µ ( x ))] and λ is the regularization parameter.Moreover, the negative entropy function is strongly convex,thus adding entropy regularization term to the formulation ofa generative model will improve its optimization landscape aswell. This phenomenon has been observed in a related entropyregularization for the optimal transport optimization in [41].Note that the entropy function in [41] has been applied onthe coupling distribution of an optimal transport while ourproposal applies the entropy function to the distribution of thegenerative model to increase its diversity and resolve the modecollapse issue.Here we consider the mutual information function I ( ˆ X ; Z ) = h ( ˆ X ) − h ( ˆ X | Z ) . Since the conditional entropyterm h ( ˆ X | Z ) is zero for a deterministic generator functionwe have I ( ˆ X ; Z ) = h ( ˆ X ) and the optimization problem (3)can be written as min θ g d (cid:0) P X , P ˆ X (cid:1) − λI (cid:16) ˆ X ; Z (cid:17) , (4)This optimization is similar to that of the InfoGAN [42] withthe difference that InfoGAN introduces extra code variablesand maximizes the mutual information between the codeand ˆ X to provide disentangled representations of the data.However, in our case, we maximize the mutual informationbetween the latent variable Z and generated sample ˆ X tomitigate the mode collapse issue. To solve the optimization(4), we use the neural information measure introduced in [43]as a variational lower bound on mutual information: I Θ ( ˆ X ; Z ) = sup θ m E P ˆ XZ [ T θ m ] − log (cid:0) E P ˆ X ⊗ P Z (cid:2) e T θm (cid:3)(cid:1) (5)where P ˆ XZ and P ˆ X ⊗ P Z are joint and product of marginaldistributions of ˆ X and Z . Additionally, T θ m : ˆ X × Z → R isa family of functions characterized by a deep neural network[43] that can be optimized using gradient descent methods.Hence, we solve the following optimization problem: min θ g min θ m d (cid:0) P X , P ˆ X (cid:1) − λI Θ (cid:16) ˆ X ; Z (cid:17) (6)Fig. 2 illustrates the block diagram of our proposed modelarchitecture. In the next section, we present our experimentalsetup, implementation details of the GAN+VER and datasetsused in our experiments.IV. E XPERIMENTS
In this section we present datasets used in prior works fordensity estimation and evaluation metrics to assess to qualityof generated samples and severity of mode collapse, afterwardswe describe the implementation details of the GAN+VER. 𝒁 DM Reals 𝑿 Adversarial
Loss 𝛴 𝑰 𝚯 (𝑿; 𝒁) Noise Vector
Loss 𝑿 −𝝀 × Fig. 2: Overview of the proposed model. Noise vector Z ispassed to the generator G to generate sample ˆ X . Then realdata X and ˆ X go through the discriminator D to computeadversarial loss. To compute I Θ (cid:16) ˆ X ; Z (cid:17) , Z and ˆ X are passedthrough the mutual information estimator network M [43]. A. Datasets
Standard benchmarks, namely synthetic Gaussian MixtureModel(GMM) and MNIST have been frequently used to testthe effectiveness of various GAN models and we follow thesame convention. In the case of GMM we consider e randomtraining samples with mixture components on a ring ofradius and components on a grid of size × . Thestandard deviation used in creating GMM datasets is σ = 0 . .During test time, we synthesize random sample from realdistribution and generate same number of samples from thetrained generator for five times and do averaging to computethe evaluation scores. B. Evaluation Metrics
Due to the absence of true target and source distributions,we measure sample-based evaluation metrics designed to as-sess the quality of generated samples. Therefore, following[44], for MNIST dataset we use: • Inception Score (IS) : is the most commonly used metricto evaluate the quality of generated samples using anImageNet [45] pre-trained Inception model M [46]. ISis computed as follows:IS = e E x ∼ PG [ KL ( p M ( y | x ) || p M ( y ) )] where KL is the Kullback-Leibler divergence and p M ( y | x ) and p M ( y ) are the conditional and marginaldistribution of Inception model predicted labels. Largervalues of IS corresponds to higher quality of gener-ated samples; However, IS score is only meaningful forcolored image generation as it is based on ImageNet.Therefore, we train model M on training set of MNISTdataset and used the trained model in IS calculation. Notethat IS is not suited for mode collapse as no comparisonbetween real and generated data occurs in its formulation. • Frechet Inception Distance (FID) : is the second com-monly used metric to evaluate the generated samplesquality:FID = || µ X − µ ˆ X || + tr (cid:16) C X + C ˆ X − C X C ˆ X ) / (cid:17) In FID, distributions of Inception model M output forreal and generated samples are modeled as multivari-ate Gaussian distributions, i.e. P X ∼ N ( µ X , C X ) and P g ∼ N ( µ ˆ X , C ˆ X ) respectively. Intuitively, lower valuesare desired. • Maximum Mean Discrepancy (MMD) : measures thedissimilarity of real P X and generated P ˆ X samples dis-tributions with Gaussian kernel k respectively.MMD = E x,x (cid:48) ∼ P X ˆ x, ˆ x (cid:48)∼ P ˆ X (cid:104) k ( x, x (cid:48) ) − k ( x, ˆ x ) + k (ˆ x, ˆ x (cid:48) ) (cid:105) Lower values of MMD are preferred. • Wassertein Distance (WD) is defined as:WD = inf P X, ˆ X E (cid:104) (cid:107) X − ˆ X (cid:107) (cid:105) . Generative distribution which is more similar to realdistribution results in a lower WD score. • is used to measure the performance ofclassifier, i.e., Total Accuracy (TA), class accuracy (RealAccuracy (RA), Generated Accuracy (GA)), Precision(PR) and Recall (RE) in a two class setting. In case thatgenerated samples successfully replicate the properties ofreal data distribution, all the accuracies, precision andrecall should be around . According to [44] thismetric can best capture the quality of samples in termsof mode collapse.In the case of GMM datasets in addition to MMD, WD and1-NN LOO accuracies, we adopt the metrics suggested in [31]: • Modes : is the average number of detected mixture com-ponents. A component is detected if there is a generatedsample which lies within the three standard deviation ofit. The issue with this metric is that even if there isonly one generated sample close to a component evenrandomly, it is deemed detected. • High Quality (HQ) Samples : is the ratio of generatedsamples which lie within three standard deviation ofmixture components. This metric is not indicative ofsample’s quality as in case of generator’s mode collapseit yields . • KL Divergence of the synthetic and generated data. Inour implementation we segment the 2D plain into squaredregions with sides equal to the standard deviation ofmixture models to create bins and measure densities.
C. Experimental Setup & Implementation Details
In our experiments we consider four scenarios: a) GMM with modes (b) VGAN (c) WGAN (d) VGAN+VER (e) WGAN+VER(f) GMM with modes (g) VGAN (h) WGAN (i) VGAN+VER (j) WGAN+VER Fig. 3: Synthetic GMM datasets. First row corresponds to generators trained on data with components. Seconds row representsgenerated samples from synthetic grid dataset with modes. • Vanilla GAN (VGAN) : Training objective only consistsof the original GAN loss proposed in [1]: min G max D E X ∼ P X [log( D ( X ))] + E Z ∼ P Z [log(1 − D ( G ( Z )))] (7) • Wasserstein GAN (WGAN) : Training objective is to minimizethe Earth Mover’s Distance (EMD) coupled with the discrimi-nator’s gradient penalty GP term [47]: min G max D E X ∼ P X [ D ( X )] − E Z ∼ P Z [ D ( G ( Z ))] + GP (8) • VGAN+VER : In addition to adversarial loss in (7), training ob-jective incorporates the neural information measure I Θ ( ˆ X ; Z ) to encourage diversity in generated samples. • WGAN+VER : Training objective of EMD is augmented withneural information measure I Θ ( ˆ X ; Z ) . For generator G and discriminator D , we follow the archi-tectures employed in [28]. To estimate the neural informationmeasure, we use a two-layer perceptron M with hidden unitsize of and Leaky ReLU non-linearity. All the networksare initialized with Kaming method [48]. In addition, forthe case of GAN+VER we used adaptive gradient clippingsuggested in [43] to bound gradients flowing back from the M ; this is required since there is no upper bound on MI.Moreover, in our experiments we used batch processing ofsize and Adam optimizer [49] with learning rate of e − .All the networks, G , D and M for the case of GAN+VER,were trained to convergence.V. R ESULTS
In this section we present best evaluation scores of thefour models discussed in subsection IV-C obtained via crossvalidation. We also highlight how VER alleviates the modecollapse.
A. GMM1) Ring:
Table I presents the quantitative results of VGAN,WGAN, VGAN+VER and WGAN+VER in generation ofGMM with mixture components. It is evident that except for HQ samples metric which is not an informative metric),models with variational entropy regularization perform better,specially for 1-NN LOO metrics getting closer to . Thisindicates real and generated samples have become more indis-tinguishable through variational entropy regularization. Also,note that WGAN-based models has the lowest WD value astheir objective is to minimize the Wasserstein distance.
2) Grid:
Evaluation results for models trained on GMMwith components are reported in Table II. Similar to theGMM ring dataset, entropy regularization improves evaluationmetrics for both VGAN and WGAN. Also note that due tothe better mathematical properties of Wasserstein distance,WGAN’s generated samples are superior to VGAN in termsof evaluation metrics as expected, however augmenting thetraining objective of VGAN with VER boosts its performanceclose to or even better than WGAN+VER for some 1-NN LOOmetrics. Figure 4 shows the progression of some metrics forthe four models over the course of training. Figure 3 visualizesthe GMM ring and grid datasets and how successful the fourmodels are in learning the respective densities. B. MNIST
For the MNIST dataset, in addition to metrics adopted inGMM datasets except for HQ, we compute IS and FID scores.Note that here Modes metric is the number of MNIST classesthat model M recognized in a set of randomly generatedimages. Table III represents the evaluation results of the fourmodels. It can be observed VGAN has inferior performancewith respect to other models; specially the average number ofmodes detected is . which illustrates mode collapse issueassociated with VGAN; however when its training objectiveis augmented with variationl entropy regularization, all themetrics are significantly boosted. Figure 1 qualitatively showshow VER reduces mode collapse in VGAN. Also thanks tosuperior properties of Wasserstein loss, entropy regularizationdoes not seem to improve the metrics considerably. This can beABLE I: Performance of generative models on GMM ring dataset with mixture componentsEvaluation MetricsModels Modes HQ(%) KL WD MMD 1-NN LOOTA(%) RA(%) GA(%) PR(%) RE(%)VGAN 8 WGAN+VER 8 92.71 0.462 mixture componentsEvaluation MetricsModels Modes HQ(%) KL WD MMD 1-NN LOOTA(%) RA(%) GA(%) PR(%) RE(%)VGAN 19 WGAN+VER Evaluation MetricsModels IS FID Modes KL WD MMD 1-NN LOOTA(%) RA(%) GA(%) PR(%) RE(%)VGAN 4.287 0.034 7.1 2.691 0.770 0.111 85.19 80.00 90.39 89.28 80.01WGAN 9.223 0.012 (a) HQ (b) KL(c) WD (d) TA Fig. 4: Progression of evaluation metrics in Table II forVGAN (blue), WGAN (green), VGAN+VER (orange) andWGAN+VER (red) when are trained on GMM grid dataset(x-axis represents epochs). a result of increased complexity of the MNIST data manifoldcompared to GMM synthetic datasets.VI. C
ONCLUSION
In this paper, an information-theoretic approach is presentedto encourage diversity in the generated samples of a GANmodel and reduce the mode collapse issue. We present Varia-tional Entropy Regularization (VER) which tries to maximizea variational lower bound on mutual information between inputand output of the generator. Theoretically, it is proved thismutual information is identical to the entropy of generatedsamples. Therefore, this maximization alleviates the modecollapse as supported in our experiments. Through extensiveexperimentation, it is showed that VER improves metrics cor-responding to generated sample quality and indicative of modecollapse issue across standard datasets; we also establishedthe benchmark results on the most informative GAN metricsfor MNIST and GMM datasets. Note that VER is a simpleregularization term that can be easily added to any type ofGAN model.In the future, we plan to extend variational entropy regular-ization to other datasets such as CIFAR-10 and CIFAR-100.Further, we will study the impact of VER on the feature levelof generator and discriminator to reduce the gap of neuralinformation measure and mutual information to enjoy a bettervariational bound.
EFERENCES[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
Advances in neural information processing systems , 2014, pp. 2672–2680.[2] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-imagetranslation using cycle-consistent adversarial networks,” in
Proceedingsof the IEEE international conference on computer vision , 2017, pp.2223–2232.[3] A. Radford, L. Metz, and S. Chintala, “Unsupervised representationlearning with deep convolutional generative adversarial networks,” arXivpreprint arXiv:1511.06434 , 2015.[4] C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videoswith scene dynamics,” in
Advances in Neural InformationProcessing Systems 29 , D. D. Lee, M. Sugiyama, U. V.Luxburg, I. Guyon, and R. Garnett, Eds. Curran Associates,Inc., 2016, pp. 613–621. [Online]. Available: http://papers.nips.cc/paper/6194-generating-videos-with-scene-dynamics.pdf[5] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translationwith conditional adversarial networks,” in
Proceedings of the IEEEconference on computer vision and pattern recognition , 2017, pp. 1125–1134.[6] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growingof gans for improved quality, stability, and variation,” arXiv preprintarXiv:1710.10196 , 2017.[7] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture forgenerative adversarial networks,” in
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2019, pp. 4401–4410.[8] C. P. Lau, H. Souri, and R. Chellappa, “Atfacegan: Single face imagerestoration and recognition from atmospheric turbulence,” arXiv preprintarXiv:1910.03119 , 2019.[9] L. Yu, W. Zhang, J. Wang, and Y. Yu, “Seqgan: Sequence generativeadversarial nets with policy gradient,” in
Thirty-First AAAI Conferenceon Artificial Intelligence , 2017.[10] W. Fedus, I. Goodfellow, and A. M. Dai, “Maskgan: better text gener-ation via filling in the_,” arXiv preprint arXiv:1801.07736 , 2018.[11] Z. Yang, W. Chen, F. Wang, and B. Xu, “Improving neural machinetranslation with conditional sequence generative adversarial nets,” 2017.[12] K. Wang and X. Wan, “Sentigan: Generating sentimental texts viamixture adversarial networks.” in
IJCAI , 2018, pp. 4446–4452.[13] P. Luc, C. Couprie, S. Chintala, and J. Verbeek, “Semantic segmentationusing adversarial networks,” arXiv preprint arXiv:1611.08408 , 2016.[14] H. Dong, S. Yu, C. Wu, and Y. Guo, “Semantic image synthesisvia adversarial learning,” in
Proceedings of the IEEE InternationalConference on Computer Vision , 2017, pp. 5706–5714.[15] B. Hitaj, P. Gasti, G. Ateniese, and F. Perez-Cruz, “Passgan: A deeplearning approach for password guessing,” in
International Conferenceon Applied Cryptography and Network Security . Springer, 2019, pp.217–237.[16] W. Hu and Y. Tan, “Generating adversarial malware examples for black-box attacks based on gan,” 2017.[17] H. Shi, J. Dong, W. Wang, Y. Qian, and X. Zhang, “Ssgan: securesteganography based on generative adversarial networks,” in
Pacific RimConference on Multimedia . Springer, 2017, pp. 534–544.[18] A. Brock, J. Donahue, and K. Simonyan, “Large scale gan training forhigh fidelity natural image synthesis,” arXiv preprint arXiv:1809.11096 ,2018.[19] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro,“High-resolution image synthesis and semantic manipulation with condi-tional gans,” in
Proceedings of the IEEE conference on computer visionand pattern recognition , 2018, pp. 8798–8807.[20] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta,A. Aitken, A. Tejani, J. Totz, Z. Wang et al. , “Photo-realistic singleimage super-resolution using a generative adversarial network,” in
Proceedings of the IEEE conference on computer vision and patternrecognition , 2017, pp. 4681–4690.[21] B. Lu, J.-C. Chen, and R. Chellappa, “Unsupervised domain-specificdeblurring via disentangled representations,” in
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2019, pp.10 225–10 234.[22] Z. Wang, Q. She, and T. E. Ward, “Generative adversarial networks: Asurvey and taxonomy,” arXiv preprint arXiv:1906.01529 , 2019. [23] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXivpreprint arXiv:1701.07875 , 2017.[24] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang, “Generalization andequilibrium in generative adversarial nets (gans),” in
Proceedings of the34th International Conference on Machine Learning-Volume 70 . JMLR.org, 2017, pp. 224–232.[25] Y. Balaji, R. Chellappa, and S. Feizi, “Normalized wasserstein distancefor mixture distributions with applications in adversarial learning anddomain adaptation,” 2019.[26] Q. Hoang, T. D. Nguyen, T. Le, and D. Phung, “MGAN: Traininggenerative adversarial nets with multiple generators,” in
InternationalConference on Learning Representations , 2018. [Online]. Available:https://openreview.net/forum?id=rkmu5b0a-[27] J. Li, A. Madry, J. Peebles, and L. Schmidt, “Towards understand-ing the dynamics of generative adversarial networks,”
ArXiv , vol.abs/1706.09884, 2017.[28] Z. Lin, A. Khetan, G. Fanti, and S. Oh, “Pacgan: The power of twosamples in generative adversarial networks,” 2017.[29] J. Donahue, P. Krähenbühl, and T. Darrell, “Adversarial feature learn-ing,” 2016.[30] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Ar-jovsky, and A. Courville, “Adversarially learned inference,” 2016.[31] A. Srivastava, L. Valkov, C. Russell, M. U. Gutmann, and C. Sutton,“Veegan: Reducing mode collapse in gans using implicit variationallearning,” 2017.[32] A. B. Dieng, F. J. R. Ruiz, D. M. Blei, and M. K. Titsias, “Prescribedgenerative adversarial networks,” 2019.[33] I. Tolstikhin, S. Gelly, O. Bousquet, C.-J. Simon-Gabriel, andB. Schölkopf, “Adagan: Boosting generative models,” 2017.[34] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein, “Unrolled generativeadversarial networks,” 2016.[35] T. Che, Y. Li, A. P. Jacob, Y. Bengio, and W. Li, “Mode regularizedgenerative adversarial networks,” 2016.[36] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectralnormalization for generative adversarial networks,” arXiv preprintarXiv:1802.05957 , 2018.[37] S. Nowozin, B. Cseke, and R. Tomioka, “f-gan: Training generative neu-ral samplers using variational divergence minimization,” arXiv preprintarXiv:1606.00709 , 2016.[38] G. K. Dziugaite, D. M. Roy, and Z. Ghahramani, “Training generativeneural networks via maximum mean discrepancy optimization,” arXivpreprint arXiv:1505.03906 , 2015.[39] Y. Li, K. Swersky, and R. Zemel, “Generative moment matchingnetworks,” in
International Conference on Machine Learning , 2015, pp.1718–1727.[40] C. E. Shannon, “A mathematical theory of communication,”
Bell systemtechnical journal , vol. 27, no. 3, pp. 379–423, 1948.[41] M. Cuturi, “Sinkhorn distances: Lightspeed computation of optimaltransportation distances,” 2013.[42] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, andP. Abbeel, “Infogan: Interpretable representation learning by informationmaximizing generative adversarial nets,” 2016.[43] M. I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio,A. Courville, and R. D. Hjelm, “Mine: mutual information neuralestimation,” arXiv preprint arXiv:1801.04062 , 2018.[44] Q. Xu, G. Huang, Y. Yuan, C. Guo, Y. Sun, F. Wu, and K. Weinberger,“An empirical study on evaluation metrics of generative adversarialnetworks,” arXiv preprint arXiv:1806.07755 , 2018.[45] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in . Ieee, 2009, pp. 248–255.[46] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinkingthe inception architecture for computer vision,” in
Proceedings of theIEEE conference on computer vision and pattern recognition , 2016, pp.2818–2826.[47] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville,“Improved training of wasserstein gans,” in
Advances in neural infor-mation processing systems , 2017, pp. 5767–5777.[48] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:Surpassing human-level performance on imagenet classification,” in
Proceedings of the IEEE international conference on computer vision ,2015, pp. 1026–1034.[49] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980arXiv preprint arXiv:1412.6980