[PDF] HGAN: Hybrid Generative Adversarial Network

Abstract

In this paper, we present a simple approach to train Generative Adversarial Networks (GANs) in order to avoid a \textit {mode collapse} issue. Implicit models such as GANs tend to generate better samples compared to explicit models that are trained on tractable data likelihood. However, GANs overlook the explicit data density characteristics which leads to undesirable quantitative evaluations and mode collapse. To bridge this gap, we propose a hybrid generative adversarial network (HGAN) for which we can enforce data density estimation via an autoregressive model and support both adversarial and likelihood framework in a joint training manner which diversify the estimated density in order to cover different modes. We propose to use an adversarial network to \textit {transfer knowledge} from an autoregressive model (teacher) to the generator (student) of a GAN model. A novel deep architecture within the GAN formulation is developed to adversarially distill the autoregressive model information in addition to simple GAN training approach. We conduct extensive experiments on real-world datasets (i.e., MNIST, CIFAR-10, STL-10) to demonstrate the effectiveness of the proposed HGAN under qualitative and quantitative evaluations. The experimental results show the superiority and competitiveness of our method compared to the baselines.

Full PDF

HHGAN: Hybrid Generative Adversarial Network

Seyed Mehdi IranmaneshWest Virginia University [email protected]

Nasser M. NasrabadiWest Virginia University [email protected]

Abstract

In this paper, we present a simple approach to train Gen-erative Adversarial Networks (GANs) in order to avoid amode collapse issue. Implicit models such as GANs tendto generate better samples compared to explicit models thatare trained on tractable data likelihood. However, GANsoverlook the explicit data density characteristics whichleads to undesirable quantitative evaluations and mode col-lapse. To bridge this gap, we propose a hybrid generativeadversarial network (HGAN) for which we can enforce datadensity estimation via an autoregressive model and supportboth adversarial and likelihood framework in a joint train-ing manner which diversify the estimated density in orderto cover different modes. We propose to use an adversar-ial network to transfer knowledge from an autoregressivemodel (teacher) to the generator (student) of a GAN model.A novel deep architecture within the GAN formulation isdeveloped to adversarially distill the autoregressive modelinformation in addition to simple GAN training approach.We conduct extensive experiments on real-world datasets(i.e., MNIST, CIFAR-10, STL-10) to demonstrate the effec-tiveness of the proposed HGAN under qualitative and quan-titative evaluations. The experimental results show the su-periority and competitiveness of our method compared tothe baselines.

1. Introduction

Generative models have extensively grown in recentyears. The main goal of a generative model is to approx-imate the true data distribution which is not known. Gen-erative models are based on ﬁnding the model parametersthat maximize the likelihood of the training data. This isequivalent to minimizing the Kullback-Leibler (KL) diver-gence ( D KL ( p data || p model ) ) between the data distribution p data and model distribution p model . Although this objec-tive spans multiple modes of the data, it leads to generat-ing vague and undesirable samples [45]. There are otherapproaches that minimize D KL ( p model || p data ) which areusually referred to as the reverse KL divergence [34] and this is the main idea behind the generative adversarial net-works. Although these models generate sharp images, min-imizing the reverse KL divergence causes the model dis-tribution to focus on a single mode of the data and ignorethe other modes. This is known as the mode collapse inthe generative adversarial models [33]. This happens be-cause the reverse KL divergence measures the dissimilaritybetween two distributions for the fake samples, and thereis no penalty on the fraction of the model distribution thatcovers the data distribution [2]. To address this problem,the authors in [1] suggested the Wassertein distance whichhas the weakest convergence among existing GANs. How-ever, they used weight clipping to approximate the Wasser-tain distance which causes a pathological behavior [33].In general, the choice for modeling the density functionis challenging. There are two ways to estimate the den-sity function namely, implicit methods and explicit meth-ods. Implicit approaches tend to calculate the model pa-rameters without the need for the analytical form of p model .Explicit models have the advantage of explicitly calculat-ing the probability densities. There are two well-knownimplicit approaches, namely GAN and Variational AutoEn-coder (VAE) which try to model the data distribution im-plicitly. The VAEs try to maximize the data likelihood lowerbound, while a GAN performs a minimax game betweentwo players during its optimization in which for an opti-mal discriminator, the algorithm tries to ﬁnd a generator thatminimizes the Jensen-Shannon divergence (JSD). The JSDminimization has been proven empirically to behave moresimilar to the reverse KL divergence rather than the KL di-vergence [21, 33]. This behavior leads to the aforemen-tioned problem of mode collapse in GAN models, whichcauses the generator to create similar looking images withpoor diversity of samples.In contrast to VAE models which implicitly compute thelikelihood of the data space, autoregressive models have theadvantage of tractable likelihood and can generate diversesamples. The basic idea of these models is to use the au-toregressive connections to model an image pixel by pixel.In fact, autoregressive approaches model the joint distribu-tion of pixels in the image as the product of conditional dis-1 a r X i v : . [ c s . C V ] F e b utoregressive model ( ) ∼ ∼ ( ) ∼ ( )∼ HGAN W e i gh t s h a r i ng F1R1R2F2

Figure 1. Proposed HGAN framework with an autoregressive model, a generator, and a discriminator is trained by using two types of realdata. tributions [35]. However, these models suffer from a slowsynthesis when compared to GANs.The lack of explicit density function in GANs is prob-lematic for two main reasons. Many applications in deepgenerative models are based on the density estimation. Forinstance, the count-based exploration methods [36] relyon density estimation have achieved state-of-the-art perfor-mance on reinforcement learning environments [15]. Thesecond reason is that the quantitative evaluation of thegeneralization performance of such models is challenging.Since GANs typically are able to generate sharp samples bymemorizing the training data, the evaluation criteria basedon ad-hoc sample quality metrics [42] does not capture themode collapse issue.Recently some approaches have been trying to solve themode collapse issue. MGAN [20] trains many genera-tors by using a classiﬁer and a discriminator. Using manygenerators and also a classiﬁer in addition to the classicalGAN make this model computationally complex and proneto over-ﬁt the training dataset. There are some other ap-proaches that attempt to use autoencoders as regularizers oradditional losses to penalize the missing modes [48, 49].In [51] authors used an LSTM-based autoregressive modelin their discriminator function and considered the recon-struction loss as the penalty for fake data. However, in theirGAN model they trained their discriminator only on the truedata as it becomes unbounded for the fake data synthesizedby the generator. SNGAN [31] utilizes spectral normaliza-tion to stabilize the training of discriminator. It controlsthe Lipschitz constant of the discriminator to mitigate theexploding gradient problem and the mode collapse issue.SAGAN [52] uses the self attention layer to capture the ﬁnedetails from distant part of image. The authors combinedtheir model with spectral normalization on both the gener- ator and discriminator. BigGAN [4] is designed for class-conditional image generation. The focus of the BigGANmodel is to increase the number of model parameters andbatch size, then conﬁgure the model and training process.It utilizes the recent techniques introduced by SNGAN [31]and SAGAN [52]. StyleGAN [22], is an approach for train-ing generator models capable of synthesizing very largehigh-quality images via the incremental expansion of bothdiscriminator and generator models from small to large im-ages during the training process. In addition, it changes thearchitecture of the generator signiﬁcantly.We propose a simple effective GAN architecture and atraining strategy with the goal of adversarially distilling theexplicit information of the data distribution provided by theautoregressive model in addition to mimicking the real datawhich leads to generating samples with a distribution veryclose to the actual data distribution and helps to avoid pos-sible mode collapse. To resolve the issue of sharp good-looking samples but poor likelihood estimation in the caseof adversarial learning (and vice versa in the case of max-imum likelihood estimation), our proposed hybrid modelbridges implicit and explicit learning models by augment-ing the adversarial learning with an additional autoregres-sive model. Our approach combines the implicit and ex-plicit density function estimation into a uniﬁed objectivefunction. In our model, the HGAN generator is guidedby exploiting the explicit data probability density from theknowledge provided by the autoregressive model while it isalso responsible to learn the data distribution via the adver-sarial learning. HGAN model exploits the complementarystatistical properties of data obtained from an autoregressivemodel by utilizing a GAN to effectively diversify the esti-mated density function and capturing different modes of thedata distribution as well as avoiding possible mode collapse.2n short, our main contributions are: (i) a novel adver-sarial model to train a generator in a GAN framework in or-der to stabilize the training process; (ii) the proposed modelis able to estimate the data density by mimicking an au-toregressive model and simultaneously combining it withthe adversarial learning process; (iii) a comprehensive per-formance evaluation of our proposed method on real-worldlarge-scale datasets of diverse natural scenes as well as mit-igating adversarial examples in a defense scenario.

2. Background

GAN is a min-max game between a generator G anda discriminator D, both parameterized via neural net-works [13]. Training a GAN can be formulated as the fol-lowing objective function: min G max D E x ∼ P data ( x ) [ logD ( x )]+ (1) E z ∼ P z [ log (1 − D ( G ( z )))] , where x is from a real data distribution P data and z is a sam-ple from a prior distribution P z . The generator is a mappingfunction from z which approximates P model . GAN alterna-tively optimizes D and G in a minimax game using stochas-tic gradient-based algorithm. Generator is prone to map ev-ery z to a single x that is most probable to be recognizedas a true data, and this leads to a mode collapse. Anotherissue with GAN is that at optimal point of D, minimizingthe generator is equal to minimizing the JSD between thetrue data distribution and model distribution which is em-pirically shown to cause the mode collapse by generatingfew modes and ignoring other modes [21]. Autoregressive models can be designed by using recur-rent networks (PixelRNNs) or a CNN (PixelCNNs) [47].These models learn the join distribution of pixels ofan image x as a product of conditional distributions p ( x i | x , ..., x i − ) , where x i is a single pixel: p ( x ) = n (cid:89) i =1 p ( x i | x , ..., x i − ) . (2)The ordering of pixel dependencies is row by row and ineach row, pixel by pixel. Therefore, every pixel ( x i ) de-pends on all the pixels above and left of it ( x , ..., x i − ). Knowledge distillation is mostly used in image classi-ﬁcation problem where the output of neural network is aprobability distribution over categories. The probability is calculated by applying a softmax function over logits whichare the output of the last fully connected layer. Hintonet al. [19] used logits to transfer the embedded informa-tion in a teacher network to student network. In order totrain a student network F to generate student logits F ( x i ) ,a parameter called temperature T is introduced. After-wards, the generalized softmax layer converts logits vector t i = ( t i , ..., t Ci ) to a probability distribution q i , M T ( t i ) = q i , where q ji = exp ( t ji /T ) (cid:80) k exp ( t ki /T ) . (3)where higher temperature T produces softer probabilityover categories.Hinton et al. [19] proposed to minimize the KL diver-gence between teacher and student output as follows: L KD ( F, T ) = 1 N N (cid:88) i =1 KL ( M T ( t i ) || M T ( F ( x i ))) . (4)In [50] instead of forcing the student to exactly mimicthe teacher by minimizing KL-divergence in Equation (4),the knowledge is transferred from teacher to student via dis-criminator in a GAN-based approach.

3. Proposed Hybrid GAN:

We now present our novel hybrid approach to tackle theproblem of mode collapse in GANs. In general, GANs cangenerate good-looking samples but have intractable like-lihoods. On the other hand, autoregressive models arelikelihood-based generative models which can return ex-plicit probability densities. The idea is to utilize a mixtureof these two models rather than a single model as in a typi-cal GAN.In our proposed hybrid model, the generator’s ﬁrst task isto learn the data distribution without any explicit model justlike a regular GAN such that G ( z ) (cid:39) x ∼ P data . On theother hand, its second task is to perform sampling whereit samples a random vector z ∼ P z and maps it to an au-toregressive model P ξ such that G ( z ) (cid:39) x ξ ∼ P ξ . Thisforces our hybrid model to learn the probability densityof the autoregressive model using the adversarial trainingmethod. These two tasks together provide a hybrid modelwhich gives more attention to the likelihood of the data forestimating P model in the data space.A natural question to ask is why one should use ad-versarial learning when autoregressive model can return atractable likelihood. The reason is that the synthesis fromthese autoregressive models are difﬁcult to parallelize andusually inefﬁcient on parallel hardware [24]. Moreover, itis not practical to perform accurate data manipulation sincethe hidden layers of autoregressive models have unknown3 igure 2. Samples generated by the proposed HGAN compared with the samples generated from DCGAN and AutoGAN on CIFAR-10. Algorithm 1:

HGAN Training procedure usingstochastic gradient descent

Input : minibatch images x , number of training batch steps S θ ξ , θ G , φ D ← initialize network parameters for n = 1 to S do x ξ ← p ξ ( x ) { Forward through auto model } z ∼ N (0 , Z { Draw sample of random noise } ˆ x ← G ( z ) { Forward through the generator } s r ← D ( x ξ ) { Auto model output } s f ← D (ˆ x ) { G ( z ) output } ˆ x ← G ( z ) { Forward through the generator } s r ← D ( x ξ ) { Real data } s f ← D ( x ξ , ˆ x ) { G ( z ) output } L D ← log( s r )+log( s r )+log(1 − s f )+log(1 − s f ) φ D ← φ D − ∂ L D ∂φ D { Update discriminator } L G ← log ( s f ) + log ( s f ) θ G ← θ G − ∂ L G ∂θ G { Update generator } L p ξ ← | x i − p ξ ( x i ) | θ ξ ← θ ξ − ∂ L pξ ∂θ ξ { Update auto model } end marginal distributions [15]. However, GAN models are fastin synthesizing and also can have useful latent space fordownstream tasks especially in ones which have an encodersuch as AGE or ALI [10, 46]. Fig. 1 illustrates the architec-ture of our proposed Hybrid GAN (HGAN) model.In naive GAN, the odds that the two distributions p g and p data share support in high-dimensional space, especiallyearly in training, are very small. If p g and p data have non-overlapping support the Jensen-Shannon divergence is satu-rated as is locally constant in θ . Also, there might be a largeset of near-optimal discriminators whose logistic regressionloss is very close to optimum, but each of these possiblyprovides very different gradients to the generator. There-fore, training the discriminator might ﬁnd a different near-optimal solution each time depending on initialization, evenfor a ﬁxed g θ and p data . We instead employ autoregressivemodel to augment the gradient information obtained by or-dinary back-propagation. In fact, we are interested in ma-nipulating the feature space of a discriminator, using theautoregressive model as a tool to tell us *how* to perform Figure 3. Training G ( z ) to mimic the autoregressive model’s out-put with an adversarial learning process. In this network whichwe denote as AutoGAN, the real data is obtained from the autore-gressive model’s output and fake data is the generated output from G ( z ) . that manipulation.In our Hybrid GAN, the discriminator observes twotypes of real inputs: the real data x and the output of theautoregressive model x ξ . The fake input, G ( z ) , is mim-icking the output of the autoregressive model x ξ ∼ P ξ inaddition to the real data x ∼ P data . We consider two termsfor the discriminator D , namely D and D . D is the ﬁrstdiscriminator which is related to the ﬁrst task, where G ( z ) is fake and x ξ is real. D is related to the second task, where G ( z ) is fake and x is real. However, all the parameters areshared between discriminators D and D and in fact thereis only one discriminator D .Algorithm 1 summarizes the training procedure. Aftergetting the input image, output of the autoregressive model,and noise, the proposed model generates a fake image (line5). s r and s r indicate the scores for the ﬁrst ( x ξ ) andsecond real inputs ( x ). s f and s f measures the score offake inputs containing the generator’s output ( G ( z ) ) tryingto mimic the ﬁrst and second real input, respectively. Notethat we use ∂ L D ∂φ D to indicate the gradient of D’s objectivefunction with respect to its parameters, and likewise for G and p ξ .In the ﬁrst fake input of discriminator, the generator at-tempts to generate data that is as close as possible to the4 able 1. Experiment on MNIST dataset containing 10 differentmodes. GAN Variants Chi-square ( × ) KL DivWGAN 1.32 0.614MIX+WGAN 1.25 0.567DFM 1.46 0.623Improved-GAN 1.13 0.436ALI 2.34 0.875BEGAN 1.06 0.944MAD-GAN 0.24 0.145GMAN 1.86 1.345DCGAN 0.90 0.322MGAN 0.32 0.211SAGAN 0.29 0.148SNGAN 0.25 0.146 HGAN 0.23 0.141 autoregressive model’s output. Therefore, the generator’stask is to make G ( z ) (cid:39) x ξ ∼ P ξ . However, for the sec-ond round of fake input, the generator tries to fool the dis-criminator in a way that its generated data is as close aspossible to the real data. Thus, it is responsible to make G ( z ) (cid:39) x ∼ P data . While G acts similar to a typical gener-ator in a regular GAN, our hybrid method tries to maximizethe likelihood of a mixture model by adversarially distillingthe properties of autoregressive model.

4. Experiments

We show the effectiveness of our proposed approach indifferent experiments with real-world datasets. For the fairevaluation, we use the same experimental settings that areidentical to prior works [33, 12, 20]. Therefore, we use theresults from the latest state-of-the-art GAN-based models tocompare with ours.We used Pytorch [38] to implement our framework. Thegenerator and discriminator architecture is adopted fromDCGAN [39]. In addition, pixelCNN++ [43] architec-ture is chosen for the autoregressive model. For trainingwe used Adam optimizer [23] with the ﬁrst-order momen-tum of 0.5, the learning rate of 0.0002, and batch sizeof 64. For the generator the ReLU activation [32], andfor the discriminator the Leaky ReLU activation with theslope of 0.2 is considered. Weights are initialized froman isotropic Gaussian: N (0 , . and zero biases. Toshow the effectiveness of the proposed framework, we per-form two types of experiments on MNIST dataset and com-pare our methods to the other well-know GANs, namelyWGAN [1], MIX+WGAN [42], DFM [49], Improved-GAN [42], ALI [10], BEGAN [3], MAD-GAN [12],GMAN [11], DCGAN [39], MGAN [20], SNGAN [31],and SAGAN [52]. It should be noted that our method cannotbe compared directly with BigGAN [4] and StyleGAN [22]since the mentioned models are based on larger models anddifferent settings (i.e., BigGAN is using class conditionalsetting or StyleGAN purpose is for having more control Table 2. Results for the Inception scores on CIFAR-10 dataset.

Objective Inception ScoreDCGAN 6.40AutoGAN 6.17

HGAN 7.46

Table 3. Results for the test MODE scores on the MNIST dataset.

Objective MODE ScoreDCGAN 9.28AutoGAN 9.32

HGAN 9.51 over the latent space for high resolution image generation).Following [12], we reuse the KL-divergence [27] and thenumber of captured modes [6] as the criteria for the com-parison to illustrate the superiority of our method comparedto others. Moreover, we perform the quantitative experi-ments on more complicated real-world datasets namely theCIFAR-10 [25] and STL-10 [8] datasets.

The data distribution of the MNIST dataset can be ap-proximated with ten dominant modes. Here, following [6]we deﬁne the term ‘mode’ as a connected component in thedata manifold.For the sake of evaluation, we train a four-layer CNNclassiﬁer on the MNIST digits and then apply it to com-pute the mode scores in the generated samples from theproposed method. We repeat the procedure and apply thetrained classiﬁer to discover the mode scores on differentbaseline GAN methods. We also have the ground truth bymeasuring the performance of classiﬁer on the MNIST testset. The number of generated samples from each method isequal to the number of test set which is 10,000. Afterwards,we use Chi-square distance and the KL-divergence to com-pute distance between the two histograms (ground truth vs.each GAN model). Table 1 shows the performance of ourproposed HGAN compared to the other methods. From Ta-ble 1, it is evident that our proposed method could outper-form the other methods in capturing all the modes in theMNIST dataset.

In this experiment, the goal is to explore the performanceof our proposed HGAN in a more challenging scenario. Inorder to illustrate and compare HGAN with other baselines,we utilized similar setup as in [12]. Authors in [30] cre-ated a Stacked MNIST with 25,600 samples where eachsample has three channels stacked together with a randomdigit from MNIST in each of them. Therefore, the StackedMNIST contains 1,000 distinct modes in the data distribu-tion. In [7], a similar process is applied to MNIST dataset.They created the Compositional MNIST whereby they took5 igure 4. Images generated by our proposed HGAN trained on natural image datasets.Table 4. Stacked-MNIST experiment. There are 1,000 modes inthe dataset.

GAN Variants KL Div

SAGAN 0.97 886SNGAN 0.91 889

HGAN 0.88 three random MNIST digits and placed them at three quad-rants of a × dimensional image. This also resultedin a data distribution with 1,000 modes. Distribution of thegenerated samples was estimated with a pre-trained MNISTclassiﬁer which classiﬁes the digits in each channel or quad-rants, and consequently decides which of the 1,000 modesis generated by the particular GAN method’s generator.Table 4 and 5 show the performance of the proposedmethod as well as other GAN methods in terms of theKL divergence and the number of modes recovered for theStacked and Compositional MNIST datasets. As shown inTable 4, our method outperformed all the other GAN meth-ods in terms of the KL divergence and the number of cap-tured modes. MGAN surpasses ours in only the numberof captured modes. It is evident from Table 5 that our pro-posed HGAN outperforms all the other baselines in terms ofthe KL divergence and it is the closest to the true data distri-bution. Also, in terms of the number of captured modes, ourmethod as well as MGAN, MAD-GAN, WGAN, SNGANand MIX+WGAN capture all the 1,000 modes in the Com- Table 5. Compositional-MNIST experiment. There are 1,000modes in the dataset.

GAN Variants KL Div

HGAN 0.078 1000

Table 6. Inception scores on the CIFAR-10 and STL-10 datasets.

Model CIFAR-10 STL-10Real data 11.24 ± ± ± − MIX+WGAN 4.04 ± − DFM 7.72 ± ± ± − ALI 5.34 ± − BEGAN 5.62 − MAD-GAN 7.34 − GMAN 6.00 ± − DCGAN 6.40 ± ± ± SAGAN 7.51 ± ± ± ± HGAN ± ± positional MNIST experiment. In this section, the proposed HGAN framework is ap-plied on more complicated real-world datasets to evalu-ate its effectiveness on more challenging large-scale image6 able 7. FIDs on CIFAR-10 and STL-10 (lower is better).

Model DCGAN DCGAN+TTUR [18] WGAN-GP [16] GAN-GP MGAN SAGAN SNGAN HGANCIFAR-10 37.7 36.9 40.2 37.7 26.7 26.3

Table 8. Classiﬁcation accuracies of using Defense-GAN and Defense-HGAN strategies on the MNIST dataset with L = 200 and R = 10. Attack No Attack (Defense-GAN) Defense-GAN No Attack (Defense-HGAN) Defense-HGANFGSM ( (cid:15) = 0 . ) 0.989 0.961 0.991 0.974PGD 0.989 0.956 0.991 0.969CW ( l norm) 0.989 0.945 0.991 0.965 data. We use two widely-adopted datasets, namely CIFAR-10 [25] and STL-10 [9]. CIFAR-10 dataset contains 50,000training images with the resolution of × for 10 differ-ent classes: airplane, automobile, bird, cat, deer, dog, frog,horse, ship, and truck. STL-10 dataset subsampled from theImageNet [41] and is more diverse database compared toCIFAR-10. This dataset composed of 100,000 images withthe resolution of × . For the sake of fair comparisonwith the baselines in [49], we follow the same procedure asin [26] to resize the STL-10 down to × . For quantitative evaluation, we consider the Inception scorewhich was introduced in [42]. This metric computes exp ( E x [ D KL ( p ( y | x ) || p ( y ))]) , where p ( y | x ) is the condi-tional label distribution for image x estimated by the refer-ence Inception model. The metric rewards good and variedsamples and is found to be well-correlated to human judg-ment. The code provided in [42], is used to compute the In-ception score for 10 partitions of 50,000 generated samples.For qualitative evaluation of the quality of images generatedby our proposed HGAN framework, we show the samplesgenerated by HGAN which are drawn randomly rather thancherry-picked. Table 6 shows the Inception scores obtained by our pro-posed HGAN method as well as the baselines. For the faircomparison, only models which are trained completely inan unsupervised manner without the label information areincluded in Table 6. Also, the reported results on STL-10for DCGAN and D2GAN are based on the models trainedon × resolution. Table 6 shows the superiority of ourproposed HGAN compared to the other methods in the lit-erature for both the STL-10 and CIFAR-10 datasets. For the qualitative assessment, we present samples whichare randomly selected from the images generated by theproposed HGAN. It can be seen from Fig. 4 that the im-ages generated by HGAN are visually recognizable imagesof cars, ships, trucks, birds, airplanes, dogs, and horses inthe CIFAR-10 database. Moreover, in the case of the STL-10 dataset, HGAN is able to produce images including car,trucks, ships, airplanes, and different kinds of animals in-cluding horses, cats, monkeys, deers, and dogs with widerrange of background such as sky, cloudy sky, sea, and for-est. These visually appealing images conﬁrms the diversityof the generated samples by HGAN.

The main disadvantage of the inception score is that itdoes not compare the statistics of the synthetic samples andthe real world ones. Therefore, we evaluate HGAN usingthe Frechet Inception Distance (FID) proposed in [18]. Ta-ble 7 compares the FIDs obtained by HGAN with baselinescollected in [20, 31]. It should be noted that some methodsin the literature use the Resnet [17] architecture. Here, forthe fair comparison we show the results of different meth-ods when using DCGAN architecture.

In the previous sections, we examined the mode cover-age of the proposed framework compared to the other base-lines in three separate experiments. In order to show the ef-fectiveness of our HGAN framework, we perform anotherexperiment with two different datasets, namely the MNISTand CIFAR-10 datasets. In this setup, we consider G ( z ) within two complete separate training approaches. In theﬁrst approach, G ( z ) is trained as a regular GAN such asa DCGAN, and in the second approach G ( z ) is trained tomimic the autoregressive model’s output with an adversar-ial training. We denote the ﬁrst approach as DCGAN andthe second approach as AutoGAN. Fig. 3 depicts the frame-work of AutoGAN. We compare the performance of thesetwo networks with the proposed HGAN in terms of samplequality.Table 3 and 2 show the highest Inception/MODEscores [42] of DCGAN, AutoGAN, and HGAN monitored7 igure 5. Classiﬁcation accuracy of Defense-GAN and Defense-HGAN on the MNIST and CIFAR-10 datasets in the case of no attack andalso under FGSM white-box attack with (cid:15) = 0 . . (a) MNIST classiﬁcation accuracy varying L (with R = 10). (b) CIFAR-10 classiﬁcationaccuracy varying L (with R = 10). (c) MNIST classiﬁcation accuracy varying R (with L = 100). (d) CIFAR-10 classiﬁcation accuracyvarying R (with L = 100).Table 9. Classiﬁcation accuracies of using Defense-GAN and Defense-HGAN strategies on the CIFAR-10 dataset with L = 200 and R =10. Attack No Attack (Defense-GAN) Defense-GAN No Attack (Defense-HGAN) Defense-HGANFGSM ( (cid:15) = 0 . ) 0.763 0.684 0.794 0.741PGD 0.763 0.671 0.794 0.738CW ( l norm) 0.763 0.646 0.794 0.731 during the training phase. The samples generated by eachof the mentioned methods on the CIFAR-10 dataset is alsoshown in Fig. 2.As it is illustrated in Table 3 and 2, HGAN outperformsboth DCGAN and AutoGAN in terms of sample quality.One possible reason behind this is in HGAN, the additionof adversarially distillation of the data information fromthe autoregressive model (pixelCNN++) in the G ( z ) ob-jective function can stabilize its optimization, thus avoid-ing the mode collapse issue. Finally, the hybrid nature ofthe proposed method leads to a better performance for bothdatasets. Despite a very rich research work leading to very in-teresting GAN algorithms, it is still challenging to assesswhich algorithm performs better compared to others. In thisexperiment we evaluate the effectiveness of HGAN com-pared to WGAN in a defense scenario. We believe thiscould be another way of assessment for GAN frameworks.Adversarial examples [40] are neural network inputswhich are designed to force misclassiﬁcation. These in-puts often appear normal to humans while cause the neu-ral network to make inaccurate predictions. Various de-fenses have been proposed to mitigate the effect of ad-versarial attacks [44, 37, 29]. In this experiment weuse our proposed HGAN as a defense mechanism againstthree different white-box attacks: Fast Gradient-SignMethod (FGSM) [14], Carlini-Wagner (CW) attack (with l norm) [5], and Projected Gradient Descent (PGD) [28]. For the fair comparison, we adopt the same set of experimentas Defense-GAN [44]. Instead of using WGAN, we useour proposed HGAN in Defense-GAN framework whichwe denote as Defense-HGAN. We also compare Defense-HGAN with Defense-GAN in the case of no attack. Table 8and 9 show the classiﬁcation performance of our methodcompared to Defense-GAN on the MNIST and CIFAR-10datasets, respectively. It should be noted that the classiﬁca-tion accuracy results on the MNIST and CIFAR-10 is 0.994and 0.886, respectively. We note that Defense-HGAN out-performs Defense-GAN which shows the superiority of ourHGAN comparing to WGAN in Defense-GAN framework.We also compared the effect of different numbers of it-erations L and random restarts R for Defense-GAN andDefense-HGAN on the MNIST and CIFAR-10 datasets.Both methods need to look for an appropriate datapoint inthe latent space which leads to generating an image closerto the input image. As it is shown in Fig. 5 classiﬁcationperformance of HGAN is better than Defense-GAN whichmeans that HGAN could do a better job in capturing thedata distribution compared to WGAN on the MNIST andCIFAR-10 datasets.

5. Conclusion

We have proposed a novel approach to address the modecollapse issue in GANs. Our idea is to design a hybridmodel which tries to learn the distribution of data via a mix-ture of density estimating models utilizing an autoregressivemodel and an adversarial learning. For this purpose, we in-troduce a minimax game between a generator, an autore-8ressive model, and a discriminator to optimize the prob-lem of minimizing the JSD between P data and P model . Inour proposed HGAN, the generator is responsible to learnthe autoregressive model output in addition to modeling thereal data just like a regular GAN. Distillation of autoregres-sive model is beneﬁcial for the HGAN since it also modelsthe distribution of the same data but in an explicit way. Itmakes the generator to give more attention to the likelihoodof the data and stabilize its optimization. This helps theproposed model to capture more data modes which leads togenerating a more diversiﬁed set of images. Comprehen-sive study on MNIST and also more challenging real-worlddatasets show the effectiveness of our HGAN in coveringdata modes and avoiding mode collapse as well as generat-ing diverse and visually appealing images. References [1] Martin Arjovsky, Soumith Chintala, and L´eon Bottou.Wasserstein gan. arXiv preprint arXiv:1701.07875 , 2017.[2] Duhyeon Bang and Hyunjung Shim. Mggan: Solving modecollapse using manifold guided training. arXiv preprintarXiv:1804.04391 , 2018.[3] David Berthelot, Thomas Schumm, and Luke Metz. Began:boundary equilibrium generative adversarial networks. arXivpreprint arXiv:1703.10717 , 2017.[4] Andrew Brock, Jeff Donahue, and Karen Simonyan. Largescale gan training for high ﬁdelity natural image synthesis. arXiv preprint arXiv:1809.11096 , 2018.[5] Nicholas Carlini and David Wagner. Towards evaluating therobustness of neural networks. In , pages 39–57. IEEE, 2017.[6] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio,and Wenjie Li. Mode regularized generative adversarial net-works. arXiv preprint arXiv:1612.02136 , 2016.[7] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio,and Wenjie Li. Mode regularized generative adversarial net-works. arXiv preprint arXiv:1612.02136 , 2016.[8] Adam Coates, Andrew Ng, and Honglak Lee. An analy-sis of single-layer networks in unsupervised feature learning.In

Proceedings of the fourteenth international conference onartiﬁcial intelligence and statistics , pages 215–223, 2011.[9] Adam Coates, Andrew Ng, and Honglak Lee. An analy-sis of single-layer networks in unsupervised feature learning.In

Proceedings of the fourteenth international conference onartiﬁcial intelligence and statistics , pages 215–223, 2011.[10] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, OlivierMastropietro, Alex Lamb, Martin Arjovsky, and AaronCourville. Adversarially learned inference. arXiv preprintarXiv:1606.00704 , 2016.[11] Ishan Durugkar, Ian Gemp, and Sridhar Mahadevan.Generative multi-adversarial networks. arXiv preprintarXiv:1611.01673 , 2016.[12] Arnab Ghosh, Viveka Kulharia, Vinay Namboodiri,Philip HS Torr, and Puneet K Dokania. Multi-agent diversegenerative adversarial networks.

CoRR, abs/1704.02906 ,6:7, 2017. [13] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio. Generative adversarial nets. In

Advancesin neural information processing systems , pages 2672–2680,2014.[14] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy.Explaining and harnessing adversarial examples (2014). arXiv preprint arXiv:1412.6572 .[15] Aditya Grover, Manik Dhar, and Stefano Ermon. Flow-gan:Combining maximum likelihood and adversarial learning ingenerative models. arXiv preprint arXiv:1705.08868 , 2017.[16] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, VincentDumoulin, and Aaron C Courville. Improved training ofwasserstein gans. In

Advances in Neural Information Pro-cessing Systems , pages 5767–5777, 2017.[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In

Proceed-ings of the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016.[18] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,Bernhard Nessler, and Sepp Hochreiter. Gans trained by atwo time-scale update rule converge to a local nash equilib-rium. In

Advances in Neural Information Processing Sys-tems , pages 6626–6637, 2017.[19] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill-ing the knowledge in a neural network. arXiv preprintarXiv:1503.02531 , 2015.[20] Quan Hoang, Tu Dinh Nguyen, Trung Le, and Dinh Phung.Mgan: Training generative adversarial nets with multiplegenerators. 2018.[21] Ferenc Husz´ar. How (not) to train your generative model:Scheduled sampling, likelihood, adversary? arXiv preprintarXiv:1511.05101 , 2015.[22] Tero Karras, Samuli Laine, and Timo Aila. A style-basedgenerator architecture for generative adversarial networks. In

Proceedings of the IEEE conference on computer vision andpattern recognition , pages 4401–4410, 2019.[23] Diederik P Kingma and Jimmy Ba. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980 ,2014.[24] Diederik P Kingma and Prafulla Dhariwal. Glow: Gener-ative ﬂow with invertible 1x1 convolutions. arXiv preprintarXiv:1807.03039 , 2018.[25] Alex Krizhevsky and Geoffrey Hinton. Learning multiplelayers of features from tiny images. Technical report, Cite-seer, 2009.[26] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classiﬁcation with deep convolutional neural net-works. In

Advances in neural information processing sys-tems , pages 1097–1105, 2012.[27] Solomon Kullback and Richard A Leibler. On informa-tion and sufﬁciency.

The annals of mathematical statistics ,22(1):79–86, 1951.[28] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Ad-versarial examples in the physical world. arXiv preprintarXiv:1607.02533 , 2016.

29] Dongyu Meng and Hao Chen. Magnet: a two-pronged de-fense against adversarial examples. In

Proceedings of the2017 ACM SIGSAC Conference on Computer and Commu-nications Security , pages 135–147. ACM, 2017.[30] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial networks. arXivpreprint arXiv:1611.02163 , 2016.[31] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, andYuichi Yoshida. Spectral normalization for generative ad-versarial networks. arXiv preprint arXiv:1802.05957 , 2018.[32] Vinod Nair and Geoffrey E Hinton. Rectiﬁed linear units im-prove restricted boltzmann machines. In

Proceedings of the27th international conference on machine learning (ICML-10) , pages 807–814, 2010.[33] Tu Nguyen, Trung Le, Hung Vu, and Dinh Phung. Dual dis-criminator generative adversarial nets. In

Advances in NeuralInformation Processing Systems , pages 2670–2680, 2017.[34] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variationaldivergence minimization. In

Advances in Neural InformationProcessing Systems , pages 271–279, 2016.[35] Aaron van den Oord, Nal Kalchbrenner, and KorayKavukcuoglu. Pixel recurrent neural networks. arXivpreprint arXiv:1601.06759 , 2016.[36] Georg Ostrovski, Marc G Bellemare, Aaron van den Oord,and R´emi Munos. Count-based exploration with neural den-sity models. arXiv preprint arXiv:1703.01310 , 2017.[37] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha,and Ananthram Swami. Distillation as a defense to adver-sarial perturbations against deep neural networks. In , pages 582–597. IEEE, 2016.[38] Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, Zeming Lin, Al-ban Desmaison, Luca Antiga, and Adam Lerer. Automaticdifferentiation in pytorch. 2017.[39] Alec Radford, Luke Metz, and Soumith Chintala. Un-supervised representation learning with deep convolu-tional generative adversarial networks. arXiv preprintarXiv:1511.06434 , 2015.[40] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo-geswaran, Bernt Schiele, and Honglak Lee. Genera-tive adversarial text to image synthesis. arXiv preprintarXiv:1605.05396 , 2016.[41] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,Aditya Khosla, Michael Bernstein, et al. Imagenet largescale visual recognition challenge.

International Journal ofComputer Vision , 115(3):211–252, 2015.[42] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, VickiCheung, Alec Radford, and Xi Chen. Improved techniquesfor training gans. In

Advances in Neural Information Pro-cessing Systems , pages 2234–2242, 2016.[43] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik PKingma. Pixelcnn++: Improving the pixelcnn with dis-cretized logistic mixture likelihood and other modiﬁcations. arXiv preprint arXiv:1701.05517 , 2017. [44] Pouya Samangouei, Maya Kabkab, and Rama Chel-lappa. Defense-gan: Protecting classiﬁers against adver-sarial attacks using generative models. arXiv preprintarXiv:1805.06605 , 2018.[45] Lucas Theis, A¨aron van den Oord, and Matthias Bethge. Anote on the evaluation of generative models. arXiv preprintarXiv:1511.01844 , 2015.[46] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempit-sky. Adversarial generator-encoder networks.

CoRR,abs/1704.02304 , 2017.[47] Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt,Oriol Vinyals, Alex Graves, et al. Conditional image gen-eration with pixelCNN decoders. In

Advances in Neural In-formation Processing Systems , pages 4790–4798, 2016.[48] Ruohan Wang, Antoine Cully, Hyung Jin Chang, and YiannisDemiris. Magan: Margin adaptation for generative adversar-ial networks. arXiv preprint arXiv:1704.03817 , 2017.[49] David Warde-Farley and Yoshua Bengio. Improving gener-ative adversarial networks with denoising feature matching.2016.[50] Zheng Xu, Yen-Chang Hsu, and Jiawei Huang. Learningloss for knowledge distillation with conditional adversarialnetworks. arXiv preprint arXiv:1709.00513 , 2017.[51] Yasin Yazici, Kim-Hui Yap, and Stefan Winkler. Autoregres-sive generative adversarial networks. 2018.[52] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augus-tus Odena. Self-attention generative adversarial networks.In

International Conference on Machine Learning , pages7354–7363. PMLR, 2019., pages7354–7363. PMLR, 2019.