Ensembles of Generative Adversarial Networks
aa r X i v : . [ c s . C V ] D ec Ensembles of Generative Adversarial Networks
Yaxing Wang, Lichao Zhang, Joost van de Weijer
Computer Vision CenterBarcelona, Spain {yaxing,lichao,joost}@cvc.uab.es
Abstract
Ensembles are a popular way to improve results of discriminative CNNs. Thecombination of several networks trained starting from different initializations im-proves results significantly. In this paper we investigate the usage of ensemblesof GANs. The specific nature of GANs opens up several new ways to constructensembles. The first one is based on the fact that in the minimax game which isplayed to optimize the GAN objective the generator network keeps on changingeven after the network can be considered optimal. As such ensembles of GANscan be constructed based on the same network initialization but just taking mod-els which have different amount of iterations. These so-called self ensembles aremuch faster to train than traditional ensembles. The second method, called cas-cade GANs, redirects part of the training data which is badly modeled by the firstGAN to another GAN. In experiments on the CIFAR10 dataset we show that en-sembles of GANs obtain model probability distributions which better model thedata distribution. In addition, we show that these improved results can be obtainedat little additional computational cost.
Unsupervised learning extracts features from the unlabeled data to describe hidden structure, whichis arguably more attractive, compelling and challenging than supervised learning. One unsupervisedapplication which has gained momentum in recent years, is the task to generate images. The mostcommon image generation models fall into two main approaches. The first one is based on proba-bilistic generative models, which includes autoencoders[10] and powerful variants[13, 1, 14]. Thesecond class, which is the focus of this paper, is called Generative Adversarial Networks (GANs)[5].These networks combine a generative network and a discriminative network. The advantage of thesenetworks is that they can be trained with back propagation. In addition, since the discriminator net-work is a convolutional network, these networks are optimizing an objective which reflects humanperception of images (something which is not true when minimizing a Euclidean reconstructionerror).Since their introduction GANs have been applied to a wide range of applications and several im-provements have been proposed. Ranford et al. [9] propose and evaluate several constraints onthe network architecture, thereby improving significantly the stability during training. They callthis class of GANs, Deep Convolutional GANs (DCGAN), and we will use these GANs in ourexperiments. Denton et al. [3] propose a Laplacian pyramid framework based on a cascade of con-volutional networks to synthesize images at multiple resolutions. Further improvements on stabilityand sythesized quality have been proposed in [2, 4, 7, 11].Several works have shown that, for disciminatively trained CNNs, applying an ensemble of networksis a straightforward way to improve results [8, 15, 16]. The ensemble is formed by training severalinstances of a network from different initializations on the same dataset, and combining them e.g.by a simple probability averaging. Krizhevsky et al. [8] applied seven networks to improve results
Workshop on Adversarial Training, NIPS 2016, Barcelona, Spain. igure 1: Two-hundred images generated from the same random noise with DCGAN on CIFARdataset after 72 (left) and 73 (right) epochs of training from same network initialization. In theminimax game the generator and discriminator keep on changing. The resulting distributions p g areclearly different (for example very few saturated greens in the images on the right).for image classification on ImageNet. Similarly, Wang and Gupta [15] showed a significant increasein performance using an ensembles of three CNNs for object detection. These works show thatensembles are a relatively easy (be it computationally expensive) way to improve results. To the bestof our knowledge the usage of ensembles has not yet been evaluated for GANs. Here we investigateif ensembles of GANs generate model distributions which closer resembles the data distribution.We investigate several strategies to train an ensemble of GANs. Similar as [8] we train several GANsfrom scratch from the data and combine them to generate the data (we will refer to these as stan-dard ensembles). When training GANs the minimax game prevents the networks from converging,but instead the networks G and D constantly remain changing. Comparing images generated bysuccessive epochs shows that even after many epochs these networks generate significantly differ-ent images. Based on this observation we propose self-ensembles which are generated by severalmodels G which only differ in the number of training iterations but originate from the same networkinitialization. This has the advantage that they can be trained much faster than standard ensembles.In a recent study on the difficulty of evaluating generative models, Theis et al. [12] pointed out thedanger that GANs could be quite accurately modeling part of the data distribution while completelyfailing to model other parts of the data. This problem would not easily show up by an inspection ofthe visual quality of the generated examples. The fact that the score of the discriminative network D for these not-modelled regions is expected to be high (images in these regions would be easyto recognize as coming from the true data because there are no similar images generated by thegenerative network) is the bases of the third ensemble method we evaluate. This method we call acascade ensemble of GANs. We redirect part of the data which is badly modelled by the generativenetwork G to a second GAN which can then concentrate on generating images according to thisdistribution. We evaluate results of ensemble of GANs on the CIFAR10 dataset, and show that whenevaluated for image retrieval, ensembles of GANs have a lower average distance to query imagesfrom the test set, indicating that they better model the data distribution. A GAN is a framework consisting of a deep generative model G and a discriminative model D , bothof which play a minimax game. The aim of the generator is to generate a distribution p g that is simi-lar to the real data distribution p data such that the discriminative network cannot distinguish betweenthe images from the real distribution and the ones which are generated (the model distribution).Let x be a real image drawn from the real data distribution p data and z be random noise. The noisevariable z is transformed into a sample G ( z ) by a generator network G which synthesizes samplesfrom the distribution p g . The discriminative model D ( x ) computes the probability that input data x is from p data rather than from the generated model distribution p g . Ideally D ( x ) = 0 if x ∼ p g and D ( x ) = 1 if x ∼ p data . More formally, the generative model and discriminative model are trainedby solving:min G max D V ( D, G ) = E x ∼ p data [log D ( x )] + E z ∼ noise [log (1 − D ( G ( z )))] (1)2igure 2: The proposed cGANs framework consists of multiple GANs. We start with all traindata (left side) and train the first GAN until no further improvements are obtained. We then use thegate-function to select part of the train data to be modeled by the second GAN, etcIn our implementation of GAN we will use the DCGAN[9] which improved the quality of GANsby the usage of strided convolutions and fractional-strided convolutions instead of pooling layers inboth generator and discriminator, as well as the RLlu and leakyReLu activation function.We shortly describe two observations which are particular to GANs and which are the motivationsfor the ensemble models we discuss in the next section. Observation 1:
In Fig 1 we show the images which are generated by a DCGAN for two successiveepochs. It can be observed that the generated images change significantly in overall appearancefrom one epoch to the other (from visual inspection quality of images does not increase after epoch30 but the overall appearance still varies considerably). The change is caused by the fact that inthe minimax game the generator and the discriminator constantly vary and do not converge in thesense that discriminatively trained networks do. Rather than a single generator and discriminatorone could consider the GAN training process to generate a set of generative-discriminative networkpairs. Given this observation it seems sub-optimal to choose a single generator network from thisset to generate images from, and ways to combine them should be explored.
Observation 2:
A drawback of GANs as pointed out be Theis et al. [12] is that they potentiallydo not describe the whole data distribution p data . The reasoning is based on the observation thatobjective function of GANs have some resemblance with the Jensen-Shannon divergence (JSD).They show that for a very simple bi-modal distribution, minimizing the JSD yields a good fit tothe principal mode but ignores other parts of the data. For the application of generating imageswith GANs this would mean that for part of the data distribution the model does not generate anyresembling images. As explained in the introduction we investigate the usage of ensembles of GANs and evaluate theirperformance gain. Based on the observations above we propose three different approaches to con-struct an ensemble of GANs. The aim of the proposed schemes is to obtain a better estimation ofthe real data distribution p data . Standard Ensemble of GANs (eGANs):
We first consider a straightforward extension of theusage of ensembles to GANs. This is similar to ensembles used for discriminative CNNs whichhave shown to result in significant performance gains [8, 15, 16]. Instead of training a single GANmodel on the data, one trains a set of GAN models from scratch from a random initialization of theparameters. When generating data one randomly chooses one of the GAN models and then generatesthe data according to that model.
Self-ensemble of GANs (seGANs):
Other than discriminative networks which minimize an objec-tive function, in a GAN the min/max game results in a continuing shifting of the generative anddiscriminative network (see also observation 1 above). An seGAN exploits this fact by combiningmodels which are based on the same initialization of the parameters but only differ in the numberof training iterations. This would have the advantage over eGANs that it is not necessary to traineach GAN in the ensemble from scratch. As a consequence it is much faster to train seGANs thaneGANs. 3igure 3: Retrieval example. The leftmost column, annotated by a red rectangle, includes fivequery images from the test set. To the right the five nearest neighbors in the training set are given.
Cascade of GANs (cGANs):
The cGANs is designed to address the problem described in ob-servation 2; part of the data distribution might be ignored by the GAN. The cGANs framework asillustrated in Figure 2 is designed to train GANs to effectively push the generator to capture thewhole distribution of the data instead of focusing on the main mode of the density distribution. Itconsists of multiple GANs and gates. Each of the GAN trains a generator to capture the currentinput data distribution which was badly modeled by previous GANs. To select the data which isre-directed to the next GAN we use the fact that for badly modeled data x , the discriminator value D ( x ) is expected to be high. When D ( x ) is high this means that the discriminator is confident thisis real data, which most probably is caused by the fact that there are few generated examples G ( z ) nearby. We use the gate-function Q to re-direct the data to the next GAN according to: Q ( x k ) = (cid:26) if D ( x ) > t r else (2)where Q ( x k ) = 1 means that x will be used to train the next adversarial network. In practice wewill train a GAN until satisfactory results are obtained. Then evaluate Eq. 2 and train the followingGAN with the selected data, etc. We set a ratio r of images which are re-directed to the next GAN,and select the threshold t r accordingly. In the experiments we show results for several ratio settings. The evaluation of generative methods is known to be problematic[12]. Since we are evaluat-ing GANs which are based on a similar network architecture (we use the standard settings ofDCGAN[9]), the quality of the generated images is similar and therefore uninformative as an eval-uation measure. Instead, we are especially interested to measure if the ensembles of GANs bettermodel the data distribution.To measure this we propose an image retrieval experiment. We represent all images, both froma held-out test set as well as generated images by the GANs, with an image descriptor based ondiscriminatively trained CNNs. For all images in the test dataset we look at their nearest neighborin the generated image dataset. Comparing ensemble methods based on these nearest neighbordistances allows us to assess the quality of these methods. We are especially interested if someimages in the dataset are badly modeled by the network, which would lead to high nearest neighbordistances. At the end of this section we discuss several evaluation criteria based on these nearestneighbor distances.For the representation of the images we finetune an Alexnet model (pre-trained on ImageNet) onthe CIFAR10 dataset. It has been shown that the layers from AlexNet describe images at varyinglevel of semantic abstraction [16]; the lower layers of the neural network mainly capture low-levelinformation, such as colors, edges and corners etc, whereas the upper layers contain more semanticfeatures like heads, wheels, etc. Therefore, we combine the output of the first convolutional layer, thefirst fully connected layer and the final results after the softmax layer into one image representation.4 . pd a t a . GAN . c GAN s ( . ) . c GAN s ( . ) . c GAN s ( . ) . c GAN s ( . ) . c GAN s ( . )
0. 0 1 1 1 1 1 11. -1 0 -1 -1 -1 -1 -12. -1 1 0 -1 -1 -1 -13. -1 1 1 0 -1 -1 -14. -1 1 1 1 0 0 15. -1 1 1 1 0 0 16. -1 1 1 1 -1 -1 0Table 1: Wilcoxon signed-rank test for cGANs approach. The number between brackets refers tothe ratio r which is varied. Best results are obtained with r equal to 0.7 and 0.8.The conv1 layer is grouped into a 3x3 spatial grid, resulting in a × ×
96 = 864 dimensional vector.For the nearest neighbor we use the Euclidean distance . Example retrieval results with this systemare provided in Fig. 3, where we show the five nearest neighbors for several images. It shows thatthe image representation captures both color and texture of the image, as well as semantic content.In the experiments we will use the retrieval system to compare various ensembles of GANs. Evaluation criteria:
To evaluate the quality of the retrieval results, we will consider two measures.As mentioned they are based on evaluating the nearest neighbor distances of the generated imagesto images in the CIFAR testset. Consider d ki,j to be the distance of the j th nearest image generatedby method k to test (query) image i , and d kj = (cid:8) d k ,j ...d kn,j (cid:9) the set of j th -nearest distances toall n test images. Then the Wilcoxon signed-rank test (which we will only apply for the nearestneighbor j = 1 ), is used to test the hypothesis that the median of the difference between two nearestdistance distributions of generators is zero, in which case they are equally good (i.e., the median ofthe distribution d k − d m when considering generator k and m ). If they are not equal the test can beused to assess which method is statistically better. This method is for example popular to compareilluminant estimation methods [6].For the second evaluation criterion, consider d tj to be the distribution of the j th nearest distance ofthe train images to the test dataset. Since we consider that the train and test set are drawn from thesame dataset, the distribution d tj can be considered the optimal distribution which a generator couldattain (considering it generates an equal amount of images as present in the trainset). To model thedifference with this ideal distribution we will consider the relative increase in mean nearest neighbordistance given by: ˆ d kj = ¯ d kj − ¯ d tj ¯ d tj (3)where ¯ d kj = 1 N N X i =1 d ki,j , ¯ d tj = 1 N N X i =1 d ti,j (4)and where N is the size of the test dataset. E.g., ˆ d GAN = 0 . means that for method GAN theaverage distance to the nearest neighbor of a query image is 10 % higher than for data drawn fromthe ideal distribution. To evaluate the different configuration for ensembles of GANs we perform several experiments onthe CIFAR10 dataset. This dataset has 10 different classes, 50000 train images and 10000 test The average distance between images in the dataset is normalized to be one for each of the three partsconv1, fc7, and prob . c GAN s . e GAN s . s e GAN s
1. 0/10/0 1/0/9 0/1/92. 9/0/1 0/10/0 4/1/53. 9/1/0 5/1/4 0/10/0 (a) . GAN . s e GAN s ( ) . s e GAN s ( ) . s e GAN s ( )
1. 0/10/0 0/0/10 0/0/10 0/0/102. 10/0/0 0/10/0 0/0/10 0/0/103. 10/0/0 10/0/0 0/10/0 0/2/84. 10/0/0 10/0/0 8/2/0 0/10/0 (b)
Table 2: Wilcoxon signed-rank test evaluation. Results are shown as (A)/(B)/(C) where A, B and Care appearing times of 1, 0 and -1 respectively during 10 experiments. The more 1 appears, thebetter the method. In between brackets we show the number of GAN networks in the ensemble (a)shows eGANs and seGANs outperform cGANs; and the more models used in the ensemble thebetter the seGANs is shown in (b).images of size × . In our experiments we compare various generative models. With each ofthem we generate 10000 images and perform the evaluations discussed in the previous section.A cGANs has one parameter, namely the ratio r of images which will be diverted to the second GAN,which we evaluate in the first experiment. The results of the signed-rank test for several differentsettings of r are provided in Table 1. In this table, a zero refers to no statistical difference betweenthe distributions. A one (or minus one) refer to non-zero median of the difference of the distributions,indicating that the method is better (or worse) than the method to which it is compared.In the graph we have also included the training dataset and a single GAN. For a fair comparisonwe only consider 10.000 randomly selected images from the training dataset (similar to the numberof images which are generated by the generative models). As expected, the distribution of minimaldistances to the test images of the training dataset is superior to any of the generative models. Wesee this as an indication that the retrieval system is a valid method to evaluate generative models.Next, in Table 1, we can see that, independent of r , cGANs always obtains superior results to usinga single standard GAN. Finally, the results show that the best results are obtained when divertingimages to the second GAN with a ratio of 0.7 or 0.8. In the rest of the experiments we fix r = 0 . .In the next experiment we compare the different approaches to ensembles of GANs. We start byonly combining two GANs into each of the ensembles. We have repeated the experiments 10 timesand show the results in Table 2a. We found that the results of GAN did not further improve after 30epochs of training. We therefore use 30 epochs for all our trained models. For seGANs we randomlypick models between 30 and 40 epochs of training. The cGANs obtain significantly less results thaneGANs and seGANs. Interestingly, the seGANs obtains similar results as eGANs. Whereas eGANsis obtained by re-training a GAN model from scratch, the seGANs is formed by models startingfrom the same network initialization and therefore much faster to compute than eGANs.The results for the ensembles of GANs are also evaluated with the average increase in nearest neigh-bor distance in Fig 4(left). In this plot we consider not only the closest nearest neighbor distance,but also the k-nearest neighbors (horizontal axis). All ensemble methods improve over using just asingle GAN. This evaluation measure shows that seGANs obtains similar results as eGANs again.Finally, we have run seGANs by combining more than two GANs, we have considered ensemblesof 2, 4, 6, and 8 GANs (see Table 2b). And then results are repeated 10 times. Results shownin Fig 4(right) show that the average increase in k-nearest distance decreases, when increasing thenumber of networks in the ensemble, but levels off from 4 to 8. We stress that when combine 2, 4 or8 GANs, we keep the number of generated images to 10.000 like in all experiments. So in the caseof seGANs(8) 1250 images are generated from each GAN. The average increase in nearest distancedrops from 0.11% for a single GAN to 0.06% for a seGANs combining 8 GANs which is a drop of40%. In Fig 5 we show several examples where a single GAN does not generate any similar imagewhereas the ensemble GAN does. 6 eGANsseGANscGANsGANs GANseGANs_2seGANs_4seGANs_8
Figure 4: (left) Showing ˆ d kj for k ∈ { cGAN s, eGAN s, seGAN s } and j ∈ { , . . . , } along thehorizontal axes. (right) similar but for seGANs with varying number of models.Figure 5: Visualization of samples from single GAN and seGANs(4). The leftmost column is thequery image from the test dataset; The red box shows the 5 nearest examples from GAN; and thegreen box shows examples from seGANs(4) the neighboring sample We have evaluated several approaches to construct ensembles of GANs. The most promising resultswere obtained by seGANs, which were shown to obtain similar results as eGANs, but have theadvantage that they can be trained with little additional computational cost. Experiments on animage retrieval experiment show that the average distance to the nearest image in the dataset candrop 40% by using seGANs, which is a significant improvement (arguably more important thanthe usage of ensembles for discriminative networks). These results should be verified on multipledatasets. For future work we are especially interested to combine the ideas behind the cGANs withseGANs to further improve results. We are also interested to further improve the image retrievalsystem, which can be used as a tool to evaluate the quality of image generation methods. It wouldbeneficial for research in GANs if a toolbox of experiments exists which allows us to quantitativelycompare generative methods.
Acknowledgments
This work is funded by the Projects TIN2013-41751-P of the Spanish Min-istry of Science and the CHISTERA M2CR project PCIN-2015-226 and the CERCA Programme /Generalitat de Catalunya. We also thank Nvidia for the generous GPU donation.7 eferences [1] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, et al. Greedy layer-wise training of deep networks.
Advances in neural information processing systems , 19:153, 2007.[2] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable represen-tation learning by information maximizing generative adversarial nets. arXiv preprint arXiv:1606.03657 ,2016.[3] E. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using a laplacian pyramid ofadversarial networks. In
Advances in neural information processing systems , pages 1486–1494, 2015.[4] J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782 ,2016.[5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.Generative adversarial nets. In
Advances in Neural Information Processing Systems , pages 2672–2680,2014.[6] S. D. Hordley and G. D. Finlayson. Reevaluation of color constancy algorithm performance.
JOSA A , 23(5):1008–1020, 2006.[7] D. J. Im, C. D. Kim, H. Jiang, and R. Memisevic. Generating images with recurrent adversarial networks. arXiv preprint arXiv:1602.05110 , 2016.[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neuralnetworks. In
Advances in neural information processing systems , pages 1097–1105, 2012.[9] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutionalgenerative adversarial networks. arXiv preprint arXiv:1511.06434 , 2015.[10] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation.Technical report, DTIC Document, 1985.[11] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques fortraining gans. arXiv preprint arXiv:1606.03498 , 2016.[12] L. Theis, A. v. d. Oord, and M. Bethge. A note on the evaluation of generative models. arXiv preprintarXiv:1511.01844 , 2015.[13] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features withdenoising autoencoders. In
Proceedings of the 25th international conference on Machine learning , pages1096–1103. ACM, 2008.[14] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked denoising autoencoders:Learning useful representations in a deep network with a local denoising criterion.
Journal of MachineLearning Research , 11(Dec):3371–3408, 2010.[15] X. Wang and A. Gupta. Unsupervised learning of visual representations using videos. In
Proceedings ofthe IEEE International Conference on Computer Vision , pages 2794–2802, 2015.[16] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In
European Confer-ence on Computer Vision , pages 818–833. Springer, 2014., pages 818–833. Springer, 2014.