Unbiased Auxiliary Classifier GANs with MINE
Ligong Han, Anastasis Stathopoulos, Tao Xue, Dimitris Metaxas
UUnbiased Auxiliary Classifier GANs with MINE
Ligong HanRutgers University [email protected]
Anastasis StathopoulosRutgers University [email protected]
Tao XueRutgers University [email protected]
Dimitris MetaxasRutgers University [email protected]
Abstract
Auxiliary Classifier GANs (AC-GANs) [15] are widelyused conditional generative models and are capable ofgenerating high-quality images. Previous work [18] haspointed out that AC-GAN learns a biased distribution. Toremedy this, Twin Auxiliary Classifier GAN (TAC-GAN) [5]introduces a twin classifier to the min-max game. How-ever, it has been reported that using a twin auxiliary clas-sifier may cause instability in training. To this end, wepropose an Unbiased Auxiliary GANs (UAC-GAN) that uti-lizes the Mutual Information Neural Estimator (MINE) [2]to estimate the mutual information between the generateddata distribution and labels. To further improve the per-formance, we also propose a novel projection-based statis-tics network architecture for MINE ∗ . Experimental resultson three datasets, including Mixture of Gaussian (MoG),MNIST [12] and CIFAR10 [11] datasets, show that ourUAC-GAN performs better than AC-GAN and TAC-GAN.Code can be found on the project website † .
1. Introduction
Generative Adversarial Networks (GANs) [6] are gen-erative models that can be used to sample from high di-mensional non-parametric distributions, such as natural im-ages or videos. Conditional GANs [13] is an extension ofGANs that utilize the label information to enable samplingfrom the class conditional data distribution. Class condi-tional sampling can be achieved by either (1) conditioningthe discriminator directly on labels [13, 9, 14], or by (2) in-corporating an additional classification loss in the trainingobjective [15]. The latter approach originates in AuxiliaryClassifier GAN (AC-GAN) [15]. ∗ This is an extended version of a CVPRW’20 workshop paper with thesame title. In the current version the projection form of MINE is detailed. † https://github.com/phymhan/ACGAN-PyTorch Despite its simplicity and popularity, AC-GAN is re-ported to produce less diverse data samples [18, 14]. Thisphenomenon is formally discussed in Twin Auxiliary Clas-sifier GAN (TAC-GAN) [5]. The authors of TAC-GAN re-veal that due to a missing negative conditional entropy termin the objective of AC-GAN, it does not exactly minimizethe divergence between real and fake conditional distribu-tions. TAC-GAN proposes to estimate this missing term byintroducing an additional classifier in the min-max game.However, it has also been reported that using such twin aux-iliary classifiers might result in unstable training [10].In this paper, we propose to incorporate the negativeconditional entropy in the min-max game by directly esti-mating the mutual information between generated data andlabels. The resulting method enjoys the same theoreticalguarantees as that of TAC-GAN and avoids the instabilitycaused by using a twin auxiliary classifier. We term theproposed method UAC-GAN because (1) it learns an Un-biased distribution, and (2) MINE [2] relates to Unnormal-ized bounds [16]. Finally, our method demonstrates supe-rior performance compared to AC-GAN and TAC-GAN on1-D mixture of Gaussian synthetic data, MNIST [12], andCIFAR10 [11] dataset.
2. Related Work
Learning unbiased AC-GANs.
In CausalGAN [10], theauthors incorporate a binary Anti-Labeler in AC-GAN andtheoretically show its necessity for the generator to learn thetrue class conditional data distributions. The Anti-Labeleris similar to the twin auxiliary classifier in TAC-GAN, butit is used only for binary classification. Shu et al . [18] for-mulates the AC-GAN objective as a Lagrangian to a con-strained optimization problem and shows that the AC-GANtends to push the data points away from the decision bound-ary of the auxiliary classifiers. TAC-GAN [5] builds onthe insights of [18] and shows that the bias in AC-GAN iscaused by a missing negative conditional entropy term. In1 a r X i v : . [ c s . C V ] J un ddition, [5] proposes to make AC-GAN unbiased by intro-ducing a twin auxiliary classifier that competes in an adver-sarial game with the generator. The TAC-GAN can be con-sidered as a generalization of CausalGAN’s Anti-Labeler tothe multi-class setting. Mutual information estimation.
Learning a twin auxil-iary classifier is essentially estimating the mutual informa-tion between generated data and labels. We refer readers to[16] for a comprehensive review of variational mutual in-formation estimators. In this paper, we employ the MutualInformation Neural Estimator (MINE) [2].
3. Background
First, we review the AC-GAN [15] and the analysis in [5,18] to show why AC-GAN learns a biased distribution. TheAC-GAN introduces an auxiliary classifier C and optimizesthe following objective min G , C max D L AC ( G , C , D ) = (1) E x ∼ P X log D ( x ) + E z ∼ P Z ,y ∼ P Y log(1 − D ( G ( z, y ))) (cid:124) (cid:123)(cid:122) (cid:125) a (cid:13)− E x,y ∼ P XY log C ( x, y ) (cid:124) (cid:123)(cid:122) (cid:125) b (cid:13) − E z ∼ P Z ,y ∼ P Y log C ( G ( z, y ) , y ) (cid:124) (cid:123)(cid:122) (cid:125) c (cid:13) , where a (cid:13) is the value function of a vanilla GAN, and b (cid:13) c (cid:13) correspond to cross-entropy classification error on realand fake data samples, respectively. Let Q cY | X denotethe conditional distribution induced by C . As pointed outin [5], adding a data-dependent negative conditional entropy − H P ( Y | X ) to b (cid:13) yields the Kullback-Leibler (KL) diver-gence between P Y | X and Q cY | X , − H ( Y | X ) + b (cid:13) = E x ∼ P X D KL ( P Y | X (cid:107) Q cY | X ) . (2)Similarly, adding a term − H Q ( Y | X ) to c (cid:13) yields the KL-divergence between Q Y | X and Q cY | X , − H Q ( Y | X ) + c (cid:13) = E x ∼ Q X D KL ( Q Y | X (cid:107) Q cY | X ) . (3)As illustrated above, if we were to optimize 2 and 3, thegenerated data posterior Q Y | X and the real data posterior P Y | X would be effectively chained together by the two KL-divergence terms. However, H Q ( Y | X ) cannot be consid-ered as a constant when updating G . Thus, to make theoriginal AC-GAN unbiased, the term − H Q ( Y | X ) has tobe added in the objective function. Without this term, thegenerator tends to generate data points that are away fromthe decision boundary of C , and thus learns a biased (de-generate) distribution. Intuitively, minimizing − H Q ( Y | X ) over G forces the generator to generate diverse samples withhigh (conditional) entropy. Twin Auxiliary Classifier GAN (TAC-GAN) [5] tries toestimate H Q ( Y | X ) by introducing another auxiliary clas-sifier C mi . First, notice the mutual information can be de-composed in two symmetrical forms, I Q ( X ; Y ) = H ( Y ) − H Q ( Y | X ) = H Q ( X ) − H Q ( X | Y ) . Herein, the subscript Q denotes the corresponding distribu-tion Q induced by G . Since H ( Y ) is constant, optimizing − H Q ( Y | X ) is equivalent to optimizing I Q ( X ; Y ) . TAC-GAN shows that when Y is uniform , the latter form of I Q can be written as the Jensen-Shannon divergence (JSD)between conditionals { Q X | Y =1 , . . . , Q X | Y = K } . Finally,TAC-GAN introduces the following min-max game min G max C mi V TAC ( G , C mi ) = E z ∼ P Z ,y ∼ P Y log C mi ( G ( z, y ) , y ) , (4)to minimize the JSD between multiple distributions. Theoverall objective is min G , C max D , C mi L TAC ( G , D , C , C mi ) = L AC + V TAC (cid:124)(cid:123)(cid:122)(cid:125) d (cid:13) . (5) TAC-GAN from a variational perspective.
Training thetwin auxiliary classifier minimizes the label reconstructionerror on fake data as in InfoGAN [3]. Thus, when opti-mizing over G , TAC-GAN minimizes a lower bound of themutual information. To see this, V TAC = E x,y ∼ Q XY log C mi ( x, y )= E x ∼ Q X E y ∼ Q Y | X log Q ( y | x ) Q mi ( y | x ) Q ( y | x )= E x ∼ Q X E y ∼ Q Y | X log Q ( y | x ) − E x ∼ Q X D KL ( Q Y | X (cid:107) Q miY | X ) ≤ − H Q ( Y | X ) . (6)The above shows that d (cid:13) is a lower bound of − H Q ( Y | X ) .The bound is tight when classifier C mi learns the true pos-terior Q Y | X on fake data. However, minimizing a lowerbound might be problematic in practice. Indeed, previousliterature [10] has reported unstable training behavior of us-ing an adversarial twin auxiliary classifier in AC-GAN. TAC-GAN as a generalized CausalGAN. A binary ver-sion of the twin auxiliary classifier has been introducedas Anti-Labeler in CausalGAN [10] to tackle the issue of label-conditioned mode collapse . As pointed out in [10],the use of Anti-Labeler brings practical challenges withgradient-based training. Specifically, (1) in the early stage,he Anti-Labeler quickly minimizes its loss if the gener-ator exhibits label-conditioned mode collapse, and (2) inthe later stage, as the generator produces more and morerealistic images, Anti-Labeler behaves more like Labeler(the other auxiliary classifier). Therefore, maximizing Anti-Labeler loss and minimizing Labeler loss become a contra-dicting task, which ends up with unstable training. To ac-count for this, CausalGAN adds an exponential decayingweight before the Anti-Labeler loss term (or d (cid:13) in 5 whenoptimizing G ). In fact, the following theorem shows thatTAC-GAN can still induce a degenerate distribution. Theorem 1.
Given fixed C and C mi , the optimal G ∗ that minimizes c (cid:13) + d (cid:13) induces a degenerated conditional Q ∗ Y | X = onehot(arg min k Q mi ( Y = k | x ) Q c ( Y = k | x ) ) , where Q miY | X isthe distribution specified by C mi .Proof. If G learns the true conditional, and C and C mi areboth optimally trained so that Q cY | X = Q miY | X = P Y | X ,then c (cid:13) + d (cid:13) = 0 and the game reaches equilibrium.If Q cY | X and Q miY | X are not equal (and Q cY | X has non-zero entries), c (cid:13) + d (cid:13) = − E x ∼ Q X (cid:88) k Q Y | X ( Y = k | x ) log Q c ( Y = k | x )+ E x ∼ Q X (cid:88) k Q Y | X ( Y = k | x ) log Q mi ( Y = k | x )= E x ∼ Q X (cid:88) k Q Y | X ( Y = k | x ) log Q mi ( Y = k | x ) Q c ( Y = k | x ) . The minimizing c (cid:13) + d (cid:13) is equivalent to minimizing theobjective point-wisely for each x , min Q Y | X = x (cid:88) k Q Y | X ( Y = k | x ) r x ( k ) , where r x is the log density ratio between Q mi and Q c . Thenthe optimized Q ∗ Y | X is obtained by noticing that (cid:88) k Q Y | X ( Y = k | x ) r x ( k ) ≥ (cid:88) k Q Y | X ( Y = k | x ) r x ( k m )= r x ( k m )= (cid:88) k Q ∗ Y | X ( Y = k | x ) r x ( k ) , with k m = arg min k r x ( k ) and Q ∗ Y | X = onehot ( k m ) .
4. Method
To develop a better unbiased AC-GAN while avoidingpotential drawbacks by introducing another auxiliary clas-sifier, we resort to directly estimate the mutual information I Q ( X ; Y ) . In this paper, we employ the Mutual Informa-tion Neural Estimator (MINE [2]). The mutual information I Q ( X ; Y ) is equal to the KL-divergence between the joint Q XY and the product of themarginals Q X ⊗ Q Y (here we denote Q Y = P Y for a con-sistent and general notation), I Q ( X ; Y ) = D KL ( Q XY (cid:107) Q X ⊗ Q Y ) . (7)MINE is built on top of the bound of Donsker and Varadhan[4] (for the KL-divergence between distributions P and Q ), D KL ( P (cid:107) Q ) = sup T :Ω → R E P [ T ] − log E Q [ e T ] , (8)where T is a scalar-valued function which takes samplesfrom P or Q as input. Then by replacing P with Q XY andreplacing Q with Q X ⊗ Q Y , we get I mineQ = max T V MINE ( G , T ) , where (9) V MINE ( G , T ) = E z ∼ P Z ,y ∼ P y T ( G ( z, y ) , y ) − log E z ∼ P Z ,y ∼ P y , ¯ y ∼ P Y e T ( G ( z,y ) , ¯ y ) . The function T : X × Y → R is often parameterized by adeep neural network. The overall objective of the proposed unbiased AC-GANis, min G , C max D , T L UAC ( G , D , C , T ) = L AC + V MINE . (10)Note that when the inner T is optimal and the bound istight, V MINE ( G , T ∗ ) recovers the true mutual information I Q ( X ; Y ) = H ( Y ) − H Q ( Y | X ) . Given that H ( Y ) isconstant, minimizing over the outer G maximizes the trueconditional entropy H Q ( Y | X ) . In the original MINE [2], the statistics network T is im-plemented as a neural network without any restrictions onthe architecture. Specifically, T is a network that takes animage x and a label y as input and outputs a scalar, and anaive way to infuse them is by concatenation ( input con-cat ). However, we find that input concat yields bad mutualinformation estimations and does not work well in practice.To solve this, we propose a projection based architecture forthe statistics network.The optimal solution of the statistics network is T ∗ ( x, y ) = log Q ( y | x ) − log Q ( y ) + log Z ( y ) , (11)where Z ( y ) = E Q X e T ( x,y ) is a partition function that onlydepends on y . For completeness, we include a brief deriva-tion here [16]: I Q ( X ; Y ) = E Q XY log ˜ Q ( x | y ) Q ( x ) + E Q Y D KL ( Q ( x | y ) (cid:107) ˜ Q ( x | y )) ≥ E Q XY log ˜ Q ( x | y ) − log Q ( x ) , (12)here ˜ Q ( x | y ) is a variational approximation of Q ( x | y ) .This is also known as the Barber & Agakov bound [1]. Thenwe choose an energy-based variational family and define ˜ Q ( x | y ) := Q ( x ) Z ( y ) e T ( x,y ) . (13)The optimal T is obtained by setting ˜ Q ( x | y ) = Q ( x | y ) .Given the form of Equation 11 and inspired by the pro-jection discriminator [14], we therefore model the Q ( y | x ) term as a log linear model: log Q ( y | x ) := v T y φ ( x ) − log Z ( φ ( x )) , (14)where Z ( φ ( x )) := (cid:80) k exp( v T k φ ( x )) is another partitionfunction. Thus, if we denote log Z as ψ , one can rewritethe the above equation as log Q ( y | x ) := v T y φ ( x )+ ψ ( φ ( x )) .As mentioned before, Q ( y ) = P ( y ) and is pre-defined bythe dataset. If P ( y ) is uniform, then log P ( y ) is a constantwhich can be absorbed into ψ . If the condition is not satis-fied, one can always merge the last two terms in Equation11 and define c ( y ) := − log Q ( y ) + log Z ( y ) , and we getthe final form of T , T ( x, y ) := v T y φ ( x ) + ψ ( φ ( x )) + c y . (15)Intuitively, isolating log Q ( y ) from c y would help thenetwork to focus on estimating the partition function. More-over, in the situation where Q ( y ) might be changing, it isbeneficial if we can model it during training. To explic-itly model the term log Q ( y ) , we can introduce another dis-criminator to differentiate samples y ∼ Q Y and samples y ∼ Unif (1 , K ) . It is known that an optimal discriminatorestimates the log density ratio between two data distribu-tions. Let D Y solve the following task max D Y E y ∼ Q Y log D Y ( y ) + E y ∼ Unif log(1 − D Y ( y )) (16)and ˜ D Y be the logit of D Y , then the optimal ˜ D ∗ Y =log Q ( y ) + log K . Plug it into Equation 11 we get anotherform T ( x, y ) := v T y φ ( x ) + ψ ( φ ( x )) − ˜ D Y ( y ) + c y + log K. (17)Implementation-wise, a projection-based network T onlyadds at most an embedding layer (same as same as a fullyconnected layer) and a single-class fully connected layer (ifreplacing the LogSumExp function with a learnable scalarfunction). Thus, UAC-GAN only adds a negligible compu-tational cost to AC-GANs.
5. Experiments
We borrow the evaluation protocol in [5] to comparethe distribution matching ability of AC-GAN, TAC-GAN,and our UAC-GAN on (1-D) mixture of Gaussian syntheticdata. Then, we evaluate the image generation performanceof UAC-GAN on MNIST [12] and CIFAR10 [11] dataset.
AC-GAN TAC-GAN UAC-GANClass 0 0.234 ± ± ± ± ± ± Class 2 527.801 ± ± ± Marginal 52.348 ± ± ± Table 1: MMD distance of 1-D mixture of Gaussian ex-periment, lower is better. UAC-GAN matches distributionsbetter than TAC-GAN except for
Class 0 . MNIST CIFAR10
Method IS ↑ FID ↓ IS ↑ FID ↓ AC-GAN 2.52 4.17 4.71 47.75TAC-GAN 2.60 3.70 4.17 54.91UAC-GAN (ours)
Table 2: Inception Scores (IS) and Fr´echet Inception Dis-tances (FID) on MNIST and CIFAR10 dataset.
The MoG data is sampled from three Gaussian compo-nents, N (0 , , N (3 , , and N (6 , , labeled as Class 0 , Class 1 , and
Class 2 , respectively. The estimated den-sity is obtained by applying kernel density estimation asused in [5], and the maximum mean discrepancy (MMD)[7] distances are reported in Table 1. As shown, in mostcases (except for
Class 0 ), UAC-GAN outperforms TAC-GAN and is generally more stable across different runs.
Table 2 reports the Inception Scores (IS) [17] and Fr´echetInception Distances (FID) [8] on the MNIST and CIFAR10datasets. To visually inspect whether the model exhibitslabel-conditioned mode collapse, we condition the gener-ator on a single class. Samples are shown in Figure 1. It isobvious to conclude from the image samples that the pro-posed UAC-GAN generates more diverse images; more-over, as demonstrated in quantitative evaluations, UAC-GAN outperforms AC-GAN and TAC-GAN.
6. Conclusion
In this paper, we reviewed the low intra-class diversityproblem of the AC-GAN model. We analyzed the TAC-GAN model and showed in theory why introducing a twinauxiliary classifier may cause unstable training. To addressthis, we proposed to directly estimate the mutual informa-tion using MINE. The effectiveness of the proposed methodis demonstrated by a distribution matching experiment andimage generation experiments on MNIST and CIFAR10. a) AC-GAN (b) TAC-GAN (c) UAC-GAN(d) AC-GAN (e) TAC-GAN (f) UAC-GAN
Figure 1: Results on MNIST (a-c) and CIFAR10 (d-f) dataset. Samples are drawn from a single class “2” (a-c) and “horse”(d-f) to illustrate the label-conditioned diversity.
References [1] David Barber and Felix V Agakov. The im algorithm: a vari-ational approach to information maximization. In
Advancesin neural information processing systems , page None, 2003.4[2] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar,Sherjil Ozair, Yoshua Bengio, Aaron Courville, and R DevonHjelm. Mine: mutual information neural estimation. arXivpreprint arXiv:1801.04062 , 2018. 1, 2, 3[3] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, IlyaSutskever, and Pieter Abbeel. Infogan: Interpretable repre-sentation learning by information maximizing generative ad-versarial nets. In
Advances in neural information processingsystems , pages 2172–2180, 2016. 2[4] Monroe D Donsker and SR Srinivasa Varadhan. Asymptoticevaluation of certain markov process expectations for largetime. iv.
Communications on Pure and Applied Mathematics ,36(2):183–212, 1983. 3[5] Mingming Gong, Yanwu Xu, Chunyuan Li, Kun Zhang, andKayhan Batmanghelich. Twin auxilary classifiers gan. In
Advances in Neural Information Processing Systems , pages1328–1337, 2019. 1, 2, 4[6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio. Generative adversarial nets. In
Advancesin neural information processing systems , pages 2672–2680,2014. 1[7] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bern-hard Sch¨olkopf, and Alexander Smola. A kernel two-sampletest.
Journal of Machine Learning Research , 13(Mar):723–773, 2012. 4[8] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,Bernhard Nessler, and Sepp Hochreiter. Gans trained by atwo time-scale update rule converge to a local nash equilib-rium. In
Advances in neural information processing systems ,pages 6626–6637, 2017. 4 [9] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei AEfros. Image-to-image translation with conditional adver-sarial networks. In
Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages 1125–1134,2017. 1[10] Murat Kocaoglu, Christopher Snyder, Alexandros G Di-makis, and Sriram Vishwanath. Causalgan: Learning causalimplicit generative models with adversarial training. arXivpreprint arXiv:1709.02023 , 2017. 1, 2[11] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiplelayers of features from tiny images. 2009. 1, 4[12] Yann LeCun, L´eon Bottou, Yoshua Bengio, Patrick Haffner,et al. Gradient-based learning applied to document recogni-tion.
Proceedings of the IEEE , 86(11):2278–2324, 1998. 1,4[13] Mehdi Mirza and Simon Osindero. Conditional generativeadversarial nets. arXiv preprint arXiv:1411.1784 , 2014. 1[14] Takeru Miyato and Masanori Koyama. cgans with projectiondiscriminator. arXiv preprint arXiv:1802.05637 , 2018. 1, 4[15] Augustus Odena, Christopher Olah, and Jonathon Shlens.Conditional image synthesis with auxiliary classifier gans.In
Proceedings of the 34th International Conference on Ma-chine Learning-Volume 70 , pages 2642–2651. JMLR. org,2017. 1, 2[16] Ben Poole, Sherjil Ozair, Aaron van den Oord, Alexander AAlemi, and George Tucker. On variational bounds of mutualinformation. arXiv preprint arXiv:1905.06922 , 2019. 1, 2, 3[17] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, VickiCheung, Alec Radford, and Xi Chen. Improved techniquesfor training gans. In
Advances in neural information pro-cessing systems , pages 2234–2242, 2016. 4[18] Rui Shu, Hung Bui, and Stefano Ermon. Ac-gan learns abiased distribution. In