Disentangling the Spatial Structure and Style in Conditional VAE
DDISENTANGLING THE SPATIAL STRUCTURE AND STYLE IN CONDITIONAL VAE
Ziye Zhang , Li Sun , , ∗ , Zhilin Zheng and Qingli Li Shanghai Key Laboratory of Multidimensional Information Processing, Key Laboratory of Advanced Theory and Application in Statistics and Data Science,East China Normal University, 200241 Shanghai, China State Key Laboratory of Mathematical Engineering and Advanced Computing, 214125 Wuxi, China
ABSTRACT
This paper proposes a structure in conditional variation au-toencoder (cVAE) to disentangle the latent vector into a spa-tial structure and a style code, complementary to each other,with the one ( z s ) being label relevant and the other ( z u ) irrel-evant. Different from traditional cVAE, our network maps thecondition label into its relevant code z s through a separatedmodule. Depending on whether the label directly relates tothe image spatial structure or not, z s output from the condi-tion mapping module is used either as the style code with thetwo spatial dimension of × , or as the spatial structure codewith a single channel. Based on the input image and its cor-responding z s , the encoder provides the posterior distributionclose to a common prior regardless of its label, thus z u sam-pled from it becomes label irrelevant. The decoder employs z s and z u by two typical adaptive normalization modules toreconstruct the input image. Results on two datasets with dif-ferent types of labels show the effectiveness of our method. Index Terms — cVAE, GAN, disentanglement
1. INTRODUCTION
VAE [1] and GAN [2] are two powerful tools for image syn-thesis. In GAN, the generator G ( z ) aims to mimic the datadistribution p data ( x ) with an approximation p G ( z ) by map-ping the random noise z to the image-like data. Meanwhile,GAN learns a discriminator D to distinguish the samples, ei-ther drawn from p data ( x ) or p G ( z ) . G and D are trainedjointly in an adversarial manner. VAE consists of a pair ofconnected encoder and decoder, with their parameters φ and θ , respectively. The encoder maps the image x into a posteriordistribution q φ ( z | x ) from which the code z is sampled, andthe decoder p θ ( x | z ) , transforms z back into image domain toreconstruct x .VAE requires q φ ( z | x ) to be simple, e.g. , closeto the Gaussian prior N ( z | , I ) based on the KL divergence.Compared to GAN, VAE tends to generate blurry im-ages, since q φ ( z | x ) is too simple to capture the true posterior, ∗ Corresponding author (Email: [email protected]). This work is sup-ported by the the Science and Technology Commission of Shanghai Munic-ipality (No.19511120800), and the Open Project Program of the State KeyLaboratory of Mathematical Engineering and Advanced Computing. known as ”posterior collapse”. But it is easier to train. WhileGAN’s optimization is unstable, hence many works try tostabilize its training [3–5]. Moreover, VAE explicitly modelseach dimension of z as independent Gaussian, so it can disen-tangle the factors in unsupervised way [6, 7]. To fully exploitthe advantage from both of them, VAE and GAN can be com-bined into VAE-GAN [8], in which the encoder and decoderin VAE forms the generator, and it employs a discriminator toidentify the real from the fake image.Both GAN and VAE can be utilized for conditional gen-eration. The generator in conditional GAN (cGAN) [9, 10]is usually given the concatenation of a random code z and aconditional label c . Its output G ( z, c ) is required to fulfill thecondition. Here c has various forms. It can be a one-hot vec-tor indicating the categories, or a conditional image with itsspatial structure. D in cGAN not only evaluates the reality of G ( z, c ) , but also checks its conformity on c . Similarly, cVAE[11] gives the label c to both encoder and decoder. The poste-rior specified by the encoder becomes q φ ( z | x, c ) . Note that x with different c are mapping to the same prior N ( z ; 0 , I ) , sothe regularization term actually prevents z being relevant with c in some extent. Then the decoder p θ ( x | z, c ) reconstructs x by concatenating c with z . Like VAE-GAN [8], cVAE canalso be extended to cVAE-GAN as introduced in [12].In cVAE, the label relevant and irrelevant factors arenot explicitly disentangled in the posterior q φ ( z | x, c ) , hencemanipulating z ∼ q φ ( z | x, c ) often leads to the unnecessarychange on the given condition. Moreover, the synthesizedimages in cVAE often have the low conformity on the con-dition. To improve the conditional synthesized results, someworks try to disentangle label relevant factors from all theothers. CSVAE [13] proposes to learn the conditional labelrelevant subspace by using distinct priors under the differentlabels. Zheng et. al. [14] employ two encoders. One learnsthe label dependent priors and specifies the posterior z s asa δ function. The other uses a N (0 , I ) Gaussian prior, andmaps the data into a label irrelevant posterior q φ ( z u | x ) . Boththe works adopt the adversarial training in the latent space toprevent z u from carrying the label information.Different from previous works, we use a single prior, re- a r X i v : . [ c s . C V ] J u l ardless of the label of input, and model the label irrelevantposterior based on the encoder. A deterministic code is alsoemployed to model label relevant factors. Moreover, we donot apply the adversarial training to minimize the mutual in-formation between the label and its irrelevant factors, whichmakes the optimization easy. The key idea is to disentanglethe label irrelevant code z u , sampled from the posterior ofVAE encoder, and label relevant code z s , given by the con-dition mapping network, based on the proposed cVAE struc-ture. We design the two types of structures.They are chosendepending on whether the label condition is spatial related( e.g. the pose degree) or not ( e.g. the face identity). Posteri-ors in both cases are constrained in the same way, close to theprior N ( z ; 0 , I ) , thus the code z u ∼ N ( z ; 0 , I ) is label irrel-evant. The dimension of z s and z u are intentionally designedso that they become complementary. Particularly, one of themreflects the spatial structure, so it is a single channel featuremap and is applied into VAE through spatially-adaptive nor-malization (SPADE) [15]. The other is a × × C style vectorand it is incorporated into VAE by adaptive instance normal-ization (AdaIN) [16, 17]. Note that Fig.1 shows one case inwhich z s is a style code and z u from the posterior is a onechannel feature map. To improve the image quality, we add adiscriminator like cVAE-GAN [12], and extend the fake datato include the condition exchanged image. Therefore, the re-constructed, prior-sampled and condition exchanged imagesare all regarded as fake ones for the discriminator.Our contribution lies in following aspects. First, we pro-pose a simple, flexible way to disentangle the spatial structureand style code for synthesis. It requires one of them to be la-bel dependent, and it is given to VAE as the condition. Theother becomes label irrelevant posterior after training. Sec-ond, by applying the adaptive normalization based on boththe style and the spatial structure code, our model improvesthe disentangling performance. We carry out experiments ontwo types of datasets to prove the effectiveness of the method.
2. PROPOSED METHOD
Fig.1 shows the overall architecture. The generator G is ofVAE structure, consisting of the Enc , Dec and the conditionmapping network f . Enc specifies the spatial structure pre-serving posterior which is assumed to be label irrelevant, andconstrained with prior N (0 , I ) by KL divergence. Dec ex-ploits the spatial structure code z u sampled from the posteriorwith SPADE, and the style code z s given by f with AdaIN.The ”const input” of Dec does not affect the appearance ofthe generated image, hence it can be a constant vector. Notethat for brief, we do not distinguish the random variables andtheir realizations on z u and z s . Supposing the image x ∈ R H × W × with its label c , the goalof cVAE is to maximize the ELBO defined in the left of (1), so x mu sigmaAdaIN z u c f z s KL divergence
DecDec
SPADE
AdaIN x EncEnc DD reconstruction loss adversarial lossconstinput ( ) ;0, z R kWkH C11 R Fig. 1 . The proposed architecture on FaceSrub, in which IDlabels have few spatial cues, and they are mapped into stylecodes.
Dec exploits the spatial structure code z u ∈ R Hk × Wk × with SPADE, and the style code z s ∈ R × × C with AdaIN.that the data log-likelihood log p ( x ) can be maximized. Here H and W indicate the height and width of the input image, φ , θ and ψ , correspond to the model parameters in Enc , Dec and f , respectively. The key idea is to split the latent z intoseparate codes, the label relevant z s and irrelevant z u . D KL indicates the KL divergence between two distributions. Notethat there are three terms in (1). The first one is the negativereconstruction error. The second and third terms are the regu-larization which pushes the q φ,ψ ( z u | x, c ) and q ψ ( z s | c ) to theirpriors p ( z u ) and p ( z s ) , respectively. In practice, we assumethat z s is deterministic, which means p ( z s ) and q ψ ( z s | c ) areboth dirac δ function. Hence the third term is strictly requiredto be , thus can be ignored. log p ( x ) ≥ E q φ,ψ ( z u | x,c ) ,q ψ ( z s | c ) [log p θ ( x | z u , z s )] − D KL ( q φ,ψ ( z u | x, c ) || p ( z u )) − D KL ( q ψ ( z s | c ) || p ( z s )) (1) f As is illustrated in Fig.1, the input of f is a conditional label c ,which indicates the category of x , usually expressed in a one-hot vector. Like [17], we use several fully-connected layers tomap c into an embedding code z s , which is later employed bythe Enc and
Dec based on the adaptive normalization mod-ule. Here C is the number of channels, k is the ratio of theheight (or width) of the image to the feature map, z s = f ( c ) is the label relevant code , and it is reshaped either as a spatialstructure preserving feature map z s ∈ R Hk × Wk × , or a stylecode z s ∈ R × × C . We make the choice based on whether c directly relates with the spatial structure. Actually, we tryboth cases on two different datasets, 3D chair [18] and Face-Scrub [19]. The details are given in the experiments section. cVAE has a pair of connected Enc and
Dec . Together with f , Enc takes the image and label pair { x, c } and maps itinto a posterior probability, which is assumed to be theGaussian q φ,ψ ( z u | x, c ) = N ( z u | µ, σ ) . Here µ and σ arethe mean and standard deviation, depended on { x, c } andoutput from Enc . A code z u ∼ q φ,ψ ( z u | x, c ) is given to ec to reconstruct x . Note that in VAE, there is a prior on z u , which is p ( z u ) = N ( z u ; 0 , I ) . During the optimiza-tion, the KL divergence between the posterior and the prior D KL ( q φ,ψ ( z u | x, c ) || p ( z u )) is considered. In other words, Enc maps x from various classes into the posterior close tothe same prior. Therefore z u becomes label irrelevant.We have two choices on the shape and dimension settingsfor z u and z s . Similar with z s , z u is either a one channelfeature map z u ∈ R Hk × Wk × when z s ∈ R × × C , or a stylecode z u ∈ R × × C when z s ∈ R Hk × Wk × . In the former, z u keeps the spatial structure from x , and in the latter, it isformulated to capture the feature channel style by the globalaverage pooling. Section 3 demonstrates that the dimensionsettings for z s and z u in the above two cases achieve the bestperformance on FaceScrub and 3D chair, respectively. Dec takes z s and z u to reconstruct x . Traditionally incVAE, all inputs are directly concatenated along the channelat the beginning. However, this simple strategy does not em-phasize the difference between z s and z u , which definitelydegrades the disentangling quality. Inspired by two adaptivenormalization structures, AdaIN [17] and SPADE [15], weuse them to help disentangling z s and z u . As shown in (2), h ( l ) ∈ R N × H ( l ) × W ( l ) × C ( l ) is an intermediate output in l thlayer of Dec . ˆ h ( l ) is the tensor after the normalization. Here, H ( l ) , W ( l ) and C ( l ) indicate the height, width and channelnumber of h ( l ) . N is the batch size. γ ( z ) and β ( z ) are twooutputs from the conditional branch, depended on z s or z u . µ h and σ h are the mean and standard deviation statistics on h ( l ) . SPADE and AdaIN have different strategies to compute µ h and σ h , and manipulate β and γ . In SPADE, µ h and σ h arecomputed over the dimensions of H ( l ) , W ( l ) and N , while β , γ are provided with distinct values along the 3D tensor. InAdaIN, µ h and σ h are computed over the spatial dimensions,but β and γ are only provided for channel dimensions. Ourmethod processes h ( l ) by both SPADE and AdaIN. Then weconcatenate the results and reduce the channels by × conv. ˆ h ( l ) = γ ( z ) h ( l ) − µ h σ h + β ( z ) , z ∈ { z s , z u } (2) Traditional cVAE has only the
Enc and
Dec . Thus they areoptimized only by the reconstruction loss with the KL diver-gence as a regularization like (1). To improve the synthesisquality, we add adversarial training by employing a discrim-inator D to inspect the quality of fake data. Besides, we canhave more types of fake data. In [8,12], the reconstructed andprior sampled image from z u ∼ N ( z ; 0 , I ) are given to D .Our work extends it by synthesizing this kind of exchangedimage by z s and z u from images of different conditions withlabels of c (cid:48) and c . D proposed in [10] is used in our work. Itsscore is from -1 to 1. The adversarial training loss is in (3). L advD = E x ∼ p data [max(0 , − D ( x, c ))]+ E z u ∼ N ( z ; µ,σ ) [max(0 , D ( Dec ( z u , f ( c )) , c ))]+ E z u ∼ N ( z ; µ,σ ) [max(0 , D ( Dec ( z u , f ( c (cid:48) )) , c (cid:48) ))]+ E z u ∼ N ( z ;0 ,I ) [max(0 , D ( Dec ( z u , f ( c )) , c ))] (3)Here the first term is the score for real image, and the otherthree terms are the scores for the reconstructed, exchangedand prior sampled image.
3. EXPERIMENTS3.1. Experimental setupDatasets.
We conduct experiments on two datasets, includ-ing the 3D chair [18] and the FaceScrub [19]. The 3D chairdepicts a wide variety of chairs in 62 different chosen az-imuth angles. Images are resized to the fixed size × .The Facescrub contains k facial images from 530 differ-ent IDs. These faces are cropped by the detectors [20], andthey are aligned based on the facial landmarks [21]. The de-tected cropped images are in × . Network structure and dimension settings on z s and z u . Table 1 lists the comparison structures in our experimentson two datasets. Typically, z s and z u is used by Dec in 3ways, “
Dec inputs”, “AdaIN”, and “SPADE”. “
Dec inputs”indicates that the code is taken as
Dec ’s input.On
3D Chair dataset, the model rotates chair to the spec-ified azimuth, while preserving original chair style. Here, thelabel relevant code z s has the spatial structure, and the la-bel irrelevant z u is the style code. Different from Fig.1, Dec of the proposed structure adopts z u with AdaIN, and z s withSPADE, and it is compared with other structures from S1 toS4. On Facescrub , the model changes facial image to thespecified ID but preserve label irrelevant factors, like pose orexpression, from the input image. It is obvious that the IDrelevant z s becomes the style code and the ID irrelevant z u mainly specifies spatial related information. The proposedstructure, shown in Fig.1, is compared with S1 and S3, twotypical structures using AdaIN. To compare different modelsin a fair way, we fix the total dimension of z s and z u to 512,and each is 256. If it is a spatial structure code, its size is × × . While for style code, it is × × . Evaluation metrics.
We adopt three metrics for quan-titative analysis. (1) Classification Accuracy (
Acc ) reflectsthe condition conformity of the generated images. We useResNet-50 [22] trained on these two datasets for evaluating.(2) Fr ´ e chet Inception Distance ( FID ) [23] measures the dis-tance between distributions of the synthesized and the realimages, thus the lower, the better. (3) Mutual Information( MI ), between label irrelevant code z u and original label c . M I = I ( z u ; c ) = E q ( z u | c ) p ( c ) log q ( z u | c ) q ( z u )= 1 N C (cid:88) c E q ( z u | c ) log q ( z u | c ) q ( z u ) (4) able 1 . The compared structures on two datasets. Note that we do not compare all possible ways. But only the typical ones.
3D chair FacescrubcVAE-GAN S1 S2 S3 S4 Proposed S1 S3 Proposed z s Dec input AdaIN SPADE AdaIN AdaIN SPADE AdaIN AdaIN AdaIN z u Dec input
Dec input
Dec input AdaIN SPADE AdaIN
Dec input AdaIN SPADE MI can be computed as (4), and the smaller, the better. Itindicates whether label relevant and irrelevant variables aredisentangled well. Here N C is the number of the categories. q ( z u | c ) and q ( z u ) are approximated from the posterior distri-bution by Monte Carlo simulation. We choose one specific label to syn-thesize the exchanged images under the condition. For 3Dchair, the condition label is “24” (which corresponds to az-imuth ◦ ) and for Facescrub it is “Anne Hathaway” . Theresults from different models on 3D chair and Facescrub arepresented in Fig.2 and Fig.3, respectively. In both figures,images generated from our proposed method achieves the bestperformance. Particularly, the proposed model keeps the styleof input chair well when rotating it, and it also maintains thefacial expression when changing its identity on FaceScrub. CVAE-
GAN
S1 S2
Input
OursS3 S4 Real
Image
Fig. 2 . Rotated (exchanged) images comparison on 3D chair.
Quantitative results.
To validate the proposed model, wecompute the MI and Acc on 3D chair, and the
Acc and
FID onFacescrub. Results are shown in Table 2 and Table 3. For MI and Acc , we generate 100 exchanged images for each speci-fied label. For
FID , due to many IDs in Facsescrub, we chooseto evaluate it on 5 IDs. Each ID has 5000 images and we cal-culate the average
FID on the chosen 5 categories. As shownin Table 2 and Table 3, our proposed method, which usesSPADE and AdaIN to process the spatial structure and stylecode respectively, achieves the best scores on both datasets.
CVAE-GAN S1 S3 OursRecon ExchangeInput Recon Exchange Recon Exchange Recon Exchange "Anne Hathaway" face images in the dataset:
Fig. 3 . Reconstructed and exchanged images on FaceScrub.
Table 2 . MI and Acc on 3D chair from different models.
MI Acc cVAE-GAN 3.777 0.124S1 3.773 0.573S2 3.765 0.608S3 3.775 0.511S4 3.774 0.404Proposed method . Acc and
FID on Facescrub from different models.
Acc FID cVAE-GAN 0.072 83.05S1 0.444 80.37S3 0.498 52.46Proposed method
We propose a latent space disentangling algorithm for condi-tional image synthesis in cVAE. Our method divides the latentcode into label relevant and irrelevant parts. One of them pre-serves the spatial structure, and the other is the style code.These two types of codes are applied into cVAE by differentadaptive normalization schemes. Together with a discrimina-tor in the pixel domain, our model can generate high qualityimages, and achieve the disentangling performance. . REFERENCES [1] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wier-stra, “Stochastic backpropagation and approximate inferencein deep generative models,” arXiv preprint arXiv:1401.4082 ,2014.[2] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio, “Generative adversarial nets,” in
Advances inneural information processing systems , 2014, pp. 2672–2680.[3] Martin Arjovsky, Soumith Chintala, and L´eon Bottou,“Wasserstein generative adversarial networks,” in
Interna-tional Conference on Machine Learning , 2017, pp. 214–223.[4] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Du-moulin, and Aaron C Courville, “Improved training of wasser-stein gans,” in
Advances in Neural Information Processing Sys-tems , 2017, pp. 5767–5777.[5] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, andYuichi Yoshida, “Spectral normalization for generative adver-sarial networks,” arXiv preprint arXiv:1802.05957 , 2018.[6] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess,Xavier Glorot, Matthew Botvinick, Shakir Mohamed, andAlexander Lerchner, “beta-vae: Learning basic visual conceptswith a constrained variational framework.,”
ICLR , vol. 2, no.5, pp. 6, 2017.[7] Tian Qi Chen, Xuechen Li, Roger B Grosse, and David K Du-venaud, “Isolating sources of disentanglement in variationalautoencoders,” in
Advances in Neural Information ProcessingSystems , 2018, pp. 2610–2620.[8] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, HugoLarochelle, and Ole Winther, “Autoencoding beyond pix-els using a learned similarity metric,” arXiv preprintarXiv:1512.09300 , 2015.[9] Mehdi Mirza and Simon Osindero, “Conditional generativeadversarial nets,” arXiv preprint arXiv:1411.1784 , 2014.[10] Takeru Miyato and Masanori Koyama, “cgans with projectiondiscriminator,” arXiv preprint arXiv:1802.05637 , 2018.[11] Kihyuk Sohn, Honglak Lee, and Xinchen Yan, “Learningstructured output representation using deep conditional gener-ative models,” in
Advances in neural information processingsystems , 2015, pp. 3483–3491.[12] Jianmin Bao, Dong Chen, Fang Wen, Houqiang Li, and GangHua, “Cvae-gan: fine-grained image generation through asym-metric training,” in
Proceedings of the IEEE InternationalConference on Computer Vision , 2017, pp. 2745–2754.[13] Jack Klys, Jake Snell, and Richard Zemel, “Learning latentsubspaces in variational autoencoders,” in
Advances in NeuralInformation Processing Systems , 2018, pp. 6444–6454.[14] Zhilin Zheng and Li Sun, “Disentangling latent space for vaeby label relevant/irrelevant dimensions,” in
Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , 2019, pp. 12192–12201.[15] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-YanZhu, “Semantic image synthesis with spatially-adaptive nor-malization,” in
Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , 2019, pp. 2337–2346. [16] Xun Huang and Serge Belongie, “Arbitrary style transfer inreal-time with adaptive instance normalization,” in
Proceed-ings of the IEEE International Conference on Computer Vision ,2017, pp. 1501–1510.[17] Tero Karras, Samuli Laine, and Timo Aila, “A style-basedgenerator architecture for generative adversarial networks,” in
Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , 2019, pp. 4401–4410.[18] Mathieu Aubry, Daniel Maturana, Alexei A Efros, Bryan CRussell, and Josef Sivic, “Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models,” in
Proceedings of the IEEE conference on computer vision andpattern recognition , 2014, pp. 3762–3769.[19] Hong-Wei Ng and Stefan Winkler, “A data-driven approachto cleaning large face datasets,” in . IEEE, 2014, pp. 343–347.[20] Dong Chen, Shaoqing Ren, Yichen Wei, Xudong Cao, and JianSun, “Joint cascade face detection and alignment,” in
Eu-ropean Conference on Computer Vision . Springer, 2014, pp.109–122.[21] Xuehan Xiong and Fernando De la Torre, “Supervised descentmethod and its applications to face alignment,” in
Proceed-ings of the IEEE conference on computer vision and patternrecognition , 2013, pp. 532–539.[22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,“Deep residual learning for image recognition,” in
Proceed-ings of the IEEE conference on computer vision and patternrecognition , 2016, pp. 770–778.[23] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern-hard Nessler, and Sepp Hochreiter, “Gans trained by a twotime-scale update rule converge to a local nash equilibrium,”in