[PDF] Improving Text to Image Generation using Mode-seeking Function

Abstract

Generative Adversarial Networks (GANs) have long been used to understand the semantic relationship between the text and image. However, there are problems with mode collapsing in the image generation that causes some preferred output modes. Our aim is to improve the training of the network by using a specialized mode-seeking loss function to avoid this issue. In the text to image synthesis, our loss function differentiates two points in latent space for the generation of distinct images. We validate our model on the Caltech Birds (CUB) dataset and the Microsoft COCO dataset by changing the intensity of the loss function during the training. Experimental results demonstrate that our model works very well compared to some state-of-the-art approaches.

Full PDF

IImproving Training of Text-to-image Model Using Mode-seeking Function

Naitik BhiseConcordia UniversityMontreal, Canada n [email protected]

Zhenfei ZhangConcordia UniversityMontreal, Canada z [email protected]

Tien D. BuiConcordia UniversityMontreal, Canada [email protected]

Abstract

Generative Adversarial Networks (GANs) have longbeen used to understand the semantic relationship betweenthe text and image. However, there are problems with modecollapsing in the image generation that causes some pre-ferred output modes. Our aim is to improve the training ofthe network by using a specialized mode-seeking loss func-tion to avoid this issue. In the text to image synthesis, ourloss function differentiates two points in latent space for thegeneration of distinct images. We validate our model onthe Caltech Birds (CUB) dataset and the Microsoft COCOdataset by changing the intensity of the loss function dur-ing the training. Experimental results demonstrate that ourmodel works very well compared to some state-of-the-artapproaches.

1. Introduction

Text to Image synthesis is a burgeoning ﬁeld and hasgained a lot of attention in the last few years. With thehelp of Generative Adversarial Networks[3], the focus hasbeen shifted to generate better quality images. GANs haveprovided the best descriptions by trying to ﬁnd the relation-ship between the text vectors and the image. There havebeen techniques for the improvement of images based onthe types of architectures used. Single stage methods[23]try to insert a relation between the text and the image fea-tures. Multistage methods[31][28][30][34] use the imageresults from the single stage to improve its image quality byapplying techniques based on original image and the avail-able word descriptions.Though multistage architectures improve the quality ofthe images , there are some problems associated with them.The output images are highly dependent on the quality ofthe input images generated by the single stage. Also, eachword in a sentence has a different level of importance[28]on the image features. There have been architectures to ad-dress these problems such as Dynamic Memory GAN[34]and AttnGAN[28] that improve the semantic knowledge of the image.Conditional Generative Adversarial Networks(cGANs)[20] propose to do image generation with the help of theinput descriptions and the latent vector containing the in-formation of the input context. Therefore , they take condi-tional information as input augmented with the random vec-tor for generating the output. For image synthesis, cGANscan be used in many tasks with different conditional con-texts. cGANs can be applied to categorical image genera-tion with the help of class labels. By using text sentencesas contexts, cGANs can be used in text-to-image synthe-sis. cGANs are multimodal as they map one input to manyoutput modes. Different output images are generated bythe variation in the noise vectors that could be checked byapplying cGAN on a context vector augmented with dif-ferent noise vectors. One good example of variation is thedog to cat method where the noise vectors are responsiblefor the change from image of one animal to another. How-ever, cGANs suffer from the mode collapse issue as someof the output modes are chosen as compared to others[16].Some weak modes are suppressed as compared to the otherprominent strong modes. The noise vectors are ignored orof minor impacts, since cGANs pay more attention to learnfrom the high-dimensional and structured conditional con-texts. There are two approaches for solving the mode col-lapse problem. The ﬁrst one focuses on discriminators byintroducing different divergence metrics and optimizationprocess while the second deals with the auxiliary networkssuch as multiple generators and additional encoders. Modeseeking loss function[16] maximizes the difference betweenthe images, enhances the modes and separates them in asimple way compared to conventional methods.We introduce a method of implementing the DynamicMemory-GAN[34] architecture with the help of the modeseeking loss function. For an ensemble of points on thelatent distribution, two points are chosen with different ran-dom vector and fed to the generator network. We obtain aratio of the distance between the generated images and twonoise vectors applied to the initial stage of the generator net-work. The discriminator is trained with the gradients from9876 a r X i v : . [ c s . C V ] S e p hese modes so that they will not get ignored. The trainingis used to maximize the distance between the two randomgenerated modes, thus exploring the target distribution andencouraging other modes.We validate our method on two public datasets - CaltechBirds(CUB) Dataset[27] and Microsoft COCO Dataset[14].The validation is tested with the metric of the Frechet Incep-tion Distance. From this study, an effective way to train theGANs is proposed. The contributions of our work include: • We analyse the limitations of current techniques. Inorder to solve these issues, a new loss function is usedto improve the training of network. • Our method is effective with low parameter overheads. • Our work achieves better performance on both CaltechBirds(CUB) dataset and Microsoft COCO dataset thanprevious models.The review of related works to the text to image synthesisis given in the next section. The Procedure section entailsthe detailed information describing the method and it’s ap-plication. The result section shows the observations withthe metric values with the explanations and the conclusiongives an inference about the research.

2. Relevant Text

There are many methods of image generation adoptedfrom VAEs[10] and GANs[3]. However, VAE has somedisadvantages as it uses the mean square error between gen-erated image and initial image. Compared with GAN whichlearns through confrontation, the image from VAE is muchmore blurry. DC-GAN[23] is the ﬁrst work to show thatcGANs can be used to generate visually acceptable imagesfrom descriptions. After DC-GAN, Generative Adversar-ial Network has been widely used on text to image tasks.Reed[23] introduced the single stage generation of the textto image synthesis wherein they learned the joint repre-sentation between the natural language and the real-worldimages and passed it to a generator for image generation.Since previous generative models failed to add the locationinformation, Reed proposed GAWWN [24] to encode local-ization constraints. To diversify the generated images, thediscriminator of TAC-GAN[2] not only distinguishes realimages from synthetic images, but also classiﬁes syntheticimages into classes. Similar to TAC-GAN, PPGN[21] in-cludes a conditional network to synthesize images condi-tioned on a caption.The limitation of the above techniques is that the gen-eration can only understand the meaning of text except fornecessary details. Moreover, it is very hard to obtain a highresolution and realistic image. Single-stage methods havespurred a series of new research techniques in this ﬁeld with the multi-stage methods working towards the improve-ment of the image quality and the image resolution. Reedet. al. 2016 [23] generated images with the resolution of64 ×

64 while the multistage methods took the image res-olution to higher levels up to 225 × ×

3. Procedure

Generative Adversarial Networks involve solving of aMin-Max problem. The Generator generates an imagebased on an input distribution while the Discriminator triesto compare the generated image to the real images. Thusadversarially, the Discriminator tries to train generator withthe help of its gradients and the Generator tries to foolthe Discriminator with the synthesis of the realistic images.Conditional GAN generates its output distribution with thehelp of a directed input distribution based on a prior condi-tion or context. Text to Image synthesis involves the use ofmore than a single GAN network for reﬁning of the imagescalled stages. We try to make use of the initial network ofthe DMGAN.The DMGAN is a network that learns the inner semanticrelationship between the text and the image with the helpof a memory architecture. The objective of the Dynamicmemory is to fuse the image and text information throughnon-trivia transformations between the key and value mem-ory. It tries to improve the image vectors with the help ofthe text vectors. The memory is an ensemble of three mod-ules namely the memory writing, key-value memories andthe response gate. • Memory writing: this module encodes prior infor-mation from the image vectors and the text vectorsthrough convolutional operation. That is, obtain atten-tion between feature map and word vector. • Key addressing: In Key addressing section, key mem-ory are used to search related memories by utilizingsimilarity function. • Value reading: Value reading is the process of lever-aging the key memories to prepare the output memoryrepresentation. • Response gate: Response gate is the section for theconcatenation of the image and formed vectors.The encoder by Reed et. al. 2016[24] is used for the genera-tion of the word embeddings and the sentence embeddings.The text encoder is a pretrained bidirectional LSTM net-work. The deep conventional generator predicts an initial image x with a rough shape and few details in accordancewith the sentence feature and a random noise vector.At the dynamic memory based image reﬁnement stage,more ﬁne-grained visual contents are added to the fuzzy ini-tial images to generate a photo-realistic image x i : x i = G i ( R i − , W ) , where R i − is the image feature from thelast stage. The reﬁnement stage is conducted many attemptsto retrieve more pertinent information and generate a high-resolution image with more ﬁne-grained details. Figure 1shows the diagram of the proposed method. The objective function of the generator network is de-ﬁned as: L = (cid:88) i L G i + λ L G CA + λ L DAMSM + λL ms (1)in which λ and λ are the corresponding weights ofconditioning augmentation loss and DAMSM loss. We tunethe λ parameter as for the intensity of the Mode-seekingfunction. G denotes the generator of the initial generationstage. G i denotes the generator of the i-th iteration of theimage reﬁnement stage. Adversarial Loss : The adversarial loss for G i is deﬁnedas: L G i = − (cid:2) E x ∼ p Gi log ( D i ( x )) + E x ∼ p Gi log ( D i ( x, s )) (cid:3) (2)where the ﬁrst term is the unconditional loss which makesthe generated image real as much as possible and the secondterm is the conditional loss which makes the image matchthe input sentence. Alternatively, the adversarial loss foreach discriminator D i is deﬁned as: L D i = −

12 [ E x ∼ p data log ( D i ( x )) + E x ∼ p Gi log (1 − D i ( x )) (cid:124) (cid:123)(cid:122) (cid:125) unconditional loss + E x ∼ p data log ( D i ( x, s )) + E x ∼ p Gi log (1 − D i ( x, s )) (cid:124) (cid:123)(cid:122) (cid:125) conditional loss ] (3)Here, the unconditional loss is designed to distinguishthe generated image from real images. The conditional lossmeasures the matching context between the image vectorsand word embeddings. Conditioning Augmentation Loss : The ConditioningAugmentation (CA) technique[30] is proposed to augmenttraining data and avoid overﬁtting by resampling the inputsentence vector from an independent Gaussian distribution.Thus, the CA loss is deﬁned as the Kullback-Leibler diver-gence between the standard Gaussian distribution and theGaussian distribution of training data. L CA = D KL ( N ( µ ( s ) , Σ( s )) || N (0 , I )) (4)9878 igure 1. Architecture of the proposed method, which can solve the model collapse effectively by using mode seeking loss function. where µ and Σ are mean and diagonal covariance matrixof the sentence feature. are computed by fully connectedlayers. DAMSM Loss : The DAMSM loss constructed by [28]is used to measure the degree matching between images andtext descriptions. It makes images well conditioned on thecontent of the text descriptions.

Mode-seeking loss function : The mode collapse func-tion is addressed with the help of a loss function given bythe Mao et al 2019[16]. While extracting a distributionover a latent prior, many output modes are generated dueto different initial vectors. Each mode generated due to ran-dom distribution generates a different image. Some of themodes are being preferred as compared to the other ones.The mode-seeking loss function tries to strengthen the mi-nor modes of the distribution between the major modes.Suppose a latent distribution is used to generate two dif-ferent instances based on the same input. The ﬁrst point z causes the formation of an image mode I = G ( c, z ) im-age and the second point z leads to I = G ( c, z ) imageafter application of a Generator and a context vector c. Themode seeking loss function is given by the expression L ms = max G (cid:20) D ( G ( c, z ) , G ( c, z )) D ( z , z ) (cid:21) (5)where D( . , . ) is a distance function that calculates themagnitude of the distance between the two tensors or vec-tors. The numerator calculates the distance between the im-age vectors while the denominator is the distance betweenthe normal vectors. The purpose of this term is to maximize the distance between the image vectors and thus separatethe modes from overlapping. The DM-GAN network is trained according to the imple-mentation details explained in the original paper[34]. Fortext embedding, a pre-trained bidirectional LSTM text en-coder is employed by Xu et al.[28] and their parametersare ﬁxed during training. Each word feature correspondsto the hidden states of two directions. The sentence fea-ture is generated by concatenating the last hidden states oftwo directions. Primarily, the initial image generation stagesynthesizes images with 64x64 resolution. Then, the dy-namic memory based image reﬁnement stage reﬁnes imagesto 128x128 and 256x256 resolution in the second and thirdstage. By default from [34], we set N w = 256, N r = 64 and N m = 128 to be the dimension of text, image and memoryfeature vectors respectively. We set the hyperparameter λ = 1 and λ = 5 for the CUB dataset and λ = 1 and λ =50 for the COCO dataset. All networks are trained usingADAM optimizer[9] with batch size 8 for COCO datasetand 5 for CUB, β = 0.5 and β = 0.999. The learning rateis set to be 0.0002. We train the DM-GAN model with 600epochs on the CUB dataset and 120 epochs on the COCOdataset.

4. Results

In this section, we evaluate the DM-GAN model quanti-tatively and qualitatively. The implementation of the DM-GAN model was done using the open-source Python library9879yTorch[22].

Datasets : To demonstrate the capability of our proposedmethod for text-to-image synthesis, we conducted experi-ments on the CUB[27] and the COCO[14] datasets. TheCUB dataset contains 200 bird categories with 11,788 im-ages, where 150 categories with 8,855 images are employedfor training while the remaining 50 categories with 2,933images for testing. There are ten captions for each imagein CUB dataset. The COCO dataset includes a training setwith 80k images and a test set with 40k images. Each imagein the COCO dataset has ﬁve text descriptions.

Evaluation Metric : We quantify the performance of theDM-GAN in terms of Frechet Inception Distance (FID).Each model generated 30,000 images conditioning on thetext descriptions from the unseen test set for evaluation.The FID[5] computes the Frechet distance between syn-thetic and real-world images based on the extracted featuresfrom a pre-trained Inception v3 network. It measures thedistance between the synthetic data distribution and the realdata distribution. A lower FID implies a closer distance be-tween generated image distribution and real-world imagedistribution.The DM GAN network was trained on two values of the λ for each of the Datasets. The trained models were usedto generate 30000 images of the validation dataset separatefrom the training data. The generated images were used forgenerating the FID metric score. We can see that the image from the new DMGAN hasbetter quality in terms of ﬁner features and better charac-teristics of the birds for the CUB dataset than the originalDMGAN images. It can also be observed that the bird iswell separated from the background and there are less in-stances of having a double head. Also in many images, thebackground is clearly described and can be distinguishedvery well from the bird. There are few images where wecan see colours well distinguishable and brighter to corre-late well with the description. In the case of COCO dataset,more distinguishing features can be seen in the same wayas birds with better clarity. With increasing values of pa-rameters, we can see improvement in the image quality andbetter colours in the image. Thus, our training method pro-duces better distinctive features in the images by improvingcontrast.

Our objective is to understand the importance of themode seeking term in the loss function by tweaking its am-plitude. It is decreased from 16.09 for the original DM-GAN network to the 14.27 for the λ = 1 while it reducedfurther to 13.93 for λ = 1 . . The COCO Dataset saw abigger change as the value of the FID score was 24.3 for Architecture CUB Dataset Coco DatasetStackGAN++[31] 35.11 33.88AttnGAN[28] 23.98 35.49DMGAN[34] 16.09 32.64OPGAN[6] - 24.7Ours @ λ = 1 λ = 1 . Table 1. Evaluation results (FID score) on CUB-200-2011 andCOCO dataset with two different values of λ . Ours is better thanDMGAN and higher λ has better results. the λ = 1 , a 25.5% decrease from the original DMGANFID and 1.6% from the OPGAN FID. But the dataset gavea higher score for λ = 1 . with a value of 26.01 beatingthe basic model of DMGAN. It is a bigger improvement onthe previous AttnGAN model with a 42% decrease in FIDon CUB dataset and 31.5% decrease in FID on the COCOdataset. In summary , we can say that we have trained theDMGAN network to achieve the best experimental resultsas compared to the original loss scheme. Table 1 compilesthe values of FID scores with changing parameters with theFID values of the previous architectures. We observe the variation of the FID with the increase inthe λ factor. The λ factor is responsible for the strength ofthe mode seeking loss function in the overall loss expressionand thus, it has an inﬂuence on the training of the network.It manages the intensity of the mode collapse loss functionin the loss term. We can see in Table 1 the results for theFID scores on different values of parameters. We see thatthe FID score decreases as we increase the amplitude of themode seeking term for the COCO dataset but it decreasesfor the CUB dataset. We have found some modes that re-sulted in the new images having different shades and betterquality. Therefore, we have achieved a substantial amountof diversity in our network and thus we could generate bet-ter images for descriptions.Qualitatively, we can see that the increase in λ improvesthe quality of the image with the increase in clarity. But italso increases the strength of the mode-seeking term whichdepletes the importance of the DAMSM loss and the condi-tional loss in the original loss function. Some of the effectscan be seen in the ﬁrst example of the bird2. We can seethat the bird being black is magniﬁed better in the higher λ and other features are rendered less signiﬁcance. We cansee in the fourth example of 3 that the bus being white isbeing jumbled up as we increase the λ .

5. Conclusion

In this work, we present a simple but effective modeseeking regularization term on the generator to address the9880 igure 2. Results of our models on the CUB dataset with the values of the λ coefﬁcient and their respective images model collapse issue in cGANs and improve the quality ofthe images generated by the Dynamic Memory GAN. Bymaximizing the distance between generated images with re-spect to that between the corresponding latent codes, theregularization term forces the generators to explore moreminor modes. It is concluded that changing the amplitudeof the regularization term could lead to better results andimages. We measured the efﬁciency of our method with thehelp of the FID metric as well as human eye evaluation andthe new technique was applied to two image datasets. Ourmodel compares well with the past state-of-the-art modelsin terms of the FID score. Both qualitative and quantitativeresults show that the proposed regularization term facilitatesthe baseline frameworks, improving the diversity withoutsacriﬁcing visual quality of the generated images. References [1] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Ben-gio, and Wenjie Li. Mode regularized generative ad-versarial networks. arXiv preprint arXiv:1612.02136 ,2016.[2] Ayushman Dash, John Cristian Borges Gamboa,Sheraz Ahmed, Marcus Liwicki, and Muhammad Ze-shan Afzal. Tac-gan-text conditioned auxiliary clas- siﬁer generative adversarial network. arXiv preprintarXiv:1703.06412 , 2017.[3] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. Generative adversarialnetworks. In

Annual Conference on Neural Informa-tion Processing Systems (NeurIPS) , pages 2672–2680,2014.[4] Caglar Gulcehre, Sarath Chandar, Kyunghyun Cho,and Yoshua Bengio. Dynamic neural turing machinewith continuous and discrete addressing schemes.

Neural computation , 30(4):857–884, 2018.[5] Martin Heusel, Hubert Ramsauer, Thomas Un-terthiner, Bernhard Nessler, and Sepp Hochreiter.Gans trained by a two time-scale update rule con-verge to a local nash equilibrium. In

Advances in neu-ral information processing systems , pages 6626–6637,2017.[6] Tobias Hinz, Stefan Heinrich, and Stefan Wermter.Semantic object accuracy for generative text-to-imagesynthesis. arXiv preprint arXiv:1910.13321 , 2019.[7] Xun Huang, Ming-Yu Liu, Serge Belongie, and JanKautz. Multimodal unsupervised image-to-imagetranslation. In

Proceedings of the European Confer- igure 3. Results of our models on the Microsoft COCO dataset with the values of the λ coefﬁcient and their respective images ence on Computer Vision (ECCV) , pages 172–189,2018.[8] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, andAlexei A Efros. Image-to-image translation withconditional adversarial networks. In Proceedings ofthe IEEE conference on computer vision and patternrecognition , pages 1125–1134, 2017.[9] Diederik P Kingma and Jimmy Ba. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[10] Diederik P Kingma and Max Welling. Auto-encodingvariational bayes. arXiv preprint arXiv:1312.6114 ,2013.[11] Christian Ledig, Lucas Theis, Ferenc Husz´ar, JoseCaballero, Andrew Cunningham, Alejandro Acosta,Andrew Aitken, Alykhan Tejani, Johannes Totz, Ze-han Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In

Proceedings of the IEEE conference on computer vi-sion and pattern recognition , pages 4681–4690, 2017.[12] Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Ma-neesh Singh, and Ming-Hsuan Yang. Diverse image-to-image translation via disentangled representations. In

Proceedings of the European conference on com-puter vision (ECCV) , pages 35–51, 2018.[13] Chuan Li and Michael Wand. Precomputed real-timetexture synthesis with markovian generative adversar-ial networks. In

European conference on computervision , pages 702–716. Springer, 2016.[14] Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, andC Lawrence Zitnick. Microsoft coco: Common ob-jects in context. In

European conference on computervision , pages 740–755. Springer, 2014.[15] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Un-supervised image-to-image translation networks. In

Advances in neural information processing systems ,pages 700–708, 2017.[16] Qi Mao, Hsin-Ying Lee, Hung-Yu Tseng, Siwei Ma,and Ming-Hsuan Yang. Mode seeking generative ad-versarial networks for diverse image synthesis. In

Pro-ceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 1429–1437, 2019.[17] Xudong Mao, Qing Li, Haoran Xie, Raymond YKLau, Zhen Wang, and Stephen Paul Smolley. Leastsquares generative adversarial networks. In

Proceed- ngs of the IEEE International Conference on Com-puter Vision , pages 2794–2802, 2017.[18] SC Martin Arjovsky and Leon Bottou. Wassersteingenerative adversarial networks. In

Proceedings of the34 th International Conference on Machine Learning,Sydney, Australia , 2017.[19] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163 , 2016.[20] Mehdi Mirza and Simon Osindero. Condi-tional generative adversarial nets. arXiv preprintarXiv:1411.1784 , 2014.[21] Anh Nguyen, Jeff Clune, Yoshua Bengio, AlexeyDosovitskiy, and Jason Yosinski. Plug & play gen-erative networks: Conditional iterative generation ofimages in latent space. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recogni-tion , pages 4467–4477, 2017.[22] Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, Zeming Lin,Alban Desmaison, Luca Antiga, and Adam Lerer. Au-tomatic differentiation in pytorch. 2017.[23] Scott Reed, Zeynep Akata, Xinchen Yan, LajanugenLogeswaran, Bernt Schiele, and Honglak Lee. Gen-erative adversarial text to image synthesis. arXivpreprint arXiv:1605.05396 , 2016.[24] Scott E Reed, Zeynep Akata, Santosh Mohan, SamuelTenka, Bernt Schiele, and Honglak Lee. Learningwhat and where to draw. In

Advances in neural in-formation processing systems , pages 217–225, 2016.[25] Tim Salimans, Ian Goodfellow, Wojciech Zaremba,Vicki Cheung, Alec Radford, and Xi Chen. Improvedtechniques for training gans. In

Advances in neu-ral information processing systems , pages 2234–2242,2016.[26] Akash Srivastava, Lazar Valkov, Chris Russell,Michael U Gutmann, and Charles Sutton. Veegan:Reducing mode collapse in gans using implicit vari-ational learning. In

Advances in Neural InformationProcessing Systems , pages 3308–3318, 2017.[27] Catherine Wah, Steve Branson, Peter Welinder, PietroPerona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.[28] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, HanZhang, Zhe Gan, Xiaolei Huang, and Xiaodong He.Attngan: Fine-grained text to image generation withattentional generative adversarial networks. In

Pro-ceedings of the IEEE conference on computer visionand pattern recognition , pages 1316–1324, 2018.[29] Dingdong Yang, Seunghoon Hong, Yunseok Jang,Tianchen Zhao, and Honglak Lee. Diversity-sensitive conditional generative adversarial networks. arXivpreprint arXiv:1901.09024 , 2019.[30] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang,Xiaogang Wang, Xiaolei Huang, and Dimitris NMetaxas. Stackgan: Text to photo-realistic image syn-thesis with stacked generative adversarial networks. In

Proceedings of the IEEE international conference oncomputer vision , pages 5907–5915, 2017.[31] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang,Xiaogang Wang, Xiaolei Huang, and Dimitris NMetaxas. Stackgan++: Realistic image synthesis withstacked generative adversarial networks.

IEEE trans-actions on pattern analysis and machine intelligence ,41(8):1947–1962, 2018.[32] Richard Zhang, Phillip Isola, Alexei A Efros, EliShechtman, and Oliver Wang. The unreasonable ef-fectiveness of deep features as a perceptual metric. In

Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition , pages 586–595, 2018.[33] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, TrevorDarrell, Alexei A Efros, Oliver Wang, and Eli Shecht-man. Toward multimodal image-to-image translation.In

Advances in neural information processing systems ,pages 465–476, 2017.[34] Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang.Dm-gan: Dynamic memory generative adversarialnetworks for text-to-image synthesis. In