Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search
GGenerating images from caption and vice versavia CLIP-Guided Generative Latent Space Search
Federico A. Galatolo ∗ , Mario G.C.A. Cimino and Gigliola Vaglini Department of Information Engineering, University of Pisa, 56122 Pisa, Italy ∗ [email protected] Keywords: CLIP, Generative Adversarial Networks, GPT2, Genetic AlgorithmsAbstract: In this research work we present CLIP-GLaSS, a novel zero-shot framework to generate an image (or a caption)corresponding to a given caption (or image). CLIP-GLaSS is based on the CLIP neural network, which, givenan image and a descriptive caption, provides similar embeddings. Differently, CLIP-GLaSS takes a caption(or an image) as an input, and generates the image (or the caption) whose CLIP embedding is the most similarto the input one. This optimal image (or caption) is produced via a generative network, after an explorationby a genetic algorithm. Promising results are shown, based on the experimentation of the image GeneratorsBigGAN and StyleGAN2, and of the text Generator GPT2.
In the last years, Generative Neural Networks showedpromising results in a wide range of fields. The stateof the art in the field of image generation is repre-sented by Generative Adversarial Networks (GANs)(Goodfellow et al., 2014). GANs are capable of gen-erating realistic synthetic images leveraging two com-peting neural networks: a Generator and a Discrimi-nator. The objective of the Generator is to generateimages capable of fooling the Discriminator. The ob-jective of the Discriminator is to distinguish betweenimages generated by the Generator and the originalones.In the field of Natural Language Processing (NLP),unprecedented results have been achieved by trans-formers architectures (Hu, 2019). One of the mostknown and studied transformer is the Generative Pre-trained Transformer 2 (GPT2) (Radford et al., 2019).GPT-2 has 1.5 billion parameters and was trained ona language modelling task on the texts of 8 millionsof web pages.Generating images from a descriptive caption hasalways been a challenging problem. In the lastyears, some architectures like StackGAN++ (Zhanget al., 2018) and AlignDRAW (Mansimov et al., a https://orcid.org/0000-0001-7193-3754 b https://orcid.org/0000-0002-1031-1959 c https://orcid.org/0000-0003-1949-6504 a r X i v : . [ c s . N E ] F e b mage-to-text tasks, respectively.The paper is structured as follows. Section 2 fo-cuses on the Design of the CLIP-GLaSS framework.Experimental studies are covered by Section 3. Con-clusions are drawn in Section 4. The source code ofCLIP-GLaSS has been publicly released. The main components of the CLIP-GLaSS frameworkare: (i) the CLIP network for producing image andtext embedding, (ii) an Image or Text Generator forgenerating a text/image output with a similar embed-ding, (iii) a Genetic Algorithm to explore the latentspace of the Generator for finding the most similarembedding between image and text. In the case of thetext-to-image task, the Image Generator is based ondomain-specific or mixed-domains Generative Adver-sarial Networks, such as DeepMind’s BigGAN andNvidia’s StyleGAN2, respectively. In the case of theimage-to-text task, the Generative Pre-trained Trans-former 2 (GPT2) has been used. Finally, as a GeneticAlgorithm, the NSGA-II (Deb et al., 2002) has beenemployed, in order to solve a multi objective opti-mization problem. A classical Genetic Algorithm canbe employed when solving one single objective op-timization. The advantage of the Genetic Algorithmis that it is independent from the type of generativearchitecture, and it is characterized by a high degreeof exploration. Since the genetic algorithm is consid-ered well-known, the next subsections detail the firsttwo components.
CLIP is composed by two Neural Networks: an Im-age Encoder (IE) and a Text Encoder (TE). The twoencoders produce similar embeddings if the imageand the text contains similar visual and textual con-cepts. CLIP was trained using images and relatedsnipped of texts scraped from the internet, to per-form a contrastive learning task. Contrastive learn-ing consists in training a model to predict the cor-rect similarity between data samples. CLIP was in-spired by some prior work: in 2013 Richer Socher et al. trained a model to map images in feature vec-tors close to semantic word vectors corresponding totheir class. The model shown some zero-shot capa-bilities (Socher et al., 2013). Later, in 2017, Ang Li et al. trained a model using images scraped form theinternet and related user comments to predict the cor-responding comments word n-gram given an image. The resulting model was able to achieve a zero-shot11.5% accuracy on ImageNet (Li et al., 2017). Be-cause of the wide range of visual concepts learnedfrom natural language via contrastive learning, CLIPshows great zero-shot capabilities. CLIP’s zero-shotperformances were tested on over 30 different tasksspanning from OCR to geo-localization. The modelshowed competitive results against fully supervisedbaselines. CLIP was able to match the accuracy ofthe original ResNet-50 classifier on ImageNet in zero-shot without seeing any of the 1.28 million trainingimages.
Figure 1 shows the architectural design of the CLIP-GLaSS framework for the text-to-image task. Here,the CLIP text/image embeddings are represented inlight/dark blue, respectively. The similarity s of theiroutput is sent to the Genetic Algorithm (the green boxon the top) for being maximized. The Genetic Algo-rithms controls the input z of the image Generator,whose components are represented in orange. Theimage Generator is based on a Generative AdversarialNetwork (GAN). To clearly understand the overall be-havior of the framework, the GAN is first introducedin the following. Figure 1: Architectural design of the CLIP-GLaSS frame-work for the text-to-image task
The Generative Adversarial Neural Network isa framework proposed in 2014 (Goodfellow et al.,2014), and consists in two competing neural net-works: a
Generator and a
Discriminator . The Gen-erator produces candidates while the Discriminatorvaluates them. The Generator learns to map from anseeding space called latent space ( z ), to the data spaceof interest, while the Discriminator distinguishes can-didates produced by the Generator ( d img ) from the realdata ( R ). In Figure 1, the output of the Discrimina-tor is evaluated in terms of loss ( l ) to be minimized.During the training, the Generator objective is to in-crease the error rate of the Discriminator, by produc-ing novel candidates that the Discriminator classifiesas real. For this purpose, it is seeded with random-ized input that is sampled from a predefined latentspace (noise distribution p z ). Independent backpropa-gation procedures are applied to both networks, to al-low the Generator to produce better images, while theDiscriminator becomes more skilled at distinguishingsynthetic images. For this purpose, Generator andDiscriminator are simultaneously trained to performtheir respective task in a zero-sum game. The Gen-erator and the Discriminator are typically transposedconvolutional and convolutional neural networks, re-spectively. In the proposed architecture, pre-trainedGANs have been used. Specifically, the Discrimina-tor takes the image and provides a single scalar rep-resenting the probability that its input comes from theoriginal data rather than from the Generator. Moreformally, let us assume that the Discriminator D istrained to output 1 if its input image is original, and 0if it is synthetically generated. Then, using the CrossEntropy loss the conjunct training objective functioncan be expressed as:min G max D E x ∼ data [ log ( D ( x )] + E z ∼ p z [ log ( − D ( G ( z )))] (1)where E x ∼ p means the expectation over the probabil-ity distribution p .At the end of the training process, the Generatoris able to generate data indistinguishable (by the Dis-criminator) from the data originally provided. It hasbeen shown that the resulting Generator is also able togive semantically significant interpolations in the do-main of the training data (Wang et al., 2019). Figure1 focuses on the search carried out by the Genetic Al-gorithm, over pre-trained networks. Overall, the Gen-erator is seeded by z (according to a noise distribu-tion p z ) and provides a related image to both the Dis-criminator and the CLIP image encoding ( CLIP IE ).The latter is compared with the CLIP text encoding( CLIP TE ) of the target text, via a similarity function Sim . As a similarity function, the cosine similarityhas been used according to (Radford et al., 2021).One objective for the Genetic Algorithm is to providethe best z to maximize this similarity. More formally, given the target text T , the optimization problem is:max z sim ( CLIP IE ( G ( z )) , CLIP TE ( T )) . (2)The second objective of the Genetic Algorithm is theclassification loss l of the Discriminator, which is cal-culated from the encoding d img provided by the imageof the Generator, and the value associated to the out-put of a real image ( R ). More formally, the optimiza-tion problem is:min z Loss ( D ( G ( z )) , R ) . (3)After solving the optimization problem, the re-sulting optimal image provided by the Generator is img opt .It is worth to notice that if using a generative archi-tecture without Discriminator, the second objective ismissing. As a consequence, the optimization problemis single-objective, since it considers only the CLIPembeddings similarity. Figure 2 shows the architectural design of the CLIP-GLaSS framework for the image-to-text task. Sim-ilarly to the text-to-image task, the CLIP image/textembeddings are represented in dark/light blue, respec-tively. The similarity s of their output is sent to theGenetic Algorithm for being maximized. The Ge-netic Algorithms controls the seed z of the text Gen-erator, represented in orange. The state of the art fortext generators is based on transformers, such as XL-Net (Yang et al., 2019), Conditional Transformer Lan-guage Model (CTRL)(Keskar et al., 2019) and Gen-erative Pre-trained Transformer-2 (GPT2)(Radfordet al., 2019). As an example, in this paper the GPT2has been used.More specifically, GPT2 is a transformer modeldeveloped by OpenAI as improved version of theGenerative Pre-trained Transformer (GPT). The trans-former architecture was introduced in 2017, and it isused mainly for solving NLP tasks. Although trans-formers are built to work with temporal data (like Re-current Neural Networks or Long Short Term Mem-ory Neural Networks), they do not require that tem-poral data are sequentially processed. Hence, trans-formers allow a fast large scale parallel training.Transformers are composed by a stack of encodersand decoders, and heavily rely on a mechanism called attention , which is able to highlight important infor-mation while discarding the non-important one. Thetraining data for GPT2 was scraped from the inter-net using a custom web scraper that emphasizes doc-ument quality using some meta-heuristics. The re-sulting dataset contains text from over 45 million igure 2: Architectural design of the CLIP-GLaSS frame-work for the image-to-text task links and consists in almost 40 GB of text. GPT2was trained using a Model-Agnostic Meta-Learning(MAML) method (Finn et al., 2017), and tested withzero-shot approach on a variety of tasks: text genera-tion, translation, summarization, question answering,etc. It was able to match and even surpass the state-of-the-art of fully supervised models in zero-shot.It is important to notice that GPT2 does not use ahuman-like alphabet, but an alphabet of tokens gen-erated using the Byte Pair Encoding (BPE) (Sennrichet al., 2015). GPT2 can be used as generative archi-tecture by setting context tokens, and using them topredict the next token, until the stop-sentence tokenpredicted or a target text length is reached. We willrefer to input context tokens as its latent space .In Figure 2 the Generator is seeded by z , toprovide an output text accordingly. The output textis used to fed the CLIP text encoder ( CLIP TE ).Finally, the similarity between the embedding of the( CLIP TE ) and the embedding of the CLIP imageembedding ( CLIP IE ) is computed. The optimizationproblem of the Genetic Algorithm is to maximize thissimilarity.After solving the optimization problem, the resultingoptimal image generated from GPT2 is text opt . The CLIP-GLaSS framework has been implementedand publicly released as an open source GitHubrepository (Galatolo, 2021), along with an in-browser demonstration. Experiments have been carried out onan Intel Core i9-9900K CPU, a GeForce RTX 2080GPU. After the input image/caption, 500 generationsare executed by the Genetic Algorithm to find the op-timal caption/image. The order of magnitude of theprocessing time of an input is 5-10 minutes depend-ing on the generative architecture used.In this section, some pilot experiments of CLIP-GLaSS are considered, to show its generation capabil-ities. Each output is assessed in terms of quality , i.e.absence of artifacts, and relevance , i.e., the absenceof unexpected elements, evaluated with the naked eyeas low, medium, or high.
Different state-of-the-art pre-trained networks havebeen used as Generator/Discriminator. In particu-lar, two GANs have been used: DeepMind’s Big-GAN (Brock et al., 2018) and Nvidia’s StyleGAN2(Karras et al., 2020). The original BigGAN, trainedon the ImageNet dataset, has been considered (Rus-sakovsky et al., 2015). Three versions of Style-GAN2 have been used: (i) StyleGAN2-face, whichis trained on Flickr-Faces-HQ (Karras et al., 2019);(ii) StyleGAN2-church, trained on a subset of LSUN(Yu et al., 2015) made of church buildings; (iii)StyleGAN2-car, trained on a subset of LSUN madeof cars.OpenAI publicly released only BigGAN Generator.For this case, a single-objective genetic algorithm isused when optimizing its latent space. Differently,Nvidia released both StyleGAN Generator and Dis-criminator.The BigGAN latent space z is composed by 1000booleans representing the 1000 ImageNet classes, andby 128 real numbers meant to be sampled from a trun-cated normal distribution in the range [ − , ] . Whenoptimizing its latent space, mixed genetic operatorsare employed to correctly perform the initialization,mutation and crossover operations.The StyleGAN2 latent space z is made by 512 realnumbers meant to be sampled from a normal distribu-tion.Figure 3 shows three representative examples of in-put caption and related output image generated viaStyleGAN2-face. Since this GAN is specialized onfaces, the overall result is very good: the quality andthe relevance of all images are high, except for Im-age (b), whose relevance is medium due to the blondehairs on the bottom.Figure 4 shows three representative examples of in-put caption and related output image generated viaStyleGAN2-car. Although this GAN is specialized onars, the quality of all images is medium, due to thepresence of artifacts. The relevance is high for (a) and(b), but medium for (c) because the ”intersection” isnot visible.Figure 5 shows three representative examples of in-put caption and related output image generated viaStyleGAN2-church. This GAN is specialized on im-ages of church buildings. Indeed, the image relevanceis high, and the quality is about high due to the pres-ence of minor artifacts.The examples clearly show that CLIP-GLaSS is ableto combine the right image elements for matching thetarget caption, when using input texts that belong tothe same domain of the generative network.Differently than in the previous cases, Figure 6shows the output images generated via StyleGAN2on three domains (face, car, and church). To assesthe role of the Discriminator, the optimization is alsoperformed without it. In Figure 6, the images pro-duced without Discriminator have a final asterisk inthe caption. Specifically, by using the input text ”theface of a blonde girl with glasses”, the StyleGAN2-face achieves a high quality and relevance for (a) and(d), as expected. On the other side, the low perfor-mance of StyleGAn2-car and StyleGAn2-church areapparent. However, it is worth noting that in Figure 6(f) the picture resembles a face with glasses, gener-ated via two windows on a building.Figure 7 and Figure 8 show the output images gen-erated via StyleGAN2, with and without Discrimina-tor, for the input caption ”a blue car in the snow” and”a gothic church in the city”, respectively. Not sur-prisingly, the GAN that is specialized in the categoryof interest (face, car, church) provide the best signifi-cance, but medium-low quality due to the presence ofartifacts.Overall, the resulting images resemble the targetcaption, but the medium-low quality of the imagessuggests that, to correctly perform this task, a biggerGenerator (i.e. with more parameters) trained on awider variety of images is needed. In other words,the Generator cannot create images that do not be-long to its training domain. In general, it is appar-ent that the Discriminator guarantees a better qualityof the output images. Interestingly, when the Dis-criminator is not used, even if the target text is out-side of the Generator domain, it is able to generateimages that somehow resemble the target text. Forexample, Figure 7(d) is generated via StyleGAN2-face with the input caption “a blue car in the snow”:it represents a face of man in the snow with a bluesweater. Another example is Figure 7(f), generatedby StyleGAN2-church without Discriminator: it rep-resents a church with a blue blob whose shape remem- bers a car. Figure 8 shows examples generated by thethree domain StyleGAN2 networks via the caption “agothic church in the city”. Specifically, StyleGAN2-car without Discriminator generates an image with abackground building that remember the gothic style.Finally, the CLIP-GLaSS framework has been exper-imented using BigGAN, a large scale GAN trainedwith ImageNet, i.e., with multiple domains images.Figure 9 shows three captions and the related imagesgenerated from BigGAN. Although the overall qual-ity is medium for the presence of artifacts, the rele-vance is high. This section focuses on some representative exam-ples of the caption generation capabilities of CLIP-GLaSS. Specifically, 20 context tokens have beenused as GPT2 latent space z , to which three fixed to-kens have been concatenated, representing the staticcontext ”the picture of”. The latent space of the con-text tokens is a integer space of numbers ranging from0 to 50257, which is the BPE vocabulary size. In-deed, GPT2 adopts a subword-level vocabulary: itdoes not predict the next word but the next subwordtoken. When optimizing GPT2 latent space, integergenetic operators have been used to perform initial-ization, mutation and crossover operations.Figure 10 shows nine input images randomly ex-tracted from ImageNet, with the related captions gen-erated via GPT2. The results clearly show the CLIP-GLaSS potential of caption generation. The most cap-tions have both high quality and high relevance. Somecaptions, e.g., (a), (c), and (h) have high relevance andmedium quality because of the textual artifacts. Forexample, caption (a) “the picture of a dog that has ahigh fatality rate”, can be related to the fact that thedog is lying on the ground; caption (h) “the picture ofthe world’s first ‘safe’ phone”, can be related to thefact that the phone in the picture resembles a safe. In this research work we have introduced CLIP-GLaSS, a zero-shot framework which takes an in-put caption and generates a corresponding image, andvice versa. CLIP-GLaSS is based on the CLIP neuralnetwork, for generating close embeddings of seman-tically close texts or images, a Generator network,for controlling the respective generation of images ortexts, and a Genetic Algorithm, to explore the Gener-ator space to find the best image or text. The designchoices are first detailed, and then a set of pilot exper-ments are discussed, using the generative networksBigGAN, StyleGAN2 and GPT2. Results show thehigh potential of the proposed framework, in terms ofquality and relevance of the output image or text, en-couraging further comparative research. The sourcecode has been publicly released, along with an in-browser demonstration.
ACKNOWLEDGEMENT
Work partially supported by the Italian Ministry ofEducation and Research (MIUR) in the framework ofthe CrossLab project (Departments of Excellence).
REFERENCES
Brock, A., Donahue, J., and Simonyan, K. (2018). Largescale gan training for high fidelity natural image syn-thesis. arXiv preprint arXiv:1809.11096 .Deb, K., Pratap, A., Agarwal, S., and Meyarivan, T. (2002).A fast and elitist multiobjective genetic algorithm:Nsga-ii.
IEEE transactions on evolutionary compu-tation , 6(2):182–197.Finn, C., Abbeel, P., and Levine, S. (2017). Model-agnosticmeta-learning for fast adaptation of deep networks.In
International Conference on Machine Learning ,pages 1126–1135. PMLR.Galatolo, F. A. (2021). Clip-glass repository on github,https://github.com/galatolofederico/clip-glass.Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., and Ben-gio, Y. (2014). Generative adversarial networks. arXivpreprint arXiv:1406.2661 .Hu, D. (2019). An introductory survey on attention mecha-nisms in nlp problems. In
Proceedings of SAI Intelli-gent Systems Conference , pages 432–448. Springer.Karras, T., Laine, S., and Aila, T. (2019). A style-basedgenerator architecture for generative adversarial net-works. In
Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition , pages4401–4410.Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen,J., and Aila, T. (2020). Analyzing and improvingthe image quality of stylegan. In
Proceedings of theIEEE/CVF Conference on Computer Vision and Pat-tern Recognition , pages 8110–8119.Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C., andSocher, R. (2019). Ctrl: A conditional transformerlanguage model for controllable generation. arXivpreprint arXiv:1909.05858 .Li, A., Jabri, A., Joulin, A., and van der Maaten, L. (2017).Learning visual n-grams from web data. In
Proceed-ings of the IEEE International Conference on Com-puter Vision , pages 4183–4192. Mansimov, E., Parisotto, E., Ba, J. L., and Salakhutdinov,R. (2015). Generating images from captions with at-tention. arXiv preprint arXiv:1511.02793 .Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,J., et al. (2021). Learning transferable visual modelsfrom natural language supervision.Radford, A., Wu, J., Amodei, D., Amodei, D., Clark, J.,Brundage, M., and Sutskever, I. (2019). Better lan-guage models and their implications.
OpenAI Bloghttps://openai. com/blog/better-language-models .Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-stein, M., et al. (2015). Imagenet large scale visualrecognition challenge.
International journal of com-puter vision , 115(3):211–252.Sennrich, R., Haddow, B., and Birch, A. (2015). Neuralmachine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 .Socher, R., Ganjoo, M., Sridhar, H., Bastani, O., Man-ning, C. D., and Ng, A. Y. (2013). Zero-shot learn-ing through cross-modal transfer. arXiv preprintarXiv:1301.3666 .Wang, Z., She, Q., and Ward, T. E. (2019). Generative ad-versarial networks in computer vision: A survey andtaxonomy. arXiv preprint arXiv:1906.01529 .Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.,and Le, Q. V. (2019). Xlnet: Generalized autoregres-sive pretraining for language understanding. arXivpreprint arXiv:1906.08237 .Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., andXiao, J. (2015). Lsun: Construction of a large-scaleimage dataset using deep learning with humans in theloop. arXiv preprint arXiv:1506.03365 .Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang,X., and Metaxas, D. N. (2018). Stackgan++: Realis-tic image synthesis with stacked generative adversar-ial networks.