[PDF] StyleGAN2 Distillation for Feed-forward Image Manipulation

Abstract

StyleGAN2 is a state-of-the-art network in generating realistic images. Besides, it was explicitly trained to have disentangled directions in latent space, which allows efficient image manipulation by varying latent factors. Editing existing images requires embedding a given image into the latent space of StyleGAN2. Latent code optimization via backpropagation is commonly used for qualitative embedding of real world images, although it is prohibitively slow for many applications. We propose a way to distill a particular image manipulation of StyleGAN2 into image-to-image network trained in paired way. The resulting pipeline is an alternative to existing GANs, trained on unpaired data. We provide results of human faces' transformation: gender swap, aging/rejuvenation, style transfer and image morphing. We show that the quality of generation using our method is comparable to StyleGAN2 backpropagation and current state-of-the-art methods in these particular tasks.

Full PDF

SStyleGAN2 Distillation for Feed-forwardImage Manipulation

Yuri Viazovetskyi (cid:63) , Vladimir Ivashkin (cid:63) , , and Evgeny Kashin (cid:63) Yandex Moscow Institute of Physics and Technology { iviazovetskyi,vlivashkin,evgenykashin } @yandex-team.ru Fig. 1: Image manipulation examples generated by our method from (a) sourceimage sampled from Celeba-HQ: (b) gender swap at 1024x1024 and (c) stylemixing at 512x512. Samples are generated feed-forward, StyleGANv2 which wedistilled was trained on FFHQ

Abstract.

StyleGAN2 is a state-of-the-art network in generating real-istic images. Besides, it was explicitly trained to have disentangled direc-tions in latent space, which allows eﬃcient image manipulation by vary-ing latent factors. Editing existing images requires embedding a givenimage into the latent space of StyleGAN2. Latent code optimizationvia backpropagation is commonly used for qualitative embedding of realworld images, although it is prohibitively slow for many applications. Wepropose a way to distill a particular image manipulation of StyleGAN2into image-to-image network trained in paired way. The resulting pipelineis an alternative to existing GANs, trained on unpaired data. We provideresults of human faces’ transformation: gender swap, aging/rejuvenation,style transfer and image morphing. We show that the quality of gener-ation using our method is comparable to StyleGAN2 backpropagationand current state-of-the-art methods in these particular tasks.

Keywords:

Computer Vision, StyleGAN2, distillation, synthetic data (cid:63) equal contribution a r X i v : . [ c s . C V ] O c t Y. Viazovetskyi et al.

Generative adversarial networks (GANs) [18] have created wide opportunitiesin image manipulation. General public is familiar with them from the manyapplications which oﬀer to change one’s face in some way: make it older/younger,add glasses, beard, etc.There are two types of network architecture which can perform such transla-tions feed-forward: neural networks trained on either paired or unpaired datasets.In practice, only unpaired datasets are used. The methods used there are basedon cycle consistency [61]. The follow-up studies [24,11,12] have maximum reso-lution of 256x256.At the same time, existing paired methods (e.g. pix2pixHD [55] or SPADE[42]) support resolution up to 2048x1024. But it is very diﬃcult or even impos-sible to collect a paired dataset for such tasks as age manipulation. For eachperson, such dataset would have to contain photos made at diﬀerent age, withthe same head position and facial expression. Close examples of such datasetsexist, e.g. CACD [8], AgeDB [40], although with diﬀerent expressions and faceorientation. To the best of our knowledge, they have never been used to trainneural networks in a paired mode.These obstacles can be overcome by making a synthetic paired dataset, ifwe solve two known issues concerning dataset generation: appearance gap [22]and content gap [28]. Here, unconditional generation methods, like StyleGAN[30], can be of use. StyleGAN generates images of quality close to real world andwith distribution close to real one according to low FID results. Thus outputof this generative model can be a good substitute for real world images. Theproperties of its latent space allow to create sets of images diﬀering in particularparameters. Addition of path length regularization (introduced as measure ofquality in [30]) in the second version of StyleGAN [31] makes latent space evenmore suitable for manipulations.Basic operations in the latent space correspond to particular image manipu-lation operations. Adding a vector, linear interpolation, and crossover in latentspace lead to expression transfer, morphing, and style transfer, respectively. Thedistinctive feature of both versions of StyleGAN architecture is that the latentcode is applied several times at diﬀerent layers of the network. Changing thevector for some layers will lead to changes at diﬀerent scales of generated image.Authors group spatial resolutions in process of generation into coarse, middle,and ﬁne ones. It is possible to combine two people by using one person’s codeat one scale and the other person’s at another.Operations mentioned above are easily performed for images with knownembeddings. For many entertainment purposes this is vital to manipulate someexisting real world image on the ﬂy, e.g. to edit a photo which has just been taken.Unfortunately, in all the cases of successful search in latent space described inliterature the backpropagation method was used [1,2,16,31,47]. Feed-forward isonly reported to be working as an initial state for latent code optimization [5].Slow inference makes application of image manipulation with StyleGAN2 inproduction very limited: it costs a lot in data center and is almost impossible tyleGAN2 Distillation for Feed-forward Image Manipulation 3 to run on a device. However, there are examples of backpropagation run inproduction, e.g. [48].In this paper we consider opportunities to distill [21,4] a particular imagemanipulation of StyleGAN2 generator, trained on the FFHQ dataset. The dis-tillation allows to extract the information about faces’ appearance and the waysthey can change (e.g. aging, gender swap) from StyleGAN into image-to-imagenetwork. We propose a way to generate a paired dataset and then train a “stu-dent” network on the gathered data. This method is very ﬂexible and is notlimited to the particular image-to-image model.Despite the resulting image-to-image network is trained only on generatedsamples, we show that it performs on real world images on par with Style-GAN backpropagation and current state-of-the-art algorithms trained on un-paired data.Our contributions are summarized as follows: – We create synthetic datasets of paired images to solve several tasks of im-age manipulation on human faces: gender swap, aging/rejuvenation, styletransfer and face morphing; – We show that it is possible to train image-to-image network on syntheticdata and then apply it to real world images; – We study the qualitative and quantitative performance of image-to-imagenetworks trained on the synthetic datasets; – We show that our approach outperforms existing approaches in gender swaptask.We publish all collected paired datasets for reproducibility and future re-search: https://github.com/EvgenyKashin/stylegan2-distillation . Unconditional image generation

Following the success of ProgressiveGAN[29] and BigGAN [6], StyleGAN [30] became state-of-the-art image generationmodel. This was achieved due to rethinking generator architecture and borrow-ing approaches from style transfer networks: mapping network and AdaIN [23],constant input, noise addition, and mixing regularization. The next version ofStyleGAN – StyleGAN2 [31], gets rid of artifacts of the ﬁrst version by revis-ing AdaIN and improves disentanglement by using perceptual path length asregularizer.Mapping network is a key component of StyleGAN, which allows to transformlatent space Z into less entangled intermediate latent space W . Instead of actuallatent z ∈ Z sampled from normal distribution, w ∈ W resulting from mappingnetwork f : Z → W is fed to AdaIN. Also it is possible to sample vectors fromextended space W +, which consists of multiple independent samples of W , onefor each layer of generator. Varying w at diﬀerent layers will change details ofgenerated picture at diﬀerent scales. Y. Viazovetskyi et al.

Latent codes manipulation

It was recently shown [17,27] that linear oper-ations in latent space of generator allow successful image manipulations in avariety of domains and with various GAN architectures. In GANalyze [17], theattention is directed to search interpretable directions in latent space of Big-GAN [6] using MemNet [32] as “assessor” network. Jahanian et al. [27] showthat walk in latent space lead to interpretable changes in diﬀerent model archi-tectures: BigGAN, StyleGAN, and DCGAN [43].To manipulate real images in latent space of StyleGAN, one needs to ﬁndtheir embeddings in it. The method of searching the embedding in intermediatelatent space via backprop optimization is described in [1,2,16,47]. The authorsuse non-trivial loss functions to ﬁnd both close and perceptually good imageand show that embedding ﬁts better in extended space W +. Gabbay et al. [16]show that StyleGAN generator can be used as general purpose image prior.Shen et al. [47] show the opportunity to manipulate appearance of generatedperson, including age, gender, eyeglasses, and pose, for both PGGAN [29] andStyleGAN. The authors of StyleGAN2 [31] propose to search embeddings in W instead of W + to check if the picture was generated by StyleGAN2. Paired Image-to-image translation

Pix2pix [26] is one of the ﬁrst condi-tional generative models applied for image-to-image translation. It learns map-ping from input to output images. Chen and Koltun [9] propose the ﬁrst modelwhich can synthesize 2048x1024 images. It is followed by pix2pixHD [55] andSPADE [42]. In SPADE generator, each normalization layer uses the segmen-tation mask to modulate the layer activations. So its usage is limited to thetranslation from segmentation maps. There are numerous follow-up works basedon pix2pixHD architecture, including those working with video [7,53,54].

Unpaired Image-to-image translation

The idea of applying cycle consis-tency to train on unpaired data is ﬁrst introduced in CycleGAN [61]. The meth-ods of unpaired image-to-image translation can be either single mode GANs [61,59,36,11]or multimodal GANs [62,24,33,34,37,12]. FUNIT [37] supports multi-domain im-age translation using a few reference images from a target domain. StarGANv2 [12] provide both latent-guided and reference-guided synthesis. All of theabove-mentioned methods operate at resolution of at most 256x256 when ap-plied to human faces.Gender swap is one of well-known tasks of unsupervised image-to-imagetranslation [11,12,38].Face aging/rejuvenation is a special task which gets a lot of attention [60,50,19].Formulation of the problem can vary. The simplest version of this task is makingfaces look older or younger [11]. More diﬃcult task is to produce faces matchingparticular age intervals [35,56,58,38]. S GAN [19] proposes continuous changingof age using weight interpolation between transforms which correspond to twoclosest age groups. tyleGAN2 Distillation for Feed-forward Image Manipulation 5

Training on synthetic data

Synthetic datasets are widely used to extenddatasets for some analysis tasks (e.g. classiﬁcation). In many cases, simple graph-ical engine can be used to generate synthetic data. To perform well on real worldimages, this data need to overcome both appearance gap [22,15,51,52,49] andcontent gap [28,46].Ravuri et al. [44] study the quality of a classiﬁcator trained on synthetic datagenerated by BigGAN and show [45] that BigGAN does not capture the Ima-geNet [14] data distributions and is only partly successful for data augmentation.Shrivastava et al. [49] reduce the quality drop of this approach by revising trainsetup. Chen et al. [10] make paired dataset with image editing applications totrain image2image network.Synthetic data is what underlies knowledge distillation, a technique thatallows to train “student” network using data generated by “teacher” network[21,4]. Usage of this additional source of data can be used to improve measures[57] or to reduce size of target model [39]. Aguinaldo et al. [3] show that knowl-edge distillation is successfully applicable for generative models. 𝑧 Mapping network Generator Face Classifier Confidence, age, gender w Confidence Age Gender

StyleGAN2 … 𝑤 Fig. 2: Method of ﬁnding correspondence between latent codes and facial at-tributesAll of the images used in our datasets are generated using the oﬃcial imple-mentation of StyleGAN2 . In addition to that we only use the conﬁg-f versioncheckpoint pretrained by the authors of StyleGAN2 on FFHQ dataset. All themanipulations are performed with the disentangled image codes w . https://github.com/NVlabs/stylegan2 Y. Viazovetskyi et al.

We use the most straightforward way of generating datasets for style mixingand face morphing. Style mixing is described in [30] as a regularization tech-nique and requires using two intermediate latent codes w and w at diﬀerentscales. Face morphing corresponds to linear interpolation of intermediate latentcodes w . We generate 50 000 samples for each task. Each sample consists of twosource images and a target image. Each source image is obtained by randomlysampling z from normal distribution, mapping it to intermediate latent code w , and generating image g ( w ) with StyleGAN2. We produce target image byperforming corresponding operation on the latent codes and feeding the resultto StyleGAN2.Face attributes, such as gender or age, are not explicitly encoded in Style-GAN2 latent space or intermediate space. To overcome this limitation we usea separate pretrained face classiﬁcation network. Its outputs include conﬁdenceof face detection, age bin and gender. The network is proprietary, therefore werelease the ﬁnal version of our gender and age datasets in order to maintain fullreproducibility of this work .We create gender and age datasets in four major steps. First, we generate anintermediate dataset, mapping latent vectors to target attributes as illustratedin Fig. 2. Second, we ﬁnd the direction in latent space associated with the at-tribute. Third, we generate raw dataset, using above-mentioned vector as brieﬂydescribed in Fig. 3. Finally, we ﬁlter the images to get the ﬁnal dataset. Themethod is described below in more detail.1. Generate random latent vectors z . . . z n , map them to intermediate latentcodes w . . . w n , and generate corresponding image samples g ( w i ) with Style-GAN2.2. Get attribute predictions from pretrained neural network f , c ( w i ) = f ( g ( w i )).3. Filter out images where faces were detected with low conﬁdence . Then selectonly images with high classiﬁcation certainty.4. Find the center of every class C k = n c = k (cid:80) c ( w i )= k w i and the transitionvectors from one class to another ∆ c i ,c j = C j − C i

5. Generate random samples z i and pass them through mapping network. Forgender swap task, create a set of ﬁve images g ( w − ∆ ) , g ( w − ∆/ , g ( w ) , g ( w + ∆/ , g ( w + ∆ ) For aging/rejuvenation ﬁrst predict faces’ attributes c ( w i ),then use corresponding vectors ∆ c ( w i ) to generate faces that should be twobins older/younger.6. Get predictions for every image in the raw dataset. Filter out by conﬁdence.7. From every set of images, select a pair based on classiﬁcation results. Eachimage must belong to the corresponding class with high certainty.As soon as we have aligned data, a paired image-to-image translation networkcan be trained. https://github.com/EvgenyKashin/stylegan2-distillation This helps to reduce generation artifacts in the dataset, while maintaining highvariability as opposed to lowering truncation-psi parameter.tyleGAN2 Distillation for Feed-forward Image Manipulation 7 +∆𝑧

Mapping network 𝑤 % 𝑤 & −∆𝑤 ( Filtering

Fig. 3: Dataset generation. We ﬁrst sample random vectors z from normal dis-tribution. Then for each z we generate a set of images along the vector ∆ cor-responding to a facial attribute. Then for each set of images we select the bestpair based on classiﬁcation results In this work, we focus on illustrating the general approach rather than solvingevery task as best as possible. As a result, we choose to train pix2pixHD [55] asa uniﬁed framework for image-to-image translation instead of selecting a custommodel for every type of task.It is known that pix2pixHD has blob artifacts and also tends to repeatpatterns [42]. The problem with repeated patterns is solved in [30,42]. Light blobsis a problem which is solved in StyleGAN2. We suppose that similar treatmentalso in use for pix2pixHD.Fortunately, even vanilla pix2pixHD trained on our datasets produces suf-ﬁciently good results with little or no artifacts. Thus, we leave improving orreplacing pix2pixHD for future work. We make most part of our experimentsand comparison in 512x512 resolution, but also try 1024x1024 for gender swap.Style mixing and face averaging tasks require two input images to be fed tothe network at the same time. It is done by setting number of input channels to6 and concatenating the inputs along channel axis. Although StyleGAN2 can be trained on data of diﬀerent nature, we concentrateour eﬀorts only on face data. We show application of our method to several https://github.com/NVIDIA/pix2pixHD https://github.com/NVIDIA/pix2pixHD/issues/46 Y. Viazovetskyi et al. tasks: gender swap, aging/rejuvenation and style mixing and face morphing. Inall our experiments we collect data from StyleGAN2, trained on FFHQ dataset[30].

Only the task of gender transform (two directions) is used for evaluation. We useFrech´et inception distance (FID) [20] for quantitative comparison of methods,as well as human evaluation.For each feed-forward baseline we calculate FID between 50 000 real imagesfrom FFHQ datasets and 20 000 generated images, using 20 000 images fromFFHQ as source images. For each source image we apply transformation to theother gender, assuming source gender is determined by our classiﬁcation model.Before calculating FID measure all images are resized to 256x256 size for faircomparison.Also human evaluation is used for more accurate comparison with optimiza-tion based methods. Our study consists of two surveys:1.

Quality.

Task for female to male translation (male to female one is similar):“For the same image on the left, there are two diﬀerent options on the right.Choose the best face, which is: turned into a male (most important), similarto the original person, the position of the face and emotions are preserved,the original items in the photo are preserved.”2.

Realism.

In this task, sources are diﬀerent and not shown. “Choose theimage, which is: more realistic (the most important), better in quality, withfewer artifacts.”All images were resized to 512x512 size in this comparison. The ﬁrst taskshould show which method is the best at performing transformation, the second– which looks the most real regardless of the source image. We use side-by-sideexperiments for both tasks where one side is our method and the other side is oneof optimization based baselines. Answer choices are shuﬄed. For each comparisonof our method with a baseline, we generate 1000 questions and each question isanswered by 10 diﬀerent people. For answers aggregation we use Dawid-Skenemethod [13] and ﬁlter out the examples with conﬁdence level less than 95% (itis approximately 4% of all questions).

We generate a paired dataset for male and female faces accordingto the method described above and than train a separate pix2pixHD model foreach gender translation.We compete with both unpaired image-to-image methods and diﬀerent Style-GAN embedders with latent code optimization. We choose StarGAN [11], MU- https://github.com/yunjey/stargan tyleGAN2 Distillation for Feed-forward Image Manipulation 9 Source StarGAN MUNIT StarGAN v2*

Ours (a) Female to male

Source StarGAN MUNIT StarGAN v2*

Ours (b) Male to female

Fig. 4: Gender transformation: comparison with image-to-image translation ap-proaches. MUNIT and StarGAN v2* are multimodal so we show one randomrealization thereNIT [25] and StarGAN v2* [12] for a competition with unpaired methods.We train all these methods on FFHQ classiﬁed into males and females.Fig. 4 shows qualitative comparison between our approach and unpairedimage-to-image ones. It demonstrates that distilled transformation have signiﬁ-cantly better visual quality and more stable results. Quantitative comparison inTable 1a conﬁrms our observations. We also checked that our model is performwell on other datasets without retraining. Table 1b shows comparison of genderswap of CelebA-HQ images with models trained on CelebA. Our model winsdespite it has no CelebA samples during training. The results indicate that themethod can potentially be applied to real world images without retraining.StyleGAN2 provides an oﬃcial projection method. This method operates in W , which only allows to ﬁnd faces generated by this model, but not real worldimages. So, we also build a similar method for W + for comparison. It optimizesseparate w for each layer of the generator, which helps to better reconstruct agiven image. After ﬁnding w we can add transformation vector described aboveand generate a transformed image.Also we add projection methods made by Dmitry Nikitko (Puzer) [41] andPeter Baylies (pbaylies) [5] for ﬁnding latent code to comparison, even thoughthey are based on the ﬁrst version of StyleGAN. These encoders are the mostknown implementations, they use custom perceptual losses for better perception.StyleGAN encoder by Peter Baylies is mode advanced one. In addition to moreprecisely selected loss functions, it uses background masking and forward passapproximation of optimization starting point. https://github.com/NVlabs/MUNIT https://github.com/taki0112/StarGAN_v2-Tensorflow (unoﬃcial implementa-tion, so its results may diﬀer from the oﬃcial one)0 Y. Viazovetskyi et al.(a) Evaluate on FFHQMethod FIDStarGAN [11] 29.7MUNIT [24] 40.2StarGANv2* [12] 25.6Ours Real images 3.3 (b) Evaluate on Celeba-HQMethod FIDStarGANv2 [12] Table 1: Quantitative comparison with unpaired methods. Unpaired methodstrained on the same datasets we evaluate them, although ours trained on FFHQin both cases. Table 1b shows that our method is robust regarding dataset.Since unpaired methods show signiﬁcantly worse quality, we put more eﬀortinto comparisons between diﬀerent methods of searching embedding throughoptimization. We avoid using methods that utilize FID because all of them arebased on the same StyleGAN model. Also, FID cannot measure “quality oftransformation” because it does not check keeping of personality. So we decideto make user study our main measure for all StyleGAN-based methods. Fig. 5shows qualitative comparison of all the methods. It is visible that our methodperforms better in terms of transformation quality. And only StyleGAN En-coder [5] outperforms our method in realism. However this method generatesbackground unconditionally.Table 2: User study of StyleGAN-based approaches. Winrate “method vs ours”.We measure user study for all StyleGAN-based approaches because we considerhuman evaluation more reliable measure for perception

Method Quality RealismStyleGAN Encoder (Nikitko) 18% 14%StyleGAN Encoder (Baylies) 30% 68%StyleGAN2 projection ( W ) 22% 22%StyleGAN2 projection ( W +) 11% 14%Real images - 85% We ﬁnd that pix2pixHD keeps more details on transformed images than allthe encoders. We suppose that this is achieved due to the ability of pix2pixHD topass part of the unchanged content through the network. Pix2pixHD solves aneasier task compared to encoders which are forced to encode all the informationabout the image in one vector. Oﬃcial model and weights.tyleGAN2 Distillation for Feed-forward Image Manipulation 11 F e m a l e t o m a l e Source StyleGANEncoder (Puzer) StyleGANEncoder (pbaylies) StyleGAN2projection  StyleGAN2projection + Ours M a l e t o f e m a l e Fig. 5: Gender transformation: comparison with StyleGAN2 latent code op-timization methods. Input samples are real images from FFHQ. Notice thatunusual objects are lost with optimization but kept with image-to-image trans-lation

Fig. 4 and 5 also show drawbacks of our approach. Vector of “gender” is notperfectly disentangled due to some bias in attribute distribution of FFHQ and,consequently, latent space correlation of StyleGAN[47]. For example, it can beseen that translation into female faces can also add smile.We also encounter problems of pix2pixHD architecture: repeated patterns,light blobs and diﬃculties with ﬁnetuning 1024x1024 resolution. We show anuncurated list of generated images in supplementary materials.

Aging/rejuvenation

To show that our approach can be applied for anotherimage-to-image transform task, we also carry out similar experiment with faceage manipulation. First, we estimate age for all generated images, then groupthem into several bins. After that, for each bin we ﬁnd vectors of “+2 bins”and “-2 bins”. Using these vectors, we generate united paired dataset. Eachpair contains younger and older versions of the same face. Finally, we train twopix2pixHD networks, one for each of two directions. Examples of the application

Younger Source Older Younger Source Older

Fig. 6: Aging/rejuvenation. Source images are sampled from FFHQof this approach are presented in Fig. 6.

There are 18 AdaIN inputs in StyleGAN2architecture. These AdaINs work with diﬀerent spatial resolutions, and changingdiﬀerent input will change details of diﬀerent scale. The authors divide them intothree groups: coarse styles (for 4 – 8 spatial resolutions), middle styles (16 – tyleGAN2 Distillation for Feed-forward Image Manipulation 13 S o u r c e B Source A ( a ) C o a r s e s t y l e s f r o m s o u r c e B ( b ) M i dd l e s t y l e s f r o m s o u r c e B ( c ) F i n e s t y l e s f r o m s o u r c e B ( d ) A v e r a g e l a t e n t c o d e Fig. 7: Style mixing with pix2pixHD. (a), (b), (c) show results of distillatedcrossover of two latent codes in W +, (d) shows result of average latent codetransformaiton. Source images are sampled from FFHQ ) and ﬁne styles (64 – 1024 ). The opportunity to change coarse, middle orﬁne details is a unique feature of StyleGAN architectures.We collect datasets of triplets (two source images and their mixture) andtrain our models for each transformation. We concatenate two images into 6channels to feed our pix2pixHD model. Fig. 7(a,b,c) show the results of stylemixing.Another simple linear operation is to average two latent codes. It correspondsto morphing operation on images. We collect another dataset with triplet latentcodes: two random codes and an average one. The examples of face morphingare shown in Fig. 7(d). In this paper, we unite unconditional image generation and paired image-to-image GANs to distill a particular image manipulation in latent code of Style-GAN2 into single image-to-image translation. The resulting technique showsboth fast inference and impressive quality. It outperforms existing unpairedimage-to-image models in FID score and StyleGAN Encoder approaches both inuser study and time of inference on gender swap task. We show that the approachis also applicable for other image manipulations, such as aging/rejuvenation andstyle transfer.Our framework has several limitations. StyleGAN2 latent space is not per-fectly disentangled, so the transformations made by our network are not per-fectly pure. Despite the latent space is not disentangled enough to make puretransformations, impurities are not so severe.We use only pix2pixHD network although diﬀerent architectures ﬁt betterfor diﬀerent tasks. Besides, we distil every transformation to a separate model,although some universal model could be trained. This opportunity should beinvestigated in future studies.

References

1. Abdal, R., Qin, Y., Wonka, P.: Image2stylegan++: How to edit the embeddedimages? arXiv preprint arXiv:1911.11544 (2019)2. Abdal, R., Qin, Y., Wonka, P.: Image2stylegan: How to embed images into thestylegan latent space? In: Proceedings of the IEEE International Conference onComputer Vision. pp. 4432–4441 (2019)3. Aguinaldo, A., Chiang, P.Y., Gain, A., Patil, A., Pearson, K., Feizi, S.: Compressinggans using knowledge distillation. arXiv preprint arXiv:1902.00159 (2019)4. Ba, J., Caruana, R.: Do deep nets really need to be deep? In: Advances in neuralinformation processing systems. pp. 2654–2662 (2014)5. Baylies, P.: Stylegan encoder - converts real images to latent space. https://github.com/pbaylies/stylegan-encoder (2019)6. Brock, A., Donahue, J., Simonyan, K.: Large scale gan training for high ﬁdelitynatural image synthesis. arXiv preprint arXiv:1809.11096 (2018)tyleGAN2 Distillation for Feed-forward Image Manipulation 157. Chan, C., Ginosar, S., Zhou, T., Efros, A.A.: Everybody dance now. In: Proceedingsof the IEEE International Conference on Computer Vision. pp. 5933–5942 (2019)8. Chen, B.C., Chen, C.S., Hsu, W.H.: Cross-age reference coding for age-invariantface recognition and retrieval. In: European conference on computer vision. pp.768–783. Springer (2014)9. Chen, Q., Koltun, V.: Photographic image synthesis with cascaded reﬁnementnetworks. In: Proceedings of the IEEE international conference on computer vision.pp. 1511–1520 (2017)10. Chen, Y.C., Shen, X., Jia, J.: Makeup-go: Blind reversion of portrait edit. In:Proceedings of the IEEE International Conference on Computer Vision. pp. 4501–4509 (2017)11. Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: Stargan: Uniﬁed gener-ative adversarial networks for multi-domain image-to-image translation. In: Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recognition(2018)12. Choi, Y., Uh, Y., Yoo, J., Ha, J.W.: Stargan v2: Diverse image synthesis for mul-tiple domains. arXiv preprint arXiv:1912.01865 (2019)13. Dawid, A.P., Skene, A.M.: Maximum likelihood estimation of observer error-ratesusing the em algorithm. Journal of the Royal Statistical Society: Series C (AppliedStatistics) (1), 20–28 (1979)14. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hierarchical Image Database. In: CVPR09 (2009)15. French, G., Mackiewicz, M., Fisher, M.: Self-ensembling for visual domain adap-tation. arXiv preprint arXiv:1706.05208 (2017)16. Gabbay, A., Hoshen, Y.: Style generator inversion for image enhancement andanimation. arXiv preprint arXiv:1906.11880 (2019)17. Goetschalckx, L., Andonian, A., Oliva, A., Isola, P.: Ganalyze: Toward visual def-initions of cognitive image properties. In: The IEEE International Conference onComputer Vision (ICCV) (October 2019)18. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neuralinformation processing systems. pp. 2672–2680 (2014)19. He, Z., Kan, M., Shan, S., Chen, X.: S2gan: Share aging factors across ages andshare aging trends among individuals. In: Proceedings of the IEEE InternationalConference on Computer Vision. pp. 9440–9449 (2019)20. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trainedby a two time-scale update rule converge to a local nash equilibrium. In: Advancesin neural information processing systems. pp. 6626–6637 (2017)21. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531 (2015)22. Hoﬀman, J., Tzeng, E., Park, T., Zhu, J.Y., Isola, P., Saenko, K., Efros, A.A., Dar-rell, T.: Cycada: Cycle-consistent adversarial domain adaptation. arXiv preprintarXiv:1711.03213 (2017)23. Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instancenormalization. In: Proceedings of the IEEE International Conference on ComputerVision. pp. 1501–1510 (2017)24. Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: Proceedings of the European Conference on ComputerVision (ECCV). pp. 172–189 (2018)25. Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: ECCV (2018)6 Y. Viazovetskyi et al.26. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with condi-tional adversarial networks. In: Proceedings of the IEEE conference on computervision and pattern recognition. pp. 1125–1134 (2017)27. Jahanian, A., Chai, L., Isola, P.: On the ”steerability” of generative adversarialnetworks. arXiv preprint arXiv:1907.07171 (2019)28. Kar, A., Prakash, A., Liu, M.Y., Cameracci, E., Yuan, J., Rusiniak, M., Acuna,D., Torralba, A., Fidler, S.: Meta-sim: Learning to generate synthetic datasets.In: Proceedings of the IEEE International Conference on Computer Vision. pp.4551–4560 (2019)29. Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for im-proved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017)30. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generativeadversarial networks. In: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition. pp. 4401–4410 (2019)31. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing andimproving the image quality of stylegan. arXiv preprint arXiv:1912.04958 (2019)32. Khosla, A., Raju, A.S., Torralba, A., Oliva, A.: Understanding and predictingimage memorability at a large scale. In: Proceedings of the IEEE InternationalConference on Computer Vision. pp. 2390–2398 (2015)33. Lee, H.Y., Tseng, H.Y., Huang, J.B., Singh, M.K., Yang, M.H.: Diverse image-to-image translation via disentangled representations. In: European Conference onComputer Vision (2018)34. Lee, H.Y., Tseng, H.Y., Mao, Q., Huang, J.B., Lu, Y.D., Singh, M.K., Yang, M.H.:Drit++: Diverse image-to-image translation viadisentangled representations. arXivpreprint arXiv:1905.01270 (2019)35. Li, P., Hu, Y., Li, Q., He, R., Sun, Z.: Global and local consistent age generative ad-versarial networks. In: 2018 24th International Conference on Pattern Recognition(ICPR). pp. 1073–1078. IEEE (2018)36. Liu, M.Y., Breuel, T., Kautz, J.: Unsupervised image-to-image translation net-works. In: Advances in neural information processing systems. pp. 700–708 (2017)37. Liu, M.Y., Huang, X., Mallya, A., Karras, T., Aila, T., Lehtinen, J., Kautz, J.: Few-shot unsupervised image-to-image translation. arXiv preprint arXiv:1905.01723(2019)38. Liu, Y., Li, Q., Sun, Z.: Attribute-aware face aging with wavelet-based generativeadversarial networks. In: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition. pp. 11877–11886 (2019)39. Mirzadeh, S.I., Farajtabar, M., Li, A., Ghasemzadeh, H.: Improved knowledgedistillation via teacher assistant: Bridging the gap between student and teacher.arXiv preprint arXiv:1902.03393 (2019)40. Moschoglou, S., Papaioannou, A., Sagonas, C., Deng, J., Kotsia, I., Zafeiriou, S.:Agedb: the ﬁrst manually collected, in-the-wild age database. In: Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition Workshops.pp. 51–59 (2017)41. Nikitko, D.: Stylegan – encoder for oﬃcial tensorﬂow implementation. https://github.com/Puzer/stylegan-encoderhttps://github.com/Puzer/stylegan-encoder