[PDF] XOGAN: One-to-Many Unsupervised Image-to-Image Translation

Abstract

Unsupervised image-to-image translation aims at learning the relationship between samples from two image domains without supervised pair information. The relationship between two domain images can be one-to-one, one-to-many or many-to-many. In this paper, we study the one-to-many unsupervised image translation problem in which an input sample from one domain can correspond to multiple samples in the other domain. To learn the complex relationship between the two domains, we introduce an additional variable to control the variations in our one-to-many mapping. A generative model with an XO-structure, called the XOGAN, is proposed to learn the cross domain relationship among the two domains and the ad- ditional variables. Not only can we learn to translate between the two image domains, we can also handle the translated images with additional variations. Experiments are performed on unpaired image generation tasks, including edges-to-objects translation and facial image translation. We show that the proposed XOGAN model can generate plausible images and control variations, such as color and texture, of the generated images. Moreover, while state-of-the-art unpaired image generation algorithms tend to generate images with monotonous colors, XOGAN can generate more diverse results.

Full PDF

XXOGAN: One-to-Many UnsupervisedImage-to-Image Translation

Yongqi Zhang

Department of Computer Science and EngineeringHong Kong University of Science and TechnologyClear Water Bay, Hong Kong { yzhangee } @cse.ust.hk Abstract —Unsupervised image-to-image translation aims atlearning the relationship between samples from two imagedomains without supervised pair information. The relationshipbetween two domain images can be one-to-one, one-to-manyor many-to-many. In this paper, we study the one-to-manyunsupervised image translation problem in which an inputsample from one domain can correspond to multiple samplesin the other domain. To learn the complex relationship betweenthe two domains, we introduce an additional variable to controlthe variations in our one-to-many mapping. A generative modelwith an XO -structure, called the XOGAN, is proposed to learnthe cross domain relationship among the two domains and the ad-ditional variables. Not only can we learn to translate between thetwo image domains, we can also handle the translated images withadditional variations. Experiments are performed on unpairedimage generation tasks, including edges-to-objects translationand facial image translation. We show that the proposed XOGANmodel can generate plausible images and control variations, suchas color and texture, of the generated images. Moreover, whilestate-of-the-art unpaired image generation algorithms tend togenerate images with monotonous colors, XOGAN can generatemore diverse results. Index Terms —Unsupervised image translation, Image gener-ation, Generative adversarial networks, One-to-many mapping

I. I

NTRODUCTION

With the development of deep generative models, imagegeneration has become popular in various applications. Amongthem, image-to-image translation, which learns a mappingfrom one domain to another, is a recent hot topic. Many com-puter vision tasks, including cross-domain image generation[1], [2], [3], [4], [5], super-resolution [6], colorization [7], [8],image inpainting [9], image style transfer [10], [11], [12], canbe considered as image translation. Based on the two domainsof images (either paired or unpaired), a generative model istrained to learn their relationship.Image translation can be deﬁned as follows. Given an image X S , we map it to a target domain image X T that shares somesimilarity or has close relationship with X S . The task is tolearn the mapping f : X S → X T that transforms the sourcedomain distribution p ( X S ) to the target domain distribution p ( X T ) . For example, X T can be a scaled-up version of X S in super-resolution, or X T is a colored version of a grayscaleimage X S , or X T is a photo of its sketch X S (Figure 1).In these tasks, we assume that there also exists an inverse mapping g : X T → X S . With the help of the inverse mapping,we can understand better about the two domains by learningtheir joint relationship. Fig. 1. Example one-to-many translation. Here, the translation is between aset of edge images and a set of real shoes. The proposed model can learn theone-to-many generation of multiple colored shoes from one edge input.

Given a paired data set, early approaches learn the mappingby using the input-output pairs in a supervised manner [1], [6].The main challenge is to learn and generalize the relationshipbetween the given pairs. However, paired data sets are usuallyhard to collect, and the corresponding target domain imagemay not even exist in practice. For example, it is hard toconnect the set of paired images between photos and artisticworks. Another example is that if the two domains are malefaces and female faces, then there does not exist paired dataof a same person. In these cases, supervised models failbecause of the lack of a ground truth mapping for training.Learning unpaired image translation is more practical and hasreceived more attention recently [2], [3], [4], [5]. Moreover,unsupervised (or unpaired) image generation is more practicalsince data collection is much easier.In this paper, we consider the task of unsupervised one-to-many image generation. Given two image domains A and B ,we learn to transfer images in domain A to domain B , andvice versa. Different from previous works on unpaired imagetranslation [2], [3], [4], we assume that there can be manypossible target images in domain B given the same image indomain A . For example, in the edge-to-shoe translation task a r X i v : . [ c s . C V ] M a y n Figure 1, there can be different colors and textures whengenerating shoes. To model this variation, we propose to usean additional variable Z to complement images in domain A .Moreover, this variable Z can be easily sampled from a priordistribution, such as the normal distribution.To learn the relationship among A , B and Z , we proposea novel generative model under the constraint of domainadversarial loss and cycle consistency loss, which is ﬁrstdeﬁned in [4]. The proposed model, which will be calledXOGAN, is assembled in an “XO”-structure, and is trainedunder the generative adversarial network (GAN) framework[13], [14], [15], [16], [17].In our experiments, we show results on generating shoesand handbags with diverse colors and textures given the edgeimages. Besides, when the additional variable is kept the samefor different edge inputs, we can generate objects with thesame colors. Moreover, we can alter the colors of differentobjects by substituting the variable Z . Hence, not only canour model generate plausible images as in other generativemodels, it can also replace the color of a certain image withanother.Section II ﬁrst reviews the related works. The proposed“XO”-structure and the training procedure are introducedin Section III. Section IV presents experimental results onthe edges2shoes , edges2handbags and CelebA data sets.Finally, Section V gives some concluding remarks and futuredirections.

Notations . Samples from the two image domains are denotedby A ∈ R h × w × c A and B ∈ R h × w × c B , where the subscripts A and B are the domain indicators, and h, w, c are the height,width, and channel, respectively. The empirical distributions ofthe image domains are denoted P A , and P B , respectively. Theadditional variable Z ∈ R dZ is sampled from a standard normaldistribution P Z = N (0 , I ) , where d is the dimensionality of Z .The generator for image domain A is denoted G A : R h × w × c B → R h × w × c A , in which the subscripts denote the imagedomains. The generator for image domain B is denoted G B :( R h × w × c A , R dZ ) → R h × w × c B . The input is a concatenation ofthe image from domain A and a sample from the distribution P Z . The generator (also called an encoder) for Z is denoted G Z : R h × w × c B → R dZ .The ﬁrst path generated fake samples in domain A and B are denoted ¯ A and ¯ B , and the encoded variable is ¯ Z . When thefake samples ¯ A, ¯ B, ¯ Z are forwarded once again through thegenerators, the output variables in each domain are denoted ˆ A, ˆ B, ˆ Z (Figure 3).The discriminator networks are denoted D A : R h × w × c A → [0 , , D B : R h × w × c B → [0 , and D Z : R dZ → [0 , ,respectively. II. R ELATED W ORK

A. Generative Adversarial Networks

The generative adversarial network (GAN) [13] is a power-ful generative model that can generate plausible images. The GAN contains two modules: a generator G that generates sam-ples and a discriminator D that tries to distinguish whether thesample is from the real or generated distribution. The generatoraims to confuse the discriminator by generating samples thatare difﬁcult to differentiate from the real ones. Training ofGAN often suffers from issues such as vanishing gradient andmode collapse [16], in which the generator tends to collapse topoints in a single mode. Very recently, a number of techniqueshave been introduced to stabilize training procedure [14], [16].In cross-domain image generation [1], [2], [4], [18], GAN is apowerful tool to match the generated image to the real imagedistributions, especially when paired images are not available. B. Supervised Image-to-Image Translation

Isola et al. [1] showed that cross-domain image translationcan be learned and generalized using a paired data set. Byusing the conditional GAN [15], their model can generateplausible photographs from sketches or semantic layouts. Zhu et al. [19] used bicycle consistency between the latent codeand output images to generate multi-modal target domainimages. However, paired data sets are not always available,and unpaired data sets are more common in practice.

C. Unsupervised Image-to-Image Translation

Taigman et al. [3] introduced the domain transfer network(DTN) to generate emoji-style images from facial images inan unsupervised manner. In the DTN, image translation is aone-way mapping. If we train another model to map the emojiimages back to real faces, the face identity may be changed.More recently, bidirectional mapping becomes more appealing,and has been studied in the DiscoGAN [2], CycleGAN [4]and DualGAN [5]. These models use one generator and onediscriminator for each mapping, and the symmetric structurehelps to learn the bidirectional mapping. (a) Forward path.(b) Backward path.Fig. 2. The CycleGAN [4]. igure 2 shows the CycleGAN [4], which uses a generator G for the mapping X → Y and another generator F for Y → X . Two associated adversarial discriminators, D X and D Y , are used to measure the quality of generated samples inthe corresponding domains. Figure 2(a) contains the forwardcycle-consistency path: x → G ( x ) → F ( G ( x )) ≈ x , andFigure 2(b) is the backward cycle-consistency path: y → F ( y ) → G ( F ( y )) ≈ y . The cycle consistency loss capturesthe intuition that if we translate from one domain to theother and back again, we should be able to reconstruct theoriginal input. However, the generated image by CycleGAN isdeterministic. As will be shown in Section IV, these modelscannot model additional variations even by adding randomnoise to the inputs.Another recent model is the UNIT [18], which performsimage translation by using a shared latent space to encode thetwo domains. Although they can generate multiple images viathe use of a stochastic variable, the variations generated arestill limited. III. T HE P ROPOSED M ODEL

Let A and B be two image domains. In supervised image-to-image translation, a sample pair ( A, B ) is drawn from thejoint distribution P A , B . In this paper, we focus on unsupervisedimage-to-image translation, in which samples are drawn fromthe marginal distributions P A and P B . A. Generators and Cycle Consistency Loss

The proposed XOGAN model contains three generators G A , G B and G Z (with parameters θ G A , θ G B and θ G Z , respec-tively). In this paper, we propose to use an additional variable Z to model the variation when translating from domain A todomain B . Given a sample A drawn from P A and Z fromthe prior distribution P Z , a fake sample ¯ B in domain B isgenerated by G B as ¯ B = G B ( A, Z ) . Given ¯ B , generator G A generates a reconstruction ˆ A of A indomain A , and generator G Z encodes a reconstruction ˆ Z of Z : ˆ A = G A ( ¯ B ) , ˆ Z = G Z ( ¯ B ) . Together, this forms the X-path in Figure 3(a). To ensure cycleconsistency, the generated sample ¯ B should contain sufﬁcientinformation to reconstruct A (for the path A → ¯ B → ˆ A ), andsimilarly ˆ Z should be similar to Z (for the path Z → ¯ B → ˆ Z ).On the other hand, given a sample B in domain B , generator G A can use it to generate a fake sample ¯ A = G A ( B ) indomain A ; and generator G Z can use it to encode a fake ¯ Z = G Z ( B ) . Using both ¯ A and ¯ Z , generator G B can recovera reconstruction of B as ˆ B = G B ( ¯ A, ¯ Z ) . This forms the O-path in Figure 3(b). Again, for cycle consistency, ˆ B should beclose to B .Combining the above, the cycle consistency loss can thusbe written as: L cyc ( θ G A , θ G B , θ D Z )= || ˆ A − A || + || ˆ B − B || + || ˆ Z − Z || . (1) Here, we use the (cid:96) norm, though other norms can also beused. (a) X path.(b) O path.Fig. 3. The three generators in the proposed XOGAN model. The path inblue is for domain A , green is for B , and yellow is for Z . B. Domain Adversarial Loss

Minimizing (1) alone cannot guarantee that the generatedfake samples ¯ A, ¯ B and encoded variable ¯ Z follow the distri-butions of P A , P B and P Z . The GAN [13], which is known tobe able to learn good generative models, can also be regardedas performing distribution matching. In the following, with theuse of the adversarial loss, we will try to match the generateddistributions P G A , P G B and P G Z with the corresponding P A , P B and P Z .Figure 4 shows the three discriminators D A , D B and D Z (with parameters θ D A , θ D B and θ D Z , respectively) that areused to discriminate the generated ¯ A, ¯ B, ¯ Z from the true A, B, Z . As in DiscoGAN [2] and CycleGAN [4], the dis-criminators are binary classiﬁers, and the discriminator lossesare: L dis ( θ D A ) = − E A ∈ P A [log D A ( A )] − E ¯ A ∈ P GA (cid:2) log (cid:0) − D A ( ¯ A ) (cid:1)(cid:3) , (2) L dis ( θ D B ) = − E B ∈ P B [log D B ( B )] − E ¯ B ∈ P GB (cid:2) log (cid:0) − D B ( ¯ B ) (cid:1)(cid:3) , (3) L dis ( θ D Z ) = − E Z ∈ P Z [log D Z ( Z )] − E ¯ Z ∈ P GZ (cid:2) log (cid:0) − D Z ( ¯ Z ) (cid:1)(cid:3) . (4) ig. 4. The XOGAN discriminator. Label “1” denotes true samples of A, B, Z , while label “0” denotess the generated samples ¯ A, ¯ B, ¯ Z . In GAN, the generators, besides trying to minimize the cycleconsistency loss, also need to confuse their correspondingdiscriminators. The adversarial losses for the generators are L adv ( θ G A ) = − E A ∈ P A (cid:2) log D A ( ¯ A ) (cid:3) , L adv ( θ G B ) = − E B ∈ P B (cid:2) log D B ( ¯ B ) (cid:3) , L adv ( θ G Z ) = − E Z ∈ P Z (cid:2) log D Z ( ¯ Z ) (cid:3) . To ensure both cycle consistency and distribution matching,the total loss for the generators is a combination of the cycleconsistency loss in (1) and the adversarial losses: L gen ( θ G A , θ G B , θ D Z ) = L adv ( θ G A ) + L adv ( θ G B ) + L adv ( θ G Z )+ λ L cyc ( θ G A , θ G B , θ G Z ) , (5)where λ controls the balance between the two types of losses.In the experiment, λ is set to 10. C. Training Procedure

In each iteration, we sample a mini-batch of images A ’sand B ’s from P A , P B , and variable Z from prior distribution P Z (which is the standard normal distribution N (0 , I ) ). Theyare fed through the X- and O-paths in Figure 3 to obtain thegenerated samples ¯ A , ¯ B , ¯ Z , and the reconstructed samples ˆ A , ˆ B , ˆ Z . The real and generated samples are then input to thethree discriminators.As in GAN, we minimize the generators’ objective (5) for k steps and the discriminators’ objectives ((2), (3) and (4)) forone step. We use the ADAM optimizer [20] with learning rate0.0002. IV. E XPERIMENTS

In this section, experiments are performed on a number ofcommonly used data sets for image translation.1) edges2shoes and edges2handbags : These twodata sets have been used in [1]. The edges2shoes data set contains about 50k paired images, and the edges2handbags data set contains about 140k pairedimages. In both data sets, domain A contains edge https://people.eecs.berkeley.edu/ ∼ tinghuiz/projects/pix2pix/datasets/ images and domain B contains real objects (shoes andbags). Note that one real object can be mapped to onlyone edge map, but an edge map can correspond tomultiple objects.Although the two data sets contain paired images, weseparate the paired sets by sampling domain A images inthe ﬁrst half pairs, and domain B images in the other halfpairs. Hence, there is no paired data in the training set,and the task is unsupervised image-to-image translation.2) CelebA : This is a large-scale face attributes data setwith more than 200K celebrity images, each with 40attribute annotations [21]. In the experiment, we use thehair color attribute. Domain A contains faces with blackhair, and domain B contains faces with other hair colors.As the hair in many male CelebA faces is not apparent,we only use the female faces.All the input images are rescaled to × .In the proposed XOGAN, we use the U-Net [22], whichadds skip connections between mirrored layers of the 7 down-sampling layers and the 7 upsampling layers, for the imagegenerators G A and G B . For Z , the generator (or encoder) G Z is a 7-layer strided convolutional network with residual blocks[23]. We set its dimensionality to 8 for the edges2shoes and edges2handbags data sets, respectively, and 4 for the CelebA data set. For the image discriminators D A and D B ,we use the patch-discriminator in [24], which only classiﬁesinput images at the scale of patches, with × overlappingpatches. The discriminator D Z is a simple two-layer multi-layer-perception. As in [4], the hyperparameter λ in (5) is setto 10. In each training iteration, we update the generators twiceand then update the discriminators once.Existing unsupervised image-to-image translation modelssuch as DiscoGAN [2] and CycleGAN [4] can only generateone target image. Instead, we use the following setup as thebaseline model:1) Noisy DiscoGAN: This is a variant of DiscoGAN. Ituses the same generators as G A , G B in XOGAN, butthe generator from domain A to domain B is augmentedwith random Gaussian noise. This allows the generation http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html f different images in domain B given the same imagefrom domain A . We do not compare with CycleGAN[4] and DualGAN [5], as they are very similar toDiscoGAN.2) UNIT [18]: The UNIT model uses two variationalautoencoders [25] with shared latent space as cross-domain image translators. It also uses cycle consistencyfor unpaired image-to-image translation. For each inputimage, we sample multiple latent codes z ’s, and usethem to generate different outputs. A. Translating A to B with Random Z To show the consistency of the learned additional variables,we sample different random variables Z j ’s to generate differ-ent fake images ¯ B ij = G B ( A i , Z j ) for each input image A i .We keep the random variable set { Z j } the same for different A i ’s, so that when Z j or j is ﬁxed, the generated samples ¯ B ij should have some ﬁxed attributes when i is changing, such asthe color as shown in Figures 5 and 8.In this experiment, we randomly sample 4 input images { A i , i = 1 , . . . , } in domain A from the test set. For each A i , we generate 4 images { ¯ B ij = G B ( A i , Z j ) , j = 1 , . . . , } using different Z j ’s sampled from the standard normal distri-bution. edges2shoes and edges2handbags Data Sets:

Fig-ure 5 shows that the proposed XOGAN can generate realisticphotos of shoes and handbags. With the help of the cycleconsistency loss on variable Z , the generated objects havethe same color for the same Z j . This makes it possible togenerate plausible and colorful objects by just drawing edgeslike those in domain A . Besides, we can also control the colorof generated images by Z . Note that our task is different fromthe drawing softwares in two ways. (i) We show that the end-to-end deep neural network is able to generate plausible anddiverse objects given edges; (ii) The generated objects do notonly ﬁll in the colors, but also have smooth textures that makethem look real.Figure 6 shows the results obtained by the noisy DiscoGAN.Although multiple images can be generated, they are inferioras compared to those obtained with the proposed model in thefollowing ways. (i) The images { G B ( A i , Z j ) } generated bythe noisy DiscoGAN are not as good as those of XOGAN;(ii) The images generated by the noisy DiscoGAN are not asdiverse as the proposed XOGAN, i.e., G B ( A i , Z j ) has similarcolor for different j ’s in each row. This may be due to the modecollapse problem of GAN [13], which means that it tends togenerate images from a single mode (in this case, color).Figure 7 shows the results obtained by UNIT. Similar to thenoisy DiscoGAN, almost all the generated images tend to havethe same color. From our observations, the generated samplesof UNIT also suffer from mode collapse during training. Thesampled latent variables do not produce diverse outputs in eachrow.The above results show that the proposed XO-path isessential in both generating plausible and consistency targetimages. From the edges2shoes and edges2handbags data (a) Edges2Shoes.(b) Edges2Handbags.Fig. 5. Edges to shoes and handbags experiment. The right 4 images ineach row are translated from input image A i to ¯ B i with random variable Z j , j = 1 , . . . , . As can be seen, the colors are consistent in the right 4columns. sets, we show that XOGAN is able to generate plausibleimages in the target domain B . Different from the baselinemodels of noisy DiscoGAN and UNIT, we can sample mulitpleand consistent output images with different Z j ’s. Besides, theresults of XOGAN are more diverse with the help of Z . CelebA

Data Set:

Figure 8 shows the generated faceswith different hair colors. Variable Z is used to model thedifferent hair colors in domain B . Note that the other parts inthe images are almost unchanged. Hence, the color change isfocused on speciﬁc parts (hairs here). B. Translating B to A to B with Substituted Z In this section, we study whether Z can encode relevantinformation in domain B by substituting ¯ Z j among differentimages in domain B .In this experiment, we randomly sample 4 input images { B i , i = 1 , . . . , } in domain B from the test set. For each B i , we generate 4 corresponding images { ¯ A i = G A ( B i ) , i = a) Edges2Shoes.(b) Edges2Handbags.Fig. 6. Edges to shoes and handbags experiment of noisy DiscoGAN. Theright 4 images in each row are translated from input image A i to ¯ B i withrandom variable Z j , j = 1 .. . , . . . , } in domain A and encode its additional variation in { ¯ Z i = G Z ( B i ) } . As in the previous edges2shoes experi-ment, B i represents the colored shoes, ¯ A i is its correspondingedge image and ¯ Z i should encode content inside the edge. Weconcatenate ¯ A i with different ¯ Z j ’s to generate various images { ˆ B ij = G B ( ¯ A i , ¯ Z j ) , i = 1 , . . . , , j = 1 , . . . , in domain B . edges2shoes and edges2handbags Data Sets:

Tosee what the additional variable Z encodes, we substitutedifferent Z j ’s given different colored shoes and handbags. Asshown in Figure 9, we can modify images in domain B . Inthese two pictures, we show the input shoes and handbags B i in the ﬁrst column. We ﬁrst generate the correspondingedge image ¯ A i , shown in the second column of Figure 9, ofeach B i using generator G A . The generated edges describethe contour of the given inputs well. Considering that edgeimages ignore the content of the given shoes or handbags,we encode the content or color into the additional variable { ¯ Z i = G Z ( B i ) , i = 1 , . . . , } . As we assumed, Z i should (a) Edges2Shoes.(b) Edges2Handbags.Fig. 7. Edges to shoes and handbags experiment of UNIT model. The right4 images in each row are translated from input image A i to ¯ B i with randomvariable Z j , j = 1 , . . . , . contain the corresponding color information of input image B i . In order to show the relevance between Z i and color of B i , we concatenate each edge image ¯ A i with different ¯ Z j ’s togenerate new images in domain B as ˆ B ij = G B ( ¯ A i , ¯ Z j ) . Inthe ﬁrst row of Figure 9, we show B and ¯ A in nthe ﬁrst twocolumns, and the last four columns show ˆ B j , j = 1 , . . . , .Since ˆ B j is generated from the concatenation of the edgeimage ¯ A and different codes ¯ Z j , ˆ B j should have the sameshape as B and the same color as B j , j = 1 , . . . , .In the ﬁgure, we can see that the shape is consistent in eachrows and colors are consistent in each of the four rightmostcolumns. More importantly, when j ∈ [1 , . . . , is ﬁxed, allimages ˆ B ij ’s in that column have similar colors as the inputimage B j . As for the edges2shoes results in Figure 9, theshoes in the last column have similar colors as the last inputshoe B . It veriﬁes our assumption that ¯ Z encodes relevantinformation of its corresponding inputs B . This is interestingthat, we can see what our object will look like by replacing its ig. 8. CelebA hair color conversion experiment. We transfer the black hairfaces to faces with other hair colors. The hair colors are consistent for different G B ( A i , Z j ) , when j is ﬁxed. color with that of the other objects. In real-world applicationslike ﬁtting in a clothes shop, the user does not need to try onover and over again, if they want to try the same clothes withdifferent colors. CelebA

Hair Color Conversion:

We perform the B -to- A -to- B path on the CelebA data set again. Input faces aresampled from domain B where the hair colors are not black.If the user wants to change the hair color to black, we caneasily transfer the photo from B i to ¯ A i . An interesting resultis to replace the hair color with that of another person. Figure10 shows that we can replace the blond hair to gray or brownby substituting the code ¯ Z i with that from a gray or brownperson.In this section, we show that not only the proposed modelcan generate plausible and diverse images from domain A todomain B and vice versa, but also can modify the color ofspeciﬁc features in domain B by substituting the additionalvariable Z . The proposed cycle consistency constraints guar-antee a good joint relationship between A and B . They alsoencode consistent features in variable Z for both random gen-eration in Section IV-A and color substitution in Section IV-B.V. C ONCLUSION

In this paper, we presented a generative model calledXOGAN for unsupervised image-to-image translation withadditional variations in the one-to-many translation setting.We showed that we can generate plausible images in bothdomains, and the generated samples are more diverse thanthe baseline models. Not only does the additional variable Z learned lead to more diverse results, it also controls thecolors in certain parts of the generated images. Experimentson the edges2shoes , edges2handbags and CelebA datasets showed that the learned variable Z is meaningful whengenerating images in domain B .The proposed method can be extended in several ways. First,the prior distribution of Z is a standard normal distribution, (a) Edges2Shoes.(b) Edges2Handbags.Fig. 9. Edges to shoes and handbags experiment with variable substitution.The edge image ¯ A i is shown in the second column. The rightmost 4 imagesare translated with the concatenation of ¯ A i with variation ¯ Z j for each inputimage B j . The color is replaced for different inputs. which may be too simple to model more complex variations.This can be improved with more complicated prior distribu-tions introduced in VAE [26], [27]. Second, the variationsin our model are mostly related to color. We hope that ourmodel can be improved to change other attributes such ashair styles and ornaments. Besides, we can also consider amany-to-many mapping based on the one-to-many framework.Similar to the summer-to-winter task in [4], there exist manywinter images corresponding to a single summer image andvice versa. Further, we can extend the proposed model to otherdomains such as text or speech. The additional variation canbe different voices when translating text to speech.R EFERENCES[1] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translationwith conditional adversarial networks,” Preprint arXiv: 1611.07004,2016.[2] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, “Learning todiscover cross-domain relations with generative adversarial networks,” in

Proceedings of the 34th International Conference on Machine Learning ,2017, pp. 1857–1865.[3] Y. Taigman, A. Polyak, and L. Wolf, “Unsupervised cross-domain imagegeneration,” Preprint arXiv:1611.02200, 2016.ig. 10. Hair color translation experiment. The second column transfers non-black hair to black. The hair color of each person B i is encoded in ¯ Z i . Byconcatenating ¯ A i with different ¯ Z j , j = 1 .. , we can modify the hair color.[4] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in IEEEInternational Conference on Computer Vision , 2017, pp. 2223–2232.[5] Z. Yi, H. Zhang, P. Tan, and M. Gong, “Dualgan: unsupervised duallearning for image-to-image translation,” in

IEEE International Confer-ence on Computer Vision , 2017, pp. 2849–2857.[6] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta,A. Aitken, A. Tejani, J. Totz, Z. Wang et al. , “Photo-realistic singleimage super-resolution using a generative adversarial network,” in

IEEEConference on Computer Vision and Pattern Recognition , 2017, pp.4681–4690.[7] Q. Yao and J. T. Kwok, “Colorization by patch-based local low-rankmatrix completion,” in

Proceedings of the 29th AAAI Conference onArtiﬁcial Intelligence , 2015, pp. 1959–1965.[8] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in

European Conference on Computer Vision , 2016, pp. 649–666.[9] R. Yeh, C. Chen, T. Y. Lim, M. Hasegawa-Johnson, and M. N. Do,“Semantic image inpainting with perceptual and contextual losses,”Preprint arXiv:1607.07539, 2016.[10] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer usingconvolutional neural networks,” in

IEEE Conference on Computer Visionand Pattern Recognition , 2016, pp. 2414–2423.[11] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang, “Universalstyle transfer via feature transforms,” in

Advances in Neural InformationProcessing Systems , 2017, pp. 385–395.[12] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in

European Conference onComputer Vision , 2016, pp. 694–711.[13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in

Advances in Neural Information Processing Systems , 2014, pp. 2672–2680.[14] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville,“Improved training of wasserstein gans,” in

Advances in Neural Infor-mation Processing Systems , 2017, pp. 5769–5779.[15] M. Mirza and S. Osindero, “Conditional generative adversarial nets,”Preprint arXiv:1411.1784, 2014.[16] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, andX. Chen, “Improved techniques for training gans,” in

Advances in NeuralInformation Processing Systems , 2016, pp. 2234–2242.[17] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adver-sarial networks,” in

Proceedings of the 34th International Conferenceon Machine Learning , 2017, pp. 214–223.[18] M.-Y. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-imagetranslation networks,” in

Advances in Neural Information ProcessingSystems , 2017, pp. 700–708.[19] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang,and E. Shechtman, “Toward multimodal image-to-image translation,” in

Advances in Neural Information Processing Systems , 2017, pp. 465–476. [20] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”Preprint arXiv:1412.6980, 2014.[21] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes inthe wild,” in

IEEE International Conference on Computer Vision , 2015,pp. 3730–3738.[22] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networksfor biomedical image segmentation,” in

International Conference onMedical Image Computing and Computer-Assisted Intervention , 2015,pp. 234–241.[23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in

IEEE Conference on Computer Vision and PatternRecognition , 2016, pp. 770–778.[24] C. Li and M. Wand, “Precomputed real-time texture synthesis withmarkovian generative adversarial networks,” in

European Conferenceon Computer Vision , 2016, pp. 702–716.[25] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”Preprint arXiv:1312.6114, 2013.[26] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, andM. Welling, “Improved variational inference with inverse autoregressiveﬂow,” in

Advances in Neural Information Processing Systems , 2016, pp.4743–4751.[27] L. Maaløe, C. K. Sønderby, S. K. Sønderby, and O. Winther, “Auxiliarydeep generative models,” in