Fader Networks: Manipulating Images by Sliding Attributes
Guillaume Lample, Neil Zeghidour, Nicolas Usunier, Antoine Bordes, Ludovic Denoyer, Marc'Aurelio Ranzato
FFader Networks:Manipulating Images by Sliding Attributes
Guillaume Lample , , Neil Zeghidour , , Nicolas Usunier ,Antoine Bordes , Ludovic Denoyer , Marc’Aurelio Ranzato {gl,neilz,usunier,abordes,ranzato}@[email protected] Abstract
This paper introduces a new encoder-decoder architecture that is trained to re-construct images by disentangling the salient information of the image and thevalues of attributes directly in the latent space. As a result, after training, ourmodel can generate different realistic versions of an input image by varying theattribute values. By using continuous attribute values, we can choose how much aspecific attribute is perceivable in the generated image. This property could allowfor applications where users can modify an image using sliding knobs, like faders on a mixing console, to change the facial expression of a portrait, or to updatethe color of some objects. Compared to the state-of-the-art which mostly relieson training adversarial networks in pixel space by altering attribute values at traintime, our approach results in much simpler training schemes and nicely scales tomultiple attributes. We present evidence that our model can significantly changethe perceived value of the attributes while preserving the naturalness of images.
We are interested in the problem of manipulating natural images by controlling some attributes ofinterest. For example, given a photograph of the face of a person described by their gender, age, andexpression, we want to generate a realistic version of this same person looking older or happier, or animage of a hypothetical twin of the opposite gender. This task and the related problem of unsuperviseddomain transfer recently received a lot of interest [17, 24, 9, 26, 21, 23], as a case study for conditionalgenerative models but also for applications like automatic image edition. The key challenge is thatthe transformations are ill-defined and training is unsupervised: the training set contains imagesannotated with the attributes of interest, but there is no example of the transformation: In many casessuch as the “gender swapping” example above, there are no pairs of images representing the sameperson as a male or as a female. In other cases, collecting examples requires a costly annotationprocess, like taking pictures of the same person with and without glasses.Our approach relies on an encoder-decoder architecture where, given an input image x with itsattributes y , the encoder maps x to a latent representation z , and the decoder is trained to reconstruct x given ( z, y ) . At inference time, a test image is encoded in the latent space, and the user choosesthe attribute values y that are fed to the decoder. Even with binary attribute values at train time,each attribute can be considered as a continuous variable during inference to control how much it isperceived in the final image. We call our architecture Fader Networks , in analogy to the sliders of anaudio mixing console, since the user can choose how much of each attribute they want to incorporate. Facebook AI Research Sorbonne Universités, UPMC Univ Paris 06, UMR 7606, LIP6 LSCP, ENS, EHESS, CNRS, PSL Research University, INRIACode available at https://github.com/facebookresearch/FaderNetworks a r X i v : . [ c s . C V ] J a n igure 1: Interpolation between different attributes (Zoom in for better resolution). Each line showsreconstructions of the same face with different attribute values, where each attribute is controlled as acontinuous variable. It is then possible to make an old person look older or younger, a man look moremanly or to imagine his female version. Left images are the originals.The fundamental feature of our approach is to constrain the latent space to be invariant to the attributesof interest. Concretely, it means that the distribution over images of the latent representations shouldbe identical for all possible attribute values. This invariance is obtained by using a procedure similarto domain-adversarial training (see e.g., [20, 6, 14]). In this process, a classifier learns to predict theattributes y given the latent representation z during training while the encoder-decoder is trainedbased on two objectives at the same time. The first objective is the reconstruction error of the decoder,i.e., the latent representation z must contain enough information to allow for the reconstruction of theinput. The second objective consists in fooling the attribute classifier, i.e., the latent representationmust prevent it from predicting the correct attribute values. In this model, achieving invariance is ameans to filter out, or hide, the properties of the image that are related to the attributes of interest.A single latent representation thus corresponds to different images that share a common structurebut with different attribute values. The reconstruction objective then forces the decoder to use theattribute values to choose, from the latent representation, the intended image.Our motivation is to learn a disentangled latent space in which we have explicit control on someattributes of interest, without supervision of the intended result of modifying attribute values. Witha similar motivation, several approaches have been tested on the same tasks [17, 24], on relatedimage-to-image translation problems [9, 26], or for more specific applications like the creation ofparametrized avatars [23]. In addition to a reconstruction loss, the vast majority of these works relyon adversarial training in pixel space, which compares during training images generated with anintentional change of attributes from genuine images for the target attribute values. Our approach isdifferent both because we use adversarial training for the latent space instead of the output, but alsobecause adversarial training aims at learning invariance to attributes. The assumption underlying ourwork is that a high fidelity to the input image is less conflicting with the invariance criterion, thanwith a criterion that forces the hallucinated image to match images from the training set.As a consequence of this principle, our approach results in much simpler training pipelines than thosebased on adversarial training in pixel space, and is readily amenable to controlling multiple attributes,by adding new output variables to the discriminator of the latent space. As shown in Figure 1 on testimages from the CelebA dataset [13], our model can make subtle changes to portraits that end upsufficient to alter the perceived value of attributes while preserving the natural aspect of the imageand the identity of the person. Our experiments show that our model outperforms previous methodsbased on adversarial training on the decoders’ output like [17] in terms of both reconstruction lossand generation quality as measured by human subjects. We believe this disentanglement approach isa serious competitor to the widespread adversarial losses on the decoder output for such tasks.In the remainder of the paper, we discuss in more details the related work in Section 2. We thenpresent the training procedure in Section 3 before describing the network architecture and theimplementation in Section 4. Experimental results are shown in Section 5.2 Related work
There is substantial literature on attribute-based and/or conditional image generation that can be splitin terms of required supervision, with three different levels. At one extreme are fully supervisedapproaches developed to model known transformations, where examples take the form of (input,transformation, result of the transformation) . In that case, the model needs to learn the desiredtransformation. This setting was previously explored to learn affine transformations [8], 3D rotations[25], lighting variations [11] and 2D video game animations [19]. The methods developed in theseworks however rely on the supervised setting, and thus cannot be applied in our setup.At the other extreme of the supervision spectrum lie fully unsupervised methods that aim at learningdeep neural networks that disentangle the factors of variations in the data, without specification ofthe attributes. Example methods are InfoGAN [4], or the predictability minimization frameworkproposed in [20]. The neural photo editor [3] disentangles factors of variations in natural images forimage edition. This setting is considerably harder than the one we consider, and it may be difficultwith these methods to automatically discover high-level concepts such as gender or age.Our work lies in between the two previous settings. It is related to information as in [15]. Methodsdeveloped for unsupervised domain transfer [9, 26, 21, 23] can also be applied in our case: giventwo different domains of images such as “drawings” and “photograph”, one wants to map an imagefrom one domain to the other without supervision; in our case, a domain would correspond to anattribute value. The mappings are trained using adversarial training in pixel space as mentioned inthe introduction, using separate encoders and/or decoders per domain, and thus do not scale well tomultiple attributes. In this line of work but more specifically considering the problem of modifyingattributes, the Invertible conditional GAN [17] first trains a GAN conditioned on the attribute values,and in a second step learns to map input images to the latent space of the GAN, hence the name ofinvertible GANs. It is used as a baseline in our experiments. Antipov et al. [1] use a pre-trained facerecognition system instead of a conditional GAN to learn the latent space, and only focuses on theage attribute. The attribute-to-image approach [24] is a variational auto-encoder that disentanglesforeground and background to generate images using attribute values only. Conditional generation isperformed by inferring the latent state given the correct attributes and then changing the attributes.Additionally, our work is related to work on learning invariant latent spaces using adversarial trainingin domain adaptation [6], fair classification [5] and robust inference [14]. The training criterionwe use for enforcing invariance is similar to the one used in those works, the difference is that theend-goal of these works is only to filter out nuisance variables or sensitive information. In our case,we learn generative models, and invariance is used as a means to force the decoder to use attributeinformation in its reconstruction.Finally, for the application of automatically modifying faces using attributes, the feature interpolationapproach of [22] presents a means to generate alterations of images based on attributes using apre-trained network on ImageNet. While their approach is interesting from an application perspective,their inference is costly and since it relies on pre-trained models, cannot naturally incorporate factorsor attributes that have not been foreseen during the pre-training.
Let X be an image domain and Y the set of possible attributes associated with images in X , wherein the case of people’s faces typical attributes are glasses/no glasses , man/woman , young/old . Forsimplicity, we consider here the case where attributes are binary, but our approach could be extendedto categorical attributes. In that setting, Y = { , } n , where n is the number of attributes. We have atraining set D = { ( x , y ) , ..., ( x m , y m ) } , of m pairs (image, attribute) ( x i ∈ X , y i ∈ Y ) . The endgoal is to learn from D a model that will generate, for any attribute vector y (cid:48) , a version of an inputimage x whose attribute values correspond to y (cid:48) . Encoder-decoder architecture
Our model, described in Figure 2, is based on an encoder-decoderarchitecture with domain-adversarial training on the latent space. The encoder E θ enc : X → R N is aconvolutional neural network with parameters θ enc that maps an input image to its N -dimensionallatent representation E θ enc ( x ) . The decoder D θ dec : ( R N , Y ) → X is a deconvolutional network withparameters θ dec that produces a new version of the input image given its latent representation E θ enc ( x ) y (cid:48) . When the context is clear, we simply use D and E to denote D θ dec and E θ enc . The precise architectures of the neural networks are described in Section 4. The auto-encodingloss associated to this architecture is a classical mean squared error (MSE) that measures the qualityof the reconstruction of a training input x given its true attribute vector y : L AE ( θ enc , θ dec ) = 1 m (cid:88) ( x,y ) ∈D (cid:13)(cid:13) D θ dec (cid:0) E θ enc ( x ) , y (cid:1) − x (cid:13)(cid:13) The exact choice of the reconstruction loss is not fundamental in our approach, and adversarial lossessuch as PatchGAN [12] could be used in addition to the MSE at this stage to obtain better textures orsharper images, as in [9]. Using a mean absolute or mean squared error is still necessary to ensurethat the reconstruction matches the original image.Ideally, modifying y in D ( E ( x ) , y ) would generate images with different perceived attributes, butsimilar to x in every other aspect. However, without additional constraints, the decoder learns toignore the attributes, and modifying y at test time has no effect. Learning attribute-invariant latent representations
To avoid this behavior, our approach is tolearn latent representations that are invariant with respect to the attributes. By invariance, we meanthat given two versions of a same object x and x (cid:48) that are the same up to their attribute values, forinstance two images of the same person with and without glasses, the two latent representations E ( x ) and E ( x (cid:48) ) should be the same. When such an invariance is satisfied, the decoder must use theattribute to reconstruct the original image. Since the training set does not contain different versionsof the same image, this constraint cannot be trivially added in the loss.We hence propose to incorporate this constraint by doing adversarial training on the latent space.This idea is inspired by the work on predictability minimization [20] and adversarial training fordomain adaptation [6, 14] where the objective is also to learn an invariant latent representation usingan adversarial formulation of the learning objective. To that end, an additional neural network calledthe discriminator is trained to identify the true attributes y of a training pair ( x, y ) given E ( x ) . Theinvariance is obtained by learning the encoder E such that the discriminator is unable to identify theright attributes. As in GANs [7], this corresponds to a two-player game where the discriminator aimsat maximizing its ability to identify attributes, and E aims at preventing it to be a good discriminator.The exact structure of our discriminator is described in Section 4. Discriminator objective
The discriminator outputs probabilities of an attribute vector P θ dis ( y | E ( x )) , where θ dis are the discriminator’s parameters. Using the subscript k to refer tothe k -th attribute, we have log P θ dis ( y | E ( x )) = n (cid:80) k =1 log P θ dis ,k ( y k | E ( x )) . Since the objective of thediscriminator is to predict the attributes of the input image given its latent representation, its lossdepends on the current state of the encoder and is written as: L dis ( θ dis | θ enc ) = − m (cid:88) ( x,y ) ∈D log P θ dis (cid:0) y (cid:12)(cid:12) E θ enc ( x ) (cid:1) (1) Adversarial objective
The objective of the encoder is now to compute a latent representation thatoptimizes two objectives. First, the decoder should be able to reconstruct x given E ( x ) and y , andat the same time the discriminator should not be able to predict y given E ( x ) . We consider that amistake is made when the discriminator predicts − y k for attribute k . Given the discriminator’sparameters, the complete loss of the encoder-decoder architecture is then: L ( θ enc , θ dec | θ dis ) = 1 m (cid:88) ( x,y ) ∈D (cid:13)(cid:13) D θ dec (cid:0) E θ enc ( x ) , y (cid:1) − x (cid:13)(cid:13) − λ E log P θ dis (1 − y | E θ enc ( x )) , (2)where λ E > controls the trade-off between the quality of the reconstruction and the invarianceof the latent representations. Large values of λ E will restrain the amount of information about x contained in E ( x ) , and result in blurry images, while low values limit the decoder’s dependency onthe latent code y and will result in poor effects when altering attributes.4igure 2: Main architecture. An (image, attribute) pair ( x, y ) is given as input. The encoder maps x to the latent representation z ; the discriminator is trained to predict y given z whereas the encoderis trained to make it impossible for the discriminator to predict y given z only. The decoder shouldreconstruct x given ( z, y ) . At test time, the discriminator is discarded and the model can generatedifferent versions of x when fed with different attribute values. Learning algorithm
Overall, given the current state of the encoder, the optimal discriminatorparameters satisfy θ ∗ dis ( θ enc ) ∈ argmin θ dis L dis ( θ dis | θ enc ) . If we ignore problems related to multiple(and local) minima, the overall objective function is θ ∗ enc , θ ∗ dec = argmin θ enc ,θ dec L ( θ enc , θ dec | θ ∗ dis ( θ enc )) . In practice, it is unreasonable to solve for θ ∗ dis ( θ enc ) at each update of θ enc . Following the practice ofadversarial training for deep networks, we use stochastic gradient updates for all parameters, consid-ering the current value of θ dis as an approximation for θ ∗ dis ( θ enc ) . Given a training example ( x, y ) , letus denote L dis (cid:0) θ dis (cid:12)(cid:12) θ enc , x, y (cid:1) the auto-encoder loss restricted to ( x, y ) and L (cid:0) θ enc , θ dec (cid:12)(cid:12) θ dis , x, y (cid:1) the corresponding discriminator loss. The update at time t given the current parameters θ ( t )dis , θ ( t )enc ,and θ ( t )dec and the training example ( x ( t ) , y ( t ) ) is: θ ( t +1)dis = θ ( t )dis − η ∇ θ dis L dis (cid:0) θ ( t )dis (cid:12)(cid:12) θ ( t )enc , x ( t ) , y ( t ) (cid:1) [ θ ( t +1)enc , θ ( t +1)dec ] = [ θ ( t )enc , θ ( t )dec ] − η ∇ θ enc ,θ dec L (cid:0) θ ( t )enc , θ ( t )dec (cid:12)(cid:12) θ ( t +1)dis , x ( t ) , y ( t ) (cid:1) . The details of training and models are given in the next section.
We adapt the architecture of our network from [9]. Let C k be a Convolution-BatchNorm-ReLU layerwith k filters. Convolutions use kernel of size × , with a stride of , and a padding of , so thateach layer of the encoder divides the size of its input by . We use leaky-ReLUs with a slope of . in the encoder, and simple ReLUs in the decoder.The encoder consists of the following 7 layers: C − C − C − C − C − C − C Input images have a size of × . As a result, the latent representation of an image consists of feature maps of size × . In our experiments, using 6 layers gave us similar results, while 8layers significantly decreased the performance, even when using more feature maps in the latent state.To provide the decoder with image attributes, we append the latent code to each layer given as input tothe decoder, where the latent code of an image is the concatenation of the one-hot vectors representing5odel Naturalness AccuracyMouth Smile Glasses Mouth Smile GlassesReal Image 92.6 87.0 88.6 89.0 88.3 97.6IcGAN AE 22.7 21.7 14.8 88.1 91.7 86.2IcGAN Swap 11.4 22.9 9.6 10.1 9.9 47.5FadNet AE 88.4 75.2 78.8 91.8 90.1 94.5FadNet Swap 79.0 31.4 45.3 66.2 97.1 76.6Table 1: Perceptual evaluation of naturalness and swap accuracy for each model. The naturalnessscore is the percentage of images that were labeled as “real” by human evaluators to the question “Isthis image a real photograph or a fake generated by a graphics engine?”. The accuracy score is theclassification accuracy by human evaluators on the values of each attribute.the values of its attributes (binary attributes are represented as [1 , and [0 , ). We append the latentcode as additional constant input channels for all the convolutions of the decoder. Denoting by n thenumber of attributes, (hence a code of size n ), the decoder is symmetric to the encoder, but usestransposed convolutions for the up-sampling: C n − C n − C n − C n − C n − C n − C n . The discriminator is a C layer followed by a fully-connected neural network of two layers of size and n repsectively. Dropout
We found it extremely beneficial to add dropout in our discriminator. We set the dropoutrate to . in all our experiments. Following [9], we also tried to add dropout in the first layers of thedecoder, but in our experiments, this turned out to significantly decrease the performance. Discriminator cost scheduling
Similarly to [2], we use a variable weight for the discriminator losscoefficient λ E . We initially set λ E to and the model is trained like a normal auto-encoder. Then, λ E is linearly increased to . over the first , iterations to slowly encourage the modelto produce invariant representations. This scheduling turned out to be critical in our experiments.Without it, we observed that the encoder was too affected by the loss coming from the discriminator,even for low values of λ E . Model selection
Model selection was first performed automatically using two criteria. First, weused the reconstruction error on original images as measured by the MSE. Second, we also want themodel to properly swap the attributes of an image. For this second criterion, we train a classifierto predict image attributes. At the end of each epoch, we swap the attributes of each image in thevalidation set and measure how well the classifier performs on the decoded images. These twometrics were used to filter out potentially good models. The final model was selected based on humanevaluation on images from the train set reconstructed with swapped attributes.
We first present experiments on the celebA dataset [13], which contains , images of celebrity of shape × annotated with attributes. We used the standardtraining, validation and test split. All pictures presented in the paper or used for evaluation have beentaken from the test set. For pre-processing, we cropped images to × , and resized them to × , which is the resolution used in all figures of the paper. Image values were normalizedto [ − , . All models were trained with Adam [10], using a learning rate of . , β = 0 . , and abatch size of . We performed data augmentation by flipping horizontally images with a probability . at each iteration. As model baseline, we used IcGAN [17] with the model provided by the authorsand trained on the same dataset. https://github.com/Guim3/IcGAN Qualitative evaluation
Figure 3 shows examples of images generated when swapping differentattributes: the generated images have a high visual quality and clearly handle the attribute valuechanges, for example by adding realistic glasses to the different faces. These generated imagesconfirm that the latent representation learned by Fader Networks is both invariant to the attributevalues, but also captures the information needed to generate any version of a face, for any attributevalue. Indeed, when looking at the shape of the generated glasses, different glasses shapes and colorshave been integrated into the original face depending on the face: our model is not only adding“generic” glasses to all faces, but generates plausible glasses depending on the input.
Quantitative evaluation protocol
We performed a quantitative evaluation of Fader Networks onMechanical Turk, using IcGAN as a baseline. We chose the three attributes
Mouth (Open/Close) , Smile (With/Without) and
Glasses (With/Without) as they were attributes in common between IcGANand our model. We evaluated two different aspects of the generated images: the naturalness , thatmeasures the quality of generated images, and the accuracy , that measures how well swapping anattribute value is reflected in the generation. Both measures are necessary to assess that we generatenatural images, and that the swap is effective. We compare: R
EAL I MAGE , that provides originalimages without transformation, F AD N ET AE and I C GAN AE , that reconstruct original imageswithout attribute alteration, and F AD N ET S WAP and I C GAN S
WAP , that generate images with oneswapped attribute, e.g.,
With Glasses → Without Glasses . Before being submitted to Mechanical Turk,all images were cropped and resized following the same processing than IcGAN. As a result, outputimages were displayed in × resolution, also preventing Workers from basing their judgment onthe sharpness of presented images exclusively.Technically, we should also assess that the identity of a person is preserved when swapping attributes.This seemed to be a problem for GAN-based methods, but the reconstruction quality of our model isvery good (RMSE on test of . , to be compared to . for IcGAN), and we did not observethis issue. Therefore, we did not evaluate this aspect.For naturalness, the first images from the test set such that there are images for each attributevalue were shown to Mechanical Turk Workers, for each of the different models presentedabove. For each image, we asked whether the image seems natural or generated. The descriptiongiven to the Workers to understand their task showed examples of real images, and examples offake images ( F AD N ET AE , F AD N ET S WAP , I C GAN AE , I C GAN S
WAP ).The accuracy of each model on each attribute was evaluated in a different classification task, resultingin a total of 15 experiments. For example, the FadNet/Glasses experiment consisted in askingWorkers whether people with glasses being added by F AD N ET S WAP effectively possess glasses,and vice-versa. This allows us to evaluate how perceptible the swaps are to the human eye. In eachexperiment, images were shown ( images per class, in the order they appear in the test set).In both quantitative evaluations, each experiment was performed by Workers, resulting in , samples per experiment for naturalness, and , samples per classification experiment on swappedattributes. The results on both tasks are shown in Table 1.7igure 4: (Zoom in for better resolution.) Examples of multi-attribute swap (Gender / Opened eyes /Eye glasses) performed by the same model. Left images are the originals. Quantitative results
In the naturalness experiments, only around of real images were classi-fied as “real” by the Workers, indicating the high level of requirement to generate natural images. Ourmodel obtained high naturalness accuracies when reconstructing images without swapping attributes: . , . and . , compared to IcGAN reconstructions whose accuracy does not exceed ,whether it be for reconstructed or swapped images. For the swap, F AD N ET S WAP still consistentlyoutperforms I C GAN S
WAP by a large margin. However, the naturalness accuracy varies a lot basedon the swapped attribute: from . for the opening of the mouth, down to . for the smile.Classification experiments show that reconstructions with F AD N ET AE and I C GAN AE have veryhigh classification scores, and are even on par with real images on both Mouth and Smile. F AD N ET S WAP obtains an accuracy of . for the mouth, . for the glasses and . for the smile,indicating that our model can swap these attributes with a very high efficiency. On the other hand,with accuracies of . , . and . on these same attributes, I C GAN S
WAP does not seemable to generate convincing swaps.
Multi-attributes swapping
We present qualitative results for the ability of our model to swapmultiple attributes at once in Figure 4, by jointly modifying the gender, open eyes and glassesattributes. Even in this more difficult setting, our model can generate convincing images with multipleswaps.
We performed additional experiments on the Oxford-102 dataset, which contains about , imagesof flowers classified into categories [16]. Since the dataset does not contain other labels than theflower categories, we built a list of color attributes from the flower captions provided by [18]. Eachflower is provided with different captions. For a given color, we gave a flower the associated colorattribute, if that color appears in at least 5 out of the 10 different captions. Although being naive, thisapproach was enough to create accurate labels. We resized images to × . Figure 5 representsreconstructed flowers with different values of the “pink” attribute. We can observe that the color ofthe flower changes in the desired direction, while keeping the background cleanly unchanged.Figure 5: Examples of reconstructed flowers with different values of the pink attribute. First rowimages are the originals. Increasing the value of that attribute will turn flower colors into pink, whiledecreasing it in images with originally pink flowers will make them turn yellow or orange.8 Conclusion
We presented a new approach to generate variations of images by changing attribute values. Theapproach is based on enforcing the invariance of the latent space w.r.t. the attributes. A key advantageof our method compared to many recent models [26, 9] is that it generates realistic images of highresolution without needing to apply a GAN to the decoder output. As a result, it could easily beextended to other domains like speech, or text, where the backpropagation through the decoder canbe really challenging because of the non-differentiable text generation process for instance. However,methods commonly used in vision to assess the visual quality of the generated images, like PatchGAN,could totally be applied on top of our model.
Acknowledgments
The authors would like to thank Yedid Hoshen for initial discussions about the core ideas of the paper,Christian Pursch and Alexander Miller for their help in setting up the experiments and MechanicalTurk evaluations. The authors are also grateful to David Lopez-Paz and Mouhamadou MoustaphaCisse for useful feedback and support on this project.