Constrained Generative Adversarial Networks for Interactive Image Generation
CConstrained Generative Adversarial Networks for Interactive Image Generation
Eric HeimAir Force Research LaboratoryInformation DirectorateRome, NY USA [email protected]
Abstract
Generative Adversarial Networks (GANs) have receiveda great deal of attention due in part to recent success in gen-erating original, high-quality samples from visual domains.However, most current methods only allow for users to guidethis image generation process through limited interactions.In this work we develop a novel GAN framework that allowshumans to be “in-the-loop” of the image generation process.Our technique iteratively accepts relative constraints of theform “Generate an image more like image A than image B ”.After each constraint is given, the user is presented with newoutputs from the GAN, informing the next round of feedback.This feedback is used to constrain the output of the GANwith respect to an underlying semantic space that can bedesigned to model a variety of different notions of similarity(e.g. classes, attributes, object relationships, color, etc.). Inour experiments, we show that our GAN framework is able togenerate images that are of comparable quality to equivalentunsupervised GANs while satisfying a large number of theconstraints provided by users, effectively changing a GANinto one that allows users interactive control over imagegeneration without sacrificing image quality.
1. Introduction
Learning a generative model from data is a task that hasgotten recent attention due to a number of breakthroughsin complex data domains [15, 30, 13]. Some of the moststriking successes have been in creating novel imagery us-ing Generative Adversarial Networks (GANs) [6]. WhileGANs show promise in having machines effectively “draw”realistic pictures, the mechanisms for allowing humans toguide the image generation process have been largely limitedto conditioning on class labels [21] (e.g. “Draw a zero.”)or domain-specific attributes [35] (e.g. “Draw a coat withstripes.”). Such feedback, though powerful, limits the userto expressing feedback through a pre-defined set of labels.If the user is unable to accurately express the characteristics
User M o r e li k e t h a n ConstraintSet+GeneratorConstraintImage I n t e r ac ti on w it h u s e r M a p t o S e m a n ti c S p ace Figure 1: Interaction with the CONGAN generator: A userprovides a relative constraint in the form of two images mean-ing “Generate an image more like image A than image B .”The constraint is combined to previously given constraintsto form a set, which is input to the generator to produce animage. This image is shown to the user to drive further itera-tions of feedback. The goal for the generator is to “satisfy”the constraints with respect to a mapping to an underlyingsemantic space. The generator satisfies a constraint ( A, B ) by producing an image that is mapped to a coordinate closerto where A is mapped than to where B is mapped.that they desire using this label set, then they cannot guidethe model to produce acceptable images.In this work, we seek a more natural and powerful way forhumans to interact with a generative model. To this end, wepropose a novel GAN technique we call CONstrained GAN(CONGAN). Our model is designed to accept human feed-back iteratively, effectively putting users “in-the-loop” of thegeneration process. Figure 1 illustrates how a user interactswith the CONGAN generator. The generator accepts relative constraints of the form “More like image A than image B .”1 a r X i v : . [ c s . G R ] A p r hese constraints are used to define a feasible region withina given semantic space that models an underlying notion ofsimilarity between images. The goal of the generator is toaccept relative constraints as input, and output an image thatis within the corresponding feasible region.Modeling interaction in this way has two primary benefits.First, such relative pair-wise assessments have been shownto be an easy medium for humans to articulate similarity [14,26]. As such, CONGAN allows users to refine its outputin a natural way. Relative constraints can also be used toallow for different interactions besides providing pair-wisecomparisons. If the output from the generator is then inputas the B (“less similar”) image in the next iteration, the userneed only provide the A (“more similar”) image. In thisway, the user can provide single examples, meaning “Morelike image A than what was previously generated”, to refinethe output. Second, within the CONGAN framework, thesemantic space defines the characteristics that users guidethe image generation process. This allows for the option ofa variety of notions of similarity, such as class and attributeinformation, but also more continuous or complex notionssuch as color spaces, or size of objects within the image.To achieve this form of interaction, our model must havemultiple interrelated components. The generator must beable to accept a variable number of constraints as a set ,i.e. the output should be invariant to the order of the inputconstraints. For this, we leverage recent work in MemoryNetworks [7, 34, 27, 31] within the CONGAN generator tolearn a fixed length vector representation over the constraintset as a whole. In addition, the generator must not only beable to generate realistic looking images, but images thatare within the feasible region of a given semantic space.During training, the CONGAN generator is trained againsta constraint critic that enforces the output to satisfy givenconstraints. The result is a generator that is able to produceimagery guided by iterative relative feedback.The remainder of the paper will proceed as follows. First,we discuss prior related work. Then, we describe our methodbeginning with a formal definition of the constrained gen-eration problem, continuing to an outline of the CONGANtraining algorithm, and ending with a description of theCONGAN generator. Next, we perform an evaluation wherewe compare our method to an unsupervised GAN, showingqualitative and quantitative results. Finally, we conclude.
2. Related Work
Our proposed CONGAN method follows from a long lineof work in neural network image generation. Specifically,autoencoders [15], autoregressive models [30], and genera-tive adversarial networks [6] (GANs) have all shown recentsuccess. We chose to learn a model using the GAN frame-work, as GANs are arguably the best performing generativemodels in terms of qualitative image quality. Much of the fundamental work in GANs have focusedon unsupervised learning settings [6, 39, 1]. The output ofthese models can be controlled by manipulating the latentspace used as input [23, 22]. However, such manipulation islimited in that the latent space often has no obvious humanunderstandable interpretation. Thus finding ways to manipu-late it requires either trial and error or interpolating betweentwo points in the latent space. Other works learn conditional
GAN models [21], where generation is guided by side in-formation, such as class labels [21], visual attributes [35],text [24], and images [28]. In this work, we aim to developa method that allows more intuitive manipulation of a GANsoutput that generalizes to many different forms of similarity.The GAN method most similar to ours is the one intro-duced in [40]. This method first maps an image to a manifoldof natural images using a GAN. Then, they provide a seriesof image editing operations that users can use to move theimage along that manifold. We see our work as related butorthogonal to this work as both the means for manipulation,as well as the goals of the methods differ.Another line of research that motivates this work is inter-active learning over imagery. Much of the work in this fieldhas focused on classification problems [3, 32, 17, 33], butalso others such as learning localized attributes [4]. Most no-tably, in [16] the authors propose an interactive image searchmethod that allows users to provide iterative refinementsto their query, based on visual attributes. This is similar inprinciple to our method in that their method searches imagesthrough interactive comparisons to other images in the do-main of interest. However, our method does not necessarilyrequire predefined attributes and generates novel imageryinstead of retrieving relevant images from a database.
3. A Model for Constrained Image Generation
The goal of this work is to learn an image generationmodel in the form of a mapping from a set of pair-wiserelative constraints to a realistic looking image. Let X be adomain of images. We wish to learn the mapping: g Θ : (cid:110) ( X ×X ) i | i ≥ (cid:111) × Z (cid:55)→ X This generator maps a set of constraints C = { C , C , ... } and a random noise vector z ∈ Z to an image, where aconstraint C = ( X + , X − ) ∈ X × X is a pair of imagesmeaning “Generate an image more like X + than X − .” Intu-itively, z represents the variation of imagery allowed withinthe constraints, and different z will produce different images.Practically, z provides the noise component necessary forour generator to be trained within the GAN framework.For training our generator, we require a mechanism thatdetermines whether the output of g Θ satisfies input con-straints. To this end, we assume the existence of a mapping φ : X (cid:55)→ S that maps images to a semantic space . The only lgorithm 1
CONGAN Training Procedure
Input:
Gradient penalty coefficient λ , constraint penalty co-efficient γ , discriminator iterations per generator iteration n disc , batch size m , Adam optimizer parameters α, β , β repeatfor t = 1 , ...n disc dofor i = 1 , ..., m do Sample X ∼ P D , C ∼ P C , z ∼ Z, (cid:15) ∼ U (0 , X ← g Θ ( C , z )˜ X ← (cid:15) X + (1 − (cid:15) ) ˆ X L i ← d W ( ˆ X ) − d W ( X ) + λ ( ||∇ ˜ X d W ( ˜ X ) || − end for W ← Adam (cid:0) ∇ W m (cid:80) mi =1 L i , α, β , β (cid:1) end for Sample batches (cid:8) z i (cid:9) mi =1 ∼ Z, (cid:8) C i (cid:9) mi =1 ∼ P C { ˆ X i } mi =1 ← (cid:8) g Θ (cid:0) C i , z i (cid:1)(cid:9) mi =1 L ← m (cid:80) mi =1 − d W ( ˆ X i ) + γl φ, S ( ˆ X i )Θ ← Adam ( ∇ Θ L, α, β , β ) until Θ convergedrequirements are that φ be differentiable, and that there existsa distance metric d S over elements of S . For instance, if onewanted to have users manipulate generated images by theirattributes (i.e. the dimensions of S correspond to attributes), φ could be a learned attribute classifier (for binary attributes)or regressor (for continuous attributes). We say a generatedimage ˆ X satisfies a given constraint C = ( X + , X − ) withrespect to S if the following holds: d S (cid:16) φ ( ˆ X ) , φ ( X + ) (cid:17) < d S (cid:16) φ ( ˆ X ) , φ ( X − ) (cid:17) (1)Given a set of constraints C , the goal of g Θ is to produce an ˆ X that satisfies all constraints in the set. In doing so, thegenerator produces images that are closer in the semanticspace to “positive” images X + than “negative” images X − .Put another way, C defines a feasible region in S for which ˆ X must lie in. How we use this idea of relative constraintsto train g Θ is discussed in the following section. To train the generator g Θ , we utilize the GAN frameworkthat pits a generator g Θ against a discriminator d W , whereboth g and d and neural networks parameterized by Θ and W , respectively. The discriminator is trained to distinguishoutputs of the generator from real image samples. The gen-erator is trained to produce images that the discriminatorcannot differentiate from real samples. The two are trainedagainst one another; at convergence, the generator is oftenable to produce instances that are similar to real samples.While d W ensures output images look realistic, we useanother model to enforce constraint satisfaction. For this, we introduce the idea of a constraint critic that informs thetraining procedure in a similar manner as d W . We define theconstraint critic loss as the average loss over each constraintafter mapping images into the semantic space: l φ, S ( ˆ X , C ) = − |C| (cid:88) ( X + , X − ) ∈C p S ( φ ( ˆ X ) , φ ( X + ) , φ ( X − )) Loss over each constraint p S is inspired by the loss used int-Distributed Stochastic Triplet Embedding (STE) [29]: p S ( a, b, c ) = (cid:16) d S ( a,b ) α (cid:17) − α +12 (cid:16) d S ( a,b ) α (cid:17) − α +12 + (cid:16) d S ( a,c ) α (cid:17) − α +12 This loss compares pairs of objects according to a t-Studentkernel and is motivated by successes in dimensionality reduc-tion techniques that use heavy tailed similarity kernels [19].By minimizing the negation of p S for each constraint, ˆ X is “pulled” closer to images X + and “pushed” farther fromimages X − in S . As a result, using this loss during trainingwill produce images more likely to satisfy constraints.We leverage the constraint critic loss in tandem withthe discriminator to train the CONGAN generator. Morespecifically, our training algorithm is an extension of theWasserstein GAN [1, 8]. We aim to optimize the following: min Θ max W E X ∼ P D [ d W ( X )] − E ˆ X ∼ P g (cid:104) d W ( ˆ X ) − γl φ, S ( ˆ X , C ) (cid:105) Here, P D is a data distribution (i.e. X is a sample froma training set), P g is the generator distribution (i.e. ˆ X = g Θ ( C , z ) for a given z ∼ Z and a given C drawn from atraining set of constraint sets). Finally, d W is constrained tobe 1-Lipschitz. This objective is optimized by alternatingbetween updating discriminator parameters W and generatorparameters Θ using stochastic gradient descent, samplingfrom the training set and generator where necessary. Intu-itively, the discriminator’s output can be interpreted as ascore of how likely the input is from the data distribution.When the discriminator updates, it attempts to increase itsscore for real samples and decrease its score for generatedsamples. Conversely, when the generator updates, it attemptsto increase the discriminator’s score for generated images.In addition, generator updates decrease the constraint lossby a factor of the hyperparameter γ . As a result, generatorupdates encourage g Θ to produce images similar to those inthe image training set, while also satisfying samples from aconstraint training set. To enforce the 1-Lipschitz constrainton d W we use the gradient penalty term proposed in [8].The CONGAN training procedure is outlined in Alg. 1.This algorithm is very similar to the WGAN training algo-rithm (Algorithm 1 in [8]) with a few key additions. First,when updating both the discriminator and generator, batches ead CNN ! " ! $ W r it e N e t w o r k % ~ Z ( ) , ( ) , ... L S T M ' (∗ A tt e n ti on … Read CNN ... ' " ' +∗ ' ,-+∗ L S T M A tt e n ti on ' . Figure 2: The CONGAN generator. Purple is the read network, orange is the process network, and green is the write networkof constraint sets are selected from a training set. In practice,we use φ to construct ground truth constraint sets of variablelength from images in the image train set, ensuring that ourgenerator is trained on constraint sets that are feasible in S .Second, the generator update has an additional term: theconstraint critic term that encourages constraint satisfaction. While Alg. 1 outlines how to train g Θ , we have yet to for-mally define g Θ . In order for g Θ to accept C as a set it must1) accept a variable number of constraints, and 2) outputthe same image regardless of the order in which constraintsare given. For this we leverage the work of [31] that intro-duces a neural network framework capable of consideringorder-invariant inputs, such as sets. An illustration of theCONGAN generator is depicted in Fig. 2. Our generatorhas three components: 1) A read network used to learn arepresentation of each constraint 2) a process network thatcombines all constraints in a set into a single set representa-tion, and 3) a write network that maps the set representationto an image. Below we describe each of these components.The read network puts images within a constraint setthrough a Convolutional Neural Network (CNN) to extractvisual features. Feature vectors of images from a commonconstraint pair are concatenated and input to a fully con-nected layer. The result is a single vector c i for each con-straint, which are collectively input to the process network.The process network consists of a “processing unit” thatis repeated p times. Let { c , ..., c n } be the output of theread network for a size n set of constraints. For each of the t repetitions of the processing unit, an iteration through anLSTM cell with “content-based” attention is performed: q t = LST M (cid:0) z , q ∗ t − (cid:1) (2) e i,t = c i · q t (3) a i,t = exp ( e i,t ) (cid:80) nj exp ( e j,t ) (4) r t = n (cid:88) i a i,t c i (5) q ∗ t = [ q t , r t ] (6) First, z (as “input”) and the hidden state from previous repe-tition are put through LSTM unit. The resultant hidden stateoutput of the LSTM q t is then combined with each c i viadot product to create a scalar value e i,t for each constraint.These are used in a softmax function to obtain scalars a i,t ,which in turn are used in a weighted sum. This sum is thekey operation that combines the constraints. Because addi-tion is commutative, the result of (5), and thus the outputof the processing network, is invariant to the order that theconstraints were given. The result r t is concatenated with q t and is used as the input in the next processing iteration.After p steps, q ∗ p is put through a fully connected layer toproduce s , which is input to the write network.One way of interpreting this network is that each process-ing unit iteration refines the representation of the constraintset produced by the previous iteration. The output of theprocessing unit has two parts. First, r t is a learned weightedaverage of the constraints, ideally emphasizing constraintswith stronger signal. Second, q t is the output of the LSTMwhich combines the noise vector and the output from the pre-vious iteration, using various gates to retain certain featureswhile removing others. These two components are sent backthrough the processing unit for further rounds of refinement.Similar to the generator in the unconditional GAN frame-work, the write network maps a noise vector to imagespace. Motivated by this, we use the transpose convolu-tions [5, 25] utilized in Deep Convolutional GANs (DC-GANs) [23]. Transpose convolutions effectively learn anupsampling transformation. By building a network fromtranspose convolutional layers, our write network is able tolearn how to map from a lower dimensional representationof constraint set to a higher dimensional image.
4. Empirical Evaluation
In order to evaluate CONGAN we aim to show its abilityto satisfy constraints while achieving the image quality ofsimilar WGAN models. Further, we wish to highlight someexamples of how a user can interact with a CONGAN gen-erator. To this end, we perform experiments with three datasets: MNIST [18], CelebA [36], and Zappos50k [37, 38].In all experiments, we use the hyperparameters suggested -+-+-
Figure 3: Example illustrating the order invariance propertyof CONGAN. On the left are relative constraints (top ispositive image, bottom is negative) in the order they areinput to the CONGAN generator. On the right are the imagesproduced for two different z vectors. The output remains thesame even when the constraints are given in different orders.in [8]: ( λ = 10 , n disc = 5 , α = 0 . , β = 0 , β = 0 . ),follow Algorithm 1 from the same work to train WGAN,and set the batch size m = 32 . We seed WGANs with noisevectors z drawn from a standard normal ( Z = N (0 , I ) ), andCONGANs with a uniform distribution ( Z = U ( − , ).We opt to use the uniform distribution as it allows both inputsinto the processing network to be in the same range. Thenoise vectors are of size 64 for the MNIST experiments andof size 128 for the CelebA and Zappos50k experiments. Weset p (number of “processing” steps) to 5 in both experiments,but have observed that CONGAN is robust to this setting.In [1] the authors observe that the Wasserstein Distancecan be used to determine convergence. In our experiments,the Wasserstein Distance stopped improving by 100,000generator update iterations for all models and use that asthe iteration limit. We chose values for γ that were able toreduce the t-STE train error significantly while maintainingWasserstein Distance close to what was achieved by theWGAN. To strike a good balance we set γ = 10 on MNIST, γ = 250 on CelebA, and γ = 100 on Zappos50k.WGAN models were trained on the designated trainedsets for MNIST and CelebA. For Zappos50k, we randomlychose 90% of the images as the train set, leaving the restas test. Similarly, CONGAN model constraint sets C in thetraining set of constraint sets are created by first randomlychoosing an image of the train set to be a reference image.Then, anywhere between 1 and 10 pairs of images are ran-domly chosen to be constraints. Next, φ is applied to thereference image and each pair. The resultant representationsin S are used to determine which elements of the pairs areconsidered X + (positive examples) and X − (negative exam- Figure 4: Examples from WGAN (left) and CONGAN(right) generators trained on the MNIST data set.ples) according to (1). Test sets are constructed similarly.The CONGAN network architectures used in these exper-iments are as follows . For MNIST: The discriminator andread networks are five layer CNNs. The write network is afive layer transpose convolutional network. For CelebA andZappos50k: The discriminator and read networks are resid-ual networks [10] with four residual CNN blocks. The writenetwork has four transpose convolutional residual blocks. Tomaintain some regularity between models in the interest offair comparison, we use the same discriminator architecturesfor both WGAN and CONGAN and use the WGAN genera-tor architecture as the CONGAN write network architecture.Other than a few special cases, we use rectified linear unitsas activation functions and perform layer normalization [2]. MNIST is a well known data set containing 28x28 im-ages of hand-written digits. For preprocessing we zero padthe images to 32x32 and scale them to [-1,1]. For φ wetrain a “mirrored” autoencoder on the MNIST train set usingsquared Euclidean loss. The encoder portion consists offour convolutional layers and a fully connected layer withno activation to a two-dimensional encoding. We use theencoder as φ . The decoder has a similar structure but usestranspose convolutions to reverse the mapping. Simply au-toencoding MNIST digits reveals a loose class structure inthe embedding space ( S in this experiment). As such, thisexperiment shows how class relationships can be retrievedeven if φ does not precisely map to classes.We seek to evaluate the CONGAN’s ability to satisfygiven constraints. To this end, we constructed ten differ-ent test sets, each containing constraint sets of a fixed size.For example, each constraint set in the “2” test set has twoconstraints. We call an evaluation over a different test setan “experiment”. In each experiment, we performed tendifferent trials where the generator was given different noisevectors per constraint set. With these experiments we can A more rigorous description can be found in the appendix. - ! " ! ! $ Figure 5: Example of CONGAN generator outputs when trained on the CelebA data set. The bottom two rows of images areconstraints, where the positive and negative images only differ by a single attribute. The first three constraints differ by onlythe “Male” attribute, the second three by only the “Beard” attribute, and the third three by only the “Eyeglasses” attribute. Thetop three rows are images produced from three different seeds when the constraints are provided to the CONGAN generatorfrom left to right. For example, the third image in the first row is generated when z and the first three constraints are given. +- ! " ! ! $ Figure 6: Another example of CONGAN generator outputs when trained on the CelebA data set. This is the same experimentas in Fig. 5, but with the attributes “Pale Skin”, “Brown Hair”, and “Female” from left to right.observe the effect constraint set size has on the generator.
Results:
Table 1 shows the mean constraint satisfactionerror (i.e one minus the prevalence of (1)) of the CONGANgenerator for each MNIST experiment. Overall, it was ableto satisfy over 90% of given constraints. Note that the gen-erator performs slightly better when more constraints aregiven. This is somewhat counter-intuitive. We believe that in this case the generator is using constraints to determinewhat class of digit to produce. If given few constraints, it ismore difficult for the generator to determine the class of theoutput. Figures 3 and 4 show example outputs of CONGANwhen trained on MNIST: One showing the order invarianceproperty of CONGAN and the other showing CONGAN gen-erated images next to ones produced by a similar WGAN. input constraints1 2 3 4 5 6 7 8 9 100.0931 0.0895 0.0860 0.0831 0.0808 0.0784 0.0775 0.0756 0.0743 0.0733Table 1: Mean constraint satisfaction errors of CONGAN on MNIST constraints per input set size (10 trials).CONGAN (
The CelebA data set contains 202,599 color images ofcelebrity faces. For our experiments, we resize each imageto 64x64 and scale to [-1,1]. Associated with each image are40 binary attributes ranging from “Blond Hair” to “Smiling”.We chose twelve of these attributes to be S . More specif-ically, an image’s representation in S is a binary vector ofattributes, which differs from the MNIST experiment. In theprevious experiment, S was both lower dimensional and con-tinuous. As such, this experiment will evaluate CONGAN’sability to adapt to different semantic spaces.For φ we construct a simple multi-task CNN (MCNN) [9] that consists of one base network and multiple specializednetworks, trained end-to-end. The base network acceptsthe image as input and extracts features for detecting allattributes. The specialized networks split from the basenetwork and learn to detect to their predetermined subset.Our φ base network consists of two convolutional layers.The specialized networks (one for each of twelve attributes)consists of three convolutional layers followed by a fullyconnected layer that maps to a scalar attribute identifier.For this experiment we sought to more objectively com-pare the WGAN generated images with those produced byCONGAN. To this end we first train a WGAN on the CelebAtrain set. Then, we initialize the CONGAN write networkand discriminator to the trained WGAN generator and dis-criminator, respectively, before training the CONGAN gen-erator. By doing this, we can observe how image quality is Details and an evaluation of the MCNN can be found in the appendix. affected by adding the CONGAN components to a WGAN.
Results:
Rows one and two of Table 2 (top table) showthe mean negative discriminator scores for both the WGANand CONGAN generators against the WGAN and CONGANdiscriminators at convergence over ten trials. We can seethat for both discriminators, WGAN generated images arescored very similarly to those generated by CONGAN. Thisis especially true when considering the standard deviationfor the WGAN generator against the WGAN and CONGANdiscriminators is 8.75 and 16.22, respectively, and slightlyhigher on both for the CONGAN generator. We believe thisresult shows evidence that adding the CONGAN frameworkto the WGAN training did not drastically alter image quality.The last row of Table 2 shows the mean constraint satis-faction error on the test set for each experiment. Here, theCONGAN generator is able to satisfy around 87% or moreof the constraints. Figures 5 and 6 show images generated byCONGAN. As constraints are provided, the image producedfrom different seeds take on the attributes indicated by theconstraints. In Fig. 5, the first three constraints indicate the“Male” attribute, the next three indicate “Beard”, and the last“Eyeglasses”. In Fig. 6, “Pale Skin”, “Brown Hair”, and“Female” are indicated. These examples show that a user caniteratively refine the images to have desired characteristics,and still be given a variety of realistic, novel images.
The Zappos50K data set contains 50,025 color imagesof shoes. We resize each image to 64x64 and scale to [- - InitialConstraint Target +- ! " ! ! " ! +- ! " ! Generated Images
Figure 7: Three sets of two examples from the CONGANgenerator trained on the Zappos data set. The generator wasfirst provided the initial constraint of the left, generating thefirst (left-most) image in the generated images column. Togenerate each of the next three images, the generator was feda constraint where the positive image was the target image,and the negative image was the previously generated image.1,1]. For this experiment, we chose S to be a color space.To accomplish this, we computed a 64 bin color histogramover each image and trained a nine-layer CNN to embedthe images in 2-dimensions using a triplet network [11] ,and used this as φ . We opted to use the T-STE loss in theobjective as it produced a clear separation of colors.There is inherent bias in the Zappos50K data set whenit comes to color, as most shoes tend to be black, brown,or white. This poses a challenge in that if constraint setsused for training are formed by uniformly sampling over thetrain set, the model will tend to favor few colors, makingit difficult to guide generation to other colors. To combatthis, we constructed constraints to include a more uniformsampling over colors. When constructing the train set ofconstraint sets, with probability 0.5 we uniformly sampledover training images as in the other experiments. When notsampling uniformly, we focused on a single color bin byfirst selecting a bin and choosing all positive images in theconstraint set to be images where the highest histogram valuecorresponded to that bin (e.g. all positive examples would be“light blue”). Negative examples would be chose uniformlyfrom the other bins. We found this allowed the CONGANgenerator to more easily learn to produce a variety of colors. Results:
Table 2 (bottom table) shows the discrimina-tor scores and mean constraint satisfaction errors for eachZappos50K experiment. Here, the CONGAN generator pro- A visualization of this embedding can be found in the appendix. duced lower scores than the WGAN for both discriminators,though within one standard deviation. We believe this is dueto training the generator to produce a wider variety of colors.If training data contains many brown, black, and white shoes,then training the generator to produce blue, red and yellowshoes will force it to produce images that differ than thoseprovided to the discriminator. Nevertheless, we believe thatimage quality was only slightly degraded as a result.Figure 7 shows examples of the images produced by theCONGAN generator. Here, we wanted to test the use caseof providing single images, instead of pair-wise constraints,to guide the generator to a result. An initial constraint isprovided to produce a starting images. After that, a singletarget image is used repeatedly as the positive example togenerate shoes more similarly colored to the target.
5. Conclusion and Future Work
In this work, we introduce a Generative Adversarial Net-work framework that is able to generate imagery guided byiterative human feedback. Our model relies on two novelcomponents. First, we develop a generator, based on recentwork in memory networks, that maps variable-sized setsof constraints to image space using order-invariant opera-tions. Second, this generator is informed during training bya critic that determines whether generated imagery satisfiesgiven constraints. The result is a generator that can can beguided interactively by humans through relative constraints.Empirically our model is able to generate images that areof comparable quality to those produced by similar GANmodels, while satisfying a up to 90% of given constraints.There are multiple avenues of future work that we believeare worthy of further study. First, it may not be feasible forusers of CONGAN to search through large image databasesto find the exact constraints they desire. We will apply pair-wise active ranking techniques [12] to suggest constraintqueries in order to quickly constrain the semantic spacewithout requiring users to search through images themselves.Second, we will investigate the output of the process networkmore closely seeing if constraint representations have prop-erties that match intuition about how sets of constraints areclassically reasoned about, similar to word embeddings [20].
Acknowledgments:
This work was supported by theAFOSR Science of Information, Computation, Learning,and Fusion program lead by Dr. Doug Riecken. Eric wouldlike to thank Davis Gilton (UW-M) and Timothy Van Slyke(NEU) for early exploratory experiments that made thiswork possible. Eric would also like to thank Ritwik Gupta(SEI/CMU) for reviewing a draft of the paper. Finally, Ericwould like to thank his colleagues at AFRL/RI: Dr. LeeSeversky, Dr. Walter Bennette, Dr. Matthew Klawonn, andDylan Elliot for insightful feedback as this work progressed. eferences [1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein genera-tive adversarial networks. In
ICML , 2017. 2, 3, 5[2] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization.
NIPS Deep Learning Symposium , 2016. 5[3] S. Branson, C. Wah, F. Schroff, B. Babenko, P. Welinder,P. Perona, and S. Belongie. Visual recognition with humansin the loop.
ECCV , 2010. 2[4] K. Duan, D. Parikh, D. Crandall, and K. Grauman. Discover-ing localized attributes for fine-grained recognition. In
CVPR ,2012. 2[5] V. Dumoulin and F. Visin. A guide to convolution arithmeticfor deep learning. arXiv preprint arXiv:1603.07285 , 2016. 4[6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generativeadversarial nets. In
NIPS , 2014. 1, 2[7] A. Graves, G. Wayne, and I. Danihelka. Neural turing ma-chines. arXiv preprint arXiv:1410.5401 , 2014. 2[8] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, andA. Courville. Improved training of wasserstein gans. arXivpreprint arXiv:1704.00028 , 2017. 3, 5[9] E. M. Hand and R. Chellappa. Attributes for improved at-tributes: A multi-task network utilizing implicit and explicitrelationships for facial attribute classification. In
AAAI , 2017.7[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In
CVPR , 2016. 5[11] E. Hoffer and N. Ailon. Deep metric learning using tripletnetwork. In
International Workshop on Similarity-BasedPattern Recognition , 2015. 8[12] K. G. Jamieson and R. Nowak. Active ranking using pairwisecomparisons. In
NIPS , 2011. 8[13] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive grow-ing of GANs for improved quality, stability, and variation. In
ICLR , 2018. 1[14] M. G. Kendall and J. D. Gibbons. Rank correlation methods.1990. 2[15] D. P. Kingma and M. Welling. Auto-encoding variationalbayes. 2014. 1, 2[16] A. Kovashka, D. Parikh, and K. Grauman. Whittlesearch:Image search with relative attribute feedback. In
CVPR , 2012.2[17] N. Kumar, P. N. Belhumeur, A. Biswas, D. W. Jacobs, W. J.Kress, I. C. Lopez, and J. V. Soares. Leafsnap: A computervision system for automatic plant species identification. In
ECCV . 2012. 2[18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.
Proceedingsof the IEEE , 1998. 4[19] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne.
JMLR , 2008. 3[20] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean.Distributed representations of words and phrases and theircompositionality. In
NIPS , 2013. 8[21] M. Mirza and S. Osindero. Conditional generative adversarialnets. arXiv preprint arXiv:1411.1784 , 2014. 1, 2 [22] G. Perarnau, J. van de Weijer, B. Raducanu, and J. M. ´Alvarez.Invertible conditional gans for image editing. arXiv preprintarXiv:1611.06355 , 2016. 2[23] A. Radford, L. Metz, and S. Chintala. Unsupervised represen-tation learning with deep convolutional generative adversarialnetworks.
ICLR , 2016. 2, 4[24] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, andH. Lee. Generative adversarial text to image synthesis. In
ICML , 2016. 2[25] W. Shi, J. Caballero, L. Theis, F. Huszar, A. Aitken, C. Ledig,and Z. Wang. Is the deconvolution layer the same as a con-volutional layer? arXiv preprint arXiv:1609.07009 , 2016.4[26] N. Stewart, G. D. Brown, and N. Chater. Absolute identifica-tion by relative judgment.
Psychological review , 112(4):881,2005. 2[27] S. Sukhbaatar, J. Weston, R. Fergus, et al. End-to-end memorynetworks. In
NIPS , 2015. 2[28] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals,A. Graves, et al. Conditional image generation with pixelcnndecoders. In
NIPS , 2016. 2[29] L. Van Der Maaten and K. Weinberger. Stochastic tripletembedding. In
MLSP , 2012. 3[30] A. Van Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixelrecurrent neural networks. In
ICML , 2016. 1, 2[31] O. Vinyals, S. Bengio, and M. Kudlur. Order matters: Se-quence to sequence for sets. In
ICLR , 2016. 2, 4[32] C. Wah, S. Branson, P. Perona, and S. Belongie. Multiclassrecognition and part localization with humans in the loop. In
ICCV , 2011. 2[33] C. Wah, G. Van Horn, S. Branson, S. Maji, P. Perona, andS. Belongie. Similarity comparisons for interactive fine-grained categorization. In
CVPR , 2014. 2[34] J. Weston, S. Chopra, and A. Bordes. Memory networks. In
ICLR , 2015. 2[35] X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image:Conditional image generation from visual attributes. In
ECCV ,2016. 1, 2[36] S. Yang, P. Luo, C.-C. Loy, and X. Tang. From facial partsresponses to face detection: A deep learning approach. In
ICCV , 2015. 4[37] A. Yu and K. Grauman. Fine-grained visual comparisons withlocal learning. In
CVPR , 2014. 4[38] A. Yu and K. Grauman. Semantic jitter: Dense supervisionfor visual comparisons via synthetic images. In
ICCV . IEEE,2017. 4[39] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generativeadversarial network.
ICLR , 2017. 2[40] J.-Y. Zhu, P. Kr¨ahenb¨uhl, E. Shechtman, and A. A. Efros.Generative visual manipulation on the natural image manifold.In
ECCV . Springer, 2016. 2 . Formal Definition of LSTM Component
Equation (4) is a standard LSTM cell:
LST M (cid:0) z , q ∗ t − (cid:1) = o t ∗ tanh ( h t ) , where o t = σ (cid:0) w o · (cid:2) q ∗ t − , z (cid:3) + b o (cid:1) h t = f t ∗ h t − + i t ∗ ˜ h t f t = σ (cid:0) w f · (cid:2) q ∗ t − , z (cid:3) + b f (cid:1) i t = σ (cid:0) w i · (cid:2) q ∗ t − , z (cid:3) + b i (cid:1) ˜ h t = tanh (cid:0) w ˜ h · (cid:2) q ∗ t − , z (cid:3) + b ˜ h (cid:1) Here, z is used as what is commonly referred to as “input”to the LSTM, q ∗ t − is commonly called the “hidden state” ofthe previous iteration, and LST M returns the hidden stateof the current iteration.
B. Neural Network Architectures used in Ex-periments
In this section, we outline the neural network architectureused in all experiments in the main paper, layer by layer.Rows of the network in descending order (top to bottom)indicate layers from input to output. The following nam-ing conventions are used throughout. “Conv” indicates aconvolutional layer, “FC” indicates a fully connected layer,and “TConv” indicates a transpose convolutional layer. Thecolumn labeled “Ker” indicates the kernel size, “Str” indi-cates stride, and “Act” indicates the activation function used.Columns labeled “In” and “Out” indicate the shape of theinput to the layer and shape of the output of the layer.
B.1. MNIST Experiments
Below you will find architecture descriptions for the net-works used in the MNIST experiments. Note that after eachtwo convolutional or transpose convolutional layers in allnetworks, layer normalization is used. φ Network (Encoder)Layer In Ker Str Act Out
Conv 32x32x1 3x3 1 ReLU 32x32x4Conv 32x32x4 3x3 2 ReLU 16x16x8Conv 16x16x8 3x3 2 ReLU 8x8x16Conv 8x8x16 3x3 2 ReLU 4x4x32Conv 4x4x32 3x3 2 ReLU 2x2x64FC 2x2x64 None 2 φ Network (Decoder)Layer In Ker Str Act Out
FC 2 None 2x2x64TConv 2x2x64 3x3 2 ReLU 4x4x32TConv 4x4x32 3x3 2 ReLU 8x8x16TConv 8x8x16 3x3 2 ReLU 16x16x8TConv 16x16x8 3x3 2 ReLU 32x32x4Conv 32x32x4 3x3 1 tanh 32x32x1
Discriminator Network (WGAN and CONGAN)Layer In Ker Str Act Out
Conv 32x32x1 3x3 1 ReLU 32x32x64Conv 32x32x64 3x3 2 ReLU 16x16x128Conv 16x16x128 3x3 2 ReLU 8x8x256Conv 8x8x256 3x3 2 ReLU 4x4x512FC 4x4x512 None 1
Read CNNLayer In Ker Str Act Out
Conv 32x32x1 5x5 1 ReLU 32x32x2Conv 32x32x2 5x5 2 ReLU 16x16x4Conv 16x16x4 5x5 2 ReLU 8x8x8Conv 8x8x8 5x5 2 ReLU 4x4x16Conv 4x4x16 5x5 2 ReLU 2x2x32FC 2x2x32 tanh 64
CONGAN Write Network/WGAN GeneratorLayer In Ker Str Act Out
FC 64 None 4x4x512TConv 4x4x512 3x3 2 ReLU 8x8x256Conv 8x8x256 3x3 1 ReLU 8x8x256TConv 8x8x256 3x3 2 ReLU 16x16x128Conv 16x16x128 3x3 1 ReLU 16x16x128TConv 16x16x128 3x3 2 ReLU 32x32x64Conv 32x32x64 3x3 1 tanh 32x32x1
B.2. CelebA and Zappos50K Experiments
In this section, we first describe all network architecturesused in both the CelebA and Zappos50K experiments. Thenwe outline the φ networks used for each. Here, “Norm”indicates layer norm, “ReLU” indicates the application ofa rectified linear unit. The “ID” column is used to identifywhich layers are used in subsequent operations in the residualblock. For the residual blocks, the “In” column is either usedto indicate the size of the input or the IDs of the layers usedas input. The “Add” layers are simply the addition of thetwo layers identified in the “In” column with the first IDmultiplied by 0.3 before the addition. The “RB ↑ ” layer is aresidual block up and “RB ↓ ” is a residual block down. iscriminator NetworkLayer In Ker Str Act Out Conv 64x64x3 3x3 1 ReLU 64x64x64RB ↓ ↓ ↓ ↓ Read CNNLayer In Ker Str Act Out
Conv 64x64x3 3x3 1 ReLU 64x64x8RB ↓ ↓ ↓ CONGAN Write Network/WGAN GeneratorLayer In Ker Str Act Out
FC 128 ReLU 4x4x512RB ↑ ↑ ↑ ↑ Residual Block (Down)ID Layer In Ker Str Act Out a x b x c a x b x d a x b x c a x b x c a x b x d Residual Block (Up)ID Layer In Ker Str Act Out a x b x c (2 ∗ a ) x (2 ∗ b ) x d a x b x c a x b x c (2 ∗ a ) x (2 ∗ b ) x d B.3. Celeba φ MCNN
The MCNN we developed for the φ network in ourCelebA experiments takes an image, and puts it througha “base” network. Then the output of the base network is in-put to twelve“specialized” networks to predict the presenceor absence of each of the twelve attributes we used in ourexperiment. Each of these architectures are outlined below. φ MCNN Network (Base)Layer In Ker Str Act Out
Conv 64x64x3 7x7 2 ReLU 32x32x64Conv 32x32x64 5x5 2 ReLU 16x16x128Norm φ MCNN Network (Specialized)Layer In Ker Str Act Out
Conv 16x16x128 3x3 2 ReLU 8x8x256Conv 8x8x256 3x3 2 ReLU 4x4x512NormConv 4x4x512 3x3 2 ReLU 2x2x1024FC 2x2x1024 sigm 1
B.4. Zappos50K φ Triplet Network
A triplet network takes three images and puts themthrough the same network resulting in an n dimensionalembedding for which standard triplet losses can be applied.Below describes the network we used in our Zappos50Kexperiments. Note that after each two convolutional layers,layer normalization is applied. ) , CNNCNN ! " ! + FCRead CNN ! Figure 8: The read network to map a constraint to a vector.
LSTM
Process Network ! Attention " $ % $ & ... Softmax ' %, ' &, ... ) %, ) &, ... Σ + []" Figure 9: Illustration of the t th iteration of the process net-work, beginning with the LSTM unit and ending with q ∗ t . φ Triplet NetworkLayer In Ker Str Act Out
Conv 64x64x3 5x5 1 ReLU 64x64x8Conv 64x64x8 5x5 2 ReLU 32x32x8Conv 32x32x8 5x5 1 ReLU 32x32x16Conv 32x32x16 5x5 2 ReLU 16x16x16Conv 16x16x16 5x5 1 ReLU 16x16x32Conv 16x16x32 5x5 2 ReLU 8x8x32Conv 8x8x32 5x5 1 ReLU 8x8x64Conv 8x8x64 5x5 2 ReLU 4x4x64FC 4x4x64 None 2
C. CelebA φ MCNN Training Details and Per-formance
For training the φ MCNN used in the CelebA data experi-ments, we chose twelve attributes for the network to predict.We used the Adam optimization method with default param-eters, a batch size of 32, and trained the model for 100,000iterations. The test accuracy of the network for the twelveattributes is shown in the table below. We note that theseresults are slightly worse than those reported in the originalpaper, but sufficient for the CONGAN generator to learnhow to manipulate images. Performance can be increasedby employing the “aux” method described in the originalMCNN paper, and by designing the architecture to be takeadvantage of groups of common attributes.
Attribute Accuracy
Bald 0.9836Black Hair 0.8870Blond Hair 0.9414Brown Hair 0.8242Eyeglasses 0.9901Goatee 0.9531Gray Hair 0.9709Male 0.9760Mustache 0.9557No Beard 0.9360Pale Skin 0.9601Wearing Hat 0.9832
D. Zappos50K φ Triplet Network Training De-tails and Performance
We formed the training set for the triplet network by firsttaking each image in the Zappos50K train set, and placedit into one of the 64 color histogram bins according theirhighest histogram value. To form each triplet ( A, B, C ) (“ A is more similar to C than C ”), we iterated over each bin j ,selecting images A and B randomly from j , and image C randomly from another bin. We iterated over each bin 5000times creating 320,000 triplets for training. We did a similarprocess for the test set, but with 1000 “passes” over eachbin,, making a test set of 64,000 triplets.We trained the network using default Adam optimizationparameters and a batch size of 128. We found that lossleveled out around 25,000 steps and stopped optimizationat that point. Upon convergence, the network was able tosatisfy 94.504% of the test triplets. Figure 10 shows samplesof the Zappos50K data set embedded in two dimensionsusing the φ triplet network. i gu r e : S a m p l e s fr o m t h e Z a ppo s K d a t a s e t e m b e dd e du s i ng t h e φ t r i p l e t n e t w o r kk