[PDF] Interactive Optimization of Generative Image Modeling using Sequential Subspace Search and Content-based Guidance

Abstract

Generative image modeling techniques such as GAN demonstrate highly convincing image generation result. However, user interaction is often necessary to obtain the desired results. Existing attempts add interactivity but require either tailored architectures or extra data. We present a human-in-the-optimization method that allows users to directly explore and search the latent vector space of generative image modeling. Our system provides multiple candidates by sampling the latent vector space, and the user selects the best blending weights within the subspace using multiple sliders. In addition, the user can express their intention through image editing tools. The system samples latent vectors based on inputs and presents new candidates to the user iteratively. An advantage of our formulation is that one can apply our method to arbitrary pre-trained model without developing specialized architecture or data. We demonstrate our method with various generative image modeling applications, and show superior performance in a comparative user study with prior art iGAN.

Full PDF

IInteractive Subspace Exploration on Generative Image Modelling

TOBY CHONG LONG HIN,

The University of Tokyo

I-CHAO SHEN,

National Taiwan University

ISSEI SATO,

The University of Tokyo

TAKEO IGARASHI,

The University of Tokyo

Fig. 1. Our system provides additional control to a pretrained and frozen generative image modeling network without modifying the network itself. Theuser starts with an input signal (a). They explore the latent space with sliders and pick the best result within slider space (b). They can optionally provide aguidance image (c). The system provides image candidates that align with the user input (d).

Generative image modeling techniques such as GAN demonstrate highlyconvincing image generation result. However, user interaction is often nec-essary to obtain desired results. Existing attempts add interactivity butrequire either tailored architectures or extra data. We present a human-in-the-optimization method that allows users to directly explore and searchthe latent vector space of generative image modeling. Our system providesmultiple candidates by sampling the latent vector space, and the user selectsthe best blending weights within the subspace using multiple sliders. Inaddition, the user can express their intention through image editing tools.The system samples latent vectors based on inputs and presents new can-didates to the user iteratively. An advantage of our formulation is that onecan apply our method to arbitrary pre-trained model without developingspecialized architecture or data. We demonstrate our method with variousgenerative image modeling applications, and show superior performance ina comparative user study with prior art iGAN [Zhu et al. 2016].Additional Key Words and Phrases: Human-in-the-loop Optimization, BayesianOptimization, Generative Image Modeling

ACM Reference Format:

Toby Chong Long Hin, I-Chao Shen, Issei Sato, and Takeo Igarashi. 2019.Interactive Subspace Exploration on Generative Image Modelling.

ACMTrans. Graph.

1, 1 (June 2019), 9 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

With the growing maturity of generative image modeling techniquessuch as generative adversarial networks [Goodfellow et al. 2014](GANs) and auto-encoder architecture [Kingma and Welling 2013],many applications have been proposed, including high quality facegeneration [Karras et al. 2018], image inpainting [Iizuka et al. 2017],

Authors’ addresses: Toby Chong Long Hin, The University of Tokyo; I-Chao Shen,National Taiwan University; Issei Sato, The University of Tokyo; Takeo Igarashi, TheUniversity of Tokyo.2019. 0730-0301/2019/6-ART $15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn

Fig. 2. Artists create different art forms with the help of generative imagemodeling : (a) printed art based on GAN , and (b) fashion design usingGAN . image generation from text [Tao Xu 2018], and image style trans-fer and translation [Huang et al. 2018; Isola et al. 2017; Zhu et al.2017]. These applications produce images from a random latentvector and additional inputs (e.g. a sketch or a sentence) that exhibitcharacteristics observed in the training data.A common problem with these approaches is the lack of control-lability of the generated results. When the user detects a defect inthe generated result (e.g. artifacts, unwanted image attributes), it isoften difficult to fix such problem. This is particularly relevant topeople not familiar with machine learning techniques. For example,some artists create art using GAN such as face portrait, abstract art,garment design and more (Figure 2). With more publicly releasedmachine learning resources, such as source codes and pretrained net-work model, practicalists can simply download them and generateimages with it without understanding all the theory of behind thealgorithm. However, interfacing with generative image modeling toobtain images with high user satisfaction has remain a challengedue to the lack of an unified user interface. ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: June 2019. a r X i v : . [ c s . G R ] J un • Toby Chong Long Hin, I-Chao Shen, Issei Sato, and Takeo Igarashi Some existing approaches allow interactive control of the imagegeneration process, e.g. using color point cues for image coloriza-tion [Zhang et al. 2017], or sketches as structural cues for imagetranslation [Isola et al. 2017] and face image editing [Portenier et al.2018]. However, it either requires new network architecture usuallytailored to allow only specific controls. Second, the tailored networkmust be trained using additional data such as labels.iGAN [Zhu et al. 2016] addresses these challenges by building anatural image manifold and allowing the user to explore the mani-fold by drawing a desired image. They directly optimize for imagesthat matches the user stroke. Although their approach is straight-forward, we observed that drawing of users are usually too poor toproperly guide the generation process(Fig 9).In this paper, we address these challenges by allowing the userto directly explore the latent vector space using human-in-the-loopoptimization. The user provides feedback to candidates presentedby the system by manipulating sliders and drawing images. Slidercontrol allows the user to explore the subspace around the desiredpoint in the latent vector space. Image editing tools allow the userto directly indicate desired changes to the system, such as to thecolors used, and to identify regions that should be preserved andproblematic regions that should be removed. The system takes thefeedback and provide new candidates that matches the user input.This process repeats until the user obtains a satisfying image.Our method considers pre-train networks as black-box functionsand adds interactivity regardless of network architecture. Hence,it is applicable to a board range of existing works as well as futurearchitectures, which is critical as image generative model is evolvingin an immense pace.Our work builds on human-in-the-loop parameter tweaking forimage editing [Koyama et al. 2017]. A notable difference is that ourtarget space is much larger (512d) than theirs (10d).Exploring high dimensional space with Bayesian Optimizationcould be impractical due to the high sample count requirement.Therefore, we propose two novel user interface elements togetherwith appropriate algorithms to assist the optimization. First, we usemultiple sliders rather than a single slider to allow the user to explorea larger subspace than the 1-D subspace of the original method. Thisallows the system to attain an optimum more efficiently. Second,we introduce a content-aware sampling strategy that favors resultsaligns with user edits. We achieve this by formulating differentimage operations as content-aware bias term, and adding it to theacquisition function in Bayesian optimization which provides thenext “best” candidates when optimized.We demonstrate the effectiveness and versatility of our frame-work to three image generation applications: image generation, im-age translation, and text-to-image generation, together with otheruser studies and ablation studies.To summarize, the key contributions of this paper are: • Introducing a human-in-the-loop optimization framework forguiding generative image modelling, where the user directlyexplores and search the latent space assisted by the system. http://metodica.es/lab.html https://lbd-ai.com/ • Imparting controllability to generative image modeling with-out requiring tailoring of the network architecture and addi-tional training. • Two extensions to human-in-the-loop Bayesian optimiza-tion [Koyama et al. 2017] : (i) sequential subspace search ef-fectively exploring the high-dimensional latent vector space,and (ii) the content-aware sampling strategy that favoursimages aligning with user edits.

One promising approach to incorporate different user inputs is touse conditional GAN to enable inputs such as labels or aerial im-ages, to solve image-image translation problem [Isola et al. 2017]and sketching of terrain authoring system [Guérin et al. 2017].Zhang et al. [2017] created additional local and global hint networksto incorporate local and global user inputs. Portenier et al. developedFaceShop [2018], a novel network architecture combining both im-age completion and translation in a single framework. The userdraws strokes, using both geometry and color constraints to guideface editing. However, these methods require tailored network ar-chitectures and training data to endow specific applications withcontrollability. To many practitioner such as artists, providing mean-ingful label data is either financially impractical or impossible due tothe abstract nature of their work. We address this problem by devel-oping a generic framework that introduces additional controllabilityto pretrained models.iGAN [Zhu et al. 2016] is closely related to our method. They pro-vide a blank canvas on which the user can draw line sketches, paintcolors, and warp the image content. However, as the user providesmore input, the system continuously optimizes some randomlyinitialized latent vectors, which generate images that matches theuser’s edits.Another problem with iGAN is that it relies solely on user drawninputs, but casual users often fail to express their intentions ac-curately by drawing (Figure 9). Our system therefore allows theuser to directly explore the latent vector space by using sliders. Theuser only “evaluates” images presented by the system and “selects”the best ones rather than “drawing” images accurately. Anotherdifference is that iGAN initializes candidates by random samplingand tends to converge to similar ones as user add more input. Incontrast, Bayesian optimization in our work provides candidatesthat searches the space efficiently.GAN Dissection [Bau et al. 2018] provides a framework for vi-sualizing and understanding the structures learned by generativenetworks and provides users with an intuitive painting interface sothat they can manipulate generated images by painting objects di-rectly. Unlike our method, their method requires a separate networkand segmentation masks to identify the function of specific neuronsinside the generative network. Neural Collage [Suzuki et al. 2018]enables users to change the attributes of an image (e.g. changethe breed of dog, color of petals). It requires both specific architec-ture and additional image labels, together with a manually created

ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: June 2019. nteractive Subspace Exploration on Generative Image Modelling • 3 mask, to combine the processed result with an original image. GAN-Breeder is a community project in which users can simply keepselecting the most interesting image to discover totally new images.Yet, an ideal system “understands” user even with minimal userinteraction. Bayesian optimization (BO) is a framework designed to optimizeexpensive-to-evaluate black-box functions with minimal number ofevaluations: max x f ( x ) , (1)where f ( x ) is a black-box function with unknown derivatives andconvexity properties.During each trial, BO provides a new sample candidates basedon prior observations; more specifically, it optimizes an acquisitionfunction using a predefined prior; in this work we focused on BOwith a Gaussian Process (GP) as the prior, which is commonly usedfor such task [Koyama et al. 2017]. The acquisition function seeksthe next sample candidate that maximizes a criterion such as ex-pected improvement (EI) [Brochu et al. 2010], knowledge gradient(KG) [Scott et al. 2011], or variations thereof; please refer to [Shahri-ari et al. 2016] for details. As BO effectively approximates arbitraryfunctions, it is applicable to high-dimensional parameter spacesexploration, such as hyperparameter tuning systems for machinelearning algorithms [Bergstra et al. 2015; Murugan 2017], photoenhancement [Koyama et al. 2017], material BRDF design [Brochuet al. 2007; Koyama et al. 2017].Bayesian optimization with inequality constraints has been pro-posed [Gardner et al. 2014; Gelbart et al. 2014], for scenarios wherethe feasibility cannot be determined in advance. In this problemsetting, they incorporate inequality constraints into a black-boxoptimization: max x f ( x ) s . t . c ( x ) ≤ ϵ , (2)where f ( x ) and c ( x ) are expensive-to-evaluate black-box functions.Instead, we formulated the inequality constraints by incorporatinga feasibility indicator function into the the acquisition functioninstead. Figure 3 shows the workflow of our framework. The system firstshows an image I (Figure 3(a)) generated with a random vector z using frozen pretrained generator network G , i.e. I = G ( z ) ,and I ij refers to the i -th candidate in the j -th iteration. However,the user may not be satisfied for its defects or personal preferenceon I ; such as hair blending in with background and unshavenbeard respectively. Assume that the user prefers a face similar tothe current image I , but without the defect and the beard. Theuser first adjusts the multi-way slider to obtain an image I ′ = G ( z ′ ) (Figure 3(b)) without defects by blending generated candidate images { I i = G ( z i )} ci = , where c is the number of candidates ( c = https://ganbreeder.app/ Fig. 3.

Illustration of our workflow to remove beard.

At each iteration,the user starts with initial images (a) (supplied by the user or randomlygenerated). They manipulate the sliders (b) and edits on the blended image(c). Our user-guided Bayesian Optimization (d) provides the candidates forthe next iteration. After the optimization, the system generates three newcandidates (e) for the user to explore and generate new blending image.Note that (b) is the same as (e) large image. several image-editing tools as in iGAN allowing the user to engagein additional guidance (Section 4).The editing tools allow the user to directly edit I ′ by paintingover it, or sourcing external images guidance image I ∗ (Figure 3(c)).The user presses the next button and the system provides thenext set of candidate images. We use z ′ as z and the new latentvectors { z i } ci = to generate the candidate images { I i } ci = for thenext user input iteration. This iterative process continues until theuser obtains the image that matches his/her preferences. Our user interface consists of a main viewing window and multiplecandidate images as shown in Figure 4. For each candidate image,we provide an associated user slider. Adjusting the slider enablesthe user to explore and compose a new image ˜ I that is shown in theprincipal window. Moreover, we provide several local image editingtools to enable additonal user guidance. At each iteration, the userspecifies preferences by manipulating the sliders and optionallyperforming local edits. The user can drag the several sliders and“blend” images, and paint on the blended image locally. Finally, theuser hits the “next” button to request the system to update theinternal model and present new candidates. At iteration t , the user manipulates the sliders to compose a newimage ( I ′ t ) that is the closest to his/her preferences within the sliderspace. The slider values correspond to blending weights of candidateimages; the user explores the convex subspace of the latent vectorspace bounded by the candidate images. If the user is not satisfiedwith the blended image I ′ t , he/she can use the image editing toolsto create a guidance image I ∗ t (Section 4.2). Otherwise, he/she candirectly use the blended image as the guidance image, i.e. I ∗ t = I ′ t . The user creates a guidance image by using the image editing toolson the blended image ( I ′ t ) and the system will provide new candi-dates that matches the guidance image. Color painting assigns colors to any region on the blended image.

Eraser removes content locally. The user uses eraser when theyare not satisfied with certain region and would like to simply seechange in that region.

ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: June 2019. • Toby Chong Long Hin, I-Chao Shen, Issei Sato, and Takeo Igarashi

Fig. 4. Screenshot of our user interface. It allows users to control the gener-ation process by controlling slider and image-editing operations.

Keep allows the user to specify a region to remain unchanged.

Copy & Paste

The user finds external images from any source,e.g. from a web search or image databases, that match his/her pref-erences for a specific region. The user then copies and pastes thatregion of the external image to I ′ t .Note that our method allows the users to use any external imageeditor to provide image-based guidances. Our method focuses on adjusting the input latent vector z to controlimage generation process G( z ) of generative image modeling. Weconsider this latent vector adjusting process to be a numerical opti-mization, and model the user preference as our objective function.Since evaluating such user preference function is expensive-and-hard, we design our method based on Bayesian optimization thataims for optimizing it with minimal number of observations. We in-clude pseudo-code to assist user understanding in the supplementalmaterial. Given an initial latent vector z , which generates an image I = G( z ) . We seek to maximise д ( I ) , a user preference function de-scribing how much a user prefers an image I .Koyama et al. [2017] extends BO and proposes to use slider in-terface which allows the users to consistently express their ownpreferences to a large number of options. In this work, we proposemultiple sliders user interface to accommodate the large searchspace of generative image mtethods. User interaction and its mathematical formulation.

Our user inter-face consists of c sliders, each represents a latent vector z ... c and acorresponding image I ... c . The user explores the entire subspaceby steering through it with the provided sliders. Given slider values { s i } i = c , blending weights a i for each latent vector z i are given as { a i = s i / (cid:205) cj = s j } ci = . Note that we use normalized slider values,i.e. slider values ( , , , ) is equivalent to ( . , . , . , . ) . Theblended image, I b = G( z b ) , generated by the blended latent vector z b = (cid:205) cj = a j z j , is updated and displayed on the left side of theuser interface as the user manipulates the sliders 4. After the userfinishes manipulating the sliders, we assume that they arrive to the global minimum of the user preference function within the sub-space, i.e. they choose the best image within the subspace. Below werefer the underlying perceptional response д ( I ) of user observing animage I = G ( z ) generated by a fixed generator G as “user preferenceat latent space”. To avoid complicated notation, we use д ( z ) as ashort hand for д ( G ( z )) since we assume a fixed G . And we refer to д as the goodness values of z and д ( z ) as the function itself. The latentspace refers to a search space which is defined during the trainingof G . Modelling user preference.

We formulate the estimation of thegoodness values д ( z ) from slider interaction as a maximum a poste-rior MAP estimation. Following sequential line search [Koyama et al.2017], slider manipulation is modeled with BTL model [Tsukida andGupta 2011] and user preference at latent space with a Gaussian pro-cess prior. This allows targeting general generative model inferencewith no domain specific knowledge assumption. Concretely, theobjective function that we want to maximize using MAP estimationis the following: ( д MAP , θ MAP ) = argmin д , θ p ( д , θ |P , z b ) = argmin д , θ p (P , z b | д , θ ) p ( д | θ ) p ( θ ) , (3)where θ is the parameters of the Gaussian Process, P is the latentvectors sampled with user interaction, д MAP and θ MAP are themaximum likelihood estimate of д and θ respectively. This opti-mization serves two purposes, to extract numerical value of userpreference from slider manipulation ( slider manipulation modelling )and to estimate user preference for any unobserved latent variable z ( user preference modelling ). Given a set of c multidimen-sional latent variables P t = { z it } ci = , and the variable correspondingto z bt that is chosen by the user, we describe this situation as z bt ≻ P t (4)and it’s likelihood using BTL model as p ( z bt ≻ P t |{ д ( z it )} ci = ) = exp ( д ( z bt )/ s ) (cid:205) ci = exp ( д ( z it )/ s ) , (5)where s is a hyperparameter to adjust the sensitivity of the model.We set s = We also model the underlying userpreference toward the latent space as a Gaussian Process GP . Weassume GP to follow a multivariate Gaussian distribution, param-eterized with θ , describing the kernel used in the function. In allof our experiments, we follow Koyama et al. [2017] and use RBFkernel.As D and θ are conditionally independent given д , at iteration t ,we have p (P , z b | д , θ ) = p (P , z b | д ) = t (cid:214) j = p ( z bj ≻ { z ij } ci = | д ) . (6) ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: June 2019. nteractive Subspace Exploration on Generative Image Modelling • 5

Fig. 5. An illustration of the sequential subspace search process using a two-dimensional test function. The iteration proceeds from left to right. At eachiteration, the user first locates the local maximum (white dot) within thesearch region, the system then locates the next 3 most probable candidates(colored dots) and form a subspace for the next iteration.

We also define the following prior. p ( д | θ ) = N( д ; 0 , K ) , (7) p ( θ i ) = LN( ln 0 . , . ) , (8) p ( θ ) = (cid:214) i p ( θ i ) , (9)where p ( д | θ ) is defined to be the GP prior, and K and θ are thecovariance matrix and the parameters of the kernel (in our experi-ment we use RBF kernel) and the latent variables {P , z b } . We alsoconstruct д ′ , an Gaussian Process regressor that approximates д using K and θ .For further explanation and implementation detail of GaussianProcess with sequential line search, we advice the reader to readthe sequential line search paper [Koyama et al. 2017]. Figure 7illustrates an example optimization sequence of our subspace searchwith multi-way sliders. For iteration t >

1, we use the chosen latent variable in the lastiteration to be the starting latent variable, such that z t = z chosen t − .Therefore at iteration t >

1, we have m = t · c + O m at iteration t m : O m = {P m , д ′ (P m )} . (10)The next observations { z im + } ci = should be “the ones most worthobserving” based on the all previous observed data O = {O n } mn = .We define an acquisition function a O ( z ) to measure the “worthiness”of the next sampling candidate z m + . For each iteration, the systemmaximizes the acquisition function to determine the next samplingpoint: z m + = argmax z ∈Z a O ( z ) . (11)In order to choose the next sampling point which is most worthyto sample from, the expected improvement (EI) criterion is oftenused. Let д ′ + be the maximum value among the observation O , theacquisition function is defined as a EI O ( z ) = E [ max { д ′ ( z ) − д ′ + , }] , (12)where E [ X ] means the expectation value of X . Selection of multiple candidates.

We combine expected improve-ment with constant liar strategy [Ginsbourger et al. 2010] to acquiremultiple points at each iteration. We first obtain the first candidatethrough maximizing the current acquisition function. Then, we as-sign maximum score to this sample point and update the acquisition(e.g. we assume the new candidate is as good as the best candidatewe have seen so far). And we pick the second candidate that max-imizes the update acquisition function. We repeat this process toobtain a set of c − To incorporate the user guidances by image editing (Section 4.2), weextend the original acquisition function (Eq. 12) into the followingform: a cEI O ( z ) = E [ max { д ′ ( z ) − д ′ + , }] − σ C (G( z )) − σ R( z ) , (13)where C ( I ) is the content-aware bias term, R( z ) is a regularizationterm, and σ and σ are the balance weights. Regional guidance term.

We handle all the guidance from theimage editing operation with the following term: C ( I ) = (cid:213) x (cid:213) y ( I ∗ x , y − I x , y ) ∗ M x , y , (14)where M is a mask initialized as all 0 .

2. We use i , j as the 2D pixelcoordinate of the image. For color painting and copy & paste, weset M = Keep preserves the region inside ROI bysetting M = Eraser sets M = Regularization term.

We introduce an regularization term to incor-porate our prior knowledge of z , i.e. the latent variables are usuallysampled from a known distribution p , e.g. Normal distribution dur-ing training. Therefore we design a regularization term R to preventthe estimation from deviating too far from p : R ( z ) = loд ( p ( z )) . We applied our framework on face generation application usingPGGAN [Karras et al. 2018] to control the appearance of humanfaces. We used the pretrained generator model provided by theauthors . Our method can be used to control different parts of theface; to change hair colors and styles (change hair color and stylein Figure 1(a), and remove hair in Figure 7(c)); to add or removebeards (Figure 7(a)); to remove accessories such as earrings andglasses (Figure 7(b)); and to add or remove smiles (Figure 7(d)). Inaddition, we applied our framework to image-to-image translationand text-to-image synthesis. Representative results are shown inFigure 1. We used either pretrained generator networks or networkstrained as described in the original paper. During user interactions,the generator network parameters were frozen. Image Translation.

We applied our method to MUNIT [Huanget al. 2018], which performs image-to-image translation (e.g. hang-bag sketches into photographs of hangbags of different colors). We https://github.com/tkarras/progressive_growing_of_gansACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: June 2019. • Toby Chong Long Hin, I-Chao Shen, Issei Sato, and Takeo Igarashi (a) face synthesis (PGGAN) (b) image translation (MUNIT) (c) text-to-image (AttnGAN) Fig. 6. Overview of the generator architectures we used to demonstrate our method. The blue part is the latent space in which our framework operates. used the pretrained models provided by the authors . MUNIT sepa-rates content and style into separate latent vectors, and we appliedour method to the 8-d style vector (the blue latent vector part ofFigure 6(b)). Our method could be used to design new heels (checkteaser figure (b)), boots (Figure 10(b)), and hangbags (Figure 10(c))given sketch inputs by assigning different colors for different parts. Image Synthesis from Text.

Finally we applied our method toAttnGAN [Tao Xu 2018], which allows text-to-image synthesis.We choose AttnGAN as a state-of-the-art text-to-image synthesismethod to show that our framework could handle various networkarchitectures. AttnGAN encodes input text into a sentence vector,accompanied by a random latent vector drawn from a normal distri-bution N( , ) , and synthesizes an image. We applied our methodto the 100-d random noise vector (the blue latent vector part inFigure 6(c)). We showed that our method can be used to improve thesynthesized bird images (check teaser figure (c) and Figure 11(b)),and repair artifacts (Figure 11(a)). We conducted a user study to clarify how our method compareswith iGAN [Zhu et al. 2016] with respect to steering the image gen-eration process. We measure the perceptual quality of image andthe time required to generate it.

Procedure.

We used face image editing with a reference image asthe target task. We randomly sampled four images from a subset ofthe CelebA dataset [Liu et al. 2015] and excluded the images usedto train PGGAN [Karras et al. 2018]. Each participant was showna reference image, and asked to generate an image as similar aspossible to the reference using either our method or iGAN.Participants were judged to have finished the task when either (1)they were satisfied with the result, (2) they found it hard to improvethe results, or (3) the 10 minute time limit was reached. We recruited8 people for the user study, 3 female and 5 male. We divided theparticipants randomly into two groups, each containing 4 people.Our user study proceeded as follows: (tutorial A → A1 → A2 → tutorial B → B1 → B2), where A1 denotes as using method A onimage 1. The order of the methods and images are fully balanced.Before the participants started to edit the test image, we provideda 10 minute tutorial session on both methods. From both methods,we removed import functions (“copy and paste” in our method and“import” in iGAN), to prevent the participants from directly using https://github.com/NVlabs/MUNIT reference image as input. This provides a better simulation of theactual use case, in which users have no access to any picture of thedesired goal, but only a vague idea in their minds. Crowdsourced Evaluation.

To evaluate the perceptual quality ofimages generated in the user study, we carried out a crowd-sourcedcomparison study to evaluate the visual quality of the resulting im-ages. For each query, we showed crowdworkers three images: a ref-erence image and the edited result using both iGAN and our method.We then asked them to select the result that better matched the refer-ence image. We composed a survey of 16 queries. We used AmazonMechanical Turk interface to conduct the survey, in which eachparticipant was shown all 16 queries, with each query shown twiceand the order of the candidates switched. For each crowdworker, wediscarded inconsistent answers, where a duplicated query was an-swered differently, and discarded all answers from participants whoanswered over 25% of queries inconsistently. In total, we report theresults obtained from 50 crowdworkers that passed our consistencychecks.

Study Result.

We compared the user interaction time for bothmethods, and the results are shown in Figure 8(a). The user inter-action time was significantly shorter using our method than usingiGAN. Figure 8(b) shows the results of the crowd-sourced compari-son user study. As shown, the results obtained using our methodwere voted higher than those obtained using iGAN across all fourreference images. In total, 77% of the crowdworkers preferred our re-sults to the results obtained using iGAN. The edited images obtainedfrom the user study are shown in Figure 9.We attribute these differences to the fact that novice users oftenfail to render their mental images accurately using drawing tools. Inaddition, preprocessing is often applied to image during the training(e.g.cropping, scaling) and user might not be able to draw followingthose rules. This makes direct optimization, such as iGAN, performpoorly in practical tasks. In contrast, the same users find it easier tocompare and evaluate images.

Additional study.

Besides the above user study, we performedadditional studies to analyse the theoretical and practical advantageof multiple sliders instead of one. We showed that the user exploresthe latent space more efficiently with 4-slider than 1-slider in the

Slider-time complexity study and the theoretical advantage of multi-ple sliders in the ablation study . Detail of both studies are located inthe supplemental material. We show that multiple sliders allow theusers to explore the latent space more efficiently.

ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: June 2019. nteractive Subspace Exploration on Generative Image Modelling • 7

Fig. 7. We use our method to control the synthesis results using PGGAN [Karras et al. 2018]. The user is able (a) to add beard, (b) remove glasses, (c) make aman bald and old, and (d) make a woman smile. s

Our iGAN100200300400500 (a) Average interaction time (sec) forall participants, referring to the timethe user interact with the system, ex-cluding wait time for computation.

85% 62% 62% 74%15% 38% 38%38% 26% (b) The percentages of crowdworkerpreference to our method (green)and iGAN (red) on different refer-ence images (shown in Figure 9).Fig. 8. Results of the user study.

We introduced a human-in-the-loop optimization method that al-lows the user to explore the latent vector space in generative image modeling with sliders and image editing. Our method only requiresa pretrained generative model, so that it provides a plug-and-playoption for various generative image modeling methods. Sliders al-low sampling large latent hype volume efficiently. In addition, wemodel the user preference using indirect observation (e.g.slider in-teraction) rather than direct input (i.e.user strokes). Our method iseasier and encourages the user to explore the latent space insteadof drawing, which they tend to fail. The focus on exploration alsoprovides artistic freedom to the user by not requiring a set goal inmind. It supports a wide range of applications outside those in Sec-tion 6. We encourage readers to apply our method to generate artwork with their own architecture and datasets.

REFERENCES

David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, Joshua B Tenenbaum, William TFreeman, and Antonio Torralba. 2018. GAN Dissection: Visualizing and Understand-ing Generative Adversarial Networks. arXiv preprint arXiv:1811.10597 (2018).James Bergstra, Brent Komer, Chris Eliasmith, Dan Yamins, and David D Cox. 2015.Hyperopt: a Python library for model selection and hyperparameter optimization.

Computational Science & Discovery

8, 1 (2015), 014008.ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: June 2019. • Toby Chong Long Hin, I-Chao Shen, Issei Sato, and Takeo Igarashi reference 1reference 3 reference 2reference 4

Fig. 9. Here we show several edited results from user study.Grey lines in iGAN edit represent contour sketching and the rest strokes represent colour painting(Please find more in supplemental material).Fig. 10. The control sequences of our framework running on MUNIT [Huang et al. 2018]. The users are able to control the slider and use the color paint tool todesign (a) a new boot, and (b) a new handbag.

ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: June 2019. nteractive Subspace Exploration on Generative Image Modelling • 9

Fig. 11. We use our method to control the synthesis results from text-to-image application (using AttnGAN [Tao Xu 2018]). The user is able to (a) add a largerbeak, and (b) improve the synthesized bird in only few iterations.

Eric Brochu, Tyson Brochu, and Nando de Freitas. 2010. A Bayesian interactive opti-mization approach to procedural animation design. In

Proceedings of the 2010 ACMSIGGRAPH/Eurographics Symposium on Computer Animation . 103–112.Eric Brochu, Nando de Freitas, and Abhijeet Ghosh. 2007. Active Preference Learningwith Discrete Choice Data. In

Advances in Neural Information Processing Systems .Jacob R. Gardner, Matt J. Kusner, Zhixiang Xu, Kilian Q. Weinberger, and John P. Cun-ningham. 2014. Bayesian Optimization with Inequality Constraints. In

Proceedingsof the 31st International Conference on Machine Learning . 937–945.Michael A. Gelbart, Jasper Snoek, and Ryan P. Adams. 2014. Bayesian Optimization withUnknown Constraints. In

Proceedings of the 30st Uncertainty in Artificial Intelligence .David Ginsbourger, Rodolphe Le Riche, and Laurent Carraro. 2010. Kriging is well-suitedto parallelize optimization. In

Computational Intelligence in Expensive OptimizationProblems . Springer, 131–162.Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, SherjilOzair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In

Advances in neural information processing systems . 2672–2680.Éric Guérin, Julie Digne, Éric Galin, Adrien Peytavie, Christian Wolf, Bedrich Benes,and Benoît Martinez. 2017. Interactive Example-based Terrain Authoring withConditional Generative Adversarial Networks.

ACM Trans. Graph.

36, 6, Article 228(Nov. 2017), 13 pages. https://doi.org/10.1145/3130800.3130804Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. 2018. Multimodal Unsuper-vised Image-to-image Translation. arXiv preprint arXiv:1804.04732 (2018).Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. 2017. Globally and LocallyConsistent Image Completion.

ACM Transactions on Graphics (Proc. of SIGGRAPH2017)

36, 4, Article 107 (2017), 107:1–107:14 pages.Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-ImageTranslation with Conditional Adversarial Networks. In

Computer Vision and PatternRecognition (CVPR), 2017 IEEE Conference on .Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive growing ofGANs for improved quality, stability, and variation. In

Proc. International Conferenceon Learning Representations (ICLR) .Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXivpreprint arXiv:1312.6114 (2013).Yuki Koyama, Issei Sato, Daisuke Sakamoto, and Takeo Igarashi. 2017. Sequential LineSearch for Efficient Visual Design Optimization by Crowds.

ACM Trans. Graph.

Proceedings of International Conference on Computer Vision(ICCV) .Pushparaja Murugan. 2017. Hyperparameters Optimization in Deep Convolutional Neu-ral Network / Bayesian Approach with Gaussian Process Prior.

CoRR abs/1712.07233(2017). arXiv:1712.07233 http://arxiv.org/abs/1712.07233Tiziano Portenier, Qiyang Hu, Attila Szabó, Siavash Arjomand Bigdeli, Paolo Favaro,and Matthias Zwicker. 2018. Faceshop: Deep Sketch-based Face Image Editing.

ACMTrans. Graph.

37, 4, Article 99 (July 2018), 13 pages. https://doi.org/10.1145/3197517.3201393 Warren Scott, Peter Frazier, and Warren Powell. 2011. The correlated knowledgegradient for simulation optimization of continuous parameters using gaussianprocess regression.

SIAM Journal on Optimization

21, 3 (2011), 996–1026.Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas.2016. Taking the human out of the loop: A review of bayesian optimization.

Proc.IEEE

CoRR abs/1811.10153 (2018). arXiv:1811.10153 http://arxiv.org/abs/1811.10153Qiuyuan Huang Han Zhang Zhe Gan Xiaolei Huang Xiaodong He Tao Xu,Pengchuan Zhang. 2018. AttnGAN: Fine-Grained Text to Image Generation withAttentional Generative Adversarial Networks. (2018).Kristi Tsukida and Maya R. Gupta. 2011.

How to Analyze Paired Comparison Data .Technical Report. University of Washington, Department of Electrical Engineering.Richard Zhang, Jun-Yan Zhu, Phillip Isola, Xinyang Geng, Angela S Lin, Tianhe Yu,and Alexei A Efros. 2017. Real-Time User-Guided Image Colorization with LearnedDeep Priors.

ACM Transactions on Graphics (TOG)

9, 4 (2017).Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A. Efros. 2016. GenerativeVisual Manipulation on the Natural Image Manifold. In

Proceedings of EuropeanConference on Computer Vision (ECCV) .Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networkss. In