A Study of Cross-domain Generative Models applied to Cartoon Series
AA Study of Cross-domain Generative ModelsApplied to Cartoon Series
Eman T. Hassan David J. CrandallSchool of Informatics, Computing, and EngineeringIndiana UniversityBloomington, IN USAEmail: { emhassan, djcran } @indiana.edu Abstract
We investigate Generative Adversarial Networks(GANs) to model one particular kind of image: framesfrom TV cartoons. Cartoons are particularly interest-ing because their visual appearance emphasizes the im-portant semantic information about a scene while ab-stracting out the less important details, but each car-toon series has a distinctive artistic style that per-forms this abstraction in different ways. We considera dataset consisting of images from two popular tele-vision cartoon series, Family Guy and The Simpsons.We examine the ability of GANs to generate imagesfrom each of these two domains, when trained indepen-dently as well as on both domains jointly. We find thatgenerative models may be capable of finding semantic-level correspondences between these two image domainsdespite the unsupervised setting, even when the train-ing data does not give labeled alignments between them.
1. Introduction
Filmmakers and authors may not wish to admit it,but almost every story – and almost every work of liter-ature and art in general – borrows heavily from thosethat came before it. Sometimes this is explicit: the2004 movie
Phantom of the Opera is a film remake ofthe famous Andrew Lloyd Webber musical, which wasbased on an earlier 1976 musical, which in turn was in-spired by the 1925 silent film, all of which are based onthe 1910 novel by Gaston Leroux. Sometimes the bor-rowing is for humor – the TV sitcom
Modern Family ’sepisode
Fulgencio was a clear parody of
The Godfather ,for example – or for political expression, such as
TheOnion ’s satirized version of news stories. Even highlyoriginal stories still share common themes and ingre-
Training (actual) frames Novel frames ... → ... → Figure 1: We apply Generative Adversarial Networks(GANs) to model the styles of two different cartoonseries,
Family Guy (top) and
The Simpsons (bottom),exploring their ability to generate novel frames and findsemantic relationships between the two domains.dients, like archetypes for characters [10] (“the tragichero,” “the femme fatal,” etc.) and plot lines [2] (“ragsto riches,” “the quest,” etc.).The fact that stories inspire one another means thatthe canon of film is full of similarities and latent connec-tions between different works. These connections aresometimes obvious and sometimes subtle and highlydebatable. To what extent can computer vision findthese connections automatically, based on visual fea-tures alone?As a starting point, here we explore the ability ofGenerative Adversarial Networks (GANs) to model thestyle of long-running television series . GANs haveshown impressive results in a wide range of imagegeneration problems, including infilling [28], automaticcolorization [3], image super-resolution [19], and videoframe prediction [23,37]. Despite this work, many ques-tions remain about when GAN models work well. Oneproblem is that it is difficult to evaluate the resultsof these techniques objectively, or to understand the1 a r X i v : . [ c s . C V ] S e p echanisms and failure modes under which they op-erate. Various threads of work are underway to ex-plore this, including network visualization approaches(e.g. [41]) and attempts to connect deep networks withwell-understood formalisms like probabilistic graphicalmodels (e.g. [16]). Another approach is to simply ap-ply GANs to various novel case studies that may gavenew insight into when they work and when they fail, aswe do here. While predicting frames from individualvideos has been studied [37], joint models of entire se-ries may offer new insights. TV series usually includea set of key characters that are prominently featuredin almost every episode, and a supporting set of char-acters that may appear only occasionally. Most seriesfeature canonical recurring backgrounds or scenes (e.g.the coffee shop in Friends ), along with others that oc-cur rarely or even just once.In this paper, we consider the specific case of tele-vision cartoon series. We use TV cartoons becausethey are more structured and constrained than naturalimages: they abstract out photo-realistic details in or-der to focus on high-level semantics, which may makeit easier to understand the model learned by a net-work. Nevertheless, different cartoon series are signif-icantly different in the appearance of characters, colorschemes, background sets, artistic designs, etc. In par-ticular, we consider two specific, well-known TV car-toon series: Simpsons and Family Guy. As our dataset,we sampled about 32,000 frames from four TV seasons(about 80 videos) from each series (Figure 1).We use these two different cartoon series to exploreseveral questions. To what extent can a network gener-ate novel frames in the style of a particular series? Cana network automatically discover high-level semanticmappings between our two cartoon series, finding sim-ilar styles, scenes, and themes? If we have a frame inone series, can we find similar high-level scenes (e.g.“people shaking hands”) in the other? Can trainingfrom two different series be combined to generate bet-ter frames from each individual series? We find, for ex-ample, that training both domains together can gener-ate better high-resolution images than either indepen-dently. Our work follows others that have consideredmapping problems like image style transfer [12,45] andtext-to-image mappings [8, 32]. Our three contribu-tions are: (1) proposing cartoon series as a fun, usefultest domain for GANs, (2) building a structured buthighly nontrivial mapping problem that reveals inter-esting insights about the latent space produced by theGAN, (3) presenting extensive experimentation wherewe vary the training dataset composition and genera-tion techniques in order to study what is captured bythe underlying latent space.
2. Related work
Generative networks learn an unsupervisedmodel of a domain such as images or videos, so thatthe model can generate samples from the latent repre-sentation that it learns. For example, Dosovitskiy etal. [9] proposed a deep architecture to generate imagesof chairs based on a latent representation that encodeschair appearance as a function of attributes and view-point. Generative Adversarial Networks (GANs) [13]are a particularly prominent example. GANs modelgeneral classification problems as finding an equilib-rium point between two players, a generator and a dis-criminator, where the generator attempts to produce“confusing” examples and the discriminator attemptsto correctly classify them. Radford et al. [30] com-bined CNNs with GANs in their Deep ConvolutionalGANs (DCGANs). GANs have been applied to do-mains including images [6, 25], videos [37], and evenemojies [36], and applications including face aging [1],robotic perception [39], colorization [3], color correc-tion [20], editing [44], and in-painting [28].
Conditioned Generative Adversarial Net-works (CGANs) [25] introduce a class label in boththe generator and discriminator, allowing them to gen-erate images with some specified properties, such asobject and part attributes [31]. GAN-CLS [32] usesrecurrent techniques to generate text encodings andDCGANs to generate higher-resolution images giveninput text descriptions, while Reed et al. [33] use text,segmentation masks, and part keypoints [33]. Dong etal. [8] use an image captioning module for textual dataaugmentation to enhance the performance of GAN-CLS. Perarnau et al. [29] modify the GAN architectureto generate images conditioned on specific attributesby training an attribute predictor, while InfoGAN [5]learns a more interpretable latent representation.
Stacked GAN architecture.
To generatehigher-quality images, coarse-to-fine approaches calledStacked GANs have been proposed. Denton et al. [6]describe a Laplacian pyramid GAN that generates im-ages hierarchically: the first level generates a low res-olution image, which is then fed into the subsequentstage, and so on. Zhang et al. [42] propose a two-stageGAN in which the first generates a low resolution imagegiven a text input encoding, while the second improvesits quality. Huang et al. [14] propose a multiple-stageGAN in which each stage is responsible for inverting adiscriminative bottom-up mapping that is pre-trainedfor classification. Wang et al. [38] describe two sequen-tial GANs, one that generates surface normals and asecond that transforms the surface normals into an in-door image. Yang et al. [40] describe a recursive GANthat generates image backgrounds and foregrounds sep-2rately, and then combines them together.
GAN Framework modifications.
The originalGANs proposed by Goodfellow et al. [13] use a binarycross entropy loss to train the discriminator network.Zhao et al. [43] employ energy-based objectives suchthat the discriminator tries to assign high energy tothe generated samples, while the generator attempts togenerate samples with minimal energy. It uses an auto-encoder to enhance the stability of the GAN. Metz etal. [24] define the generator objective based on un-rolled optimization of the discriminator by optimiz-ing a surrogate loss function. Mao et al. [22] addressthe problem of vanishing gradients of the discrimina-tor by proposing a least squares loss for the discrimi-nator instead of a sigmoid cross-entropy. Salimans etal. [34] propose several training strategies to enhancethe convergence of GANs, such as feature matchingmini-batch discrimination. Nowozin et al. [27] considerf-divergence measures for training generative models byregarding the GAN as a general variational divergenceestimator. Tong et al. [4] address instability of the net-work and the model-collapse problem by introducingtwo types of regularizers: geometric metrics (e.g., thepixelwise distance between the discriminator featuresand VGG features) and mode regularizers to penalizemissing modes.
Image-to-Image GANs.
Many researchers haveproposed GAN-based image-to-image translation tech-niques, most of which require training data with cor-respondences between images. Isola et al. [15] pro-pose conditional GANs that map a random vector z and an input image from a source domain to an im-age in the target. Sangkloy et al. [35] synthesize im-ages from rough sketches. Karacan et al. [17] generaterealistic outdoor scenes based on input image layoutand scene attributes. Other work has tried to trainwithout image-to-image correspondences in the train-ing dataset. For example, Taigman et al. [36] use aDomain Transfer Network (DTN) with a compoundloss function for mapping between face emojis and faceimages. Kim et al. [18] propose DiscoGAN, which em-ploys a generative model to learn a bijection betweentwo domains based on loss functions that reduce thedistance between the input image and the inverse map-ping of the generated image. Zhu et al. [45] employcycle-consistent loss to train adversarial networks forimage-to-image translation.
3. Approach
Our goal is to study GANs for image generationacross two different image domains – in particular,two TV cartoon series. While the last section gavean overview of GANs, we now focus on the techniques to address this particular task.
Goodfellow [13] proposed a deep generative model asa min-max game between two agents: (1) a generativenetwork g ( z ; θ g ) that models a probability distribution p g of generated image samples with input z that hasdistribution z ∼ p z , and (2) a discriminator network f ( x ; θ f ) which estimates the probability that the inputsample is drawn from the real data distribution x ∼ p x or from the generated distribution g ( z ; θ g ). Ideally f ( x ; θ f ) = 1 if x ∼ p x and f ( g ( z ; θ g ); θ f ) = 0 if z ∼ p z .The networks are trained by optimizing,min θ g max θ f L ( θ f , θ g ) ,L ( θ f , θ g ) = E x ∼ p x [log( f ( x ; θ f ))]+ E z ∼ p z [log(1 − f ( g ( z ; θ g ); θ f ))] . The optimization problem is solved by alternating be-tween gradient ascent steps for the discriminator, θ t +1 f = θ tf + λ t (cid:53) θ f L ( θ tf , θ tg ) , and descent steps on the generator, θ t +1 g = θ tg − λ t (cid:53) θ g L ( θ t +1 f , θ tg ) . Deep Convolutional GANs (DCGANs) [30] includearchitectural constraints for deep unsupervised net-work training, many of which we follow here. Anydeep architecture we employ consists of a set of mod-ules (represented by rectangular blocks in network ar-chitectures). Each module in the generator consistsof fractional-strided convolutions, batch normalizationand ReLU activation for all layers except for the out-put, which uses hyperbolic tangent. In the discrimina-tor, each block consists of strided convolutions, batchnormalization and LeakyReLU activations.
Coupled GANs (Co-GANs) [21] allow for a networkto model multiple image domains. A CO-GAN con-sists of two (or more) GANs, where ( g s , f s ) and ( g l , f l )represent the generative and discriminative networkfor the first and second domains, respectively (Fig-ure 2(a)). Since the first layers of the discriminatorencode low-level features while the later layers encodehigh-level features, we force the domains to share se-mantic information by tying the weights of their finaldiscriminator layers together. Since the flow of infor-mation in generative networks is opposite (initial layers3 a) Coupled GANs (b) GANs with domain adaptation (c) Generator for high-resolution images Figure 2: Various network architectures that we consider.represent high-level concepts), we tie together the earlylayers of the generators. In learning, Co-GANs solve aconstrained minimax game similar to that of GANs,max θ gs ,θ gl min θ fs ,θ fl L ( θ f s , θ g s , θ f l , θ g l ) , . We will use CO-GANs to explore semantic connec-tions between Family Guy and The Simpsons and com-pare the results with other methods described in thefollowing section. In domain adaptation, we have an input space X (images) and output space Y (class labels). The objec-tive is to train the model using source domain distri-bution S ( x, y ) so as to maximize the performance on atarget domain distribution τ ( x, y ). Both distributionsare defined on X × Y where S is “shifted” from τ bysome domain offset.The adversarial domain adaptation model [11] de-composes the input image feature representation intothree parts: G ft , which extracts feature representa-tions f t = G ft ( x ; θ ft ), G y ( f t ; θ y ), which maps a fea-ture vector f t to label y , and G d ( f t ; θ d ), which maps afeature vector f t to either 0 or 1 according to whetherit belongs to the source or target distribution, respec-tively. The objective is to obtain a domain invariantfeature representation f that maximizes the loss of thedomain classifier while making the parameters θ d min-imize the loss of the domain classifier. This is accom-plished by a gradient reversal unit (GRU), which actsas identity during forward propagation and value nega-tion during backward propagation. Consider two distributions, X s ∼ τ ( x s ) and X l ∼ τ ( x l ), corresponding to two different inputdomains, and a set of samples from each domain, { x s , x s , . . . x sN } and { x l , x l , . . . x lN } . Each samplehas a label y ∈ { , } corresponding to whether it is a fake or real image, respectively, and a label d ∈ { , } indicating whether it is from the first or second domain.The parameters for the model are θ Gs , θ Gl , θ f , θ c , θ a and the network architecture is shown in Figure 2(b).Training involves optimizing an energy function, E ( θ Gs , θ Gl , θ a , θ f , θ c ) = L ( θ Gs , θ Gl , θ a , θ f ) + L ( θ c , θ a ) , where L is the cross entropy loss, L ( θ Gs ,θ Gl , θ a , θ f ) = E x s ∼ p xs [log( f ( x s ; θ af ))]+ E x s ∼ p xl [log( f ( x l ; θ af ))]+ E z s ∼ p zs [log(1 − f ( g s ( z s ; θ g s ); θ af ))]+ E z l ∼ p zl [log(1 − f ( g l ( z l ; θ g l ); θ af ))] , where θ af = { θ a , θ f } and L is a binary layer log maxloss for the domain classifier, L ( θ ) = − N N (cid:88) i =1 k (cid:88) j =1 { y ( i ) = j } log (cid:32) e θ Tj x ( i ) (cid:80) kl =1 e θ Tl x ( i ) (cid:33) , with θ = { θ c , θ a } . Algorithm 1 shows the steps involvedin this optimization. In the experimental results, wewill examine the results of applying that technique onfinding high level semantic alignments between the twodomains of Simpsons and Family Guy.
4. Experimental Results
We now present results on generating images acrossour two domains of interest: frames from the cartoonseries The Simpsons and Family Guy.
To build a large-scale dataset, we took about 25hours of DVDs corresponding to every episode for fouryears of each program (Simpsons seasons 7-10, Fam-ily Guy seasons 1-4). We performed screen capturesat about Hz and a resolution of 1280 × impsons Family Guy × × Figure 3: Sample frames generated by two independently-trained cartoon series models, at two different resolutions.
We began by training a separate, independent net-work for each of our two domains, basing the networkarchitecture and training procedure on the publicly-available Torch implementation by Radford et al. [30].We generated new 64 ×
64 pixel frames by passingindependently-sampled random vectors into the net-work’s 1024D z input. Figure 3(a) presents the results.Comparing these novel frames to those in the trainingset (Figure 1), we observe that the network seems tohave captured the overall style and appearance of thecharacteristics of the original domains — the distinc-tive yellow color of the characters in the Simpsons, forexample, versus the paler skin tones in Family Guy. Nearest neighbors.
To what extent are the gen-erated frames really novel, and to what extent do theysimply “copy” the training images? To help visualizethis, for each generated frame we found the nearest- neighbors in the training set, according to Euclideandistance between the activation values of the second tothe last layer of the discriminator. Figure 4 presentsthe results. Intuitively, the nearest neighbors give ussome information about the “inspiration” that the net-work used in generating a new frame. We observe, forexample, that the upper-left image in the Family Guyrow looks like an image of the husband and wife talk-ing in the bedroom, but the closest images retrievedfrom the training dataset are images of them talkingin the kitchen; this is explained by the fact that thetwo rooms have very similar appearance in the series.In the top-middle example of the Simpsons row, thenetwork appears to have generated a novel scene withsome people appearing in a TV set, whereas the closestframes in the training images are the TV with differentvarieties of text.5 i m p s o n s F a m il y G u y Figure 4: Nearest training neighbors for each of nine sample generated images (left image in each pane) for a GANmodel trained on each independent cartoon series.
Generating higher-resolution images.
To im-prove the quality of the frames, we tried to generateimages at a higher resolution of 128 × Instead of training models for the two domains inisolation, we next consider various jointly-trained mod- els. We hypothesize that such jointly-trained modelscould potentially overcome the problems with generat-ing higher-resolution images (by doubling the numberof training examples), and could find semantic corre-spondences between frames across the two series.
Combining the datasets.
We first tried retrain-ing the same model as before but with a dataset con-sisting of both series mixed together, and results arein Figure 5 for two different resolutions. Again, thelower-resolution images seem reasonable in both ap-pearance and diversity, and for most frames we canidentify characteristics of one or both of the two se-ries. However, the combined dataset seems not to havehelped the higher-resolution frame generator, as we stillsee evidence of model collapse. The second row of the64 ×
64 128 × lgorithm 1 Training with domain adaptation Given: Minibatch size b , learning rate λ , iteration count n ,randomly-initialized θ Gs , θ Gl , θ f , θ c , θ a . for i = 1 to n do Update discriminator parameters { θ a , θ f } , where real im-ages of both domains have label 1 and fake (generated) imageshave label 0. { θ a , θ f } ←− { θ a , θ f } + λ ∂L ( θ Gs , θ Gl , θ a , θ f ) ∂θ { θa,θf } (1) Update generative models for the two domains indepen-dently, by propagating the cross-entropy from the discrimina-tive network, considering images generated from the generatoras real, θ Gs ←− θ Gs − λ ∂L ( θ Gs , θ Gl , θ a , θ f ) ∂θ θGs (2) θ Gl ←− θ Gl − λ ∂L ( θ Gs , θ Gl , θ a , θ f ) ∂θ θGl (3) Update classifier parameters θ a , θ c with real samples fromboth domains, where class labels indicate domain, and propa-gate error through the classifier and shared discriminator pa-rameters, { θ a , θ c } ←− { θ a , θ c } − λ ∂L ( θ a , θ c ) ∂θ { θa,θc } | real (4) Repeat with fake generated images, { θ a , θ c } ←− { θ a , θ c } − λ ∂L ( θ a , θ c ) ∂θ { θa,θc } | fake (5) figure again presents k -nearest-neighbors in the com-bined training set, showing that sometimes the gener-ated frames are most similar to one dataset, and some-times seem to synthesize a combination of the two. COGANs.
We next test COGANs, which explic-itly model that there are multiple image domains, asshown in Figure 6. We observe that the coupled train-ing generated images with noticeably better quality at128 × z intoboth of the two models, as shown in each pane of Fig-ure 7. Intuitively, the images generated from the same z should be topically similar if COGAN has identifiedmeaningful semantic connections; we observe, howeverthat this does not appear to be the case, since imagesacross domains in the figure are quite different. We alsoshow the nearest neighbors of each generated frame inthe corresponding training set. Domain adaptation.
Sample results for domainadaption are shown in Figure 8. We noticed duringtraining that only some iterations of the model wereable to generate good images for both domains; herewe selected iterations that worked well. The figure alsoshows the nearest neighbors in each domain’s trainingset for each sample generated frame. We use two dif-ferent features for finding nearest neighbors: (1) thesecond to the last layer of the discriminator, as before, and (2) second to the last layer of the classifier. Theresults suggest that the model managed to find cor-respondences in the main colors of the image in bothdomains.Examining the results shown in Figure 8, we notethat the model seemed to find some meaningful high-level semantic alignments. For discriminator similar-ity, for example, we see alignments in terms of peoplein theaters or stadiums (Family Guy images 1e and 1iwith Simpsons images 1k and 1p), houses (images 2athrough 2e with 2k and 2r), a person talking against ared background (row 3), cars in a parking lot (images5b and 5f with 5k and 5m), and groups of people in-doors (7d and 7e with 7k and 7m through 7t). Withthe similarity measure in the classifier’s feature space,a general theme seems to be people conversing in dif-ferent indoor settings, including images 1I and 1J with1K, 1L, and 1P. Images 2D and 2I seem to relate toimage 2M in that both have people talking against agreen background, and in general the green backgroundseems to be a theme of the whole row. The third rowappears to roughly correspond with conversations be-tween pairs of people, as in images 3A, 3C, and 3E withimages 3M through 3P, whereas the fifth row featuressingle characters in the scene (e.g. images 5A, 5D, and5I with 5K through 5N). Other themes include square-framed scenes (images 1F with 1M and 1Q through 1T)and scenes with prominent buildings (image 6E and 6Gwith 6K through 6M).
Domain adaptation variants.
To better under-stand the importance of different parts of the domainadaptation technique in Algorithm 1, we tried variousvariants, and present results in Figure 9. First, our
NoClassifierTraining variant skips steps 5 and 6 totest if just the shared discriminator (without the clas-sifier) is enough to achieve good mappings between do-mains. Figure 10 shows nearest neighbors for somesample frames generated by this variant of the model.For the discriminator similarities, we see much less evi-dence of semantic correspondence between frames thanwith the full model, although there is some in terms ofsimple features like overall color. For example, row 1has a yellow theme (e.g. images 1g and 1h with 1l, 1m,1s, and 1t), row 2 has a violet theme (images 2a–2c and2e–2f with 2k, 2q, and 2t), rows 4 and 8 features darkgray backgrounds, and row 5 has a combination of yel-low and blue colors. The classifier similarities are notuseful here, of course, since they have not been trained,and resemble clusterings of random vectors.
NoFakeClassifierTraining includes step 5 butskips step 6, so that the classifier is trained only withreal images, and results are shown in Figure 11. Theseresults suggest the technique has once again found8 he Simpsons Family Guy × × × Figure 6: Sample frames generated by COGAN at different resolutions (rows) for each cartoon dataset (columns).9 a m il y G u y S i m p s o n s F a m il y G u y S i m p s o n s F a m il y G u y S i m p s o n s F a m il y G u y S i m p s o n s Figure 7: Sample results of paired image generation with COGAN. In each pane, the top left and bottom right im-ages were generated by the same random input vector passed to the Simpsons and Family Guy models, respectively.The remaining images show nearest neighbors in the corresponding dataset.10 iscriminator similarity Classifier similarity
Frame Family Guy Simpsons Family Guy Simpsons F a m il y G u y aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT S i m p s o n s aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT Figure 8: Sample results of the
Full Domain Adaptation model. For each generated frame (left image of eachrow), we show the 10 nearest neighbors in each of the two training datasets, using two different definitions ofsimilarity.some higher-level semantic alignments in both similar-ity spaces, including the clouds of dust or smoke in im-ages 1h, 1l, and 1n, the large groups of people in 1a, 1f,1i and 1j with 1t, the paper documents in images 5a–5jwith 5k, 5p, and 5s, the single characters against a bluebackground in row 6, and the sky-colored backgroundwith white foreground objects like clouds in rows 7 and8. Under classifier similarities, we also see some seman-tic themes, including several rows that seem to be cuingon certain facial reactions (e.g. images 5A, 5G, 5I, and5J with 5L, 5P, 5Q, and 5S).
NoRealClassifierTraining skips step 5 but not 6,so that the classifier is trained only with synthetic im-ages. We find that in both techniques the model has generated images with good correspondence in the twotechniques, which suggests that the mapping is manyto many and not bi-jective, as shown in Figure 12.Examples of semantic-level alignments seem to includeframes with single humans against a blue background(images 2a–2h with 2m, 2n, 2p, and 2t), two charac-ters talking (images 3a–j with 3m, 3o, and 3t), largecrowds of people in row 4, and similar color themes inthe remaining rows. Classifier similarity results seemsto find mostly alignments based on similar characterconfigurations and activities.Finally,
LazyFakeClassifierTraining skips step 6during the initial few iterations, so that the classifieronly trains with fake images once they start becoming11 he Simpsons Family Guy F u ll D o m a i n A d a p t a t i o n N o C l a ss i fi e r T r a i n i n g N o F a k e C l a ss i fi e r T r a i n i n g N o R e a l C l a ss i fi e r T r a i n i n g L a z y F a k e C l a ss i fi e r T r a i n i n g Figure 9: Sample frames generated under different variants of the domain adaptation models.12 iscriminator similarity Classifier similarity
Frame Family Guy Simpsons Family Guy Simpsons F a m il y G u y aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT S i m p s o n s aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT Figure 10: Sample results of the
NoClassifierTraining model. For each generated frame (left image of each row),we show the 10 nearest neighbors in each of the two training datasets, using two different definitions of similarity.realistic. Examining the results shown in Figure 13,we again see relatively good high-level semantic align-ments, including: interactions between two main char-acters (images 1e and 1j with 1k–1t), blue-green colorschemes (row 2), large groups of indoor people (row 3),two characters interacting (images 8b and 8c with 8k–t), etc. Particularly interesting are rows 5 and 6, wheremany images correspond with scenes appearing on aTV screen (and thus framed by a rectangular “viewingwindow”). Correspondences in the classifier similarityspace are also readily apparent.We note that when training with both datasetsjointly (as opposed to training them independently asjust one dataset) the model manages to generate sam-ples that represent different color styles of the two se- ries separately, as shown in Figure 14. When trainedwith a single dataset, the model finds it hard to builda joint color space for both domains so it alternatesbetween them. The first row of Figure 14 shows sam-ples generated after the first training epoch of Co-GAN model for Family Guy and Simpsons, respec-tively, while the second row shows samples from thedomain adaptation model. Both the first and secondrows show that the network detects the difference fromthe first epoch. The third row of the figure shows thefirst and second epoch of the combined dataset im-age generation. We notice that the model tries to finda common color space between the two domains andthe results change drastically between the two differ-ent epochs.13 iscriminator similarity Classifier similarity
Frame Family Guy Simpsons Family Guy Simpsons F a m il y G u y aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT S i m p s o n s aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT Figure 11: Sample results of the
NoFakeClassifierTraining model. For each generated frame (left image ofeach row), we show the 10 nearest neighbors in each of the two training datasets, using two different definitions ofsimilarity.
A potential direct application of our model is incross-domain image retrieval, and in this case, findingsemantically-similar episodes across two different car-toon series. We test both the discriminator and classi-fier feature-based similarities for
FullDomainAdap-tation . We view each episode as a bag-of-frames, andthen given an episode in one domain and an episodein another, we calculate a distance measure: for everyframe in the first episode, we find the closest framein the second measure, and each frame is allowed tobe paired to at most one other frame from the otherepisode. The distance between the two episodes is the minimum closest frame distance.Figure 15 shows four sample retrieval results forboth the discriminator similarity measure (top) andclassifier similarity measure (bottom). Within each re-sult, the top row shows 20 sample frames from a FamilyGuy query video, and the bottom shows the matchedframes from the most similar Simpsons episode. Whilewe do not have ground truth to evaluate quantitatively,we see that the retrieved results do share similarities interms of overall scene composition. Examining the re-sults for the discriminator similarity, for example, theepisodes in the first example both features “synthetic”-looking 3d models of characters, while the episodes inthe second example feature objects against a blue sky14 iscriminator similarity Classifier similarity
Frame Family Guy Simpsons Family Guy Simpsons F a m il y G u y aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT S i m p s o n s aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT Figure 12: Sample results of the
NoRealClassifierTraining model. For each generated frame (left image ofeach row), we show the 10 nearest neighbors in each of the two training datasets, using two different definitions ofsimilarity.and close-ups of people. The third example shows in-door rooms from a specific viewpoint, and the fourthseems to match black and white scenes in Family Guywith Cowboy scenes in Simpsons. Meanwhile, the clas-sifier similarity measure seems to have found simiarepisodes in terms of appearance of main characters andsimilar background colors.
5. Conclusion and Future Work
We have studied finding high level semantic map-pings between cartoon frames using GANs, as a firststep towards finding general semantic connections be-tween videos. We show that this problem is many-to-many (not bi-jective which means that the same scene in one domain can map to different scene in the otherdomains each mapping can have a high level seman-tic meaning) and that some models can find reason-able high-level semantic alignments between the twodomains. Our work also shows, however, that this isstill an open research problem. Future work shouldconsider the temporal dimension of video, as well asadding other modalities like subtitles and audio. Be-yond the insight that our analysis gives about GANs,it also creates the opportunity for interesting applica-tions in the specific domain of cartoons. For exam-ple, a common practice among fans is to create cor-respondences between live-action movies and TV car-toon series [7, 26], such as parodies. Our work raises15 iscriminator similarity Classifier similarity
Frame Family Guy Simpsons Family Guy Simpsons F a m il y G u y aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT S i m p s o n s aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT aa bb cc dd ee kk ll mm nn oo AA BB CC DD EE KK LL MM NN OOff gg hh ii jj pp qq rr ss tt FF GG HH II JJ PP QQ RR SS TT Figure 13: Sample results of the
LazyFakeClassifierTraining model. For each generated frame (left image ofeach row), we show the 10 nearest neighbors in each of the two training datasets, using two different definitions ofsimilarity.the intriguing possibility that such mappings betweendomains could be created completely automatically byGAN models. By training on TV series as opposedto individual images, it may even be possible to sam-ple entirely new story lines, generating new episodesthat fit the stylistic mores of a given series, completelyautomatically!
References [1] G. Antipov, M. Baccouche, and J.-L. Dugelay. Face ag-ing with conditional generative adversarial networks. arXiv:1702.01983 , 2017. 2[2] C. Booker.
The Seven Basic Plots . 2004. 1 [3] Y. Cao, Z. Zhou, W. Zhang, and Y. Yu. Unsuper-vised diverse colorization via generative adversarialnetworks. arXiv:1702.06674 , 2017. 1, 2[4] T. Che, Y. Li, A. P. Jacob, Y. Bengio, and W. Li.Mode regularized generative adversarial networks. arXiv:1612.02136 , 2016. 3[5] X. Chen, Y. Duan, R. Houthooft, J. Schulman,I. Sutskever, and P. Abbeel. Infogan: Interpretablerepresentation learning by information maximizinggenerative adversarial nets. In
NIPS , pages 2172–2180,2016. 2[6] E. L. Denton, S. Chintala, R. Fergus, et al. Deep gen-erative image models using a laplacian pyramid of ad-versarial networks. In
NIPS , pages 1486–1494, 2015.2 impsons Family Guy F i r s t C O GA N i t e r a t i o n F i r s t D A i t e r a t i o n First iteration Second iteration S i m p s o n s + F a m il y G u y Figure 14: Samples generated by various training iter-ations. arXiv:1703.06676 , 2017. 2[9] A. Dosovitskiy, J. Tobias Springenberg, and T. Brox.Learning to generate chairs with convolutional neuralnetworks. In
CVPR , pages 1538–1546, 2015. 2[10] N. Frye.
The Archetypes of Literature . Norton, 2001.1[11] Y. Ganin and V. Lempitsky. Unsupervised domainadaptation by backpropagation. arXiv:1409.7495 ,2014. 4[12] L. A. Gatys, A. S. Ecker, and M. Bethge. Imagestyle transfer using convolutional neural networks. In
CVPR , 2016. 2[13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben- gio. Generative adversarial nets. In
NIPS , 2014. 2,3[14] X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, andS. Belongie. Stacked generative adversarial networks. arXiv:1612.04357 , 2016. 2[15] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial net-works. arXiv:1611.07004 , 2016. 3[16] M. Johnson, D. Duvenaud, A. Wiltschko, R. Adams,and S. Datta. Composing graphical models with neuralnetworks for structured representations and fast infer-ence. In
NIPS , 2016. 2[17] L. Karacan, Z. Akata, A. Erdem, and E. Erdem.Learning to generate images of outdoor scenes fromattributes and semantic layouts. arXiv:1612.00215 ,2016. 3[18] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learn-ing to discover cross-domain relations with generativeadversarial networks. arXiv:1703.05192 , 2017. 3[19] C. Ledig, L. Theis, F. Husz´ar, J. Caballero, A. Cun-ningham, A. Acosta, A. Aitken, A. Tejani, J. Totz,Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. arXiv:1609.04802 , 2016. 1[20] J. Li, K. A. Skinner, R. M. Eustice, and M. Johnson-Roberson. Watergan: Unsupervised generative net-work to enable real-time color correction of monocularunderwater images. arXiv:1702.07392 , 2017. 2[21] M.-Y. Liu and O. Tuzel. Coupled generative adversar-ial networks. In
NIPS , pages 469–477, 2016. 3[22] X. Mao, Q. Li, H. Xie, R. Y.K. Lau, Z. Wang, andS. Paul Smolley. Least squares generative adversarialnetworks. arXiv:1611.04076 , 2017. 3[23] M. Mathieu, C. Couprie, and Y. LeCun. Deepmulti-scale video prediction beyond mean square er-ror. arXiv:1511.05440 , 2015. 1[24] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generative adversarial networks. arXiv:1611.02163 , 2016. 3[25] M. Mirza and S. Osindero. Conditional generative ad-versarial nets. arXiv:1411.1784 , 2014. 2[26] J. Nawara. This ‘Simpsons’ Homage Super-cut Celebrates The 75th Anniversary Of ‘Citi-zen Kane’. http://uproxx.com/movies/citizen-kane-simpsons-homages/, 2016. 16[27] S. Nowozin, B. Cseke, and R. Tomioka. f-GAN: Train-ing generative neural samplers using variational diver-gence minimization. In
NIPS , pages 271–279, 2016.3[28] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell,and A. A. Efros. Context encoders: Feature learningby inpainting. In
CVPR , pages 2536–2544, 2016. 1, 2[29] G. Perarnau, J. van de Weijer, B. Raducanu, and J. M.´Alvarez. Invertible conditional GANs for image edit-ing. arXiv:1611.06355 , 2016. 2[30] A. Radford, L. Metz, and S. Chintala. Unsupervisedrepresentation learning with deep convolutional gener- i s c r i m i n a t o r S i m il a r i t y C l a ss i fi e r S i m il a r i t y Figure 15: Cross-domain similar episode retrieval examples using both discriminator (top) and classifier (bottom)similarity. In each panel, the top row shows a subset of frames from a query video, and the bottom shows thecorresponding matched frames from the most similar video in the other domain. ative adversarial networks. arXiv:1511.06434 , 2015. 2,3, 5[31] S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, andH. Lee. Learning what and where to draw. In
NIPS ,2016. 2[32] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele,and H. Lee. Generative adversarial text to image syn-thesis. In
ICML , 2016. 2[33] S. Reed, A. van den Oord, N. Kalchbrenner, V. Bapst,M. Botvinick, and N. de Freitas. Generating inter-pretable images with controllable structure. Technicalreport, 2016. 2[34] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung,A. Radford, and X. Chen. Improved techniques fortraining GANs. In
NIPS , pages 2226–2234, 2016. 3 [35] P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays. Scrib-bler: Controlling deep image synthesis with sketch andcolor. arXiv:1612.00835 , 2016. 3[36] Y. Taigman, A. Polyak, and L. Wolf. Unsupervisedcross-domain image generation. arXiv:1611.02200 ,2016. 2, 3[37] C. Vondrick, H. Pirsiavash, and A. Torralba. Gener-ating videos with scene dynamics. In
NIPS , 2016. 1,2[38] X. Wang and A. Gupta. Generative image model-ing using style and structure adversarial networks. In
ECCV , 2016. 2[39] M. Wulfmeier, A. Bewley, and I. Posner. Addressingappearance change in outdoor robotics with adversar-ial domain adaptation. arXiv:1703.01461 , 2016. 2
40] J. Yang, A. Kannan, D. Batra, and D. Parikh. LR-GAN: Layered recursive generative adversarial net-works for image generation. arXiv:1703.01560 , 2017.2[41] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lip-son. Understanding neural networks through deep vi-sualization. arXiv:1506.06579 , 2015. 2[42] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang,and D. Metaxas. StackGAN: Text to photo-realisticimage synthesis with stacked generative adversarialnetworks. arXiv:1612.03242 , 2016. 2[43] J. Zhao, M. Mathieu, and Y. LeCun. Energy-basedgenerative adversarial network. arXiv:1609.03126 ,2016. 3[44] J.-Y. Zhu, P. Kr¨ahenb¨uhl, E. Shechtman, and A. A.Efros. Generative visual manipulation on the naturalimage manifold. In
ECCV , 2016. 2[45] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpairedimage-to-image translation using cycle-consistent ad-versarial networks. arXiv:1703.10593 , 2017. 2, 3, 2017. 2, 3