[PDF] CharacterGAN: Few-Shot Keypoint Character Animation and Reposing

Abstract

We introduce CharacterGAN, a generative model that can be trained on only a few samples (8 - 15) of a given character. Our model generates novel poses based on keypoint locations, which can be modified in real time while providing interactive feedback, allowing for intuitive reposing and animation. Since we only have very limited training samples, one of the key challenges lies in how to address (dis)occlusions, e.g. when a hand moves behind or in front of a body. To address this, we introduce a novel layering approach which explicitly splits the input keypoints into different layers which are processed independently. These layers represent different parts of the character and provide a strong implicit bias that helps to obtain realistic results even with strong (dis)occlusions. To combine the features of individual layers we use an adaptive scaling approach conditioned on all keypoints. Finally, we introduce a mask connectivity constraint to reduce distortion artifacts that occur with extreme out-of-distribution poses at test time. We show that our approach outperforms recent baselines and creates realistic animations for diverse characters. We also show that our model can handle discrete state changes, for example a profile facing left or right, that the different layers do indeed learn features specific for the respective keypoints in those layers, and that our model scales to larger datasets when more data is available.

Full PDF

CCharacterGAN: Few-Shot Keypoint Character Animation and Reposing

Tobias Hinz , Matthew Fisher University of Hamburg Oliver Wang Eli Shechtman Adobe Research Stefan Wermter Input InputFigure 1: We train a generative model in a low-data setting (8 to 15 training samples) to repose and animate characters basedon keypoint positions. The ﬁrst row shows video results (indicated throughout the paper by ), of our method driven bylinearly interpolating input keypoints. Please view with Adobe Acrobat to see animations. The second and third rows showinterpolated frames generated by our method between a single start and end frame (left and right columns).

Abstract

We introduce CharacterGAN, a generative model thatcan be trained on only a few samples (8 – 15) of a givencharacter. Our model generates novel poses based on key-point locations, which can be modiﬁed in real time whileproviding interactive feedback, allowing for intuitive repos-ing and animation. Since we only have very limited trainingsamples, one of the key challenges lies in how to address(dis)occlusions, e.g. when a hand moves behind or in frontof a body. To address this, we introduce a novel layeringapproach which explicitly splits the input keypoints into dif-ferent layers which are processed independently. These lay-ers represent different parts of the character and providea strong implicit bias that helps to obtain realistic resultseven with strong (dis)occlusions. To combine the featuresof individual layers we use an adaptive scaling approachconditioned on all keypoints. Finally, we introduce a maskconnectivity constraint to reduce distortion artifacts that oc-cur with extreme out-of-distribution poses at test time. Weshow that our approach outperforms recent baselines andcreates realistic animations for diverse characters. We also show that our model can handle discrete state changes, forexample a proﬁle facing left or right, that the different layersdo indeed learn features speciﬁc for the respective keypointsin those layers, and that our model scales to larger datasetswhen more data is available.

1. Introduction

Modifying and animating artistic characters is a task thatoften requires experts to manually create many instancesof the same character in different poses, which is a time-consuming and expensive process. Using video frame inter-polation methods can reduce the overhead, but these methodsdo not leverage character-speciﬁc priors and so can only beused for small motions. Similarly, warping input framesdirectly with 2D handles cannot account for disocclusionsor appearance changes between poses. In this work we havetwo main goals: (1) to generate high quality frames of ananimated character based on a small number of examples,and (2) to generate these images based on a sparse set ofkeypoints that can be easily modiﬁed in real time.1 a r X i v : . [ c s . C V ] F e b o address both issues, we propose to train a conditionalGenerative Adversarial Network [11] (GAN) architecturethat allows us to create new images of a character based ona set of given keypoint locations. We show that, with theright form of implicit biases, such a model can be trained ina few-shot setting (i.e., 10s of training images). In characteranimation, each training image has to be created manually,making it expensive to acquire the number of images re-quired for training most conditional GAN models [16], andunlike recent work on single image generative models [33],we desire precise control over the generated image.Our method consists of GAN that is trained on 8 – 15images of a given character and its associated keypointsin different poses. One of the key challenges is that it isdifﬁcult for the model to learn which parts of a charactershould be occluded by other parts. E.g. when the hand of ahumanoid character moves in front of the torso, either thehand should be visible at all times (if it is in the foreground)or only the torso should be visible if the hand moves “behind”the torso. To learn this ordering in a purely data-drivenapproach we need many images which we do not have inour setting. Instead, we propose to use user-speciﬁed layers for our keypoints, i.e. each keypoint lies in a given layerand we can introduce an ordering over those layers. Forexample, a humanoid character could be described by threelayers, consisting of 1) the arm and leg in the “back” (i.e.occluded by the torso and other arm and leg), 2) the torsoand head which are occluded by one arm and one leg andocclude one arm and one leg, and 3) the other arm and legwhich occlude every other part of the character. Our modelprocesses each of these layers independently, i.e. generatesfeatures for each layer without knowing the location of thekeypoints in the other layers. We then use an adaptive scalingapproach, conditioned on all keypoints, to spatially scale thefeatures of each layer, before concatenating and using themto generate the ﬁnal image.Additionally, we can train our generator to predict themask for a generated character which we found to be a robustway to identify and automatically ﬁx unrealistic keypointlayouts at test time. Our model naturally learns to associatethe keypoints with roughly semantically meaningful bodyparts for each layer, and can handle discrete state pointsthat arise as a function of keypoint locations (e.g., switchingbetween a proﬁle facing left or right). Finally, to improveimage quality, we use a patch-based reﬁnement step [4] onthe generated images based on [17].We show via a number of qualitative and quantitativeexperiments that our resulting model allows for real-timereposing and animation of diverse characters and that ourlayering approach outperforms the more traditional condi-tioning approach of using all keypoints at once. Since weassume no prior knowledge about the modeled characterour model can be applied to any shape and does not require additional data such as a 3D model or a character mesh.This allows our model to be applied in domains for whichonly limited data is available (e.g. artistic drawings or spritesheets) without the need for additional manual input besidesthe keypoint labels. In summary, we introduce the followingmain contributions:• We show that it is possible to train a GAN on onlyfew images (8 – 15) of a given character to allow forfew-shot character reposing and animation. By onlyconditioning the training on keypoints (instead of e.g.semantic maps) the trained model allows for characterreposing in real-time without expert knowledge.• By using a layered approach that explicitly encodesthe ordering of different keypoints our model is able tomodel occlusions with only very limited training data.• We introduce a mask connectivity constraint, wherea jointly predicted mask can be used at test time toautomatically ﬁx keypoint layouts for which the modelproduces unrealistic outputs.

2. Related Work

Conditional GANs take as input some form of label whichmakes it possible to control the output of the generator tovarying degrees and has also been shown to help with thetraining process. The label input can come in several forms,such as class conditioning [25], semantic maps [16], key-points [31], or bounding boxes [14, 15]. However, mostconditional GANs are trained with large datasets and areapplied to broader domains, whereas we are interested inanimating a speciﬁc character. In the following, we ﬁrst fo-cus on approaches in how to leverage small amounts of datato train GANs and conclude with a section about how thechosen conditioning method directly affects how the modelcan be used at test time.

Few-Shot Learning with GANs

One promising ap-proach to few-shot learning with GANs is to use ﬁne-tuneGANs that are pre-trained on large datasets. This can beachieved by ﬁne-tuning a given model [42], by only trainingparts of the pre-trained model [27, 32, 26, 22], or transfer-ing knowledge between different - but related - domains[41, 45]. However, these methods still rely on a model thatis pre-trained on a large dataset in a very similar domain. Asopposed to this we do not ﬁne-tune a pre-trained model ona small dataset but instead train our model from scratch onlimited available data.Recent approaches show that applying data augmentationtechniques during training of GANs directly is useful, espe-cially when the dataset is small [37, 20, 46, 47]. One of themain insights of these approaches is that it is not sufﬁcient toonly augment real images since this leads the discriminatorto learn that these augmented images are part of the realdata. We also make heavy use of data augmentation in our2 daptiveScalingAdaptiveScalingAdaptiveScaling × + element-wiseGlobal Per Keypoint LayerconvconvAdaptive Scalingconcat

Figure 2: Our model processes keypoints which are split into individual layers. The resulting features for each keypoint layerare then scaled, concatenated, and used to generate the ﬁnal image.training process but use much less data (8 – 15 images) thanthe approaches mentioned here (usually 100+ images).It is also possible to learn useful features from only asingle image [49, 2] and that tasks such as texture synthe-sis [18, 21, 6, 48], image retargeting [34], inpainting andsegmentation [38, 10], and unconditional image synthesis[33, 13] are possible with only a single image. Other ap-proaches train GANs for image-to-image translation withonly one pair of matching images [23, 5, 28, 39]. However, amodel trained on just a single image without any other infor-mation can only generate limited variations of the traininginput. Therefore, we propose to increase the available infor-mation by slightly increasing the training set’s size, whichincreases the model’s capability to generate more variationsof a learned object.

Editability

An important characteristic of models forcharacter reposing and animation is how the character is con-trolled. Existing approaches use driving videos [35], learn adistribution over poses for inbetweening [30], or map pup-pets to skeletons [8]. However, it is difﬁcult to achieve adesired pose exactly with these approaches [35, 30] or re-quires expert knowledge to obtain the required training data[8]. Another approach is to condition the model on a seman-tic label map [7, 39] which can be changed at test time toreﬂect the new layout. However, modifying semantic mapsor edge maps is not something that can be done on-the-ﬂy,but instead takes time and skill. Finally, one could start withthe character in a given pose and warp or stretch it until itreaches the desired pose [24] which might lead to unrealisticresults if the end pose is too different from the starting pose.Given the previous limitations we only condition our modelon high-level keypoints which are provided for a given char-acter, making it easy and fast to generate new poses at testtime by moving keypoints.

3. Methodology

In this section we describe our approach and how themodel can be used at test time.

In this work, we focus on character animation and repos-ing, where the images show the same character in differentposes from the same viewpoint. Since our goal is to work inlow data scenarios, we focus on cases with a small amountof images of the given character. Furthermore, the individualimages can depict discrete appearance variations, e.g. dif-ferent facial expressions. We do not make any other priorassumptions about the structure of the character. Our methodrequires an input image set, keypoints per image, keypointconnectivity information (essentially a skeleton), and a layerordering for the keypoints. This information can be obtainedeasily as we only need labels for very few images. Whilethe keypoints have to be labeled manually for each image,all other information is image independent and, thus, onlyneeds to be deﬁned once.We experimented with various ways to input keypoints tothe network, such as as individual channels, but found thatrepresenting keypoints as RGB Gaussian blobs performedthe best. This also means that the input condition always hasthe same input dimensionality, independent of the numberof keypoints, which is helpful as our model uses the sameparameters to process these individual layers. Each keypointis deﬁned by a position { x, y } , three randomly chosen colorvalues for the three RGB channels, and the values σ and r ,where σ represents the falloff of the Gaussian distribution weuse for blurring the keypoint, and r represent the radius ofthe ﬁnal keypoint, which we deﬁne later. Discrete keypointsare modelled with individual RGB values, and in cases wheredifferent keypoints overlap we sum the respective colors.3 .2. Model Our model consists of a generator and a discriminator(see Figure 2). The generator generates the image basedon the keypoint locations while the discriminator is trainedto distinguish between real and fake image-keypoint pairs.Finally, we apply a patch-match [4] based reﬁnement step toimprove the ﬁnal quality of the generated images.

Generator and Discriminator

We base our architectureon pix2pixHD [40]. However, while the discriminator issimilar to the original pix2pixHD discriminator, we addseveral implicit biases to the generator reﬂecting our priorknowledge about the problem. Concretely, we know that themodeled characters are inherently three-dimensional, i.e. ifsome body parts are occluded by others they still exist eventhough they may not be visible. To address this, we split ourcharacters into different layers, e.g. representing the “left”side of the character (e.g. left arm and leg), the “middle” partof the character (e.g. head and torso) and the “right” sideof the character (e.g. right arm and leg). These layers canbe modeled individually and can then be composed to forma ﬁnal image. To model this, our generator processes eachkeypoint layer individually and learns a representation ofeach keypoint layer (see Figure 2).Intuitively, we could set some features to zero if they areoccluded by other features. For example, if the left hand is“behind” the torso for a given image, zeroing out the featuresfor the left hand might make it easier for the generator togenerate a realistic image. Conversely, if the right hand is“in front of” the torso, zeroing out the torso features at thelocation of the right hand might improve the performance. Toaddress this, we incorporate an adaptive scaling technique[29] in which we scale the features of each layer beforeconcatenating them. For this, we ﬁrst learn an embedding ofthe keypoints and their layers (“Global” in

Adaptive Scaling of Figure 2). Based on this embedding we then learn scalingparameters for each keypoint layer and use them to scale thefeatures of each each layer. These scaled layer features arethen concatenated and used to generate the ﬁnal image.Our discriminator takes the keypoint conditioning k con-catenated with an RGB image as input and classiﬁes it aseither real or fake. We use two patch-discriminators [16],one of which operates on the full resolution image, whilethe second operates on the image down-scaled by a factorof two. We use a feature matching and an adversarial lossduring training as deﬁned by the pix2pixHD model [40].The adversarial loss is the standard GAN loss L adv : min G max D L adv = E ( k,x ) [ log D ( k, x )]+ E ( k ) [ log (1 − D ( k, G ( k ))] (1)where k is the keypoint condition and x is the correspondingreal image. The feature matching loss L fm stabilizes training by forcing the generator to produced realistic features atmultiple scales and is deﬁned as: min G L fm = E ( k,x ) T (cid:88) i N i [ || D ( k, x ) − D ( k, G ( k )) || ] (2)where T is the number of layers in the discriminator and N i is the number of elements in each layer. We add a perceptualloss [19, 44] to further improve the image quality. We use aVGG net to extract features from real and generated imagesand compute the perceptual loss L perc as deﬁned by [19].Our ﬁnal loss is the combination of these losses: min G max D L adv + L fm + L perc . (3) Patch-based Reﬁnement

To further improve the ﬁnalresult we apply a patch-based reﬁnement algorithm that re-places generated patches with their closest real patch. In ourcase, given a real and a generated image, for each patch inthe generated image, we ﬁnd the closest patch in the datasetof all real images using the Patch Match approximate near-est neighbor algorithm [4, 12], and replace the generatedpatches with their real equivalent [36]. We found that thisapproach often improves the sharpness and general imagequality over the output of the generator (Figure 5).

Data Augmentation

We employ both afﬁne transforma-tions and thin-plate-spline augmentation [39]. We use amixture of horizontal and vertical translations and horizontalﬂipping and randomly sample a subset from these augmenta-tion approaches at each training iteration. Thin-plate-spline(TPS) augmentation was introduced by Vinker et al. [39].For this approach, the image is modeled as a grid and eachgrid point is then shifted by a random distance sampled froma uniform distribution. After this, a TPS is used to smooththe transformed grid into a more realistic warp. Using TPSaugmentation results in warped images where parts of theimage are stretched and elongated, adding further variationto the training data. All augmentations are applied to thegiven image and the associated keypoints and skeleton.

After our model is trained it offers a straightforward wayto modify the pose of the character. Given an image of thecharacter the user can drag keypoints to novel positions andthe generator will generate the character in the new pose.We can also easily switch between different discrete states,and we provide two ways to do this. First, it can be handledcompletely automatic where the discrete state is determinedsolely by keypoint positions (e.g., for facing left vs right).Second, we provide the ability to have speciﬁc keypointsfor individual states, such as smile vs frown, so a user canchoose the desired expression at inference time. We also4 ataset SinGAN [33] ConSinGAN [13] DeepSIM [39] CharacterGANPSNR ↑ LPIPS ↓ PSNR ↑ LPIPS ↓ PSNR ↑ LPIPS ↓ PSNR ↑ LPIPS ↓ Watercolor Man 18.03 ± ± ± ± ± ± ± ± Watercolor Lady 18.03 ± ± ± ± ± ± ± ± Sprite Man 17.14 ± ± ± ± ± ± ± ± Dog 15.01 ± ± ± ± ± ± ± ± Ostrich 16.54 ± ± ± ± ± ± ± ± Cow 13.58 ± ± ± ± ± ± ± ± Table 1: Results of cross validation for the different models.

CharacterGAN CharacterGANDataset No Layer, No Scaling Layer, No ScalingPSNR ↑ LPIPS ↓ PSNR ↑ LPIPS ↓ Watercolor Man 23.81 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 2: Ablation study: results of cross validation for different parts of our model.allow the user to optionally enable a mask-based connectivitycorrection. In this case, if the user positions keypoints toofar from their input distribution, such that they could lead tounrealistic or undesired results, we can automatically modifythe keypoint locations to achieve more realistic results.

Ensuring Mask Connectivity

If the image backgroundis of uniform color (e.g. white), or we have a segmenta-tion network, we can automatically extract a foreground-background mask. We can use this mask as additional con-ditioning information during training, i.e. in this case thegenerator does not only generate the RGB image but alsothe mask, while the discriminator gets as input an image andits associated keypoints and mask.At test time, if a keypoint is moved in a way that results ina layout that is too different from the ones seen at train time itcan happen that the generator generates either “disconnected”body parts (e.g. the hand is not connected to the body) orintroduces unwanted artifacts. Since the generator predictsthe mask for the generated character we can use connectedcomponent analysis [9, 43] to check whether the generatedmask is connected. We found two cases that often leadto a disconnected mask; if the disconnected part containsa keypoint the resulting character will be ripped apart or,alternatively, if the disconnected part does not contain akeypoint the generated image will often contain unwantedartifacts. In either case, we ﬁx the result by automaticallymoving nearby keypoints in an iterative procedure until thepredicted mask is fully connected again.Given the last moved keypoint that caused the a discon-nected region to appear, we identify the closest keypoint and move this keypoint in the same direction as the keypointmoved by the user by a fraction δ of the absolute distance.If δ = 1 the keypoint is moved exactly in the same direc-tion and by the same amount as the original keypoint andif δ = 0 no keypoints are moved automatically at all. Wefound that a default of δ = 0 . achieved good results in asingle iteration in most cases, however, this process can berepeated iteratively until the mask is connected (Figure 10).

4. Experiments

Baselines

To compare our approach we adapt three gen-erative baseline models from the single-image domain toour setting. SinGAN [33] and ConSinGAN [13] are trainedon only a single image for tasks such as image generationand harmonization. Both models are trained in a multi-stagemanner where the image resolution increases with each stage.While SinGAN trains each stage in isolation and freezes allprevious stages, ConSinGAN trains the whole model end-to-end. We adapt both models by additionally conditioningthe training at each stage on the image keypoints, i.e. eachstage gets as input the keypoint condition in the respectiveimage resolution. DeepSIM [39] trains a model speciﬁcallyfor image manipulation, where the conditioning input con-sists either of edge-maps, semantic labels, or a combinationof both. The model is trained on only a single instance ofan image and its conditioning information. We adapt theDeepSIM model to our setting by replacing the edge-mapconditioning with keypoints, i.e., instead of the edge-mapthe input to the model is a map of keypoint locations.5 riginal Generated Original Generated Original Generated Original Generated

Figure 3: Examples from our model which was trained on only 8 – 12 images for each of the characters. Odd columns showthe original image and our intended modiﬁcations, even columns show the output of our model.

SinGAN ConSinGAN DeepSIM CharacterGAN Ground w/o layer w/ layer

Truth

Figure 4: Qualitative examples of reconstructing held-outtest images based on their keypoint locations.

Generated Reﬁned Generated Reﬁned

Figure 5: Effects of patch-based reﬁnement (please view onscreen and zoomed in).

Experiments

We perform experiments on several dif-ferent characters, including humans and animals. Our datacomes from different sources such as drawings by artists [8]and characters taken from sprite sheets. For our characterswe have 8 – 15 images which we manually label with key-point locations, but we also show that our model is able tohandle larger datasets. Our model itself can generate imagesin real-time and takes about 0.01 seconds for one image onan NVIDIA RTX 2080Ti (0.4 seconds on CPU). One itera-tion of evaluating the mask connectivity takes about 0.00002seconds on CPU and the patch based reﬁnement takes about0.6 seconds (1.8 seconds on CPU).

To the best of our knowledge there are no establishedquantitative evaluation metrics for few-shot character anima-tion. [33] introduced the Single-Image FID (SIFID) scoreto evaluate single-image generative models. However, [13]report large variance in the SIFID scores and [32] reportthat models in the few-shot setting overﬁt to the FID metric.Other approaches use LPIPS [44] to compare a generated image with its ground truth counterpart. We use the PeakSignal-to-Noise Ratio (PSNR) and LPIPS as metrics to eval-uate our model.To evaluate our model we design an N -fold cross-validation for a given character with N images. Given acharacter we train our model N times on N − images wereeach image in the dataset is left out of training exactly once.At test time we generate the left-out image based on its key-point layout and calculate the PSNR and LPIPS between thegenerated and ground-truth image. For each character werun the full N -fold cross-validation three times and reportthe average and standard deviation across the three runs. Weperform all our quantitative evaluations without using thepatch reﬁnement step to evaluate the models directly. Someexamples are shown in Figure 4.Table 1 shows the results of our model compared to thebaselines. Our model achieves the best LPIPS and PSNR forall characters. We observe that the PSNR is not always pre-dictive of the (perceptive) quality of the generated image. Inparticular, SinGAN and ConSinGAN often generate imageswhere the character exhibits disconnected body parts (e.g.the feet are not connected to the main body), but this is notrepresented in the PSNR, as feet and legs cover a relativelysmall area of the image.Table 2 shows ablation studies with our model, wherewe train the same model without any layering, and withlayering but no adaptive scaling. We can see that the lay-ering approach improves the performance in all cases butfor the “Watercolor Lady” character. This character is theonly character of these that does not include occlusions, i.e.none of the bodyparts overlap each other which supports ourhypothesis that the layering approach is mainly helpful formodeling occlusions in the low-data setting. Finally, addingthe adaptive scaling further improves the performance, albeitnot as much as the keypoint layering. Figure 3 shows visualizations of our model’s capabilitieson several different characters. All examples in this ﬁgureare trained on sprite sheets which contain 8 – 12 examplesof the given character in different poses. Our model learnsto generate realistic samples of four-legged animals, two-legged animals, and humanoid shapes. Furthermore, wesee that our model can handle the movement of keypoints6

AIN DeepSIM Ours Input DAIN DeepSIM Ours Input

Figure 6: Comparison of our approach to frame interpolation [3] and DeepSIM [39] using the displayed input frames ( ).

No Layer Layer No Layer Layer

Figure 7: Comparison of layered vs non-layered at occlu-sions/overlaps.that relate to relatively small character parts (e.g. individualfeet) as well as keypoints that represent large body parts (e.g.head and torso). Even when we move multiple keypoints,the resulting image is realistic, adheres to the novel keypointlayout, and leaves areas of the character that were not modi-ﬁed unchanged. Figure 5 shows the effect of the patch-basedreﬁnement algorithm which often improves small details ofthe generated images.Figure 1 and Figure 6 show how our model can be usedfor character animation. We use the poses from the trainingset as starting point and linearly interpolate the keypointsto generate the intermediate frames. Our model producessmooth and realistic results. We note that these intermediatekeypoint locations could also be derived from other datae.g. by extracting keypoints from driving videos [1]. Forcomparison, we also show the results of a recent generalpurpose frame-interpolation model DAIN [3].

Mask-based Keypoint Reﬁnement

When moving cer-tain keypoints to a new location we sometimes observe that Figure 8: Visualization of what different layers learn ( ).this leads to “rips” in the generated character since the key-point is too far away from the main body and such a con-ﬁguration does not occur in training data. This can usuallybe ﬁxed by moving the connected body part in the samedirection as the original keypoint. Figure 10 shows exam-ples of rips in the generated character (ﬁrst two rows) andintroduced artifacts (third row). The ﬁrst column shows theoriginal image and user speciﬁed modiﬁcations. The secondand third columns show the predicted mask and generatedimage. The last column shows our model’s ﬁnal outputafter the respective connected keypoint was moved automat-ically. We can see that by enforcing the mask connectivityconstraint we can get more realistic results in all cases.

Layered vs Non-layered Architecture

Figure 7 showsqualitative comparisons between images generated by ourarchitecture while using either our layered approach or atraditional, non-layered approach. As discussed previously,our layered approach is beneﬁcial when several keypointsoverlap in the 2D space, e.g. if a hand passes in front ofa body or if two body parts are very close to each other.Figure 8 visualizes the features that our model learns foreach keypoint layer. As we can see the model learns to onlymodel the relevant keypoints and their associated body partsfor each layer.7igure 9: Discrete appearance changes based on keypointlocation ( ).

Original Predicted Mask Before Fix Fixed

Figure 10: Enforcing mask connectivity at test time resultsin more realistic images.

Automatic Appearance Switching

Our model does notonly learn to associate discrete keypoints with given features,but also learns to associate different features with a keypointbased on its location relative to other keypoints. Figure 9shows several examples in which we can see that individualparts get “ﬂipped” as a function of the location of that key-point with respect to the others. As in our previous examples,the features of unrelated keypoints are unaffected by this.

Appearance-speciﬁc Keypoints

Figure 11 shows howour model is able to switch between discrete appearancestates for given keypoints. Each of the characters showsdifferent visual features during training which we encode asdifferent keypoint conditions. At test time we can switchbetween these different states to combine novel poses withany of the discrete visual expressions. Note that, again,our model learns to associate good features with the givenkeypoints, allowing us to model the poses independently ofthe discrete keypoint states.

Scaling to Larger Datasets

While we show that ourmodel performs well with only 8 – 15 training images, wealso evaluate our model on characters for which we havemore training images ( ≥ ). Figure 12 shows how ourmodel scales with larger datasets. We see that more data isespecially helpful when there are overlaps and occlusions. Original Generated Original Generated

Figure 11: Discrete appearance change based on keypointselection. In this example, the user not only moves keypointsto generate a new pose, but switches the IDs of keypointsto change expression, such as the color and rotation of thecharacters’ faces.

Number of Training Images5 15 89 5 15 85

Figure 12: Here we show how performance improves withmore training images ( ).While the models trained on only 5 or 15 images have dif-ﬁculty modeling this (e.g. when the hand moves in front ofthe body), the models which are trained on more imagesperform much better.

Limitations

Our model addresses the main challenge ofcorrectly modeling (dis)occlusions based on limited trainingdata. However, our model still has no explicit understandingabout any underlying 3D representation of the character andFigure 12 shows how modeling occlusions gets worse withfewer training images. This could be addressed by futurework, e.g. by adding known character priors into the modelor by incorporating some form of 3D understanding.

5. Conclusion

In this work we show how to train GANs on few examples(8 – 15 images) of a given character for few-shot characterreposing and animation. The model is easy to use, requiresno expert knowledge, and our layering approach producesrealistic results for novel poses and occlusions. We can per-form character reposing in real-time through moving aroundkeypoints and can animate character by interpolating key-points. In the future, this can be used with other approaches,e.g. by extracting keypoints from driving videos for characteranimation. Through the use of a predicted foreground maskwe can also automatically ﬁx keypoint layouts that lead tounrealistic character poses. Finally, we show that our modellearns discrete state changes based on keypoint locations,associates keypoints and their layers with semantic bodyparts, and scales to larger datasets.8 cknowledgements

The authors gratefully acknowledge partial support fromthe German Research Foundation DFG under project CML(TRR 169). We also thank Zuzana Studen´a who producedsome of the artwork used in this work.

References [1] Kﬁr Aberman, Jing Liao, Mingyi Shi, Dani Lischinski, Bao-quan Chen, and Daniel Cohen-Or. Neural best-buddies:Sparse cross-domain correspondence.

ACM Transactionson Graphics (TOG) , 37(4):69, 2018. 7[2] Yuki M Asano, Christian Rupprecht, and Andrea Vedaldi.A critical analysis of self-supervision, or what we can learnfrom a single image. In

International Conference on LearningRepresentations , 2020. 3[3] Wenbo Bao, Wei-Sheng Lai, Chao Ma, Xiaoyun Zhang, Zhiy-ong Gao, and Ming-Hsuan Yang. Depth-aware video frameinterpolation. In

Proceedings of the IEEE Computer Visionand Pattern Recognition , pages 3703–3712, 2019. 7[4] Connelly Barnes, Eli Shechtman, Adam Finkelstein, andDan B Goldman. Patchmatch: A randomized correspondencealgorithm for structural image editing.

ACM Transactions onGraphics , 28(3):24, 2009. 2, 4[5] Sagie Benaim, Ron Mokady, Amit Bermano, Daniel Cohen-Or, and Lior Wolf. Structural-analogy from a single imagepair. arXiv preprint arXiv:2004.02222 , 2020. 3[6] Urs Bergmann, Nikolay Jetchev, and Roland Vollgraf. Learn-ing texture manifolds with the periodic spatial gan. In

Inter-national Conference on Machine Learning , pages 469–477,2017. 3[7] Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei AEfros. Everybody dance now. In

Proceedings of the IEEEInternational Conference on Computer Vision , pages 5933–5942, 2019. 3[8] Marek Dvoroˇzˇn´ak, Wilmot Li, Vladimir G. Kim, and DanielS´ykora. ToonSynth: Example-based synthesis of hand-colored cartoon animations.

ACM Transactions on Graphics ,37(4), 2018. 3, 6[9] Christophe Fiorio and Jens Gustedt. Two linear time union-ﬁnd strategies for image processing.

Theoretical ComputerScience , 154(2):165–181, 1996. 5[10] Yossi Gandelsman, Assaf Shocher, and Michal Irani. ”double-dip”: Unsupervised image decomposition via coupled deep-image-priors. In

Proceedings of the IEEE Computer Visionand Pattern Recognition , 2019. 3[11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio. Generative adversarial nets. In

Advances inNeural Information Processing Systems , pages 2672–2680,2014. 2[12] Aaron Hertzmann, Charles E Jacobs, Nuria Oliver, Brian Cur-less, and David H Salesin. Image analogies. In

Proceedingsof the 28th annual conference on Computer graphics andinteractive techniques , pages 327–340, 2001. 4[13] Tobias Hinz, Matthew Fisher, Oliver Wang, and StefanWermter. Improved techniques for training single-image gans. In

IEEE Winter Conference on Applications of ComputerVision , pages 1300–1309, 2021. 3, 5, 6[14] Tobias Hinz, Stefan Heinrich, and Stefan Wermter. Generat-ing multiple objects at spatially distinct locations.

Interna-tional Conference on Learning Representations , 2019. 2[15] Tobias Hinz, Stefan Heinrich, and Stefan Wermter. Semanticobject accuracy for generative text-to-image synthesis.

IEEETransactions on Pattern Analysis and Machine Intelligence ,2020. 2[16] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros.Image-to-image translation with conditional adversarial net-works. In

Proceedings of the IEEE Computer Vision andPattern Recognition , pages 1125–1134, 2017. 2, 4[17] Ondrej Jamriska. Ebsynth: Fast example-based imagesynthesis and style transfer. https://github.com/jamriska/ebsynth , 2018. 2[18] Nikolay Jetchev, Urs Bergmann, and Roland Vollgraf. Texturesynthesis with spatial generative adversarial networks. In

Ad-vances in Neural Information Processing Systems Workshop ,2016. 3[19] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptuallosses for real-time style transfer and super-resolution. In

European Conference on Computer Vision , pages 694–711.Springer, 2016. 4[20] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine,Jaakko Lehtinen, and Timo Aila. Training generative adver-sarial networks with limited data. In

Advances in NeuralInformation Processing Systems , 2020. 2[21] Chuan Li and Michael Wand. Precomputed real-time texturesynthesis with markovian generative adversarial networks. In

European Conference on Computer Vision , pages 702–716,2016. 3[22] Yijun Li, Richard Zhang, Jingwan Cynthia Lu, and Eli Shecht-man. Few-shot image generation with elastic weight consoli-dation.

Advances in Neural Information Processing Systems ,33, 2020. 2[23] Jianxin Lin, Yingxue Pang, Yingce Xia, Zhibo Chen, andJiebo Luo. Tuigan: Learning versatile image-to-image trans-lation with two unpaired images. In

European Conference onComputer Vision , 2020. 3[24] Songrun Liu, Alec Jacobson, and Yotam Gingold. Skinningcubic b´ezier splines and catmull-clark subdivision surfaces.

ACM Transactions on Graphics (TOG) , 33(6):1–9, 2014. 3[25] Mehdi Mirza and Simon Osindero. Conditional generativeadversarial nets. arXiv preprint arXiv:1411.1784 , 2014. 2[26] Sangwoo Mo, Minsu Cho, and Jinwoo Shin. Freeze discrimi-nator: A simple baseline for ﬁne-tuning gans. arXiv preprintarXiv:2002.10964 , 2020. 2[27] Atsuhiro Noguchi and Tatsuya Harada. Image generationfrom small datasets via batch statistics adaptation. In

Pro-ceedings of the IEEE International Conference on ComputerVision , pages 2750–2758, 2019. 2[28] Taesung Park, Alexei A Efros, Richard Zhang, and Jun-YanZhu. Contrastive learning for unpaired image-to-image trans-lation. In

European Conference on Computer Vision , 2020.3

29] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-YanZhu. Semantic image synthesis with spatially-adaptive nor-malization. In

Proceedings of the IEEE Computer Vision andPattern Recognition , pages 2337–2346, 2019. 4[30] Omid Poursaeed, Vladimir Kim, Eli Shechtman, Jun Saito,and Serge Belongie. Neural puppet: Generative layered car-toon characters. In

IEEE Winter Conference on Applicationsof Computer Vision , pages 3346–3356, 2020. 3[31] Scott E Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka,Bernt Schiele, and Honglak Lee. Learning what and where todraw. In

Advances in Neural Information Processing Systems ,pages 217–225, 2016. 2[32] Esther Robb, Wen-Sheng Chu, Abhishek Kumar, and Jia-Bin Huang. Few-shot adaptation of generative adversarialnetworks. arXiv preprint arXiv:2010.11943 , 2020. 2, 6[33] Tamar Rott Shaham, Tali Dekel, and Tomer Michaeli. Singan:Learning a generative model from a single natural image. In

Proceedings of the IEEE International Conference on Com-puter Vision , pages 4570–4580, 2019. 2, 3, 5, 6[34] Assaf Shocher, Shai Bagon, Phillip Isola, and Michal Irani.Ingan: Capturing and retargeting the dna of a natural im-age. In

Proceedings of the IEEE International Conference onComputer Vision , pages 4492–4501, 2019. 3[35] Aliaksandr Siarohin, St´ephane Lathuili`ere, Sergey Tulyakov,Elisa Ricci, and Nicu Sebe. First order motion model for im-age animation. In

Advances in Neural Information ProcessingSystems , 2019. 3[36] Ondˇrej Texler, David Futschik, Jakub Fiˇser, Michal Luk´aˇc,Jingwan Lu, Eli Shechtman, and Daniel S`ykora. Arbitrarystyle transfer using neurally-guided patch-based synthesis.

Computers & Graphics , 87:62–71, 2020. 4[37] Ngoc-Trung Tran, Viet-Hung Tran, Ngoc-Bao Nguyen,Trung-Kien Nguyen, and Ngai-Man Cheung. Towards goodpractices for data augmentation in gan training. arXiv preprintarXiv:2006.05338 , 2020. 2[38] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky.Deep image prior. In

Proceedings of the IEEE ComputerVision and Pattern Recognition , pages 9446–9454, 2018. 3[39] Yael Vinker, Eliahu Horwitz, Nir Zabari, and YedidHoshen. Deep single image manipulation. arXiv preprintarXiv:2007.01289 , 2020. 3, 4, 5, 7[40] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao,Jan Kautz, and Bryan Catanzaro. High-resolution imagesynthesis and semantic manipulation with conditional gans.In

Proceedings of the IEEE conference on computer visionand pattern recognition , pages 8798–8807, 2018. 4[41] Yaxing Wang, Abel Gonzalez-Garcia, David Berga, Luis Her-ranz, Fahad Shahbaz Khan, and Joost van de Weijer. Minegan:effective knowledge transfer from gans to target domains withfew images. In

Proceedings of the IEEE Computer Visionand Pattern Recognition , pages 9332–9341, 2020. 2[42] Yaxing Wang, Chenshen Wu, Luis Herranz, Joost van de Wei-jer, Abel Gonzalez-Garcia, and Bogdan Raducanu. Transfer-ring gans: generating images from limited data. In

EuropeanConference on Computer Vision , pages 218–234, 2018. 2[43] Kesheng Wu, Ekow Otoo, and Arie Shoshani. Optimizingconnected component labeling algorithms. In

Medical Imag- ing 2005: Image Processing , volume 5747, pages 1965–1976,2005. 5[44] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,and Oliver Wang. The unreasonable effectiveness of deepfeatures as a perceptual metric. In

Proceedings of the IEEEComputer Vision and Pattern Recognition , pages 586–595,2018. 4, 6[45] Miaoyun Zhao, Yulai Cong, and Lawrence Carin. On lever-aging pretrained gans for generation with limited data. In

International Conference on Machine Learning , 2020. 2[46] Shengyu Zhao, Zhijian Liu, Ji Lin, Jun-Yan Zhu, and SongHan. Differentiable augmentation for data-efﬁcient gan train-ing. In

Advances in Neural Information Processing Systems ,2020. 2[47] Zhengli Zhao, Zizhao Zhang, Ting Chen, Sameer Singh, andHan Zhang. Image augmentations for gan training. arXivpreprint arXiv:2006.02595 , 2020. 2[48] Yang Zhou, Zhen Zhu, Xiang Bai, Dani Lischinski, DanielCohen-Or, and Hui Huang. Non-stationary texture synthesisby adversarial expansion.

ACM Transactions on Graphics(TOG) , 37(4):1–13, 2018. 3[49] Maria Zontak and Michal Irani. Internal statistics of a singlenatural image. In

Proceedings of the IEEE Computer Visionand Pattern Recognition , pages 977–984, 2011. 3, pages 977–984, 2011. 3