[PDF] Intuitive Facial Animation Editing Based On A Generative RNN Framework

Abstract

For the last decades, the concern of producing convincing facial animation has garnered great interest, that has only been accelerating with the recent explosion of 3D content in both entertainment and professional activities. The use of motion capture and retargeting has arguably become the dominant solution to address this demand. Yet, despite high level of quality and automation performance-based animation pipelines still require manual cleaning and editing to refine raw results, which is a time- and skill-demanding process. In this paper, we look to leverage machine learning to make facial animation editing faster and more accessible to non-experts. Inspired by recent image inpainting methods, we design a generative recurrent neural network that generates realistic motion into designated segments of an existing facial animation, optionally following user-provided guiding constraints. Our system handles different supervised or unsupervised editing scenarios such as motion filling during occlusions, expression corrections, semantic content modifications, and noise filtering. We demonstrate the usability of our system on several animation editing use cases.

Full PDF

AACM SIGGRAPH / Eurographics Symposium on Computer Animation 2020J. Bender and T. Popa(Guest Editors)

Volume 39 ( ), Number 8

Intuitive Facial Animation Editing Based On A Generative RNNFramework

Eloïse Berson, , Catherine Soladié and Nicolas Stoiber Dynamixyz, France CentraleSupélec, CNRS, IETR, UMR 6164, F-35000, France

Abstract

For the last decades, the concern of producing convincing facial animation has garnered great interest, that has only beenaccelerating with the recent explosion of 3D content in both entertainment and professional activities. The use of motioncapture and retargeting has arguably become the dominant solution to address this demand. Yet, despite high level of qualityand automation performance-based animation pipelines still require manual cleaning and editing to reﬁne raw results, whichis a time- and skill-demanding process. In this paper, we look to leverage machine learning to make facial animation editingfaster and more accessible to non-experts. Inspired by recent image inpainting methods, we design a generative recurrent neuralnetwork that generates realistic motion into designated segments of an existing facial animation, optionally following user-provided guiding constraints. Our system handles different supervised or unsupervised editing scenarios such as motion ﬁllingduring occlusions, expression corrections, semantic content modiﬁcations, and noise ﬁltering. We demonstrate the usability ofour system on several animation editing use cases.

CCS Concepts • Computing methodologies → Motion processing;

Neural networks;

1. Introduction

Creating realistic facial animation has been a long-time challengein the industry, historically relying on the craftmanship of fewhighly trained professional animators. In the last three decades, theresearch community has produced methods and algorithms aim-ing to make quality facial animation generation accessible andwidespread. To this day, this remains a challenge due to the com-plexity of facial dynamics, triggering a plethora of spatiotempo-ral motion patterns ranging from subtle local deformations to largeemotional expressions. The emergence and increasing availability of motion capture (mocap) technologies have opened a new era,where realistic animation generation is more deterministic and re-peatable.The theoretical promise of mocap is the ability to completely andﬂawlessly capture and retarget a human performance, from emotiondown the most subtle motion of facial skin. In reality, even profes-sional motion capture setups often fall short of a perfect anima-tion result: It is for instance usual that some part of a performancecannot be captured due to occlusions or unexpected poses. For fa-cial mocap, popular video-based technologies have known ﬂaws as © 2020 The Author(s)Computer Graphics Forum © 2020 The Eurographics Association and JohnWiley & Sons Ltd. Published by John Wiley & Sons Ltd. This is the accepted version of the followingarticle: "Intuitive Facial Animation Editing Based On A Generative RNN Framework", E. Berson,C. Soladié, N. Stoiber, which has been published in ﬁnal form at http://onlinelibrary.wiley.com. Thisarticle may be used for non-commercial purposes in accordance with the Wiley Self-Archiving Pol-icy. a r X i v : . [ c s . G R ] O c t loïse Berson, Catherine Soladié & Nicolas Stoiber / Intuitive Facial Animation Editing based on a Generative RNN Framework well: the camera resolution limits the capture precision, while sig-nal noise, jitter, and inconsistent lighting can impair its robustness.In addition to technical considerations, performance-based anima-tion also lacks ﬂexibility when the nature of captured motion doesnot match the desired animation result, when the animation intentsuddenly differs from what was captured, or the performer cannotor has not performed the requested motions. Animation editing -or cleaning as it is often called- is therefore unavoidable, and oftenthe bottleneck of modern performance-based animation pipelines.The editing task usually consists of selecting an unsatisfactoryor corrupted time segment in the animation, and either correct orreplace animation curves in that segment using computational ormanual methods. Several automatic motion completion systemshave been developed based on simple interpolation between user-speciﬁed keyframes [Par72], usually with linear or cubic polyno-mials, because of their simplicity and execution speed. While inter-polation has proven efﬁcient for short segments with dense sets ofkeyframes, the smooth and monotonous motion patterns they pro-duce are far from realistic facial dynamics when used on longersegments. In most cases today, animation cleaning thus relies on keyframing : having artists replace faulty animation with numerouscarefully-crafted keyframes to interpolate a new animation. Notonly is keyframing a time- and skill-demanding process, it requiresacting on several of the character’s low-level animation parameters,which is not intuitive for non-experts.At the origin of this work is the parallel we draw between edit-ing an animation and performing image inpainting. Image inpaint-ing aims at replacing unwanted/missing parts of an image withautomatically generated pixel patterns, so that the edited imagelooks realistic. In animation editing, we pursue the same objec-tive, substituting 2D spatial pixel patterns for 1D temporal motionsignals. Inspired by recent advances in image inpainting frame-works, we present a machine-learning-based approach that makesfacial animation editing faster and more accessible to non-experts.Given missing, damaged, or unsatisfactory animation segments,our GAN-based system regenerates the animation segment, follow-ing few discrete semantic user-guidance such as keyframes, noisysignals, or a sequence of visemes.Previous works have proposed to use a reduced set of low-dimensional parameters to simplify animation editing, either basedon temporal Poisson reconstruction from keyframes [ASK ∗ ∗

12] or based on regression from semantic-level temporal pa-rameters [BSBS19]. While achieving smooth results, they requiredense temporal speciﬁcations to edit long sequences. In addition,in many cases where the capture process has failed to produce ananimation (occlusion, camera malfunctions), no input animation isavailable to guide the result; hence the animator has to create thewhole missing sequence from scratch. Our system handles all thosecases by leveraging a GAN framework [GPAM ∗

14] to generate an-imation curves either from guidance inputs or unsupervised. GANshave demonstrated impressive results at generating state-of-the-artresults from little to no inputs in many tasks, such as image transla-tion [IZZE17], image inpainting [YLY ∗

19, LLYY17], and text-to-image [RAY ∗ • A multifunctional framework that handles various high-level andsemantic constraints to guide the editing process. It can be ap-plied to many editing use cases, such as long occlusions, expres-sions adding/changing, or viseme modiﬁcations. • A generative and ﬂexible system enabling fast unsupervised orsupervised facial animation editing. Inspired by recent inpaint-ing schemes, it leverages machine-learning-based signal recon-struction and transposes it in the facial animation domain. Thisframework allows editing motion segments of any length at anypoint in the animation timeline.

2. Related Work

In this paper, we propose a generative system for facial animationediting, synthesizing new facial motions to ﬁll missing or unwantedanimation segments. In this section, we point to relevant techniquesfor animation generation (Section 2.1) and motion editing (Sec-tion 2.2). Finally, as our system can perform guided editing usingsemantic inputs, such as keyframes or visemes, we review worksrelated to facial reenactment (Section 2.3).

In this section, we discuss existing animation synthesis techniquesthat rely only on sparse or no explicit external constraints, encom-passing methods that automatically generate motion transitions be-tween keyframes (Section 2.1.1), or techniques predicting motionsequences based on past context (Section 2.1.2).

The most basic and widespread form of animation generation iskeyframing. Artists specify the conﬁguration of character at cer-tain key points in time and let an interpolation function generatethe in-between motion. Early works on facial editing focus on im-proving the keyframing process, providing an automatical solvingstrategy to map high-level static users’ constraints to the key an-imation parameters. User constraints are formulated as either 2Dpoints such as image features [ZLH03], motion markers [JTDP03],strokes on a screen [DBB ∗

18, CO18, COL15] or the 2D projectionof 3D vertices [ZSCS04, CGZ17]; or 3D controllers like verticesposition on the mesh [LA10,ATL12,TDlTM11]. Other works lever-age reduced dimension space to derive realistic animation parame-ters [LCXS09, CFP03]. Then, the ﬁnal animation is reconstructed © 2020 The Author(s)Computer Graphics Forum © 2020 The Eurographics Association and John Wiley & Sons Ltd. loïse Berson, Catherine Soladié & Nicolas Stoiber / Intuitive Facial Animation Editing based on a Generative RNN Framework using linear interpolation or blending weights function. The ﬁrstworks considering the temporal behavior of the face propose topropagate the edition by ﬁtting a Catmull-Rom spline [LD08] ora B-spline curve [CLK01] on the edited animation parameters.Alternatively, more sophisticated interpolation methods wereproposed such as a bilinear interpolation [AKA96], spline func-tion [KB84] or cosine interpolation [Par72]. While easy to controland fast at generating coarse animation, the simplicity of the inter-polation algorithms cannot mimic the complex dynamics of facialmotions for segments longer than a few frames. The resulting an-imation’s quality is dictated by the number and relevance of user-created keyframes.Seol and colleagues [SLS ∗

12] propagate edits using a move-ment matching equation. In the same spirit, Dinev and col-leagues [DBB ∗

18] use a gradient-based algorithm to smoothlypropagate sparse mouth shape corrections throughout an anima-tion. While producing high-quality results, their solutions rest onwell-edited keyframes. Ma et al. [MLD09] learn the editing styleon few frames through a constraint-based Gaussian Process andthen utilize it to edit similar frames in the sequence. Their methodsare efﬁcient at the time-consuming task of animation editing, but itdoes not ensure temporal consistency of the motion.To accelerate keyframe speciﬁcation, several works exploremethods to generate hand-drawn in-betweens [BW75] automati-cally. Recently, considering that human motion dynamics can belearned from data, Zhang et al. [ZvdP18] learn inbetween patternswith an auto-regressive two-layer recurrent network to automati-cally autocomplete a hopping lamp motion between two keyframes.Their system offers the ﬂexibility of keyframing and an intelligentautocompletion learned on data, but does not address the case oflong completion segments. Zhou et al. [ZLB ∗

20] address this witha learning-based method interpolating motion in long-term seg-ments guided by sparse keyframes. They rely on a fully convo-lutional autoencoder architecture and demonstrate good results onfull-body motions. As they point out, using convolutional modelsfor temporal sequences has drawbacks, as it hard-codes the model’sframerate, as well as the time-window on which temporal depen-dencies in the signal are considered by the model (the receptiveﬁeld of the network). Our experience indicates that recurrent net-works seem to obtain better results in that case of facial data. Onereason might be, the facial motions tend to exhibit less inertia andmore discontinuities, which are better modeled by recurrent mod-els’ ability to learn to preserve or forget temporal behavior at dif-ferent time scales.

Our work focus on generating new motion through context-aware learning-based methods. Predicting context-aware motionis a recent popular topic of research. Since the seminal workof [FLFM15] on motion forecasting, an increasing amount ofwork has addressed learning-based motion generation [RGMN19,WCX19, BBKK17] using previous frames [MBR17]. Earlylearning-based works rely on deterministic recurrent networks topredict future frames [FLFM15, JZSS16, MBR17]. Overall, recentworks turn toward generative frameworks that have demonstratedstate-of-the-art results in motion forecasting [WCX19, RGMN19, ZLB ∗ Multiple works leverage existing data to synthesize temporal mo-tion matching user’s constraints. A ﬁrst group of methods de-rives from data a subspace of realistic motion and performs trajec-tory optimization, ensuring natural motion generation. Stoiber etal. [SSB08] create a continuous subspace of realistic facial expres-sions using AAM, synthesizing coherent temporal facial animation.Akhter et al. [ASK ∗

12] learn a bilinear spatiotemporal model en-suring a realistic edited animation. Another group of solutions isthe use of motion graph [KGP02, ZSCS04], which considers thetemporality of an animation. Zhang et al. [ZSCS04] create a Facegraph to interpolate frames realistically. Motion graph ensures a re-alistic facial animation, but it requires high memory cost to retainthe whole graph.The ﬁrst one to propose a fully learning-based human motionediting system is the seminal work of Holden et al. [HSK16]. Theymap high level control parameters to a learned body motion mani-fold presented earlier by the same authors [HSKJ15]. Navigatingthis manifold of body motion allows to easily alter and controlbody animations, while preserving their plausibility. Recently,several works emphasize the realistic aspect of generated motionthrough generative and adversarial techniques [WCX19, HHS ∗ ∗

17] leverage a variational autoencoder tosample new motion from a latent space. Wang et al. [WCX19]stack a "reﬁner" neural network over the RNN-based generator,trained in an adversarial fashion. While an intuitive and high-levelparametrization steering a body motion have generated a consen-sus, there is no such standard abstraction to guide facial motion.Later, Berson et colleagues [BSBS19] use a learning-based methodto perform temporal animation editing, providing meaningfultemporal vertex distances. However, this work needs explicittemporal constraints at each frame to edit, precluding a precisekeyframe-level control. In this work, we propose a new point ofview: a generative method from none, discrete or semantic inputs.

Our work is also related to the problem of video facial reenactment.Facial reenactment consists of substituting facial performance inan existing video with ones from another source and recomposinga new realistic animation. Video facial reenactment has been anattractive area of research [KEZ ∗

19, KTC ∗

18, FTZ ∗

19, TZS ∗ ∗

14] for the last decades. One instance of facial reenactmentis Visual Dubbing, that consists of modifying the target video to © 2020 The Author(s)Computer Graphics Forum © 2020 The Eurographics Association and John Wiley & Sons Ltd. loïse Berson, Catherine Soladié & Nicolas Stoiber / Intuitive Facial Animation Editing based on a Generative RNN Framework be consistent with a given audio track [SSKS17, BCS97, GVS ∗ ∗

19] propose a new workﬂowto edit a video by modifying the associated transcript. The sys-tem automatically regenerates the corresponding altered viseme se-quence using a two-stage method: a coarse sequence is generatedby searching similar visemes in the video and stitching them to-gether. Then a high-quality photorealistic video is synthesized us-ing a recurrent neural network. This work follows the general trendand exploits recurrent GAN architecture [KEZ ∗

19, SWQ ∗

20] toproduce realistic facial animation matching semantic constraints.However, our work does not aim to improve the photorealism ofsynthesized facial performance but instead, focuses on supplying aversatile and global facial animation editing framework. Indeed, fa-cial reenactment is devoted to a particular facial animation editingscenario, in which either a semantic or source animation is avail-able, preventing ﬂexible and creative editing applications.

3. Method

Our goal herein is to train from data a generative neural network ca-pable of generating plausible facial motions given different kinds ofinput constraints such as sparse keyframes, discrete semantic input,or coarse animation. In this section, we describe the parametriza-tion of our system with the different constraints, enabling super-vised motion editing (Section 3.1). We then detail our system basedon the well-established GAN minmax game (Section 3.2), as wellas the training speciﬁcations. An overview of our system is depictedin Figure 1.

Our system is meant to be used in any animation generationpipeline. Therefore, we parametrize facial animations with thehighly popular blendshape representation, common throughoutacademia and the industry [LAR ∗ ∗

19, JP19]: more pre-cisely, we consider an analogous training strategy for our networks.We feed a generator, G , with an incomplete animation, a noisevector, a mask and optionally a discrete, noisy, or semantic in-put guiding the editing process. At training time, the incompleteanimation X i ∈ R L × N corresponds to the original ground-truthanimation X gt ∈ R L × N with randomly erased segments signaledby the mask. Both the original and the incomplete input anima-tions consist of the concatenation of L =

200 frames of N = M ∈ R L × N encodes locationsof erased segments (all blendshape coefﬁcients) for a random num-ber of consecutive frames. The input animation can be expressedas X i = ( − M ) (cid:12) X gt . M is a matrix with zeros everywhere andones where blendshape coefﬁcients are removed, and is an all-ones matrix of size L × N . The number and the length of maskedsegments in the input animation are chosen randomly, such as attest time our network can edit both short and very long sequences.At test time, masked segments are placed by the user to target theportions of the input sequence to edits. We note that our networkcan also generate an animation by using a mask covering the fullsequence. The vector of noise, z ∈ R L × , is composed of indepen-dent components drawn from Gaussian distribution, with 0 mean Table 1:

Groups of phonemes.

Visemes Phonemes Visemes Phonemessil G + K + H g, k, q, å AO + OY a, O L + N + T + D l, n, t, d, Ï , S , R AA + AE + AY æ, A S + Z s, z, G EH + EY e, E , e I Sh + Ch + Zh S , Ù , Z IH + IY + EE + IX i, I , TH + DH T , ðOH + OW o, F + V f, vAH + ER , @ , Ä , Ç M + B + P b, m, pUW + AW + UH u, U , a U W w, û JH j, Ã R ô and a standard deviation of 1. We use the same framework for dif-ferent editing scenarios and train a different network for each edit-ing input type. Our framework can also perform unguided motioncompletion in missing segments, which is useful in the case of longocclusions for instance. In many cases though, the animator/userwants to guide the edit; so we focus on employing our frameworkfor supervised motion editing. To achieve this, we leverage the con-ditional GAN (CGAN) [MO14] mechanism to add semantic guid-ance to our system. We concatenate a constraint matrix to the in-put, C i = ˜M i (cid:12) C gt , i , with non-zero components where animationhas been erased. C gt , i ∈ R L × N feati encodes the i th constraint vec-tor of N f eat i features over time. ˜M i ∈ R L × N feati is the constraint-speciﬁc mask matrix, with zeros everywhere and ones at the sameframe indices as M . The constraints C gt , i can be a sparse matrix ofkeyframes, a dense noisy animation, or one-hot vectors represent-ing pronounced visemes at each frame. Each constraint conditionsthe training of the corresponding speciﬁc system. We consider threehigh-level constraint types enabling animation editing for severaluse cases: • Keyframes : One main cause of animation editing is expressionmodiﬁcations, such as correcting the shape of the mouth oradding new expressions. Hence, we add sparse keyframes ex-tracted from the ground-truth animation as constraints. The timebetween two keyframes is chosen randomly between 0 and 0.8seconds. • Noisy animation : Our system enables the user to change the con-tent of the animation and guide it with a coarse animation, suchas one obtains from consumer-grade motion capture on con-sumer devices (webcam, mobile phone, ...). • Visemes : We also consider a more semantic editing use case,such as speech corrections from audio. We use an audio-to-phoneme tool to obtain annotation in phonemes of each se-quence in the database. In this work, we use the Montreal-Forced-Aligner [MSM ∗ We consider a generative approach relying on the well-known GANprinciple. Hence, as in any GAN framework, our system is com- © 2020 The Author(s)Computer Graphics Forum © 2020 The Eurographics Association and John Wiley & Sons Ltd. loïse Berson, Catherine Soladié & Nicolas Stoiber / Intuitive Facial Animation Editing based on a Generative RNN Framework

Figure 1:

Framework overview. We build our editing tool upon aGAN scheme, using an approach similar to image inpainting. Wefeed the generator with a mask, a masked animation and a noisevector, eventually we add constraints such as sparse keyframes,a noisy animation or sequence of visemes. The generator ends-up with the completed animation. The discriminator has to distin-guish between real animation and fake ones: it is supplied with theground-truth animation and the generated one (the partial ground-truth sequence completed with the generate parts). posed of two neural networks: a generator, designed to ﬁll the time-line with realistic animation, and a discriminator intended to eval-uate the quality of the generated animation.Our generator, G , has to learn the temporal dynamics of facialmotion. We use a recurrent architecture for our generator, as shar-ing parameters through time have demonstrated impressive resultsin modeling, correcting, and generating intricate temporal patterns.Our generator uses a Bidirectional Long Short-Term Memory (B-LSTM) architecture for its capability to adapt to quickly chang-ing contexts yet also model long-term dependencies. Our generatorconsists of a sequence of N layers B-LSTM layers ( N layers =2) witha stacked ﬁnal dense output layer to get dimensions matching theoutput features. The recurrent layers consist of 128 hidden units.The main goal of the generator is to create plausible animations,i.e., to ﬁll a given timeline segment with realistic motion signalsthat smoothly connects to the motion at the edge of the segment.Our discriminator, D , has to learn to distinguish between a gen-erated animation and a one produced by ground-truth motion cap-ture. Because we want our generator to create an animation thatblends well outside its segment, we supply our discriminator withthe entire animation rather than only the generated segment, andchoose a convolutional structure for D . Some elements have ahigher impact on the quality perception of a facial animation. Forinstance, inaccuracies in mouth and eye closures during speech orblinks are naturally picked up as disturbing and unrealistic. Thus,we enrich the discriminator’s score with relevant distance measure-ments over time that matches those salient elements. Our discrim-inator’s structure is inspired by recent advances in image inpaint-ing [YLY ∗

19, JP19]. It is a sequence of 4 convolutional layers, fol-lowed by spectral normalization [MKKY18], stabilizing the train-ing of GANs. Over the convolutional layers, we stack a fully con-nected layer predicting the plausibility of the input animation. Theconvolutional layers get a kernel of size 3, scanning their input witha stride of 2, and end up with respectively 64, 32, 16, 8 channels.We use the LeakyRelu activation function [XWCL15] after everylayer except the last one.

Classically, to train the proposed system we consider the minmaxgame between the generative and the discriminative losses. Thegenerative loss is inspired by [JP19, YLY ∗ L f eat = α gt ( − M ) (cid:12) | G ( X i ) − X gt | + M (cid:12) | G ( X i ) − X gt | (1)The blendshape representation weights salient shapes such asshapes controlling eyelid closure and shapes with minor effect suchas the one affecting the nose deformation equally. As Berson etal. [BSBS19], we add a loss L dis , to focus preservation of somekey inter-vertices distances between the estimate and the groundtruth animations. L dis encourages accurate mouth shape and eyelidclosure, crucial ingredients for realistic facial animation. It focuseson six distances: the ﬁrst three measure the extent between the up-per and lower lips (at three different locations along the mouth),the fourth is the extent between the mouth corners and the last twomeasure the opening of the right and left eyelids. Finally, the gen-erator is trained to minimize the following loss: L G = E [ − D ( G ( X i ))] + w f eat L f eat + w dis L dis . (2)At the same time, we train our discriminator to minimize thehinge loss. We force the discriminator to focus on the edited partby feeding it with a recomposed animation X rec , which is the in-complete input animation completed with the generated anima-tion, i.e, X rec = ( − M ) (cid:12) X gt + M (cid:12) G ( X i ) . We also inﬂuencethe discriminator attention by providing it the key intervertices dis-tances mentioned earlier. We add the WGAN-GP loss [GAA ∗ L gp = E [ || ( ∇ U D ( U ) (cid:12) M || − ) ] to make the GAN training morestable. In this formula, U is a vector uniformly sampled alongthe line between discriminator inputs from Y gt and Y rec , i.e, U = t Y gt + ( − t ) Y rec with 0 ≤ t ≤

1. Hence, the loss of the discrimi-nator is: L D = E [ − D ( Y gt )] + E [ + D ( Y rec )] + w gp L gp , (3)where Y refers to the concatenation of an animation and its cor-responding intervertices distances. For all our experiments, we set w f eat = α gt = w gp =

10 and w dis =

1. We set the initial learn-ing rate of both the generator and the discriminator at 0.001. We usethe Adam optimizer [KB14]. We add a dropout of 0.3 to regulatethe generator. This system has been implemented using the Pytorchframework.

4. Results

In this section, we demonstrate our system’s capability to renderrealistic animation with different types of editing constraints. First,we detail the data used for the training and the testing of our frame-work (Section 4.1). Then, we describe the different scenarios inwhich our framework might be useful, from unsupervised motioncompletion (Section 4.2), to constraint-based motion editing (Sec-tion 4.3). © 2020 The Author(s)Computer Graphics Forum © 2020 The Eurographics Association and John Wiley & Sons Ltd. loïse Berson, Catherine Soladié & Nicolas Stoiber / Intuitive Facial Animation Editing based on a Generative RNN Framework

We use two datasets for our experiments. We leverage the en-hanced audiovisual datasets "3D Audio-Visual Corpus of Affec-tive Communication" (B3D(AC)ˆ2) [FGR ∗

10, BSBS19] to trainour networks, especially the one requiring both facial animationand phoneme labels. Overall, the corpus amounts to 85 minutes ofanimation and will be released for reproducibility of our results.We add another dataset, which consists of performance-based ani-mations, manually created with a professional animation software.From the original videos, we also employ an automatic face track-ing solution to generate coarse, noisy animation corresponding tothose videos. Those trackers are noisy by nature, so we do not needto add artiﬁcial noise to the input. We use this last dataset alone totrain our "noisy-signal-based" editing system. This training set con-tains 52 sequences (49 minutes of animation) recorded at differentframerates between 30 and 120 frames-per-second (fps). For all ourexperiments, we resample every animation at 25 fps (the framerateof the (B3D(AC)ˆ2) dataset) and use the same blendshape modelcounting 34 blendshapes for every animation of each of our scenar-ios.As with any learning-based methods, it is essential to know howthe proposed approach depends on the training data. To test ourframework, we record new sequences with a different subject, recit-ing new sentences, and performing different expressions to check ifthe model generalizes well. We derive both the original animationsand the noisy ones using the same procedure as described above.

First, we demonstrate the capability of our system to generate plau-sible animation without any supervision. We validate our systemusing animation of the test set by randomly removing some partsof them. We regenerate a complete sequence using our network,producing undirected motion ﬁlling. As we can see in the accom-panying video, the generated parts (lasting 2.6s) are blended real-istically with the animation preceding and following the edit. Inthis sequence, our generator produces "talking-style" motions andhallucinates eyebrows movements rendering the edited parts moreplausible.One potential application of our unsupervised animation gener-ation system is its capability to generate more realistic sequencesin case of long occlusions than simple interpolation methods. Weuse a new recorded sequence with occlusion of around 3 seconds(about 75 frames). Such occlusions often alter the quality of theﬁnal animation and require manual cleaning. We compare our gen-erative system with a sequence resulting in interpolating the miss-ing animation with boundaries and derivatives constraints. As wecan see in Figure 2, ﬁlling the gap with interpolation leads to longoversmoothed motions, far from realistic motion patterns. Our sys-tem creates a more realistic sequence: the subject ﬁrst returns to theneutral pose and anticipates the wide mouth opening by smoothlyreopening the mouth. One might also observe the eyebrows activa-tion, consistent with the mouth openings.

Figure 2:

Occlusion motion completion. Compared to standardlinear interpolation solving, our system generates realistic motiondynamics: in case of long occlusions, our system ensures that themouth returns to the neutral poses. Moreover, as we use a bidirec-tional architecture, our system anticipates the wide opening of themouth and smoothly re-open the mouth from the neutral pose.

While unsupervised motion completion can be used to handle longocclusions, most relevant uses require users to steer the editingprocess. In the following, we present several use cases of guidedfacial animation editing. We test our system using both the testset, which is composed of sequences of unseen subjects, and newperformance-based animations recorded outside the dataset.

It is common for performance-based animation to require addi-tional or localized corrections either due to technical or artistic con-siderations. Ideally, one would simply use new captured or hand-speciﬁed expressions to edit the animation and expect the edit-ing tool to derive the right facial dynamics, reconstructing a re-alistic animation automatically. This use case has motivated thekeyframe-based supervision of our editing system. We test our sys-tem’s ability to handle this scenario by randomly removing parts ofthe input animation and inputting the network with sparse, closely-or widely-spaced, keyframe expressions. We observe that the sys-tem outputs natural and well-coarticulated motions between thekeyframe constraints and the input signal: as we can see in Fig-ure 3, our system generates non-linear blending around the smilekeyframe expressions, and naturally reopens the mouth at the endof the edited segment. We can see in the video that our systemgenerates a more natural and organic facial dynamics than classicinterpolation.Another use case is adding an expression not present in the ex-isting animation. For instance, in one of our videos, the performerforgot the ﬁnal wink move at the end of the sequence (see 4b).We simply add it to the sequence by constraining the end of thesequence with a wink keyframe, which has been recorded later.We can observe in Figure 4b how naturally the mouth moves tocombine the pre-existing smiling expression and the added winkrequest.Finally, one recurrent shortcoming of performance-based anima-tion is getting a mouth shape that does not match the audio speech.For instance, on a video outside the dataset, we observe that the facecapture yields imprecise animation frames of the mouth. As we cansee in Figure 4a, the mouth should be almost closed, yet it remains © 2020 The Author(s)Computer Graphics Forum © 2020 The Eurographics Association and John Wiley & Sons Ltd. loïse Berson, Catherine Soladié & Nicolas Stoiber / Intuitive Facial Animation Editing based on a Generative RNN Framework

Figure 3:

Validation of our keyframes-based constraints system onour testset with new coarse animation. Our system ensures naturalcoarticulation between key frames constraints and input signal. (a)

Modiﬁcation of the mouth shape. Our system generates a morefaithful shape of the mouth, given only one keyframe. (b)

Addition of one expression such as a wink. Our system naturallyadds a key-expression: as we can observe, the mouth motion consis-tently moves to re-match the smiling expression.

Figure 4:

Keyframe-based Editing. Our system generates realisticmotions with only a few keyframes as a constraint. wide open during a few frames. We fed the desired expression as akeyframe input to the system, and let the system generate the cor-rected mouth motion 4a. The visual signature of labial consonantsis a mouth closure. In the same editing spirit, our system can revisean inaccurate labial viseme by imposing mouth closure. We displayan example of this correction in the accompanying video.

Figure 5:

Noisy animation-based system. We mask half the orig-inal sequence and feed the network with the other half noisy ani-mation. As we can see on the left, our system removes jitters andunnatural temporal patterns, generating a smooth animation at theboundary. We can see on the right, how the unrealistic lips frowningmovements are ﬁltered by our system, while the natural dynamic ofthe eyelids is preserved.

Animation changes longer than a few seconds would require spec-ifying many guiding keyframes. Instead, when long segments needto be substantially changed one could guide animation editing withlower-quality facial tracking applications, using webcam or mobilephone feeds. In that case, the guiding animation is noisy and inac-curate, but is a simple and intuitive way to convey the animationintent. We test this conﬁguration, feeding our system with noisyanimations generated from a blendshape-based face tracking soft-ware as a guide for the animation segment to edit. As we can see inFigure 5, our system removes jitters and unrealistic temporal pat-terns but preserves natural high-frequency components such as theeyelids closures.

We demonstrate the capability of our system to edit an animationsemantically. We use the initial sentence found in the test set "Oh,I’ve missed you. I’ve been going completely doolally up here." . Wegenerate a new animation by substituting "you" with other nounsor noun phrases pronounced by the same subject in order to haveconsistent audio along with the animation. As we can see in Fig-ure 6, our system generates new motions consistent with the in-put constraints, "our little brother": it adjusts the movements of thejaw to create a realistic bilabial viseme. We observe the closureof the mouth when pronouncing "brother" in Figure 6. It hallu-cinates consistent micro-motions, such as raising eyebrows at thesame time, favoring natural-looking facial animation. Other exam-ples are shown in the supplementary video.We also perform viseme-based editing on a new subject recit-ing new sentences. For instance, we turn the initial sentence "Myfavorite sport is skiing. I’m vacationing in Hawai this winter.” into "My favorite sport is surﬁng. I’m vacationing in Hawai this winter.”

The generated motion follows the new visemes sequence "surﬁng" in Figure 7. More precisely, we can see the bottom lip raising up tothe bottom of the top teeth to generate the viseme "f" .

5. Evaluation

In this section, we present quantitative evaluations of our frame-work. First, we demonstrate the capability of our approach to re-duce the manual effort required to edit facial animation. We then © 2020 The Author(s)Computer Graphics Forum © 2020 The Eurographics Association and John Wiley & Sons Ltd. loïse Berson, Catherine Soladié & Nicolas Stoiber / Intuitive Facial Animation Editing based on a Generative RNN Framework

Figure 6:

Our system modiﬁes the jaw motion according to theinput constraints such as adjusting the jaw opening to ﬁt bilabialconsonant constraints. It hallucinates micro-motions such as rais-ing eyebrows to make the editing part more plausible. (Left) Gen-erated frames given the input phonemes sequence "brother".

Figure 7:

Generated frames given the input phonemes sequence"surﬁng". We can notice the bottom lip raises up to the bottom ofthe top teeth to generate the viseme "f". compare our methods with related ones in dealing with control-lable animation editing. Finally, we assess the quality of our resultsby gathering user evaluation on a batch of edited sequences.

The principal objective of this work is to provide a system that ac-celerates the editing task. We timed two professional animators tomeasure the average time they need to create a sequence of 100frames (see Table 2). From this experiment, we ﬁnd that it takes be-tween 20 and 50 minutes to create a 100-frames animation, depend-ing on the complexity and framerate of the animation. This amountsto an average individual keyframe setup time between 12 and 30seconds. We note that this estimation is consistent with the studyconducted by Seol et al. [SSK ∗ Table 2:

Average time to create 100-frames animation.

Handmade (min) With animation software (s)Artist 1 ∼

20 20Artist 2 ∼

50 60We compare in Table 3 the time to edit a few animations with oursystem and manual keyframing. From this experiment, we note thatour system considerably reduces the time required to edit animationsegments.

Recent controllable motion generation studies have an objectiveakin to animation editing, as they use regression neural networks togenerate motion from high-level inputs. We compare our system totwo previous works, closely related to motion editing: the seminalwork of Holden et al. [HSK16] on controlled body motion gener-ation, and the recent work on facial animation editing of Bersonet al. [BSBS19]. For a fair comparison, we use the same con-trol parameters as [BSBS19], and regress the corresponding blend-shape weights using either the fully convolutional regressor and de-coder of [HSK16], or the 2-network system proposed by [BSBS19].We quantitatively compare the reconstruction error between thesemethods and our system on the test set. Therefore, we mask-out thecomplete input animation and feed our network with the controlparameter signals. We measure the mean square error between theoriginal animation and the output one. As we can see in Table 4,our system achieves better performances than a regression networktrained with MSE only.

Table 4:

MSE between high level parameters and our network with8 control parameters. MSE [HSK16] 0.016[BSBS19] 0.018Ours

We also observe qualitative differences between regres-sors [HSK16, BSBS19] and our current approach. We do so byfeeding our generator with dense control parameter curves, as usedby regressors (see Figure 8a). Even when stretching and deformingcontrol curves to match sparse constraints, our system robustly con-tinues to generate animation with realistic dynamics (Figure 8b).As mentioned by Holden et al. [HSK16], the main issue with re-gression frameworks is the ambiguities of high-levels parameter in-puts: the same set of high-level parameters can correspond to manydifferent valid motion conﬁgurations. We test the behavior of ourapproach in such ambiguous cases, by using very few input controlparameters (3): the mouth opening amplitude, the mouth’s cornersdistance, and one eyelids closure distance. We indeed observe thata more ambiguous input signal leads to a noisier output animationfor regression networks. With the same input, our system is able tohallucinate missing motion cues outputs, producing a more natural © 2020 The Author(s)Computer Graphics Forum © 2020 The Eurographics Association and John Wiley & Sons Ltd. loïse Berson, Catherine Soladié & Nicolas Stoiber / Intuitive Facial Animation Editing based on a Generative RNN Framework

Table 3:

Time performance evaluation. We compare the time to edit few animation with our system and manual keyframing. Our systemconsiderably reduces the time of facial animation editing. ∼

12 min 0.14sViseme editing 19 15 0.012 ∼ ∼

31 min 0.12 sand smooth animation. We note that our system is even capable ofcreating plausible dynamics for the whole face in an unsupervisedfashion (Section 4.2).

One widely recognized issue with animation generation methods isreliable evaluation of animation quality. Indeed, there is no quan-titative metrics that reﬂect the naturalness and the realism of facialmotions. Hence, we gather qualitative feedback on edited anima-tion generated by our system in an informal study. A sample of44 animation sequences -with different lengths and with or with-out audio- were presented to 21 subjects. Half the animations wereedited with our system, using either visemes constraints, keyframesexpressions, noisy signals, or in an unsupervised fashion. Subjectswere asked to assess whether the animation cames from originalmocap or was edited. In essence, participants were asked to playthe role of the discriminator in distinguishing original from editedsequences. Most of the participants were not accustomed to closeobservation of 3D animation content. We gather the following userfeedback among the 21 subjects: 54% of the original animationswere classiﬁed as such (true positive), while 51% of edited se-quences were also classiﬁed as original ones (false positive). Wealso show the sequences to 5 experienced subjects, that know thecontext of this work: even they detected only 58% of the edited se-quences (true negative) and half of the original ones (true positive).

6. Conclusion and Future Work

We have proposed a generative facial animation framework able tohandle a wide range of animation editing scenarios. Our frameworkwas inspired by recent image inpainting approaches; it enables un-supervised motion completion, semantic animation modiﬁcations,as well as animation editing from sparse keyframes or coarse noisyanimation signals. The lack of high-quality animation data remainsthe major limitation in facial animation synthesis and editing re-search. While our system obtains good results, we note that thequality of produced animation can only be as good looking and ac-curate as what the quality and diversity of our animation databasecovers. We present various results, testifying for the validity of theproposed framework, but the current state of our result calls forexperimentations on more sophisticated blendshape models, morediverse facial motions, and possibly the addition of rigid head mo-tion.The presented method relies on a generative model and offersno guarantee as such to match input constraints exactly. Yet, en-suring an exact hit is a standard requirement for high-quality pro-duction. We note that a workaround solution in our case would be to post-process our system’s animation to match sparse constraintsexactly, following the interpolation of [SLS ∗

12] for instance. Be-yond the proposed solution for ofﬂine facial animation editing, aninteresting direction would be to enable facial animation modiﬁca-tions to occur in real-time. We plan on evaluating the performanceof a forward-only recurrent network to assess the feasibility of real-time use cases.Our system aims to make facial animation editing more acces-sible to non-expert users, but also more time-efﬁcient, to reducethe bottleneck of animation cleaning and editing. In terms of userinteraction, our semantic editing framework requires isolating theanimation segments to edit, and providing editing cues. An inter-esting future work would be to integrate our system within a user-oriented application, combining our network with a user interfaceand a recording framework, forming a complete, interactive, efﬁ-cient animation editing tool. Another interesting extension of thiswork would be to consider audio signals as additional input con-trollers.

References [AKA96] A

RAI

K., K

URIHARA

T., A

NJYO

K.- I .: Bilinear interpolationfor facial expression and metamorphosis in real-time animation. TheVisual Computer 12 , 3 (1996), 105–116. 3[ASK ∗

12] A

KHTER

I., S

IMON

T., K

HAN

S., M

ATTHEWS

I., S

HEIKH

Y.: Bilinear spatiotemporal basis models.

ACM Transactions on Graph-ics 31 , 2 (Apr. 2012), 1–12. doi:10.1145/2159516.2159523 . 2,3[ATL12] A

NJYO

K., T

ODO

H., L

EWIS

J.: A Practical Approach to Di-rect Manipulation Blendshapes.

Journal of Graphics Tools 16 , 3 (Aug.2012), 160–176. doi:10.1080/2165347X.2012.689747 . 2[BBKK17] B

ÜTEPAGE

J., B

LACK

M., K

RAGIC

D., K

JELLSTRÖM

H.:Deep representation learning for human motion prediction and classiﬁ-cation. arXiv:1702.07486 [cs] (Apr. 2017). arXiv: 1702.07486. URL: http://arxiv.org/abs/1702.07486 . 3[BCS97] B

REGLER

C., C

OVELL

M., S

LANEY

M.: Video rewrite: Driv-ing visual speech with audio. In

Proceedings of the 24th annual con-ference on Computer graphics and interactive techniques (1997), ACMPress/Addison-Wesley Publishing Co., pp. 353–360. 4[BSBS19] B

ERSON

E., S

OLADIÉ

C., B

ARRIELLE

V., S

TOIBER

N.: ARobust Interactive Facial Animation Editing System. In

Proceedings ofthe 12th Annual International Conference on Motion, Interaction, andGames (New York, NY, USA, 2019), MIG ’19, ACM, pp. 26:1–26:10.event-place: Newcastle-upon-Tyne, United Kingdom. doi:10.1145/3359566.3360076 . 2, 3, 5, 6, 8, 10[BW75] B

URTNYK

N., W

EIN

M.: Computer animation of free form im-ages. In

Proceedings of the 2nd annual conference on Computer graphicsand interactive techniques (1975), pp. 78–80. 3[CE05] C

HANG

Y.-J., E

ZZAT

T.: Transferable videorealistic speech an-imation. In

Proceedings of the 2005 ACM SIGGRAPH/Eurographicssymposium on Computer animation (2005), ACM, pp. 143–151. 4 © 2020 The Author(s)Computer Graphics Forum © 2020 The Eurographics Association and John Wiley & Sons Ltd. loïse Berson, Catherine Soladié & Nicolas Stoiber / Intuitive Facial Animation Editing based on a Generative RNN Framework (a)

We manually deform control parameter’s curves 3 times (Top). The corresponding control parameters are displayed on the right-hand side. (b)

As we can observe, our system generates realistic expressions consistent with the input constraints such as the regression-based systems developedby Holden et al. [HSK16] and Berson et al. [BSBS19]. Stretching control parameter curves to match sparse constraints may yield unrealistic controlparameter trajectories. However, our generative approach always generates motion with realistic dynamics.

Figure 8:

Comparison with controllable motion editing systems. [CFP03] C AO Y., F

ALOUTSOS

P., P

IGHIN

F.: Unsupervised Learn-ing for Speech Motion Editing.

Proceedings of the 2003 ACM SIG-GRAPH/Eurographics symposium on Computer animation (2003), 225–231. 2[CGZ17] C HI J., G AO S., Z

HANG

C.: Interactive facial expression edit-ing based on spatio-temporal coherency.

The Visual Computer 33 , 6-8(2017), 981–991. 2[CLK01] C

HOE

B., L EE H., K O H.-S.: Performance-Driven Muscle-Based Facial Animation.

The Journal of Visualization and ComputerAnimation 12 (2001), 67–79. 3[CO18] C

ETINASLAN

O., O

RVALHO

V.: Direct manipulation of blend-shapes using a sketch-based interface. In

Proceedings of the 23rdInternational ACM Conference on 3D Web Technology - Web3D ’18 (Pozna& doi:10.1145/3208806.3208811 . 2[COL15] C

ETINASLAN

O., O

RVALHO

V., L

EWIS

J. P.: Sketch-BasedControllers for Blendshape Facial Animation.

Eurographics (Short Pa-pers) (2015), 25–28. 2[DBB ∗

18] D

INEV

D., B

EELER

T., B

RADLEY

D., B

ÄCHER

M., X U H.,K

AVAN

L.: User-Guided Lip Correction for Facial Performance Capture.

Computer Graphics Forum 37 , 8 (Dec. 2018), 93–101. URL: http://doi.wiley.com/10.1111/cgf.13515 , doi:10.1111/cgf.13515 . 2, 3[FGR ∗

10] F

ANELLI

G., G

ALL

J., R

OMSDORFER

H., W

EISE

T.,V AN G OOL

L.: A 3-D Audio-Visual Corpus of Affective Com-munication.

IEEE Transactions on Multimedia 12 , 6 (Oct. 2010),591–598. URL: http://ieeexplore.ieee.org/document/5571821/ , doi:10.1109/TMM.2010.2052239 . 6[FLFM15] F RAGKIADAKI

K., L

EVINE

S., F

ELSEN

P., M

ALIK

J.: Re-current network models for human dynamics. In

Proceedings of the IEEEInternational Conference on Computer Vision (2015), pp. 4346–4354. 3[FTZ ∗

19] F

RIED

O., T

EWARI

A., Z

OLLHÖFER

M., F

INKELSTEIN

A.,S

HECHTMAN

E., G

OLDMAN

D. B., G

ENOVA

K., J IN Z., T

HEOBALT

C., A

GRAWALA

M.: Text-based editing of talking-head video.

ACMTransactions on Graphics 38 , 4 (July 2019), 1–14. URL: https://dl.acm.org/doi/10.1145/3306346.3323028 , doi:10.1145/3306346.3323028 . 3, 4[GAA ∗

17] G

ULRAJANI

I., A

HMED

F., A

RJOVSKY

M., D

UMOULIN

V., C

OURVILLE

A.: Improved Training of Wasserstein GANs. arXiv:1704.00028 [cs, stat] (Dec. 2017). arXiv: 1704.00028. URL: http://arxiv.org/abs/1704.00028 . 5[GPAM ∗

14] G

OODFELLOW

I. J., P

OUGET -A BADIE

J., M

IRZA

M., X U B., W

ARDE -F ARLEY

D., O

ZAIR

S., C

OURVILLE

A., B

ENGIO

Y.: Gen-erative Adversarial Networks. arXiv:1406.2661 [cs, stat] (June 2014).arXiv: 1406.2661. URL: http://arxiv.org/abs/1406.2661 .2[GVR ∗

14] G

ARRIDO

P., V

ALGAERTS

L., R

EHMSEN

O., T

HOR - MAEHLEN

T., P

EREZ

P., T

HEOBALT

C.: Automatic Face Reenactment. (June 2014), 4217–4224. arXiv: 1602.02651. URL: http://arxiv.org/abs/1602.02651 , doi:10.1109/CVPR.2014.537 . 3[GVS ∗

15] G

ARRIDO

P., V

ALGAERTS

L., S

ARMADI

H., S

TEINER

I.,V

ARANASI

K., P

ÉREZ

P., T

HEOBALT

C.: VDub: Modifying Face Videoof Actors for Plausible Visual Alignment to a Dubbed Audio Track.

Computer Graphics Forum 34 , 2 (May 2015), 193–204. URL: http://doi.wiley.com/10.1111/cgf.12552 , doi:10.1111/cgf.12552 . 4[HHS ∗

17] H

ABIBIE

I., H

OLDEN

D., S

CHWARZ

J., Y

EARSLEY

J., K O - MURA

T., S

AITO

J., K

USAJIMA

I., Z

HAO

X., C

HOI

M.-G., H U R.: ARecurrent Variational Autoencoder for Human Motion Synthesis.

IEEEComputer Graphics and Applications 37 (2017), 4. 3[HSK16] H

OLDEN

D., S

AITO

J., K

OMURA

T.: A deep learning frame-work for character motion synthesis and editing.

ACM Transactionson Graphics 35 , 4 (July 2016), 1–11. doi:10.1145/2897824.2925975 . 3, 8, 10[HSKJ15] H

OLDEN

D., S

AITO

J., K

OMURA

T., J

OYCE

T.: Learn-ing motion manifolds with convolutional autoencoders. In

SIG-GRAPH Asia 2015 Technical Briefs (2015), ACM Press, pp. 1–4.URL: http://dl.acm.org/citation.cfm?doid=2820903.2820918 , doi:10.1145/2820903.2820918 . 3[IZZE17] I SOLA

P., Z HU J.-Y., Z

HOU

T., E

FROS

A. A.: Image-to-Image Translation with Conditional Adversarial Networks. In (Honolulu, HI, July 2017), IEEE, pp. 5967–5976. URL: http://ieeexplore.ieee.org/document/8100115/ , doi:10.1109/CVPR.2017.632 . 2[JP19] J O Y., P

ARK

J.: SC-FEGAN: Face Editing Generative AdversarialNetwork with User’s Sketch and Color. arXiv:1902.06838 [cs] (Feb.2019). arXiv: 1902.06838. URL: http://arxiv.org/abs/1902.06838 . 2, 4, 5[JTDP03] J

OSHI

P., T

IEN

W. C., D

ESBRUN

M., P

IGHIN

F.: Learn-ing Controls for Blend Shape Based Realistic Facial Animation.

SIG-GRAPH/Eurographics Symposium on Computer Animation (2003), 187–192. 2[JZSS16] J

AIN

A., Z

AMIR

A. R., S

AVARESE

S., S

AXENA

A.: © 2020 The Author(s)Computer Graphics Forum © 2020 The Eurographics Association and John Wiley & Sons Ltd. loïse Berson, Catherine Soladié & Nicolas Stoiber / Intuitive Facial Animation Editing based on a Generative RNN Framework

Structural-RNN: Deep learning on spatio-temporal graphs. In

Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recogni-tion (2016), pp. 5308–5317. 3[KB84] K

OCHANEK

D. H., B

ARTELS

R. H.: Interpolating splines withlocal tension, continuity, and bias control. In

Proceedings of the 11thannual conference on Computer graphics and interactive techniques (1984), pp. 33–41. 3[KB14] K

INGMA

D. P., B A J.: Adam: A Method for Stochastic Opti-mization. arXiv:1412.6980 [cs] (Dec. 2014). arXiv: 1412.6980. URL: http://arxiv.org/abs/1412.6980 . 5[KEZ ∗

19] K IM H., E

LGHARIB

M., Z

OLLHÖFER

M., S

EIDEL

H.-P.,B

EELER

T., R

ICHARDT

C., T

HEOBALT

C.: Neural Style-PreservingVisual Dubbing.

ACM Transactions on Graphics 38 , 6 (Nov. 2019), 1–13. arXiv: 1909.02518. doi:10.1145/3355089.3356500 . 3, 4[KGP02] K

OVAR

L., G

LEICHER

M., P

IGHIN

F.: Motion graphs. In

ACMSIGGRAPH (2002), ACM, p. 482. 3[KTC ∗

18] K IM H., T

HEOBALT

C., C

ARRIDO

P., T

EWARI

A., X U W.,T

HIES

J., N

IESSNER

M., P

ÉREZ

P., R

ICHARDT

C., Z

OLLHÖFER

M.:Deep video portraits.

ACM Transactions on Graphics 37 , 4 (July 2018),1–14. doi:10.1145/3197517.3201283 . 3[LA10] L

EWIS

J. P., A

NJYO

K.- I .: Direct Manipulation Blendshapes. IEEE Computer Graphics and Applications 30 , 4 (July 2010), 42–50. doi:10.1109/MCG.2010.41 . 2[LAR ∗

14] L

EWIS

J. P., A

NJYO

K., R

HEE

T., Z

HANG

M., P

IGHIN

F.,D

ENG

Z.: Practice and Theory of Blendshape Facial Models.

Euro-graphics (State of the Art Reports) 1 , 8 (2014), 2. 4[LCXS09] L AU M., C

HAI

J., X U Y.-Q., S

HUM

H.-Y.: Face poser: In-teractive modeling of 3D facial expressions using facial priors.

ACMTransactions on Graphics 29 , 1 (Dec. 2009), 1–17. doi:10.1145/1640443.1640446 . 2[LD08] L I Q., D

ENG

Z.: Orthogonal-Blendshape-Based Editing Systemfor Facial Motion Capture Data.

IEEE Computer Graphics and Applica-tions 28 , 6 (Nov. 2008), 76–82. URL: http://ieeexplore.ieee.org/document/4670103/ , doi:10.1109/MCG.2008.120 . 3[LLYY17] L I Y., L IU S., Y

ANG

J., Y

ANG

M.-H.: Generative Face Com-pletion. arXiv:1704.05838 [cs] (Apr. 2017). arXiv: 1704.05838. URL: http://arxiv.org/abs/1704.05838 . 2[MBR17] M

ARTINEZ

J., B

LACK

M. J., R

OMERO

J.: On human motionprediction using recurrent neural networks. arXiv:1705.02445 [cs] (May2017). arXiv: 1705.02445. URL: http://arxiv.org/abs/1705.02445 . 3[MKKY18] M

IYATO

T., K

ATAOKA

T., K

OYAMA

M., Y

OSHIDA

Y.: Spectral Normalization for Generative Adversarial Networks. arXiv:1802.05957 [cs, stat] (Feb. 2018). arXiv: 1802.05957. URL: http://arxiv.org/abs/1802.05957 . 5[MLD09] M A X., L E B. H., D

ENG

Z.: Style learning and transfer-ring for facial animation editing. In

Proceedings of the 2009 ACMSIGGRAPH/Eurographics Symposium on Computer Animation (2009),ACM, pp. 123–132. 3[MO14] M

IRZA

M., O

SINDERO

S.: Conditional Generative AdversarialNets. arXiv:1411.1784 [cs, stat] (Nov. 2014). arXiv: 1411.1784. URL: http://arxiv.org/abs/1411.1784 . 4[MSM ∗

17] M C A ULIFFE

M., S

OCOLOF

M., M

IHUC

S., W

AGNER

M.,S

ONDEREGGER

M.: Montreal Forced Aligner: Trainable Text-SpeechAlignment Using Kaldi. In

Interspeech 2017 (Aug. 2017), ISCA,pp. 498–502. doi:10.21437/Interspeech.2017-1386 . 4[Par72] P

ARKE

F. I.: Computer generated animation of faces. In

Pro-ceedings of the ACM annual conference - Volume 1 (Boston, Mas-sachusetts, USA, Aug. 1972), ACM ’72, Association for ComputingMachinery, pp. 451–457. URL: https://doi.org/10.1145/800193.569955 , doi:10.1145/800193.569955 . 2, 3[RAY ∗

16] R

EED

S., A

KATA

Z., Y AN X., L

OGESWARAN

L., S

CHIELE

B., L EE H.: Generative Adversarial Text to Image Synthesis. (2016), 1060–1069. 2 [RGMN19] R

UIZ

A. H., G

ALL

J., M

ORENO -N OGUER

F.: Human Mo-tion Prediction via Spatio-Temporal Inpainting. arXiv:1812.05478 [cs] (Oct. 2019). arXiv: 1812.05478. URL: http://arxiv.org/abs/1812.05478 . 3[SLS ∗

12] S

EOL

Y., L

EWIS

J. P., S EO J., C

HOI

B., A

NJYO

K., N OH J.:Spacetime expression cloning for blendshapes.

ACM Transactions onGraphics (TOG) 31 , 2 (2012), 14. 2, 3, 9[SSB08] S

TOIBER

N., S

EGUIER

R., B

RETON

G.: Automatic design of acontrol interface for a synthetic face. In

Proceedingsc of the 13th interna-tional conference on Intelligent user interfaces - IUI ’09 (Sanibel Island,Florida, USA, 2008), ACM Press, p. 207. doi:10.1145/1502650.1502681 . 3[SSK ∗

11] S

EOL

Y., S EO J., K IM P. H., L

EWIS

J. P., N OH J.: Artist friendly facial animation retargeting.

ACM Trans-actions on Graphics 30 , 6 (Dec. 2011), 162. URL: http://dl.acm.org/citation.cfm?doid=2070781.2024196 , doi:10.1145/2070781.2024196 . 8[SSKS17] S UWAJANAKORN

S., S

EITZ

S. M., K

EMELMACHER -S HLIZERMAN

I.: Synthesizing Obama: learning lip sync from au-dio.

ACM Transactions on Graphics 36 , 4 (July 2017), 1–13. doi:10.1145/3072959.3073640 . 4[SWQ ∗

20] S

ONG

L., W U W., Q

IAN

C., H E R., L OY C. C.: Every-body’s Talkin’: Let Me Talk as You Want. arXiv:2001.05201 [cs] (Jan.2020). arXiv: 2001.05201. URL: http://arxiv.org/abs/2001.05201 . 4[TDlTM11] T

ENA

J. R., D

E LA T ORRE

F., M

ATTHEWS

I.: InteractiveRegion-based Linear 3D Face Models. In

ACM SIGGRAPH 2011 Papers (New York, NY, USA, 2011), SIGGRAPH ’11, ACM, pp. 76:1–76:10.event-place: Vancouver, British Columbia, Canada. doi:10.1145/1964921.1964971 . 2[TZS ∗

16] T

HIES

J., Z

OLLHOFER

M., S

TAMMINGER

M., T

HEOBALT

C., N IE \ S SNER

M.: Face2face: Real-time face capture and reenactmentof rgb videos. In

Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition (2016), pp. 2387–2395. 3[WCX19] W

ANG

Z., C

HAI

J., X IA S.: Combining recurrent neural net-works and adversarial training for human motion synthesis and control.

IEEE transactions on visualization and computer graphics (2019). 3[XWCL15] X U B., W

ANG

N., C

HEN

T., L I M.: Empirical Evaluation ofRectiﬁed Activations in Convolutional Network. arXiv:1505.00853 [cs,stat] (Nov. 2015). arXiv: 1505.00853. URL: http://arxiv.org/abs/1505.00853 . 5[YLY ∗

19] Y U J., L IN Z., Y

ANG

J., S

HEN

X., L U X., H

UANG

T.: Free-Form Image Inpainting with Gated Convolution. arXiv:1806.03589 [cs] (Oct. 2019). arXiv: 1806.03589. URL: http://arxiv.org/abs/1806.03589 . 2, 4, 5[ZLB ∗

20] Z

HOU

Y., L U J., B

ARNES

C., Y

ANG

J., X

IANG

S., LI H.:Generative Tweening: Long-term Inbetweening of 3D Human Motions. arXiv:2005.08891 [cs] (May 2020). arXiv: 2005.08891. URL: http://arxiv.org/abs/2005.08891 . 3[ZLH03] Z

HANG

Q., L IU ,Z ICHENG , H

EUNG -Y EUNG S HUM :Geometry-driven photorealistic facial expression synthesis.

Proceedings of the ACM SIGGRAPH/Eurographics Sym-posium on Computer Animation (2003), 177–186. URL: http://ieeexplore.ieee.org/document/1541999/ , doi:10.1109/TVCG.2006.9 . 2[ZSCS04] Z HANG

L., S

NAVELY

N., C

URLESS

B., S

EITZ

S. M.: Space-time Faces: High Resolution Capture for Modeling and Animation.

ACMTrans. Graph. 23 (2004), 548–558. 2, 3[ZvdP18] Z

HANG

X.,

VAN DE P ANNE

M.: Data-driven autocomple-tion for keyframe animation. In

Proceedings of the 11th Annual In-ternational Conference on Motion, Interaction, and Games (LimassolCyprus, Nov. 2018), ACM, pp. 1–11. URL: https://dl.acm.org/doi/10.1145/3274247.3274502 , doi:10.1145/3274247.3274502 . 3. 3