[PDF] Robust Motion In-betweening

Abstract

In this work we present a novel, robust transition generation technique that can serve as a new tool for 3D animators, based on adversarial recurrent neural networks. The system synthesizes high-quality motions that use temporally-sparse keyframes as animation constraints. This is reminiscent of the job of in-betweening in traditional animation pipelines, in which an animator draws motion frames between provided keyframes. We first show that a state-of-the-art motion prediction model cannot be easily converted into a robust transition generator when only adding conditioning information about future keyframes. To solve this problem, we then propose two novel additive embedding modifiers that are applied at each timestep to latent representations encoded inside the network's architecture. One modifier is a time-to-arrival embedding that allows variations of the transition length with a single model. The other is a scheduled target noise vector that allows the system to be robust to target distortions and to sample different transitions given fixed keyframes. To qualitatively evaluate our method, we present a custom MotionBuilder plugin that uses our trained model to perform in-betweening in production scenarios. To quantitatively evaluate performance on transitions and generalizations to longer time horizons, we present well-defined in-betweening benchmarks on a subset of the widely used Human3.6M dataset and on LaFAN1, a novel high quality motion capture dataset that is more appropriate for transition generation. We are releasing this new dataset along with this work, with accompanying code for reproducing our baseline results.

Full PDF

RRobust Motion In-betweening

FÉLIX G. HARVEY,

Polytechnique Montreal, Canada, Mila, Canada, and Ubisoft Montreal, Canada

MIKE YURICK,

Ubisoft Montreal, Canada

DEREK NOWROUZEZAHRAI,

McGill University, Canada and Mila, Canada

CHRISTOPHER PAL,

CIFAR AI Chair, Canada, Polytechnique Montreal, Canada, Mila, Canada, and Element AI, Canada

Fig. 1. Transitions automatically generated by our system between target keyframes (in blue). For clarity, only one in four generated frames is shown. Ourtool allows for generating transitions of variable lengths and for sampling different variations of motion given fixed keyframes.

In this work we present a novel, robust transition generation techniquethat can serve as a new tool for 3D animators, based on adversarial recur-rent neural networks. The system synthesizes high-quality motions thatuse temporally-sparse keyframes as animation constraints. This is remi-niscent of the job of in-betweening in traditional animation pipelines, inwhich an animator draws motion frames between provided keyframes. Wefirst show that a state-of-the-art motion prediction model cannot be easilyconverted into a robust transition generator when only adding condition-ing information about future keyframes. To solve this problem, we thenpropose two novel additive embedding modifiers that are applied at eachtimestep to latent representations encoded inside the network’s architecture.One modifier is a time-to-arrival embedding that allows variations of thetransition length with a single model. The other is a scheduled target noise vector that allows the system to be robust to target distortions and to sample

Authors’ addresses: Félix G. Harvey, Polytechnique Montreal, 2500 Chemin de la Poly-techique, Montreal, QC, H3T 1J4, Canada, Mila, 6666 St-Urbain Street, different transitions given fixed keyframes. To qualitatively evaluate ourmethod, we present a custom MotionBuilder plugin that uses our trainedmodel to perform in-betweening in production scenarios. To quantitativelyevaluate performance on transitions and generalizations to longer time hori-zons, we present well-defined in-betweening benchmarks on a subset ofthe widely used Human3.6M dataset and on LaFAN1, a novel high qualitymotion capture dataset that is more appropriate for transition generation.We are releasing this new dataset along with this work, with accompanyingcode for reproducing our baseline results.CCS Concepts: •

Computing methodologies → Motion capture ; Neuralnetworks .Additional Key Words and Phrases: animation, locomotion, transition gener-ation, in-betweening, deep learning, LSTM

ACM Reference Format:

Félix G. Harvey, Mike Yurick, Derek Nowrouzezahrai, and Christopher Pal.2020. Robust Motion In-betweening.

ACM Trans. Graph.

39, 4, Article 60(July 2020), 12 pages. https://doi.org/10.1145/3386569.3392480

Human motion is inherently complex and stochastic for long-termhorizons. This is why Motion Capture (MOCAP) technologies stilloften surpass generative modeling or traditional animation tech-niques for 3D characters with many degrees of freedom. However,in modern video games, the number of motion clips needed to prop-erly animate a complex character with rich behaviors is often verylarge and manually authoring animation sequences with keyframesor using a MOCAP pipeline are highly time-consuming processes.Some methods to improve curve fitting between keyframes [Cic-cone et al. 2019] or to accelerate the MOCAP workflow [Holden2018] have been proposed to improve these processes. On another

ACM Trans. Graph., Vol. 39, No. 4, Article 60. Publication date: July 2020. a r X i v : . [ c s . C V ] F e b et al . front, many auto-regressive deep learning methods that leveragehigh quality MOCAP for motion prediction have recently been pro-posed [Barsoum et al. 2018; Chiu et al. 2019; Fragkiadaki et al. 2015;Gopalakrishnan et al. 2019; Jain et al. 2016; Martinez et al. 2017;Pavllo et al. 2019]. Inspired by these achievements, we build in thiswork a transition generation tool that leverages the power of Re-current Neural Networks (RNN) as powerful motion predictors togo beyond keyframe interpolation techniques, which have limitedexpressiveness and applicability.We start by building a state-of-the-art motion predictor basedon several recent advances on modeling human motion with RNNs[Chiu et al. 2019; Fragkiadaki et al. 2015; Pavllo et al. 2019]. Usinga recently proposed target-conditioning strategy [Harvey and Pal2018], we convert this unconstrained predictor into a transitiongenerator, and expose the limitations of such a conditioning strategy.These limitations include poor handling of transitions of differentlengths for a single model, and the inherent determinism of thearchitectures. The goal of this work is to tackle such problems inorder to present a new architecture that is usable in a productionenvironment.To do so, we propose two different additive modifiers applied tosome of the latent representations encoded by the network. Thefirst one is a time-to-arrival embedding applied on the hidden repre-sentation of all inputs. This temporal embedding is similar to thepositional encoding used in transformer networks [Vaswani et al.2017] in natural language modeling, but serves here a different role.In our case, these embeddings evolve backwards in time from thetarget frame in order to allow the recurrent layer to have a contin-uous, dense representation of the number of timesteps remainingbefore the target keyframe must be reached. This proves to be es-sential to remove artifacts such as gaps or stalling at the end oftransitions. The second embedding modifier is an additive sched-uled target noise vector that forces the recurrent layer to receivedistorted target embeddings at the beginning of long transitions.The scheduled scaling reduces the norm of the noise during thesynthesis in order to reach the correct keyframe. This forces thegenerator to be robust to noisy target embeddings. We show that itcan also be used to enforce stochasticity in the generated transitionsmore efficiently than another noise-based method. We then furtherincrease the quality of the generated transitions by operating in theGenerative Adversarial Network (GAN) framework with two simplediscriminators applied on different timescales.This results in a temporally-aware, stochastic, adversarial archi-tecture able to generate missing motions of variable length betweensparse keyframes of animation. The network takes 10 frames ofpast context and a single target keyframe as inputs and producesa smooth motion that leads to the target, on time. It allows forcyclic and acyclic motions alike and can therefore help generatehigh-quality animations from sparser keyframes than what is usu-ally allowed by curve-fitting techniques. Our model can fill gapsof an arbitrary number of frames under a soft upper-bound andwe show that the particular form of temporal awareness we use iskey to achieve this without needing any smoothing post-process.The resulting system allows us to perform robust, automatic in-betweening, or can be used to stitch different pieces of existingmotions when blending is impossible or yields poor quality motion. Our system is tested in production scenarios by integrating atrained network in a custom plugin for Autodesk’s MotionBuilder,a popular animation software, where it is used to greatly acceler-ate prototyping and authoring new animations. In order to alsoquantitatively assess the performance of different methods on thetransition generation task, we present the LaFAN1 dataset, a novelcollection of high quality MOCAP sequences that is well-suited fortransition generation. We define in-betweening benchmarks on thisnew dataset as well as on a subset of Human3.6M, commonly usedin the motion prediction literature. Our procedure stays close tothe common evaluation scheme used in many prediction papersand defined by Jain et al . [2016], but differs on some important as-pects. First, we provide error metrics that take into consideration theglobal root transformation of the skeleton, which provides a betterassessment of the absolute motion of the character in the world.This is mandatory in order to produce and evaluate valid transitions.Second, we train and evaluate the models in an action-agnostic fash-ion and report average errors on a large evaluation set, as opposedto the commonly used 8 sequences per action. We further reportgeneralization results for transitions that are longer than those seenduring training. Finally, we also report the Normalized Power Spec-trum Similarity (NPSS) measure for all evaluations, as suggestedby Gopalakrishnan et al . [2019] which reportedly correlates betterwith human perception of quality.Our main contributions can thus be summarized as follow: • Latent additive modifiers to convert state-of-the-art motionpredictors into robust transition generators: – A time-to-arrival embedding allowing robustness to varyingtransition lengths, – A scheduled target-noise vector allowing variations in gen-erated transitions, • New in-betweening benchmarks that take into account globaldisplacements and generalization to longer sequences, • LaFAN1, a novel high quality motion dataset well-suited formotion prediction that we make publicly available with ac-companying code for reproducing our baseline results . We refer to motion control here as scenarios in which temporally-dense external signals, usually user-defined, are used to drive thegeneration of an animation. Even if the main application of thepresent work is not focused on online control, many works on mo-tion control stay relevant to this research. Motion graphs [Arikanand Forsyth 2002; Beaudoin et al. 2008; Kovar et al. 2008; Lee et al.2002] allow one to produce motions by traversing nodes and edgesthat map to character states or motions segments from a dataset.Safonova and Hodgins [Safonova and Hodgins 2007] combine aninterpolated motion graph to an anytime 𝐴 ∗ search algorithm inorder produce transitions that respect some constraints. Motionmatching [Büttner and Clavet 2015] is another search driven mo-tion control technique, where the current character pose and tra-jectory are matched to segments of animation in a large dataset.Chai & Hodgins, and Tautges et al . [2005; 2011] rely on learning https://github.com/ubisoftinc/Ubisoft-LaForge-Animation-DatasetACM Trans. Graph., Vol. 39, No. 4, Article 60. Publication date: July 2020. obust Motion In-betweening • 60:3 local PCA models on pose candidates from a motion dataset givenlow-dimensional control signals and previously synthesized posesin order to generate the next motion frame. All these techniquesrequire a motion database to be loaded in memory or in the lattercases to perform searches and learning at run-time, limiting theirscalability compared to generative models.Many machine learning techniques can mitigate these require-ments. Important work has used the Maximum A Posteriori (MAP)framework where a motion prior is used to regularize constraint(s)-related objectives to generate motion. [Chai and Hodgins 2007] usea statistical dynamics model as a motion prior and user constraints,such as keyframes, to generate motion. Min et al . [2009] use de-formable motion models and optimize the deformable parameters atrun-time given the MAP framework. Other statistical models, suchas Gaussian Processes [Min and Chai 2012] and Gaussian Process La-tent Variable Models [Grochow et al. 2004; Levine et al. 2012; Wanget al. 2008; Ye and Liu 2010] have been applied to the constrainedmotion control task, but are often limited by heavy run-time com-putations and memory requirements that still scale with the size ofthe motion database. As a result, these are often applied to separatetypes of motions and combined together with some post-process,limiting the expressiveness of the systems.Deep neural networks can circumvent these limitations by al-lowing huge, heterogeneous datasets to be used for training, whilehaving a fixed computation budget at run-time. Holden et al . [2016;2015] use feed-forward convolutional neural networks to build aconstrained animation synthesis framework that uses root trajectoryor end-effectors’ positions as control signals. Online control froma gamepad has also been tackled with phase-aware [Holden et al.2017], mode-aware [Zhang et al. 2018] and action-aware [Starkeet al. 2019] neural networks that can automatically choose a mixtureof network weights at run-time to disambiguate possible motions.Recurrent Neural Networks (RNNs) on the other hand keep aninternal memory state at each timestep that allows them to per-form naturally such disambiguation, and are very well suited formodeling time series. Lee et al . [2018] train an RNN for interactivecontrol using multiple control signals. These approaches [Holdenet al. 2017, 2016; Lee et al. 2018; Zhang et al. 2018] rely on spatiallyor temporally dense signals to constrain the motion and thus re-duce ambiguity. In our system, a character might have to preciselyreach a temporally distant keyframe without any dense spatial ortemporal information provided by the user during the transition.The spatial ambiguity is mostly alleviated by the RNN’s memoryand the target-conditioning, while the timing ambiguity is resolvedin our case by time-to-arrival embeddings added to the RNN inputs.Remaining ambiguity can be alleviated with generative adversarialtraining [Goodfellow et al. 2014], in which the motion generatorlearns to fool an additional discriminator network that tries to dif-ferentiate generated sequences from real sequences. Barsoum etal . [2018] and Gui et al . [2018] both design new loss functions forhuman motion prediction, while also using adversarial losses usingdifferent types of discriminators. These losses help reduce artifactsthat may be produced by generators that average different modesof the plausible motions’ distribution.Motion control has also been addressed with ReinforcementLearning (RL) approaches, in which the problem is framed as a Markov Decision Process where actions can correspond to actualmotion clips [Lee and Lee 2006; Treuille et al. 2007] or characterstates [Lee et al. 2010], but again requiring the motion dataset tobe loaded in memory at run-time. Physically-based control gets ridof this limitation by having the output of the system operate on aphysically-driven character. Coros et al . [2009] employ fitted valueiteration with actions corresponding to optimized Proportional-Derivative (PD) controllers proposed by Yin et al . [2007]. These RLmethods operate on value functions that have discrete domains,which do not represent the continuous nature of motion and imposerun-time estimations through interpolation.Deep RL methods, which use neural networks as powerful con-tinuous function approximators have recently started being usedto address these limitations. Peng et al . [2017] apply a hierarchi-cal actor-critic algorithm that outputs desired joint angles for PD-controllers. Their approach is applied on a simplified skeleton anddoes not express human-like quality of movement despite theirstyle constraints. Imitation-learning based RL approaches [Baramet al. 2016; Ho and Ermon 2016] try to address this with adversariallearning, while others tackle the problem by penalizing distance ofa generated state from a reference state [Bergamin et al. 2019; Penget al. 2018]. Actions as animation clips, or control fragments [Liuand Hodgins 2017] can also be used in a deep-RL framework withQ-learning to drive physically-based characters. These methodsshow impressive results for characters having physical interactionswith the world, while still being limited to specific skills or shortcyclic motions. We operate in our case in the kinematics domainand train on significantly more heterogeneous motions.

We limit here the definition of motion prediction to generating un-constrained motion continuation given single or multiple frames ofanimation as context. This task implies learning a powerful motiondynamics model which is useful for transition generation. Neuralnetworks have shown over the years to excel in such representationlearning. Early work from Taylor et al . [2007] using ConditionalRestricted Boltzmann Machines showed promising results on mo-tion generation by sampling at each timestep the next frame ofmotion conditioned on the current hidden state and 𝑛 previousframes. More recently, many RNN-based approaches have been pro-posed for motion prediction from a past-context of several frames,motivated by the representational power of RNNs for temporal dy-namics. Fragkiadki et al . [2015] propose to separate spatial encodingand decoding from the temporal dependencies modeling with theEncoder-Recurrent-Decoder (ERD) networks, while Jain et al . [2016]apply structural RNNs to model human motion sequences repre-sented as spatio-temporal graphs. Other recent approaches [Chiuet al. 2019; Gopalakrishnan et al. 2019; Liu et al. 2019; Martinezet al. 2017; Pavllo et al. 2019; Tang et al. 2018] investigate new ar-chitectures and loss functions to further improve short-term andlong-term prediction of human motion. Others [Ghosh et al. 2017;Li et al. 2017] investigate ways to prevent divergence or collapsingto the average pose for long-term predictions with RNNs. In thiswork, we start by building a powerful motion predictor based onthe state-of-the-art recurrent architecture for long-term prediction ACM Trans. Graph., Vol. 39, No. 4, Article 60. Publication date: July 2020. et al . proposed by Chiu et al . [2019]. We combine this architecture withthe feed-forward encoders of Harvey et al . [2018] applied to dif-ferent parts of the input to allow our embedding modifiers to beapplied on distinct parts of the inputs. In our case, we operate onjoint-local quaternions for all bones, except for the root, for whichwe use quaternions and translations local to the last seed frame. We define transition generation as a type of control with temporallysparse spatial constraints, i.e. where large gaps of motion must befilled without explicit conditioning during the missing frames suchas trajectory or contact information. This is related to keyframeor motion interpolation (e.g. [Ciccone et al. 2019]), but our workextends interpolation in that the system allows for generating wholecycles of motion, which cannot be done by most key-based inter-polation techniques, such a spline fitting. Pioneering approaches[Cohen et al. 1996; Witkin and Kass 1988] on transition generationand interpolation used spacetime constraints and inverse kinematicsto produce physically-plausible motion between keyframes. Workwith probabilistic models of human motion have also been usedfor filling gaps of animation. These include the MAP optimizersof Chai et al . [2007] and Min et al . [2009], the Gaussian processdynamical models from Wang et al . [2008] and Markov models withdynamic auto-regressive forests from Lehrmann et al . [2014]. Allof these present specific models for given action and actors. Thiscan make combinations of actions look scripted and sequential. Thescalability and expressiveness of deep neural networks has beenapplied to keyframe animation by Zhang et al . [2018], who use anRNN conditioned on key-frames to produce jumping motions for asimple 2D model. Harvey et al . [2018] present Recurrent TransitionNetworks (RTN) that operate on a more complex human charac-ter, but work on fixed-lengths transitions with positional data onlyand are deterministic. We use the core architecture of the RTN aswe make use of the separately encoded inputs to apply our latentmodifiers. Hernandez et al . [2019] recently applied convolutionaladversarial networks to pose the problem of prediction or transitiongeneration as an in-painting one, given the success of convolutionalgenerative adversarial networks on such tasks. They also proposefrequency-based losses to assess motion quality, but do not providea detailed evaluation for the task of in-betweening.

We use a humanoid skeleton that has 𝑗 = joints when usingthe Human3.6M dataset and 𝑗 = in the case of the LaFAN1dataset. We use a local quaternion vector q 𝑡 of 𝑗 ∗ dimensionsas our main data representation along with a 3-dimensional globalroot velocity vector (cid:164) r 𝑡 at each timestep 𝑡 . We also extract fromthe data, based on toes and feet velocities, contact information as abinary vector c 𝑡 of dimensions that we use when working with theLaFAN1 dataset. The offset vectors o 𝑟𝑡 and o 𝑞𝑡 contain respectivelythe global root position’s offset and local-quaternions’ offsets fromthe target keyframe at time 𝑡 . Even though the quaternion offsetcould be expressed as a valid, normalized quaternion, we found thatusing simpler element-wise linear differences simplifies learning and yields better performance. When computing our positional loss,we reformat the predicted state into a global positions vector p 𝑡 + using q 𝑡 + , r 𝑡 + and the stored, constant local bone translations b by performing Forward Kinematics (FK). The resulting vector p 𝑡 + has 𝑗 ∗ dimensions. We also retrieve through FK the globalquaternions vector g 𝑡 + , which we use for quantitatively evaluatingtransitions.The discriminator use as input sequences of 3-dimensional vectorsof global root velocities (cid:164) r , concatenated with x and (cid:164) x , the root-relative positions and velocities of all other bones respectively. Thevectors x and (cid:164) x both have ( 𝑗 − ) ∗ dimensions.To simplify the learning process, we rotate each input sequenceseen by the network around the 𝑌 axis (up) so that the root of theskeleton points towards the 𝑋 + axis on the last frame of past context.Each transition thus starts with the same global horizontal facing.We refer to rotations and positions relative to this frame as global in the rest of this work. When using the network inside a contentcreation software, we store the applied rotation in order to rotateback the generated motion to fit the context. Note however that thishas no effect on the public dataset Human3.6M since root transfor-mations are set to the identity on the first frame of any sequences,regardless of the actual world orientation. We also augment the databy mirroring the sequences over the 𝑋 + axes with a probability of . during training. Figure 2 presents a visual depiction of our recurrent generator for asingle timestep. It uses the same input separation used by the RTNnetwork [Harvey and Pal 2018], but operates on angular data anduses FK in order to retrieve global positions [Pavllo et al. 2019]. Itis also augmented with our latent space modifiers z tta and z target .Finally it also uses different losses, such as an adversarial loss forimproved realism of the generated motions.As seen in Figure 2, the generator has three different encodersthat take the different data vectors described above as inputs; thecharacter state encoder, the offset encoder, and the target encoder.The encoders are all fully-connected Feed-Forward Networks (FFN)with a hidden layer of 512 units and an output layer of 256 units.All layers use the Piecewise Linear Activation function (PLU) [Nico-lae 2018], which performed slightly better than Rectified LinearUnits (ReLU) in our experiments. The time-to-arrival embedding z 𝑡𝑡𝑎 has 256 dimensions and is added to the latent input represen-tations. Offset and target embeddings h offset 𝑡 and h target 𝑡 are thenconcatenated and added to the 512-dimensional target-noise vector z target . Next, the three augmented embeddings are concatenatedand fed as input to a recurrent Long-Short-Term-Memory (LSTM)layer. The embedding from the recurrent layer, h LSTM 𝑡 is then fedto the decoder, another FFN with two PLU hidden layers of 512and 256 units respectively and a linear output layer. The resultingoutput is separated into local-quaternion and root velocities ^ (cid:164) q 𝑡 + and ^ (cid:164) r 𝑡 + to retrieve the next character state. When working with theLaFAN1 dataset, the decoder has four extra output dimensions thatgo through a sigmoid non-linearity 𝜙 to retrieve contact predictions ^ c 𝑡 + . The estimated quaternions ^ q 𝑡 + are normalized as valid unitquaternions and used along with the new root position ^ r 𝑡 + and the ACM Trans. Graph., Vol. 39, No. 4, Article 60. Publication date: July 2020. obust Motion In-betweening • 60:5

Fig. 2.

Overview of the TG complete architecture for in-betweening.

Computations for a single timestep are shown. Visual concatenation ofinput boxes or arrows represents vector concatenation. Green boxes are thejointly trained neural networks. Dashed boxes represent our two proposedembedding modifiers. The "quat norm" and "FK" red boxes represent thequaternion normalization and Forward Kinematics operations respectively.The ⊕ sign represents element-wise addition and 𝜙 is the sigmoid non-linearity. Outputs are linked to associated losses with dashed lines. constant bone offsets b to perform FK and retrieve the new globalpositions ^ p 𝑡 + . We present here our method to allow robustness to variable lengthsof in-betweening. In order to achieve this, simply adding condition-ing information about the target keyframe is insufficient since therecurrent layer must be aware of the number of frames left untilthe target must be reached. This is essential to produce a smoothtransition without teleportation or stalling. Transformer networks[Vaswani et al. 2017] are attention-based models that are increas-ingly used in natural language processing due to their state-of-the-art modeling capacity. They are sequence-to-sequence models thatdo not use recurrent layers but require positional encodings thatmodify a word embedding to represent its location in a sentence.Our problem is also a sequence-to-sequence task where we translatea sequence of seed frames to a transition sequence, with additionalconditioning on the target keyframe. Although our generator does use a recurrent layer, it needs time-to-arrival awareness in orderto gracefully handle transitions of variable lengths. To this end, weuse the mathematical formulation of positional encodings, that webase in our case on the time-to-arrival to the target: z 𝑡𝑡𝑎, 𝑖 = 𝑠𝑖𝑛 (cid:16) 𝑡𝑡𝑎 basis 𝑖 / 𝑑 (cid:17) (1) z 𝑡𝑡𝑎, 𝑖 + = 𝑐𝑜𝑠 (cid:16) 𝑡𝑡𝑎 basis 𝑖 / 𝑑 (cid:17) (2) where tta is the number of timesteps until arrival and the secondsubscript of the vector z tta , _ represents the dimension index. Thevalue 𝑑 is the dimensionality of the input embeddings, and 𝑖 ∈[ , ..., 𝑑 / ] . The basis component influences the rate of change infrequencies along the embedding dimensions. It is set to , asin most transformer implementations.Time-to-arrival embeddings thus provide continuous codes thatwill shift input representations in the latent space smoothly anduniquely for each transition step due to the phase and frequencyshifts of the sinusoidal waves on each dimension. Such embedding isthus bounded, smooth and dense, three characteristics beneficial forlearning. Its additive nature makes it harder for a neural network toignore, as can be the case with concatenation methods. This followsthe successful trend in computer vision [Dumoulin et al. 2017; Perezet al. 2018] of conditioning through transformations of the latentspace instead of conditioning with input concatenation. In thesecases, the conditioning signals are significantly more complex andthe affine transformations need to be learned, whereas Vaswani etal . [2017] report similar performance when using this sine-basedformulation as when using learned embeddings.It is said that positional encodings can generalize to longer se-quences in the natural language domain. However, since z tta evolvesbackwards in time to retrieve a time-to-arrival representation, gen-eralizing to longer sequences becomes a more difficult challenge.Indeed, in the cases of Transformers (without temporal reversal), thefirst embeddings of the sequence are always the same and smoothlyevolve towards new ones when generalizing to longer sequences. Inour case, longer sequences change drastically the initial embeddingseen and may thus generate unstable hidden states inside the recur-rent layer before the transition begins. This can hurt performanceon the first frames of transitions when extending the time-horizonafter training. To alleviate this problem, we define a maximum du-ration in which we allow z tta to vary, and fix it past this maximumduration. Precisely, the maximum duration 𝑇 max ( z tta ) is set to 𝑇 𝑚𝑎𝑥 ( trans ) + 𝑇 𝑝𝑎𝑠𝑡 − , where 𝑇 max ( trans ) is the maximum tran-sition length seen during training and 𝑇 past is the number of seedframes given before the transition. This means that when dealingwith transitions of length 𝑇 max ( trans ) , the model sees a constant z tta for 5 frames before it starts to vary. This allows the network tohandle a constant z tta and to keep the benefits of this augmentationeven when generalizing to longer transitions. Visual representationsof z tta and the effects 𝑇 max ( z tta ) are shown in Appendix A.2.We explored simpler approaches to induce temporal awareness,such as concatenating a time-to-arrival dimension either to the in-puts of the state encoder, or to the LSTM layer’s inputs. This 𝑡𝑡𝑎 dimension is a single scalar increasing from 0 to 1 during the transi-tion. Its period of increase is set to 𝑇 max ( z tta ) . Results comparingthese methods with a temporally unaware network, and our use of z tta can be visualized in Figure 3. ACM Trans. Graph., Vol. 39, No. 4, Article 60. Publication date: July 2020. et al . Fig. 3.

Reducing the L2Q loss with z tta . We compare simple interpolationwith our temporally unaware model (TG-Q) on the walking subset of Human3.6M. We further test two strategies based on adding a single 𝑡𝑡𝑎 dimensioneither to the character state (TG-Q + 𝑡𝑡𝑎 input scalar) or the LSTM inputs(TG-Q + 𝑡𝑡𝑎

LSTM scalar). Finally, our use of time-to-arrival embeddings (TG-Q + z tta ) yields the best results, mostly noticeable at the end of transitions,where the generated motion is smoother than interpolation. Another contribution of this work is to improve robustness tokeyframe modifications and to enforce diversity in the generatedtransitions given a fixed context. To do so we propose a scheduledtarget-distortion strategy. We first concatenate the encoded em-beddings h offset 𝑡 and h target 𝑡 of the current offset vector and thetarget keyframe respectively. We then add to the resulting vectorthe target noise vector z target , sampled once per sequence froma spherical, zero-centered Gaussian distribution N ( , 𝐼 ∗ 𝜎 target ) .The standard deviation 𝜎 target is an hyper-parameter controllingthe level of accepted distortion. In order to produce smooth transi-tions to the target, we then define a target noise multiplier 𝜆 target ,responsible for scaling down z target as the number of remainingtimesteps goes down. We define a period of noise-free generation(5 frames) where 𝜆 target = and a period of linear decrease ofthe target-noise (25 frames) to produce our noise-scale schedule.Beyond 30 frames before the target, the target-noise is thereforeconstant and 𝜆 target = . 𝜆 target =  if tta ≥ tta − if ≤ tta < if tta < (3)Since this modifier is additive, it also corrupts time-to-arrival infor-mation, effectively distorting the timing information. This allows tomodify the pace of the generated motion. Our target noise scheduleis intuitively similar to an agent receiving a distorted view of itslong-term target, with this goal becoming clearer as the agent ad-vances towards it. This additive embedding modifier outperformedin our experiments another common approach in terms of diver-sity of the transitions while keeping the motion plausible. Indeed awidespread approach to conditional GANs is to use a noise vector z concat as additional input to the conditioned generator in orderto enable stochasticity and potentially disambiguate the possibleoutcomes from the condition (e.g. avoid mode collapsing). However in highly constrained cases like ours, the condition is often informa-tive enough to obtain good performance, especially at the beginningof the training. This leads to the generator learning to ignore theadditional noise, as observed in our tests (see Figure 4). We thusforce the transition generator to be stochastic by using z target todistort its view of the target and current offsets. Fig. 4.

Increasing variability with z target . We compare z concat (left)against z target (right) midway in a 100-frames transition re-sampled 10times. The generator successfully learns to ignore z concat while z target isimposed and leads to noticeable variations with controllable scale. A common problem with reconstruction-based losses and RNNs isthe blurriness of the results, which is translated into collapse to theaverage motion and foot slides when predicting motion. The targetkeyframe conditioning can slightly alleviate these problems, butadditional improvement comes from our use of adversarial losses,given by two discriminator networks. We use two variants of arelatively simple feed-forward architecture for our discriminators,or critics , C and C . Each discriminator has 3 fully-connected layers,with the last one being a 1D linear output layer. C is a long-term critic that looks at sliding windows of 10 consecutive frames ofmotion and C is the short-term critic and looks at windows ofinstant motion over 2 frames. Both critics have 512 and 256 units intheir first and second hidden layers respectively. The hidden layersuse ReLU activations. We average the discriminator scores over timein order to produce a single scalar loss. A visual summary of oursliding critics is presented in Appendix A.1. In order to make the training stable and to obtain the most realis-tic results, we use multiple loss functions as complementary softconstraints that the neural network learns to respect.

All of our reconstruction losses for apredicted sequence ^ 𝑋 given its ground-truth 𝑋 are computed withthe L1 norm: ACM Trans. Graph., Vol. 39, No. 4, Article 60. Publication date: July 2020. obust Motion In-betweening • 60:7 𝐿 quat = 𝑇 𝑇 − ∑︁ 𝑡 = ∥ ^ q 𝑡 − q 𝑡 ∥ (4) 𝐿 root = 𝑇 𝑇 − ∑︁ 𝑡 = ∥ ^ r 𝑡 − r 𝑡 ∥ (5) 𝐿 pos = 𝑇 𝑇 − ∑︁ 𝑡 = ∥ ^ p 𝑡 − p 𝑡 ∥ (6) 𝐿 contacts = 𝑇 𝑇 − ∑︁ 𝑡 = ∥ ^ c 𝑡 − c 𝑡 ∥ (7) where 𝑇 is the sequence length. The two main losses that we useare the local-quaternion loss 𝐿 quat and the root-position loss 𝐿 root .The former is computed over all joints’ local rotations, includingthe root-node, which in this case also determines global orientation.The latter is responsible for the learning of the global root displace-ment. As an additional reconstruction loss, we use a positional-loss 𝐿 pos that is computed on the global position of each joints retrievedthrough FK. In theory, the use of 𝐿 pos isn’t necessary to achieve aperfect reconstruction of the character state when using 𝐿 quat and 𝐿 root , but as noted by Pavllo et al . [2019], using global positionshelps to implicitly weight the orientation of the bone’s hierarchyfor better results. As we will show in Section 4.2, adding this lossindeed improves results on both quaternion and translation recon-structions. Finally, in order to allow for runtime Inverse-Kinematicscorrection (IK) of the legs inside an animation software, we also usea contact prediction loss 𝐿 contacts , between predicted contacts ^ c 𝑡 and true contacts c 𝑡 . We use the contact predictions at runtime toindicate when to perform IK on each leg. This loss is used only formodels trained on the LaFAN1 dataset and that are deployed in ourMotionBuilder plugin. We use the Least Square GAN (LSGAN)formulation [Mao et al. 2017]. As our discriminators operate onsliding windows of motion, we average their losses over time. OurLSGAN losses are defined as follows: 𝐿 gen = E X p , X f ∼ 𝑝 𝐷𝑎𝑡𝑎 [( 𝐷 ( X p , 𝐺 ( 𝑋 p , X f ) , X f ) − ) ] , (8) 𝐿 disc = E X p , X trans , X f ∼ 𝑝 𝐷𝑎𝑡𝑎 [( 𝐷 ( X p , X trans , X f ) − ) ]+ E X p , X f ∼ 𝑝 𝐷𝑎𝑡𝑎 [( 𝐷 ( X p , 𝐺 ( 𝑋 p , X f ) , X f )) ] , (9)where X p , X f , and X trans represent the past context, target state,and transition respectively, in the discriminator input format de-scribed in Section 3.1. 𝐺 is the transition generator network. Bothdiscriminators use the same loss, with different input sequencelengths. In order to accelerate train-ing, we adopt a curriculum learning strategy with respect to thetransition lengths. Each training starts at the first epoch with 𝑃 min = ˜ 𝑃 max = , where 𝑃 min and ˜ 𝑃 max are the minimal and current max-imal transition lengths. During training, we increase ˜ 𝑃 max until itreaches the true maximum transition length 𝑃 max . The increaserate is set by number of epochs 𝑛 ep − max by which we wish to havereached ˜ 𝑃 max = 𝑃 max . For each minibatch, we sample uniformlythe current transition length between 𝑃 min and ˜ 𝑃 max , making thenetwork train with variable length transitions, while beginning thetraining with simple tasks only. In our experiments, this leads to similar results as using any teacher forcing strategy, while accelerat-ing the beginning of training due to the shorter batches. Empirically,it also outperformed gradient clipping. At evaluation time, the tran-sition length is fixed to the desired length. In practice, our discriminators are implementedas 1D temporal convolutions, with strides of 1, without padding,and with receptive fields of 1 in the last 2 layers, yielding parallelfeed-forward networks for each motion window in the sequence.

In all of our experiments, we use mini-batches of 32 sequences of variable lengths as explained above. Weuse the AMSgrad optimizer [Reddi et al. 2018] with a learning rateof 0.001 and adjusted parameters ( 𝛽 = . , 𝛽 = . ) for increasedstability. We scale all of our losses to be approximately equal on theLaFAN1 dataset for an untrained network before tuning them withcustom weights. In all of our experiments, these relative weights(when applicable) are of 1.0 for 𝐿 quat and 𝐿 root , 0.5 for 𝐿 pos , and0.1 for 𝐿 gen and 𝐿 contacts . The target noise’s standard deviation 𝜎 target is 0.5. In experiments on Human3.6M, we set 𝑛 ep − max to 5while it is set to 3 on the larger LaFAN1 dataset. Based on recent advances in motion prediction, we first build amotion prediction network that yields state-of-the-art results. Weevaluate our model on the popular motion prediction benchmarkthat uses the Human 3.6M dataset. We follow the evaluation protocoldefined by Jain et al . [2016] that we base on the code from Martinez et al . [2017]. We train the networks for , iterations beforeevaluation. We use the core architecture of Harvey et al . [2018]since their separate encoders allow us to apply our embeddingmodifiers. In the case of unconstrained prediction however, this ismore similar to the Encoder-Recurrent-Decoder (ERD) networksfrom Fragkiadaki et al . [2015]. We also apply the velocity-basedinput representation of Chiu et al . [2019] which seems to be a keycomponent to improve performance. This is empirically shownin our experiments for motion prediction, but as we will see inSection 4.2, it doesn’t hold for transition generation, where thecharacter state as an input is more informative than velocities toproduce correct transitions, evaluated on global angles and positions.Another difference lies in our data representation, which is based onquaternions instead of exponential maps. We call our architecturefor motion prediction ERD-Quaternion Velocity network (ERD-QV).This model is therefore similar to the one depicted in Figure 2, withquaternions velocities (cid:164) q 𝑡 as only inputs of the state encoder insteadof q 𝑡 and (cid:164) r 𝑡 , and without the two other encoders and their inputs.No embedding modifier and no FK are used in this case, and theonly loss used is the L1 norm on joint-local quaternions. In thisevaluation, the root transform is ignored, to be consistent withprevious works.In Table 1, we compare this model with the TP-RNN which ob-tains to our knowledge state-of-the-art results for Euler angle differ-ences. We also compare with two variants of the VGRU architectureproposed by Gopalakrishnan et al . [2019], who propose a novelNormalized Power Spectrum Similarity (NPSS) metric for motion ACM Trans. Graph., Vol. 39, No. 4, Article 60. Publication date: July 2020. et al . Table 1.

Unconstrained motion prediction results on Human 3.6M . The VGRU-d/rl models are from [Gopalakrishnan et al. 2019]. The TP-RNN is from[Chiu et al. 2019] and has to our knowledge the best published results on motion prediction for this benchmark. Our model, ERD-QV is competitive with thestate-of-the-art on angular errors and improves performance with respect to the recently proposed NPSS metric on all actions.

Walking Eating Smoking Discussion

MAE NPSS MAE NPSS MAE NPSS MAE NPSSmilliseconds 80 160 320 500 560 1000 0-1000 80 160 320 500 560 1000 0-1000 80 160 320 500 560 1000 0-1000 80 160 320 500 560 1000 0-1000Zero-Vel 0.39 0.68 0.99 1.15 1.35 1.32 0.1418 0.27 0.48 0.73 0.86 1.04 1.38 0.0839 0.26 0.48 0.97 0.95 1.02 1.69 0.0572 0.31 0.67 0.94 1.04 1.41 1.96 0.1221VGRU-rl 0.34 0.47 0.64 0.72 - - - 0.27 0.40 0.64 0.79 - - - 0.36 0.61 - - - 0.46 0.82 0.95 1.21 - - -VGRU-d - - - - - - 0.1170 - - - - - - 0.1210 - - - - - - 0.0840 - - - - - - 0.1940TP-RNN 0.25 0.41 0.58 - 0.74 - 0.20 - 0.84 - 0.26 0.48 0.88 - -ERD-QV (ours) prediction that is more correlated to human assessment of qualityfor motion. Note that in most cases, we improve upon the TP-RNNfor angular errors and perform better than the VGRU-d proposed byGopalakrishnan et al . [2019] on their proposed metric. This allowsus to confirm the performance of our chosen architecture as thebasis of our transition generation model.

Given our highly performing prediction architecture, we now buildupon it to produce a transition generator (TG). We start off by firstadding conditioning information about the future target and cur-rent offset to the target, and then sequentially add our proposedcontributions to show their quantitative benefits on a novel tran-sition benchmark. Even though the Human 3.6M dataset is one ofthe most used in motion prediction research, most of the actions itcontains are ill-suited for long-term prediction or transitions (e.g. smoking , discussion , phoning , ...) as they consists of sporadic, ran-dom short movements that are impossible to predict beyond someshort time horizons. We thus choose to use a subset of the Human3.6M dataset consisting only of the three walk-related actions ( walk-ing, walkingdog, walkingtogether ) as they are more interesting totest for transitions over 0.5 seconds long. Like previous studies, wework with a 25Hz sampling rate and thus subsample the original50Hz data. The walking data subset has , frames in total andwe keep Subject 5 as the test subject. The test set is composed ofwindows of motion regularly sampled every 10 frames in the se-quences of Subject 5. Our test set thus contains windows withlengths that depend on the evaluation length. In order to evaluaterobustness to variable lengths of transitions, we train the modelson transitions of random lengths ranging from 0.2 to 2 seconds (5to 50 frames) and evaluate on lengths going up to 4 seconds (100frames) to also test generalization to longer time horizons. We trainthe models for , iterations, and report average L2 distances ofglobal quaternions (L2Q) and global positions (L2P): 𝐿 𝑄 = |D| 𝑇 ∑︁ 𝑠 ∈D 𝑇 − ∑︁ 𝑡 = (cid:13)(cid:13) ^ g 𝑠𝑡 − g 𝑠𝑡 (cid:13)(cid:13) (10) 𝐿 𝑃 = |D| 𝑇 ∑︁ 𝑠 ∈D 𝑇 − ∑︁ 𝑡 = (cid:13)(cid:13) ^ p 𝑠𝑡 − p 𝑠𝑡 (cid:13)(cid:13) (11)where 𝑠 is a transition sequence of the test set D , and 𝑇 is thetransition length. Note that we compute L2P on normalized global positions using statistics from the training set. Precisely, we extractthe global positions’ statistics on windows of 70 frames offset by 10frames in which the motion has been centered around the origin onthe horizontal plane. We center the motion by subtracting the meanof the root’s XZ positions on all joints’ XZ positions. We report theL2P metric as it is arguably a better metric than any angular lossfor assessing visual quality of transitions with global displacements.However, it is not complete in that bone orientations might be wrongeven with the right positions. We also report NPSS scores, whichare based on angular frequency comparisons with the ground truth.Our results are shown in Table 2. Our first baseline consists of a Table 2.

Transition generation benchmark on Human 3.6M . Modelswere trained with transition lengths of maximum 50 frames, but are evalu-ated beyond this horizon, up to 100 frames (4 seconds).

L2Q

Length (frames) 5 10 25 50 75 100 AVGInterpolation + 𝐿 pos + z tta + z target + 𝐿 gen ( TG complete ) 0.24 Interpolation 0.32 0.69 1.62 2.70 4.34 6.18 2.64TG-QV 0.48 0.74 1.22 2.44 4.71 7.11 2.78TG-Q 0.50 0.75 1.29 2.27 4.07 6.28 2.53+ 𝐿 pos z tta z target 𝐿 gen ( TG complete ) Interpolation 𝐿 pos z tta z target 𝐿 gen ( TG complete ) 0.0024 naive interpolation strategy in which we linearly interpolate theroot position and spherically interpolate the quaternions betweenthe keyframes defining the transition. On the very short-term, this isan efficient strategy as motion becomes almost linear in sufficientlysmall timescales. We then compare transition generators that receivequaternion velocities as input (TG-QV) with one receiving normal We train with transitions of maximum lengths of 50 frames, plus 10 seed frames and10 frames of future context to visually assess the motion continuation, from which thefirst frame is the target keyframe. This yields windows of 70 frames in total.ACM Trans. Graph., Vol. 39, No. 4, Article 60. Publication date: July 2020. obust Motion In-betweening • 60:9 quaternions (TG-Q) as depicted in Figure 2. Both approaches havesimilar performance on short transitions, while TG-QV shows worstresults on longer transitions. This can be expected with such amodel that isn’t given a clear representation of the character stateat each frame. We thus choose TG-Q as our main baseline ontowhich we sequentially add our proposed modifications. We first addthe global positional loss ( + 𝐿 pos ) as an additional training signal,which improves performance on most metrics and lengths. We thenadd the unconstrained time-to-arrival embedding modifier (+ z tta )and observe our most significant improvement. These effects on 50-frames translations are summarized in Figure 3. Next, we evaluatethe effects of our scheduled target embedding modifier z target . Notethat it is turned off for quantitative evaluation. The effects are minorfor transitions of 5 and 10 frames, but z target is shown to generallyimprove performances for longer transitions. We argue that theseimprovements come from the fact that this target noise probablyhelps generalizing to new sequences as it improves the model’srobustness to new or noisy conditioning information. Finally, weobtain our complete model ( TG complete ) by adding our adversarialloss 𝐿 gen , which interestingly not only improves the visual resultsof the generated motions, but also most of the quantitative scores.Qualitatively, enabling the target noise allows the model to pro-duce variations of the same transitions, and it is trivial to controlthe level of variation by controlling 𝜎 target . We compare our ap-proach to a simpler variant that also aims at inducing stochasticityin the generated transition. In this variant, we aim at potentiallydisambiguating the missing target information such as velocities byconcatenating a random noise vector z concat to the target keyframeinput q 𝑇 . This is similar to a strategy used in conditional GANs toavoid mode collapse given the condition. Figure 4 and the accom-panying video show typical results obtained with our techniqueagainst this more classical technique. Given our model selection based on the Human 3.6M walking bench-mark discussed above, we further test our complete model on anovel, high quality motion dataset containing a wide range of ac-tions, often with significant global displacements interesting forin-betweening compared to the Human3.6M dataset. This datasetcontains , motion frames sampled at 30Hz and captured ina production-grade MOCAP studio. It contains actions performedby 5 subjects, with Subject 5 used as the test set. Similarly to theprocedure used for the Human3.6M walking subset, our test set ismade of regularly-sampled motion windows. Given the larger size ofthis dataset we sample our test windows from Subject 5 at every 40frames, and thus retrieve windows for evaluation. The trainingstatistics for normalization are computed on windows of 50 framesoffset by 20 frames. Once again our starting baseline is a normalinterpolation. We make public this new dataset along with accom-panying code that allows one to extract the same training set andstatistics as in this work, to extract the same test set, and to evaluatenaive baselines (zero-velocity and interpolation) on this test set forour in-betweening benchmark. We hope this will facilitate futureresearch and comparisons on the task of transition generation. Wetrain our models on this dataset for , iterations on Subjects 1 to 4. We then go on to compare a reconstruction-based, future-conditioned Transition Generator ( TG rec ) using 𝐿 quat , 𝐿 root , 𝐿 pos and 𝐿 contacts with our augmented adversarial Transition Gener-ator ( TG complete ) that adds our proposed embedding modifiers z tta , z tta and our adversarial loss 𝐿 gen . Results are presented inTable 3. Our contributions improve performance on all quantitative Table 3.

Improving in-betweening on the LaFAN1 dataset . Modelswere trained with transition lengths of maximum 30 frames (1 second),and are evaluated on 5, 15, 30, and 45 frames.

L2Q

Length (frames) 5 15 30 45Interpolation 0.22 0.62 0.98 1.25 TG rec TG complete Interpolation 0.37 1.25 2.32 3.45 TG rec TG complete Interpolation 0.0023 0.0391 0.2013 0.4493 TG rec TG complete measurements. On this larger dataset with more complex move-ments, our proposed in-betweeners surpass interpolation even onthe very short transitions, as opposed to what was observed on theHuman3.6M walking subset. This motivates the use of our systemeven on short time-horizons. In order to also qualitatively test our models, we deploy networkstrained on LaFAN1 in a custom plugin inside Autodesk’s Motion-Builder, a widely used animation authoring and editing software.This enables the use of our model on user-defined keyframes or thegeneration of transitions between existing clips of animation. Figure5 shows an example scene with an incomplete sequence alongsideour user interface for the plugin. The

Source Character is the onefrom which keyframes are extracted while the generated framesare applied onto the

Target Character ’s skeleton. In this setup it istrivial to re-sample different transitions while controlling the levelof target noise through the

Variation parameter. A variation of 0makes the model deterministic. Changing the temporal or spatiallocation of the target keyframes and producing new animations isalso trivial. Such examples of variations can be seen in Figure 6.The user can decide to apply IK guided by the network’s contactpredictions through the

Enable IK checkbox. An example of theworkflow and rendered results can be seen in the accompanyingvideo.The plugin with the loaded neural network takes 170MB of mem-ory. Table 4 shows a summary of average speed performances ofdifferent in-betweening cases. This shows that an animator canuse our tool to generate transition candidates almost for free whencompared to manually authoring such transitions or finding similarmotions in a motion database.

ACM Trans. Graph., Vol. 39, No. 4, Article 60. Publication date: July 2020. et al . Fig. 5.

Generating animations inside MotionBuilder.

On the left is ascene where the last seed frame and target keyframe are visible. On theright is our user interface for the plugin that allows, among other things, tospecify the level of scheduled target noise for the next generation throughthe variation parameter, and to use the network’s contact predictions toapply IK. On the bottom is the timeline where the gap of missing motion isvisible.Table 4.

Speed performance summary of our MotionBuilder plugin .The model inference also includes the IK postprocess. The last columnindicates the time taken to produce a string of 10 transitions of 30 frames.Everything is run on a Intel Xeon CPU E5-1650 @ 3.20GHz.

Transition time (s) 0.50 1.00 2.00 10 x 1.00Keyframe extraction (s) 0.01 0.01 0.01 0.01Model inference (s) 0.30 0.31 0.31 0.40Applying keyframes (s) 0.72 1.05 1.65 6.79Total (s) 1.03 1.37 1.97 7.20

We found our time-to-arrival and scheduled target noise additivemodifiers to be very effective for robustness to time variations andfor enabling sampling capabilities. We explored relatively simplerconcatenation-based methods that showed worse performances. Wehypothesize that concatenating time-to-arrival or noise dimensionsis often less efficient because the neural network can learn to ignorethose extra dimensions which are not crucial in the beginning ofthe training. Additive embedding modifiers however impose a shiftin latent space and thus are harder to bypass.

In order to gain some insights on the importance of the training setcontent, we trained two additional models with ablated versions ofthe LaFAN1 dataset. For the first one, we removed all dance trainingsequences (approximately 10% of the data). For the second one, wekept only those sequences, yielding a much smaller dataset (21.5minutes). Results showed that keeping only the dance sequencesyielded similar results as the bigger ablated dataset, but that the fulldataset is necessary to generate transitions that stay in a dancing

Fig. 6.

Three types of variations of a crouch-to-run transition.

A singleframe per generated transition is shown, taken at the same timestep. Semi-transparent poses are the start and end keyframes. A : Temporal variationsare obtained by changing the temporal location of the second keyframe. Thisrelies on our time-to-arrival embeddings ( z tta ). B : Spatial variations can beobtained by simply moving the target keyframe in space. C : Motion variationare obtained by re-sampling the same transition with our scheduled targetnoise ( z target ) enabled. These results can also be seen in the accompanyingvideo. style. This indicates that large amounts of generic data can be asuseful as much fewer specialized sequences for a task, but thatcombining both is key. An example is shown in the accompanyingvideo. When building our motion predictor ERD-QV, we based our inputrepresention on velocities, as suggested with the TP-RNN architec-ture proposed by [Chiu et al. 2019]. However, we did not witness anygains when using their proposed Triangular-Prism RNN (TP-RNN)architecture. Although unclear why, it might be due to the addeddepth of the network by adding in our case a feed-forward encoder,making the triangular prism architecure unnecessary. 𝐿 pos Although we use FK for our positional loss 𝐿 pos as suggested byPavllo et al . [2019], this loss isn’t sufficient to produce fully definedcharacter configurations. Indeed using only this loss may lead tothe correct positions of the joints but offers no guarantee for thebone orientations, and led in our experiments to noticeable artifactsespecially at the ends of the kinematic chains. ACM Trans. Graph., Vol. 39, No. 4, Article 60. Publication date: July 2020. obust Motion In-betweening • 60:11

Some recent works on motion prediction prefer Gated RecurrentUnits (GRU) over LSTMs for their lower parameter count, but ourempirical performance comparisons favored LSTMs over GRUs.

A more informative way of representing the current offset to thetarget o 𝑡 would be to include positional-offsets in the representa-tion. For this to be informative however, it would need to rely oncharacter-local or global positions, which require FK. Although it ispossible to perform FK inside the network at every step of genera-tion, the backward pass during training becomes prohibitively slowjustifying our use of root offset and rotational offsets only.As with many data-driven approaches, our method struggles togenerate transitions for which conditions are unrealistic, or outsidethe range covered by the training set.Our scheduled target noise allows us to modify to some extentthe manner in which a character reaches its target, reminiscentof changing the style of the motion, but doesn’t allow yet to havecontrol over those variations. Style control given a fixed contextwould be very interesting but is out of scope of this work. In this work we first showed that state-of-the-art motion predictorscannot be converted into robust transition generators by simplyadding conditioning information about the target keyframe. Weproposed a time-to-arrival embedding modifier to allow robust-ness to transition lengths, and a scheduled target noise modifierto allow robustness to target keyframe variations and to enablesampling capabilities in the system. We showed how such a systemallows animators to quickly generate quality motion between sparsekeyframes inside an animation software. We also presented LaFAN1,a new high quality dataset well suited for transition generationbenchmarking.

ACKNOWLEDGMENTS

We thank Ubisoft Montreal, the Natural Sciences and EngineeringResearch Council of Canada and Mitacs for their support. We alsothank Daniel Holden, Julien Roy, Paul Barde, Marc-André Carbon-neau and Olivier Pomarez for their support and valuable feedback.

REFERENCES

Okan Arikan and David A Forsyth. 2002. Interactive motion generation from examples.In

ACM Transactions on Graphics (TOG) , Vol. 21. ACM, 483–490.Nir Baram, Oron Anschel, and Shie Mannor. 2016. Model-based Adversarial ImitationLearning. arXiv preprint arXiv:1612.02179 (2016).Emad Barsoum, John Kender, and Zicheng Liu. 2018. HP-GAN: Probabilistic 3D humanmotion prediction via GAN. In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition Workshops . 1418–1427.Philippe Beaudoin, Stelian Coros, Michiel van de Panne, and Pierre Poulin. 2008. Motion-motif graphs. In

Proceedings of the 2008 ACM SIGGRAPH/Eurographics Symposiumon Computer Animation . Eurographics Association, 117–126.Kevin Bergamin, Simon Clavet, Daniel Holden, and James Richard Forbes. 2019. DReCon:data-driven responsive control of physics-based characters.

ACM Transactions onGraphics (TOG)

38, 6 (2019), 1–11.Michael Büttner and Simon Clavet. 2015. Motion Matching - The Road to NextGen Animation. In

Proc. of Nucl.ai 2015

ACM Transactions on Graphics (ToG) , Vol. 24. ACM,686–696.Jinxiang Chai and Jessica K Hodgins. 2007. Constraint-based motion optimization usinga statistical dynamic model.

ACM Transactions on Graphics (TOG)

26, 3 (2007), 8.Hsu-kuang Chiu, Ehsan Adeli, Borui Wang, De-An Huang, and Juan Carlos Niebles.2019. Action-agnostic human pose forecasting. In . IEEE, 1423–1432.Loïc Ciccone, Cengiz Öztireli, and Robert W. Sumner. 2019. Tangent-space Optimizationfor Interactive Animation Control.

ACM Trans. Graph.

38, 4, Article 101 (July 2019),10 pages. https://doi.org/10.1145/3306346.3322938Michael Cohen, Brian Guenter, Bobby Bodenheimer, and Charles Rose. 1996.Efficient Generation of Motion Transitions Using Spacetime Constraints.In

SIGGRAPH 96

ACM Transactions on Graphics(TOG) , Vol. 28. ACM, 170.Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. 2017. A Learned Repre-sentation For Artistic Style.

ICLR (2017). https://arxiv.org/abs/1610.07629Katerina Fragkiadaki, Sergey Levine, Panna Felsen, and Jitendra Malik. 2015. Recur-rent network models for human dynamics. In

Proceedings of the IEEE InternationalConference on Computer Vision . 4346–4354.Partha Ghosh, Jie Song, Emre Aksan, and Otmar Hilliges. 2017. Learning HumanMotion Models for Long-term Predictions. arXiv preprint arXiv:1704.02827 (2017).Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, SherjilOzair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In

Advances in neural information processing systems . 2672–2680.Anand Gopalakrishnan, Ankur Mali, Dan Kifer, Lee Giles, and Alexander G Ororbia.2019. A neural temporal model for human motion prediction. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition . 12116–12125.Keith Grochow, Steven L Martin, Aaron Hertzmann, and Zoran Popović. 2004. Style-based inverse kinematics. In

ACM transactions on graphics (TOG) , Vol. 23. ACM,522–531.Liang-Yan Gui, Yu-Xiong Wang, Xiaodan Liang, and José MF Moura. 2018. Adversarialgeometry-aware human motion prediction. In

Proceedings of the European Conferenceon Computer Vision (ECCV) . 786–803.Félix G Harvey and Christopher Pal. 2018. Recurrent transition networks for characterlocomotion. In

SIGGRAPH Asia 2018 Technical Briefs . ACM, 4.Alejandro Hernandez, Jurgen Gall, and Francesc Moreno-Noguer. 2019. Human MotionPrediction via Spatio-Temporal Inpainting. In

Proceedings of the IEEE InternationalConference on Computer Vision . 7134–7143.Jonathan Ho and Stefano Ermon. 2016. Generative adversarial imitation learning. In

Advances in Neural Information Processing Systems . 4565–4573.Daniel Holden. 2018. Robust solving of optical motion capture data by denoising.

ACMTransactions on Graphics (TOG)

37, 4 (2018), 165.Daniel Holden, Taku Komura, and Jun Saito. 2017. Phase-functioned neural networksfor character control.

ACM Transactions on Graphics (TOG)

36, 4 (2017), 42.Daniel Holden, Jun Saito, and Taku Komura. 2016. A deep learning framework forcharacter motion synthesis and editing.

ACM Transactions on Graphics (TOG)

35, 4(2016), 138.Daniel Holden, Jun Saito, Taku Komura, and Thomas Joyce. 2015. Learning motionmanifolds with convolutional autoencoders. In

SIGGRAPH Asia 2015 Technical Briefs .ACM, 18.Ashesh Jain, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena. 2016. Structural-RNN:Deep learning on spatio-temporal graphs. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition . 5308–5317.Lucas Kovar, Michael Gleicher, and Frédéric Pighin. 2008. Motion graphs. In

ACMSIGGRAPH 2008 classes . ACM, 51.Jehee Lee, Jinxiang Chai, Paul SA Reitsma, Jessica K Hodgins, and Nancy S Pollard.2002. Interactive control of avatars animated with human motion data. In

ACMTransactions on Graphics (ToG) , Vol. 21. ACM, 491–500.Jehee Lee and Kang Hoon Lee. 2006. Precomputing avatar behavior from human motiondata.

Graphical Models

68, 2 (2006), 158–174.Kyungho Lee, Seyoung Lee, and Jehee Lee. 2018. Interactive character animation bylearning multi-objective control. In

SIGGRAPH Asia 2018 Technical Papers . ACM,180.Yongjoon Lee, Kevin Wampler, Gilbert Bernstein, Jovan Popović, and Zoran Popović.2010. Motion fields for interactive character locomotion. In

ACM Transactions onGraphics (TOG) , Vol. 29. ACM, 138.Andreas M Lehrmann, Peter V Gehler, and Sebastian Nowozin. 2014. Efficient nonlinearmarkov models for human motion. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition . 1314–1321.Sergey Levine, Jack M Wang, Alexis Haraux, Zoran Popović, and Vladlen Koltun. 2012.Continuous character control with low-dimensional embeddings.

ACM Transactions

ACM Trans. Graph., Vol. 39, No. 4, Article 60. Publication date: July 2020. et al . on Graphics (TOG)

31, 4 (2012), 28.Zimo Li, Yi Zhou, Shuangjiu Xiao, Chong He, and Hao Li. 2017. Auto-ConditionedLSTM Network for Extended Complex Human Motion Synthesis. arXiv preprintarXiv:1707.05363 (2017).Libin Liu and Jessica Hodgins. 2017. Learning to schedule control fragments for physics-based characters using deep q-learning.

ACM Transactions on Graphics (TOG)

36, 3(2017), 29.Zhenguang Liu, Shuang Wu, Shuyuan Jin, Qi Liu, Shijian Lu, Roger Zimmermann, andLi Cheng. 2019. Towards Natural and Accurate Future Motion Prediction of Humansand Animals. In

The IEEE Conference on Computer Vision and Pattern Recognition(CVPR) .Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and StephenPaul Smolley. 2017. Least squares generative adversarial networks. In

Proceed-ings of the IEEE International Conference on Computer Vision . 2794–2802.Julieta Martinez, Michael J Black, and Javier Romero. 2017. On human motion predictionusing recurrent neural networks. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition . 2891–2900.Jianyuan Min and Jinxiang Chai. 2012. Motion graphs++: a compact generative modelfor semantic motion analysis and synthesis.

ACM Transactions on Graphics (TOG)

31, 6 (2012), 153.Jianyuan Min, Yen-Lin Chen, and Jinxiang Chai. 2009. Interactive generation of humananimation with deformable motion models.

ACM Transactions on Graphics (TOG)

29, 1 (2009), 9.Andrei Nicolae. 2018. PLU: The Piecewise Linear Unit Activation Function. arXivpreprint arXiv:1809.09534 (2018).Dario Pavllo, Christoph Feichtenhofer, Michael Auli, and David Grangier. 2019. Model-ing Human Motion with Quaternion-Based Neural Networks.

International Journalof Computer Vision (2019), 1–18.Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. 2018. DeepMimic:Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills.

ACM Transactions on Graphics (Proc. SIGGRAPH 2018 - to appear)

37, 4 (2018).Xue Bin Peng, Glen Berseth, KangKang Yin, and Michiel Van De Panne. 2017. Deeploco:Dynamic locomotion skills using hierarchical deep reinforcement learning.

ACMTransactions on Graphics (TOG)

36, 4 (2017), 41.Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville.2018. Film: Visual reasoning with a general conditioning layer. In

Thirty-SecondAAAI Conference on Artificial Intelligence .Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. 2018. On the Convergence of Adam andBeyond. In

International Conference on Learning Representations . https://openreview.net/forum?id=ryQu7f-RZAlla Safonova and Jessica K Hodgins. 2007. Construction and optimal search of in-terpolated motion graphs. In

ACM Transactions on Graphics (TOG) , Vol. 26. ACM,106.Sebastian Starke, He Zhang, Taku Komura, and Jun Saito. 2019. Neural state machinefor character-scene interactions.

ACM Transactions on Graphics (TOG)

38, 6 (2019),1–14.Yongyi Tang, Lin Ma, Wei Liu, and Weishi Zheng. 2018. Long-term human motionprediction by modeling motion context and enhancing motion dynamic. arXivpreprint arXiv:1805.02513 (2018).Jochen Tautges, Arno Zinke, Björn Krüger, Jan Baumann, Andreas Weber, ThomasHelten, Meinard Müller, Hans-Peter Seidel, and Bernd Eberhardt. 2011. Motionreconstruction using sparse accelerometer data.

ACM Transactions on Graphics(ToG)

30, 3 (2011), 18.Graham W Taylor, Geoffrey E Hinton, and Sam T Roweis. 2007. Modeling humanmotion using binary latent variables. In

Advances in neural information processingsystems . 1345–1352.Adrien Treuille, Yongjoon Lee, and Zoran Popović. 2007. Near-optimal characteranimation with continuous control.

ACM Transactions on Graphics (tog)

26, 3 (2007),7.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan NGomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In

Advances in neural information processing systems . 5998–6008.Jack M Wang, David J Fleet, and Aaron Hertzmann. 2008. Gaussian process dynamicalmodels for human motion.

IEEE transactions on pattern analysis and machineintelligence

30, 2 (2008), 283–298.Andrew Witkin and Michael Kass. 1988. Spacetime Constraints. In

Proceedings of the15th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH’88) . ACM, New York, NY, USA, 159–168. https://doi.org/10.1145/54852.378507Yuting Ye and C Karen Liu. 2010. Synthesis of responsive motion using a dynamicmodel. In

Computer Graphics Forum , Vol. 29. Wiley Online Library, 555–562.KangKang Yin, Kevin Loken, and Michiel van de Panne. 2007. Simbicon: Simple bipedlocomotion control. In

ACM Transactions on Graphics (TOG) , Vol. 26. ACM, 105.He Zhang, Sabastian Starke, Taku Komura, and Jun Saito. 2018. Mode-AdaptativeNeural Networks for Quadruped Motion Control.

ACM Transactions on Graphics(TOG)

37, 4 (2018). Xinyi Zhang and Michiel van de Panne. 2018. Data-driven autocompletion for keyframeanimation. In

Proceedings of the 11th Annual International Conference on Motion,Interaction, and Games . ACM, 10.

A APPENDIXA.1 Sliding critics

Fig. 7.

Visual summary of the two timescales critics.

Blue frames arethe given contexts and green frames correspond to the transition. First andlast critic positions are shown without transparency. At the beginning andend of transitions, the critics are conditional in that they include ground-truth context in their input sequences. Scalar scores at each timestep areaveraged to get the final score.

A.2 Time-to-arrival embedding visualization

Fig. 8.

Visual depiction of time-to-arrival embeddings.

Sub-figure (b)shows the effect of using 𝑇 max ( z tta ) , which in practice improves perfor-mances when generalizing to longer transitions as it prevents initializingthe LSTM hidden state with novel embeddings., which in practice improves perfor-mances when generalizing to longer transitions as it prevents initializingthe LSTM hidden state with novel embeddings.