Convolutional Humanoid Animation via Deformation
CConvolutional Humanoid Animation via Deformation
JOHN KANJI,
University of Toronto
DAVID I. W. LEVIN,
University of Toronto
Fig. 1. Our Convolutional algorithm for Humanoid Animation via Deformation (CHAD) parameterizes object pose via a learned configuration manifold.CHAD generates new animations (green dots) by following interpolating curves between keyframes (red dots) on this manifold. CHAD uses no prior onmotion or subject type, enabling the synthesis of face motion, whole body motion or even multiple character motion.
In this paper we present a new deep learning-driven approach to image-based synthesis of animations involving humanoid characters. Unlike pre-vious deep approaches to image-based animation our method makes noassumptions on the type of motion to be animated nor does it require densetemporal input to produce motion. Instead we generate new animations byinterpolating between user chosen keyframes, arranged sparsely in time.Utilizing a novel configuration manifold learning approach we interpolatesuitable motions between these keyframes. In contrast to previous methods,ours requires less data (animations can be generated from a single youtubevideo) and is broadly applicable to a wide range of motions including fa-cial motion, whole body motion and even scenes with multiple characters.These improvements serve to significantly reduce the difficulty in producingimage-based animations of humanoid characters, allowing even broaderaudiences to express their creativity.CCS Concepts: •
Computing methodologies → Animation;
Additional Key Words and Phrases: Deep Learning, Animation, Image-Based
ACM Reference format:
John Kanji and David I. W. Levin. 2016. Convolutional Humanoid Animationvia Deformation. 1, 1, Article 1 (January 2016), 11 pages.DOI: 10.1145/nnnnnnn.nnnnnnn
Character animation is hard. In computer graphics, animating acharacter begins with choosing a suitable motion parameterization
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].© 2016 ACM. XXXX-XXXX/2016/1-ART1 $15.00DOI: 10.1145/nnnnnnn.nnnnnnn called a rig. Rigs come in many forms, from skeletal rigs used torepresent human motion to blendshapes, with much inbetween.Even when availing oneself of state-of-the-art approaches to helpautomate rig construction, the process can be tedious and requirescopious amounts of skill and precision. A poorly built rig couldbe hard to control, exclude important character motions or both.For a blockbuster movie, rigs are works of art, the result of thecollaboration of many expert modelers and rigging artists.With rig in hand, the real work begins: synthesizing charactermotion by crafting a time varying trajectory through the rig-space.This process can be artist-guided, driven by motion capture data oreven video. However, manually constructing appealing charactermotions requires patience and an artistic eye for the subtleties ofhuman motion, motion capture requires expensive additional hard-ware and software and methods for video often rely on strong priors,meaning that no single method is broadly applicable to all types ofcharacter animation.One approach to ease the burden of character animation is to useimage-based approaches. This is quickly becoming a defacto stan-dard approach for facial animation and has been explored for char-acter animation as well . The advantage of these approaches is that,by leveraging machine learning techniques, compelling subspacesfor pose can be created and then driven using video, permitting easysynthesis of animations.Unfortunately, these approaches typically require strong priorson the poses they can create. Ironically these methods rely onthe blendshapes and rigs that are so burdensome in traditionalcomputer animation approaches. A consequence of this is that thesealgorithms do not apply to general humanoid character motionsynthesis – instead they individually specialize towards face, handor body animation and will fail for novel, unexpected motions. For , Vol. 1, No. 1, Article 1. Publication date: January 2016. a r X i v : . [ c s . G R ] A ug :2 • Kanji and Levin instance, systems that rely on face models and facial landmarks canonly reproduce poses of the human face.Our goal is to produce a general, image-based algorithm for char-acter motion synthesis. In contrast to previous methods our ap-proach for generating convolutional humanoid animation via de-formation (CHAD) does so without requiring any explicit prior onthe motion type. Instead, we learn an “implicit rig” by constructinga configuration space for a particular animation from short videos(we use mostly YouTube videos).CHAD’s goal is to provide a general purpose tool that will al-low inexperienced users to craft new image-based animations viakeyframing. CHAD requires comparatively little data and paths inthe CHAD configuration space encode natural human movements(replete with hand wringing, facial ticks and blinks) meaning that,with relatively little input (a few keyframes), a novice can synthesizea compelling animation. While CHAD does not match the highestquality animations produced by professionals, its ease-of-use, ex-pressiveness and ability to generate a wide range of varying motionsmake for a significant step towards the democratization of qualityimage-based animation. The goal of synthesizing humanoid motion drives a large portionof the computer graphics research community. An exhaustive char-acterization of all related work is beyond the scope of this paper.Below, we attempt to highlight important developments and posi-tion our work, CHAD, appropriately relative to this ever growingcorpus.Of all the types of humanoid animation to be studied, that of theface has seen, perhaps, the most attention. Everything from highlydetailed facial capture (Beeler et al. 2010, 2011) to sensorimotormodeling (Lee and Terzopoulos 2006) has been employed to generateconvincing facial animations.Facial motion capture lies at the heart of a large number of faceanimation algorithms and has become increasingly popular as botha research and industrial tool. Williams (1990) introduced the notionof marker-based 3D face capture, while the seminal work of Bradleyet al. (2010) debuted a markerless approach which relies on mul-tiview stereo to fit geometry from images, combined with opticalflow to track deformation and texture details across frames. Beeleret al. (2011) improve this method by introducing anchor frames totrack facial motion while avoiding integrated error. An increasinglylarge number of facial animation papers rely on face capture toprovide input data for data-driven approaches.A classical approach to data-driven facial animation is the so-called 3D Morphable Model (3DMM) (Blanz and Vetter 1999). Thismethod models textured 3D faces from data using principal compo-nent analysis (PCA) to project captured data into a low-dimensionalparameter space. The model can then be controlled by fitting theparameters to a photograph. Creating an animation thus requiresa dense sequence of control frames. By the nature of the PCA pro-jection, fine-scale pose details are lost in the reconstructed model.However, this approach has formed the basis of a several followupworks such as Kim et al. (2018) or Olszewski et al. (2017), who use3DMM to model the poses of a source video, and then transferthese motions to a target portrait, relying on the source video to provide fine details. These methods can be considered image-basedapproaches that rely on a strong facial prior. They produce com-pelling results but are limited to faces only and can have difficultywith features such as long hair, which the 3DMM prior does notmodel.Blendshapes provide an alternative reduced space representationfor facial motion synthesis. Many approaches use dense temporalinput such as monocular images (Cao et al. 2014, 2016) or even straingauge data (Li et al. 2015) to drive blendshape models and produce3D facial animations. While blendshapes can represent compli-cated facial expressions more compactly then PCA-based 3DMMs,they still can exclude detailed motions and are often augmentedat runtime to make up for this (Cao et al. 2015). Accurate interpo-lation between tracked expressions can also be difficult. Meng etal. (2018) tackle this by learning an embedding of facial expressionfrom images, which forms the input to a recurrent model giving 3Ddeformations between face poses.The related problem of facial motion transfer has also seen muchinterest. Xu et al (2014) seek to transfer the 3D motions from asource model to a target. Large-scale motions are transferred us-ing a blendshape model and deformation transfer (Sumner andPopovic 2004), with fine details transferred using the coating trans-fer method (Sorkine-Hornung et al. 2004). This method requirestemporally dense 3D mesh input which can be cumbersome to ac-quire, process and store. Garrido et al. (2015) apply motion transferto dialog dubbing, transferring the mouth movements of a dubberonto an actor’s performance. They also employ a blendshape modelderived from monocular facial capture as a prior on facial expres-sion. Finally, Vlassic et al. (2006) employ a multilinear model tofactorize the variance of a set of 3D poses into identity, expression,and viseme. They can then transfer appearance by traversing theidentity axis of the reduced space.Finally, 3D facial motion synthesis cannot be discussed without atleast some discussion on animation of speech. Edwards et al. (2016)introduce an artist friendly Jaw-Lip space to apply and control lip-synchronized animation synthesis, based on audio and textual input.Suwajanakorn et al. (2017) take a deep learning approach, usinga recurrent neural network to synthesize an appropriate mouthtexture from audio input, allowing the generation of speech. Thismethod uses facial landmarks as a prior to guide the lip-synchprocess.While the previously mentioned works treat facial motion in the3D domain, others have used image-based approaches, analyzingand manipulating motion in 2D pixel space. Garrido et al. (2014)use identity-preserving image warps to perform facial reenactmentand use a face-matching metric to chose frames to be transferredfrom a source clip to a target performance. Other approaches blendbetween different recorded takes of the same performance (Malle-son et al. 2015), generate cinematographs (Aberman et al. 2018) oranimate portraits (Averbuch-Elor et al. 2017; Geng et al. 2018) usingdeformation fields.Beyond faces, full body human motion has also been exploredextensively and again can be split into the broad categories: 3Dand image-based approaches. Skeletal rigs are a popular choicefor parameterizing human motion and there has been extensivework exploring both performance capture(Xu et al. 2018), motion , Vol. 1, No. 1, Article 1. Publication date: January 2016. onvolutional Humanoid Animation via Deformation • 1:3 synthesis (Holden et al. 2017, 2016, 2015) and control (Peng et al.2018, 2017; Yu et al. 2018) of such representations. Image-basedmethod have also been explored for whole body animation. Chan etal. (2018) perform motion transfer of dance motion, using a learned2D pose estimator, while other worksDavis and Agrawala (2018)re-time video to synchronize with the beat of a user-supplied song.Other methods synthesize images of novel poses of humans fromvideo input (Balakrishnan et al. 2018) or learn to predict futureframes in a video sequence (Aberman et al. 2018; Xue et al. 2016).Almost all previous approaches for image-based animation relyon strong priors to generate results (such as 3DMMs or blendshapesfor facial animation, or a known skeleton for full humanoid motion).General, image-based approaches for approximating shape varia-tion (Cootes et al. 1995) and tracking motion (Grover et al. 2015;Lucas and Kanade 1981; Tao et al. 2012; Weinzaepfel et al. 2013)have been developed but, when applied to the problem of animation,they either sacrifice detailed motion (PCA-based approaches) orbreak down for long running trajectories (optical flow based ap-proaches). In contrast to all these methods, CHAD eschews anyprior information regarding the structure of the character to beanimated. Rather CHAD learns an “implicit rig” for a face, characteror multiple characters entirely from a small amount of data (we relyon single YouTube videos for most of the examples in this paper).Rather than relying on dense temporal input, CHAD can interpolatenatural looking motion between sparsely placed keyframes, mak-ing animator control intuitive. In the next sections we detail theasymmetric neural net that lies at the heart of CHAD.
CHAD (Figure 2) aims to place as few restrictions on input as pos-sible, preferring to ingest unlabeled, unstructured videos; indeed,many of the examples shown here have been scraped from YouTube,and all are unprocessed aside from uniform cropping. We chosekeyframe-based animation as our control method because it pro-vides an intuitive interface for the user to specify target poses whilealso being familiar to experienced animators. It also provides flex-ibility for the user to choose the granularity or sparsity of theirkeyframes. The output of CHAD is a video sequence that interpo-lates between user specified keyframes with motion that mimicsthat of the input video sequence.
In this section we detail both the experiments and insights that ledto the development of CHAD, as well as the workings of CHADitself.
Our initial attempts at learning a model for humanoid animationwere inspired by previous work on video frame prediction. Thesealgorithms often learn deformation fields between adjacent framesin a video sequence. A deformation field, u i ∈ R n × n , is a two-dimensional vector field over an n × n image. This field encodesthe deformation of the i th image in a video sequence ( f i ) into the ( i + ) th frame ( f i + ). In practice, one can reconstruct f i + via a deformation operation D such that f i + = D (cid:16) f i , u i (cid:17) . Typically D is a bilinear image warp.Creating a user controlled animation requires, at the very least,the ability to specify a starting state for the animation and then tobe able to evolve that state over time. We can accomplish this bybeginning with a suitable initial frame f and progressively warpingit with displacement fields that characterize the motion desired forthe animation. An obvious approach to parameterizing this spaceof displacement fields is to learn a reduced mapping from exemplardata, via Deep Learning. ϕ τ Ground TruthDeformed
Fig. 3. Deformation field learning setup. The ground truth is reconstructedby deforming the reference frame.
The input to our deformation learning algorithm is an input videosequence, F (e.g. of a human face speaking). We construct a convo-lutional auto-encoder by composing an encoder: ϕ : F → Z and adecoder: τ : Z → U . Here U is the space of all displacement fieldsand Z is a reduced space with | Z | << | U | . This encoder-decoderpair is parameterized by a pair of convolutional neural networks(CNNs) (Figure 3).Our loss function is crafted such that τ learns to produce deforma-tion fields that warp frames f i to frames f i + . The encoder-decoderpair can be trained to do this by minimizing the following loss, L def = (cid:12)(cid:12)(cid:12) f i + − D (cid:16) f i , τ ( ϕ ( f i , f i + )) (cid:17)(cid:12)(cid:12)(cid:12) , (1)for each f i , f i + ∈ F . Full details of the training procedure are givenin subsection 5.1. Once trained, we can synthesize the k th frameof a new animation by evaluating a sequence of k deformationsusing our learned displacement fields. However, this approach failsin practice due to the accumulation of error caused by succesivewarping.To combat error accumulation, we train using batches of sequen-tial frames, attempting to reconstruct the sequence from the firstframe in the batch. To reconstruct the i th frame in the sequence weemploy two methods, summed deformations and composed defor-mations. For summed deformation we compute f i by f i + = D (cid:169)(cid:173)(cid:171) f , i (cid:213) j = τ (cid:16) ϕ (cid:16) f j (cid:17)(cid:17)(cid:170)(cid:174)(cid:172) . (2) , Vol. 1, No. 1, Article 1. Publication date: January 2016. :4 • Kanji and Levin kNN+PIE (cid:855) Fig. 2. CHAD-Net. Our asymmetric network setup for learning cool animations. Our method takes, as input, a set of video frames ( f ) and passes themthrough an assymetric autoencoder where the encoder U is a single layer composed of the PCA-basis and X is a deep, convolutional decoder. Using thisarchitecture we learn a configuration manifold ( X ) that encodes the video motion. Concurrently we train a GAN ( γ ) that produces low resolution imageswhich we further improve using detail transfer from the initial set of input frames. Composed deformations are given by f i + = D Li ◦ · · · ◦ D L (cid:16) f (cid:17) , (3)where we take D Li to be our learned deformation function, given by D (cid:16) f i , τ ( ϕ ( f i , f i + ) (cid:17) while ◦ is the standard function compositionoperator. These two methods accumulate error differently (as can beseen in Figure 4). To balance out these errors, we use a deformationloss that is the sum of the L distances computed by reconstructingthe batch using each method. Ground TruthComposed DeformationSummed Deformation
Fig. 4. Accumulated error incurred over 5 seconds by our two deformationmethods.
In practice we observed two problems with this model. The firstwas that, while it could readily fit to our training data and producea plausible reconstruction, synthesizing animations via repeatedapplications of D L produced poor results. The second was thatwhen learning the deformation between frames, the input spaceof the encoder is large; it’s F × F . The autoencoder needs to seea lot of data to fully characterize this mapping. Second, while thedeformation learning approach allows us to grow an animation outfrom an initial frame, it does not readily allow us to interpolatebetween two separate key frames. This makes animations producedin this manner extremely difficult to control for a user. These twoobservations led us to incorporate more structure into CHAD viathe use of configuration manifolds, which we detail below. Deformation fields are a very general motion representation. Ourgoal is not to represent any motion in a video but to generate newvideos that contain humanoid motion. To address the issues dis-cussed above we instead turn to a representation common in thefield of mechanics (Lanczos 1986). In mechanics, the set of all posesof an object are modeled by a low-dimensional manifold (the con-figuration manifold ) embedded in high-dimensional space.Each point, x ( z ) , on a configuration manifold , X , represents aunique pose of an object. Here x is not a point in 3D space but ann-dimensional point that describes the deformed state of the object(e.g. for a triangle mesh x stores all vertex positions of the mesh),while z ∈ Z is a low-dimensional coordinate (the rotation matrixand translation vector of a rigid body, for instance). In our case, wehave no explicit knowledge of the object’s form. However, we canattempt to learn a proxy to X from input video.Any smooth, continuous motion of an object can be describedby a corresponding smooth continuous path on X . Given a timevarying motion x ( t ) , we note that the object’s instantaneous veloc-ity is given by d x dt = ∂ x ∂ z ∂ z ∂ t and that, over a sufficiently small timeinterval ∆t we can represent the displacement of every point in anobject as u = d x dt ∆t = ∂ x ∂ z ∂ z ∂ t ∆t , (4)which is the standard relationship between an objects total velocityand its velocity in the reduced space Z .In CHAD, we take x to be an n × n pixel image and we take ∆t to be the frame time of our input video (typically of a second).We can now use Equation 4 to establish the following relationshipbetween F and X : u i + = x i + − x i f i + = D (cid:16) f i , u i + (cid:17) , (5)via finite differences. In order to simplify the equation, we takethe distance over which the finite difference is calculated to alsobe ∆t (which is the smallest unit of time we can observe from ourinput videos). The frame increment denotes that u i + is beingestimated at the midpoint of the line between x i and x i + .Figure 5 shows a hypothetical configuration manifold for image-based animation. The configuration manifold representation ad-dresses the two issues with deformation field learning. First, the , Vol. 1, No. 1, Article 1. Publication date: January 2016. onvolutional Humanoid Animation via Deformation • 1:5 Fig. 5. A hypothetical configuration manifold for image-based animation.Our algorithm maps from a low dimensional space to the high-dimensionalpose manifold. Animation sequences (green dots) are curves on the surfaceof this manifold that interpolate between user chosen keyframe (red dots) manifold is more constrained than the mapping learned in sub-section 4.1. Displacements emanating from identical frames arecompactly encoded in the tangent space of the manifold whereasdeformation learning maps each of these to a unique point in thereduced space (Figure 6). Second, interpolating between two anima-tion frames becomes as easy as following a curve between them in Z and retrieving the image frames via the mapping x ( z ) . We will modify our DeepLearning approach from subsection 4.1 to learn the configurationmanifold. This requires replacing τ with χ : Z → X (Equation 1).Composition with ϕ gives χ ◦ ϕ : F → X , which projects a frameonto the configuration manifold. Enforcing the structure imposed byEquation 5 is done, not by modifying Equation 1, but by modifyingthe structure of the autoencoder itself (Figure 6). τ ϕ χ ϕ ( ) ( ) ( ) , Fig. 6. Deformation learning (top) requires two frames as input, giving avery large space to learn. Using the configuration manifold we can projecteach frame into a common space (bottom) and compute curves betweenthem.
We learn a CNN encoder-decoder pair, ϕ : F → Z and χ : Z → X ,which projects a frame into a low dimensional pose space Z , andthen onto the configuration manifold. We train the network byconsidering a batch of sequential frames, in F . By mapping thesequence onto the configuration manifold we obtain a piece-wiselinear curve on the manifold. For each segment of this curve between frames f i and f i + , we compute u i using Equation 5 where x i (resp. x i + ) is the current estimate for the configuration point of f i (resp. f i + ). Once we have learned the configurationmanifold we can embed any two keyframes into Z and performinterpolation by walking a smooth curve between them. We canhallucinate new frames either using D L or more advanced meth-ods (subsection 4.3). Interpolation performs well for frames thatwere temporally close in the input video, but the results degradequickly as the keyframes get further apart, both in terms of visualfidelity, and quality of motion (Figure 7).Let’s consider the effect of a small perturbation on a single frameof an animation on its configuration point: x ( f + ∆ f ) ≈ x ( f ) + ∂ x ∂ z (cid:124)(cid:123)(cid:122)(cid:125) Equation 5 ∂ z ∂ f ∆ f . (6)From this simple, local expansion it is easy to see that the inclusionof the deep encoder z ( f ) potentially allows an image perturbationto cause a finite, but unbounded perturbation in the correspondingconfiguration coordinate. This is because, while ∂ x ∂ z is regularizedby Equation 5 (which we attempt to enforce via deformation loss), ∂ z ∂ f has no such regularizer. Ground TruthReconstructionDeep InterpolationPCA Interpolation
Fig. 7. Top: Ground truth video sequence. Second: Reconstruction usingDeep Encoder-Decoder. Third: New sequence synthesized by interpolatingbetween end points using Deep Encoder-Decoder. Bottom: Interpolationusing PCA Encoder-Deep Decoder. Notice how the PCA encoder producesa more detailed result.
Ideally, the matrix 2-norm of ∂ z ∂ f should be bounded with a max-imum value of 1. so that as much of the pose change induced byan image perturbation is included in x ( z ) as possible. A simplemethod for constructing such a space Z is to apply PCA to our inputframe data. The resulting linear basis, Z , can be represented as an , Vol. 1, No. 1, Article 1. Publication date: January 2016. :6 • Kanji and Levin orthogonal matrix. Using Z T as our encoder gives us the propertywe want. Figure 7 shows the improvement that can be gained byusing a PCA encoder and Figure 2 shows the final, asymmetricautoencoder setup used to generate all subsequent results. Until this point we have assumed new images are generated viadeformation. In some cases this yields acceptable animations butit suffers from the limitation that we cannot synthesize any visualphenomena not present in the original keyframe being warped.If the objects in the video change their topology (for example, byopening their mouth), this action cannot be reconstructed accuratelyby deformation alone.Fortunately, we can leverage our configuration manifold repre-sentation to side step this issue. Because, a point on the manifoldencodes pose we can learn an image generator γ : X → F , which re-constructs a frame based on its pose without relying on the keyframe.This approach has the double benefit of enforcing empirically our re-quirement of the manifold that it is possible to uniquely reconstructthe frame from its configuration variable.We implement γ as a generative adversarial network (GAN) (Good-fellow et al. 2014) which takes as input a point on the configurationmanifold, and attempts to invert the mapping ( ϕ ◦ χ ) , reconstructingthe input frame. The GAN’s encoder and decoder networks share thearchitecture of the ϕ and χ networks (Figure 11 and Figure 12). Thediscriminator network uses the encoder architecture with an outputvector size of 1, and a sigmoid function for the final activation layer.While our image generator gives very good results when recon-structing frames from the video, synthesizing frames via interpola-tion can introduce noise where the manifold is less well defined. Tocompensate we perform additional data-driven denoising. In our experimentswe were unable to train our image generator to produce crisp resultsin all cases. In order to alleviate this problem we turn to classicalmethods for reconstructing image sequences by transferring detailfrom existing video frames onto frames synthesized by our imagegenerator.Given a noisy synthesized target frame f t , we must first selecta source frame from the video whose pose matches the target asclosely as possible. We initially choose a set of candidate frames, f s ∈ F , by projecting f t into Z and choosing the frames correspond-ing to the k nearest neighbors (Spotify 2018) by Euclidean distancein Z .We then compute and apply an image warp to each candidate (Wein-zaepfel et al. 2013) to more closely match the large scale pose in f t ,giving ˜ f s . We extract an as-smooth-as-possible set of frames usinga minimum cost path approach. We construct a directed graph, E ,in which the source and sink nodes are the keyframes to interpolatebetween. For each sampled point on the configuration manifold, weadd it’s k nearest neighbors E (Figure 8). We set the weights of eachedge according to the following cost: warpednearestneighbors Fig. 8. Left: the graph traversal used for locating frame sequences for detailtransfer. Right: an output sequence. λ=0.1 λ=1λ=0.5
Fig. 9. Increasing the blending parameter lambda, increases the amount ofcolor information taken from the output GAN frame. E match = α E data + β E smooth E data = L ( f it , ˜ f s i )E smooth = (cid:40) , if i = L ( ˜ f s i , f s ∗ i − ) , otherwise , (7)where α and β are user-specified parameters, s ∗ denotes all k nearestneighbors and i is the index (in time) for each frame. The minimumcost path through E gives us a sequence of video frames whichwe use for detail transfer. We blend details between frames in thestandard manner, by solving the screened poisson equation (Darabi , Vol. 1, No. 1, Article 1. Publication date: January 2016. onvolutional Humanoid Animation via Deformation • 1:7 et al. 2012) ( L + λ I ) f i = L ˜ f is + λ f it , (8)where L is the discrete laplacian for our image, discretized on aregular grid and, f it and ˜ f is are the target image and warped sourceimage (from the shortest path) in vector form for the i th frame.Intuitively λ controls how much of the source frame is included inthe final image (Figure 9). Figure 10 shows a comparison of imagescreated using the above procedure to frames directly output by γ . GAN Synthsized (λ=0.5)Nearest Neighbor
Fig. 10. Examples of our image denoising procedure showing the noisy GANframe (left), warped nearest neighbor (middle), and final composited frame(right).
We implement our neural models using the PyTorch deep-learningframework (Paszke et al. 2017). We use the Adam optimizationalgorithm (Kingma and Ba 2014) for all models, except for the GANdiscriminator, for which we use stochastic gradient descent in orderto stabilize training. In all cases the learning rate is set to 1e − [ , . ] for real images, and [ . , ] for synthetic images. Labels are randomly flipped with probability0 . Conv 7x7, stride 2Layer Operation Output SizeSeLUBatch NormBatch NormConvBlockConvBlockConvBlockConvBlockMax PoolMax PoolMax PoolLinearLinearTanh |Z|128x128x64128x128x6464x64x19264x64x19216x16x25616x16x25616x16x25664x64x6464x64x6432x32x1928x8x256128|Z| Conv 1x1SeLUSeLU
Conv 3x3
Fig. 11. Encoder architecture. Output sizes are given for input size x LinearLayer Operation Output SizeSeLUSeLUSeLULinearConvTransBlockConvTransBlockConvTransBlockConvTransBlockConvTransBlockLinearConvTransBlockConv 1x1SeLUConv3x3SeLUConv5x5SeLUConv7x7 2048204816384163848x8x51216x16x25632x32x1282048204864x64x64128x128x32256x256x32256x256x64256x256x3256x256x64256x256x64256x256x64256x256x128256x256x128 SeLU
ConvTranspose4x4, stride 2
Fig. 12. Decoder architecture , Vol. 1, No. 1, Article 1. Publication date: January 2016. :8 • Kanji and Levin
Clip name F | Z | Resolution Hardware
John Oliver 5000 200 256 ×
256 TITAN RTXZebra 5000 400 128 ×
128 GTX 1080TiCookie Monster 1769 250 256 ×
256 TITAN VMichelle Obama 5000 200 256 ×
256 TITAN RTX
Table 1. Summary of the datasets.
Our dataset consists of publicly available clipsdownloaded from YouTube. The clips contain various types of char-acters and motion, including ”talking head” speech, dancing, andmulti-character interactions. The clips are pre-processed only bycropping. For the
John Oliver clip we compute the minimal squarebounding box which contains the face in all frames. This box isexpanded by 15 pixels and used to crop all frames in the sequence.For all other clips we simply crop the edges to make the framessquare. The cropped frames are then bilinearly downsampled toeither 128 ×
128 or 256 × Here we show some results of using CHAD to perform interpo-lation between two randomly chosen key frames from the sourcevideo (Figure 13). CHAD is trained separately for each exampleusing one of the datasets from Table 1 for each example.
We begin with some examples of interpolating between facial poses(Figure 14). Because CHAD uses no motion priors and is therefore,not tuned for facial animation, we don’t expect it to match thequality of more specialized methods (Kim et al. 2018). CHAD’sadvantage is the ability to synthesize natural frames from onlytwo keyframes. In these examples you will see that CHAD canadequately generate suitable facial motion like blinking and caninterpolate hand motion (even when the hand enters and exits theframe). To the authors knowledge, CHAD is the first algorithm tobe able to perform synthesis of this detail by purely relying on inputdata.
CHAD can also interpolate between whole body keyframes withoutany adjustments to the network architecture (Figure 15). Here weshow the results of performing animation synthesis on a bipedalzebra dataset taken from youtube. Notice that CHAD is capable ofinterpolating between poses with different facing and large limbmotion.
Finally, in order to stress the flexibility of CHAD we demonstrateinterpolation between random frames of a video containing twocharacters (Figure 16). Again we CHAD is able to synthesize naturalmotion for both characters that is consistent with the input video.Of particular interest is the “googliness” of the blue monster’s eyes.
In this paper we have presented CHAD, our machine learning ap-proach to creating image-based animation of humanoid charactersdriven by sparse temporal input. CHAD’s strength is its generality.By eschewing any prior models about character animation, CHADcan produce a wider range of animated motions than previous ap-proaches. This flexibility is enabled by our novel configurationmanifold learning approach and our new asymmetric architecturewhich we have justified both theoretically and experimentally.CHAD significantly lowers the barrier of entry for creating image-based animations, requiring only a single short to medium lengthyoutube video to learn a model of humanoid motion. We also believethat CHAD opens up a number of avenues for exciting future work.First, our current image denoising approach is less than perfect.It generates acceptably sharp images at the cost of a reduction intemporal smoothness. It can also introduce some ghosting whenthe source and target frames for detail transfer don’t align perfectly.This is a shame because our GAN images, despite lacking detailoccasionally, capture interpolated motion extremely well. Ideally,we would be able to retain the temporal coherence evidenced inour output GAN frames, however this seems out of reach withoutresorting to extremely long runtimes and large amounts of data (Kar-ras et al. 2018). Using our image synthesis procedure to performon-the-fly data augmentation during GAN training may help usovercome some of these challenges.Second, we would also like to extend CHAD from the 2D do-main to the 3D domain by attempting to reconstruct configurationmanifolds for 3D objects using depth scan data from commodityhardware such as the iPhone X. Such an algorithm could lowerthe barrier of entry for 3D animation in the same way we feel thatCHAD has done for 2D animation.Finally, we are curious about using our learned configurationspaces for autonomous character animation. Keyframes could beused as the state for an animation controller with actions beingtransitions between frames. CHAD provides a data-driven meansto generate poses between these discrete states and could serve tobring image-based autonomous actors to live. In order to enable thisand other explorations we intend to release the CHAD source codeand pre-trained networks for all examples shown in this submission.
REFERENCES
Kfir Aberman, Jing Liao, Mingyi Shi, Dani Lischinski, Baoquan Chen, and DanielCohen-Or. 2018. Neural Best-buddies: Sparse Cross-domain Correspondence.
ACMTrans. Graph.
37, 4 (jul 2018), 69:1—-69:14. https://doi.org/10.1145/3197517.3201332Hadar Averbuch-Elor, Daniel Cohen-Or, Johannes Kopf, and Michael F. Cohen. 2017.Bringing portraits to life.
ACM Transactions on Graphics
36, 6 (nov 2017), 1–13.https://doi.org/10.1145/3130800.3130818Guha Balakrishnan, Amy Zhao, Adrian V. Dalca, Fr´edo Durand, and John V. Guttag.2018. Synthesizing Images of Humans in Unseen Poses. In
CVPR .Thabo Beeler, Bernd Bickel, Paul Beardsley, Bob Sumner, and Markus Gross. 2010.High-quality single-shot capture of facial geometry. In
ACM SIGGRAPH 2010 paperson - SIGGRAPH ’10 . ACM Press, New York, New York, USA, 1. https://doi.org/10.1145/1833349.1778777Thabo Beeler, Fabian Hahn, Derek Bradley, Bernd Bickel, Paul Beardsley, Craig Gotsman,Robert W. Sumner, and Markus Gross. 2011. High-quality passive facial performancecapture using anchor frames. In
ACM SIGGRAPH 2011 papers on - SIGGRAPH ’11 .ACM Press, New York, New York, USA, 1. https://doi.org/10.1145/1964921.1964970Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D faces.In
Proceedings of the 26th annual conference on Computer graphics and interactivetechniques - SIGGRAPH ’99 . ACM Press, New York, New York, USA, 187–194. https://doi.org/10.1145/311535.311556, Vol. 1, No. 1, Article 1. Publication date: January 2016. onvolutional Humanoid Animation via Deformation • 1:9
First Frame Last FrameGAN ImagesFinal Images
Fig. 13. Interpolating between two random video frames using CHAD. We show the result of our image generator (top) and denoising algorithm (bottom).Fig. 14. Examples of interpolating between face poses using CHAD. Left: Initial keyframe. Right: Final keyframe. Middle: Frames synthesized by our method.
Derek Bradley, Wolfgang Heidrich, Tiberiu Popa, and Alla Sheffer. 2010. High resolutionpassive facial performance capture.
ACM Transactions on Graphics
29, 4 (jul 2010), 1.https://doi.org/10.1145/1833351.1778778Chen Cao, Derek Bradley, Kun Zhou, and Thabo Beeler. 2015. Real-time high-fidelityfacial performance capture.
ACM Transactions on Graphics
34, 4 (jul 2015), 46:1–46:9.https://doi.org/10.1145/2766943 Chen Cao, Qiming Hou, and Kun Zhou. 2014. Displaced dynamic expression regressionfor real-time facial tracking and animation.
ACM Transactions on Graphics (2014).https://doi.org/10.1145/2601097.2601204Chen Cao, Hongzhi Wu, Yanlin Weng, Tianjia Shao, and Kun Zhou. 2016. Real-timefacial animation with image-based dynamic avatars.
ACM Transactions on Graphics
35, 4 (2016). https://doi.org/10.1145/2897824.2925873, Vol. 1, No. 1, Article 1. Publication date: January 2016. :10 • Kanji and Levin
Fig. 15. Examples of interpolating whole body poses using CHAD. Left: Initial keyframe. Right: Final keyframe. Middle: Frames synthesized by our method.
Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A. Efros. 2018. EverybodyDance Now.
CoRR abs/1808.07371 (2018).T.F. Cootes, C.J. Taylor, D.H. Cooper, and J. Graham. 1995. Active Shape Models-TheirTraining and Application.
Computer Vision and Image Understanding
61, 1 (jan 1995),38–59. https://doi.org/10.1006/cviu.1995.1004Soheil Darabi, Eli Shechtman, Connelly Barnes, Dan B. Goldman, and Pradeep Sen.2012. Image Melding: Combining Inconsistent Images Using Patch-based Synthesis.
ACM Trans. Graph.
31, 4, Article 82 (July 2012), 10 pages. https://doi.org/10.1145/2185520.2185578Abe Davis and Maneesh Agrawala. 2018. Visual Rhythm and Beat.
ACM Trans. Graph.
37, 4, Article 122 (July 2018), 11 pages. https://doi.org/10.1145/3197517.3201371Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. JALI: an animator-centric viseme model for expressive lip synchronization.
ACM Trans. Graph. (2014), 4217–4224.Pablo Garrido, Levi Valgaerts, H. Sarmadi, Ingmar Steiner, Kiran Varanasi, PatrickP´erez, and Christian Theobalt. 2015. VDub: Modifying Face Video of Actors forPlausible Visual Alignment to a Dubbed Audio Track.
Comput. Graph. Forum
SIGGRAPH Asia 2018 TechnicalPapers (SIGGRAPH Asia ’18) . ACM, New York, NY, USA, 231:1—-231:12. https://doi.org/10.1145/3272127.3275043Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative AdversarialNets. In
NIPS .Naman Grover, Nitin Agarwal, and Kotaro Kataoka. 2015. liteFlow: Lightweight anddistributed flow monitoring platform for SDN.
Proceedings of the 2015 1st IEEEConference on Network Softwarization (NetSoft) (2015), 1–9. Daniel Holden, Taku Komura, and Jun Saito. 2017. Phase-functioned neural networksfor character control.
ACM Trans. Graph.
36 (2017), 42:1–42:13.Daniel Holden, Jun Saito, and Taku Komura. 2016. A deep learning framework forcharacter motion synthesis and editing.
ACM Trans. Graph.
35 (2016), 138:1–138:11.Daniel Holden, Jun Saito, Taku Komura, and Thomas Joyce. 2015. Learning motionmanifolds with convolutional autoencoders. In
SIGGRAPH Asia Technical Briefs .Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive Growingof GANs for Improved Quality, Stability, and Variation. In
International Conferenceon Learning Representations . https://openreview.net/forum?id=Hk99zCeAbHyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, MatthiasNiessner, Patrick P´erez, Christian Richardt, Michael Zollh¨ofer, and ChristianTheobalt. 2018. Deep Video Portraits.
ACM Trans. Graph.
37, 4 (jul 2018), 163:1—-163:14. https://doi.org/10.1145/3197517.3201283Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization.
CoRR abs/1412.6980 (2014).C. Lanczos. 1986.
The Variational Principles of Mechanics . Dover Publications. https://books.google.ca/books?id=ZWoYYr8wk2ICSung-Hee Lee and Demetri Terzopoulos. 2006. Heads Up!: Biomechanical Modeling andNeuromuscular Control of the Neck.
ACM Trans. Graph.
25, 3 (July 2006), 1188–1198.https://doi.org/10.1145/1141911.1142013Hao Li, Laura Trutoiu, Kyle Olszewski, Lingyu Wei, Tristan Trutna, Pei-Lun Hsieh,Aaron Nicholls, and Chongyang Ma. 2015. Facial Performance Sensing Head-Mounted Display.
ACM Transactions on Graphics (Proceedings SIGGRAPH 2015)
IJCAI .Charles Malleson, Jean-Charles Bazin, Oliver Wang, Derek Bradley, Thabo Beeler,Adrian Hilton, and Alexander Sorkine-Hornung. 2015. FaceDirector: ContinuousControl of Facial Performance in Video. In . IEEE, 3979–3987. https://doi.org/10.1109/ICCV.2015.453Hsien-Yu Meng, Tzu-heng Lin, Xiubao Jiang, Yao Lu, and Jiangtao Wen. 2018. LSTM-Based Facial Performance Capture Using Embedding Between Expressions.
Graphics , Vol. 1, No. 1, Article 1. Publication date: January 2016. onvolutional Humanoid Animation via Deformation • 1:11
Fig. 16. Examples of interpolating a two character poses using CHAD. Left: Initial keyframe. Right: Final keyframe. Middle: Frames synthesized by ourmethod. (may 2018). https://doi.org/arXiv:1805.03874v1 arXiv:1805.03874Kyle Olszewski, Zimo Li, Chao Yang, Yi Zhou, Ronald Yu, Zeng Huang, Sitao Xiang,Shunsuke Saito, Pushmeet Kohli, and Hao Li. 2017. Realistic Dynamic Facial Texturesfrom a Single Image Using GANs. In . IEEE, 5439–5448. https://doi.org/10.1109/ICCV.2017.580Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, ZacharyDeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Auto-matic differentiation in PyTorch. In
NIPS-W .Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. 2018. DeepMimic:example-guided deep reinforcement learning of physics-based character skills.
ACMTrans. Graph.
37 (2018), 143:1–143:14.Xue Bin Peng, Glen Berseth, KangKang Yin, and Michiel van de Panne. 2017. DeepLoco:dynamic locomotion skills using hierarchical deep reinforcement learning.
ACMTrans. Graph.
36 (2017), 41:1–41:13.Olga Sorkine-Hornung, Daniel Cohen-Or, Yaron Lipman, Marc Alexa, Christian R¨ossl,and Hans-Peter Seidel. 2004. Laplacian Surface Editing. In
Symposium on GeometryProcessing .Spotify. 2018. Annoy: Approximate Nearest Neighbors in C++/Python optimized formemory usage and loading/saving to disk. https://github.com/spotify/annoy.Robert W. Sumner and Jovan Popovic. 2004. Deformation transfer for triangle meshes.
ACM Trans. Graph.
23 (2004), 399–405.Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2017. Syn-thesizing Obama: Learning Lip Sync from Audio.
ACM Trans. Graph.
36, 4 (jul 2017),95:1—-95:13. https://doi.org/10.1145/3072959.3073640Michael W. Tao, Jiamin Bai, Pushmeet Kohli, and Sylvain Paris. 2012. SimpleFlow: ANon-iterative, Sublinear Optical Flow Algorithm.
Comput. Graph. Forum
31 (2012),345–353.Daniel Vlasic, Matthew Brand, Hanspeter Pfister, and Jovan Popovic. 2006. Face transferwith multilinear models. In
ACM SIGGRAPH 2006 Courses on - SIGGRAPH ’06 . ACMPress, New York, New York, USA, 24. https://doi.org/10.1145/1185657.1185864 Philippe Weinzaepfel, J´erˆome Revaud, Za¨ıd Harchaoui, and Cordelia Schmid. 2013.DeepFlow: Large Displacement Optical Flow with Deep Matching. (2013), 1385–1392.Lance Williams. 1990. Performance-driven Facial Animation.
SIGGRAPH Comput.Graph.
24, 4 (Sept. 1990), 235–242. https://doi.org/10.1145/97880.97906Feng Xu, Jinxiang Chai, Yilong Liu, and Xin Tong. 2014. Controllable high-fidelityfacial performance transfer.
ACM Transactions on Graphics
33, 4 (jul 2014), 1–11.https://doi.org/10.1145/2601097.2601210Weipeng Xu, Avishek Chatterjee, Michael Zollh¨ofer, Helge Rhodin, Dushyant Mehta,Hans-Peter Seidel, and Christian Theobalt. 2018. MonoPerfCap: Human PerformanceCapture From Monocular Video.
ACM Trans. Graph.
37 (2018), 27:1–27:15.Tianfan Xue, Jiajun Wu, Katherine L. Bouman, and William T. Freeman. 2016. VisualDynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks.In
NIPS .Wenhao Yu, Greg Turk, and C. Karen Liu. 2018. Learning symmetric and low-energylocomotion.