DDeep Video Portraits
HYEONGWOO KIM,
Max Planck Institute for Informatics, Germany
PABLO GARRIDO,
Technicolor, France
AYUSH TEWARI and WEIPENG XU,
Max Planck Institute for Informatics, Germany
JUSTUS THIES and MATTHIAS NIESSNER,
Technical University of Munich, Germany
PATRICK PÉREZ,
Technicolor, France
CHRISTIAN RICHARDT,
University of Bath, United Kingdom
MICHAEL ZOLLHÖFER,
Stanford University, United States of America
CHRISTIAN THEOBALT,
Max Planck Institute for Informatics, Germany
Input Output Input Output
Fig. 1. Unlike current face reenactment approaches that only modify the expression of a target actor in a video, our novel deep video portrait approach enablesfull control over the target by transferring the rigid head pose, facial expression and eye motion with a high level of photorealism.
We present a novel approach that enables photo-realistic re-animation ofportrait videos using only an input video. In contrast to existing approachesthat are restricted to manipulations of facial expressions only, we are the irstto transfer the full 3D head position, head rotation, face expression, eye gaze,and eye blinking from a source actor to a portrait video of a target actor. Thecore of our approach is a generative neural network with a novel space-timearchitecture. The network takes as input synthetic renderings of a parametricface model, based on which it predicts photo-realistic video frames for agiven target actor. The realism in this rendering-to-video transfer is achievedby careful adversarial training, and as a result, we can create modiied targetvideos that mimic the behavior of the synthetically-created input. In orderto enable source-to-target video re-animation, we render a synthetic targetvideo with the reconstructed head animation parameters from a sourcevideo, and feed it into the trained network ś thus taking full control of the
Authors’ addresses: Hyeongwoo Kim, Max Planck Institute for Informatics, CampusE1.4, Saarbrücken, 66123, Germany, [email protected]; Pablo Gar-rido, Technicolor, 975 Avenue des Champs Blancs, Cesson-Sévigné, 35576, France,[email protected]; Ayush Tewari, [email protected]; WeipengXu, [email protected], Max Planck Institute for Informatics, Campus E1.4, Saar-brücken, 66123, Germany; Justus Thies, [email protected]; Matthias Nießner,[email protected], Technical University of Munich, Boltzmannstraße 3, Garching, 85748,Germany; Patrick Pérez, Technicolor, 975 Avenue des Champs Blancs, Cesson-Sévigné,35576, France, [email protected]; Christian Richardt, University of Bath,Claverton Down, Bath, BA2 7AY, United Kingdom, [email protected]; MichaelZollhöfer, Stanford University, 353 Serra Mall, Stanford, CA, 94305, United States ofAmerica, [email protected]; Christian Theobalt, Max Planck Institute forInformatics, Campus E1.4, Saarbrücken, 66123, Germany, [email protected].© 2018 Association for Computing Machinery.This is the author’s version of the work. It is posted here for your personal use. Not forredistribution. The deinitive Version of Record was published in
ACM Transactions onGraphics , https://doi.org/10.1145/3197517.3201283. target. With the ability to freely recombine source and target parameters,we are able to demonstrate a large variety of video rewrite applicationswithout explicitly modeling hair, body or background. For instance, we canreenact the full head using interactive user-controlled editing, and realizehigh-idelity visual dubbing. To demonstrate the high quality of our output,we conduct an extensive series of experiments and evaluations, where forinstance a user study shows that our video edits are hard to detect.CCS Concepts: ·
Computing methodologies → Computer graphics ; Neural networks ; Appearance and texture representations ; Animation ; Ren-dering ;Additional Key Words and Phrases: Facial Reenactment, Video Portraits,Dubbing, Deep Learning, Conditional GAN, Rendering-to-Video Translation
ACM Reference Format:
Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies,Matthias Nießner, Patrick Pérez, Christian Richardt, Michael Zollhöfer,and Christian Theobalt. 2018. Deep Video Portraits.
ACM Trans. Graph.
37, 4,Article 163 (August 2018), 14 pages. https://doi.org/10.1145/3197517.3201283
Synthesizing and editing video portraits, i.e., videos framed to showa person’s head and upper body, is an important problem in com-puter graphics, with applications in video editing and movie post-production, visual efects, visual dubbing, virtual reality, and telep-resence, among others. In this paper, we address the problem ofsynthesizing a photo-realistic video portrait of a target actor thatmimics the actions of a source actor, where source and target can bediferent subjects. More speciically, our approach enables a sourceactor to take full control of the rigid head pose, face expressions and
ACM Trans. Graph., Vol. 37, No. 4, Article 163. Publication date: August 2018. a r X i v : . [ c s . C V ] M a y eye motion of the target actor; even face identity can be modiied tosome extent. All of these dimensions can be manipulated together orindependently. Full target frames, including the entire head and hair,but also a realistic upper body and scene background complyingwith the modiied head, are automatically synthesized.Recently, many methods have been proposed for face-interiorreenactment [Liu et al. 2001; Olszewski et al. 2017; Suwajanakornet al. 2017; Thies et al. 2015, 2016; Vlasic et al. 2005]. Here, onlythe face expression can be modiied realistically, but not the full3D head pose, including a consistent upper body and a consistentlychanging background. Many of these methods it a parametric 3Dface model to RGB(-D) video [Thies et al. 2015, 2016; Vlasic et al.2005], and re-render the modiied model as a blended overlay overthe target video for reenactment, even in real time [Thies et al.2015, 2016]. Synthesizing a complete portrait video under full 3Dhead control is much more challenging. Averbuch-Elor et al. [2017]enable mild head pose changes driven by a source actor based onimage warping. They generate reactive dynamic proile picturesfrom a static target portrait photo, but not fully reenacted videos.Also, large changes in head pose cause artifacts (see Section 7.3),the target gaze cannot be controlled, and the identity of the targetperson is not fully preserved (mouth appearance is copied from thesource actor).Performance-driven 3D head animation methods [Cao et al. 2015,2014a, 2016; Hu et al. 2017; Ichim et al. 2015; Li et al. 2015; Olszewskiet al. 2016; Weise et al. 2011] are related to our work, but haveorthogonal methodology and application goals. They typically drivethe full head pose of stylized 3D CG avatars based on visual sourceactor input, e.g., for games or stylized VR environments. Recently,Cao et al. [2016] proposed image-based 3D avatars with dynamictextures based on a real-time face tracker. However, their goal isfull 3D animated head control and rendering, often intentionally ina stylized rather than a photo-realistic fashion.We take a diferent approach that directly generates entire photo-realistic video portraits in front of general static backgrounds underfull control of a target’s head pose, facial expression, and eye mo-tion. We formulate video portrait synthesis and reenactment asa rendering-to-video translation task. Input to our algorithm aresynthetic renderings of only the coarse and fully-controllable 3Dface interior model of a target actor and separately rendered eyegaze images, which can be robustly and eiciently obtained viaa state-of-the-art model-based reconstruction technique. The in-put is automatically translated into full-frame photo-realistic videooutput showing the entire upper body and background. Since weonly track the face, we cannot actively control the motion of thetorso or hair, or control the background, but our rendering-to-videotranslation network is able to implicitly synthesize a plausible bodyand background (including some shadows and relections) for agiven head pose. This translation problem is tackled using a novelspace-time encoderśdecoder deep neural network, which is trainedin an adversarial manner.At the core of our approach is a conditional generative adversarialnetwork (cGAN) [Isola et al. 2017], which is speciically tailoredto video portrait synthesis. For temporal stability, we use a novelspace-time network architecture that takes as input short sequencesof conditioning input frames of head and eye gaze in a sliding window manner to synthesize each target video frame. Our targetand scene-speciic networks only require a few minutes of portraitvideo footage of a person for training. To the best of our knowledge,our approach is the irst to synthesize full photo-realistic videoportraits of a target person’s upper body, including realistic clothingand hair, and consistent scene background, under full 3D control ofthe target’s head. To summarize, we make the following technicalcontributions: • A rendering-to-video translation network that transformscoarse face model renderings into full photo-realistic portraitvideo output. • A novel space-time encoding as conditional input for tempo-rally coherent video synthesis that represents face geometry,relectance, and motion as well as eye gaze and eye blinks. • A comprehensive evaluation on several applications to demon-strate the lexibility and efectiveness of our approach.We demonstrate the potential and high quality of our method inmany intriguing applications, ranging from face reenactment andvisual dubbing for foreign language movies to user-guided interac-tive editing of portrait videos for movie postproduction. A compre-hensive comparison to state-of-the-art methods and a user studyconirm the high idelity of our results.
We discuss related optimization and learning-based methods thataim at reconstructing, animating and re-writing faces in imagesand videos, and review relevant image-to-image translation work.For a comprehensive overview of current methods we refer to arecent state-of-the-art report on monocular 3D face reconstruction,tracking and applications [Zollhöfer et al. 2018].
Monocular Face Reconstruction.
Face reconstruction methods aimto reconstruct 3D face models of shape and appearance from visualdata. Optimization-based methods it a 3D template model, mainlythe inner face region, to single images [Blanz et al. 2004; Blanzand Vetter 1999], unstructured image collections [Kemelmacher-Shlizerman 2013; Kemelmacher-Shlizerman et al. 2011; Roth et al.2017] or video [Cao et al. 2014b; Fyfe et al. 2014; Garrido et al. 2016;Ichim et al. 2015; Shi et al. 2014; Suwajanakorn et al. 2014; Thies et al.2016; Wu et al. 2016]. Recently, Booth et al. [2018] proposed a large-scale parametric face model constructed from almost ten thousand3D scans. Learning-based approaches leverage a large corpus ofimages or image patches to learn a regressor for predicting either3D face shape and appearance [Richardson et al. 2016; Tewari et al.2017; Tran et al. 2017], ine-scale skin details [Cao et al. 2015], orboth [Richardson et al. 2017; Sela et al. 2017]. Deep neural networkshave been shown to be quite robust for inferring the coarse 3Dfacial shape and appearance of the inner face region, even whentrained on synthetic data [Richardson et al. 2016]. Tewari et al.[2017] showed that encoderśdecoder architectures can be trainedfully unsupervised on in-the-wild images by integrating physicalimage formation into the network. Richardson et al. [2017] trainedan end-to-end regressor to recover facial geometry at a coarse andine-scale level. Sela et al. [2017] use an encoderśdecoder networkto infer a detailed depth image and a dense correspondence map,which serve as a basis for non-rigidly deforming a template mesh.
ACM Trans. Graph., Vol. 37, No. 4, Article 163. Publication date: August 2018. eep Video Portraits • 163:3
Fig. 2. Deep video portraits enable a source actor to fully control a target video portrait. First, a low-dimensional parametric representation (let) of bothvideos is obtained using monocular face reconstruction. The head pose, expression and eye gaze can now be transferred in parameter space (middle). We do notfocus on the modification of the identity and scene illumination (hatched background), since we are interested in reenactment. Finally, we render conditioninginput images that are converted to a photo-realistic video portrait of the target actor (right).
Obama video courtesy of the White House (public domain).
Still, none of these methods creates a fully generative model for theentire head, hair, mouth interior, and eye gaze, like we do.
Video-based Facial Reenactment.
Facial reenactment methods re-write the face content of a target actor in a video or image by trans-ferring facial expressions from a source actor. Facial expressionsare commonly transferred via dense motion ields [Averbuch-Eloret al. 2017; Liu et al. 2001; Suwajanakorn et al. 2015], parameters[Thies et al. 2016, 2018; Vlasic et al. 2005], or by warping candidateframes that are selected based on the facial motion [Dale et al. 2011],appearance metrics [Kemelmacher-Shlizerman et al. 2010] or both[Garrido et al. 2014; Li et al. 2014]. The methods described aboveirst reconstruct and track the source and target faces, which arerepresented as a set of sparse 2D landmarks or dense 3D models.Most approaches only modify the inner region of the face and thusare mainly intended for altering facial expressions, but they do nottake full control of a video portrait in terms of rigid head pose, facialexpression, and eye gaze. Recently, Wood et al. [2018] proposed anapproach for eye gaze redirection based on a itted parametric eyemodel. Their approach only provides control over the eye region.One notable exception to pure facial reenactment is Averbuch-Elor et al.’s approach [2017], which enables the reenactment of aportrait image and allows for slight changes in head pose via imagewarping [Fried et al. 2016]. Since this approach is based on a singletarget image, it copies the mouth interior from the source to thetarget, thus preserving the target’s identity only partially. We takeadvantage of learning from a target video to allow for larger changesin head pose, facial reenactment, and joint control of the eye gaze.
Visual Dubbing.
Visual dubbing is a particular instance of facereenactment that aims to alter the mouth motion of the target actorto match a new audio track, commonly spoken in a foreign languageby a dubbing actor. Here, we can ind speech-driven [Bregler et al.1997; Chang and Ezzat 2005; Ezzat et al. 2002; Liu and Ostermann2011; Suwajanakorn et al. 2017] or performance-driven [Garridoet al. 2015; Thies et al. 2016] techniques. Speech-driven dubbing tech-niques learn a person-speciic phoneme-to-viseme mapping from atraining sequence of the actor. These methods produce accurate lip sync with visually imperceptible artifacts, as recently demonstratedby Suwajanakorn et al. [2017]. However, they cannot directly con-trol the target’s facial expressions. Performance-driven techniquesovercome this limitation by transferring semantically-meaningfulmotion parameters and re-rendering the target model with photo-realistic relectance [Thies et al. 2016], and ine-scale details [Garridoet al. 2015, 2016]. These approaches generalize better, but do notedit the head pose and still struggle to synthesize photo-realisticmouth deformations. In contrast, our approach learns to synthesizephoto-realistic facial motion and actions from coarse renderings,thus enabling the synthesis of expressions and joint modiication ofthe head pose, with consistent body and background.
Image-to-image Translation.
Approaches using conditional GANs[Mirza and Osindero 2014], such as Isola et al.’s łpix2pixž [2017],have shown impressive results on image-to-image translation taskswhich convert between images of two diferent domains, such asmaps and satellite photos. These combine encoderśdecoder architec-tures [Hinton and Salakhutdinov 2006], often with skip-connections[Ronneberger et al. 2015], with adversarial loss functions [Goodfel-low et al. 2014; Radford et al. 2016]. Chen and Koltun [2017] werethe irst to demonstrate high-resolution results with 2 megapixelresolution, using cascaded reinement networks without adversarialtraining. The latest trends show that it is even possible to train high-resolution GANs [Karras et al. 2018] and conditional GANs [Wanget al. 2018] at similar resolutions. However, the main challenge isthe requirement for paired training data, as corresponding imagepairs are often not available. This problem is tackled by CycleGAN[Zhu et al. 2017], DualGAN [Yi et al. 2017], and UNIT [Liu et al.2017] ś multiple concurrent unsupervised image-to-image trans-lation techniques that only require two sets of unpaired trainingsamples. These techniques have captured the imagination of manypeople by translating between photographs and paintings, horsesand zebras, face photos and depth as well as correspondence maps[Sela et al. 2017], and translation from face photos to cartoon draw-ings [Taigman et al. 2017]. Ganin et al. [2016] learn photo-realisticgaze manipulation in images. Olszewski et al. [2017] synthesize arealistic inner face texture, but cannot generate a fully controllable
ACM Trans. Graph., Vol. 37, No. 4, Article 163. Publication date: August 2018. output video, including person-speciic hair. Lassner et al. [2017]propose a generative model to synthesize people in clothing, andMa et al. [2017] generate new images of persons in arbitrary posesusing image-to-image translation. In contrast, our approach enablesthe synthesis of temporally-coherent video portraits that follow theanimation of a source actor in terms of head pose, facial expressionand eye gaze.
Our deep video portraits approach provides full control of the headof a target actor by transferring the rigid head pose, facial expres-sion, and eye motion of a source actor , while preserving the target’sidentity and appearance. Full target video frames are synthesized,including consistent upper body posture, hair and background. First,we track the source and target actor using a state-of-the-art monoc-ular face reconstruction approach that uses a parametric face andillumination model (see Section 4). The resulting sequence of low-dimensional parameter vectors represents the actor’s identity, headpose, expression, eye gaze, and the scene lighting for every videoframe (Figure 2, left). This allows us to transfer the head pose, ex-pression, and/or eye gaze parameters from the source to the target,as desired. In the next step (Figure 2, middle), we generate newsynthetic renderings of the target actor based on the modiied pa-rameters (see Section 5). In addition to a normal color rendering, wealso render correspondence maps and eye gaze images. These ren-derings serve as conditioning input to our novel rendering-to-videotranslation network (see Section 6), which is trained to convert thesynthetic input into photo-realistic output (see Figure 2, right). Fortemporally coherent results, our network works on space-time vol-umes of conditioning inputs. To process a complete video, we inputthe conditioning space-time volumes in a sliding window fashion,and assemble the inal video from the output frames. We evaluateour approach (see Section 7) and show its potential on several videorewrite applications, such as full-head reenactment, gaze redirection,video dubbing, and interactive parameter-based video control.
We employ a state-of-the-art dense face reconstruction approachthat its a parametric model of face and illumination to each videoframe. It obtains a meaningful parametric face representation forthe source V s = {I sf | f = , . . . , N s } and target V t = {I tf | f = , . . . , N t } video sequence, where N s and N t denote the total num-ber of source and target frames, respectively. Let P • = {P • f | f = , . . . , N • } be the corresponding parameter sequence that fully de-scribes the source or target facial performance. The set of recon-structed parameters encode the rigid head pose (rotation R • ∈ SO ( ) and translation t • ∈ R ), facial identity coeicients α • ∈ R N α (ge-ometry, N α =
80) and β • ∈ R N β (relectance, N β = δ • ∈ R N δ ( N δ = e • ∈ R ,and spherical harmonics illumination coeicients γ • ∈ R . Overall,our monocular face tracker reconstructs N p =
261 parameters pervideo frame. In the following, we provide more details on the facetracking algorithm as well as the parametric face representation.
Parametric Face Representation.
We represent the space of facialidentity based on a parametric head model [Blanz and Vetter 1999],and the space of facial expressions via an aine model. Mathemati-cally, we model geometry variation through an aine model v ∈ R N that stacks per-vertex deformations of the underlying template meshwith N vertices, as follows: v ( α , δ ) = a geo + N α Õ k = α k b geo k + N δ Õ k = δ k b exp k . (1)Difuse skin relectance is modeled similarly by a second ainemodel r ∈ R N that stacks the difuse per-vertex albedo: r ( β ) = a ref + N β Õ k = β k b ref k . (2)The vectors a geo ∈ R N and a ref ∈ R N store the average facialgeometry and corresponding skin relectance, respectively. Thegeometry basis { b geo k } N α k = has been computed by applying principalcomponent analysis (PCA) to 200 high-quality face scans [Blanzand Vetter 1999]. The relectance basis { b ref k } N β k = has been obtainedin the same manner. For dimensionality reduction, the expressionbasis { b exp k } N δ k = has been computed using PCA, starting from theblendshapes of Alexander et al. [2010] and Cao et al. [2014b]. Theirblendshapes have been transferred to the topology of Blanz andVetter [1999] using deformation transfer [Sumner and Popović 2004]. Image Formation Model.
To render synthetic head images, weassume a full perspective camera that maps model-space 3D points v via camera space ˆ v ∈ R to 2D points p = Π ( ˆ v ) ∈ R on the imageplane. The perspective mapping Π contains the multiplication withthe camera intrinsics and the perspective division. We assume aixed and identical camera for all scenes, i.e., world and camera spaceare the same, and the face model accounts for all the scene motion.Based on a distant illumination assumption, we use the sphericalharmonics (SH) basis functions Y b : R → R to approximate theincoming radiance B from the environment: B ( r i , n i , γ ) = r i · B Õ b = γ b Y b ( n i ) . (3)Here, B is the number of spherical harmonics bands, γ b ∈ R are theSH coeicients, and r i and n i are the relectance and unit normalvector of the i -th vertex, respectively. For difuse materials, an av-erage approximation error below 1 percent is achieved with only B = B = Dense Face Reconstruction.
We employ a dense data-parallel facereconstruction approach to eiciently compute the parameters P • for both source and target videos. Face reconstruction is based on an analysis-by-synthesis approach that maximizes photo-consistencybetween a synthetic rendering of the model and the input. Thereconstruction energy combines terms for dense photo-consistency,landmark alignment and statistical regularization: E (X) = w photo E photo (X) + w land E land (X) + w reg E reg (X) , (4) ACM Trans. Graph., Vol. 37, No. 4, Article 163. Publication date: August 2018. eep Video Portraits • 163:5 with X = { R • , t • , α • , β • , δ • , γ • } . This enables the robust reconstruc-tion of identity (geometry and skin relectance), facial expression,and scene illumination. We use 66 automatically detected faciallandmarks of the True Vision Solution tracker , which is a commer-cial implementation of Saragih et al. [2011], to deine the sparsealignment term E land . Similar to Thies et al. [2016], we use a robust ℓ -norm for dense photometric alignment E photo . The regularizer E reg enforces statistically plausible parameter values based on theassumption of normally distributed data. The eye gaze estimate e • is directly obtained from the landmark tracker. The identity is onlyestimated in the irst frame and is kept constant afterwards. Allother parameters are estimated every frame. For more details onthe energy formulation, we refer to Garrido et al. [2016] and Thieset al. [2016]. We use a data-parallel implementation of iterativelyre-weighted least squares (IRLS), similar to Thies et al. [2016], toind the optimal set of parameters. One diference to their work isthat we compute and explicitly store the Jacobian J and the residualvector F to global memory based on a data-parallel strategy thatlaunches one thread per matrix/vector element. Afterwards, a data-parallel matrixśmatrix/matrixśvector multiplication computes theright- and left-hand side of the normal equations that have to besolved in each IRLS step. The resulting small linear system (97 × Using the method from Section 4, we reconstruct the face in eachframe of the source and unmodiied target video. Next, we obtain themodiied parameter vector for every frame of the target sequence,e.g., for full-head reenactment, we modify the rigid head pose, ex-pression and eye gaze of the target actor. All parameters are copiedin a relative manner from the source to the target, i.e., with respectto a neutral reference frame. Then we render synthetic conditioningimages of the target actor’s face model under the modiied parame-ters using hardware rasterization. For higher temporal coherence,our rendering-to-video translation network takes a space-time vol-ume of conditioning images {C f − o | o = , . . . , } as input, with f being the index of the current frame. We use a temporal window ofsize N w =
11, with the current frame being at its end. This providesthe network a history of the earlier motions.For each frame C f − o of the window, we generate three diferentconditioning inputs: a color rendering, a correspondence image, andan eye gaze image (see Figure 3). The color rendering shows themodiied target actor model under the estimated target illumination,while keeping the target identity (geometry and skin relectance)ixed. This image provides a good starting point for the followingrendering-to-video translation, since in the face region only the http://truevisionsolutions.net Diffuse Rendering Correspondence Eye and Gaze Map
Fig. 3. The synthetic input used for conditioning our rendering-to-videotranslation network: (1) colored face rendering under target illumination,(2) correspondence image, and (3) the eye gaze image. delta to a real image has to be learned. In addition to this color input,we also provide a correspondence image encoding the index of theparametric face model’s vertex that projects into each pixel. To thisend, we texture the head model with a constant unique gradienttexture map, and render it. Finally, we also provide an eye gaze imagethat solely contains the white region of both eyes and the locationsof the pupils as blue circles. This image provides information aboutthe eye gaze direction and blinking to the network.We stack all N w conditioning inputs of a time window in a 3Dtensor X of size W × H × N w (3 images, with 3 channels each), toobtain the input to our rendering-to-video translation network. Toprocess the complete video, we feed the conditioning space-timevolumes in a sliding window fashion. The inal generated photo-realistic video output is assembled directly from the output frames. The generated conditioning space-time video tensors are the input toour rendering-to-video translation network. The network learns toconvert the synthetic input into full frames of a photo-realistic targetvideo, in which the target actor now mimics the head motion, facialexpression and eye gaze of the synthetic input. The network learns tosynthesize the entire actor in the foreground, i.e., the face for whichconditioning input exists, but also all other parts of the actor, such ashair and body, so that they comply with the target head pose. It alsosynthesizes the appropriately modiied and illed-in background,including even some consistent lighting efects between foregroundand background. The network is trained for a speciic target actorand a speciic static, but otherwise general scene background. Ourrendering-to-video translation network follows an encoderśdecoderarchitecture and is trained in an adversarial manner based on adiscriminator that is jointly trained. In the following, we explainthe network architectures, the used loss functions and the trainingprocedure in detail.
Network Architecture.
We show the architecture of our rendering-to-video translation network in Figure 4. Our conditional generativeadversarial network consists of a space-time transformation network T and a discriminator D . The transformation network T takes the W × H × N w space-time tensor X as input and outputs a photo-real image T ( X ) of the target actor. The temporal input enables thenetwork to take the history of motions into account by inspectingprevious conditioning images. The temporal axis of the input tensoris aligned along the network channels, i.e., the convolutions in theirst layer have 9 N w channels. Note, we store all image data innormalized [− , + ] -space, i.e, black is mapped to [− , − , − ] ⊤ andwhite is mapped to [ + , + , + ] ⊤ . ACM Trans. Graph., Vol. 37, No. 4, Article 163. Publication date: August 2018. Y Bilinear Downsampling
T(X) channels
64 128 256 128 64 … D X … T
64 128 25612864 6432 32 64 128
TanH = BN LReLu Up = DeConv BN ReLu Refine = Conv BN ReLuUp Refine = Conv Drop Drop inout inout × × Refine p r o b . p r o b . s t r i d e s t r i d e s t r i d e × Fig. 4. Architecture of our rendering-to-video translation network for aninput resolution of × : The encoder has 8 downsampling moduleswith ( , , , , , , , ) output channels. The decoderhas 8 upsampling modules with ( , , , , , , , ) outputchannels. The upsampling modules use the following dropout probabilities ( . , . , . , , , , , ) . The first downsampling and the last upsamplingmodule do not employ batch normalization (BN). The final non-linearity(TanH) brings the output to the employed normalized [− , + ] -space. Our network consists of two main parts, an encoder for com-puting a low-dimensional latent representation, and a decoder forsynthesizing the output image. We employ skip connections [Ron-neberger et al. 2015] to enable the network to transfer ine-scalestructure. To generate video frames with suicient resolution, ournetwork also employs a cascaded reinement strategy [Chen andKoltun 2017]. In each downsampling step, we use a convolution(4 ×
4, stride 2) followed by batch normalization and a leaky ReLUnon-linearity. The upsampling module is speciically designed toproduce high-quality output, and has the following structure: irst,the resolution is increased by a factor of two based on deconvolu-tion (4 ×
4, upsampling factor of 2), batch normalization, dropoutand ReLU. Afterwards, two reinement steps based on convolution(3 ×
3, stride 1, stays on the same resolution) and ReLU are applied.The inal hyperbolic tangent non-linearity (TanH) brings the outputtensor to the normalized [− , + ] -space used for storing the imagedata. For more details, please refer to Figure 4.The input to our discriminator D is the conditioning input tensor X (size W × H × N w ), and either the predicted output image T ( X ) or the ground-truth image, both of size W × H ×
3. The employeddiscriminator is inspired by the PatchGAN classiier, proposed byIsola et al. [2017]. We extended it to take volumes of conditioningimages as input.
Objective Function.
We train in an adversarial manner to ind thebest rendering-to-video translation network: T ∗ = argmin T max D E cGAN ( T , D ) + λE ℓ ( T ) . (5)This objective function comprises an adversarial loss E cGAN ( T , D ) and an ℓ -norm reproduction loss E ℓ ( T ) . The constant weight of λ =
100 balances the contribution of these two terms. The adversarialloss has the following form: E GAN ( T , D ) = E X , Y (cid:2) log D ( X , Y ) (cid:3) + E X (cid:2) log (cid:0) − D ( X , T ( X )) (cid:1)(cid:3) . (6) We do not inject a noise vector while training our network to pro-duce deterministic outputs. During adversarial training, the discrim-inator D tries to get better at classifying given images as real or synthetic , while the transformation network T tries to improve infooling the discriminator. The ℓ -norm loss penalizes the distancebetween the synthesized image T ( X ) and the ground-truth image Y ,which encourages the sharpness of the synthesized output: E ℓ ( T ) = E X , Y (cid:2) ∥ Y − T ( X )∥ (cid:3) . (7) Training.
We construct the training corpus T = {( X i , Y i )} i basedon the tracked video frames of the target video sequence. Typically,two thousand video frames, i.e., about one minute of video footage,are suicient to train our network (see Section 7). Our trainingcorpus consists of N t −( N w − ) rendered conditioning space-timevolumes X i and the corresponding ground-truth image Y i (using awindow size of N w = N( , . ) . Our approach enables full-frame target video portrait synthesis un-der full 3D head pose control. We measured the runtime for trainingand testing on an Intel Xeon E5-2637 with 3.5 GHz (16 GB RAM) andan NVIDIA GeForce GTX Titan Xp (12 GB RAM). Training our net-work takes 10 hours for a target video resolution of 256 ×
256 pixels,and 42 hours for 512 ×
512 pixels. Tracking the source actor takes250 ms per frame (without identity), and the rendering-to-videoconversion (inference) takes 65 ms per frame for 256 ×
256 pixels, or196 ms for 512 ×
512 pixels.In the following, we evaluate the design choices of our deep videoportrait algorithm, compare to current state-of-the-art reenactmentapproaches, and show the results of a large-scale web-based userstudy. We further demonstrate the potential of our approach on sev-eral video rewrite applications, such as reenactment under full headand facial expression control, facial expression reenactment only,video dubbing, and live video portrait editing under user control.In total, we applied our approach to 14 diferent target sequencesof 13 diferent subjects and used 5 diferent source sequences; seeAppendix A for details. A comparison to a simple nearest-neighborretrieval approach can be found in Figure 6 and in the supplementalvideo. Our approach requires only a few minutes of target videofootage for training.
Our approach enables us to take full control of the rigid head pose,facial expression, and eye motion of a target actor in a video por-trait, thus opening up a wide range of video rewrite applications.All parameter dimensions can be estimated and transfered from asource video sequence or edited manually through an interactiveuser interface.
ACM Trans. Graph., Vol. 37, No. 4, Article 163. Publication date: August 2018. eep Video Portraits • 163:7 S o u r c e T a r g e t S o u r c e T a r g e t Fig. 5. ualitative results of full-head reenactment: our approach enables full-frame target video portrait synthesis under full 3D head pose control. Theoutput video portraits are photo-realistic and hard to distinguish from real videos. Note that even the shadow in the background of the second row movesconsistently with the modified foreground head motion. In the sequence at the top, we only transfer the translation in the camera plane, while we transfer thefull 3D translation for the sequence at the botom. For full sequences, please refer to our video.
Obama video courtesy of the White House (public domain).
Input OursNearest Neighbor
Fig. 6. Comparison to a nearest-neighbor approach in parameter space (poseand expression). Our results have higher quality and are temporally morecoherent (see supplemental video). For the nearest-neighbor approach, it isdificult to find the right trade-of between pose and expression. This leadsto many results with one of the two dimensions not being well-matched.The results are also temporally unstable, since the nearest neighbor abruptlychanges, especially for small training sets.
Reenactment under full head control.
Our approach is the irst thatcan photo-realistically transfer the full 3D head pose (spatial positionand rotation), facial expression, as well as eye gaze and eye blinkingof a captured source actor to a target actor video. Figure 5 showssome examples of full-head reenactment between diferent sourceand target actors. Here, we use the full target video for trainingand the source video as the driving sequence. As can be seen, theoutput of our approach achieves a high level of realism and faithfullymimics the driving sequence, while still retaining the mannerismsof the original target actor. Note that the shadow in the backgroundmoves consistently with the position of the actor in the scene, asshown in Figure 5 (second row). We also demonstrate the highquality of our results and evaluate our approach quantitatively in aself-reenactment scenario, see Figure 7. For the quantitative analysis, we use two thirds of the target video for training and one third fortesting. We capture the face in the training and driving video withour model-based tracker, and then render the conditioning images,which serve as input to our network for synthesizing the output. Forfurther details, please refer to Section 7.2. Note that the synthesizedresults are nearly indistinguishable from the ground truth.
Facial Reenactment and Video Dubbing.
Besides full-head reen-actment, our approach also enables facial reenactment. In this ex-periment, we replace the expression coeicients of the target actorwith those of the source actor before synthesizing the conditioninginput to our rendering-to-video translation network. Here, the headpose and position, and eye gaze remain unchanged. Figure 8 showsfacial reenactment results. Observe that the face expression in thesynthesized target video nicely matches the expression of the sourceactor in the driving sequence. Please refer to the supplemental videofor the complete video sequences.Our approach can also be applied to visual dubbing. In manycountries, foreign-language movies are dubbed, i.e., the originalvoice of an actor is replaced with that of a dubbing actor speakingin another language. Dubbing often causes visual discomfort dueto the discrepancy between the actor’s mouth motion and the newaudio track. Even professional dubbing studios achieve only approx-imate audio alignment at best. Visual dubbing aims at altering themouth motion of the target actor to match the new foreign-languageaudio track spoken by the dubber. Figure 9 shows results wherewe modify the facial motion of actors speaking originally in Ger-man to adhere to an English translation spoken by a professionaldubbing actor, who was ilmed in a dubbing studio [Garrido et al.2015]. More precisely, we transfer the captured facial expressions
ACM Trans. Graph., Vol. 37, No. 4, Article 163. Publication date: August 2018.
Fig. 7. uantitative evaluation of the photometric re-rendering error. We evaluate our approach quantitatively in a self-reenactment seting, where theground-truth video portrait is known. We train our rendering-to-video translation network on two thirds of the video sequence, and test on the remaining third.The error maps show per-pixel Euclidean distance in RGB (color channels in [ , ] ); the mean photometric error of the test set is shown in the top-right. Theerror is consistently low in regions with conditioning input, with higher errors in regions without conditioning, such as the upper body. Obama video courtesyof the White House (public domain).
Putin video courtesy of the Kremlin (CC BY).
May video courtesy of the UK government (Open Government Licence).Fig. 8. Facial reenactment results of our approach. We transfer the expressions from the source to the target actor, while retaining the head pose (rotation andtranslation) as well as the eye gaze of the target actor. For the full sequences, please refer to the supplemental video.
Obama video courtesy of the White House(public domain).
Putin video courtesy of the Kremlin (CC BY).
Reagan video courtesy of the National Archives and Records Administration (public domain).Fig. 9. Dubbing comparison on two sequences of Garrido et al. [2015]. Forvisual dubbing, we transfer the facial expressions of the dubbing actor(‘input’) to the target actor. We compare our results to Garrido et al.’s. Ourapproach obtains higher quality results in terms of the synthesized mouthshape and mouth interior. Note that our approach also enables full-headreenactment in addition to expression transfer. For the full comparison, werefer to the supplemental video. of the dubbing actor to the target actor, while leaving the originaltarget gaze and eye blinks intact, i.e., we use the original eye gazeimages of the tracked target sequence as conditioning. As can beseen, our approach achieves dubbing results of high quality. In fact,we produce images with more realistic mouth interior and moreemotional content in the mouth region. Please see the supplementalvideo for full video results.
Interactive Editing of Video Portraits.
We built an interactive editorthat enables users to reanimate video portraits with live feedback by modifying the parameters of the coarse face model rendered into theconditioning images (see our live demo in the supplemental video).Figure 10 shows a few static snapshots that were taken while theusers were playing with our editor. Our approach enables changesof all parameter dimensions, either independently or all together,as shown in Figure 10. More speciically, we show independentchanges of the expression, head rotation, head translation, and eyegaze (including eye blinks). Please note the realistic and consistentgeneration of the torso, head and background. Even shadows orrelections appear very consistently in the background. In addition,we show user edits that modify all parameters simultaneously. Ourinteractive editor runs at approximately 9 fps. While not the focus ofthis paper, our approach also enables modiications of the geometricfacial identity, see Figure 11. These combined modiications show asa proof of concept that our network generalizes beyond the trainingcorpus.
We performed a quantitative evaluation of the re-rendering quality.First, we evaluate our approach in a self-reenactment setting, wherethe ground-truth video portrait is known. We train our rendering-to-video translation network on the irst two thirds of a video sequenceand test it on the remaining last third of the video, see Figure 7. Thephotometric error maps show the per-pixel Euclidean distance inRGB color space, with each channel being in [ , ] . We performedthis test for three diferent videos and the mean photometric er-rors are 2.88 (Vladimir Putin), 4.76 (Theresa May), and 4.46 (BarackObama). Our approach obtains consistently low error in regions ACM Trans. Graph., Vol. 37, No. 4, Article 163. Publication date: August 2018. eep Video Portraits • 163:9
Reference Expression GazeRotation Translation Combined Reference Expression GazeRotation Translation Combined Reference Expression GazeRotation Translation Combined
Fig. 10. Interactive editing. Our approach provides full parametric control over video portraits (by controlling head model parameters in conditioning images).This enables modifications of the rigid head pose (rotation and translation), facial expression and eye motion. All of these dimensions can be manipulatedtogether or independently. We also show these modifications live in the supplemental video.
Obama video courtesy of the White House (public domain).
Reference Identity Change
Fig. 11. Identity modification. While not the main focus of our approach,it also enables modification of the facial shape via the geometry shapeparameters. This shows that our network picks up the correspondencebetween the model and the video portrait. Note that the produced outputsare also consistent in regions that are not constrained by the conditioninginput, such as the hair and background. O u r s A v e r b u c h - E l o r I n p u t Fig. 12. Comparison to the image reenactment approach of Averbuch-Eloret al. [2017] in the full-head reenactment scenario. Since their method isbased on a single target image, they copy the mouth interior from thesource to the target, thus not preserving the target’s identity. Our learning-based approach enables larger modifications of the rigid head pose withoutapparent artifacts, while their warping-based approach distorts the headand background. In addition, ours enables joint control of the eye gazeand eye blinks. The diferences are most evident in the supplemental video.
Obama video courtesy of the White House (public domain). Fig. 13. Comparison to Suwajanakorn et al. [2017]. Their approach producesaccurate lip sync with visually imperceptible artifacts, but provides no directcontrol over facial expressions. Thus, the expressions in the output do notalways perfectly match the input (box, mouth), especially for expressionchanges without audio cue. Our visual dubbing approach accurately trans-fers the expressions from the source to the target. In addition, our approachprovides more control over the target video by also transferring the eye gazeand eye blinks (box, eyes), and the rigid head pose (arrows). Since the sourcesequence shows more head-pose variation than the target sequence, wescaled the transferred rotation and translation by 0.5 in this experiment. Forthe full video sequence, we refer to the supplemental video.
Obama videocourtesy of the White House (public domain). with conditioning input (face) and higher errors are found in regionsthat are unexplained by the conditioning input. Please note thatwhile the synthesized video portraits slightly difer from the groundtruth outside the face region, the synthesized hair and upper bodyare still plausible, consistent with the face region, and free of visualartifacts. For a complete analysis of these sequences, we refer to thesupplemental video.We evaluate our space-time conditioning strategy in Figure 16.Without space-time conditioning, the photometric error is signii-cantly higher. The average errors over the complete sequence are4.9 without vs. 4.5 with temporal conditioning (Barack Obama) and5.3 without vs. 4.8 with temporal conditioning (Theresa May). In
ACM Trans. Graph., Vol. 37, No. 4, Article 163. Publication date: August 2018.
Fig. 14. Comparison to the state-of-the-art facial reenactment approachof Thies et al. [2016]. Our approach achieves expression transfer of similarquality, while also enabling full-head reenactment, i.e., it also transfers therigid head pose, gaze direction, and eye blinks. For the video result, werefer to the supplemental video.
Obama video courtesy of the White House(public domain). addition to a lower photometric error, space-time conditioning alsoleads to temporally signiicantly more stable video outputs. Thiscan be seen best in the supplemental video.We also evaluate the importance of the training set size. In thisexperiment, we train our rendering-to-video translation networkwith 500, 1000, 2000 and 4000 frames of the target sequence, seeFigure 15. As can be expected, larger training sets produce betterresults, and the best results are obtained with the full training set.We also evaluate diferent image resolutions by training ourrendering-to-video translation network for resolutions of 256 × ×
512 and 1024 × × × × ×
256 pixels for most results.
We compare our deep video portrait approach to current state-of-the-art video and image reenactment techniques.
Comparison to Thies et al. [2016].
We compare our approach tothe state-of-the-art
Face2Face facial reenactment method of Thieset al. [2016]. In comparison to
Face2Face , our approach achievesexpression transfer of similar quality. What distinguishes our ap-proach is the capability for full-head reenactment, i.e., the ability to also transfer the rigid head pose, gaze direction, and eye blinks inaddition to the facial expressions, as shown in Figure 14. As can beseen, in our result, the head pose and eye motion nicely matches thesource sequence, while the output generated by
Face2Face followsthe head and eye motion of the original target sequence. Please seethe supplemental video for the video result.
Comparison to Suwajanakorn et al. [2017].
We also compare tothe audio-based dubbing approach of Suwajanakorn et al. [2017],see Figure 13. Their
AudioToObama approach produces accurate lipsync with visually imperceptible artifacts, but provides no directcontrol over facial expressions. Thus, the expressions in the outputdo not always perfectly match the input (box, mouth), especiallyfor expression changes without an audio cue. Our visual dubbingapproach accurately transfers the expressions from the source tothe target. In addition, our approach provides more control overthe target video by also transferring the eye gaze and eye blinks(box, eyes) and the general rigid head pose (arrows). While theirapproach is trained on a huge amount of training data (17 hours),our approach only uses a small training dataset (1.3 minutes). Thediferences are best visible in the supplemental video.
Comparison to Averbuch-Elor et al. [2017].
We compare our ap-proach in the full-head reenactment scenario to the image reenact-ment approach of Averbuch-Elor et al. [2017], see Figure 12. Theirapproach does not preserve the identity of the target actor, sincethey copy the teeth and mouth interior from the source to the targetsequence. Our learning-based approach enables larger modiicationsof the head pose without apparent artifacts, while their warping-based approach signiicantly distorts the head and background. Inaddition, we enable the joint modiication of the gaze direction andeye blinks; see supplemental video.
We conducted two extensive web-based user studies to quantita-tively evaluate the realism of our results. We prepared short 5-second video clips that we extracted from both real and synthesizedvideos (see Figure 18), to evaluate three applications of our approach:self-reenactment, same-person-reenactment and visual dubbing. Weopted for self-reenactment, same-person-reenactment (two speechesof Barack Obama) and visual dubbing to guarantee that the motiontypes in the evaluated real and synthesized video pairs are match-ing. This eliminates the motion type as a confounding factor fromthe statistical analysis, e.g., having unrealistic motions for a publicspeech in the synthesized videos would negatively bias the out-come of the study. Our evaluation is focused on the visual qualityof the synthesized results. Most video clips have a resolution of256 ×
256 pixels, but some are 512 ×
512 pixels. In our user study, wepresented one video clip at a time, and asked participants to re-spond to the statement ł
This video clip looks real to me ž on a 5-pointLikert scale (1ś strongly disagree , 2ś disagree , 3ś don’t know , 4ś agree ,5ś strongly agree ). Video clips are shown in a random order, andeach video clip is shown exactly once to assess participants’ irstimpression. We recruited 135 and 69 anonymous participants forour two studies, largely from North America and Europe.
ACM Trans. Graph., Vol. 37, No. 4, Article 163. Publication date: August 2018. eep Video Portraits • 163:11
Fig. 15. uantitative evaluation of the training set size. We train our rendering-to-video translation network with training corpora of diferent sizes. The errormaps show the per-pixel distance in RGB color space with each channel being in [ , ] ; the mean photometric error is shown in the top-right. Smallertraining sets have larger photometric errors, especially for regions outside of the face. For the full comparison, we refer to the supplemental video. Obama video courtesy of the White House (public domain).
May video courtesy of the UK government (Open Government Licence).Fig. 16. uantitative evaluation of the influence of the proposed space-timeconditioning input. The error maps show the per-pixel distance in RGB colorspace with each channel being in [ , ] ; the mean photometric error isshown in the top-right. Without space-time conditioning, the photometricerror is higher. Temporal conditioning adds significant temporal stability.This is best seen in the supplemental video. Obama video courtesy of theWhite House (public domain).
May video courtesy of the UK government(Open Government Licence).Fig. 17. uantitative comparison of diferent resolutions. We train threerendering-to-video translation networks for resolutions of 256 × × × [ , ] ; the mean photometric erroris shown in the top-right. For the full comparison, see our video. May videocourtesy of the UK government (Open Government Licence). Fig. 18. We performed a user study to evaluate the quality of our results andsee if users can distinguish between real (top) and synthesized video clips(botom). The video clips include self-reenactment, same-person-reenact-ment, and video dubbing.
Putin video courtesy of the Kremlin (CC BY).
Obama video courtesy of the White House (public domain).
Elizabeth II video courtesy of the Governor General of Canada (public domain).
The results in Table 1 show that only 80% of participants rated real256 ×
256 videos as real, i.e. (strongly) agreeing to the video lookingreal; it seems that in anticipation of synthetic video clips, partici-pants became overly critical. At the same time, 50% of participantsconsider our 256 ×
256 results to be real, which increases slightly to52% for 512 × ×
256 resolution, which 65% of participants considerto be real, compared to 78% for the real video. We also evaluatedpartial and full reenactment by transferring a speech by BarackObama to another video clip of himself. Table 2 indicates that weachieve better realism ratings with full reenactment comprising fa-cial expressions and pose (50%) compared to transferring only facialexpressions (38%). This might be because full-head reenactmentkeeps expressions and head motion synchronized. Suwajanakornet al.’s speech-driven reenactment approach [2017] achieves a re-alism rating of 64% compared to the real source and target videoclips, which achieve 70ś86%. Our full-head reenactment results areconsidered to be at least as real as Suwajanakorn et al.’s by 60%of participants. We inally compared our dubbing results to VDub[Garrido et al. 2015] in Table 3. Overall, 57% of participants gaveour results a higher realism rating (and 32% gave the same rating).Our results are again considered to be real by 51% of participants,compared to only 21% for VDub.
ACM Trans. Graph., Vol. 37, No. 4, Article 163. Publication date: August 2018.
Table 1. User study results for self-reenacted videos ( n = ). Columns 1ś5show the percentage of ratings given about the statement łThis video cliplooks real to mež, from 1 ( strongly disagree ) to 5 ( strongly agree ). 4+5=‘real’. Real videos Our results res 1 2 3 4 5 ‘real’ 1 2 3 4 5 ‘real’Obama 256 2 8 10 62 19 81% 13 33 11 37 6 43%Putin 256 2 11 10 58 20 78% 3 17 15 54 11 65%Eliabeth II 256 2 6 12 59 21 80% 6 32 20 33 9 42%Obama 512 0 7 3 49 42 91% 9 35 13 36 8 44%Putin 512 4 13 10 47 25 72% 2 20 15 44 19 63%Eliabeth II 512 1 7 4 55 34 89% 7 33 10 38 13 51%Mean 256 2 8 10 60 20 80% 7 27 15 41 9 50%Mean 512 2 9 6 50 34 84% 6 29 12 39 13 52%
Table 2. User study results for expression and full head transfer between twovideos of Barack Obama compared to the input videos and Suwajanakornet al.’s approach ( n = , mean of 4 clips). Ratings
Table 3. User study results for dubbing comparison to VDub ( n = ). Garrido et al. [2015] Our results
On average, across all scenarios and both studies, our results areconsidered to be real by 47% of the participants (1,767 ratings), com-pared to only 80% for real video clips (1,362 ratings). This suggeststhat our results already fool about 60% of the participants ś a goodresult given the critical participant pool. However, there is somevariation across our results: lower realism ratings were given forwell-known personalities like Barack Obama, while higher ratingswere given for instance to the unknown dubbing actors.
While we have demonstrated highly realistic reenactment resultsin a large variety of applications and scenarios, our approach isalso subject to a few limitations. Similar to all other learning-basedapproaches, ours works very well inside the span of the trainingcorpus. Extreme target head poses, such as large rotations, or ex-pressions far outside this span can lead to a degradation of thevisual quality of the generated video portrait, see Figure 19 and thesupplemental video. Since we only track the face with a parametricmodel, we cannot actively control the motion of the torso or hair, orcontrol the background. The network learns to extrapolate and indsa plausible and consistent upper body and background (includingsome shadows and relections) for a given head pose. This limitation
Fig. 19. Our approach works well within the span of the training corpus.Extreme changes in head pose far outside the training set or strong changesto the facial expression might lead to artifacts in the synthesized video. Thisis a common limitation of all learning-based approaches. In these cases,artifacts are most prominent outside the face region, as these regions haveno conditioning input.
May video courtesy of the UK government (OpenGovernment Licence).
Malou video courtesy of Louisa Malou (CC BY). could be overcome by also tracking the body and using the underly-ing body model to generate an extended set of conditioning images.Currently, we are only able to produce medium-resolution outputdue to memory and training time limitations. The limited outputresolution makes it especially diicult to reproduce ine-scale de-tail, such as individual teeth, in a temporally coherent manner. Yet,recent progress on high-resolution discriminative adversarial net-works [Karras et al. 2018; Wang et al. 2017] is promising and couldbe leveraged to further increase the resolution of the generated out-put. On a broader scale, and not being a limitation, democratizationof advanced high-quality video editing possibilities, ofered by ourand other methods, calls for additional care in ensuring veriiablevideo authenticity, e.g., through invisible watermarking.
We presented a new approach to synthesize entire photo-real videoportraits of a target actors in front of general static backgrounds.It is the irst to transfer head pose and orientation, face expression,and eye gaze from a source actor to a target actor. The proposedmethod is based on a novel rendering-to-video translation networkthat converts a sequence of simple computer graphics renderingsinto photo-realistic and temporally-coherent video. This mapping islearned based on a novel space-time conditioning volume formula-tion. We have shown through experiments and a user study that ourmethod outperforms prior work in quality and expands over theirpossibilities. It thus opens up a new level of capabilities in manyapplications, like video reenactment for virtual reality and telep-resence, interactive video editing, and visual dubbing. We see ourapproach as a step towards highly realistic synthesis of full-framevideo content under control of meaningful parameters. We hopethat it will inspire future research in this very challenging ield.
ACKNOWLEDGMENTS
We are grateful to all our actors. We thank True-VisionSolutionsPty Ltd for kindly providing the 2D face tracker and Adobe for aPremiere Pro CC license. We also thank Supasorn Suwajanakornand Hadar Averbuch-Elor for the comparisons. This work was sup-ported by ERC Starting Grant CapReal (335545), a TUM-IAS RudolfMößbauer Fellowship, a Google Faculty Award, RCUK grant CAM-ERA (EP/M023281/1), an NVIDIA Corporation GPU Grant, and theMax Planck Center for Visual Computing and Communications(MPC-VCC).
ACM Trans. Graph., Vol. 37, No. 4, Article 163. Publication date: August 2018. eep Video Portraits • 163:13
A APPENDIX
This appendix describes all the used datasets, see Table 4 (targetactors) and Table 5 (source actors).
Table 4. Target videos: Name and length of sequences (in frames).
Malou video courtesy of Louisa Malou (CC BY).
May video courtesy of the UKgovernment (Open Government Licence).
Obama video courtesy of theWhite House (public domain).
Putin video courtesy of the Kremlin (CC BY).
Reagan video courtesy of the National Archives and Records Administration(public domain).
Elizabeth II video courtesy of the Governor General ofCanada (public domain).
Reagan video courtesy of the National Archivesand Records Administration (public domain).
Wolf video courtesy of TomWolf (CC BY).
Ingmar Malou May Obama1 Obama23,000 15,000 5,000 2,000 3,613Putin Elizabeth II Reagan Thomas Wolf4,000 1,500 6,984 2,239 15,000DB1 DB2 DB3 DB48,000 18,138 6,500 30,024
Table 5. Source videos: Name and length of sequences (in frames).
Obama video courtesy of the White House (public domain).
Obama3 David1 David2 DB5 DB61,945 4,611 3,323 3,824 2,380
REFERENCES
IEEE Computer Graphics and Applications
30, 4(July/August 2010), 20ś31. https://doi.org/10.1109/MCG.2010.65Hadar Averbuch-Elor, Daniel Cohen-Or, Johannes Kopf, and Michael F. Cohen. 2017.Bringing Portraits to Life.
ACM Transactions on Graphics (SIGGRAPH Asia)
36, 6(November 2017), 196:1ś13. https://doi.org/10.1145/3130800.3130818 Volker Blanz, Kristina Scherbaum, Thomas Vetter, and Hans-Peter Seidel. 2004. Ex-changing Faces in Images.
Computer Graphics Forum (Eurographics)
23, 3 (September2004), 669ś676. https://doi.org/10.1111/j.1467-8659.2004.00799.xVolker Blanz and Thomas Vetter. 1999. A Morphable Model for the Synthesis of 3D Faces.In
Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH) .187ś194. https://doi.org/10.1145/311535.311556James Booth, Anastasios Roussos, Allan Ponniah, David Dunaway, and StefanosZafeiriou. 2018. Large Scale 3D Morphable Models.
International Journal of ComputerVision
Annual Conference on Computer Graphics and InteractiveTechniques (SIGGRAPH) . 353ś360. https://doi.org/10.1145/258734.258880Chen Cao, Derek Bradley, Kun Zhou, and Thabo Beeler. 2015. Real-time High-idelityFacial Performance Capture.
ACM Transactions on Graphics (SIGGRAPH)
34, 4 (July2015), 46:1ś9. https://doi.org/10.1145/2766943Chen Cao, Qiming Hou, and Kun Zhou. 2014a. Displaced Dynamic Expression Regres-sion for Real-time Facial Tracking and Animation.
ACM Transactions on Graphics(SIGGRAPH)
33, 4 (July 2014), 43:1ś10. https://doi.org/10.1145/2601097.2601204Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun Zhou. 2014b. FaceWarehouse:A 3D Facial Expression Database for Visual Computing.
IEEE Transactions onVisualization and Computer Graphics
20, 3 (March 2014), 413ś425. https://doi.org/10.1109/TVCG.2013.249Chen Cao, Hongzhi Wu, Yanlin Weng, Tianjia Shao, and Kun Zhou. 2016. Real-timeFacial Animation with Image-based Dynamic Avatars.
ACM Transactions on Graphics(SIGGRAPH)
35, 4 (July 2016), 126:1ś12. https://doi.org/10.1145/2897824.2925873Yao-Jen Chang and Tony Ezzat. 2005. Transferable Videorealistic Speech Animation.In
Symposium on Computer Animation (SCA) . 143ś151. https://doi.org/10.1145/1073368.1073388Qifeng Chen and Vladlen Koltun. 2017. Photographic Image Synthesis with CascadedReinement Networks. In
International Conference on Computer Vision (ICCV) . 1520ś1529. https://doi.org/10.1109/ICCV.2017.168Kevin Dale, Kalyan Sunkavalli, Micah K. Johnson, Daniel Vlasic, Wojciech Matusik,and Hanspeter Pister. 2011. Video face replacement.
ACM Transactions on Graphics(SIGGRAPH Asia)
30, 6 (December 2011), 130:1ś10. https://doi.org/10.1145/2070781.2024164Tony Ezzat, Gadi Geiger, and Tomaso Poggio. 2002. Trainable Videorealistic SpeechAnimation.
ACM Transactions on Graphics (SIGGRAPH)
21, 3 (July 2002), 388ś398.https://doi.org/10.1145/566654.566594Ohad Fried, Eli Shechtman, Dan B. Goldman, and Adam Finkelstein. 2016. Perspective-aware Manipulation of Portrait Photos.
ACM Transactions on Graphics (SIGGRAPH)
35, 4 (July 2016), 128:1ś10. https://doi.org/10.1145/2897824.2925933Graham Fyfe, Andrew Jones, Oleg Alexander, Ryosuke Ichikari, and Paul Debevec.2014. Driving High-Resolution Facial Scans with Video Performance Capture.
ACMTransactions on Graphics
34, 1 (December 2014), 8:1ś14. https://doi.org/10.1145/2638549Yaroslav Ganin, Daniil Kononenko, Diana Sungatullina, and Victor Lempitsky. 2016.DeepWarp: Photorealistic Image Resynthesis for Gaze Manipulation. In
Euro-pean Conference on Computer Vision (ECCV) . 311ś326. https://doi.org/10.1007/978-3-319-46475-6_20Pablo Garrido, Levi Valgaerts, Ole Rehmsen, Thorsten Thormaehlen, Patrick Pérez, andChristian Theobalt. 2014. Automatic Face Reenactment. In
Conference on ComputerVision and Pattern Recognition (CVPR) . 4217ś4224. https://doi.org/10.1109/CVPR.2014.537Pablo Garrido, Levi Valgaerts, Hamid Sarmadi, Ingmar Steiner, Kiran Varanasi, PatrickPérez, and Christian Theobalt. 2015. VDub: Modifying Face Video of Actors forPlausible Visual Alignment to a Dubbed Audio Track.
Computer Graphics Forum(Eurographics)
34, 2 (May 2015), 193ś204. https://doi.org/10.1111/cgf.12552Pablo Garrido, Michael Zollhöfer, Dan Casas, Levi Valgaerts, Kiran Varanasi, PatrickPérez, and Christian Theobalt. 2016. Reconstruction of Personalized 3D Face Rigsfrom Monocular Video.
ACM Transactions on Graphics
35, 3 (June 2016), 28:1ś15.https://doi.org/10.1145/2890493Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative AdversarialNets. In
Advances in Neural Information Processing Systems .Geofrey E. Hinton and Ruslan Salakhutdinov. 2006. Reducing the Dimensionalityof Data with Neural Networks.
Science
ACM Transactions on Graphics (SIGGRAPHAsia)
36, 6 (November 2017), 195:1ś14. https://doi.org/10.1145/3130800.31310887Alexandru Eugen Ichim, Soien Bouaziz, and Mark Pauly. 2015. Dynamic 3D AvatarCreation from Hand-held Video Input.
ACM Transactions on Graphics (SIGGRAPH)
34, 4 (July 2015), 45:1ś14. https://doi.org/10.1145/2766974Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-ImageTranslation with Conditional Adversarial Networks. In
Conference on Computer
ACM Trans. Graph., Vol. 37, No. 4, Article 163. Publication date: August 2018.
Vision and Pattern Recognition (CVPR) . 5967ś5976. https://doi.org/10.1109/CVPR.2017.632Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive Growingof GANs for Improved Quality, Stability, and Variation. In
International Conferenceon Learning Representations (ICLR) .Ira Kemelmacher-Shlizerman. 2013. Internet-Based Morphable Model. In
InternationalConference on Computer Vision (ICCV) . 3256ś3263. https://doi.org/10.1109/ICCV.2013.404Ira Kemelmacher-Shlizerman, Aditya Sankar, Eli Shechtman, and Steven M. Seitz. 2010.Being John Malkovich. In
European Conference on Computer Vision (ECCV) . 341ś353.https://doi.org/10.1007/978-3-642-15549-9_25Ira Kemelmacher-Shlizerman, Eli Shechtman, Rahul Garg, and Steven M. Seitz. 2011.Exploring photobios.
ACM Transactions on Graphics (SIGGRAPH)
30, 4 (August2011), 61:1ś10. https://doi.org/10.1145/2010324.1964956Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization.In
International Conference on Learning Representations (ICLR) .Christoph Lassner, Gerard Pons-Moll, and Peter V. Gehler. 2017. A Generative Model ofPeople in Clothing. In
International Conference on Computer Vision (ICCV) . 853ś862.https://doi.org/10.1109/ICCV.2017.98Hao Li, Laura Trutoiu, Kyle Olszewski, Lingyu Wei, Tristan Trutna, Pei-Lun Hsieh,Aaron Nicholls, and Chongyang Ma. 2015. Facial Performance Sensing Head-mounted Display.
ACM Transactions on Graphics (SIGGRAPH)
34, 4 (July 2015),47:1ś9. https://doi.org/10.1145/2766939Kai Li, Qionghai Dai, Ruiping Wang, Yebin Liu, Feng Xu, and Jue Wang. 2014. A Data-Driven Approach for Facial Expression Retargeting in Video.
IEEE Transactionson Multimedia
16, 2 (February 2014), 299ś310. https://doi.org/10.1109/TMM.2013.2293064Kang Liu and Joern Ostermann. 2011. Realistic facial expression synthesis for an image-based talking head. In
International Conference on Multimedia and Expo (ICME) .https://doi.org/10.1109/ICME.2011.6011835Ming-Yu Liu, Thomas Breuel, and Jan Kautz. 2017. Unsupervised Image-to-ImageTranslation Networks. In
Advances in Neural Information Processing Systems .Zicheng Liu, Ying Shan, and Zhengyou Zhang. 2001. Expressive Expression Mappingwith Ratio Images. In
Annual Conference on Computer Graphics and InteractiveTechniques (SIGGRAPH) . 271ś276. https://doi.org/10.1145/383259.383289Liqian Ma, Qianru Sun, Xu Jia, Bernt Schiele, Tinne Tuytelaars, and Luc Van Gool. 2017.Pose Guided Person Image Generation. In
Advances in Neural Information ProcessingSystems .Mehdi Mirza and Simon Osindero. 2014. Conditional Generative Adversarial Nets.(2014). https://arxiv.org/abs/1411.1784 arXiv:1411.1784.Kyle Olszewski, Zimo Li, Chao Yang, Yi Zhou, Ronald Yu, Zeng Huang, Sitao Xiang,Shunsuke Saito, Pushmeet Kohli, and Hao Li. 2017. Realistic Dynamic Facial Texturesfrom a Single Image using GANs. In
International Conference on Computer Vision(ICCV) . 5439ś5448. https://doi.org/10.1109/ICCV.2017.580Kyle Olszewski, Joseph J. Lim, Shunsuke Saito, and Hao Li. 2016. High-idelity Facialand Speech Animation for VR HMDs.
ACM Transactions on Graphics (SIGGRAPHAsia)
35, 6 (November 2016), 221:1ś14. https://doi.org/10.1145/2980179.2980252Alec Radford, Luke Metz, and Soumith Chintala. 2016. Unsupervised RepresentationLearning with Deep Convolutional Generative Adversarial Networks. In
Interna-tional Conference on Learning Representations (ICLR) .Ravi Ramamoorthi and Pat Hanrahan. 2001. An eicient representation for irradianceenvironment maps. In
Annual Conference on Computer Graphics and InteractiveTechniques (SIGGRAPH) . 497ś500. https://doi.org/10.1145/383259.383317Elad Richardson, Matan Sela, and Ron Kimmel. 2016. 3D Face Reconstruction byLearning from Synthetic Data. In
International Conference on 3D Vision (3DV) . 460ś469. https://doi.org/10.1109/3DV.2016.56Elad Richardson, Matan Sela, Roy Or-El, and Ron Kimmel. 2017. Learning DetailedFace Reconstruction from a Single Image. In
Conference on Computer Vision andPattern Recognition (CVPR) . 5553ś5562. https://doi.org/10.1109/CVPR.2017.589Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: ConvolutionalNetworks for Biomedical Image Segmentation. In
International Conference on MedicalImage Computing and Computer-Assisted Intervention (MICCAI) . 234ś241. https://doi.org/10.1007/978-3-319-24574-4_28Joseph Roth, Yiying Tong Tong, and Xiaoming Liu. 2017. Adaptive 3D Face Recon-struction from Unconstrained Photo Collections.
IEEE Transactions on PatternAnalysis and Machine Intelligence
39, 11 (November 2017), 2127ś2141. https://doi.org/10.1109/TPAMI.2016.2636829Jason M. Saragih, Simon Lucey, and Jefrey F. Cohn. 2011. Real-time avatar animationfrom a single image. In
International Conference on Automatic Face and GestureRecognition (FG) . 117ś124. https://doi.org/10.1109/FG.2011.5771383Matan Sela, Elad Richardson, and Ron Kimmel. 2017. Unrestricted Facial GeometryReconstruction Using Image-to-Image Translation. In
International Conference onComputer Vision (ICCV) . 1585ś1594. https://doi.org/10.1109/ICCV.2017.175Fuhao Shi, Hsiang-Tao Wu, Xin Tong, and Jinxiang Chai. 2014. Automatic Acquisitionof High-idelity Facial Performances Using Monocular Videos.
ACM Transactionson Graphics (SIGGRAPH Asia)
33, 6 (November 2014), 222:1ś13. https://doi.org/10. 1145/2661229.2661290Robert W. Sumner and Jovan Popović. 2004. Deformation Transfer for Triangle Meshes.
ACM Transactions on Graphics (SIGGRAPH)
23, 3 (August 2004), 399ś405. https://doi.org/10.1145/1015706.1015736Supasorn Suwajanakorn, Ira Kemelmacher-Shlizerman, and Steven M. Seitz. 2014. TotalMoving Face Reconstruction. In
European Conference on Computer Vision (ECCV)(Lecture Notes in Computer Science) , Vol. 8692. 796ś812. https://doi.org/10.1007/978-3-319-10593-2_52Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2015. WhatMakes Tom Hanks Look Like Tom Hanks. In
International Conference on ComputerVision (ICCV) . 3952ś3960. https://doi.org/10.1109/ICCV.2015.450Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2017.Synthesizing Obama: Learning Lip Sync from Audio.
ACM Transactions on Graphics(SIGGRAPH)
36, 4 (July 2017), 95:1ś13. https://doi.org/10.1145/3072959.3073640Yaniv Taigman, Adam Polyak, and Lior Wolf. 2017. Unsupervised Cross-Domain ImageGeneration. In
International Conference on Learning Representations (ICLR) .Ayush Tewari, Michael Zollhöfer, Hyeongwoo Kim, Pablo Garrido, Florian Bernard,Patrick Pérez, and Christian Theobalt. 2017. MoFA: Model-based Deep ConvolutionalFace Autoencoder for Unsupervised Monocular Reconstruction. In
InternationalConference on Computer Vision (ICCV) . 3735ś3744. https://doi.org/10.1109/ICCV.2017.401Justus Thies, Michael Zollhöfer, Matthias Nießner, Levi Valgaerts, Marc Stamminger,and Christian Theobalt. 2015. Real-time Expression Transfer for Facial Reenactment.
ACM Transactions on Graphics (SIGGRAPH Asia)
34, 6 (November 2015), 183:1ś14.https://doi.org/10.1145/2816795.2818056Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and MatthiasNießner. 2016. Face2Face: Real-Time Face Capture and Reenactment of RGB Videos.In
Conference on Computer Vision and Pattern Recognition (CVPR) . 2387ś2395. https://doi.org/10.1109/CVPR.2016.262Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and MatthiasNießner. 2018. FaceVR: Real-Time Facial Reenactment and Eye Gaze Control inVirtual Reality.
ACM Transactions on Graphics (2018).Anh Tuan Tran, Tal Hassner, Iacopo Masi, and Gerard Medioni. 2017. Regressing Robustand Discriminative 3D Morphable Models with a very Deep Neural Network. In
Conference on Computer Vision and Pattern Recognition (CVPR) . 1493ś1502. https://doi.org/10.1109/CVPR.2017.163Daniel Vlasic, Matthew Brand, Hanspeter Pister, and Jovan Popović. 2005. Face Transferwith Multilinear Models.
ACM Transactions on Graphics (SIGGRAPH)
24, 3 (July2005), 426ś433. https://doi.org/10.1145/1073204.1073209Chao Wang, Haiyong Zheng, Zhibin Yu, Ziqiang Zheng, Zhaorui Gu, and Bing Zheng.2017. Discriminative Region Proposal Adversarial Networks for High-Quality Image-to-Image Translation. (2017). https://arxiv.org/abs/1711.09554 arXiv:1711.09554.Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and BryanCatanzaro. 2018. High-Resolution Image Synthesis and Semantic Manipulationwith Conditional GANs. In
Conference on Computer Vision and Pattern Recognition(CVPR) .Thibaut Weise, Soien Bouaziz, Hao Li, and Mark Pauly. 2011. Realtime Performance-based Facial Animation.
ACM Transactions on Graphics (SIGGRAPH)
30, 4 (July2011), 77:1ś10. https://doi.org/10.1145/2010324.1964972Erroll Wood, Tadas Baltrušaitis, Louis-Philippe Morency, Peter Robinson, and AndreasBulling. 2018. GazeDirector: Fully articulated eye gaze redirection in video.
ComputerGraphics Forum (Eurographics)
37, 2 (2018). https://doi.org/10.1111/cgf.13355Chenglei Wu, Derek Bradley, Markus Gross, and Thabo Beeler. 2016. An Anatomically-Constrained Local Deformation Model for Monocular Face Capture.
ACM Transac-tions on Graphics (SIGGRAPH)
35, 4 (July 2016), 115:1ś12. https://doi.org/10.1145/2897824.2925882Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. 2017. DualGAN: Unsupervised DualLearning for Image-to-Image Translation. In
International Conference on ComputerVision (ICCV) . 2868ś2876. https://doi.org/10.1109/ICCV.2017.310Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. In
InternationalConference on Computer Vision (ICCV) . 2242ś2251. https://doi.org/10.1109/ICCV.2017.244Michael Zollhöfer, Justus Thies, Pablo Garrido, Derek Bradley, Thabo Beeler, PatrickPérez, Marc Stamminger, Matthias Nießner, and Christian Theobalt. 2018. State ofthe Art on Monocular 3D Face Reconstruction, Tracking, and Applications.
ComputerGraphics Forum
37, 2 (2018). https://doi.org/10.1111/cgf.13382