[PDF] HeadOn: Real-time Reenactment of Human Portrait Videos

Abstract

We propose HeadOn, the first real-time source-to-target reenactment approach for complete human portrait videos that enables transfer of torso and head motion, face expression, and eye gaze. Given a short RGB-D video of the target actor, we automatically construct a personalized geometry proxy that embeds a parametric head, eye, and kinematic torso model. A novel real-time reenactment algorithm employs this proxy to photo-realistically map the captured motion from the source actor to the target actor. On top of the coarse geometric proxy, we propose a video-based rendering technique that composites the modified target portrait video via view- and pose-dependent texturing, and creates photo-realistic imagery of the target actor under novel torso and head poses, facial expressions, and gaze directions. To this end, we propose a robust tracking of the face and torso of the source actor. We extensively evaluate our approach and show significant improvements in enabling much greater flexibility in creating realistic reenacted output videos.

Full PDF

HHeadOn : Real-time Reenactment of Human Portrait Videos

JUSTUS THIES,

Technical University of Munich

MICHAEL ZOLLHÖFER,

Stanford University

CHRISTIAN THEOBALT,

Max-Planck-Institute for Informatics

MARC STAMMINGER,

University of Erlangen-Nuremberg

MATTHIAS NIESSNER,

Technical University of Munich

Fig. 1. Our novel

HeadOn approach enables real-time reenactment of upper body motion, head pose, face expression, and eye gaze in human portrait videos.For synthesis of new photo-realistic video content, we employ a novel video-based rendering approach that builds on top of a fully controllable 3D actormodel. The person-specific model is constructed from a short RGB-D calibration sequence and is driven by a real-time torso and face tracker.

We propose HeadOn, the first real-time source-to-target reenactment ap-proach for complete human portrait videos that enables transfer of torso andhead motion, face expression, and eye gaze. Given a short RGB-D video ofthe target actor, we automatically construct a personalized geometry proxythat embeds a parametric head, eye, and kinematic torso model. A novel real-time reenactment algorithm employs this proxy to photo-realistically mapthe captured motion from the source actor to the target actor. On top of thecoarse geometric proxy, we propose a video-based rendering technique thatcomposites the modified target portrait video via view- and pose-dependenttexturing, and creates photo-realistic imagery of the target actor under noveltorso and head poses, facial expressions, and gaze directions. To this end, wepropose a robust tracking of the face and torso of the source actor. We exten-sively evaluate our approach and show significant improvements in enablingmuch greater flexibility in creating realistic reenacted output videos.CCS Concepts: •

Computing methodologies → Computer vision ; Com-puter graphics ;Additional Key Words and Phrases: Reenactment, Face tracking, Video-basedRendering, Real-time

ACM Reference Format:

Justus Thies, Michael Zollhöfer, Christian Theobalt, Marc Stamminger,and Matthias Nießner. 2018.

HeadOn : Real-time Reenactment of HumanPortrait Videos.

ACM Trans. Graph.

37, 4, Article 164 (August 2018), 13 pages.https://doi.org/10.1145/3197517.3201350

Authors’ addresses: Justus Thies, Technical University of Munich, [email protected];Michael Zollhöfer, Stanford University, [email protected]; Christian Theobalt,Max-Planck-Institute for Informatics, [email protected]; Marc Stamminger,University of Erlangen-Nuremberg, [email protected]; Matthias Nießner, Tech-nical University of Munich, [email protected].© 2018 Association for Computing Machinery.This is the author’s version of the work. It is posted here for your personal use. Not forredistribution. The definitive Version of Record was published in

ACM Transactions onGraphics , https://doi.org/10.1145/3197517.3201350.

Reenactment approaches aim to transfer the motion of a sourceactor to an image or video of a target actor. Very recently, facialreenactment methods have been successfully employed to achievehighly-realistic manipulations of facial expressions based on com-modity video data [Averbuch-Elor et al. 2017; Suwajanakorn et al.2017; Thies et al. 2015, 2016, 2018; Vlasic et al. 2005]. Rather thananimating a virtual, stylized avatar (e.g., as used in video games),these algorithms replace the face region of a person with a syntheticre-rendering, or modify the target image/video under guidance of a3D face model. This enables changing the expression of a target per-son and creating a manipulated output video that suggests differentcontent; e.g., a person who is sitting still could appear as if he/she istalking. Modern reenactment approaches achieve highly believableresults, even in real-time, and have enjoyed wide media coveragedue to the interest in general movie and video editing [Vlasic et al.2005], teleconferencing [Thies et al. 2018], reactive profile pictures[Averbuch-Elor et al. 2017], or visual dubbing of foreign languagemovies [Garrido et al. 2015].Even though current facial reenactment results are impressive,they are still fundamentally limited in the type of manipulationsthey enable. For instance, these approaches are only able to modifyfacial expressions, whereas the rigid pose of the head, includingits orientation, remains unchanged and does not follow the inputvideo. Thus, only subtle changes, such as opening the mouth oradding wrinkles on the forehead are realized, which severely limitsthe applicability to video editing, where the control of the pose ofthe target person is also required. Furthermore, without joint modi-fication of the head pose, the modified facial expressions often seemout-of-place, since they do not well align with visual pauses in thebody and head motion; as noted by Suwajanakorn et al. [2017] thissignificantly restricts the applicability in teleconferencing scenarios.

ACM Trans. Graph., Vol. 37, No. 4, Article 164. Publication date: August 2018. a r X i v : . [ c s . C V ] M a y In this work, we thus go one step further by introducing

HeadOn ,a reenactment system for portrait videos recorded with a commod-ity RGB-D camera. We overcome the limitations of current facialreenactment methods by not only controlling changes in facial ex-pression, but also reenacting the rigid position of the head, of theupper body, and the eye gaze – i.e., the entire person-related contentin a portrait video.At the core of our approach is the combination of robust and accu-rate tracking of a deformation proxy with view-dependent texturingfor video-based re-rendering. To achieve this, we propose a newmethod to swiftly and automatically construct a personalized headand torso geometry proxy of a human from a brief RGB-D initializa-tion sequence. The shape proxy features a personalized parametric3D model of the complete head that is rigged with blendshapes forexpression control and is integrated with a personalized upper torsomodel. A new real-time reenactment algorithm employs this proxyto photo-realistically map face expression and eye gaze, as well ashead and torso motion of a captured source actor to a target actor. Tothis end, we contribute a new photo-realistic video-based renderingapproach that composites the reenacted target portrait video viaview- and pose-dependent texturing and video compositing.In summary, we contribute the following: • rapid automatic construction of a personalized geometryproxy that embeds a parametric human face, eye, full head,and upper body model, • a photo-realistic, view-, and pose-dependent texturing andcompositing approach, • a robust tracking approach of the source actor, • and real-time source-to-target reenactment of complete hu-man portrait videos. Face reconstruction and reenactment have a long history in com-puter graphics and vision. We focus on recent approaches based onlightweight commodity sensors. For an overview of high-qualitytechniques that use controlled acquisition setups, we refer to Klehmet al. [2015]. Recently, a state-of-the-art report on monocular 3dface reconstruction, tracking and applications has been publishedthat gives a comprehensive overview of current methods [Zollhöferet al. 2018]. In the following we concentrate on the most relatedtechniques.

Parametric Face Representations.

Current state-of-the-art monoc-ular face tracking and reconstruction approaches heavily rely on3D parametric identity [Blanz et al. 2003; Blanz and Vetter 1999]and expression models [Tena et al. 2011] that generalize active ap-pearance models [Cootes et al. 2001] from 2D to 3D space. Evencombinations of the two have been proposed [Xiao et al. 2004]. Re-cently, large-scale models in terms of geometry [Booth et al. 2016]and texture [Zafeiriou et al. 2017] have been constructed based onan immense amount of training data (10,000 scans). For modelingfacial expressions, the de facto standard in the industry are blend-shapes [Lewis et al. 2014; Pighin et al. 1998]. Physics-based models[Ichim et al. 2017; Sifakis et al. 2005] have been proposed in research,but fitting such complex models to commodity video at real-timerates is still challenging. Some approaches [Shi et al. 2014a; Vlasic et al. 2005] jointly represent face identity and expression in a singlemulti-linear model. Joint shape and motion models [Li et al. 2017]have also been learned from a large collection of 4D scan data. Otherapproaches [Garrido et al. 2016] reconstruct personalized face rigs,including reflectance and fine-scale detail from monocular video.Liang et al. [Liang et al. 2014] reconstruct the identity of a face frommonocular Kinect data using a part-based matching algorithm. Theyselect face parts (eyes,nose,mouth,cheeks) from a database of facesthat best match the input data. To get an improved and personalizedoutput they fuse these parts with the Kinect depth data. Ichim etal. [2015] propose to reconstruct 3D avatars from multi-view imagesrecorded by a mobile phone and personalize the expression spaceusing a calibration sequence.

Commodity Face Reconstruction and Tracking.

The first commod-ity face reconstruction approaches that employed lightweight cap-ture setups, i.e., stereo [Valgaerts et al. 2012], RGB [Fyffe et al. 2014;Garrido et al. 2013; Shi et al. 2014a; Suwajanakorn et al. 2014, 2015],or RGB-D [Chen et al. 2013] cameras had slow off-line frame ratesand required up to several minutes to process a single input frame.These methods either deform a personalized template mesh [Suwa-janakorn et al. 2014, 2015; Valgaerts et al. 2012], use a 3D templateand expression blendshapes [Fyffe et al. 2014; Garrido et al. 2013],a template and an underlying generic deformation graph [Chenet al. 2013], or additionally solve for the parameters of a multi-linear face model [Shi et al. 2014a]. Suwajanakorn et al. [2014; 2015]build a modifiable mesh model from internet photo collections. Shiet al. [2014b] key-frame based bundle adjustment to fit the multi-linear model. Recently, first methods have appeared that reconstructfacial performances in real-time from a single commodity RGB-Dcamera [Bouaziz et al. 2013; Hsieh et al. 2015; Li et al. 2013; Thieset al. 2015; Weise et al. 2011; Zollhöfer et al. 2014]. Dense real-timeface reconstruction has also been demonstrated based on monoc-ular RGB data using trained regressors [Cao et al. 2014a, 2013] oranalysis-by-synthesis [Thies et al. 2016]. Even fine-scale detail canbe recovered at real-time frame rates [Cao et al. 2015].

Performance Driven Facial Animation.

Face tracking has been ap-plied to control virtual avatars in many contexts. First approacheswere based on sparse detected feature points [Chai et al. 2003;Chuang and Bregler 2002]. Current methods for character anima-tion [Cao et al. 2015, 2014a, 2013; Weise et al. 2009], teleconferences[Weise et al. 2011], games [Ichim et al. 2015], and virtual reality[Li et al. 2015; Olszewski et al. 2016] are based on dense alignmentenergies. Olszewski et al. [2016] proposed an approach to controla digital avatar in real-time based on an HMD-mounted RGB cam-era. Recently, Hu et al. [2017] reconstructed a stylized 3D avatar,including hair, from a single image that can be animated and dis-played in virtual environments. General image-based modeling andrendering techniques [Gortler et al. 1996; Isaksen et al. 2000; Kanget al. 2006; Kopf et al. 2013; Wood et al. 2000] enable the creation ofphoto-realistic imagery for many real-world effects that are hard torender and reconstruct at a sufficiently high quality using currentapproaches. In the context of portrait videos, especially fine details,e.g., single strands of hair or high-quality apparel, are hard to re-construct. Cao et al. [2016] drive dynamic image-based 3D avatars

ACM Trans. Graph., Vol. 37, No. 4, Article 164. Publication date: August 2018. eadOn : Real-time Reenactment of Human Portrait Videos • 164:3 based on a real-time face tracker. We go one step further and com-bine a controllable geometric actor rig with video-based renderingtechniques to enable the real-time animation and synthesis of aphoto-realistic portrait video of a target actor.

Face Reenactment and Replacement.

Face reconstruction and track-ing enabling the manipulation of faces in videos has already foundits way into consumer applications, e.g., Snapchat, Face Changer,and FaceSwap. Face replacement approaches [Dale et al. 2011; Gar-rido et al. 2014] swap out the facial region of a target actor andreplace it with the face of a source actor. Face replacement is alsopossible in portrait photos crawled from the web [Kemelmacher-Shlizerman 2016]. In contrast, facial reenactment approaches pre-serve the identity of the target actor and modify only the facialexpressions. The first approaches worked offline [Vlasic et al. 2005]and required controlled recording setups. Thies et al. [2015] pro-posed the first real-time expression mapping approach based on anRGB-D camera. Follow-up works enabled real-time reenactment ofmonocular videos [Thies et al. 2016] and stereo video content [Thieset al. 2017, 2018]. Visual video dubbing approaches try to match themouth motion to a dubbed audio-track [Garrido et al. 2015]. Formouth interior synthesis, image-based [Kawai et al. 2014; Thies et al.2016] and template-based [Thies et al. 2015] approaches have beenproposed. Recently, Suwajanakorn et al. [2017] presented an impres-sive system mapping audio input to plausible lip motion using alearning-based approach. Even though all of these approaches ob-tain impressive results, they are fundamentally limited in the typesof enabled manipulations. For instance, the rigid pose of the upperbody and head cannot be modified. One exception is the offlineapproach of Elor et al. [Averbuch-Elor et al. 2017] that enables thecreation of reactive profile videos while allowing mapping of smallhead motions based on image warping. Our approach goes one stepfurther by enabling complete reenactment of portrait videos, i.e.,it enables larger changes of the head pose, control over the torso,facial reenactment and eye gaze redirection, all at real-time framerates, which is of paramount importance for live teleconferencingscenarios.Recently, Ma et al. [Ma et al. 2017] proposed a generative frame-work that allows to synthesize images of people in novel body poses.They employ a U-Net-like generator that is able to synthesize im-ages at a resolution of 256 ×

256 pixels. While showing nice results,they only work on single images and not videos; they are not ableto modify facial expressions.

Our approach is a synergy between many tailored components. Inthis section we give an overview of our approach; before explainingall components in the following sections. Fig. 2 depicts the pipelineof the proposed technique. We distinguish between the source actorand the target actor that has to be reenacted using the expressionsand motions of the source actor. The source actor is tracked inreal time using a dense face tracker and a model-to-frame IterativeClosest Point (ICP) method to track the torso of the person (detailsare given in Sec. 6.1). To be able to transfer the expressions and therigid motion of the head as well as the torso to the target actor, weconstruct a video-based actor rig (see Sec. 4). This actor rig is based

Fig. 2. Overview of our proposed

HeadOn technique. Based on the trackingof the torso and the face of the source actor, we deform the target actormesh. Using this deformed proxy of the target actor’s body, we use ournovel view-dependent texturing to generate a photo-realistic output. on the combination of the SMPL body model [Loper et al. 2015]and a parametric face model that is also used to track the facialexpressions of the source actor. Our novel video-based renderingtechnique (Sec. 5) allows us to render the target actor rig in a photo-realistic fashion. Since the face model used to rig the target actor isthe same as the model used to track the source actor, we can directlycopy the expression parameters from the source model to the targetrig. To transfer the body pose, we compute the relative pose betweenthe tracked face and the torso. Using inverse kinematic we map thepose to the three involved joints of the SMPL skeleton (head, neckand torso joint; each having three degrees of freedom). In Sec. 7and in the supplemental video we demonstrate the effectiveness ofour technique and we compare our results against state-of-the-artapproaches.

The first key component of our approach is the fully automatic gen-eration of a video-based person-specific rig of the target actor fromcommodity RGB-D input. The actor rig combines a unified para-metric representation of the target’s upper body (chest, shoulders,and neck, no arms) and head geometry with a video-based render-ing technique that enables the synthesis of photo-realistic portraitvideo footage. In this section, we describe the reconstruction of afully rigged geometric model of the target actor. This model is thenused as a proxy for video-based re-rendering of the target actor, asdescribed in Section 5. Fig. 3 shows an overview of the actor riggeneration pipeline.

As input, we record two short video sequences of the target actor.The first sequence is a short stream S = { C t , D t } t of color C t anddepth D t images of the target actor under different viewing angles. ACM Trans. Graph., Vol. 37, No. 4, Article 164. Publication date: August 2018.

Fig. 3. Automatic generation of a fully controllable person-specific target actor rig. We reconstruct a coarse geometric proxy of the torso and head based ona commodity RGB-D stream. To gain full parametric control of the target actor, we automatically rig the model with facial expression blendshapes and akinematic skeleton.

We assume that the target actor is sitting on a swivel chair and isinitially facing the camera. The target actor then first rotates thechair to a left profile view ( − ◦ ), followed by a right profile view( + ◦ ), while keeping the body and head pose as rigid as possible.Our starting pose is camera-facing to enable robust facial landmarkdetection in the first frame, which is required for later registrationsteps. Based on this sequence, we generate most parts of our actor rig,except eye gaze control, for which we need an additional recordingof the eye motion. In this sequence, the target actor faces the cameraand looks at a moving dot on a screen directly in front of him. Theactor follows the dot with his eyes without moving the head. Thissequence is used for an eye gaze transfer strategy similar to Thieset al. [2017; 2018]. The complete recording of these two datasetstakes less than 30 seconds, with approximately 10 seconds for thebody and 20 seconds for the eye data acquisition step. Note, we onlycapture images of the person in a single static pose. In particular,we do not capture neck motions. We start with the reconstruction of the geometry of the torso andhead of the target actor, based on the recorded depth images D t ofthe body sequence. First, we estimate the rigid pose of the actorin each frame, relative to the canonical pose in the first frame, us-ing projective data association and an iterative closest point (ICP)[Besl and McKay 1992; Chen and Medioni 1992] strategy based on apoint-to-plane distance metric [Low 2004]. We then fuse all depthobservations D t in a canonical truncated signed distance (TSDF)representation [Curless and Levoy 1996; Newcombe et al. 2011]. Weare using the open source VoxelHashing [Nießner et al. 2013] im-plementation that stores the TSDF in a memory efficient manner toreconstruct the actor in its canonical pose. In all our experiments, weuse a voxel size of 4 mm. Finally, we extract a mesh using MarchingCubes [Lorensen and Cline 1987].For every tracked frame, we also store the rigid transformationof the body with respect to the canonical pose, which we need forview-dependent texturing in a later step. For the eye calibrationsequence, we also estimate the rigid pose for each frame, by fittingthe previously obtained model using a projective point-to-plane ICP.We need these poses later to enable the re-projection of the eyes inthe synthesis stage. To gain full parametric control of the person-specific actor model,we automatically rig the reconstructed mesh. To this end, we firstfit a statistical morphable face model to establish correspondenceand then transfer facial blendshapes to the actor model. We use themulti-linear face model of [Thies et al. 2016] that is based on thestatistical face model of Blanz and Vetter [Blanz and Vetter 1999]and the blendshapes of [Alexander et al. 2009; Cao et al. 2014b].

Sparse Feature Alignment.

The used model-based non-rigid regis-tration approach is based on a set of sparse detected facial featurepoints and a dense geometric alignment term. The sparse discrimi-native feature points are detected in the frontal view of the bodycalibration sequence using the True Vision Solution (TVS) featuretracker . This landmark tracker is a commercial implementation ofSaragih et al. [2011]. We lift the detected feature points to 3D byprojecting them onto the target proxy mesh using the recoveredrigid pose and the known camera intrinsics. The corresponding 3Dfeature points on the template face are selected once in a preprocess-ing step and stay constant for all experiments. The sparse featurealignment term is defined as: E sparse ( α , δ , R , t ) = (cid:213) ( i , j )∈ C sparse (cid:12)(cid:12)(cid:12)(cid:12)(cid:2) Rv i ( α , δ ) + t (cid:3) − p j (cid:12)(cid:12)(cid:12)(cid:12) . Here, α is the vector containing the N α =

80 shape coefficients ofthe face model and the δ are the N δ =

76 blendshape expressionweights. We include blendshapes during optimization to compensatefor non-neutral face expression of the actor. R is the rotation and t the translation of the face model. The p j are the points on the proxymesh and the v i ( α , δ ) are the corresponding sparse points on thetemplate mesh that are computed by a linear combination of theshape and expression basis vectors of the underlying face model.The tuples ( i , j ) ∈ C sparse define the set of feature correspondences. Dense Point-to-Point Alignment.

In addition to this sparse featurealignment term, we employ a dense point-to-point alignment energybased on closest point correspondences: E dense ( α , δ , R , t ) = (cid:213) ( i , j )∈ C dense (cid:12)(cid:12)(cid:12)(cid:12)(cid:2) Rv i ( α , δ ) + t (cid:3) − p j (cid:12)(cid:12)(cid:12)(cid:12) . http://truevisionsolutions.net/ACM Trans. Graph., Vol. 37, No. 4, Article 164. Publication date: August 2018. eadOn : Real-time Reenactment of Human Portrait Videos • 164:5 The closest point correspondences C dense are computed using theapproximate nearest neighbor (ANN) library . We prune correspon-dences based on a distance threshold ( thres dist =

10 cm) and on theorientation of the normals.

Statistical Regularization.

For more robustness, we use a statisti-cally motivated regularization term that punishes shape and expres-sion coefficients that deviate too much from the average: E regularizer ( α , δ ) = (cid:213) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) α i σ i , shape (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:213) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) δ i σ i , exp (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Here, σ i , shape and σ i , exp are the standard deviations of the cor-responding shape and blendshape dimensions, respectively. Theweighted sum of these three terms is minimized using the optimiza-tion method of Levenberg-Marquardt. Automatic Blendshape Transfer.

The established set of dense point-to-point correspondences allows us to build an expression basis forthe person-specific actor rig by transferring the per-vertex blend-shape displacements of the face model. The basis is only transferredinside a predefined face mask region, and if the correspondencelies within a threshold distance ( thres transfer = In contrast to facial expressions, which are mostly linear, body mo-tion is non-linear. To accommodate for this, we use a kinematicskeleton. We automatically rig the person-specific actor model bytransferring the skinning weights and skeleton of a parametric bodymodel. In our system, we use the

SMPL [Loper et al. 2015] model. Weperform a non-rigid model-based registration to the reconstructed3D actor model, in a similar fashion as for the face. First, the re-quired 6 sparse feature points are manually selected. These markersare used to initialize the shoulder position and the head position.We then solve for the 10 shape parameters and the joint angles ofthe kinematic chain of

SMPL . After fitting, we establish a set ofdense correspondences between the two models. Finally, we trans-fer the skinning weights, as well as the skeleton. We also use thecorrespondences to transfer body, neck and head region masks withcorresponding feathering weights. Note, to ensure consistent skin-ning weights of neighboring vertices, we apply Gaussian smoothing(5 iterations of 1-ring filtering).

To improve our results, we refine the per frame tracking informationof the depth sequence based on our final parametric actor rig. Tothis end, we use the segmentation of the scan (head and body) andre-track the calibration sequence independently for both areas. Thisstep compensates for miss-alignments in the initial tracking due toslight non-rigid motions of the target during capture. The refined tracking information leads to an improved quality of the followingvideo-based rendering step. To synthesize novel portrait videos of the target actor, we applyvideo-based rendering with image data from the input video se-quences and the tracked actor model as geometric proxy. Withvideo-based rendering it is possible to generate photo-realistic novelviews, in particular, we can correctly synthesize regions for whichit is difficult to reconstruct geometry at a sufficiently high quality,i.e., hair. To achieve good results, we need good correspondencebetween the parametric 3D target actor rig and the video data cap-tured in the calibration sequence, as they are obtained in our refinedtracking stage (see Sec. 4.5). Based on these correspondences, wecross-project images from the input sequences to the projection ofthe deformed target actor model. We warp separately accordingto the torso and head motion, facial expression, and eye motion,and we take special care for the proper segmentation of fore- andbackground. An overview of our view-dependent image synthesispipeline is shown in Fig. 4, and the single steps are described in thefollowing sections.

First, we generate a foreground/background segmentation (Fig. 5)using a novel space-time graph cut approach (Fig. 6). We initializethe segmentation of the given image domain I by re-projecting thereconstructed and tracked proxy mesh to the calibration images toobtain an initial mask M . Afterwards, we compute segmentationmasks F , B , U f , and U b . F and B are confident foreground andbackground regions. Between them is an uncertainty region, with U f being the probable foreground region, and U b the probablebackground region.The confident foreground region F = M ⊖ S is computedby applying an erosion operator M . The confident background B = I \ (M ⊕ S) is the complement of the dilation of M . In the re-maining region of uncertainty, we perform background subtractionin HSV color space using a previously captured background image.If the pixel color differs from the background image more than athreshold, the pixel is assumed to be most likely a foreground pixeland assigned to U f , otherwise to U b . Finally, we remove outliersusing a number of further erosion and dilation operations.The resulting regions are used to initialize the GrabCut [Rotheret al. 2004] segmentation algorithm . Performing the segmentationper frame can result in temporally incoherent segmentation. Thus,we apply GrabCut to the entire 3D space-time volume of the calibra-tion sequence. We do so by executing the approach independentlyon all x -, y -, and t -slices. The resulting foreground masks are com-bined in a consolidation step to generate the final foreground alphamask (see Fig. 6). Using the color data observed during the scanning process, wepropose a view-dependent compositing strategy, see also Fig. 4.Based on the skinning weights, the body is clustered into body https://opencv.orgACM Trans. Graph., Vol. 37, No. 4, Article 164. Publication date: August 2018. Fig. 4. Overview of the view-dependent image synthesis. Starting with a depth image of our target actor (left), we search for the closest frames in the inputsequence, independently for the current head, neck, and body positions. For each such frame, a warp field is computed, and the frames are warped to thecorrect position. The warped images are then combined after a background subtraction and composited with the background to achieve a photo-realisticre-rendering. The shown uv displacements are color coded in the red and green color channel.Fig. 5. Our temporal background subtraction: the top row shows the inputcolor images and the middle row the extracted foreground layer using ourspace-time graph cut segmentation approach. The bottom row shows abackground replacement example. parts, which are textured independently. For each body part, wefirst retrieve the color frame of the calibration sequence that bestmatches its current modified orientation. We then initialize theper-view warp fields exploiting the morphed 3D geometry andcross-projection. To this end, we back-project the model into theretrieved frame using the tracking information. Then, we computea warp field, i.e., a 2D displacement field in image space. The warpfield maps from the re-projection in the retrieved image and theprojection of the current model into screen space. Using a Laplacianimage pyramid, we extend the warp field to the complete imagedomain. Finally, we use the extended warp field as described aboveand apply it to the retrieved image frames. Thus, we ensure thatwe also re-synthesize regions that are not directly covered by theproxy mesh, e.g., hair strands, and that we do not render parts of

Fig. 6. Temporal GrabCut. On the left we show the output of the origi-nal GrabCut approach and on the right our temporal modified GrabCut.Our approach combines the segmentation results along the xt , yt and xy planes. The results on the left show the foreground masks retrieved from the xy GrabCuts. Our extension of GrabCut to the temporal domain reducesflickering artifacts, thus, the foreground segmentations in the xt and yt planes are smoother. the mesh where actually the background is visible. The final per-region warps are blended based on a feathering operation usingthe body, neck, and head masks. Note, our image-based warpingtechnique preserves the details from the calibration sequence sincewe select the texture based on the pose of the corresponding bodypart. This selection can be seen as a heuristic of finding the texturewith minimal required warping to produce the output frame. Thus,detailed images with hair strands can be synthesized. Our approach enables real-time reenactment of the head and torso inportrait videos. This requires real-time tracking of the source actorand an efficient technique to transfer the deformations from thesource to the target. To this end, we apply our video-based renderingapproach to re-render the modified target actor in a photo-realisticfashion. In the following, we detail our real-time upper body and

ACM Trans. Graph., Vol. 37, No. 4, Article 164. Publication date: August 2018. eadOn : Real-time Reenactment of Human Portrait Videos • 164:7 face tracking approach, and describe the deformation transfer. Inorder to ensure real-time reenactment on a single consumer levelcomputer, all components are required to run in a relatively shorttime span.

We track the source actor using a monocular stream from a commod-ity RGB-D sensor (see Fig. 1). In our examples, we use either an AsusXtion RGB-D sensor or a StructureIO sensor . Our default optionis the StructureIO sensor, which we set up for real-time streamingover WiFi in a similar configuration as Dai et al. [2017]. The Struc-tureIO sensor uses the RGB camera of the iPad, allowing us to recordthe RGB stream at higher resolution (1296 × ×

480 resolution of the Asus Xtion. However, the WiFi streamingcomes also with a latency of a few frames which is noticeable in thelive sequences in the accompanying video, and the overall framerateis typically 20 Hz due to the limited bandwidth.The tracking of the source actor consists of two major parts, theface tracking and the upper body tracking as can be seen in Fig. 7. Fig. 7. Source actor tracking: Top: example input sequence of a source actor.Bottom: corresponding tracking results as overlay. The fitted face model isshown in red and the proxy mesh for tracking the upper body in green.

Facial performance capture isbased on an analysis-by-synthesis approach [Thies et al. 2015] thatfits the multi-linear face model that is also used for automatic rig-ging. We jointly optimize for the model parameters (shape, albedo,expression), rigid head pose, and illumination (first three bands ofspherical harmonics) that best reproduce the input frame. The en-ergy function is composed of a sparse landmark term that measuresthe distance of the model to detected 2D features (computed by theTVS marker tracker), a dense photometric appearance term thatmeasures the color differences in RGB space, and a dense geometryterm that considers point-to-point and point-to-plane distances fromthe model to the depth observations. For real-time performance,the resulting optimization problem is solved using a data-parallelGauss-Newton solver. For more details on dense facial performancecapture, we refer to Thies et al. [2015; 2016].

In order to track the upper body ofthe source actor within the limited computational time budget, wefirst compute a coarse mesh of the upper body. To achieve this mesh,we average a couple of depth frames that show the frontal facingsource actor (about 20 frames). We use the tracking information ofthe face to determine the region of interest in this averaged depth https://structure.io/ map. That is, we segment the foreground from the background anduse the region below the neck. We then extract the proxy mesh byapplying a connected component analysis on the depth map. Wetrack the rigid pose of the upper body with a model-to-frame ICPthat uses dense projective correspondence association [Rusinkiewiczand Levoy 2001] and a point-to-plane distance measure. To estimate the eye gaze of the sourceactor, we use the TVS landmark tracker that detects the pupils andeye lid closure events. The 2D location of the pupils ( P , P ∈ R ,left and right pupil) are used to approximate the gaze of the personrelative to the face model. We estimate the yaw angle of each eyeby computing the relative position of pupil between the left ( C , l )and right eye corner ( C , r ): yaw = || P − C , l || || P − C , l || + || P − C , r || · ◦ − ◦ . The pitch angle is computed in a similar fashion. We ignore squintingand vergence, and average the yaw and pitch angle of the left andright eye for higher stability. Finally, we map the yaw and pitchangle to a discrete gaze class that is defined by the eye calibrationpattern, which was used to train the eye-synthesis for the targetactor. If eye closing is detected, we overwrite the gaze class with thesampled closed eye class. This eye class can then be used to retrievethe correctly matching eye texture of the target rig.

Since the face model of the source actor uses the same blendshape ba-sis as the target rig, we can directly copy the expression parameters.In addition, we apply the relative body deformations of the head,neck and torso to the corresponding joints of the kinematic skeletonof the target rig. These relative body deformations are computed viainverse kinematics using the tracked face and the tracked torso ofthe source actor. Since the rigid pose of the source and target actor isthe same after applying the skeleton deformations, we can copy themouth interior from the source to the target. In order to compensatefor color and illumination differences, we use Poisson image editing[Pérez et al. 2003] with gradient mixing. We use predefined maskson the face template to determine the regions that must be copiedand the areas where gradient mixing is applied (between the sourceimage content and the synthesized target image). Using the eye classindex estimated by our gaze tracker, we select the correspondingeye texture from the calibration sequence and insert the eye texture,again using Poisson image blending. To produce temporally smoothtransitions between eye classes, we blend between the eye texture ofthe current and preceding frame. Fig. 8 shows the used textures andthe extent of the eye and mouth blending masks that were appliedto generate our reenactment results.

In this section, we test and evaluate our approach and compareto state-of-the-art image and video reenactment techniques. Allfollowing experiments have been performed on a single desktopcomputer with an Nvidia GTX1080 Ti and a 4 . ACM Trans. Graph., Vol. 37, No. 4, Article 164. Publication date: August 2018.

Fig. 8. Final compositing of the eye and mouth region; from left to right:driving frame of the source actor (used for mouth transfer), target actor eyeclass sample that corresponds to the estimated gaze direction of the sourceactor, cross-projection of the mouth and the eyes to the deformed targetactor mesh, and the final composite based on Poisson image blending.Table 1. Breakdown of the timings of the steps of our reenactment pipeline:dense face tracking (DenseFT), dense body tracking (DenseBT), deformationtransfer (DT), morphing of the target actor mesh and image-based videosynthesis (Synth), and cross projection and blending of the eyes and themouth region. The first row shows timings for x resolution (AsusXtion) and the second row the timings for x (StructureIO). DenseFT DenseBT DT Synth CB FPSAvg. . ms . ms . ms . ms . ms Std.Dev. .

43 0 .

14 0 .

04 0 .

20 0 . Avg. . ms . ms . ms . ms . ms Std.Dev. .

43 0 .

21 0 .

09 0 .

31 0 . Fig. 9 shows results from our live setup using the StructureIOsensor; please also see the accompanying video for live footage. Asthe results show, our approach generates high-quality reenactmentsof portrait videos, including the transfer of head pose, torso move-ment, facial expression, and eye gaze, for a large variety of sourceand target actors. The entire pipeline, from source actor trackingto video-based rendering, runs at real-time rates, and is thus appli-cable to interactive scenarios such as teleconferencing systems. Abreakdown of the timings is shown in Tab. 1.In the following, we further evaluate the quality of the synthesizedvideo output and compare to recent state-of-the-art reenactmentsystems. Comparisons are also shown in the accompanying video.

Evaluation of Video-based Rendering.

To evaluate the quality im-provement due to our video-based rendering approach, we compareit with the direct rendering of the colored mesh obtained from the3D reconstruction; see Fig. 10. Both scenarios use the same coarsegeometry proxy that has been reconstructed using VoxelHashing[Nießner et al. 2013]. As can be seen, the video-based renderingapproach leads to drastically higher quality compared to simplevoxel-based colors. Since the proxy geometry can be incomplete,holes become visible in the baseline approach, e.g., around the earsand in the hair region. In our video-based rendering approach, theseregions are filled in by our view- and pose-dependent renderingstrategy using the extended warp field, producing complete and highly-realistic video output. Since the actor was scanned withclosed mouth, opening of the mouth leads to severe artifacts inthe baseline approach, while our mouth transfer strategy enables aplausible synthesis of the mouth region. Finally, note how the hair,including its silhouette is well reproduced.

Evaluation of Eye Reenactment.

We compare our eye gaze reen-actment strategy to the deep learning-based DeepWarp [Ganin et al.2016] approach, which only allows for gaze editing. As Fig. 11 shows,we obtain results of similar quality if only gaze is redirected. Note,in contrast to our method, DeepWarp is not person specific, i.e.,to re-synthesize realistically looking eyes, we need a calibrationsequence.

Photometric Error in Self Reenactment.

To evaluate the quality ofour entire reenactment pipeline, we conducted a self-reenactmentcomparison. We first build a person-specific rig of a particular actorand then re-synthesize a sequence of the same actor. In this scenario,we can consider the source video as ground truth, and compare itwith our synthesized result. Three frames of the comparison areshown in Fig. 12. The first image shows the reference pose, sothis frame contains no error due to motion. Thus, the error of thefirst frame (0 . ℓ distance in RGB color space) shows the errorof our rerendering, and thus can be seen as baseline for the otherframes. The average color difference error of the following framesis 0 . Comparison to

Face2Face . A comparison to Face2Face [Thieset al. 2016] is shown in Fig. 13. Face2Face only reenacts facial ex-pression, and does not adapt head movement or eye gaze. Hence,the video flow of Face2Face often seems out-of-place, since thetiming of all motions do not align, as noted by Suwajanakorn etal. [2017]. The effect is particularly visible in live videos, and itseverely restricts the applicability to teleconferencing settings. Ourapproach achieves comparable quality of single frames, and gen-erates more believable reenactment results by jointly re-targetingthe rigid head pose, torso motion, facial expression, and eye gazedirection. Note that our technique copies the mouth from the sourceactor to the final output. Thus, the identity of the target person isslightly changed. Since Face2Face uses a database of mouth interiorsof the target actor, the identity is unchanged. While it is straight-forward to incorporate the mouth retrieval technique presented inFace2Face, we decided against it, because it drastically increasesthe length of the calibration phase and usability (since only mouthinteriors that have been seen in the calibration sequence can bereproduced; note that also expressions with different rigid poses ofthe head would have to be captured in such a calibration).

Comparison to

Bringing Portraits to Life . We also compareour method with Bringing Portraits to Life, an off-line imagereenactment approach [Averbuch-Elor et al. 2017], which createsconvincing reactive profile videos by transferring expressions andslight head motions of a driving sequence to a target image. It

ACM Trans. Graph., Vol. 37, No. 4, Article 164. Publication date: August 2018. eadOn : Real-time Reenactment of Human Portrait Videos • 164:9

Fig. 9. Real-time portrait video reenactment results of our system for a variety of source and target actors. The source actor drives the head motion, torsomovement, facial expression, and the eye gaze of the target actor in real time.

ACM Trans. Graph., Vol. 37, No. 4, Article 164. Publication date: August 2018.

Fig. 10. Evaluation of video-based rendering: we compare our video-basedrendering (right) and a simple colored-mesh actor proxy (middle). Bothscenarios use the same coarse geometric proxy. Our video-based renderingapproach leads to drastically higher realism in all regions and producesphoto-realistic video output, while the colored-mesh lacks this fidelity.Fig. 11. Gaze redirection comparison: we compare our eye reenactmentstrategy (left) to the DeepWarp [Ganin et al. 2016] gaze redirection approach(right). Note that DeepWarp merely modifies gaze direction, but does notperform a full reenactment of portrait videos. only requires a single image of the target actor as input, but doesnot provide any control over the torso motion and gaze direction.Fig. 14 shows results of the comparison. We achieve similar qualityin general, but Bringing Portraits to Life struggles for largerhead pose changes. In comparison, our approach enables free head-pose changes, and provides control over the torso motion, facialexpression, and gaze direction. Since our method runs at real-timerates, our approach can also be applied to live applications, such asteleconferencing.

Comparison to

Avatar Digitization . In Fig. 15, we also compareto the Avatar Digitization approach of Hu et al. [2017]. From asingle image, this approach generates an avatar, that can be animatedand used for instance as a game character. However, the approach(as well as comparable avatar digitization approaches [Ichim et al.2015]) generate stylized avatars that are appropriate as game-qualitycharacters and that can be used in gaming and social VR applications.In contrast, we aim to synthesize unseen video footage of the targetactor at photo-realistic quality as shown in Fig. 9.

Fig. 12. Self-Reenactment Evaluation: the first column of the images showsthe reference pose of the source and target actor; all following deformationsare applied relative to this pose. For this experiment, we rigidly align thereference target actor body to the reference frame of the source actor inorder to be able to compare the outputs. We compare the result to the sourceimage using a per-pixel color difference measure. The other two columnsshow representative results of the test sequence with expression and posechanges. In the bottom row, the color difference plot of the complete testsequence is depicted. The mean ℓ color difference over the whole testsequence is . measured in RGB color space ( [ , ] ).Fig. 13. Comparison to Face2Face [Thies et al. 2016]; from left to right:source actor, the reenactment result of

Face2Face , and our result. In gray,we show the underlying geometry used to generate the output images.

We have demonstrated robust source-to-target reenactment of com-plete human portrait videos at real-time rates. Still, a few limitationsremain, and we hope that these are tackled in follow-up work. Onedrawback of our approach is the requirement of a short scanning

ACM Trans. Graph., Vol. 37, No. 4, Article 164. Publication date: August 2018. eadOn : Real-time Reenactment of Human Portrait Videos • 164:11

Fig. 14. Comparison to

Bringing Portraits to Life [Averbuch-Elor et al.2017]: Our approach generalizes better to larger changes in head and bodypose than the image-warping based approach of Averbuch-Elor et al. [2017].In addition, our methods enables the joint modification and control ofthe torso motion and gaze direction. Note that while their approach runsoffline, ours allows control the entire portrait video at real-time frame rates,allowing application to live teleconferencing.Fig. 15. Avatar Digitization reconstructs stylized game-quality charactersfrom a single image. In this example, the avatar was generated from thefirst image of the second row in Fig. 14. sequence based on an RGB-D camera. While RGB-D sensors arealready widespread, the ultimate goal would be to built the video-based target rig based on an unconstrained monocular video of thetarget actor, without a predefined calibration procedure. In addition,scene illumination is currently not estimated, and therefore illumi-nation changes in reenacted videos cannot be simulated. We alsodo not track and transfer fine scale details such as wrinkles sincethey are not represented by the used multi-linear face model (seeFig. 16). While Cao et al. [Cao et al. 2015] demonstrate tracking offine scale details, it has not be shown how to transfer these wrinklesto another person. This is an open question that can be tackledin the future. Under extreme pose changes, or difficult motion ofhair (see Fig. 18), the reenacted results may exhibit artifacts as nei-ther the model nor the video-based texturing may be able to fullyrepresent the new view-dependent appearance. In Fig. 17 we showfailure cases that stem from extreme head rotations and occlusionsin the input stream of the source actor. Note that the proposed tech-nique has the same limitations as other state-of-the-art reenactment

Fig. 16. Limitation: Fine scale detail such as wrinkles are not transferred.The close-ups show the difference between the input and the output.Fig. 17. Limitation: Strong head rotations or occlusions in the input streamof the source actor lead to distortions in the reenactment result. methods like Face2Face [Thies et al. 2016]. In particular, the usedanalysis-by-synthesis approach to track the face uses the parametersof the previous frame as an initial guess, thus, fast head motionsrequire high frame rates of the input camera otherwise the trackingis disturbed by the motion (for more details on the limitations of theface tracking we refer to the publications [Thies et al. 2015, 2016]).Our approach is also limited to the upper body. We do not track themotions of the arms and hands, and are not able to re-synthesizesuch motions for the target actor. Ideally, one would want to controlthe whole body; here, we see our project as a stepping stone towardsthis direction, which we believe will lead to exciting follow up work.We do believe that the combination of a coarse deformation proxywith view-dependent textures will generalize to larger parts of thebody, if they can be robustly tracked.

We introduced HeadOn, an interactive reenactment system for hu-man portrait videos. We capture facial expressions, eye gaze, rigidhead pose, and motions of the upper body of a source actor, andtransfer them to a target actor in real time. By transferring all rele-vant motions from a human portrait video, we achieve believableand plausible reenactments, which opens up the avenue for many im-portant applications such as movie editing and video conferencing.In particular, we show examples where a person is able to controlportraits of another person or to perform self-reenactment to easilyswitch clothing in a live video stream. However, more fundamen-tally, we believe that our method is a stepping stone towards a much

ACM Trans. Graph., Vol. 37, No. 4, Article 164. Publication date: August 2018.

Fig. 18. Limitation: Hair is statically attached to the skeleton structure ofthe delegate mesh. broader avenue in movie editing. We believe that the idea of coarsegeometric proxies can be applied to more sophisticated environ-ments, such as complex movie settings, and ultimately transformcurrent video processing pipelines. In this spirit, we are convincedand hopeful to see many more future research works in this excitingarea.

ACKNOWLEDGMENTS

We thank Angela Dai for the video voice over and all actors forparticipating in this project. Thanks to Averbuch-Elor et al. and Huet al. for the comparisons. The facial landmark tracker was kindlyprovided by TrueVisionSolution. This work was supported by theERC Starting Grant CapReal (335545), the Max Planck Center forVisual Computing and Communication (MPC-VCC), a TUM-IASRudolf Mößbauer Fellowship, and a Google Faculty Award.

REFERENCES

Oleg Alexander, Mike Rogers, William Lambeth, Matt Chiang, and Paul Debevec.2009. The Digital Emily Project: Photoreal Facial Modeling and Animation. In

ACM SIGGRAPH 2009 Courses . Article 12, 12:1–12:15 pages.Hadar Averbuch-Elor, Daniel Cohen-Or, Johannes Kopf, and Michael F. Cohen. 2017.Bringing Portraits to Life.

ACM Transactions on Graphics (Proceeding of SIGGRAPHAsia 2017)

36, 4 (2017), to appear.Paul J. Besl and Neil D. McKay. 1992. A Method for Registration of 3-D Shapes.

IEEETrans. Pattern Anal. Mach. Intell.

14, 2 (Feb. 1992), 239–256.Volker Blanz, Curzio Basso, Tomaso Poggio, and Thomas Vetter. 2003. Reanimatingfaces in images and video. In

Proc. EUROGRAPHICS , Vol. 22. 641–650.Volker Blanz and Thomas Vetter. 1999. A Morphable Model for the Synthesis of 3DFaces. In

ACM TOG . 187–194.J. Booth, A. Roussos, S. Zafeiriou, A. Ponniah, and D. Dunaway. 2016. A 3D MorphableModel learnt from 10,000 faces. In

IEEE Conference on Computer Vision and PatternRecognition (CVPR) .Sofien Bouaziz, Yangang Wang, and Mark Pauly. 2013. Online Modeling for RealtimeFacial Animation.

ACM TOG

32, 4, Article 40 (2013), 10 pages.Chen Cao, Derek Bradley, Kun Zhou, and Thabo Beeler. 2015. Real-time High-fidelityFacial Performance Capture.

ACM TOG

34, 4, Article 46 (2015), 9 pages.Chen Cao, Qiming Hou, and Kun Zhou. 2014a. Displaced Dynamic Expression Regres-sion for Real-time Facial Tracking and Animation. In

ACM TOG , Vol. 33. 43:1–43:10.Chen Cao, Yanlin Weng, Stephen Lin, and Kun Zhou. 2013. 3D shape regression forreal-time facial animation. In

ACM TOG , Vol. 32. 41:1–41:10.Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun Zhou. 2014b. FaceWarehouse:A 3D Facial Expression Database for Visual Computing.

IEEE TVCG

20, 3 (2014),413–425. Chen Cao, Hongzhi Wu, Yanlin Weng, Tianjia Shao, and Kun Zhou. 2016. Real-timeFacial Animation with Image-based Dynamic Avatars.

ACM Trans. Graph.

35, 4(July 2016).Jin-xiang Chai, Jing Xiao, and Jessica Hodgins. 2003. Vision-based control of 3D facialanimation. In

Proc. SCA . 193–206.Yang Chen and Gérard G. Medioni. 1992. Object modelling by registration of multiplerange images.

Image and Vision Computing

10, 3 (1992), 145–155.Yen-Lin Chen, Hsiang-Tao Wu, Fuhao Shi, Xin Tong, and Jinxiang Chai. 2013. Accurateand Robust 3D Facial Capture Using a Single RGBD Camera.

Proc. ICCV (2013),3615–3622.E. Chuang and C. Bregler. 2002.

Performance-driven Facial Animation using Blend ShapeInterpolation . Technical Report CS-TR-2002-02. Stanford University.Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. 2001. Active Appear-ance Models.

IEEE TPAMI

23, 6 (2001), 681–685.Brian Curless and Marc Levoy. 1996. A Volumetric Method for Building ComplexModels from Range Images. In

Proceedings of the 23rd Annual Conference on ComputerGraphics and Interactive Techniques (SIGGRAPH ’96) . ACM, New York, NY, USA,303–312. https://doi.org/10.1145/237170.237269Angela Dai, Matthias Nießner, Michael Zollhöfer, Shahram Izadi, and Christian Theobalt.2017. BundleFusion: Real-time Globally Consistent 3D Reconstruction using On-the-fly Surface Reintegration.

ACM Transactions on Graphics (TOG)

36, 3 (2017),24.Kevin Dale, Kalyan Sunkavalli, Micah K. Johnson, Daniel Vlasic, Wojciech Matusik, andHanspeter Pfister. 2011. Video face replacement. In

ACM TOG , Vol. 30. 130:1–130:10.Graham Fyffe, Andrew Jones, Oleg Alexander, Ryosuke Ichikari, and Paul Debevec.2014. Driving High-Resolution Facial Scans with Video Performance Capture.

ACMTrans. Graph.

34, 1, Article 8 (Dec. 2014), 14 pages. https://doi.org/10.1145/2638549Yaroslav Ganin, Daniil Kononenko, Diana Sungatullina, and Victor S. Lempitsky. 2016.DeepWarp: Photorealistic Image Resynthesis for Gaze Manipulation. In

ECCV .Pablo Garrido, Levi Valgaerts, Ole Rehmsen, Thorsten Thormaehlen, Patrick Perez, andChristian Theobalt. 2014. Automatic Face Reenactment. In

Proc. CVPR .Pablo Garrido, Levi Valgaerts, Hamid Sarmadi, Ingmar Steiner, Kiran Varanasi, PatrickPerez, and Christian Theobalt. 2015. VDub - Modifying Face Video of Actors forPlausible Visual Alignment to a Dubbed Audio Track. In

CGF (Proc. EUROGRAPH-ICS) .Pablo Garrido, Levi Valgaerts, Chenglei Wu, and Christian Theobalt. 2013. Reconstruct-ing Detailed Dynamic Face Geometry from Monocular Video. In

ACM TOG , Vol. 32.158:1–158:10.Pablo Garrido, Michael Zollhöfer, Dan Casas, Levi Valgaerts, Kiran Varanasi, PatrickPérez, and Christian Theobalt. 2016. Reconstruction of Personalized 3D Face Rigsfrom Monocular Video.

ACM Transactions on Graphics (TOG)

35, 3 (2016), 28.Steven J. Gortler, Radek Grzeszczuk, Richard Szeliski, and Michael F. Cohen. 1996. TheLumigraph. In

Proceedings of the 23rd Annual Conference on Computer Graphics andInteractive Techniques (SIGGRAPH ’96) . ACM, New York, NY, USA, 43–54. https://doi.org/10.1145/237170.237200Pei-Lun Hsieh, Chongyang Ma, Jihun Yu, and Hao Li. 2015. Unconstrained realtimefacial performance capture. In

Proc. CVPR .Liwen Hu, Shunsuke Saito, Lingyu Wei, Koki Nagano, Jaewoo Seo, Jens Fursund, ImanSadeghi, Carrie Sun, Yen-Chun Chen, and Hao Li. 2017. Avatar Digitization from aSingle Image for Real-time Rendering.

ACM Trans. Graph.

36, 6, Article 195 (Nov.2017), 14 pages. https://doi.org/10.1145/3130800.31310887Alexandru Eugen Ichim, Sofien Bouaziz, and Mark Pauly. 2015. Dynamic 3D AvatarCreation from Hand-held Video Input.

ACM TOG

34, 4, Article 45 (2015), 14 pages.Alexandru-Eugen Ichim, Petr Kadleček, Ladislav Kavan, and Mark Pauly. 2017. Phace:Physics-based Face Modeling and Animation.

ACM Trans. Graph.

36, 4, Article 153(July 2017), 14 pages.Aaron Isaksen, Leonard McMillan, and Steven J. Gortler. 2000. Dynamically Repa-rameterized Light Fields. In

Proceedings of the 27th Annual Conference on ComputerGraphics and Interactive Techniques (SIGGRAPH ’00) . ACM Press/Addison-WesleyPublishing Co., New York, NY, USA, 297–306. https://doi.org/10.1145/344779.344929Sing Bing Kang, Yin Li, Xin Tong, and Heung-Yeung Shum. 2006. Image-based Rendering.

Found. Trends. Comput. Graph. Vis.

2, 3 (Jan. 2006), 173–258. https://doi.org/10.1561/0600000012Masahide Kawai, Tomoyori Iwao, Daisuke Mima, Akinobu Maejima, and Shigeo Mor-ishima. 2014. Data-Driven Speech Animation Synthesis Focusing on Realistic Insideof the Mouth.

Journal of Information Processing

22, 2 (2014), 401–409.Ira Kemelmacher-Shlizerman. 2016. Transfiguring Portraits.

ACM Trans. Graph.

35, 4,Article 94 (July 2016), 8 pages.Oliver Klehm, Fabrice Rousselle, Marios Papas, Derek Bradley, Christophe Hery,Bernd Bickel, Wojciech Jarosz, and Thabo Beeler. 2015. Recent Advances in Fa-cial Appearance Capture.

CGF (EUROGRAPHICS STAR Reports) (2015). https://doi.org/10.1111/cgf.12594Johannes Kopf, Fabian Langguth, Daniel Scharstein, Richard Szeliski, and MichaelGoesele. 2013. Image-based Rendering in the Gradient Domain.

ACM Trans. Graph.

32, 6, Article 199 (Nov. 2013), 9 pages. https://doi.org/10.1145/2508363.2508369ACM Trans. Graph., Vol. 37, No. 4, Article 164. Publication date: August 2018. eadOn : Real-time Reenactment of Human Portrait Videos • 164:13

J. P. Lewis, Ken Anjyo, Taehyun Rhee, Mengjie Zhang, Fred Pighin, and Zhigang Deng.2014. Practice and Theory of Blendshape Facial Models. In

Eurographics STARs .199–218.Hao Li, Laura Trutoiu, Kyle Olszewski, Lingyu Wei, Tristan Trutna, Pei-Lun Hsieh,Aaron Nicholls, and Chongyang Ma. 2015. Facial Performance Sensing Head-Mounted Display.

ACM Transactions on Graphics (Proceedings SIGGRAPH 2015)

34, 4 (July 2015).Hao Li, Jihun Yu, Yuting Ye, and Chris Bregler. 2013. Realtime Facial Animation withOn-the-fly Correctives. In

ACM TOG , Vol. 32.Tianye Li, Timo Bolkart, Michael J. Black, Hao Li, and Javier Romero. 2017. Learning amodel of facial shape and expression from 4D scans.

ACM Transactions on Graphics

36, 6 (Nov. 2017), 194:1–194:17. Two first authors contributed equally.Shu Liang, Ira Kemelmacher-Shlizerman, and Linda G. Shapiro. 2014. 3D Face Halluci-nation from a Single Depth Frame. In

3D Vision (3DV) . IEEE Computer Society.Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J.Black. 2015. SMPL: A Skinned Multi-Person Linear Model.

ACM Trans. Graphics(Proc. SIGGRAPH Asia)

34, 6 (Oct. 2015), 248:1–248:16.William E. Lorensen and Harvey E. Cline. 1987. Marching Cubes: A High Resolution3D Surface Construction Algorithm. In

Proceedings of the 14th Annual Conference onComputer Graphics and Interactive Techniques (SIGGRAPH ’87) . ACM, New York, NY,USA, 163–169.Kok-Lim Low. 2004. Linear Least-Squares Optimization for Point-to-Plane ICP SurfaceRegistration. (01 2004).Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuytelaars, and Luc Van Gool. 2017.Pose Guided Person Image Generation. In

NIPS .Richard A. Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim,Andrew J. Davison, Pushmeet Kohli, Jamie Shotton, Steve Hodges, and AndrewFitzgibbon. 2011. KinectFusion: Real-time Dense Surface Mapping and Tracking. In

Proceedings of the 2011 10th IEEE International Symposium on Mixed and AugmentedReality (ISMAR ’11) . IEEE Computer Society, Washington, DC, USA, 127–136. https://doi.org/10.1109/ISMAR.2011.6092378M. Nießner, M. Zollhöfer, S. Izadi, and M. Stamminger. 2013. Real-time 3D Reconstruc-tion at Scale Using Voxel Hashing.

ACM Trans. Graph.

32, 6, Article 169, 11 pages.Kyle Olszewski, Joseph J. Lim, Shunsuke Saito, and Hao Li. 2016. High-Fidelity Facialand Speech Animation for VR HMDs.

ACM TOG

35, 6 (2016).Patrick Pérez, Michel Gangnet, and Andrew Blake. 2003. Poisson Image Editing. In

ACM SIGGRAPH 2003 Papers (SIGGRAPH ’03) . ACM, 313–318.F. Pighin, J. Hecker, D. Lischinski, R. Szeliski, and D. Salesin. 1998. Synthesizing realisticfacial expressions from photographs. In

ACM TOG . 75–84.Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. 2004. "GrabCut": InteractiveForeground Extraction Using Iterated Graph Cuts. In

ACM SIGGRAPH 2004 Papers(SIGGRAPH ’04) . ACM, New York, NY, USA, 309–314.Szymon Rusinkiewicz and Marc Levoy. 2001. Efficient variants of the ICP algorithm. In . IEEE, 145–152.Jason M. Saragih, Simon Lucey, and Jeffrey F. Cohn. 2011. Deformable Model Fitting byRegularized Landmark Mean-Shift.

IJCV

91, 2 (2011).Fuhao Shi, Hsiang-Tao Wu, Xin Tong, and Jinxiang Chai. 2014a. Automatic Acquisitionof High-fidelity Facial Performances Using Monocular Videos. In

ACM TOG , Vol. 33.Issue 6.Fuhao Shi, Hsiang-Tao Wu, Xin Tong, and Jinxiang Chai. 2014b. Automatic Acquisitionof High-fidelity Facial Performances Using Monocular Videos.

ACM Trans. Graph.

33, 6, Article 222 (Nov. 2014), 13 pages. https://doi.org/10.1145/2661229.2661290Eftychios Sifakis, Igor Neverov, and Ronald Fedkiw. 2005. Automatic Determination ofFacial Muscle Activations from Sparse Motion Capture Marker Data.

ACM TOG

Proc. ECCV . 796–812.Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2015. WhatMakes Tom Hanks Look Like Tom Hanks. In

Proc. ICCV .Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2017.Synthesizing Obama: Learning Lip Sync from Audio.

ACM Trans. Graph.

36, 4,Article 95 (July 2017), 13 pages. https://doi.org/10.1145/3072959.3073640J. Rafael Tena, Fernando De la Torre, and Iain Matthews. 2011. Interactive Region-basedLinear 3D Face Models.

ACM TOG

30, 4, Article 76 (2011), 10 pages.Justus Thies, Michael Zollhöfer, Matthias Nießner, Levi Valgaerts, Marc Stamminger,and Christian Theobalt. 2015. Real-time Expression Transfer for Facial Reenactment.

ACM TOG

34, 6, Article 183 (2015), 14 pages.Justus Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner. 2016. Face2Face:Real-time Face Capture and Reenactment of RGB Videos. In

Proc. CVPR .Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and MatthiasNießner. 2017. Demo of FaceVR: Real-time Facial Reenactment and Eye Gaze Controlin Virtual Reality. In

ACM SIGGRAPH 2017 Emerging Technologies (SIGGRAPH ’17) .ACM, Article 7, 2 pages. https://doi.org/10.1145/3084822.3084841Justus Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner. 2018. FaceVR:Real-Time Gaze-Aware Facial Reenactment in Virtual Reality.

ACM Transactions on Graphics (TOG) .Levi Valgaerts, Chenglei Wu, Andrés Bruhn, Hans-Peter Seidel, and Christian Theobalt.2012. Lightweight Binocular Facial Performance Capture under Uncontrolled Light-ing. In

ACM TOG , Vol. 31.Daniel Vlasic, Matthew Brand, Hanspeter Pfister, and Jovan Popovic. 2005. Face transferwith multilinear models. In

ACM TOG , Vol. 24.Thibaut Weise, Sofien Bouaziz, Hao Li, and Mark Pauly. 2011. Realtime performance-based facial animation. In

ACM TOG , Vol. 30. Issue 4.Thibaut Weise, Hao Li, Luc J. Van Gool, and Mark Pauly. 2009. Face/Off: live facialpuppetry. In

Proc. SCA . 7–16.Daniel N. Wood, Daniel I. Azuma, Ken Aldinger, Brian Curless, Tom Duchamp, David H.Salesin, and Werner Stuetzle. 2000. Surface Light Fields for 3D Photography. In

Proceedings of the 27th Annual Conference on Computer Graphics and InteractiveTechniques (SIGGRAPH ’00) . ACM Press/Addison-Wesley Publishing Co., New York,NY, USA, 287–296. https://doi.org/10.1145/344779.344925Jing Xiao, Simon Baker, Iain Matthews, and Takeo Kanade. 2004. Real-Time Combined2D+3D Active Appearance Models. In

Proc. CVPR . 535 – 542.S. Zafeiriou, A. Roussos, A. Ponniah, D. Dunaway, and J. Booth. 2017. Large Scale 3DMorphable Models.

International Journal of Computer Vision (2017).Michael Zollhöfer, Matthias Nießner, Shahram Izadi, Christoph Rehmann, ChristopherZach, Matthew Fisher, Chenglei Wu, Andrew Fitzgibbon, Charles Loop, ChristianTheobalt, and Marc Stamminger. 2014. Real-time Non-rigid Reconstruction usingan RGB-D Camera. In

ACM TOG , Vol. 33.M. Zollhöfer, J. Thies, P. Garrido, D. Bradley, T. Beeler, P Pérez, M. Stamminger, M.Nießner, and C. Theobalt. 2018. State of the Art on Monocular 3D Face Reconstruc-tion, Tracking, and Applications.