[PDF] Dynamic Facial Asset and Rig Generation from a Single Scan

Abstract

The creation of high-fidelity computer-generated (CG) characters used in film and gaming requires intensive manual labor and a comprehensive set of facial assets to be captured with complex hardware, resulting in high cost and long production cycles. In order to simplify and accelerate this digitization process, we propose a framework for the automatic generation of high-quality dynamic facial assets, including rigs which can be readily deployed for artists to polish. Our framework takes a single scan as input to generate a set of personalized blendshapes, dynamic and physically-based textures, as well as secondary facial components (e.g., teeth and eyeballs). Built upon a facial database consisting of pore-level details, with over 4,000 scans of varying expressions and identities, we adopt a self-supervised neural network to learn personalized blendshapes from a set of template expressions. We also model the joint distribution between identities and expressions, enabling the inference of the full set of personalized blendshapes with dynamic appearances from a single neutral input scan. Our generated personalized face rig assets are seamlessly compatible with cutting-edge industry pipelines for facial animation and rendering. We demonstrate that our framework is robust and effective by inferring on a wide range of novel subjects, and illustrate compelling rendering results while animating faces with generated customized physically-based dynamic textures.

Full PDF

DDynamic Facial Asset and Rig Generation from a Single Scan

JIAMAN LI ∗ , University of Southern California and USC Institute for Creative Technologies

ZHENGFEI KUANG ∗ , University of Southern California and USC Institute for Creative Technologies

YAJIE ZHAO † , USC Institute for Creative Technologies

MINGMING HE,

USC Institute for Creative Technologies

KARL BLADIN,

USC Institute for Creative Technologies

HAO LI,

University of Southern California, USC Institute for Creative Technologies, and Pinscreen (a) (b) (c) (d)

Fig. 1. Given a single neutral scan (a), we generate a complete set of dynamic face model assets, including personalized blendshapes and physically-baseddynamic facial skin textures of the input subjects (b). The results carry high-fidelity details which we render in Arnold [Maya 2019] (c). Our generated facialassets are animation-ready as shown in (d).

The creation of high-fidelity computer-generated (CG) characters for filmsand games is tied with intensive manual labor, which involves the creationof comprehensive facial assets that are often captured using complex hard-ware. To simplify and accelerate this digitization process, we propose aframework for the automatic generation of high-quality dynamic facialmodels, including rigs which can be readily deployed for artists to polish.Our framework takes a single scan as input to generate a set of personal-ized blendshapes, dynamic textures, as well as secondary facial components( e.g. , teeth and eyeballs). Based on a facial database with over , scans ∗ indicates equal contribution. † indicates corresponding author.Authors’ addresses: Jiaman Li, University of Southern California, USC Institute forCreative Technologies; Zhengfei Kuang, University of Southern California, USC Insti-tute for Creative Technologies; Yajie Zhao, USC Institute for Creative Technologies;Mingming He, USC Institute for Creative Technologies; Karl Bladin, USC Institute forCreative Technologies; Hao Li, University of Southern California, USC Institute forCreative Technologies, Pinscreen.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].© 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.0730-0301/2020/12-ART215 $15.00https://doi.org/10.1145/3414685.3417817 with pore-level details, varying expressions and identities, we adopt a self-supervised neural network to learn personalized blendshapes from a set oftemplate expressions. We also model the joint distribution between iden-tities and expressions, enabling the inference of a full set of personalizedblendshapes with dynamic appearances from a single neutral input scan.Our generated personalized face rig assets are seamlessly compatible withprofessional production pipelines for facial animation and rendering. Wedemonstrate a highly robust and effective framework on a wide range ofsubjects, and showcase high-fidelity facial animations with automaticallygenerated personalized dynamic textures.CCS Concepts: • Computer methodologies → Face Animation .Additional Key Words and Phrases: Face Rigging, Blendshapes, Animation,Physically-Based Face Rendering, Performance Capture, Deformation Trans-fer.

ACM Reference Format:

Jiaman Li, Zhengfei Kuang, Yajie Zhao, Mingming He, Karl Bladin, and HaoLi. 2020. Dynamic Facial Asset and Rig Generation from a Single Scan.

ACM Trans. Graph.

39, 6, Article 215 (December 2020), 17 pages. https://doi.org/10.1145/3414685.3417817

High-quality and personalized digital humans are relevant to a widerange of applications, such as film and game production ( e.g.

UnrealEngine, Digital Doug), and virtual reality [Fyffe et al. 2014; Lombardi

ACM Trans. Graph., Vol. 39, No. 6, Article 215. Publication date: December 2020. a r X i v : . [ c s . G R ] O c t et al. 2018; Wei et al. 2019]. To produce high-fidelity digital dou-bles, complex capture equipment is often needed in conventionalcomputer graphics pipelines, and the acquired data typically under-goes intensive manual post-processing by a production team. Newapproaches based on deep learning-based synthesis are promisingas they show how photorealistic faces can be generated from cap-tured data directly [Lombardi et al. 2018; Wei et al. 2019] allowingone to overcome the notorious Uncanny Valley. In addition to theirintensive GPU compute requirements and the need for large vol-umes of training data, these deep learning-based methods are stilldifficult to integrate seamlessly into virtual CG environments asthey lack relighting capabilities and fine rendering controls, whichprevents them from being adopted for games and film production.On the other hand, realistic digital doubles in conventional graphicspipelines require months of production and involve large teamsof highly skilled digital artists as well as sophisticated scanningtechniques [Ghosh et al. 2011]. Building facial assets of a virtualcharacter typically requires a number of facial expression modelsoften based on the Facial Action Coding System (FACS), as well asphysically-based texture assets ( e.g. , albedo, specular maps, displace-ment maps) to ensure realistic facial skin reflectance in a virtualenvironment.Several recent works have shown how to automate and reducethe effort for generating personalized facial rigs. The works of Laineet al. [2017]; Li et al. [2010]; Ma et al. [2016]; Pawaskar et al. [2013]propose to automatically build personalized blendshapes using avarying number of personalized facial scans. While effective forproduction pipelines, these methods either require a large numberof facial scans as input and considerable post-processing, or theyonly focus on generating a personalized geometry for the expres-sions, without the textures. For consumer-accessible avatar creationtechniques, the works of Casas et al. [2016]; Hu et al. [2017]; Ichimet al. [2015]; Nagano et al. [2018]; Thies et al. [2016] demonstratedigitization capabilities from video sequences or even a single inputimage. However, due to the limited input data, the resulting modelsoften lack details or the generated assets do not contain physically-based properties for dynamic expressions. We propose an approachbased on a 3D scan as input and our goal is to produce a fully riggedmodel with fixed topology, personalized blendshapes expressionsalong with corresponding dynamic and physically-based texturemaps. We observe that a large amount of labeled data can enable thelearning of personalized models and dynamic deformations suchthat wrinkle formations are specific to the shape and appearance ofthe subject. In particular, we extend recent deep learning approachesfor high-resolution physically-based skin assets [Li et al. 2020; Ya-maguchi et al. 2018], to generate dynamic high-resolution facialtexture attributes (albedo, specular maps, and displacement maps),in order to produce effects such as plausible personalized wrinklesduring animation. Existing methods transfer facial expression de-tails from a generic database, which may lead to reasonable outputfor the geometry, but certainly lack dynamic texture variations.We present a framework to automate and simplify the generationof high-quality facial rig assets, consisting of personalized blend-shapes, dynamic physically-based skin attributes (albedo, specularreflection, displacement maps), including secondary facial compo-nents ( e.g. eyes, teeth, gums, and tongue), from a single neutral geometric model and albedo map as input. Our generated assetscan be directly fed into professional production pipelines. We use ahigh-fidelity facial scan database [Li et al. 2020] and address boththe problems of generating personalized blendshapes and infer-ring dynamic physically-based skin properties. We first proposean end-to-end self-supervised learning framework to overcome thelack of ground truth data for personalized blendshapes and dynamictextures. By modeling the correlation between identities and person-alized expressions on the database with 178 identities, each having19 ∼

26 different captured expressions, we eliminate the requirementof user-specific scans for personalized blendshapes generation us-ing a trade-off between semantic meaning and personality. Ourapproach uses an intermediate conversion of neutral geometry and2D textures to a common parameterization in UV space, whichenables training and inference of dynamic geometry and texturedeformation in a compact form inspired by Li et al. [2020].Learning is performed using a high-fidelity facial scan datasetwith over scans with pore-level details and different expres-sions. Our approach can automatically produce personalized blend-shapes that reflect personalized expressions of a person from onlyone neutral scan. We demonstrate the effectiveness of our frame-work on a wide range of subjects and showcase a number of com-pelling facial animations.In summary, our major contributions are as follows: • We propose an end-to-end framework to automate the gen-eration of high-quality facial assets and rigs. Given a singleneutral face scan with albedo as input, we produce plausi-ble personalized blendshapes, secondary facial components( e.g. teeth, eyelashes), and most importantly, physically-basedtextures that are both dynamic and personalized to the ap-pearance of the input subject. • We present a novel self-supervised deep learning approachto improve the personalized results using a generic facialexpression template model. In particular, our approach canmodel the joint distribution between individual identities andtheir expressions in a large high-fidelity face database. • We also introduce a novel physically-based texture synthesisframework conditioned on neutral geometry and textures.Using a new compress and stretch map approach, we are ableto synthesize dynamic expression-specific textures, includingalbedo, specular, and fine-scale displacement maps. • We will make our code, models and database with all textureassets public to facilitate further research on automating high-quality avatar generation.

Facial Capture.

Due to increased demands for realistic digitalavatars, facial capture and performance capture have been well-studied. Based on a multi-view stereo system, fine-scale details ofthe captured face can be recovered in a controlled environmentwith multiple calibrated DSLR cameras as in the work of Beeler et al.[2010]. A more intricate system by Ghosh et al. [2011] extends theview-dependent method [Ma et al. 2007] by adopting fixed linearpolarized spherical gradient illumination in front of the cameras

ACM Trans. Graph., Vol. 39, No. 6, Article 215. Publication date: December 2020. ynamic Facial Asset and Rig Generation from a Single Scan • 215:3 and enables accurate acquisition of diffuse albedo, specular inten-sity, and pore-level normal maps. Fyffe et al. [2016] later propose amethod that employs commodity hardware, while recording compa-rable results with off-the-shelf components and near-instant capture.Meanwhile, works on passive facial performance capturing [Beeleret al. 2011; Bradley et al. 2010; Fyffe et al. 2014; Valgaerts et al. 2012]have shown impressive detailed results for highly articulated mo-tion. Recently, Gotardo et al. [2018] propose a method to acquiredynamic properties of facial skin appearance, including dynamicdiffuse albedo, specular intensity, and normal maps. These methodsprovide decent training data and set a high baseline for lightweightfacial capture and modeling approaches.

Facial Rigging.

Creating facial animation is a well-studied prob-lem with a plethora of methods proposed in film and video gameindustries. Blanz and Vetter [1999] first introduce the MorphableFace Model to represent face shapes and textures of different iden-tities using principal component analysis (PCA) learned from laser scan subjects. Later, the improved parametric face models arebuilt using , high-quality 3D face scans [Booth et al. 2017,2016]. A linear model generated from web images has also beendemonstrated [Kemelmacher-Shlizerman 2013].Modeling of variational face expressions using blendshapes isa popular approach in many applications [Thies et al. 2015, 2016].The approach models facial expressions as activation of shape unitsrepresented by a linear basis of facial expression vectors [Lewiset al. 2014]. Amberg et al. [2008] combines a PCA model of a neutralface with a PCA space derived from the residual vectors of differentexpressions to the neutral pose. Blendshapes can either be hand-crafted by animators [Alexander et al. 2009; Olszewski et al. 2016],or be generated via statistical analysis from large facial expressiondatasets [Cao et al. 2014; Li et al. 2017; Vlasic et al. 2005]. The multi-linear model [Cao et al. 2014; Vlasic et al. 2005] offers a way ofcapturing a joint space of expression and identity. Li et al. [2017]propose the FLAME model learned from thousands of scans andsignificantly improve the model expressiveness. Personalized Blendshape Generation.

As an effort to advance andscale the production of facial animations, expression cloning [Nohand Neumann 2001] has been introduced to mimic the existingdeformation of a source 3D face model onto a target face. Sumnerand Popovió [2004] propose deformation transfer for generic 3Dtriangle mesh. Onizuka et al. [2019] propose a landmark-guideddeformation transfer method to generate expressions for any targetavatar that directly maps to a generic blendshape template. Thesemethods can generate an expression for a novel subject but mightfail to capture personalized behavior due to the lack of personalinformation.To build robust face rigs, we need to reconstruct a dynamic ex-pression model that faithfully captures the subject’s specific facialmovements. A full set of personalized blendshapes for a specificsubject can be built from 3D scan data of the same subject [Carri-gan et al. 2020; Huang et al. 2011; Li et al. 2010; Weise et al. 2009;Zhang et al. 2004]. These methods can reconstruct expressions thatcapture the target’s personal expressions, but a large set of actionunits or sparse expressions are required as input. Some follow-upworks [Bouaziz et al. 2013; Hsieh et al. 2015; Li et al. 2013] apply expression transfer on top of a generic face model and train modelcorrectives for the expressions during tracking with samples ob-tained from RGB-D video input. Ichim et al. [2015] and Cao et al.[2016] propose a comprehensive pipeline to generate a dynamic3D avatars based on personalized blendshapes with a monocularvideo of a specific expression sequence. Casas et al. [2016] recon-struct blendshapes and each blendshape’s textures with a Kinect.Garrido et al. [2016] introduce a video-based method, which makesblendshape generation suitable for legacy video footage.

Deep Face Models.

As deep learning-based methods for 3D shapesanalysis have attracted increasing attention in recent years, somemethods for non-linear 3D Morphable Model learning have beenintroduced [Bagautdinov et al. 2018; Li et al. 2020; Tewari et al. 2017;Tran et al. 2019; Tran and Liu 2018]. These models are formulatedas decoders using convolutional neural networks, some of thesemethods use fully connected layers or 2D convolutions in the imagespace [Li et al. 2020], while some build decoders in the mesh domainto exploit the local geometry of 3D structures [Abrevaya et al. 2019;Cheng et al. 2019; Litany et al. 2018; Ranjan et al. 2018; Zhou et al.2019].

Image-to-Image Translation.

Isola et al. [2017] present Pix2Pix,a method to translate images from one domain to another. It con-sists of a generator and a discriminator, where the objective of thegenerator is to translate images from domain A to B, while thediscriminator aims to distinguish real images from the translatedones. Wang et al. [2018b] later extend this work to Pix2PixHD tosynthesize high-resolution photo-realistic images from semanticlabel maps. Some works [Lee et al. 2019; Wang et al. 2019, 2018a] onthe learning of “translation” functions for videos also incorporate aspatio-temporal adversarial objective. Image-to-image translationhas also been adopted to generate 3D faces or detailed face textures.Matan Sela [2017] propose a Pix2Vertex framework using image-to-image translation that jointly maps the input image to a depth imageand a facial correspondence map. Huynh et al. [2018] applies thisimage-to-image translation framework to infer mesoscopic facialgeometry with high-quality training data captured using the LightStage. Yamaguchi et al. [2018] presents a comprehensive methodto infer facial reflectance maps from unconstrained image input.Nagano et al. [2018] introduces a framework to synthesize arbitraryexpressions in image space and textures in UV space from a singleinput image. Chen et al. [2019] adopts a conditional GAN to synthe-size geometric details ( wrinkles ) by estimating a displacement mapover a proxy mesh. Similarly, Yang et al. [2020] infers a displacementmap on a base mesh generated from a single image based on a largehigh-quality face dataset.

Our system takes a single scanned neutral geometry with an albedomap as input and generates a set of face rig assets and texture at-tributes for physically based production-level rendering. As shownin Fig. 2, we developed a cascaded framework, in which we firstestimate a set of personalized blendshape geometries of the inputsubject using a Blendshape Generation network, followed by a Tex-ture Generation network to infer a set of dynamic maps including

ACM Trans. Graph., Vol. 39, No. 6, Article 215. Publication date: December 2020.

Neutral geometry Personalized blendshapesNeutralalbedo Map Dynamic texture mapsFacial components &Template blendshapes Personalized blendshapes Expressive model … Input Output

AssemblyBlendshapeGenerator Texture Generator

Fig. 2. System Overview. Given the model from a single scan in a neutral expression, the blendshape generation module first generates its personalizedblendshapes. Then, using the personalized blendshapes, along with the input neutral model and its albedo map, the texture generation module produceshigh-resolution dynamic texture maps including albedo, specular intensity and displacement maps. With these assets ready, we then assemble personalizedblendshapes and the input neutral model into 3D models, combining other facial components (eyes, teeth, gums, and tongue) from the template models. Thefinal output is complete face models rendered using the blendshape models and textures. albedo maps, specular intensity maps, and displacement maps. Inthe final step, we combine the obtained secondary facial compo-nents ( i.e. teeth, gums, and eye assets) from a set of template shapes,to assemble the final face model.

Our goal is to automatically generate a full set of personalized blend-shapes from a neutral 3D face of a novel subject. This is a challengingproblem since generating subject-specific blendshapes usually re-quires different expressions of the subject. Thanks to our large-scaledataset which consists of various expressions as described in Sec.7, we introduce a self-supervised pipeline that learns to generatepersonalized blendshapes based on expressions. Our first task is toimitate the process followed by artists isolating scanned expressionsto unit blendshapes using deep neural networks. Given a set ofpre-defined generic template blendshapes as a semantic referenceand multiple well-defined scan expressions of the same subject, ourfirst goal is to automatically generate the personalized blendshapesof the input subject.The generic template blendshape model is defined as a genericmodel 𝑆 in neutral expression and a set of 𝑁 (in our case 𝑁 = )additive vector displacements S = { 𝑆 , ..., 𝑆 𝑁 } . Expressions can begenerated as 𝑃 𝑘 = 𝑆 + (cid:205) 𝑁𝑖 = 𝛼 𝑖𝑘 𝑆 𝑖 , where 𝛼 𝑖𝑘 are the blendingweights for the expression 𝑘 . For a new subject 𝑗 , given his/herneutral expression model 𝑆 𝑗 and other expressions 𝑃 𝑗𝑘 , their per-sonalized blendshapes 𝑆 𝑗𝑖 can be optimized by minimizing the re-construction loss of 𝑃 𝑗𝑘 ′ and ground truth expression 𝑃 𝑗𝑘 if blendingweights 𝛼 𝑗𝑖𝑘 , 𝑖 = , ..., 𝑁 for 𝑃 𝑗𝑘 are known: 𝑃 𝑗𝑘 ′ = 𝑆 𝑗 + 𝑁 ∑︁ 𝑖 = 𝛼 𝑗𝑖𝑘 𝑆 𝑗𝑖 . (1)This is the foundation of our self-supervised learning scheme. Based on our template blendshape set, we also pre-defined 𝑘 = FACS expressions for building the dataset (excluding neutral ex-pression). The FACS expressions refer to a set of standardized facialposes that can be performed by a person and generally correspondto a combination of blendshapes (blending weights will be either0 or 1) with minimum motion overlap and maximum blendshapecoverage. We assume that our captured FACS covers all the blend-shapes and they can be isolated to unit blendshapes losslessly (moredetails in Sec. 7). So far, for each of the training subjects, we have aset of captured FACS expressions with corresponding combinations(0 or 1 blending weights). However, it would be irresponsible to saythat the blending weights of FACS can be regarded as ground truthfor real scans. One can easily perform unwanted motions whentrying to express a predefined FACS expression ( e.g.

FACS smile consists only

Left_Lip_Corner_Puller and

Right_Lip_Corner_Puller ,ended with unexpected eye motion captured). To address this issue,we propose a two-stage learning framework as shown in Fig. 3. TheEstimation Stage, as the first one, fixes the initial blending weightsto generate a set of blendshapes that optimally preserves identityand semantics, while its counterpart, the Tuning Stage, finetunesthe initial blendshapes by jointly learning blending weights to betterfit captured FACS expressions.

As shown in Fig. 3, the Estimation Stage takes a model with neutralexpression 𝑆 𝑗 and pre-defined blending weights for FACS expression 𝑃 𝑗𝑘 as its input. It contains of a Blendshape Generator , which learns togenerate personalized blendshapes that are used to reconstruct theexpression 𝑃 𝑗𝑘 using Eq. 1. We define a reconstruction loss in Eq. 2between the reconstructed expression and the input expression. 𝐿 𝑟𝑒𝑐 = ∑︁ 𝑥 ∈ 𝑃 𝑗𝑘 (cid:13)(cid:13)(cid:13) 𝑃 𝑗𝑘 ′ ( 𝑥 ) − 𝑃 𝑗𝑘 ( 𝑥 ) (cid:13)(cid:13)(cid:13) . (2) ACM Trans. Graph., Vol. 39, No. 6, Article 215. Publication date: December 2020. ynamic Facial Asset and Rig Generation from a Single Scan • 215:5

Input neutral Blendshapeweights Reconstructed expressionInitial blendshapes

Estimation Tuning

Reconstructed expressionPersonalizedblendshapes

FACS Expression

Input Neutral

FACS expression

Blendshapeweights

BlendshapeGenerator BlendshapeGeneratorWeight Predictor

Fig. 3. Two-stage self-supervised learning framework. Given a model in a neutral expression, the Estimation Stage first predicts the initial blendshapeswhich will work as input for the Tuning Stage to generate the final personalized blendshapes. The inference pipeline is connected by solid lines. The trainingarchitecture also involves the parts in dashed lines for computing reconstruction loss. In the Estimation Stage, the

Blendshape Generator learns to generate theinitial blendshapes from the input neutral expression, which combines with the known blending weights to reconstruct the non-neutral expressions. In theTuning Stage, the

Blending Weight Predictor is added to predict blending weights for the personalized blendshapes which will be used to reconstruct the inputexpression.

Inspired by the idea in Li et al. [2010] which emphasizes the impor-tance of relative change between the template and the target models,we propose to learn blendshape offsets instead of blendshapes them-selves because: (1) blendshape offsets are distributed in a nearlystandard normal distribution which is easy for the network to learn;(2) blendshape offsets can better demonstrate the identity difference.For the example in Fig. 4, the same expression of two different sub-jects are presented, where their difference is most obviously shownby the blendshape offsets. Thus, the output of the

Blendshape Gener-ator , { Δ 𝑆 𝑗 , ..., Δ 𝑆 𝑗𝑛 } , are the offsets from the template blendshape tothe target, which can be used to reconstruct the target personalizedblendshapes by adding the template blendshapes as: 𝑆 𝑗𝑖 = Δ 𝑆 𝑗𝑖 + 𝑆 𝑖 , ∀ 𝑖 ≥ . (3)To make the target blendshapes semantically consistent with thetemplate blendshapes, we define a regularization term on blend-shape offsets to minimize their relative difference. 𝐿 𝑟𝑒𝑔 = 𝑁 ∑︁ 𝑖 = ∑︁ 𝑥 ∈ 𝑆 𝑖 𝑔 𝑖 𝑚 𝑖 ( 𝑥 ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Δ 𝑆 𝑗𝑖 ( 𝑥 ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , ∀ 𝑖 ≥ . (4)where 𝑔 𝑖 are global weights for different kinds of blendshapes and 𝑚 𝑖 ( 𝑥 ) are local weights for each vertex 𝑥 in the blendshape 𝑆 𝑖 ,defined as Eq. 5 and Eq. 6.The global weights are defined as: 𝑔 𝑖 = 𝜆 𝑔 (cid:205) 𝑥 ∈ 𝑆 𝑖 ∥ 𝑆 𝑖 ( 𝑥 )∥ , ∀ 𝑖 ≥ . (5)where 𝜆 𝑔 is a scale factor restricting the maximum 𝑔 𝑖 to 1. Consider-ing the scale difference in different blendshapes, we introduce globalweights to balance the influence of each blendshape for regulariza-tion loss. For example, the shape Jaw_Open involves more movingvertices than

Left_Eye_Open . If the same weight is assigned to both,the regularization loss will be dominated by

Jaw_Open , thus under-estimating less pronounced shapes. Thus, we adopt a strategy thatassigns a smaller regularization weight to blendshapes with largeroffset scale. A similar strategy is used in Chen et al. [2018], where (a) (b) Distance Fig. 4. Visualization of cosine distance maps between expressions, blend-shapes and blendshape offsets. (a) and (b) show the same expression ofdifferent subjects represented by absolute positions in expression geometry 𝑃 𝑗𝑖 (Row 1), blendshape offsets from neutral expression 𝑆 𝑗𝑖 (Row 2) and off-sets from the template blendshape Δ 𝑆 𝑗𝑖 (Row 3). Note that the distance mapin Row 1 is almost filled with zeroes. This is because the average differenceof the same expression between different individuals is much less than thescale of the human head. adaptive weights for multi-objective loss are applied to balance thegradients in the training.The local weights 𝑚 𝑖 are defined by normalized norms of templateblendshapes in which the vertex values are normalized to ( , ] : 𝑚 𝑖 ( 𝑥 ) = 𝜆 𝑖𝑙 ∥ 𝑆 𝑖 ( 𝑥 )∥ , ∀ 𝑥 ∈ 𝑆 𝑖 . (6)where 𝜆 𝑖𝑙 is a scale factor restricting the maximum 𝑚 𝑖 to 1 (ex-cluding fixed vertices), as for fixed vertices in blendshape 𝑆 𝑖 (where 𝑆 𝑖 ( 𝑥 ) = ), we manually assign a relative large weight to constrain ACM Trans. Graph., Vol. 39, No. 6, Article 215. Publication date: December 2020.

Single-branch network Two-branch network GT expression

Fig. 5. Comparison of two blendshape models generated by the

BlendshapeGenerator with a single-branch network and a two-branch network in theEstimation Stage. GT expression represents the reference FACS expressionwhich is most semantically similar to the corresponding blendshape. Com-pared to the single-branch results, the two-branch results are more similarto the reference FACS expressions while keeping the semantic meaning ofthe generic blendshapes. their movements (we used in our experiments). For each blend-shape, the changes from the input neutral face are dominated byonly a subset of vertices while the remaining vertices remain un-changed. The local weights are used to penalize large movementsof the unchanged vertices and ensure the overall isolation of thegenerated blendshapes.Finally, we combine the reconstruction loss 𝐿 𝑟𝑒𝑐 and the regu-larization term 𝐿 𝑟𝑒𝑔 to yield the loss function for the BlendshapeGenerator : 𝐿 𝐺 = 𝐿 𝑟𝑒𝑐 + 𝜔 𝑟𝑒𝑔 𝐿 𝑟𝑒𝑔 , (7)where 𝜔 𝑟𝑒𝑔 is the regularization weight which is set to in thetraining.The Blendshape Generator is a 2D convolutional neural network(CNN), similar to the image translator in Liu et al. [2019], consistingof an identity encoder and a blendshape decoder. The encoder, sameas the content encoder in Liu et al. [2019], is made of a few 2Dconvolutional layers followed by several residual blocks. It takesa neutral expression 𝑆 𝑗 as input and maps it into a content latentcode that is a spatial feature map. The decoder consists of severalinstance normalization residual blocks followed by a couple of up-scale convolutional layers. It decodes the feature vector into theblendshape offsets. To adapt 3D models to a compact representationwhich is friendly for the 2D CNN, we represent every 3D model as a2D geometry image by first registering all the input 3D models witha same topology and aligning them in UV space (implementationdetails in Sec. 7), in which each pixel stores the 𝑥 − 𝑦 − 𝑧 coordinatesof one vertex.Instead of training the generator in one network, we adopt atwo-branch architecture inspired by Bai and Ghanem [2017] whichuses a multi-branch network for face detection and tracking withdifferent face size.We observe that the scale of different blendshapes varies greatly.Thus we came up with a two-branch training strategy. We separateour blendshapes into two categories: 14 extreme blendshapes with relatively large motion and the rest with small motion. As shown inFig. 5, the two-branch network makes the generated blendshapesmore personalized and closer to the reference FACS expression. In the Estimation Stage, the blending weights are given, and con-sistent for all subjects, but practically it is hard to guarantee thatdifferent subjects can realize the same exact expressions. In thisscenario, the fixed blending weights lead to inaccuracy when fit-ting such expressions for different subjects. Therefore, we relaxconstraints on the blending weights and instead learn them witha neural network. As shown in Fig. 3, compared to the EstimationStage, the initial blendshapes work as additional input to the

Blend-shape Generator , and another

Blending Weight Predictor is introducedto predict blending weights from the input expression in the TuningStage.The

Blending Weight Predictor shares a similar network architec-ture as the

Blendshape Generator which consists of an expressionencoder and a blending weight decoder. Given an input expression 𝑃 𝑗𝑘 , the encoder maps it to an expression latent code, followed bythe decoder which decodes the latent code into a vector of 𝑁 blend-ing weights whose values are constrained in [ , ] . Combining theblending weights with the personalized blendshapes generated bythe Blendshape Generator , we reconstruct the input expression usingEq. 1. The loss used to constrain the output of the

Blending WeightPredictor is the reconstruction loss defined in Eq. 2.In order to preserve the semantics and personality of the ini-tial blendshapes generated by the Estimation Stage, we define theregularization term as follows: 𝐿 𝑟𝑒𝑔 𝐹𝑇 = 𝑁 ∑︁ 𝑖 = (cid:13)(cid:13)(cid:13) Δ 𝑆 𝑗𝑖 𝐹𝑇 − Δ 𝑆 𝑗𝑖 (cid:13)(cid:13)(cid:13) , (8)where Δ 𝑆 𝑗𝑖 𝐹𝑇 are the target blendshape offsets and Δ 𝑆 𝑗𝑖 are initialblendshape offsets generated in the Estimation Stage. Thus, the lossfunction used in the Tuning Stage is: 𝐿 𝐺 𝐹𝑇 = 𝐿 𝑟𝑒𝑐 + 𝜔 𝑟𝑒𝑔 𝐹𝑇 𝐿 𝑟𝑒𝑔 𝐹𝑇 , (9)where 𝜔 𝑟𝑒𝑔 𝐹𝑇 = . . In our implementation, we add skip connec-tions from the initial blendshape to the generator output (as shownin the red line in Fig. 3) such that the generator predicts Δ 𝑆 𝑗𝑖 𝐹𝑇 − Δ 𝑆 𝑗𝑖 .Examples of with and without tuning are shown in Fig. 6, we observethat the Tuning Stage achieves better fitting results by fine-tuningthe blendshapes, and jointly optimizing blending weights whilepreserving the semantics and personality. In this section, we first introduce our compact representation ofdynamic texture assets- Compress and Stretch maps, followed bya learning-based method to infer/extract them. Finally, we demon-strate the utilization of our Compress and Stretch maps for renderingat run-time.

Compress and Stretch Maps.

When static textures (obtained froma neutral expression) are used to render extensive expressions, the

ACM Trans. Graph., Vol. 39, No. 6, Article 215. Publication date: December 2020. ynamic Facial Asset and Rig Generation from a Single Scan • 215:7

Estimation Stage Tuning Stage GT expression

Fig. 6. Comparison of two reconstructed expressions by the EstimationStage alone and with the addition of the Tuning Stage, along with error mapsbetween the reconstructed expressions and the ground truth expressions.The output from the Tuning Stage results in better reconstruction withsmaller fitting errors.

Neutral albedoNeutral geometryExpression offset ExpressionspecularExpressiondisplacement (H)

Input

Pix2PixHD

Expression

Albedo

Pix2PixHD

Expressiondisplacement (L)

Output

Fig. 7. Texture Generative Network. Given the albedo map and the geometryimage of the input model in neutral expression and the geometry image ofthe target expression offset, the first network generates the albedo map ofthe expression using pix2pixHD [Wang et al. 2018b]. Then, combining theinitial input and prediected albedo map, the second network infers specularintensity, low-frequency, and high-frequency displacement maps. missing details ( e.g. wrinkles) caused by facial motion will signifi-cantly reduce the photo-realism of rendering results. Especially forthe extreme/exaggerated expressions, high-fidelity muscle move-ment and micro-expressions make big differences. A natural wayto solve this problem is to customize a set of dynamic textures forblendshapes. However, the number of blendshapes used in high-endindustries may be of the magnitude of hundreds or thousands. Thecreation of such large dynamic textures is costly and requires sub-stantial computational power. More importantly, it is difficult to loadsuch a vast collection of dynamic textures into a rendering engineat once, in particular, with multiple layers ( e.g. albedo, specular in-tensity, displacement maps) at high resolution. A memory-efficient,compact, and easy-to-compute dynamic representation is needed.

Fig. 8. Generated textures and ground truth textures of an expression. Row1 from left to right: low-resolution albedo map (1K × × Fig. 9. Illustration of Compress and Stretch Maps. From Top to Down Rows:Neutral Static maps, Compress maps, Stretch maps. From Left to RightColumns: Diffuse Albedo maps, Specular maps, Normals maps (in tangentspace) computed from Displacement maps.

Moreover, it should also be expressive enough to cover all the possi-ble dynamic details of facial motion losslessly. We adopt

Compressand Stretch Maps as shown in Fig. 9 along with a static neutral tex-ture to be the dynamic texture library, which is a commonly adopted

ACM Trans. Graph., Vol. 39, No. 6, Article 215. Publication date: December 2020.

Stretch

Compress (a) (b)

Fig. 10. Illustration of Influence maps. (a). Influence value rendered in geometries with different expressions (

Mouth Right , Smile and

Lip Funnel ). (b). SelectedInfluence maps from a set of blendshapes and an example of dynamic albedo with its corresponding influence map in the blendshape

CheckSquint_L . Notethat we store compress and stretch influence maps as 𝑅 and 𝐺 channels and set 𝐵 channel to zeros. … Fig. 11. Illustration of Compress Maps Extraction. Left: expression texturesgenerated from networks. Right: compress maps extracted by blending ex-pression textures based on the influence maps. Note that the final compressmaps gather all the dynamic details caused by skin local compression (inthe orange circles) from all the expressions. format in the industry [Oat 2007]. Guided by

Influence Maps , com-press and stretch maps gather the most prominent features causedby the local compression/stretching movement of all the availableexpressions.

Influence Maps.

Influence maps are computed based on the geom-etry changes between the expressions and the neutral face. For eachof the vertices 𝑥 on the neutral mesh 𝑁 , we define the average edgelength of its one-ring neighbors as 𝐸 𝑁 ( 𝑥 ) , and then for an arbitraryexpression mesh 𝑃 of the same subject, the influence value of eachvertex on 𝑃 in compress maps can be computed as: 𝐼 𝑃 𝐶𝑜𝑚𝑝𝑟𝑒𝑠𝑠 ( 𝑥 ) = (cid:40) ∥ 𝐸 𝑁 ( 𝑥 ) − 𝐸 𝑃 ( 𝑥 ) ∥ , 𝐸 𝑃 ( 𝑥 ) < 𝐸 𝑁 ( 𝑥 ) , 𝐸 𝑃 ( 𝑥 ) ≥ 𝐸 𝑁 ( 𝑥 ) . (10)Similarly, the influence value of each vertex on 𝑃 in stretch mapsis as follows: 𝐼 𝑃 𝑆𝑡𝑟𝑒𝑡𝑐ℎ ( 𝑥 ) = (cid:40) ∥ 𝐸 𝑃 ( 𝑥 ) − 𝐸 𝑁 ( 𝑥 ) ∥ , 𝐸 𝑃 ( 𝑥 ) > 𝐸 𝑁 ( 𝑥 ) , 𝐸 𝑃 ( 𝑥 ) ≤ 𝐸 𝑁 ( 𝑥 ) . (11) Based on the per-vertex influence values, we interpolate a per-pixel compress and stretch influence map as shown in Fig. 10. Notethat we store compress and stretch influence maps as 𝑅 and 𝐺 channels separately. The influence maps provide the weights toblend and extract dynamic textures. In the standard industry pipeline, the compress and stretch maps arehandcrafted by skilled artists using numerous captured expressionsas reference. To automate this procedure, especially when only asingle scan is provided in our scenarios, we came up with a two-stepsolution. Firstly, we predict the texture maps ( i.e. albedo, specu-lar intensity, and displacement) of the input subject’s pre-definedexpressions using a deep neural network. Then a blending step isintroduced to fuse them into compress and stretch maps.

Expression Texture Generation Networks.

Given a single neutralscan with an albedo map, in order to predict the high-fidelity albedo,specular intensity, and displacement maps of different expressions,we propose a cascade architecture, as shown in Fig. 7. We first takethe neutral geometry with its albedo map and the target expressionoffset from the neural geometry as input to predict the albedo mapoffset of the target expression. The predicted offset is then addedto the neutral albedo map to generate the expression albedo mapas the intermediate results, further combining the input of the firstnetwork to be fed into the second network. The second networkthen infers the specular intensity and displacement maps. Both ofthe networks are the Pix2pixHD [Wang et al. 2018b] model, whichcontains an encoder with several CNN layers, followed by a coupleof Resnet blocks, and a decoder with similar architecture. The reasonof using a cascade network with an expression albedo map as inter-mediate results include: (1) the specular intensity and displacementmaps generated using the albedo map as a prior have fewer artifactsand higher quality; (2) this architecture allows us to handle incom-plete training data (some of the subjects do not have the specularintensity and displacement maps). In particular, we separate thedisplacement map into low-frequency and high-frequency during

ACM Trans. Graph., Vol. 39, No. 6, Article 215. Publication date: December 2020. ynamic Facial Asset and Rig Generation from a Single Scan • 215:9 training, following Huynh et al. [2018]; Yamaguchi et al. [2018] tomake the problem more tractable and merge them together beforeusing. Both input and output of the two networks have 𝐾 × 𝐾 reso-lution. Furthermore, with all these 𝐾 result maps, we up-scale theminto 𝐾 × 𝐾 using a pre-trained super-resolution network [Lediget al. 2017]. In Fig. 8, we show a complete set of expression texturesgenerated by our networks. Compress and Stretch Map Extraction.

We design an algorithmto extract compress and stretch maps based on the influence mapsfrom the above predicted expression textures as shown in Fig. 11.Let 𝐼 𝑖 be the influence map of the 𝑖 th expression, and the influencevalue of each pixel ( 𝑥, 𝑦 ) is 𝐼 𝑖 ( 𝑥, 𝑦 ) . We first normalize the influencemap of all the expressions with a weighted sum strategy to ensurethe spatial consistency among all the expressions as follows (takethe compress map as an example): ^ 𝐼 𝑖 𝐶𝑜𝑚𝑝𝑟𝑒𝑠𝑠 ( 𝑥, 𝑦 ) = 𝑒 𝐼 𝑖𝐶𝑜𝑚𝑝𝑟𝑒𝑠𝑠 ( 𝑥,𝑦 ) (cid:205) 𝑖 𝑒 𝐼 𝑖𝐶𝑜𝑚𝑝𝑟𝑒𝑠𝑠 ( 𝑥,𝑦 ) , (12)in which ^ 𝐼 𝑖 𝐶𝑜𝑚𝑝𝑟𝑒𝑠𝑠 is the normalized influence map of 𝐼 𝑖 𝐶𝑜𝑚𝑝𝑟𝑒𝑠𝑠 ( 𝑖 = . . . 𝑁 ) where 𝑁 is the number of expressions.Once we get the normalized influence maps, the compress map iscomputed as follows: 𝑇 𝐶𝑜𝑚𝑝𝑟𝑒𝑠𝑠 ( 𝑥, 𝑦 ) = ∑︁ 𝑖 ^ 𝐼 𝑖 𝐶𝑜𝑚𝑝𝑟𝑒𝑠𝑠 ( 𝑥, 𝑦 ) 𝑇 𝑖 ( 𝑥, 𝑦 ) , (13)Where 𝑇 𝑖 is the texture of the 𝑖 th expression, and it can be oneof the albedo, specular, and displacement. The stretch maps arecomputed similarly. Finally, we obtain compress and stretch mapsfor albedo, specular, and displacement maps, respectively. When using dynamic assets for rendering in runtime applications,such as tracking, animation, we first solve the blending weightsof each input expression using personalized blendshapes. Thoseblending weights combined with a set of pre-defined influence mapsof blendshapes, will be used to sample the current dynamic tex-tures from compress and stretch maps. The dynamic textures aregenerated as follows: 𝑇 ( 𝑥, 𝑦 ) = 𝑇 𝑁 ( 𝑥, 𝑦 )+ 𝑁 ∑︁ 𝑖 (cid:16) 𝛼 𝑖 ^ 𝐼 𝑖 𝐶𝑜𝑚𝑝𝑟𝑒𝑠𝑠 ( 𝑥, 𝑦 )( 𝑇 𝐶𝑜𝑚𝑝𝑟𝑒𝑠𝑠 ( 𝑥, 𝑦 ) − 𝑇 𝑁 ( 𝑥, 𝑦 ))+ 𝛼 𝑖 ^ 𝐼 𝑖 𝑆𝑡𝑟𝑒𝑡𝑐ℎ ( 𝑥, 𝑦 )( 𝑇 𝑆𝑡𝑟𝑒𝑡𝑐ℎ ( 𝑥, 𝑦 ) − 𝑇 𝑁 ( 𝑥, 𝑦 )) (cid:17) (14)where 𝑇 𝑁 is the static texture of neutral expression, 𝑇 𝐶𝑜𝑚𝑝𝑟𝑒𝑠𝑠 and 𝑇 𝑆𝑡𝑟𝑒𝑡𝑐ℎ correspond to the compress and stretch textures, ^ 𝐼 𝑖 𝐶𝑜𝑚𝑝𝑟𝑒𝑠𝑠 and ^ 𝐼 𝑖 𝑆𝑡𝑟𝑒𝑡𝑐ℎ are the influence maps of the 𝑖 th blendshape and 𝛼 𝑖 indicates its blending weight. In addition to the primary dynamic assets (face geometry and tex-tures) generated using networks, we also include secondary com-ponents ( e.g. eyeballs, lacrimal fluid, eyelashes, teeth, and gums)in our avatar as shown in Fig. 14. We handcrafted a set of generic

Fig. 12. Selected FACS units from Light Stage Dataset. From left toright:

Neutral, Eye_close_Lip_corner_Puller, Eyes_Up_Lip_Funneler, In-ner_Brow_Raiser_Dimpler, Upper_Lip_Raiser_Lower_Lip_Depressor_Outer_Brow_Raiser, Brow_Lowerer_Inner_Brow_Raiser_Lip_Presser. (a) (b) (c)Fig. 13. Laplacian deformation results of neutral mesh to target expressionmodel using (a) landmarks only, and (b) dense optical flow correspondence.(c) Target expression. blendshapes with all the primary and secondary parts. We furtheruse this set of generic blendshapes to linearly fit each expressiongenerated by our networks based on corresponding vertices on thefacial regions. The computed coefficients based on the primary partsdrive the secondary components, such that eyelashes will travelwith eyelids. The linearly fitted secondary elements will be com-bined with the primary facial parts to get an integrated face model.Except for eyeball, other secondary parts share a set of generic tex-tures for all the subjects. For eyeball textures, we adopt an eyeballassets database [Kollar 2019] with 90 difference eye textures (pupilpatterns) to match with input subjects.

The facial scan dataset used in training comes from a combinedsource of aligned face models with 4k resolution textures and ge-ometries aligned to a known topology [Li et al. 2020]. The datasetconsists of 178 scan subjects divided into two sets, one of 78 (LightStage), and one of 100 subjects ([Triplegangers 2019]); performing 26and 20 static FACS expressions respectively. The FACS expressionsare fixed, which enables labeling of corresponding weights in ourset of template blendshapes. This feature is particularly useful whenisolating orthogonal shapes that are combined under the scanningsession. One such example may be the combination of action unit 1(Inner brow raiser) , and action unit 14 (Dimpler) [Ekman and Friesen1978]. This makes it possible to significantly reduce the number ofscans needed (Fig. 12).The assumptions that have to be realized under the learning ofcorresponded face morphologies described in section 3 are (1) arigid transformation of each subject’s skull shape can be found forevery expression the subject performs, (2) sparse correspondenceamong subjects need to be established for a common parameteriza-tion to be usable, and (3) dense correspondence among expressionsneed to be established for each subject to track minute changes inskin deformation using texture maps. Next, we describe how theseproblems are solved to generate the desired dataset.

ACM Trans. Graph., Vol. 39, No. 6, Article 215. Publication date: December 2020.

Fig. 14. Our face model consists of multiple parts including face, eyes, eyeblend mesh, lacrimal fluid, eye occlusion, eyelashes, teeth, gums and tongue.

Neutral Scans Registration.

First, a linear 3D morphable face (PCA)model is used to fit the neutral face of a scan subject, reconstructedusing multi-view stereo [Hsieh et al. 2015]. Secondly, the fittedmodel is further deformed using the non-rigid iterative closest pointmethod [Li et al. 2008] constrained by facial landmarks [Sagonaset al. 2016]. Additional Laplacian mesh surface warping is appliedfor surface detail reconstruction [Li et al. 2009].

Expression Scans Registration.

We first estimate the blendshapeexpressions from our template set using the same algorithm, butvarying blendshape weights in composite to identity PCA weightsand followed by landmarks refinement step. We further introduce aLaplacian deformation step with dense constraints based on multi-view 2D optical flow between the current expression and neutralexpression to densely correspond expressions belonging to the samesubject [Fyffe et al. 2017], see Fig. 13.

By leveraging polarized spherical gradient illumination [Ghosh et al.2011; Ma et al. 2007] we can compute skin micro-structure, andmaterial intrinsics such as diffuse albedo and specularity as we haveseen inferred by our pixel translation networks. Specifically, thesemaps are computed on a fixed aligned topology provided by thebefore mentioned morphable face model.

Our blendshape model is based on the naming convention of Apple’sArKit with additional modifications enabling asymmetries for eye-brow shapes. The shapes were computed by fitting a set of around50 scanned face neutrals along with their performed FACS shapes.By computing averages over all subjects, keeping each expressionfixed, we could find reasonable averages of each shape which couldbe artistically isolated to keep linear independence and semanticmeaning; and to avoid self-intersection.

We split our data into two subsets: training set (137 subjects) andtesting set (41 subjects). Each of the subsets covers a wide span of

Table 1. Run time for each component in our framework.

Component Time ( ms )Estimation Stage (Single Branch) 2.386Tuning Stage 2.200Texture Generation - Albedo map 130.9Texture Generation - Displacement & Specular 398.1Texture Generation - Up-scaling 3801age, gender, and race. We learn our Blendshape generation networksusing the RMSProp optimizer with a fixed learning rate of 0.0001 anda batch size of 4. For the texture generation network, it is optimizedby the Adam optimizer with a fixed learning rate of 0.0002, batchsize of 1. We train Estimation Stage and Tuning Stage for about50,000 and 60,000 iterations respectively on an NVIDIA GeForceRTX 2080 GPU. And we train texture generation model on NVIDIATesla V100. Run Time.

We record the run time of each component for an end-to-end system test (Table 1). Testing of our blendshape generationmodel was performed on an NVIDIA GeForce RTX 2080 GPU whiletexture generation was performed on an NVIDIA Tesla V100.Compared to the standard high resolution avatar generationpipeline, that requires intensive manual work of weeks or monthsof time along with many reference expressions to be captured, ourproposed approach is fast, low-cost, and robust (high-resolutiontraining data ensures the output avatar quality).

Results.

In Fig. 15, we show selected expressions of novel subjectsrendered using all the assets automatically generated by our frame-work from different sources of input data. Results show that ourgenerated dynamic textures capture the middle-frequency detailssuch as wrinkles and folds. In particular, the generated blendshapesof different individuals show that our approach captures the user-specific motion properties ( e.g. Mouth Right in row two, four, six)with the semantics preserved. Note that all the generated subjectsare unseen by the networks. Input test data from 3DScanstore [2019]and low-quality data captured by a mobile device are from a differ-ent domain and have never been observed by our networks. Hence,these results indicate the robustness of our framework.

Comparison and Evaluation.

In Fig. 16, by combining the sameneutral with the corresponding personalized Blendshapes units(

Jaw Open and

Mouth Right ) belonging to different individuals, weshowcase that our network is successful in imposing user-specificmotion features to the template blendshapes.In Fig. 17, we show an extreme expression’s fitting results withtemplate blendshapes and our generated personalized blendshapesseparately. Results indicate that our generated personalized blend-shapes perform better in the non-rigid deformation ( e.g. double-chinwhen open mouth).In Fig. 18, we demonstrate the influence of personalized blend-shapes on reconstruction/tracking accuracy by swapping blend-shapes of two subjects during expression tracking. Results showthat personalized blendshapes will be more expressive to the input

ACM Trans. Graph., Vol. 39, No. 6, Article 215. Publication date: December 2020. ynamic Facial Asset and Rig Generation from a Single Scan • 215:11

Middle Brows TogetherMiddle Brows TogetherMiddle Brows TogetherMiddle Brows TogetherMiddle Brows TogetherMiddle Brows Together Mouth LeftJaw OpenLeft Outer Brow RaiserLeft Outer Brow RaiserLeft Outer Brow RaiserLeft Outer Brow RaiserMouth RightMouth Right Right Outer Brow RaiserRight Outer Brow RaiserRight Outer Brow RaiserRight Outer Brow RaiserMouth RightLeft Cheek Lip Corner Puller Mouth Right

Fig. 15. Expressions reconstructed by face rig assets generated by our framework with inputs from multiple sources. From left to right: Column 1: inputneutral including geometry and albedo. Column 2 to Column 4: selected reconstructed expressions. Column 5 to Column 7: selected blendshape units. Fromtop to bottom: Row 1 and Row 2: input neutral from Triplegangers [Triplegangers 2019], Row 3 and Row 4: input neutral from online resources [3DScanstore2019], Row 5 and Row 6: input neutral from Light Stage testing set. Row 7: Input neutral from iPhone X Arkit. The last example shows that our method canalso be applied to data captured by a low-quality device despite that low-resolution input image may reduce the resulting quality.

ACM Trans. Graph., Vol. 39, No. 6, Article 215. Publication date: December 2020.

Fig. 16. Demonstration of customized identity of individuals on our generated blendshapes expressions. We combine blendshapes units from differentindividuals with the same template neutral (shown in orange). Row 1: source individuals. Row 2: combine personalized

Jaw_Open of individuals in row onewith template neutral. Row 3: combine personalized

Mouth_Right of individuals in row one with template neutral.Fig. 17. Comparison of extreme expression fitting using template blend-shapes and our generated personalized blendshapes. Left: fitting resultsusing template blendshapes. Middle: fitting results using our generatedPersonalized blendshapes. Right: ground truth expression. identity regarding tracking accuracy, especially in the facial partwith more non-linear and large motion ( e.g.

Mouth). This result alsodemonstrates the effectiveness of our network: One of our networkobjective is to achieve better reconstruction of scanned expression.In Fig. 19, we further compare our generated blendshapes withtemplate blendshapes and the method of Li et al. [2010]. Resultsshow that our approach is comparable to Li et al. [2010] in the task ofimposing personality to template blendshapes. Note that in Li et al.[2010], 26 references scanned expression are used for optimizationpurposes. On the other side, our results are obtained based on a >5mm (a) (b) (c) (d)

Fig. 18. Numerical analysis of the expressiveness of personalized blend-shapes on expression tracking by swapping Blendshapes. (a) Neutrals oftwo individuals. (b) Reconstruction error using personalized Blendshapesfrom counterpart individuals. (c) Reconstruction error using their own per-sonalized Blendshapes. (d) Target expressions. single neutral scan. Another observation is that our deep learning-based method shows more robust results with fewer artifacts ( e.g. the left mouth corner on the blendshape

Mouth Left ).In Fig. 20, we show dynamic displacement generated by ourframework on novel subjects. Results show the effectiveness of

ACM Trans. Graph., Vol. 39, No. 6, Article 215. Publication date: December 2020. ynamic Facial Asset and Rig Generation from a Single Scan • 215:13

Mouth_Open Mouth_LeftTemplateLi et al. [2010]Ours

Fig. 19. Comparison of selected generated Blendshapes units with template generic and Li et al. [2010]. Row 1: template blendshapes generated by expressiontransferred from a set of generic blendshapes using method in Sumner and Popovió [2004]. Row 2: blendshapes optimized with method in Li et al. [2010]. Row3: our method. Note that the results generated in Li et al. [2010] are from 26 scanned expressions, ours are from a single neutral input.Fig. 20. Dynamic displacement map predicted by our framework. Left: basegeometries. Middle: results by applying generated displacement to basegeometries. Right: closed-up comparison before and after applying dynamicdisplacement maps. our displacement network, which infers middle frequency details( e.g. wrinkles) as well as high-frequency mesoscopic details.In Fig. 23, we show the results and comparison of our generateddynamic textures on different subjects. Compared to static albedofrom input neutral, our generated dynamic albedo predicts wrin-kles, and folds caused by local self-occlusion of middle-frequencygeometry change during deformation. The results also show thatour predicted dynamic specular and displacement maps add meso-scopic details on top of diffuse albedo. It greatly improves the visualrealism of rendering, which is important for high-end applications.

Table 2. Reconstruction errors between the ground truth expressions andthe reconstructed expressions using blendshapes by different methods ontraining and testing datasets.

Method Training ↓ Testing ↓ Template blendshapes 1.661 1.638Optimization method [Li et al. 2010] 1.389 1.483Ours

In Fig. 21, we compare our generated full set of face rig assets withthe state-of-the-art paGAN [Nagano et al. 2018]. Note the the basegeometry used by paGAN [Nagano et al. 2018] are reconstructedfrom a single frontal image while ours are based on a high-qualityscan. Compared to paGAN, our avatar shows better quality andmuch more details, which indicates that a good quality neutral scanserves better in the task of high-end avatar generation. The resultsalso shows the unique physically-based skin assets will greatlyimprove the avatar rendering quality. The displacement map in ourassets captures the middle frequency and pore-level details.

Expression Reconstruction/ Face Tracking.

In Fig. 22, we compareour generated personalized blendshapes on fitting of performancecapture sequences with other methods. As shown in Fig. 18, smallerfitting errors indicates better personality on blendshapes. Resultsshow that our generated personalized blendshapes outperform base-line methods (Template and optimization-based method in Li et al.[2010] on accuracy of the face tracking task using the same solver.To provide better quantitative evidence, we evaluate face reconstruc-tion on 2,548 expressions in training dataset and 626 expressionsin testing datasets. The results are listed in Table 2. Blendshapesoptimized by Li et al. [2010] and ours show smaller reconstructionerrors in both training and testing data.

ACM Trans. Graph., Vol. 39, No. 6, Article 215. Publication date: December 2020.

Reference Avatars(a)(b)

Fig. 21. Comparison of generated face avatars between paGAN [Naganoet al. 2018] and our method. (a) and (b) show two cases of generated avatarsfrom the neutral model in reference images. In each case, Row 1 shows theavatars generated by paGAN [Nagano et al. 2018] while Row 2 shows ourresults.

Animation.

In Fig. 24, we show that our generated face rig assetscan be used directly for animation.

Please refer to accompanyingvideo material for more results . We have demonstrated an end-to-end framework for high-qualitypersonalized face rig and asset generation from a single scan. Ourface rig assets include a set of personalized blendshapes, physicallybased dynamic textures and secondary facial components (includingteeth, eyeballs, and eyelashes). Compared to previous automaticavatar and facial rig generation approaches, which either require aconsiderable number of person-specific scans or can only produce arelatively low-fidelity avatar, our framework only requires a singleneutral scan as input and can produce plausible identity attributesincluding physically-based dynamic textures of facial skins. Thischaracteristic is key to creating compelling animation-ready avatarsat scale. We achieve the above objective by modeling the correlation be-tween identity and personalized blendshapes using an extensivedataset of high-resolution facial scans. In particular, our generateddynamic textures add details from mid-frequencies (wrinkles) tomesoscopic ones (pore level). Our automatically generated face rigassets are valuable for real-world production pipelines, as thesehigh-fidelity initial models can be provided to artists for fine-tuningor simply used as secondary characters for crowds. Our proposedmethod is fast, robust, and lightweight, allowing production studiosto simply scan a neutral face of a person and immediately obtain ahigh-quality facial rig. An interesting insight from our experimentsis that the identity seems to be enough for a plausible inference ofpersonalized facial appearance and dynamic expressions. In additionto our framework, we have also introduced a novel self-superviseddeep neural network training approach to deal with the case whenno ground truth data is available, which in our case are the person-alized blendshapes.

Limitations and Future Work.

As a deep learning approach, theeffectiveness of our algorithm relies on the variety and volume oftraining data of our database. In particular, facial expressions thatare specific to young subjects could be improved, due to the lack ofyoung subjects in our current database. For the same reason, ourframework also does not perform well on subjects with facial hairor beard as shown in Fig. 25. We plan to augment our database tocover more diversity and appearance variations.Our template model consists of 55 blendshape vectors, which canrecover most of the expressions in daily life and is commonly usedin lightweight applications. However, certain extreme expressionsstill cannot be represented by our model. Our proposed networkarchitecture can be adapted for arbitrary template blendshapes.Thus, we are interested in exploring more sophisticated blendshaperigs that consist of hundreds to thousands of expressions, such asthe ones used in film production. We use generic eyes and teethmodels for all the generated avatars. An interesting direction wouldbe to explore how to generate personalized eyes [Bérard et al. 2016,2019] and teeth [Velinov et al. 2018; Wu et al. 2016] automaticallyas well.

ACKNOWLEDGMENTS

We thank Liwen Hu from Pinscreen for the fruitful discussions andhelping with this paper. This research is funded by in part by theONR YIP grant N00014-17-S-FO14, the CONIX Research Center,one of six centers in JUMP, a Semiconductor Research Corporation(SRC) program sponsored by DARPA, the Andrew and Erna ViterbiEarly Career Chair, the U.S. Army Research Laboratory (ARL) undercontract number W911NF-14-D-0005, Adobe, and Sony. The contentof the information does not necessarily reflect the position or thepolicy of the Government, and no official endorsement should beinferred.

REFERENCES

CVPR .ACM Trans. Graph., Vol. 39, No. 6, Article 215. Publication date: December 2020. ynamic Facial Asset and Rig Generation from a Single Scan • 215:15

GT expression FaceWarehouse FLAME Template Li et al. [2010] Ours[Cao et al. 2014] [Li et al. 2017]

Fig. 22. Comparison on the task of face fitting using different methods. (d)(b) (c)(a)

Fig. 23. Results and comparison on Dynamic Textures. (a) Input static albedo and expression renders. (b) Our generated dynamic albedo for the specificexpression and renders. (c) Our generated dynamic specular and displacement maps and renders using full set of generated assets (dynamic albedo, specularintensity and displacement). (d) From top to bottom: close-up of skin details of (a), (b) and (c).

ACM Trans. Graph., Vol. 39, No. 6, Article 215. Publication date: December 2020. (b)

Fig. 24. Animation sequences using the the full set of our generated assets. Row 1: source sequences. Row 2 to Row 3: target sequences driven by sourcesequences.Fig. 25. A failure case of our texture generation model. In this case, wefirst extract albedo map from an image (taken from CelebA Dataset [Liuet al. 2015]), then feed this map to our texture generation network. Fromleft to right columns: input neutral static albedo map; generated dynamicalbedo of one expression by our network; close-up details of static (Top)and dynamic (Down) albedo. Note that our result has slight distortion andmiscoloration in some area. This is mainly due to the limited quality ofthe input image, baked-in lighting and the individual’s beard, which ournetwork has not learned how to handle during the training.

Oleg Alexander, Mike Rogers, William Lambeth, Matt Chiang, and Paul Debevec.2009. The Digital Emily Project: Photoreal Facial Modeling and Animation. In

ACM SIGGRAPH 2009 Courses (SIGGRAPH ’09) . Article Article 12.Brian Amberg, Reinhard Knothe, and Thomas Vetter. 2008. Expression Invariant 3DFace Recognition with a Morphable Model. In

International Conference on AutomaticFace Gesture Recognition . 1–6.Timur Bagautdinov, Chenglei Wu, Jason Saragih, Pascal Fua, and Yaser Sheikh. 2018.Modeling Facial Geometry Using Compositional VAEs. In

CVPR .Yancheng Bai and Bernard Ghanem. 2017. Multi-branch fully convolutional networkfor face detection. arXiv preprint arXiv:1707.06330 (2017).Thabo Beeler, Bernd Bickel, Paul Beardsley, Bob Sumner, and Markus Gross. 2010.High-Quality Single-Shot Capture of Facial Geometry.

ACM Trans. Graph.

29, 4,Article Article 40 (2010). Thabo Beeler, Fabian Hahn, Derek Bradley, Bernd Bickel, Paul Beardsley, Craig Gots-man, Robert W. Sumner, and Markus Gross. 2011. High-Quality Passive FacialPerformance Capture Using Anchor Frames.

ACM Trans. Graph.

30, 4, ArticleArticle 75 (2011).Pascal Bérard, Derek Bradley, Markus Gross, and Thabo Beeler. 2016. Lightweight EyeCapture Using a Parametric Model.

ACM Trans. Graph.

35, 4, Article Article 117(2016).Pascal Bérard, Derek Bradley, Markus Gross, and Thabo Beeler. 2019. Practical Person-Specific Eye Rigging.

Computer Graphics Forum (2019). https://doi.org/10.1111/cgf.13650Volker Blanz and Thomas Vetter. 1999. A Morphable Model for the Synthesis of3D Faces. In

Proceedings of the 26th Annual Conference on Computer Graphics andInteractive Techniques (SIGGRAPH ’99) . ACM Press/Addison-Wesley Publishing Co.,USA, 187–194. https://doi.org/10.1145/311535.311556James Booth, Epameinondas Antonakos, Stylianos Ploumpis, George Trigeorgis, YannisPanagakis, and Stefanos Zafeiriou. 2017. 3D Face Morphable Models "In-the-Wild".In

CVPR .James Booth, Anastasios Roussos, Stefanos Zafeiriou, Allan Ponniah, and David Dun-away. 2016. A 3d morphable model learnt from 10,000 faces. In

CVPR .Sofien Bouaziz, Yangang Wang, and Mark Pauly. 2013. Online Modeling for RealtimeFacial Animation.

ACM Trans. Graph.

32, 4, Article Article 40 (2013).Derek Bradley, Wolfgang Heidrich, Tiberiu Popa, and Alla Sheffer. 2010. High ResolutionPassive Facial Performance Capture.

ACM Trans. Graph.

29, 4, Article Article 41(2010), 10 pages.Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun Zhou. 2014. FaceWarehouse:A 3D Facial Expression Database for Visual Computing.

IEEE Transactions onVisualization and Computer Graphics

20, 3 (2014).Chen Cao, Hongzhi Wu, Yanlin Weng, Tianjia Shao, and Kun Zhou. 2016. Real-TimeFacial Animation with Image-Based Dynamic Avatars.

ACM Trans. Graph.

35, 4,Article Article 126 (2016).E Carrigan, E Zell, C Guiard, and R McDonnell. 2020. Expression Packing: As-Few-As-Possible Training Expressions for Blendshape Transfer. In

Computer GraphicsForum , Vol. 39. Wiley Online Library, 219–233.Dan Casas, Andrew Feng, Oleg Alexander, Graham Fyffe, Paul Debevec, RyosukeIchikari, Hao Li, Kyle Olszewski, Evan Suma, and Ari Shapiro. 2016. Rapid Photore-alistic Blendshape Modeling from RGB-D Sensors. In

CASA .ACM Trans. Graph., Vol. 39, No. 6, Article 215. Publication date: December 2020. ynamic Facial Asset and Rig Generation from a Single Scan • 215:17

Anpei Chen, Zhang Chen, Guli Zhang, Kenny Mitchell, and Jingyi Yu. 2019. Photo-Realistic Facial Details Synthesis From Single Image. In

ICCV .Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. 2018. Grad-Norm: Gradient Normalization for Adaptive Loss Balancing in Deep MultitaskNetworks. In

ICML .Shiyang Cheng, Michael M. Bronstein, Yuxiang Zhou, Irene Kotsia, Maja Pantic, andStefanos Zafeiriou. 2019. MeshGAN: Non-linear 3D Morphable Models of Faces.

CoRR (2019). http://arxiv.org/abs/1903.10384Paul Ekman and Wallace V. Friesen. 1978. Facial action coding system: a technique forthe measurement of facial movement. In

Consulting Psychologists Press .G. Fyffe, P. Graham, B. Tunwattanapong, A. Ghosh, and P. Debevec. 2016. Near-InstantCapture of High-Resolution Facial Geometry and Reflectance. In

Proceedings of the37th Annual Conference of the European Association for Computer Graphics (EG ’16) .Eurographics Association, Goslar, DEU, 353–363.Graham Fyffe, Andrew Jones, Oleg Alexander, Ryosuke Ichikari, and Paul Debevec.2014. Driving high-resolution facial scans with video performance capture.

ACMTrans. Graph.

34, 1 (2014), 1–14.Graham Fyffe, Koki Nagano, Loc Huynh, Shunsuke Saito, Jay Busch, Andrew Jones,Hao Li, and Paul Debevec. 2017. Multi-View Stereo on Consistent Face Topology. In

Computer Graphics Forum , Vol. 36. Wiley Online Library, 295–309.Pablo Garrido, Michael Zollhöfer, Dan Casas, Levi Valgaerts, Kiran Varanasi, PatrickPérez, and Christian Theobalt. 2016. Reconstruction of Personalized 3D Face Rigsfrom Monocular Video.

ACM Trans. Graph.

35, 3, Article Article 28 (2016), 15 pages.Abhijeet Ghosh, Graham Fyffe, Borom Tunwattanapong, Jay Busch, Xueming Yu, andPaul Debevec. 2011. Multiview face capture using polarized spherical gradientillumination.

ACM Trans. Graph.

30, 6, 129.Paulo Gotardo, Jérémy Riviere, Derek Bradley, Abhijeet Ghosh, and Thabo Beeler. 2018.Practical Dynamic Facial Appearance Modeling and Acquisition.

ACM Trans. Graph.

37, 6, Article Article 232 (2018), 13 pages.Pei-Lun Hsieh, Chongyang Ma, Jihun Yu, and Hao Li. 2015. Unconstrained realtimefacial performance capture.

CVPR (2015).Liwen Hu, Shunsuke Saito, Lingyu Wei, Koki Nagano, Jaewoo Seo, Jens Fursund, ImanSadeghi, Carrie Sun, Yen-Chun Chen, and Hao Li. 2017. Avatar Digitization from aSingle Image for Real-Time Rendering.

ACM Trans. Graph.

36, 6, Article Article 195(2017).Haoda Huang, Jinxiang Chai, Xin Tong, and Hsiang-Tao Wu. 2011. Leveraging MotionCapture and 3D Scanning for High-Fidelity Facial Performance Acquisition.

ACMTrans. Graph.

30, 4, Article Article 74 (2011).Loc Huynh, Weikai Chen, Shunsuke Saito, Jun Xing, Koki Nagano, Andrew Jones, PaulDebevec, and Hao Li. 2018. Mesoscopic Facial Geometry Inference Using DeepNeural Networks. In

CVPR .Alexandru Eugen Ichim, Sofien Bouaziz, and Mark Pauly. 2015. Dynamic 3D avatarcreation from hand-held video input.

ACM Trans. Graph.

34, 4 (2015), 1–14.Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-ImageTranslation with Conditional Adversarial Networks. In

CVPR .Ira Kemelmacher-Shlizerman. 2013. Internet-based Morphable Model.

ICCV (2013).Andor Kollar. 2019. Realistic Human Eye. http://kollarandor.com/gallery/3d-human-eye/. Online; Accessed: 2019-7-30.Samuli Laine, Tero Karras, Timo Aila, Antti Herva, Shunsuke Saito, Ronald Yu, Hao Li,and Jaakko Lehtinen. 2017. Production-level facial performance capture using deepconvolutional neural networks. In

Proceedings of the ACM SIGGRAPH/EurographicsSymposium on Computer Animation . 1–10.Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham,Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al.2017. Photo-realistic single image super-resolution using a generative adversarialnetwork. In

CVPR .Jessica Lee, Deva Ramanan, and Rohit Girdhar. 2019. MetaPix: Few-Shot Video Retar-geting. arXiv:cs.CV/1910.04742J. P. Lewis, Ken Anjyo, Taehyun Rhee, Mengjie Zhang, Fred Pighin, and Zhigang Deng.2014. Practice and Theory of Blendshape Facial Models. In

Eurographics 2014 - Stateof the Art Reports . The Eurographics Association.Hao Li, Bart Adams, Leonidas J Guibas, and Mark Pauly. 2009. Robust single-viewgeometry and motion reconstruction.

ACM Trans. Graph.

28, 5 (2009), 1–10.Hao Li, Robert W. Sumner, and Mark Pauly. 2008. Global Correspondence Optimizationfor Non-rigid Registration of Depth Scans. In

Proceedings of the Symposium onGeometry Processing (SGP ’08) . Eurographics Association, Aire-la-Ville, Switzerland,Switzerland, 1421–1430. http://dl.acm.org/citation.cfm?id=1731309.1731326Hao Li, Thibaut Weise, and Mark Pauly. 2010. Example-Based Facial Rigging.

ACMTrans. Graph. , Article Article 32 (2010).Hao Li, Jihun Yu, Yuting Ye, and Chris Bregler. 2013. Realtime Facial Animation withOn-the-Fly Correctives.

ACM Trans. Graph.

32, 4, Article Article 42 (2013).Ruilong Li, Karl Bladin, Yajie Zhao, Chinmay Chinara, Owen Ingraham, Pengda Xi-ang, Xinglei Ren, Pratusha Prasad, Bipin Kishore, Jun Xing, et al. 2020. LearningFormation of Physically-Based Face Attributes. In

CVPR .Tianye Li, Timo Bolkart, Michael J. Black, Hao Li, and Javier Romero. 2017. Learninga Model of Facial Shape and Expression from 4D Scans.

ACM Trans. Graph.

36, 6, Article Article 194 (2017).Or Litany, Alex Bronstein, Michael Bronstein, and Ameesh Makadia. 2018. DeformableShape Completion with Graph Convolutional Autoencoders. In

CVPR .Ming-Yu Liu, Xun Huang, Arun Mallya, Tero Karras, Timo Aila, Jaakko Lehtinen, andJan Kautz. 2019. Few-shot unsupervised image-to-image translation. In

ICCV .Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep Learning FaceAttributes in the Wild. In

ICCV .Stephen Lombardi, Jason Saragih, Tomas Simon, and Yaser Sheikh. 2018. Deep appear-ance models for face rendering.

ACM Trans. Graph.

37, 4 (2018), 68.Wan-Chun Ma, Tim Hawkins, Pieter Peers, Charles-Felix Chabert, Malte Weiss, andPaul Debevec. 2007. Rapid Acquisition of Specular and Diffuse Normal Maps fromPolarized Spherical Gradient Illumination. In

Proceedings of the 18th EurographicsConference on Rendering Techniques (EGSR’07) . Eurographics Association, Goslar,DEU, 183–194.Wan-Chun Ma, Mathieu Lamarre, Etienne Danvoye, Chongyang Ma, Manny Ko, Javiervon der Pahlen, and Cyrus A Wilson. 2016. Semantically-aware blendshape rigsfrom facial performance measurements. In

SIGGRAPH ASIA 2016 Technical Briefs .ACM, 3.Ron Kimmel Matan Sela, Elad Richardson. 2017. Unrestricted Facial Geometry Recon-struction Using Image-to-Image Translation. In

ICCV .Arnold Maya. 2019. Maya Arnold renderer. https://arnoldrenderer.com/. Online;Accessed: 2019-11-22.Koki Nagano, Jaewoo Seo, Jun Xing, Lingyu Wei, Zimo Li, Shunsuke Saito, AviralAgarwal, Jens Fursund, and Hao Li. 2018. PaGAN: Real-Time Avatars Using DynamicTextures.

ACM Trans. Graph.

37, 6, Article Article 258 (2018), 12 pages.Jun-yong Noh and Ulrich Neumann. 2001. Expression Cloning. In

Proceedings of the28th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH’01) .Christopher Oat. 2007. Animated wrinkle maps. In

ACM SIGGRAPH 2007 courses .33–37.Kyle Olszewski, Joseph J. Lim, Shunsuke Saito, and Hao Li. 2016. High-Fidelity Facialand Speech Animation for VR HMDs.

ACM Trans. Graph.

35, 6 (2016).Hayato Onizuka, Diego Thomas, Hideaki Uchiyama, and Rin-ichiro Taniguchi. 2019.Landmark-Guided Deformation Transfer of Template Facial Expressions for Auto-matic Generation of Avatar Blendshapes. In

ICCVW .Chandan Pawaskar, Wan-Chun Ma, Kieran Carnegie, John P Lewis, and Taehyun Rhee.2013. Expression transfer: A system to build 3D blend shapes for facial animation.In . IEEE, 154–159.Anurag Ranjan, Timo Bolkart, Soubhik Sanyal, and Michael J. Black. 2018. Generating3D Faces Using Convolutional Mesh Autoencoders. In

ECCV .Christos Sagonas, Epameinondas Antonakos, Georgios Tzimiropoulos, StefanosZafeiriou, and Maja Pantic. 2016. 300 faces in-the-wild challenge: Database andresults.

Image and vision computing

47 (2016), 3–18.Robert W. Sumner and Jovan Popovió. 2004. Deformation Transfer for Triangle Meshes.

ACM Trans. Graph.

23, 3 (2004), 399–405.Ayush Tewari, Michael Zollhofer, Hyeongwoo Kim, Pablo Garrido, Florian Bernard,Patrick Perez, and Christian Theobalt. 2017. MoFA: Model-Based Deep Convolu-tional Face Autoencoder for Unsupervised Monocular Reconstruction. In

ICCV .Justus Thies, Michael Zollhöfer, Matthias Nieundefinedner, Levi Valgaerts, Marc Stam-minger, and Christian Theobalt. 2015. Real-Time Expression Transfer for FacialReenactment.

ACM Trans. Graph.

34, 6 (2015).Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and MatthiasNießner. 2016. Face2Face: Real-Time Face Capture and Reenactment of RGB Videos.In

CVPR .Luan Tran, Feng Liu, and Xiaoming Liu. 2019. Towards High-fidelity Nonlinear 3DFace Morphable Model. In

CVPR .Luan Tran and Xiaoming Liu. 2018. Nonlinear 3D Face Morphable Mmodel. In

CVPR .Triplegangers. 2019. Triplegangers Face Models. https://triplegangers.com/. Online;Accessed: 2019-12-21.Levi Valgaerts, Chenglei Wu, Andrés Bruhn, Hans-Peter Seidel, and Christian Theobalt.2012. Lightweight Binocular Facial Performance Capture under Uncontrolled Light-ing.

ACM Trans. Graph.

31, 6 (2012).Zdravko Velinov, Marios Papas, Derek Bradley, Paulo Gotardo, Parsa Mirdehghan, SteveMarschner, Jan Novák, and Thabo Beeler. 2018. Appearance Capture and Modelingof Human Teeth.

ACM Trans. Graph.

37, 6, Article Article 207 (Dec. 2018).Daniel Vlasic, Matthew Brand, Hanspeter Pfister, and Jovan Popovió. 2005. Face Transferwith Multilinear Models.

ACM Trans. Graph.

24, 3 (2005), 426–433.Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Jan Kautz, and Bryan Catan-zaro. 2019. Few-shot Video-to-Video Synthesis. In

NeurIPS .Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, andBryan Catanzaro. 2018a. Video-to-Video Synthesis. In

NeurIPS .Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and BryanCatanzaro. 2018b. High-Resolution Image Synthesis and Semantic Manipulationwith Conditional GANs. In

CVPR .ACM Trans. Graph., Vol. 39, No. 6, Article 215. Publication date: December 2020.

Shih-En Wei, Jason Saragih, Tomas Simon, Adam W Harley, Stephen Lombardi, MichalPerdoch, Alexander Hypes, Dawei Wang, Hernan Badino, and Yaser Sheikh. 2019.VR facial animation via multiview image translation.

ACM Trans. Graph.

38, 4 (2019),1–16.Thibaut Weise, Hao Li, Luc Van Gool, and Mark Pauly. 2009. Face/Off: live facialpuppetry. In

SCA ’09 .Chenglei Wu, Derek Bradley, Pablo Garrido, Michael Zollhöfer, Christian Theobalt,Markus Gross, and Thabo Beeler. 2016. Model-Based Teeth Reconstruction.

ACMTrans. Graph.

35, 6, Article Article 220 (2016).Shugo Yamaguchi, Shunsuke Saito, Koki Nagano, Yajie Zhao, Weikai Chen, Kyle Ol-szewski, Shigeo Morishima, and Hao Li. 2018. High-fidelity facial reflectance and geometry inference from an unconstrained image.

ACM Trans. Graph.

37, 4 (2018),1–14.Haotian Yang, Hao Zhu, Yanru Wang, Mingkai Huang, Qiu Shen, Ruigang Yang, andXun Cao. 2020. FaceScape: a Large-scale High Quality 3D Face Dataset and DetailedRiggable 3D Face Prediction. In

CVPR .Li Zhang, Noah Snavely, Brian Curless, and Steven M. Seitz. 2004. Spacetime Faces:High Resolution Capture for Modeling and Animation.

ACM Trans. Graph.

23, 3(2004).Yuxiang Zhou, Jiankang Deng, Irene Kotsia, and Stefanos Zafeiriou. 2019. Dense 3DFace Decoding Over 2500FPS: Joint Texture & Shape Convolutional Mesh Decoders.In