[PDF] DeepFaceFlow: In-the-wild Dense 3D Facial Motion Estimation

Abstract

Dense 3D facial motion capture from only monocular in-the-wild pairs of RGB images is a highly challenging problem with numerous applications, ranging from facial expression recognition to facial reenactment. In this work, we propose DeepFaceFlow, a robust, fast, and highly-accurate framework for the dense estimation of 3D non-rigid facial flow between pairs of monocular images. Our DeepFaceFlow framework was trained and tested on two very large-scale facial video datasets, one of them of our own collection and annotation, with the aid of occlusion-aware and 3D-based loss function. We conduct comprehensive experiments probing different aspects of our approach and demonstrating its improved performance against state-of-the-art flow and 3D reconstruction methods. Furthermore, we incorporate our framework in a full-head state-of-the-art facial video synthesis method and demonstrate the ability of our method in better representing and capturing the facial dynamics, resulting in a highly-realistic facial video synthesis. Given registered pairs of images, our framework generates 3D flow maps at ~60 fps.

Full PDF

DDeepFaceFlow: In-the-wild Dense 3D Facial Motion Estimation

Mohammad Rami Koujan , , Anastasios Roussos , , , Stefanos Zafeiriou , College of Engineering, Mathematics and Physical Sciences, University of Exeter, UK Department of Computing, Imperial College London, UK Institute of Computer Science, Foundation for Research and Technology-Hellas (FORTH-ICS), Greece FaceSoft.io, London, UK

Abstract

Dense 3D facial motion capture from only monocular in-the-wild pairs of RGB images is a highly challenging prob-lem with numerous applications, ranging from facial ex-pression recognition to facial reenactment. In this work, wepropose

DeepFaceFlow , a robust, fast, and highly-accurateframework for the dense estimation of 3D non-rigid facialﬂow between pairs of monocular images. Our

DeepFace-Flow framework was trained and tested on two very large-scale facial video datasets, one of them of our own collec-tion and annotation, with the aid of occlusion-aware and3D-based loss function. We conduct comprehensive experi-ments probing different aspects of our approach and demon-strating its improved performance against state-of-the-artﬂow and 3D reconstruction methods. Furthermore, we in-corporate our framework in a full-head state-of-the-art fa-cial video synthesis method and demonstrate the ability ofour method in better representing and capturing the facialdynamics, resulting in a highly-realistic facial video synthe-sis. Given registered pairs of images, our framework gener-ates 3D ﬂow maps at ∼ fps.

1. Introduction

Optical ﬂow estimation is a challenging computer visiontask that has been targeted substantially since the seminalwork of Horn and Schunck [16]. The amount of effort ded-icated for tackling such a problem is largely justiﬁed by thepotential applications in the ﬁeld, e.g. 3D facial reconstruc-tion [11, 23, 36], autonomous driving [19], action and ex-pression recognition [30, 21], human motion and head poseestimation [1, 41], and video-to-video translation [37, 22].While optical ﬂow tracks pixels between consecutive im-ages in the 2D image plane, scene ﬂow, its 3D counterpart,aims at estimating the 3D motion ﬁeld of scene points atdifferent time steps in the 3 dimensional space. Therefore,scene ﬂow combines two challenges: 1) 3D shape recon-struction, and 2) dense motion estimation. Scene ﬂow esti-

Figure 1. We propose a framework for the high-ﬁdelity 3D ﬂowestimation between a pair of monocular facial images. Left-to-right: input pair of RGB images, estimated 3D facialshape of ﬁrst image rendered with 3D motion vectors from ﬁrst tosecond image, warped 3D shape of (1) based on estimated 3Dﬂow in (3), color-coded 3D ﬂow map of each pixel in (1). Forthe color coding, see the Supplementary Material. mation, which can be traced back to the work of of vedulaet al. [34], is a highly ill-posed problem due to the depthambiguity and the aperture problem, as well as occlusionsand variations of illumination and pose, etc. which are verytypical of in-the-wild images. To address all these chal-lenges, the majority of methods in the literature use stereoor RGB-D images and enforce priors on either the smooth-ness of the reconstructed surfaces and estimated motionﬁelds [2, 27, 39, 33] or the rigidity of the motion [35].In this work, we seek to estimate the 3D motion ﬁeldof human faces from in-the-wild pairs of monocular im-ages, see Fig. 1. The output in our method is the sameas in scene ﬂow methods, but the fundamental differenceis that we use simple RGB images instead of stereo pairsor RGB-D images as input. Furthermore, our method istailored for human faces instead of arbitrary scenes. Forthe problem that we are solving, we use the term “3D faceﬂow estimation”. Our designed framework delivers accu-rate ﬂow estimation in the 3D world rather than the 2D im-age space. We focus on the human face and the modellingof its dynamics due to its centrality in myriad of applica-tions, e.g. facial expression recognition, head motion andpose estimation, 3D dense facial reconstruction, full headreenactment, etc. Human facial motion emerges from twomain sources: 1) rigid motion due to the head pose varia-tion, and 2) non-rigid motion caused by elicited facial ex-pressions and mouth motions during speech. The reliance1 a r X i v : . [ c s . C V ] M a y n only monocular and in-the-wild images to capture the3D motion of general objects makes the problem consider-ably more challenging. Alleviating such obstacles can bemade by injecting our prior knowledge about this object,as well as constructing and utilising a large-scale annotateddataset. Our contributions in this work can be summarisedas follows: • To the best of our knowledge, there does not exist anymethod that estimates 3D scene ﬂow of deformable scenesusing a pair of simple RGB images as input. The proposedapproach is the ﬁrst to solve this problem and this is madepossible by focusing on scenes with human faces. • Collection and annotation of a large-scale dataset ofhuman facial videos (more than 12000), which we call

Face3DVid . With the help of our proposed model-basedformulation, each video was annotated with the per-frame:1) 68 facial landmarks, 2) dense 3D facial shape mesh, 3)camera parameters, 4) dense 3D ﬂow maps. This datasetwill be made publicly available (project’s page: https://github.com/mrkoujan/DeepFaceFlow ). • A robust, fast, deep learning-based and end-to-end frameworkfor the dense high-quality estimation of 3D face ﬂow from only apair of monocular in-the-wild RGB images. • We demonstrate both quantitatively and qualitatively the use-fulness of our estimated 3D ﬂow in a full-head reenactment ex-periment, as well as 4D face reconstruction (see supplementarymaterials).

The approach we follow starts from the collection and an-notation of a large-scale dataset of facial videos, see section3 for details. We employ such a rich dynamic dataset in thetraining process of our entire framework and initialise thelearning procedure with this dataset of in-the-wild videos.Different from other scene ﬂow methods, our frameworkrequires only a pair of monocular RGB images and can bedecomposed into two main parts: 1) a shape-initializationnetwork ( ) aiming at densely regressing the 3Dgeometry of the face in the ﬁrst frame, and 2) a fully convo-lutional network, termed as

DeepFaceFlowNet (DFFNet) ,that accepts a pair of RGB frames along with the projected3D facial shape initialization of the ﬁrst (reference) frame,provided by , and produces a dense 3D ﬂowmap at the output.

2. Related Work

The most closely-related works in the literature solve theproblems of optical ﬂow and scene ﬂow estimation. Tradi-tionally, one of the most popular approaches to tackle theseproblems had been through variational frameworks. Thework of Horn and Schunck [16] pioneered the variationalwork on optical ﬂow , where they formulated an energyequation with brightness constancy and spatial smoothness terms. Later, a large number of variational approaches withvarious improvements were put forward [7, 26, 38, 3, 31].All of these methods involve dealing with a complex op-timisation, rendering them computationally very intensive.One of the very ﬁrst attempts for an end-to-end and CNN-based trainable framework capable of estimating the opticalﬂow was made by Dosovitskiy et al. [10]. Even thoughtheir reported results still fall behind state-of-the-art classi-cal methods, their work shows the bright promises of CNNsin this task and that further investigation is worthwhile. An-other attempt with similar results to [10] was made by theauthors of [28]. Their framework, called

SpyNet , combinea classical spatial-pyramid formulation with deep learningfor large motions estimation in a coarse-to-ﬁne approach.As a follow-up method, Ilg et al. [18] later used the twostructures proposed in [10] in a stacked pipeline,

FlowNet2 ,for estimating coarse and ﬁne scale details of optical ﬂow,with very competitive performance on the Sintel bench-mark. Recently, the authors of [32] put forward a com-pact and fast CNN model, termed as

PWC-Net , that capi-talises on pyramidal processing, warping, and cost volumes.They reported the top results on more than one benchmark,namely: MPI Sintel ﬁnal pass and KITTI 2015. Most of thedeep learning-based methods rely on synthetic datasets totrain their networks in a supervised fashion, leaving a chal-lenging gap when tested on real in-the-wild images.Quite different from optical ﬂow, scene ﬂow methods ba-sically aim at estimating the three dimensional motion vec-tors of scene points from stereo or RGB-D images. The ﬁrstattempt to extend optical ﬂow to 3D was made by Vdedulaet al. [34]. Their work assumed both the structure and thecorrespondences of the scene are known. Most of the earlyattempts on scene ﬂow estimation relied on a sequence ofstereo images to solve the problem. With the more popular-ity of depth cameras, more pipelines were utilising RGB-Ddata as an alternative to stereo images. All these methodsfollow the classical way of scene ﬂow estimation withoutusing any deep learning techniques or big datasets. The au-thors of [24] led the ﬁrst effort to use deep learning featuresto estimate optical ﬂow, disparity, and scene ﬂow from a bigdataset. The method of Golyanik et al. [13] estimates the3D ﬂow from a sequence of monocular images, with sufﬁ-cient diversity in non-rigid deformation and 3D pose, as themethod relies heavily on NRSfM. The lack of such diver-sity, which is common for the type of in-the-wild videos wedeal with, could result in degenerate solutions. On the con-trary, our method requires only a pair of monocular imagesas input. Using only monocular images, Brickwedde et al.[6] target dynamic street scenes but impose a strong rigid-ity assumption on scene objects, making it unsuitable for fa-cial videos. As opposed to other state-of-the-art approaches,we rely on minimal information to solve the highly ill-posed3D facial scene ﬂow problem. Given only a pair of monoc-lar RGB images, our novel framework is capable of ac-curately estimating the 3D ﬂow between them robustly andquickly at a rate of ∼ f ps .

3. Dataset Collection and Annotation

Given the highly ill-posed nature of the non-rigid 3D fa-cial motion estimation from a pair of monocular images,the size and variability of the training dataset is very cru-cial [18, 10]. For this reason, we are based on a large-scaletraining dataset (

Face3DVid ), which we construct by col-lecting tens of thousands of facial videos, performing dense3D face reconstruction on them and then estimating effec-tive pseudo-ground truth of 3D ﬂow maps.

First of all, we model the 3D face geometry using3DMMs and an additive combination of identity and ex-pression variation. This is similar to several recent meth-ods, e.g. [40, 9, 23, 12]. In more detail, let x =[ x , y , z , ..., x N , y N , z N ] T ∈ R N be the vectorized formof any 3D facial shape consisting of N

3D vertices. Weconsider that x can be represented as: x ( i , e ) = ¯ x + U id i + U exp e (1)where ¯x is the overall mean shape vector, U id ∈ R N × n i is the identity basis with n i = 157 principal components( n i (cid:28) N ), U exp ∈ R N × n e is the expression basis with n e = 28 principal components ( n e (cid:28) N ), and i ∈ R n i , e ∈ R n e are the identity and expression parameters respec-tively. The identity part of the model originates from theLarge Scale Face Model (LSFM) [4] and the expression partoriginates from the work of Zafeiriou et al. [40].To create effective pseudo-ground truth on tens of thou-sands of videos, we need to perform

3D face reconstruc-tion that is both efﬁcient and accurate. For this reason, wechoose to ﬁt the adopted 3DMM model on the sequence offacial landmarks over each video. Since this process is doneonly during training, we are not constrained by the need ofonline performance. Therefore, similarly to [9], we adopt abatch approach that takes into account the information fromall video frames simultaneously and exploits the rich dy-namic information usually contained in facial videos. It isan energy minimization to ﬁt the combined identity and ex-pression 3DMM model on facial landmarks from all framesof the input video simultaneously. More details are given inthe Supplementary Material.

To create our large scale training dataset, we start froma collection of 12,000 RGB videos with 19 million framesin total and 2,500 unique identities. We apply the 3D facereconstruction method outlined in Sec. 3.1, together with some video pruning steps to omit cases where the auto-matic estimations had failed. Our ﬁnal training set consistsof videos of our collection that survived the steps of videopruning: (81.25% of the initial dataset) with and around . Formore details and exemplar visualisations, please refer to theSupplementary Material.

Given a pair of images I and I , coming from a videoin our dataset, and their corresponding 3D shapes S , S and pose parameters R , t3d , R , t3d , the 3D ﬂow mapof this pair is created as follows: F ( x, y ) ( x,y ) ∈ M = f c · ( R [ S ( t j ) , S ( t j ) , S ( t j )] t j ∈{ T | t j is visible from pixel (x,y) in I } b + t d ) − f c · ( R [ S ( t j ) , S ( t j ) , S ( t j )] t j ∈{ T | t j is visible from pixel (x,y) in I } b + t d ) , (2)where M is the set of foreground pixels in I , S ∈ R × N is the matrix storing the column-wise x-y-z coordinates ofthe N -vertices 3D shape of I , R ∈ R × is the rotationmatrix, t d ∈ R is the 3D translation, f c and f c are thescales of the orthographic camera for the ﬁrst and secondimage, respectively, t j = [ t j , t j , t j ] ( t ji ∈ { , .., N } ) is thevisible triangle from pixel ( x, y ) in image I detected byour hardware-based renderer, T is the set of all trianglescomposing the mesh of S , and b ∈ R is the barycentriccoordinates of pixel ( x, y ) lying inside the projected trian-gle t j on image I . All background pixels in equation 2 areset to zero and ignored during training with the help of amasked loss. It is evident from equation 2 that we do notcare about the visible vertices in the second frame and onlytrack in 3D those were visible in image I to produce the3D ﬂow vectors. Additionally, with this ﬂow representation,the x-y coordinates alone of the 3D ﬂow map ( F ( x, y ) ) des-ignate the 2D optical ﬂow components in the image spacedirectly.

4. Proposed Framework

Our overall designed framework is demonstrated in ﬁg-ure 2. We expect as input two RGB images I , I ∈ R W × H × and produce at the output an image F ∈ R W × H × encoding the per-pixel 3D optical ﬂow from I to I . The designed framework is marked by two main stages:1) : 3D shape initialisation and encoding ofthe reference frame I , 2) DeepFaceFlowNet (DFFNet) :3D face ﬂow prediction. The entire framework was trainedin a supervised manner, utilising the collected and anno-tated dataset, see section 3.2, and ﬁne-tuned on the 4DFABdataset [8], after registering the sequence of scans comingfrom each video in this dataset to our 3D template. Inputframes were registered to a 2D template of size × with the help of the 68 mark-up extracted using [14] and fedto our framework. igure 2. Proposed DeepFaceFlow pipeline for the 3D facial ﬂow estimation. First stage (left): works as an initialisation forthe 3D facial shape in the ﬁrst frame. This estimation is rasterized in the next step and encoded in an RGB image, termed as ProjectedNormalised Coordinates Code (PNCC) storing the x-y-z coordinates of each corresponding visible 3D point. Given the pair of images aswell as the PNCC, the second stage (right) estimates the 3D ﬂow using a deep fully-convolutional network (

DeepFaceFlowNet ). To robustly estimate the per-pixel 3D ﬂow between apair of images, we provide the

DFFNet network, section4.2, not only with I & I , but also with another image thatstores a Projected Normalised coordinates Code ( PN CC )of the estimated 3D shape of the reference frame I , i.e. PN CC ∈ R W × H × . The PN CC codes that we considerare essentially images encoding the normalised x,y, and zcoordinates of facial vertices visible from each correspond-ing pixel in I based on the camera’s view angle. The in-clusion of such images allows the CNN to better associateeach RGB value in I with the corresponding point in the3D space, providing the network with a better initialisationin the problem space and establishing a reference 3D meshthat facilitates the warping in the 3D space during the courseof training. Equation 3 shows how to compute the PN CC image of the reference frame I . PN CC ( x, y ) ( x,y ) ∈ M = V ( S , c ) = P ( R [ S ( t j ) , S ( t j ) , S ( t j )] t j ∈{ T | t j is visible from ( x,y ) } b + t d ) , (3)where V ( ., . ) is the function rendering the normalised ver-sion of S , c ∈ R is the camera parameters, i.e. rotationangles, translation and scale ( R , t d , f c ), and P is a × diagonal matrix with main diagonal elements ( f c W , f c H , f c D ).The multiplication with P scales the posed 3D face with f c to be ﬁrst in the image space coordinates and then nor-malises it with the width and height of the rendered imagesize and the maximum z value D computed from the entiredataset of our annotated 3D shapes. This results in an imagewith RGB channels storing the normalised ([0, 1]) x-y-z co-ordinates of the corresponding rendered 3D shape. The restof the parameters in equation 3 are detailed in section 3.3and utilised in equation 2. The

PN CC image generation discussed inequation 3 still lacks the estimation of the 3D facial shape S of I . We deal with this problem by training a deep CNN,termed as , that aims at regressing a dense 3D mesh S through per-vertex 3D coordinates estimations. Weuse our collected dataset ( Face3DVid ) and the 4DFAB [8]3D scans to train this network in a supervised manner. Weformulate a loss function composed of two terms: L (Φ) = 1 N N (cid:88) i =1 || s GTi − s i || + 1 O O (cid:88) j =1 || e GTj − e j || . (4)The ﬁrst term in the above equation penalises the devia-tion of each vertex 3D coordinates from the correspondingground-truth vertex ( s i = [ x, y, z ] T ), while the second termensures similar edge lengths between vertices in the esti-mated and ground-truth mesh, given that e j is the (cid:96) dis-tance between vertices v j and v j deﬁning edge j in theoriginal ground-truth 3D template. Instead of estimatingthe camera parameters c separately, which are needed at thevery input of the renderer, we assume a Scaled OrthographicProjection (SOP) as the camera model and train the networkto regress directly the scaled 3D mesh by multiplying the x-y-z coordinates of each vertex of frame i with f ic . Given I , I and PN CC images, the 3D ﬂow estimationproblem is a mapping F : {I , I , PN CC} ∈ R W × H × → F W × H × . Using both the annotated Face3DVid detailed insection 3 and 4DFAB [8] datasets, we train a fully convo-lutional encoder-decoder CNN structure ( F ), called Deep-FaceFlowNet (DFFNet) , that takes three images, namely: I , I and PN CC , and produces the 3D ﬂow estimate fromeach foreground pixel in I to I as a W × H × image.The designed network follows the generic U-Net architec-ture with skip connections [29] and was inspired particu-larly by FlowNetC [10], see ﬁgure 3. Distinguished fromFlowNetC, we extend the network to account for the PN CC image at the input and modify the structure to account forthe 3D ﬂow estimation task, rather than 2D optical ﬂow. Wepropose the following two-term loss function: L (Ψ)= L (cid:80) i =1 w i || F GTi − F i (Ψ) || F + α ||I −W ( F, PN CC ; I ) || F (5) igure 3. Architecture of our designed DFFNet for the purpose of estimating the 3D ﬂow between a pair of RGB images.

The ﬁrst term in Eq. (5) is the endpoint error , whichcorresponds to a 3D extension of the standard error mea-sure for optical ﬂow methods. It computes the frobeniusnorm ( || . || F ) error between the estimated 3D ﬂow F (Ψ) and the ground truth F GT , with Ψ representing the networklearnable weights. Practically, since at the decoder part ofour DFFNet each fractionally-strided convolution opera-tion, aka. deconvolution, produces an estimation of the ﬂowat different resolutions, we compare this multi-resolution3D ﬂow with the downsampled versions of the F GT , up un-til the full resolution at stage L , and use the weighted sumof the frobenius norm error as a penalisation term. The sec-ond term in Eq. (5) is the photo-consistency error , whichassumes that the colour of each point does not change from I to I . The warping operation was done with the helpof the warping function W ( ., . ) . This function warps the3D shape of I encoded inside the PN CC image using theestimated ﬂow F and samples I at the vertices of the resul-tant projected 3D shape. The warping function in equation5 was implemented by a differentiable layer detecting theocclusions by the virtue of our 3D ﬂow, and sampling thesecond image (backward warping) in a differentiable man-ner at the output stage of our DFFNet . The scale α is usedfor the sake of terms balancing while training.

5. Experiments

In this section, we compare our framework with state-of-the-art methods in optical ﬂow and 3D face reconstruction.We ran all the experiments on an NVIDIA DGX1 machine.

Although the collected

Face3DVid dataset has a widevariety of facial dynamics and identities captured underplenitude of set-ups and viewpoints depicting the in-the- wild scenarios of videos capture, the dataset was anno-tated with pseudo ground-truth 3D shapes, not real 3Dscans. Relying only on this dataset, therefore, for train-ing our framework could result in mimicking the perfor-mance of the 3DMM-based estimation, which we want ide-ally to initialise with and then depart from. Thus, we ﬁne-tune our framework on the 4DFAB dataset [8]. The 4DFABdataset is a large-scale database of dynamic high-resolution3D faces with subjects displaying both spontaneous andposed facial expressions with the corresponding per-frame3D scans. We leave a temporal gap between consecutiveframes sampled from each video if the average 3D ﬂow perpixel is < = 1 between each pair. In total, around im-age pairs (1600 subjects) form the Face3DVid dataset and from the 4DFAB (175 subjects) were used for train-ing/testing purposes. We split the

Face3DVid into train-ing/validation vs test (80 % vs 20 % ) in the ﬁrst phase ofthe training. Likewise, 4DFAB dataset was split into train-ing/validation vs test (80 % vs 20 % ) during the ﬁne-tuning. Our pipeline consists of two networks (see Fig. 2): a) 3DMeshReg : The aim of this network is to accept aninput image ( I ∈ R × × ) and regress the per-vertex(x, y, z) coordinates describing the subject’s facial geome-try. ResNet50 [15] network architecture was selected andtrained for this purpose after replacing the output fully-connected ( fc ) layer with a convolutional one ( × , )and then a linear fc layer with ∼ . k × neurons. This net-work was trained initially and separately from the rest of theframework on the Face3DVid dataset and then ﬁne-tunedon the 4DFAB dataset [8]. Adam optimizer was used [20]during the training with learning rate of 0.0001, β = 0 . , β = 0 . , and batch size 32. Epochs

Lea r n i ng R a t e -5 oursFlowNetC, FlowNetSFlowNet-SDLiteFlowNetFlowNet2 Figure 4. Training schedule for the learning rate used while train-ing our network and other state-of-the-art approaches for 3D ﬂowestimation. The ﬁrst 20 epochs for all methods were ran on

Face3DVid dataset and the next 20 on . Each trainingepoch on

Face3DVid and is composed of . · and . · iterations, respectively, each with batch size of 16. b) DFFNet : Figure 3 shows the structure of this network.Inspired by FlowNetC [10], this network has similarly nineconvolutional layers. The ﬁrst three layers use kernels ofsize × and the rest have kernels of size × . Whereoccurs, downsampling is carried out with strides of 2 andnon-linearity is implemented with ReLU layers. We extendthis architecture at the input stage by a branch dedicatedfor processing the PN CC image. The feature map gener-ated at the end of the

PN CC branch is concatenated withthe correlation result between the feature maps of I and I . For the correlation layer, we follow the implementationsuggested by [10] and we keep the same parameters for thislayer (neighborhood search size is 2*(21)+1 pixels). At thedecoder section, the ﬂow is estimated from the ﬁnest levelup until the full resolution. While training, we use a batchsize of 16 and the Adam optimization algorithm [20] withthe default parameters recommended in [20] ( β = 0 . and β = 0 . ). Figure 4 demonstrates our scheduled learningrates over epochs for training and ﬁne-tuning. We also set w i = 1 and α = 10 in equation 5 and normalise input im-ages to the range [0, 1]. While testing, our entire frameworktakes only around 17ms (6ms ( )+ 6ms (raster-ization & PNCC generation) + 5ms (

DFFNet )) to generatethe 3D dense ﬂow map, given a registered pair of images.

In this section, we quantitatively evaluate the ability ofour approach in estimating the 3D ﬂow. As there exist noother methods for 3D scene ﬂow from simple RGB images,we adapt existing methods that solve closely-related prob-lems so that they produce 3D ﬂow estimation. In moredetail, we use two 3D reconstruction methods (ITWMM[5] and DNSfM-3DMM [23]), as well as four optical ﬂowmethods after retraining them all speciﬁcally for the taskof 3D ﬂow estimation. The four optical ﬂow methods in-clude the best performing methods in table 2 on our datasets(LiteNet & FlowNet2) as well as two additional baselines (FlowNetS & FlowNetC).To estimate the 3D ﬂow by ITWMM and DNSfM-3DMM,we ﬁrst generate the per-frame dense 3D mesh of each testvideo by passing a single frame at a time to the ITWMMmethod and the entire video for the DNSfM-3DMM (as itis a video-based approach). Then, following our annotationprocedure discussed in 3.3, the 3D ﬂow values for each pairof test images were obtained.Since the deep learning-based methods we compare againstin this section were proposed as 2D ﬂow estimators, wemodify the sizes of some ﬁlters in their original architec-tures so that their output ﬂow is a 3-channel image stor-ing the x-y-z coordinates of the ﬂow and train them onour 3D-facial-ﬂow datasets with the learning rate sched-ules reported in ﬁgure 4. FlowNet2 is a very deep architec-ture (around 160M parameters) composed of stacked net-works. As suggested in [18], we did not train this networkin one go, but instead sequentially. More speciﬁcally, wefused the separately trained individual networks (FlowNetS,FlowNetC, and FlowNetSD [18]) on our datasets togetherand ﬁne-tuned the entire stacked architecture, see 4 for thelearning rate schedule. Please consult the supplementarymaterial for more information on what we modiﬁed exactlyin each ﬂow network we compare against here.Table 1 shows the generated facial AEPE results by eachmethod on the

Face3DVid and datasets. Our pro-posed architecture and its variant (‘ours depth’) report thelowest (best) AEPE numbers on both datasets. Figure 5 vi-sualises some color-coded 3D ﬂow results produced by themethods presented in table 1. To color-code the 3D ﬂow, weconvert the x-y-z estimated ﬂow coordinates from Cartesianto spherical coordinates and normalise them so that theyrepresent the coordinates of an HSV coloring system, moredetails on that are available in the supplementary material.It is noteworthy that the 3D facial reconstruction methodswe compare against fail to produce as accurate tracking ofthe 3D ﬂow as our approach. Their result is not smooth andconsistent in the model space, resulting in a higher inten-sity motion in the space. This can be attributed to the factthat such methods pay attention to the ﬁdelity of the recon-struction from the camera’s view angle more than the 3Dtemporal ﬂow. On the other hand, the other deep architec-tures we train in this section are unable to capture the fullfacial motion with same precision, with more fading ﬂowaround cheeks and forehand.

The aim of this experiment is to probe the performanceof our framework in estimating the 2D optical facial ﬂowbetween a pair of facial images by keeping only the dis-placements produced at the output in the x and y directionswhile ignoring those in the z direction. We separate thecomparisons in this section into two parts. Firstly, we eval- igure 5. Color-coded 3D ﬂow estimations of random test pairs from the

Face3DVid and datasets. Left-to-right: pair of inputRGB images, Ground Truth, ours, ours depth, compared methods. For the color coding, see the Supp. Material.Figure 6. Color-coded 2D ﬂow estimations. Rows are random samples from the test split of the

Face3DVid and 4DFAB datasets and their2D ﬂow estimations. The ﬁrst two columns of each row show the input pair of RGB images. For the color coding, see the Supp. Material.Table 1. Comparison between our obtained 3D face ﬂow resultsagainst state-of-the-art methods on the test splits of the 4DFABand

Face3DVid datasets. Comparison metric is the standard Av-erage End Point Error (AEPE)Method ( ↓ ) Face3DVid ( ↓ )ITWMM [5] 3.43 4.1DNSfM-3DMM [23] 2.8 3.9FlowNetS [10] 2.25 3.7FlowNetC [10] 1.95 2.425FlowNet2 [18] 1.89 2.4LiteNet [17] 1.5 2.2ours depth 1.6 1.971ours uate our method against generic 2D ﬂow approaches us-ing their best performing trained models provided by theoriginal authors of each. Secondly, we train the same ar-chitectures from scratch on the same datasets we train ourframework on, namely the training splits of Face3DVid and , using a learning rate of 1e-4 that drops 5 timeseach 10 epochs. We keep the same network design as pro-vided by each paper’s authors and only train the network to minimise a masked loss composed of photo-consistencyand data terms. The masked loss is computed with the helpof a foreground (facial) mask for each reference image pro-vided with our utilised datasets. Table 2 presents the ob-tained facial Average End Point Error (AEPE) metric valuesby our proposed approach against other state-of-the-art opti-cal ﬂow prediction methods on the test splits of

Face3DVid and datasets. As can be noted from table 2, ourproposed method always achieves the smallest (best) AEPEvalues on both employed datasets. As expected, the AEPEvalues decrease when training the other methods on ourdataset for the speciﬁc task of facial 2D ﬂow estimation.However, our method still produces lower errors and out-performs the compared against methods on this task. The‘ours depth’ variant of our network comes as the secondbest performing method on both datasets. This variant wastrained in a very similar manner to our original frameworkbut with feeding the

DFFNet with I , I and only the z coordinates (last channel) of the PN CC image while ignor-ing the x and y (ﬁrst two channles). Figure 6 demonstratessome qualitative results generated by the methods reported able 2. Comparison between our obtained 2D ﬂow results againststate-of-the-art methods on the test splits of the 4DFAB and Face3DVid datasets. Comparison metric is the standard AverageEnd Point Error (AEPE). ‘original models’ refers to trained mod-els provided by the authros of each, and ‘trained from scratch’indicates that the same architectures were trained on the trainingsets of both

Face3DVid and to estimate the 2D facial ﬂow.Method original models trained from scratch

FlowNetS [10] 1.832 5.1425 1.956 2.6SpyNet [28] 1.31 3 1.042 1.5FlowNetC [10] 1.212 2.6 1.061 1.498UnFlow [25] 1.163 2.6553 1.055 1.45LiteNet [17] 1.16 2.6 1.018 1.268PWC-Net[32] 1.159 2.625 1.035 1.371FlowNet2 [18] 1.15 2.6187 1.063 1.352ours depth 0.99 1.176 0.99 1.176ours in table 2, as well as ours. Please refer to the supplementarymaterial for more information on the color-coding followedfor encoding the ﬂow values.

We further investigate the functionality of our proposedframework in capturing the human facial 3D motion andsuccessfully employing it in a full-head reenactment appli-cation. Towards that aim, we use the recently proposedmethod of [37], which is in essence a general video-to-videosynthesis approach mapping a source (conditioning) videoto a photo-realistic output one. The authors of [37] traintheir framework in an adversarial manner and learn the tem-poral dynamics of a target video during the training timewith the help of the 2D ﬂow estimated by

FlowNet2 [18].In this experiment, we replace the

FlowNet2 employed in[37] by our proposed approach and aid the generator andvideo discriminator to learn the temporal facial dynamicsrepresented by our 3D facial ﬂow. We ﬁrstly conduct aself-reenactment test as done in [37], where we divide eachvideo into a train/test splits (ﬁrst 2 third vs last third) and re-port the average per-pixel RGB error between fake and realtest frames. Table 3 reveals the average pixel distance ob-tained for 4 different videos we trained a separate model foreach. The only difference between the second and third rowof table 3 is the ﬂow estimation method, everything else(structure, loss functions, conditioning, etc.) is the same.As can be noted from table 3, our 3D ﬂow better revealsthe facial temporal dynamics of the training subject and as-sists the video synthesis generator in capturing these tem-poral characteristics, resulting in a lower error. In the sec-ond experiment, we make a full-head reenactment test tofully transfer the head pose and expression from the sourceperson to a target one. Figure 7 manifests the synthesisedframes using our 3D ﬂow and the 2D ﬂow of

FlowNet2 . Figure 7. Full-head reenactment using [37] combined with eitherFlowNet2 (second row) or our 3D ﬂow approach (last row).

Looking closely at ﬁgure 7, our 3D ﬂow results in a morephoto-realistic video synthesis with highly accurate headpose, facial expression, as well as temporal dynamics, whilethe manipulated frames generated with

FlowNet2 fail todemonstrate the same ﬁdelity. More details regarding thisexperiment are in the supplementary material.

Table 3. Average RGB distance obtained under a self-reenactmentsetup on 4 videos (each with 1K test frames) using either FlowNet2[18] or our facial 3D ﬂow with the method of Wang et al. [37]

Video + FlowNet2 ( ↓ ) 7.5 9.5 8.7 9.2[37] + Ours (3D ﬂow) ( ↓ )

6. Conclusion and Future Work

In this work, we put forward a novel and fast frame-work for densely estimating the 3D ﬂow of human facesfrom only a pair of monocular RGB images. The frame-work was trained on a very large-scale dataset of in-the-wild facial videos (

Face3DVid ) and ﬁne-tuned on a 4D fa-cial expression database (4DFAB [8]) with ground-truth 3Dscans. We conduct extensive experimental evaluations thatshow that the proposed approach: a) yields highly-accurateestimations of 2D and 3D facial ﬂow from monocular pairof images and successfully captures complex non-rigid mo-tions of the face and b) outperforms many state-of-the-artapproaches in estimating both the 2D and 3D facial ﬂow,even when training other approaches under the same setupand data. We additionally reveal the promising potential ofour work in a full-head facial manipulation application thatcapitalises on our facial ﬂow to produce highly loyal andphoto-realistic fake facial dynamics indistinguishable fromreal ones. Acknowledgement

Stefanos Zafeiriou acknowledges support from EPSRCFellowship DEFORM (EP/S010203/1) eferences [1] Thiemo Alldieck, Marc Kassubeck, Bastian Wandt, BodoRosenhahn, and Marcus Magnor. Optical ﬂow-based 3dhuman motion estimation from monocular video. In

Ger-man Conference on Pattern Recognition , pages 347–360.Springer, 2017. 1[2] Tali Basha, Yael Moses, and Nahum Kiryati. Multi-viewscene ﬂow estimation: A view centered variational approach.

International journal of computer vision , 101(1):6–21, 2013.1[3] Michael J Black and Paul Anandan. The robust estimationof multiple motions: Parametric and piecewise-smooth ﬂowﬁelds.

Computer vision and image understanding , 63(1):75–104, 1996. 2[4] James Booth, Anastasios Roussos, Allan Ponniah, DavidDunaway, and Stefanos Zafeiriou. Large scale 3d morphablemodels.

International Journal of Computer Vision , 126(2-4):233–254, 2018. 3[5] James Booth, Anastasios Roussos, Evangelos Ververas,Epameinondas Antonakos, Stylianos Ploumpis, YannisPanagakis, and Stefanos Zafeiriou. 3d reconstruction of ”in-the-wild” faces in images and videos.

IEEE transactionson pattern analysis and machine intelligence , 40(11):2638–2652, 2018. 6, 7[6] Fabian Brickwedde, Steffen Abraham, and Rudolf Mester.Mono-sf: Multi-view geometry meets single-view depth formonocular scene ﬂow estimation of dynamic trafﬁc scenes.In

Proceedings of the IEEE International Conference onComputer Vision , pages 2780–2790, 2019. 2[7] Thomas Brox, Andr´es Bruhn, Nils Papenberg, and JoachimWeickert. High accuracy optical ﬂow estimation based ona theory for warping. In

European conference on computervision , pages 25–36. Springer, 2004. 2[8] Shiyang Cheng, Irene Kotsia, Maja Pantic, and StefanosZafeiriou. 4dfab: A large scale 4d database for facial expres-sion analysis and biometric applications. In

Proceedings ofthe IEEE conference on computer vision and pattern recog-nition , pages 5117–5126, 2018. 3, 4, 5, 8[9] Jiankang Deng, Anastasios Roussos, Grigorios Chrysos,Evangelos Ververas, Irene Kotsia, Jie Shen, and StefanosZafeiriou. The menpo benchmark for multi-pose 2d and3d facial landmark localisation and tracking.

InternationalJournal of Computer Vision , pages 1–26, 2018. 3[10] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, PhilipHausser, Caner Hazirbas, Vladimir Golkov, Patrick VanDer Smagt, Daniel Cremers, and Thomas Brox. Flownet:Learning optical ﬂow with convolutional networks. In

Pro-ceedings of the IEEE international conference on computervision , pages 2758–2766, 2015. 2, 3, 4, 6, 7, 8[11] Ravi Garg, Anastasios Roussos, and Lourdes Agapito. Densevariational reconstruction of non-rigid surfaces from monoc-ular video. In

Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , pages 1272–1279,2013. 1[12] Baris Gecer, Stylianos Ploumpis, Irene Kotsia, and Ste-fanos Zafeiriou. Ganﬁt: Generative adversarial network ﬁt- ting for high ﬁdelity 3d face reconstruction. arXiv preprintarXiv:1902.05978 , 2019. 3[13] Vladislav Golyanik, Aman S Mathur, and Didier Stricker.Nrsfm-ﬂow: Recovering non-rigid scene ﬂow from monoc-ular image sequences. In

BMVC , 2016. 2[14] Jia Guo, Jiankang Deng, Niannan Xue, and StefanosZafeiriou. Stacked dense u-nets with dual transformers forrobust face alignment. In

Proceedings of the British Ma-chine Vision Conference (BMVC) , pages 87.1–87.12. BMVAPress, September 2019. 3[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In

Proceed-ings of the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016. 5[16] Berthold KP Horn and Brian G Schunck. Determining op-tical ﬂow.

Artiﬁcial intelligence , 17(1-3):185–203, 1981. 1,2[17] Tak-Wai Hui, Xiaoou Tang, and Chen Change Loy. Lite-ﬂownet: A lightweight convolutional neural network for op-tical ﬂow estimation. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 8981–8989, 2018. 7, 8[18] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper,Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolu-tion of optical ﬂow estimation with deep networks. In

Pro-ceedings of the IEEE conference on computer vision and pat-tern recognition , pages 2462–2470, 2017. 2, 3, 6, 7, 8[19] Joel Janai, Fatma G¨uney, Aseem Behl, and AndreasGeiger. Computer vision for autonomous vehicles:Problems, datasets and state-of-the-art. arXiv preprintarXiv:1704.05519 , 2017. 1[20] Diederik P Kingma and Jimmy Ba. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980 ,2014. 5, 6[21] Mohammad Rami Koujan, Luma Alharbawee, Giorgos Gi-annakakis, Nicolas Pugeault, and Anastasios Roussos. Real-time facial expression recognition in the wild by disentan-gling 3d expression from identity. In . IEEE, 2020. 1[22] Mohammad Rami Koujan, Michail Doukas, AnastasiosRoussos, and Stefanos Zafeiriou. Head2head: Video-basedneural head synthesis. In . IEEE, 2020. 1[23] Mohammad Rami Koujan and Anastasios Roussos. Combin-ing dense nonrigid structure from motion and 3d morphablemodels for monocular 4d face reconstruction. In

Proceed-ings of the 15th ACM SIGGRAPH European Conference onVisual Media Production , CVMP ’18, pages 2:1–2:9, NewYork, NY, USA, 2018. ACM. 1, 3, 6, 7[24] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer,Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. Alarge dataset to train convolutional networks for disparity,optical ﬂow, and scene ﬂow estimation. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 4040–4048, 2016. 225] Simon Meister, Junhwa Hur, and Stefan Roth. Unﬂow: Un-supervised learning of optical ﬂow with a bidirectional cen-sus loss. In

Thirty-Second AAAI Conference on Artiﬁcial In-telligence , 2018. 8[26] Etienne M´emin and Patrick P´erez. Dense estimation andobject-based segmentation of the optical ﬂow with ro-bust techniques.

IEEE Transactions on Image Processing ,7(5):703–719, 1998. 2[27] Jean-Philippe Pons, Renaud Keriven, and Olivier Faugeras.Multi-view stereo reconstruction and scene ﬂow estimationwith a global image-based matching score.

InternationalJournal of Computer Vision , 72(2):179–193, 2007. 1[28] Anurag Ranjan and Michael J Black. Optical ﬂow estima-tion using a spatial pyramid network. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 4161–4170, 2017. 2, 8[29] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmen-tation. In

International Conference on Medical image com-puting and computer-assisted intervention , pages 234–241.Springer, 2015. 4[30] Karen Simonyan and Andrew Zisserman. Two-stream con-volutional networks for action recognition in videos. In

Ad-vances in neural information processing systems , pages 568–576, 2014. 1[31] Deqing Sun, Stefan Roth, and Michael J Black. A quan-titative analysis of current practices in optical ﬂow estima-tion and the principles behind them.

International Journal ofComputer Vision , 106(2):115–137, 2014. 2[32] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz.Pwc-net: Cnns for optical ﬂow using pyramid, warping,and cost volume. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 8934–8943, 2018. 2, 8[33] Ravi Kumar Thakur and Snehasis Mukherjee. Sceneednet:A deep learning approach for scene ﬂow estimation. In , pages 394–399. IEEE, 2018.1[34] Sundar Vedula, Simon Baker, Peter Rander, Robert Collins,and Takeo Kanade. Three-dimensional scene ﬂow. In

Pro-ceedings of the Seventh IEEE International Conference onComputer Vision , volume 2, pages 722–729. IEEE, 1999. 1,2[35] Christoph Vogel, Konrad Schindler, and Stefan Roth. 3dscene ﬂow estimation with a rigid motion prior. In , pages 1291–1298. IEEE, 2011. 1[36] Shan Wang, Xukun Shen, and Jiaqing Liu. Dense opticalﬂow variation based 3d face reconstruction from monocularvideo. In , pages 2665–2669. IEEE, 2018. 1[37] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu,Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-video synthesis. arXiv preprint arXiv:1808.06601 , 2018. 1,8[38] Andreas Wedel, Daniel Cremers, Thomas Pock, and HorstBischof. Structure-and motion-adaptive regularization for high accuracy optic ﬂow. In , pages 1663–1668. IEEE,2009. 2[39] Andreas Wedel, Clemens Rabe, Tobi Vaudrey, Thomas Brox,Uwe Franke, and Daniel Cremers. Efﬁcient dense scene ﬂowfrom sparse or dense stereo data. In

European conference oncomputer vision , pages 739–751. Springer, 2008. 1[40] Stefanos Zafeiriou, Grigorios G Chrysos, Anastasios Rous-sos, Evangelos Ververas, Jiankang Deng, and George Trige-orgis. The 3d menpo facial landmark tracking challenge. In

Proceedings of the IEEE International Conference on Com-puter Vision , pages 2503–2511, 2017. 3[41] Youding Zhu and Kikuo Fujimura. 3d head pose estima-tion with optical ﬂow and depth constraints. In