DeepFaceFlow: In-the-wild Dense 3D Facial Motion Estimation
Mohammad Rami Koujan, Anastasios Roussos, Stefanos Zafeiriou
DDeepFaceFlow: In-the-wild Dense 3D Facial Motion Estimation
Mohammad Rami Koujan , , Anastasios Roussos , , , Stefanos Zafeiriou , College of Engineering, Mathematics and Physical Sciences, University of Exeter, UK Department of Computing, Imperial College London, UK Institute of Computer Science, Foundation for Research and Technology-Hellas (FORTH-ICS), Greece FaceSoft.io, London, UK
Abstract
Dense 3D facial motion capture from only monocular in-the-wild pairs of RGB images is a highly challenging prob-lem with numerous applications, ranging from facial ex-pression recognition to facial reenactment. In this work, wepropose
DeepFaceFlow , a robust, fast, and highly-accurateframework for the dense estimation of 3D non-rigid facialflow between pairs of monocular images. Our
DeepFace-Flow framework was trained and tested on two very large-scale facial video datasets, one of them of our own collec-tion and annotation, with the aid of occlusion-aware and3D-based loss function. We conduct comprehensive experi-ments probing different aspects of our approach and demon-strating its improved performance against state-of-the-artflow and 3D reconstruction methods. Furthermore, we in-corporate our framework in a full-head state-of-the-art fa-cial video synthesis method and demonstrate the ability ofour method in better representing and capturing the facialdynamics, resulting in a highly-realistic facial video synthe-sis. Given registered pairs of images, our framework gener-ates 3D flow maps at ∼ fps.
1. Introduction
Optical flow estimation is a challenging computer visiontask that has been targeted substantially since the seminalwork of Horn and Schunck [16]. The amount of effort ded-icated for tackling such a problem is largely justified by thepotential applications in the field, e.g. 3D facial reconstruc-tion [11, 23, 36], autonomous driving [19], action and ex-pression recognition [30, 21], human motion and head poseestimation [1, 41], and video-to-video translation [37, 22].While optical flow tracks pixels between consecutive im-ages in the 2D image plane, scene flow, its 3D counterpart,aims at estimating the 3D motion field of scene points atdifferent time steps in the 3 dimensional space. Therefore,scene flow combines two challenges: 1) 3D shape recon-struction, and 2) dense motion estimation. Scene flow esti-
Figure 1. We propose a framework for the high-fidelity 3D flowestimation between a pair of monocular facial images. Left-to-right: input pair of RGB images, estimated 3D facialshape of first image rendered with 3D motion vectors from first tosecond image, warped 3D shape of (1) based on estimated 3Dflow in (3), color-coded 3D flow map of each pixel in (1). Forthe color coding, see the Supplementary Material. mation, which can be traced back to the work of of vedulaet al. [34], is a highly ill-posed problem due to the depthambiguity and the aperture problem, as well as occlusionsand variations of illumination and pose, etc. which are verytypical of in-the-wild images. To address all these chal-lenges, the majority of methods in the literature use stereoor RGB-D images and enforce priors on either the smooth-ness of the reconstructed surfaces and estimated motionfields [2, 27, 39, 33] or the rigidity of the motion [35].In this work, we seek to estimate the 3D motion fieldof human faces from in-the-wild pairs of monocular im-ages, see Fig. 1. The output in our method is the sameas in scene flow methods, but the fundamental differenceis that we use simple RGB images instead of stereo pairsor RGB-D images as input. Furthermore, our method istailored for human faces instead of arbitrary scenes. Forthe problem that we are solving, we use the term “3D faceflow estimation”. Our designed framework delivers accu-rate flow estimation in the 3D world rather than the 2D im-age space. We focus on the human face and the modellingof its dynamics due to its centrality in myriad of applica-tions, e.g. facial expression recognition, head motion andpose estimation, 3D dense facial reconstruction, full headreenactment, etc. Human facial motion emerges from twomain sources: 1) rigid motion due to the head pose varia-tion, and 2) non-rigid motion caused by elicited facial ex-pressions and mouth motions during speech. The reliance1 a r X i v : . [ c s . C V ] M a y n only monocular and in-the-wild images to capture the3D motion of general objects makes the problem consider-ably more challenging. Alleviating such obstacles can bemade by injecting our prior knowledge about this object,as well as constructing and utilising a large-scale annotateddataset. Our contributions in this work can be summarisedas follows: • To the best of our knowledge, there does not exist anymethod that estimates 3D scene flow of deformable scenesusing a pair of simple RGB images as input. The proposedapproach is the first to solve this problem and this is madepossible by focusing on scenes with human faces. • Collection and annotation of a large-scale dataset ofhuman facial videos (more than 12000), which we call
Face3DVid . With the help of our proposed model-basedformulation, each video was annotated with the per-frame:1) 68 facial landmarks, 2) dense 3D facial shape mesh, 3)camera parameters, 4) dense 3D flow maps. This datasetwill be made publicly available (project’s page: https://github.com/mrkoujan/DeepFaceFlow ). • A robust, fast, deep learning-based and end-to-end frameworkfor the dense high-quality estimation of 3D face flow from only apair of monocular in-the-wild RGB images. • We demonstrate both quantitatively and qualitatively the use-fulness of our estimated 3D flow in a full-head reenactment ex-periment, as well as 4D face reconstruction (see supplementarymaterials).
The approach we follow starts from the collection and an-notation of a large-scale dataset of facial videos, see section3 for details. We employ such a rich dynamic dataset in thetraining process of our entire framework and initialise thelearning procedure with this dataset of in-the-wild videos.Different from other scene flow methods, our frameworkrequires only a pair of monocular RGB images and can bedecomposed into two main parts: 1) a shape-initializationnetwork ( ) aiming at densely regressing the 3Dgeometry of the face in the first frame, and 2) a fully convo-lutional network, termed as
DeepFaceFlowNet (DFFNet) ,that accepts a pair of RGB frames along with the projected3D facial shape initialization of the first (reference) frame,provided by , and produces a dense 3D flowmap at the output.
2. Related Work
The most closely-related works in the literature solve theproblems of optical flow and scene flow estimation. Tradi-tionally, one of the most popular approaches to tackle theseproblems had been through variational frameworks. Thework of Horn and Schunck [16] pioneered the variationalwork on optical flow , where they formulated an energyequation with brightness constancy and spatial smoothness terms. Later, a large number of variational approaches withvarious improvements were put forward [7, 26, 38, 3, 31].All of these methods involve dealing with a complex op-timisation, rendering them computationally very intensive.One of the very first attempts for an end-to-end and CNN-based trainable framework capable of estimating the opticalflow was made by Dosovitskiy et al. [10]. Even thoughtheir reported results still fall behind state-of-the-art classi-cal methods, their work shows the bright promises of CNNsin this task and that further investigation is worthwhile. An-other attempt with similar results to [10] was made by theauthors of [28]. Their framework, called
SpyNet , combinea classical spatial-pyramid formulation with deep learningfor large motions estimation in a coarse-to-fine approach.As a follow-up method, Ilg et al. [18] later used the twostructures proposed in [10] in a stacked pipeline,
FlowNet2 ,for estimating coarse and fine scale details of optical flow,with very competitive performance on the Sintel bench-mark. Recently, the authors of [32] put forward a com-pact and fast CNN model, termed as
PWC-Net , that capi-talises on pyramidal processing, warping, and cost volumes.They reported the top results on more than one benchmark,namely: MPI Sintel final pass and KITTI 2015. Most of thedeep learning-based methods rely on synthetic datasets totrain their networks in a supervised fashion, leaving a chal-lenging gap when tested on real in-the-wild images.Quite different from optical flow, scene flow methods ba-sically aim at estimating the three dimensional motion vec-tors of scene points from stereo or RGB-D images. The firstattempt to extend optical flow to 3D was made by Vdedulaet al. [34]. Their work assumed both the structure and thecorrespondences of the scene are known. Most of the earlyattempts on scene flow estimation relied on a sequence ofstereo images to solve the problem. With the more popular-ity of depth cameras, more pipelines were utilising RGB-Ddata as an alternative to stereo images. All these methodsfollow the classical way of scene flow estimation withoutusing any deep learning techniques or big datasets. The au-thors of [24] led the first effort to use deep learning featuresto estimate optical flow, disparity, and scene flow from a bigdataset. The method of Golyanik et al. [13] estimates the3D flow from a sequence of monocular images, with suffi-cient diversity in non-rigid deformation and 3D pose, as themethod relies heavily on NRSfM. The lack of such diver-sity, which is common for the type of in-the-wild videos wedeal with, could result in degenerate solutions. On the con-trary, our method requires only a pair of monocular imagesas input. Using only monocular images, Brickwedde et al.[6] target dynamic street scenes but impose a strong rigid-ity assumption on scene objects, making it unsuitable for fa-cial videos. As opposed to other state-of-the-art approaches,we rely on minimal information to solve the highly ill-posed3D facial scene flow problem. Given only a pair of monoc-lar RGB images, our novel framework is capable of ac-curately estimating the 3D flow between them robustly andquickly at a rate of ∼ f ps .
3. Dataset Collection and Annotation
Given the highly ill-posed nature of the non-rigid 3D fa-cial motion estimation from a pair of monocular images,the size and variability of the training dataset is very cru-cial [18, 10]. For this reason, we are based on a large-scaletraining dataset (
Face3DVid ), which we construct by col-lecting tens of thousands of facial videos, performing dense3D face reconstruction on them and then estimating effec-tive pseudo-ground truth of 3D flow maps.
First of all, we model the 3D face geometry using3DMMs and an additive combination of identity and ex-pression variation. This is similar to several recent meth-ods, e.g. [40, 9, 23, 12]. In more detail, let x =[ x , y , z , ..., x N , y N , z N ] T ∈ R N be the vectorized formof any 3D facial shape consisting of N
3D vertices. Weconsider that x can be represented as: x ( i , e ) = ¯ x + U id i + U exp e (1)where ¯x is the overall mean shape vector, U id ∈ R N × n i is the identity basis with n i = 157 principal components( n i (cid:28) N ), U exp ∈ R N × n e is the expression basis with n e = 28 principal components ( n e (cid:28) N ), and i ∈ R n i , e ∈ R n e are the identity and expression parameters respec-tively. The identity part of the model originates from theLarge Scale Face Model (LSFM) [4] and the expression partoriginates from the work of Zafeiriou et al. [40].To create effective pseudo-ground truth on tens of thou-sands of videos, we need to perform
3D face reconstruc-tion that is both efficient and accurate. For this reason, wechoose to fit the adopted 3DMM model on the sequence offacial landmarks over each video. Since this process is doneonly during training, we are not constrained by the need ofonline performance. Therefore, similarly to [9], we adopt abatch approach that takes into account the information fromall video frames simultaneously and exploits the rich dy-namic information usually contained in facial videos. It isan energy minimization to fit the combined identity and ex-pression 3DMM model on facial landmarks from all framesof the input video simultaneously. More details are given inthe Supplementary Material.
To create our large scale training dataset, we start froma collection of 12,000 RGB videos with 19 million framesin total and 2,500 unique identities. We apply the 3D facereconstruction method outlined in Sec. 3.1, together with some video pruning steps to omit cases where the auto-matic estimations had failed. Our final training set consistsof videos of our collection that survived the steps of videopruning: (81.25% of the initial dataset) with and around . Formore details and exemplar visualisations, please refer to theSupplementary Material.
Given a pair of images I and I , coming from a videoin our dataset, and their corresponding 3D shapes S , S and pose parameters R , t3d , R , t3d , the 3D flow mapof this pair is created as follows: F ( x, y ) ( x,y ) ∈ M = f c · ( R [ S ( t j ) , S ( t j ) , S ( t j )] t j ∈{ T | t j is visible from pixel (x,y) in I } b + t d ) − f c · ( R [ S ( t j ) , S ( t j ) , S ( t j )] t j ∈{ T | t j is visible from pixel (x,y) in I } b + t d ) , (2)where M is the set of foreground pixels in I , S ∈ R × N is the matrix storing the column-wise x-y-z coordinates ofthe N -vertices 3D shape of I , R ∈ R × is the rotationmatrix, t d ∈ R is the 3D translation, f c and f c are thescales of the orthographic camera for the first and secondimage, respectively, t j = [ t j , t j , t j ] ( t ji ∈ { , .., N } ) is thevisible triangle from pixel ( x, y ) in image I detected byour hardware-based renderer, T is the set of all trianglescomposing the mesh of S , and b ∈ R is the barycentriccoordinates of pixel ( x, y ) lying inside the projected trian-gle t j on image I . All background pixels in equation 2 areset to zero and ignored during training with the help of amasked loss. It is evident from equation 2 that we do notcare about the visible vertices in the second frame and onlytrack in 3D those were visible in image I to produce the3D flow vectors. Additionally, with this flow representation,the x-y coordinates alone of the 3D flow map ( F ( x, y ) ) des-ignate the 2D optical flow components in the image spacedirectly.
4. Proposed Framework
Our overall designed framework is demonstrated in fig-ure 2. We expect as input two RGB images I , I ∈ R W × H × and produce at the output an image F ∈ R W × H × encoding the per-pixel 3D optical flow from I to I . The designed framework is marked by two main stages:1) : 3D shape initialisation and encoding ofthe reference frame I , 2) DeepFaceFlowNet (DFFNet) :3D face flow prediction. The entire framework was trainedin a supervised manner, utilising the collected and anno-tated dataset, see section 3.2, and fine-tuned on the 4DFABdataset [8], after registering the sequence of scans comingfrom each video in this dataset to our 3D template. Inputframes were registered to a 2D template of size × with the help of the 68 mark-up extracted using [14] and fedto our framework. igure 2. Proposed DeepFaceFlow pipeline for the 3D facial flow estimation. First stage (left): works as an initialisation forthe 3D facial shape in the first frame. This estimation is rasterized in the next step and encoded in an RGB image, termed as ProjectedNormalised Coordinates Code (PNCC) storing the x-y-z coordinates of each corresponding visible 3D point. Given the pair of images aswell as the PNCC, the second stage (right) estimates the 3D flow using a deep fully-convolutional network (
DeepFaceFlowNet ). To robustly estimate the per-pixel 3D flow between apair of images, we provide the
DFFNet network, section4.2, not only with I & I , but also with another image thatstores a Projected Normalised coordinates Code ( PN CC )of the estimated 3D shape of the reference frame I , i.e. PN CC ∈ R W × H × . The PN CC codes that we considerare essentially images encoding the normalised x,y, and zcoordinates of facial vertices visible from each correspond-ing pixel in I based on the camera’s view angle. The in-clusion of such images allows the CNN to better associateeach RGB value in I with the corresponding point in the3D space, providing the network with a better initialisationin the problem space and establishing a reference 3D meshthat facilitates the warping in the 3D space during the courseof training. Equation 3 shows how to compute the PN CC image of the reference frame I . PN CC ( x, y ) ( x,y ) ∈ M = V ( S , c ) = P ( R [ S ( t j ) , S ( t j ) , S ( t j )] t j ∈{ T | t j is visible from ( x,y ) } b + t d ) , (3)where V ( ., . ) is the function rendering the normalised ver-sion of S , c ∈ R is the camera parameters, i.e. rotationangles, translation and scale ( R , t d , f c ), and P is a × diagonal matrix with main diagonal elements ( f c W , f c H , f c D ).The multiplication with P scales the posed 3D face with f c to be first in the image space coordinates and then nor-malises it with the width and height of the rendered imagesize and the maximum z value D computed from the entiredataset of our annotated 3D shapes. This results in an imagewith RGB channels storing the normalised ([0, 1]) x-y-z co-ordinates of the corresponding rendered 3D shape. The restof the parameters in equation 3 are detailed in section 3.3and utilised in equation 2. The
PN CC image generation discussed inequation 3 still lacks the estimation of the 3D facial shape S of I . We deal with this problem by training a deep CNN,termed as , that aims at regressing a dense 3D mesh S through per-vertex 3D coordinates estimations. Weuse our collected dataset ( Face3DVid ) and the 4DFAB [8]3D scans to train this network in a supervised manner. Weformulate a loss function composed of two terms: L (Φ) = 1 N N (cid:88) i =1 || s GTi − s i || + 1 O O (cid:88) j =1 || e GTj − e j || . (4)The first term in the above equation penalises the devia-tion of each vertex 3D coordinates from the correspondingground-truth vertex ( s i = [ x, y, z ] T ), while the second termensures similar edge lengths between vertices in the esti-mated and ground-truth mesh, given that e j is the (cid:96) dis-tance between vertices v j and v j defining edge j in theoriginal ground-truth 3D template. Instead of estimatingthe camera parameters c separately, which are needed at thevery input of the renderer, we assume a Scaled OrthographicProjection (SOP) as the camera model and train the networkto regress directly the scaled 3D mesh by multiplying the x-y-z coordinates of each vertex of frame i with f ic . Given I , I and PN CC images, the 3D flow estimationproblem is a mapping F : {I , I , PN CC} ∈ R W × H × → F W × H × . Using both the annotated Face3DVid detailed insection 3 and 4DFAB [8] datasets, we train a fully convo-lutional encoder-decoder CNN structure ( F ), called Deep-FaceFlowNet (DFFNet) , that takes three images, namely: I , I and PN CC , and produces the 3D flow estimate fromeach foreground pixel in I to I as a W × H × image.The designed network follows the generic U-Net architec-ture with skip connections [29] and was inspired particu-larly by FlowNetC [10], see figure 3. Distinguished fromFlowNetC, we extend the network to account for the PN CC image at the input and modify the structure to account forthe 3D flow estimation task, rather than 2D optical flow. Wepropose the following two-term loss function: L (Ψ)= L (cid:80) i =1 w i || F GTi − F i (Ψ) || F + α ||I −W ( F, PN CC ; I ) || F (5) igure 3. Architecture of our designed DFFNet for the purpose of estimating the 3D flow between a pair of RGB images.
The first term in Eq. (5) is the endpoint error , whichcorresponds to a 3D extension of the standard error mea-sure for optical flow methods. It computes the frobeniusnorm ( || . || F ) error between the estimated 3D flow F (Ψ) and the ground truth F GT , with Ψ representing the networklearnable weights. Practically, since at the decoder part ofour DFFNet each fractionally-strided convolution opera-tion, aka. deconvolution, produces an estimation of the flowat different resolutions, we compare this multi-resolution3D flow with the downsampled versions of the F GT , up un-til the full resolution at stage L , and use the weighted sumof the frobenius norm error as a penalisation term. The sec-ond term in Eq. (5) is the photo-consistency error , whichassumes that the colour of each point does not change from I to I . The warping operation was done with the helpof the warping function W ( ., . ) . This function warps the3D shape of I encoded inside the PN CC image using theestimated flow F and samples I at the vertices of the resul-tant projected 3D shape. The warping function in equation5 was implemented by a differentiable layer detecting theocclusions by the virtue of our 3D flow, and sampling thesecond image (backward warping) in a differentiable man-ner at the output stage of our DFFNet . The scale α is usedfor the sake of terms balancing while training.
5. Experiments
In this section, we compare our framework with state-of-the-art methods in optical flow and 3D face reconstruction.We ran all the experiments on an NVIDIA DGX1 machine.
Although the collected
Face3DVid dataset has a widevariety of facial dynamics and identities captured underplenitude of set-ups and viewpoints depicting the in-the- wild scenarios of videos capture, the dataset was anno-tated with pseudo ground-truth 3D shapes, not real 3Dscans. Relying only on this dataset, therefore, for train-ing our framework could result in mimicking the perfor-mance of the 3DMM-based estimation, which we want ide-ally to initialise with and then depart from. Thus, we fine-tune our framework on the 4DFAB dataset [8]. The 4DFABdataset is a large-scale database of dynamic high-resolution3D faces with subjects displaying both spontaneous andposed facial expressions with the corresponding per-frame3D scans. We leave a temporal gap between consecutiveframes sampled from each video if the average 3D flow perpixel is < = 1 between each pair. In total, around im-age pairs (1600 subjects) form the Face3DVid dataset and from the 4DFAB (175 subjects) were used for train-ing/testing purposes. We split the
Face3DVid into train-ing/validation vs test (80 % vs 20 % ) in the first phase ofthe training. Likewise, 4DFAB dataset was split into train-ing/validation vs test (80 % vs 20 % ) during the fine-tuning. Our pipeline consists of two networks (see Fig. 2): a) 3DMeshReg : The aim of this network is to accept aninput image ( I ∈ R × × ) and regress the per-vertex(x, y, z) coordinates describing the subject’s facial geome-try. ResNet50 [15] network architecture was selected andtrained for this purpose after replacing the output fully-connected ( fc ) layer with a convolutional one ( × , )and then a linear fc layer with ∼ . k × neurons. This net-work was trained initially and separately from the rest of theframework on the Face3DVid dataset and then fine-tunedon the 4DFAB dataset [8]. Adam optimizer was used [20]during the training with learning rate of 0.0001, β = 0 . , β = 0 . , and batch size 32. Epochs
Lea r n i ng R a t e -5 oursFlowNetC, FlowNetSFlowNet-SDLiteFlowNetFlowNet2 Figure 4. Training schedule for the learning rate used while train-ing our network and other state-of-the-art approaches for 3D flowestimation. The first 20 epochs for all methods were ran on
Face3DVid dataset and the next 20 on . Each trainingepoch on
Face3DVid and is composed of . · and . · iterations, respectively, each with batch size of 16. b) DFFNet : Figure 3 shows the structure of this network.Inspired by FlowNetC [10], this network has similarly nineconvolutional layers. The first three layers use kernels ofsize × and the rest have kernels of size × . Whereoccurs, downsampling is carried out with strides of 2 andnon-linearity is implemented with ReLU layers. We extendthis architecture at the input stage by a branch dedicatedfor processing the PN CC image. The feature map gener-ated at the end of the
PN CC branch is concatenated withthe correlation result between the feature maps of I and I . For the correlation layer, we follow the implementationsuggested by [10] and we keep the same parameters for thislayer (neighborhood search size is 2*(21)+1 pixels). At thedecoder section, the flow is estimated from the finest levelup until the full resolution. While training, we use a batchsize of 16 and the Adam optimization algorithm [20] withthe default parameters recommended in [20] ( β = 0 . and β = 0 . ). Figure 4 demonstrates our scheduled learningrates over epochs for training and fine-tuning. We also set w i = 1 and α = 10 in equation 5 and normalise input im-ages to the range [0, 1]. While testing, our entire frameworktakes only around 17ms (6ms ( )+ 6ms (raster-ization & PNCC generation) + 5ms (
DFFNet )) to generatethe 3D dense flow map, given a registered pair of images.
In this section, we quantitatively evaluate the ability ofour approach in estimating the 3D flow. As there exist noother methods for 3D scene flow from simple RGB images,we adapt existing methods that solve closely-related prob-lems so that they produce 3D flow estimation. In moredetail, we use two 3D reconstruction methods (ITWMM[5] and DNSfM-3DMM [23]), as well as four optical flowmethods after retraining them all specifically for the taskof 3D flow estimation. The four optical flow methods in-clude the best performing methods in table 2 on our datasets(LiteNet & FlowNet2) as well as two additional baselines (FlowNetS & FlowNetC).To estimate the 3D flow by ITWMM and DNSfM-3DMM,we first generate the per-frame dense 3D mesh of each testvideo by passing a single frame at a time to the ITWMMmethod and the entire video for the DNSfM-3DMM (as itis a video-based approach). Then, following our annotationprocedure discussed in 3.3, the 3D flow values for each pairof test images were obtained.Since the deep learning-based methods we compare againstin this section were proposed as 2D flow estimators, wemodify the sizes of some filters in their original architec-tures so that their output flow is a 3-channel image stor-ing the x-y-z coordinates of the flow and train them onour 3D-facial-flow datasets with the learning rate sched-ules reported in figure 4. FlowNet2 is a very deep architec-ture (around 160M parameters) composed of stacked net-works. As suggested in [18], we did not train this networkin one go, but instead sequentially. More specifically, wefused the separately trained individual networks (FlowNetS,FlowNetC, and FlowNetSD [18]) on our datasets togetherand fine-tuned the entire stacked architecture, see 4 for thelearning rate schedule. Please consult the supplementarymaterial for more information on what we modified exactlyin each flow network we compare against here.Table 1 shows the generated facial AEPE results by eachmethod on the
Face3DVid and datasets. Our pro-posed architecture and its variant (‘ours depth’) report thelowest (best) AEPE numbers on both datasets. Figure 5 vi-sualises some color-coded 3D flow results produced by themethods presented in table 1. To color-code the 3D flow, weconvert the x-y-z estimated flow coordinates from Cartesianto spherical coordinates and normalise them so that theyrepresent the coordinates of an HSV coloring system, moredetails on that are available in the supplementary material.It is noteworthy that the 3D facial reconstruction methodswe compare against fail to produce as accurate tracking ofthe 3D flow as our approach. Their result is not smooth andconsistent in the model space, resulting in a higher inten-sity motion in the space. This can be attributed to the factthat such methods pay attention to the fidelity of the recon-struction from the camera’s view angle more than the 3Dtemporal flow. On the other hand, the other deep architec-tures we train in this section are unable to capture the fullfacial motion with same precision, with more fading flowaround cheeks and forehand.
The aim of this experiment is to probe the performanceof our framework in estimating the 2D optical facial flowbetween a pair of facial images by keeping only the dis-placements produced at the output in the x and y directionswhile ignoring those in the z direction. We separate thecomparisons in this section into two parts. Firstly, we eval- igure 5. Color-coded 3D flow estimations of random test pairs from the
Face3DVid and datasets. Left-to-right: pair of inputRGB images, Ground Truth, ours, ours depth, compared methods. For the color coding, see the Supp. Material.Figure 6. Color-coded 2D flow estimations. Rows are random samples from the test split of the
Face3DVid and 4DFAB datasets and their2D flow estimations. The first two columns of each row show the input pair of RGB images. For the color coding, see the Supp. Material.Table 1. Comparison between our obtained 3D face flow resultsagainst state-of-the-art methods on the test splits of the 4DFABand
Face3DVid datasets. Comparison metric is the standard Av-erage End Point Error (AEPE)Method ( ↓ ) Face3DVid ( ↓ )ITWMM [5] 3.43 4.1DNSfM-3DMM [23] 2.8 3.9FlowNetS [10] 2.25 3.7FlowNetC [10] 1.95 2.425FlowNet2 [18] 1.89 2.4LiteNet [17] 1.5 2.2ours depth 1.6 1.971ours uate our method against generic 2D flow approaches us-ing their best performing trained models provided by theoriginal authors of each. Secondly, we train the same ar-chitectures from scratch on the same datasets we train ourframework on, namely the training splits of Face3DVid and , using a learning rate of 1e-4 that drops 5 timeseach 10 epochs. We keep the same network design as pro-vided by each paper’s authors and only train the network to minimise a masked loss composed of photo-consistencyand data terms. The masked loss is computed with the helpof a foreground (facial) mask for each reference image pro-vided with our utilised datasets. Table 2 presents the ob-tained facial Average End Point Error (AEPE) metric valuesby our proposed approach against other state-of-the-art opti-cal flow prediction methods on the test splits of
Face3DVid and datasets. As can be noted from table 2, ourproposed method always achieves the smallest (best) AEPEvalues on both employed datasets. As expected, the AEPEvalues decrease when training the other methods on ourdataset for the specific task of facial 2D flow estimation.However, our method still produces lower errors and out-performs the compared against methods on this task. The‘ours depth’ variant of our network comes as the secondbest performing method on both datasets. This variant wastrained in a very similar manner to our original frameworkbut with feeding the
DFFNet with I , I and only the z coordinates (last channel) of the PN CC image while ignor-ing the x and y (first two channles). Figure 6 demonstratessome qualitative results generated by the methods reported able 2. Comparison between our obtained 2D flow results againststate-of-the-art methods on the test splits of the 4DFAB and Face3DVid datasets. Comparison metric is the standard AverageEnd Point Error (AEPE). ‘original models’ refers to trained mod-els provided by the authros of each, and ‘trained from scratch’indicates that the same architectures were trained on the trainingsets of both
Face3DVid and to estimate the 2D facial flow.Method original models trained from scratch
FlowNetS [10] 1.832 5.1425 1.956 2.6SpyNet [28] 1.31 3 1.042 1.5FlowNetC [10] 1.212 2.6 1.061 1.498UnFlow [25] 1.163 2.6553 1.055 1.45LiteNet [17] 1.16 2.6 1.018 1.268PWC-Net[32] 1.159 2.625 1.035 1.371FlowNet2 [18] 1.15 2.6187 1.063 1.352ours depth 0.99 1.176 0.99 1.176ours in table 2, as well as ours. Please refer to the supplementarymaterial for more information on the color-coding followedfor encoding the flow values.
We further investigate the functionality of our proposedframework in capturing the human facial 3D motion andsuccessfully employing it in a full-head reenactment appli-cation. Towards that aim, we use the recently proposedmethod of [37], which is in essence a general video-to-videosynthesis approach mapping a source (conditioning) videoto a photo-realistic output one. The authors of [37] traintheir framework in an adversarial manner and learn the tem-poral dynamics of a target video during the training timewith the help of the 2D flow estimated by
FlowNet2 [18].In this experiment, we replace the
FlowNet2 employed in[37] by our proposed approach and aid the generator andvideo discriminator to learn the temporal facial dynamicsrepresented by our 3D facial flow. We firstly conduct aself-reenactment test as done in [37], where we divide eachvideo into a train/test splits (first 2 third vs last third) and re-port the average per-pixel RGB error between fake and realtest frames. Table 3 reveals the average pixel distance ob-tained for 4 different videos we trained a separate model foreach. The only difference between the second and third rowof table 3 is the flow estimation method, everything else(structure, loss functions, conditioning, etc.) is the same.As can be noted from table 3, our 3D flow better revealsthe facial temporal dynamics of the training subject and as-sists the video synthesis generator in capturing these tem-poral characteristics, resulting in a lower error. In the sec-ond experiment, we make a full-head reenactment test tofully transfer the head pose and expression from the sourceperson to a target one. Figure 7 manifests the synthesisedframes using our 3D flow and the 2D flow of
FlowNet2 . Figure 7. Full-head reenactment using [37] combined with eitherFlowNet2 (second row) or our 3D flow approach (last row).
Looking closely at figure 7, our 3D flow results in a morephoto-realistic video synthesis with highly accurate headpose, facial expression, as well as temporal dynamics, whilethe manipulated frames generated with
FlowNet2 fail todemonstrate the same fidelity. More details regarding thisexperiment are in the supplementary material.
Table 3. Average RGB distance obtained under a self-reenactmentsetup on 4 videos (each with 1K test frames) using either FlowNet2[18] or our facial 3D flow with the method of Wang et al. [37]
Video + FlowNet2 ( ↓ ) 7.5 9.5 8.7 9.2[37] + Ours (3D flow) ( ↓ )
6. Conclusion and Future Work
In this work, we put forward a novel and fast frame-work for densely estimating the 3D flow of human facesfrom only a pair of monocular RGB images. The frame-work was trained on a very large-scale dataset of in-the-wild facial videos (
Face3DVid ) and fine-tuned on a 4D fa-cial expression database (4DFAB [8]) with ground-truth 3Dscans. We conduct extensive experimental evaluations thatshow that the proposed approach: a) yields highly-accurateestimations of 2D and 3D facial flow from monocular pairof images and successfully captures complex non-rigid mo-tions of the face and b) outperforms many state-of-the-artapproaches in estimating both the 2D and 3D facial flow,even when training other approaches under the same setupand data. We additionally reveal the promising potential ofour work in a full-head facial manipulation application thatcapitalises on our facial flow to produce highly loyal andphoto-realistic fake facial dynamics indistinguishable fromreal ones. Acknowledgement
Stefanos Zafeiriou acknowledges support from EPSRCFellowship DEFORM (EP/S010203/1) eferences [1] Thiemo Alldieck, Marc Kassubeck, Bastian Wandt, BodoRosenhahn, and Marcus Magnor. Optical flow-based 3dhuman motion estimation from monocular video. In
Ger-man Conference on Pattern Recognition , pages 347–360.Springer, 2017. 1[2] Tali Basha, Yael Moses, and Nahum Kiryati. Multi-viewscene flow estimation: A view centered variational approach.
International journal of computer vision , 101(1):6–21, 2013.1[3] Michael J Black and Paul Anandan. The robust estimationof multiple motions: Parametric and piecewise-smooth flowfields.
Computer vision and image understanding , 63(1):75–104, 1996. 2[4] James Booth, Anastasios Roussos, Allan Ponniah, DavidDunaway, and Stefanos Zafeiriou. Large scale 3d morphablemodels.
International Journal of Computer Vision , 126(2-4):233–254, 2018. 3[5] James Booth, Anastasios Roussos, Evangelos Ververas,Epameinondas Antonakos, Stylianos Ploumpis, YannisPanagakis, and Stefanos Zafeiriou. 3d reconstruction of ”in-the-wild” faces in images and videos.
IEEE transactionson pattern analysis and machine intelligence , 40(11):2638–2652, 2018. 6, 7[6] Fabian Brickwedde, Steffen Abraham, and Rudolf Mester.Mono-sf: Multi-view geometry meets single-view depth formonocular scene flow estimation of dynamic traffic scenes.In
Proceedings of the IEEE International Conference onComputer Vision , pages 2780–2790, 2019. 2[7] Thomas Brox, Andr´es Bruhn, Nils Papenberg, and JoachimWeickert. High accuracy optical flow estimation based ona theory for warping. In
European conference on computervision , pages 25–36. Springer, 2004. 2[8] Shiyang Cheng, Irene Kotsia, Maja Pantic, and StefanosZafeiriou. 4dfab: A large scale 4d database for facial expres-sion analysis and biometric applications. In
Proceedings ofthe IEEE conference on computer vision and pattern recog-nition , pages 5117–5126, 2018. 3, 4, 5, 8[9] Jiankang Deng, Anastasios Roussos, Grigorios Chrysos,Evangelos Ververas, Irene Kotsia, Jie Shen, and StefanosZafeiriou. The menpo benchmark for multi-pose 2d and3d facial landmark localisation and tracking.
InternationalJournal of Computer Vision , pages 1–26, 2018. 3[10] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, PhilipHausser, Caner Hazirbas, Vladimir Golkov, Patrick VanDer Smagt, Daniel Cremers, and Thomas Brox. Flownet:Learning optical flow with convolutional networks. In
Pro-ceedings of the IEEE international conference on computervision , pages 2758–2766, 2015. 2, 3, 4, 6, 7, 8[11] Ravi Garg, Anastasios Roussos, and Lourdes Agapito. Densevariational reconstruction of non-rigid surfaces from monoc-ular video. In
Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , pages 1272–1279,2013. 1[12] Baris Gecer, Stylianos Ploumpis, Irene Kotsia, and Ste-fanos Zafeiriou. Ganfit: Generative adversarial network fit- ting for high fidelity 3d face reconstruction. arXiv preprintarXiv:1902.05978 , 2019. 3[13] Vladislav Golyanik, Aman S Mathur, and Didier Stricker.Nrsfm-flow: Recovering non-rigid scene flow from monoc-ular image sequences. In
BMVC , 2016. 2[14] Jia Guo, Jiankang Deng, Niannan Xue, and StefanosZafeiriou. Stacked dense u-nets with dual transformers forrobust face alignment. In
Proceedings of the British Ma-chine Vision Conference (BMVC) , pages 87.1–87.12. BMVAPress, September 2019. 3[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In
Proceed-ings of the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016. 5[16] Berthold KP Horn and Brian G Schunck. Determining op-tical flow.
Artificial intelligence , 17(1-3):185–203, 1981. 1,2[17] Tak-Wai Hui, Xiaoou Tang, and Chen Change Loy. Lite-flownet: A lightweight convolutional neural network for op-tical flow estimation. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 8981–8989, 2018. 7, 8[18] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper,Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolu-tion of optical flow estimation with deep networks. In
Pro-ceedings of the IEEE conference on computer vision and pat-tern recognition , pages 2462–2470, 2017. 2, 3, 6, 7, 8[19] Joel Janai, Fatma G¨uney, Aseem Behl, and AndreasGeiger. Computer vision for autonomous vehicles:Problems, datasets and state-of-the-art. arXiv preprintarXiv:1704.05519 , 2017. 1[20] Diederik P Kingma and Jimmy Ba. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980 ,2014. 5, 6[21] Mohammad Rami Koujan, Luma Alharbawee, Giorgos Gi-annakakis, Nicolas Pugeault, and Anastasios Roussos. Real-time facial expression recognition in the wild by disentan-gling 3d expression from identity. In . IEEE, 2020. 1[22] Mohammad Rami Koujan, Michail Doukas, AnastasiosRoussos, and Stefanos Zafeiriou. Head2head: Video-basedneural head synthesis. In . IEEE, 2020. 1[23] Mohammad Rami Koujan and Anastasios Roussos. Combin-ing dense nonrigid structure from motion and 3d morphablemodels for monocular 4d face reconstruction. In
Proceed-ings of the 15th ACM SIGGRAPH European Conference onVisual Media Production , CVMP ’18, pages 2:1–2:9, NewYork, NY, USA, 2018. ACM. 1, 3, 6, 7[24] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer,Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. Alarge dataset to train convolutional networks for disparity,optical flow, and scene flow estimation. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 4040–4048, 2016. 225] Simon Meister, Junhwa Hur, and Stefan Roth. Unflow: Un-supervised learning of optical flow with a bidirectional cen-sus loss. In
Thirty-Second AAAI Conference on Artificial In-telligence , 2018. 8[26] Etienne M´emin and Patrick P´erez. Dense estimation andobject-based segmentation of the optical flow with ro-bust techniques.
IEEE Transactions on Image Processing ,7(5):703–719, 1998. 2[27] Jean-Philippe Pons, Renaud Keriven, and Olivier Faugeras.Multi-view stereo reconstruction and scene flow estimationwith a global image-based matching score.
InternationalJournal of Computer Vision , 72(2):179–193, 2007. 1[28] Anurag Ranjan and Michael J Black. Optical flow estima-tion using a spatial pyramid network. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 4161–4170, 2017. 2, 8[29] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmen-tation. In
International Conference on Medical image com-puting and computer-assisted intervention , pages 234–241.Springer, 2015. 4[30] Karen Simonyan and Andrew Zisserman. Two-stream con-volutional networks for action recognition in videos. In
Ad-vances in neural information processing systems , pages 568–576, 2014. 1[31] Deqing Sun, Stefan Roth, and Michael J Black. A quan-titative analysis of current practices in optical flow estima-tion and the principles behind them.
International Journal ofComputer Vision , 106(2):115–137, 2014. 2[32] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz.Pwc-net: Cnns for optical flow using pyramid, warping,and cost volume. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 8934–8943, 2018. 2, 8[33] Ravi Kumar Thakur and Snehasis Mukherjee. Sceneednet:A deep learning approach for scene flow estimation. In , pages 394–399. IEEE, 2018.1[34] Sundar Vedula, Simon Baker, Peter Rander, Robert Collins,and Takeo Kanade. Three-dimensional scene flow. In
Pro-ceedings of the Seventh IEEE International Conference onComputer Vision , volume 2, pages 722–729. IEEE, 1999. 1,2[35] Christoph Vogel, Konrad Schindler, and Stefan Roth. 3dscene flow estimation with a rigid motion prior. In , pages 1291–1298. IEEE, 2011. 1[36] Shan Wang, Xukun Shen, and Jiaqing Liu. Dense opticalflow variation based 3d face reconstruction from monocularvideo. In , pages 2665–2669. IEEE, 2018. 1[37] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu,Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-video synthesis. arXiv preprint arXiv:1808.06601 , 2018. 1,8[38] Andreas Wedel, Daniel Cremers, Thomas Pock, and HorstBischof. Structure-and motion-adaptive regularization for high accuracy optic flow. In , pages 1663–1668. IEEE,2009. 2[39] Andreas Wedel, Clemens Rabe, Tobi Vaudrey, Thomas Brox,Uwe Franke, and Daniel Cremers. Efficient dense scene flowfrom sparse or dense stereo data. In
European conference oncomputer vision , pages 739–751. Springer, 2008. 1[40] Stefanos Zafeiriou, Grigorios G Chrysos, Anastasios Rous-sos, Evangelos Ververas, Jiankang Deng, and George Trige-orgis. The 3d menpo facial landmark tracking challenge. In
Proceedings of the IEEE International Conference on Com-puter Vision , pages 2503–2511, 2017. 3[41] Youding Zhu and Kikuo Fujimura. 3d head pose estima-tion with optical flow and depth constraints. In