[PDF] Pixel-Face: A Large-Scale, High-Resolution Benchmark for 3D Face Reconstruction

Abstract

3D face reconstruction is a fundamental task that can facilitate numerous applications such as robust facial analysis and augmented reality. It is also a challenging task due to the lack of high-quality datasets that can fuel current deep learning-based methods. However, existing datasets are limited in quantity, realisticity and diversity. To circumvent these hurdles, we introduce Pixel-Face, a large-scale, high-resolution and diverse 3D face dataset with massive annotations. Specifically, Pixel-Face contains 855 subjects aging from 18 to 80. Each subject has more than 20 samples with various expressions. Each sample is composed of high-resolution multi-view RGB images and 3D meshes with various expressions. Moreover, we collect precise landmarks annotation and 3D registration result for each data. To demonstrate the advantages of Pixel-Face, we re-parameterize the 3D Morphable Model (3DMM) into Pixel-3DM using the collected data. We show that the obtained Pixel-3DM is better in modeling a wide range of face shapes and expressions. We also carefully benchmark existing 3D face reconstruction methods on our dataset. Moreover, Pixel-Face serves as an effective training source. We observe that the performance of current face reconstruction models significantly improves both on existing benchmarks and Pixel-Face after being fine-tuned using our newly collected data. Extensive experiments demonstrate the effectiveness of Pixel-3DM and the usefulness of Pixel-Face.

Full PDF

PPixel-Face: A Large-Scale, High-Resolution Benchmarkfor 3D Face Reconstruction

Jiangjing Lyu Xiaobo Li Xiangyu Zhu Cheng Cheng Alibaba Inc. Institute of Automation, Chinese Academic of Sciences Chinese Academy of Sciences { jiangjing.ljj, xiaobo.lixb } @alibaba-inc.com [email protected] [email protected] Abstract

3D face reconstruction is a fundamental task that canfacilitate numerous applications such as robust facial anal-ysis and augmented reality. It is also a challenging taskdue to the lack of high-quality datasets that can fuel currentdeep learning-based methods. However, existing datasetsare limited in quantity, realisticity and diversity. To circum-vent these hurdles, we introduce

Pixel-Face , a large-scale,high-resolution and diverse 3D face dataset with massiveannotations. Speciﬁcally, Pixel-Face contains 855 subjectsaging from 18 to 80. Each subject has more than 20 sampleswith various expressions. Each sample is composed of high-resolution multi-view RGB images and 3D meshes with vari-ous expressions. Moreover, we collect precise landmarks an-notation and 3D registration result for each data. To demon-strate the advantages of Pixel-Face, we re-parameterize the3D Morphable Model (3DMM) into

Pixel-3DM using thecollected data. We show that the obtained Pixel-3DM is bet-ter in modeling a wide range of face shapes and expressions.We also carefully benchmark existing 3D face reconstruc-tion methods on our dataset. Moreover, Pixel-Face servesas an effective training source. We observe that the perfor-mance of current face reconstruction models signiﬁcantlyimproves both on existing benchmarks and Pixel-Face afterbeing ﬁne-tuned using our newly collected data. Extensiveexperiments demonstrate the effectiveness of Pixel-3DM andthe usefulness of Pixel-Face.

1. Introduction

Monocular 3D face reconstruction is one of the most fun-damental tasks in computer vision [ ? , ? , ? ]. However, theresearch of 3D face analysis is obstructed by several inherentchallenges. First, obtaining ground-truth 3D annotations forin-the-wild images is both expensive and laborious. Firstly,sophisticated devices such as Kinect are used to capture raw3D point clouds. Furthermore, obtaining multi-modality 3Ddata requires complicated processing that includes multi- views scanning, depth generating, landmarks annotation,point clouds fusion and 3D surface meshes generation. Sec-ond, current 3D face analysis methods majorly rely on avalid 3D Morphable Model to perform precise 3D face re-construction. 3DMM, however, is sensitive to the quantityand quality of training data and can be easily affected bymany factors such as age, gender and expression [ ? ]. Third,as the result of training on synthetic 3D face datasets suchas 300W-LP [ ? ], most state-of-the-arts 3D face reconstruc-tion methods have limited capacity in representing real faceshapes and expressions.In the past decade, although several authentic 3D facedatasets have been released, they all have some non-negligible shortcomings. Early datasets such as Bospho-rus [ ? ] only provide low-precision 3D meshes. BFM [ ? ]uses synthetic images which leads to the poor generalizationability of models trained on it. Follow-up datasets overcomethese shortcomings but they only provide limited annota-tions. Texas-3D [ ? ] only offers depth information. FWH [ ? ]provides results of 3DMM ﬁtting instead of the original 3Dmeshes. The 3D annotations collected by MICC [ ? ] are notpaired with the 2D images. Recent works overcome theaforementioned drawbacks but they are limited in data diver-sity. 3DFAW [ ? ] only collects data from 26 identities withﬁxed neutral expressions. BP4D [ ? ] only provides single-view 3D meshes. The age distribution of FaceScape [ ? ] liesmajor in 18 to 25.In view of those shortcomings and aim to further pushforward the research of 3D face reconstruction, we intro-duce Pixel-Face, a large-scale 3D face dataset with diversesamples and comprehensive annotations, as shown in Fig. 1.Compared with existing datasets, our new benchmark hasseveral appealing properties: Quantity. - Pixel-Face con-tains a training set with 655 identities and an evaluation setwith 200 identities. Each subject has over 20 images withpaired 3D annotations under different expressions and multi-ple views. Quality. - We use a high-precision trinocularstructured light system (Ainstec) and surface mesh gener-ation method [ ? ] to obtain high-quality 3D meshes withresolution of . ± . mm [ ? ]. Besides, we perform1 a r X i v : . [ c s . C V ] S e p a) (b) Figure 1.

3D samples of Pixel-Face.

In (a), we show several typical 3D samples with different gender, age and expressions. In (b), wevisualize the facial details preserved by high-resolution 3D meshes. multi-view fusion and registration to get aligned meshesbased on pre-deﬁned templates, such as 3DMM [ ? ]. Di-versity.

The age of the subjects ranges from 18 to 80. Eachsubject has more than 10 expressions with 3D meshes andsynchronized 2D images under multiple views. Avail-ability. - Pixel-Face will be made publicly available to theresearch community.To demonstrate the usefulness of Pixel-Face, we constructa new 3DMM, named Pixel-3DM and conduct extensive ex-periments to compare it with previous 3DMM models. Facil-itated by the high-quality and diverse annotations providedby Pixel-Face, Pixel-3DM surpasses all previous methods inrepresenting more precise face shapes. Comprehensive eval-uation set provided by Pixel-Face enables us to rigorouslybenchmark the performance of existing 3D face reconstruc-tion methods. The pitfalls of the methods trained by existingdatasets are revealed. They have limited capability of recon-structing real 3D faces with various shapes and challengingattributes such as exaggerated expressions or uncommonage. We believe it is due to the domain gap between theauthentic data and previous synthetic datasets these modelswere trained on. Given this fact, we further ﬁnetune repre-sentative methods with the training set of Pixel-Face. Afterﬁnetuning, the performance on both existing benchmarksand ours can be improved by 7%-30%, which demonstratesthe effectiveness of Pixel-Face as a pre-training source.In summary, the contributions of this work are three-fold: We build a large-scale 3D face dataset with carefully-collected training and evaluation sets. The dataset is com-posed of multi-view, high-resolution and diverse 2D face images with paired high-quality 3D annotations. We con-struct Pixel-3DM, a more expressive new 3DMM trainedwith massive diversely-distributed 3D face data. Compari-son with previous 3DMMs illustrates the strengths of Pixel-3DM in modeling face shapes and expressions. Third, weperform a comprehensive evaluation of existing methods onour benchmark and reveal several valuable observations. Weﬁnetune representative methods with the training set of Pixel-Face. Experimental results demonstrate that the performanceof current state-of-the-art methods can be signiﬁcantly im-proved after ﬁnetuning.

2. Related Work

3D face reconstruction can faciliate many tasks such asface animation [ ? , ? ] robust face recognition [ ? , ? ] and hu-man motion capture [ ? , ? , ? , ? , ? ]. Despite its important,3D ground truth is unavailable for most in-the-wild 2D im-ages. The lack of paired 2D and 3D data hinders the trainingand evaluation of 3D face reconstruction methods. To al-leviate this problem, 3DDFA [ ? ] builds a training datasetcomposed of 2D images and pseudo-3D meshes which areobtained from 3DMM ﬁtting [ ? ] and manually adjusting.Another evaluation dataset named AFLW2000-3D [ ? ] usingthe same method. The ambiguous typology of synthetic 3Ddata limits its capability of representing intricate faces. Af-terwards, some authentic 3D evaluation datasets are released,such as the Florence 2D/3D hybrid face dataset (MICC) [ ? ],“Not quite in-the-Wild” dataset (NoW) [ ? ], BP4D dataset [ ? ] a) (b) Neutral11%Laugh11%Scare11%Angry11%Sad11% Fear11% Disgust11%

Expression

655 6039 6039 6039 6039200 2136 2136 2136 2136 left view frontal right view FullSubjects 2D Image/3D Mesh/Depth 3D FaceTraining Evaluation

Figure 2.

Overview of Pixel-Face.

In (a), we show gender, age and expression distribution of the Pixel-Face. Pixel-Face covers balancedgender, wide-range ages and massive expressions. In (b), we show other statistical information including training / evaluation split andamount of data. and FaceScape dataset [ ? ]. However, these datasets oftenconfront the problems of lacking diversity of identities andattributes. As a result, it is still a challenge to develop 3Dface reconstruction methods to generate realistic 3D facemeshes. A large-scale, high-resolution and multi-modality3D face dataset with afﬂuent annotation is required to dealwith the above problems. A 3DMM is composed of facial shape and expressionmodels. Given a face image, the corresponding 3D mesh canbe reconstructed by ﬁtting the coefﬁcients of a 3DMM model.Pascal et.al [ ? ] constructs the Basel Face Model (BFM)from 200 registered face meshes with neutral expressions.Thomaset.al [ ? ] updates the BFM model by adopting 100 ad-ditional individuals from Binghamton University 3D FacialExpression Database (BU-3DFE) [ ? ]. FaceWareHouse [ ? ] isan elaborate expression model constructed by 150 subjectswith over 20 expressions. FLAME [ ? ] is constructed fromCAESAR dataset [ ? ] with low-resolution 3D face meshes.In general, most of the current 3DMMs are constructed froma small 3D dataset with less than 200 subjects. They tendto suffer from low precision and monotonous expressions.As a result, the generalization capability of these 3DMMs inreal applications cannot be guaranteed. Taking advantage ofPixel-Face, we construct a new 3DMM, Pixel-3DM, whichis more accurate and reliable in face representation.

3. The Pixel-Face Dataset

We contribute Pixel-Face, a large-scale 3D face datasetwith afﬂuent annotations. Pixel-Face has several appeal-ing properties. First, it is the largest, high-ﬁdelity 3D facedataset. Pixel-Face contains over 24,000 multi-modalitysamples collected from 855 subjects under different views.Each data sample contains both RGB images and 3D mesheswith corresponding face landmarks. The full 3D faces ob- tained from full-view fusion are also provided. The statis-tical information of Pixel-Face is shown in Fig. 2 (b). Thehigh-resolution 3D meshes have advantages on preservingdetails of authentic faces, as shown in Fig. 1 (b). Second,Pixel-Face offers manually annotated facial landmarks foreach face mesh. These landmarks can aid the tasks includ-ing multi-view fusion, 3D mesh registration and 3D facereconstruction. Third, these subjects cover balanced gender,wide-range age and various expression distribution. Distribu-tion of different attributes is shown in Fig. 2 (a). More detailsof expressions are provided in the supplementary. Compar-isons between different datasets shown in Tab. 1 reveals thatPixel-Face surpasses the existing datasets in terms of scale,quality of annotations and diversity of views.

Fig. 3 demonstrates the pipeline of collecting 3D facedata. From 855 diverse subjects, we collect over 24,000 raw3D point clouds with high-resolution of . ± . mm using a self-customized trinocular structured light sys-tem [ ? ]. Multi-View Scanning.

To avoid collected data being cor-rupted by self-occlusion, we set three camera-groups sur-rounding subjects’ head to cover 270 degrees. Followingsimilar settings in [ ? ], each camera group contains one RGBcamera, one MEMS projector, and two infrared cameras. Weaccomplish the whole scanning process using N-step phaseshifting [ ? ], which is a state-of-the-art 3D scanning methodto capture 3D point clouds with pixel-wise resolution. Thismethod alleviates the inﬂuence of varied surface reﬂectivityeffectively. After scanning, we can simultaneously acquire2D images and corresponding 3D point clouds. The averageprocessing time for each sample is less than 300 ms.

3D Face Landmarks Annotation.

To get 3D facial land-marks, directly annotating 3D landmarks on the raw pointclouds is time-costing. Therefore, we apply a retrieval-basedmethod. Firstly, we manually annotate 106 2D facial land- able 1. Comparing Pixel-Face with other authentic 3D Face datasets. The Pixel-Face has advantages in most aspects. Lms., Exp. and Vert.are abbreviations for the annotation number of facial landmarks, categories of expressions and number of vertices, respectively.

Dataset Sub. Num Image Num 3D Mesh Num Lms. Num. Exp. Num. View Camera Vert. NumBosphorus [ ? ] 105 4666 4666 24 Single Mega 35kBFM [ ? ] 200 synthetic 200 68 Neutral Single ABW-3D 50kFWH [ ? ] 150 3000 3DMM 74 20 Single Kinect v1 20kMICC [ ? ] 53 53 203 51 < ? ] 26 26 26 51 Neutral Single DI3D 20kBP4D [ ? ] 41 328 328 84 8 Single 3DMD 70kFaceScape [ ? ] 359 400,000 7120 106 20 Multi DSLR 2mPixel-Face

855 24,525 24,525 106 Multi

Ainstec 100k

Multi-View Scanning & Annotation Depth3D Mesh Texture

Infrared CameraRGB Camera

Camera Group 2D Image Fusion Iteration Full 3D Face

MEMS Projector

Figure 3.

Data collecting pipeline.

The pipeline is composed of multi-view scanning, facial landmarks annotation, texture mapping,multi-view fusion, surface mesh generation and depth generation. High-resolution, multi-modality data and comprehensive annotations areobtained by our elaborately designed pipeline. marks for each 2D image. The facial landmarks are deﬁnedthe same as in [ ? ]. For each vertex in the point clouds, we ap-ply texture mapping method [ ? ] to calculate the correspond-ing coordinates on the 2D images. To ﬁnd the corresponding3D landmark for each annotated 2D facial landmark, wecalculate the distance between 2D landmarks and the pro-jected 2D coordinates of each 3D vertex in point clouds.The nearest 3D vertex is retrieved as the corresponding 3Dlandmark. Multi-View Fusion.

To get full-view point clouds, we em-ploy an improved coarse-to-ﬁne Iterative Closest Point (ICP)[ ? ] to fuse the captured 3D point clouds in three views (left,middle, and right). To get the coarse results, we use thecorresponding 3D landmarks to calculate the transforma-tion relationship between point clouds in different views.We set the middle mesh as the pivot and align the left andright point clouds to the pivot by calculating the rigid trans-formation matrix coarsely. The coarse fusion results havelimitations in the smoothness of a surface, seamless inte-gration between edges and precision of details. Therefore,we further reﬁne the fusion results by iteratively calculat- ing the transformation matrix for each vertex [ ? ]. Finally,we merge the overlapped vertices and omit the isolated ver-tices. After fusion, we obtain full-view 3D point clouds andcomplete 3D landmarks. The corresponding 3D meshes ofpoint clouds are obtained by Centroidal Voronoi Tessellation(CVT) based method [ ? ]. The resulting meshes serve as thefull 3D faces.

3D Point Cloud to Depth.

To enable more tasks such asmonocular depth prediction, we provide depth images foreach sample. We use structured light and triangulation [ ? ]to calculate depth value for each vertex in 3D point clouds.Each vertex has the corresponding coordinate in the 2Dimage so that the depth information can be projected to 2Ddepth image. In this way, a depth image for each 3D pointcloud is obtained. Besides the 3D landmarks mentioned before, the Pixel-Face offers semantic annotations [ ? ] including gender, ageand expression. To obtain these annotations, we ﬁrst collectinformation such as gender and age from each subject. Thene demand each subject to perform 22 pre-deﬁned expres-sions adopted from FaceWareHouse [ ? ]. The distributions ofgender, age and expression are shown in Fig. 2. The Pixel-Face dataset has both balanced gender and wide-range agedistribution. Besides, each subject contains rich expressions.

4. Construct Pixel-3DM

To demonstrate the usefulness of Pixel-Face and facilitatefuture research, we use the obtained face meshes to constructa new 3DMM, named Pixel-3DM. It contains both facialshape models and meticulous expression models. To obtainthe new 3DMM, we ﬁrst register the initial 3D meshes to a3D template by an improved two-stage algorithm. Then weuse the registration results to calculate PCA bases of 3DMM.Fig. 4 (a) demonstrates the process of registration and showan example of registration result. Fig. 4 (b) visualizes thatthe Pixel-3DM can be driven by face shape and expressionsbases ﬂexibly.

3D face registration aims to align the arbitrary 3D mesheswith some pre-deﬁned mesh template, so that the registrationresults have consistent topology. In this subsection, we ﬁrstgive a brief introduction to 3DMM and then discuss themethods used for registration.

3D Morphable Model (3DMM) [ ? ] represents any3D face mesh M as Eq. 1. M = ¯ M + (cid:88) α i U i + (cid:88) β j E j , (1) ¯ M is the mean shape. U and E refer to the orthonormalbases matrix whose columns are the shape and expressioneigenvectors computed from PCA. The α is the shape coefﬁ-cients and the β is expression coefﬁcients. The combinationof these terms determines a speciﬁc instance under the given3DMM. Registration.

To register the obtained 3D face meshes to3DMM, we adopt a two-stage algorithm that combines thestrengths of two popular registration methods, ICP [ ? , ? ] andNICP [ ? ]. The goal of 3D face registration is to ﬁt templatemesh S = ( V S , F S ) to template-free mesh T = ( V T , F T ) . V and F refers to 3D vertices and faces, respectively. Afterregistration R , the T can be represented in the form of S asshown in Eq. 2: T ≈ R ( S ) = ( R ( V S ) , F S ) , (2)where R ( V S ) refers to relocated vertices and F S refersto the mesh faces deﬁned by S . A advanced registrationmethods should guarantee the R ( S ) be close enough to T . Speciﬁcally, R ( S ) is required to represent the facialexpression, head pose and identity of T precisely and robustto challenging cases such as exaggerated expressions or datamissing. Registration Iteration

Input MeshTemplate (a)

Result (b)

Face Parsing 𝜆 ! : Shape 𝑈 Expression 𝐸 Figure 4.

Overview of Pixel-3DM.

At the top of (a), we show anexample of an input mesh and its registration result which pre-serves the facial shape and expression information while sharingthe consistent topology with the template. Details are shown in thecolored box. The left bottom of (a) illuminates the face parsingand spatial-varying λ p deﬁned in Eq. 3. The registration iterationis shown at the right bottom of (a). In (b), we show several typicalshape and expression bases of Pixel-3DM. In the ﬁrst stage of registration, we apply the same methodused in section 3.1 to fuse point clouds. To get the ICPresult R icp ( S ) , we ﬁrst estimate the transformation matrixbetween the landmarks of T and S , and then use the obtainedtransformation matrices to transform T to S . Since theICP-based method only generates coarse meshes and cannothandle subtle face details, we further deploy a spatial-varyingNICP as the second stage to reﬁne the detail meshes.The conventional NICP-based method [ ? ] does not con-tain valid stiffness contraint. As a result, the transformationmatrix calculated by NICP is less constrained and proneto dislocations of points. For example, points of the nosemay move to the cheek and different points may occupy thesame position. To resolve this problem, we adopt a spatialvarying deformation method. We manually segment the faceto several parts P , according to both semantic informationand spatial location. Each part has the corresponding surface T p . Then we calculate transformation matrix of each facevertex. The cost function is deﬁned as Eq. 3. (cid:88) p ∈ P (cid:88) i ∈ p  w ip dist (cid:0) T p , X ip v ip (cid:1) + λ p (cid:88) { i,j }∈E (cid:13)(cid:13) X ip − X jp (cid:13)(cid:13) (3) v ip refers to vertex in R icp ( S ) and X ip is the correspondingtransformation matrix. The ﬁrst term affects registration ac-curacy, w p refers to the importance weight of each vertex(weset it to in practice). We calculate the euclidean distance ofone vertex in R icp ( S ) to the closest counterpart in T . Thisdistance is marked as dist( T , v ) . The second term is thestiffness regularization. E refers to a small region. In practice, we set it to be a unitsphere. λ p is the trade-off weight to balance the ﬂexibilitynd stiffness of deformation. Higher λ p corresponds to astiffer restriction. Since different parts of faces have speciﬁcsurface curvature, the λ p is set to speciﬁc values for eachpart. For example, the surface of cheek is smoother than thenose, so the transformation of points in the nose tends tobe intenser and leads to more dislocations. Part division offaces and the corresponding value of λ p is shown in Fig. 4(a). After minimizing the Eq. 3 using least square algorithm,we obtain optimized X ip for each face vertex v ip . In the end,each vertex is transformed accordingly. We follow the general process of constructing 3D mor-phable model [ ? ] to build Pixel-3DM. We concatenate over600 registration results with the neutral expressions as facialshape matrix. ¯ M is set to be the mean of those facial shapematrix. The shape model U is composed of 199 PCA compo-nents covering more than 99% of the variance observed in thefacial shape matrix. To obtain expression model E , we useover 6000 registration results with various expressions. Foreach sample, we compute its residual to the correspondingregistration result with neutral expression and then concate-nate these residuals to form the expression residual matrix.The expression model E is composed of 99 componentsexplaining more than 99% of the variance observed in theexpression residual matrix.

5. Experiments

We build benchmarks out of Pixel-Face for evaluating3D face reconstruction methods. The task of 3D face recon-struction is to predict the 3D mesh taken the 2D image asinput. There are a bunch of previous works [ ? , ? , ? ] thatfocus on 3D face reconstruction. In this paper, we choosethe three most representative methods and evaluate theirperformance on the newly-obtained Pixel-Face. Detailedevaluating results are reported in the section 5.3. With the optimization object based on faciallandmarks [ ? ] and the 3DMM assumption in Eq. 1, 3DMMFitting method [ ? ] formulates 3D face modeling as an op-timization problem to ﬁts 3DMM coefﬁcients. Since theseoptimization-based methods do not require training, we candirectly apply them to our Pixel-3DM. Coefﬁcient Regression Model.

Different from 3DMM ﬁt-ting, these methods [ ? , ? , ? ] use deep convolutional neuralnetworks (DCNN) to directly regress model coefﬁcients.These models should be re-trained if the 3D bases change.In our experiment, we evaluate two methods on Pixel-Face,namely, 3DDFA [ ? ] and RingNet [ ? ]. Dense Map Model.

These methods [ ? , ? ] directly predict3D dense reconstructions, such as UV position map [ ? ] frominput 2D images. The DCNN often serves as the backbone. Table 2. We quantitatively compare Pixel-3DM with BFM17 [ ? ]and FWH [ ? ] on evaluation set of Pixel-Face, under crop radius r = 0 . . The Pixel-3DM surpasses BFM17 and FHW in bothARMSE and NME. Evaluation Metrics → NME ARMSE3DMM Basis ↓ BFM17 [ ? ] + FHW [ ? ] .

04 4 . BFM17 [ ? ] + Pixel-3DM Exp .

56 4 . Pixel-3DM Shape + FHW [ ? ] .

04 3 . Pixel-3DM

We evaluate PRNet [ ? ] in our experiment. Data.

We mainly use the newly-built Pixel-Face to conductthe experiments. Pixel-Face is composed of 855 subjectswith more than 24,000 multi-view samples. Each sample iscomposed of a high-resolution RGB image, a high-quality3D mesh and 3D landmark annotations. We use 75% of Pixel-Face for training and the rest for evaluation (validation+test)as shown in Fig. 2 (b). We pre-train Pixel-3DM and ﬁne-tuneother methods by the training set. Besides the evaluation setof Pixel-Face, a subset of BP4D dataset [ ? ] are also used insection 5.3. The BP4D dataset is a 3D expression datasetcontaining 41 identities each of which offers about 8 tasks ofexpression. There are paired 2D/3D scanning sequence foreach expression task. To remove redundant information fromadjacent frames, we randomly sample one pair of 2D/3D datafrom each sequence. After sampling, a subset containing328 2D/3D pairs of data with different expressions from 41identities is obtained. Evaluation Metrics.

We use NME and ARMSE as the eval-uation metrics in our experiment. The Normalized MeanError (NME) is deﬁned as the average of landmark errorsnormalized by the bounding box sizes [ ? ]. The AverageRoot Mean Square Error (ARMSE) [ ? ] is employed to eval-uate the similarity between reconstructed 3D meshes andground truth meshes. Following the setting of nd ? ], we ﬁrst normalize the interocular distanceof ground truth to 1. Then we align the reconstructed 3Dmeshes to the ground truth by facial landmarks. The originis set to be the nose tip. Given a crop radius which is markedas r , we discard vertices whose distance between nose tipis higher than r . The ARMSE computes the closet point-to-mesh distance between the ground-truth and reconstructed3D meshes and vice versa.In our experiment, r ranges from . to . . This section provides qualitative and quantitative evalu-ations of different methods on our benchmarks. We ﬁrstlydemonstrate the modeling capability of Pixel-3DM by com- .56 3.61 3.56 3.46 3.433.64 3.69 3.61 3.55 3.574.51 5.53 6.66 6.74 6.67 A R M S E N M E A R M S E N M E A R M S E N M E [1, 20] [21, 40] [41, 60] >60 (a) (b)(c) Neutral Positive Negative 𝑟𝑟 = 0.6 𝑟 = 0.7 𝑟 = 0.8 𝑟 = 0.9 𝑟 = 1

Figure 5. In (a), we plot the curve of overall ARMSE and NME scores of three 3D methods, namely 3DDFA [ ? ] (blue), PRNet [ ? ] (orange)and RingNet [ ? ] (grey). The visualization of faces under different crop radius r is shown in the bottom of (a). In (b), we plot the curves ofARMSE and NME under different age regions. In (c), we plot the curves of ARMSE and NME under different expression regions. Theresults show that the neutral expression and common ages are handled more easily than other sub-groups, revealing the limitations of theprevious 3D face datasets.Table 3. We quantitatively evaluate the performances of PRNet [ ? ] with and without ﬁnetuning on our data. The results show that our datadevelop reconstructing capability of the PRNet both on the subset of BP4D [ ? ] and the validation set of Pixel-Face. Evaluation → NME ARMSEDataset ↓ w/o ﬁnetune with ﬁnetune w/o ﬁnetune with ﬁnetune r BP4D [ ? ] 2.44 2.33 2.27 2.29 2.51 paring it with other representative 3DMMs on the evalua-tion set of Pixel-3DM, using the same optimization method.Then we benchmark state-of-the-art reconstruction modelson Pixel-Face. Finally, to demonstrate the generalizationability of Pixel-Face, we ﬁnetune the PRNet [ ? ] with train-ing set of Pixel-Face and evaluate both on the evaluation ofPixel-Face and a subset of BP4D [ ? ]. The Superiority of Pixel-3DM.

To compare the model ca-pability between Pixel-3DM and previous 3DMMs, we applythe same 3DMM ﬁtting method [ ? ] for different 3DMMs and verify their effectiveness on Pixel-Face. Speciﬁcally, wecompare Pixel-3DM with face shape bases from BFM17 [ ? ]and expression bases from FWH [ ? ]. The results are listedin Tab. 2. It is observed that Pixel-3DM consistently outper-forms BFM [ ? ] and FWH [ ? ] on both shape and expressionmodeling. The superior property of Pixel-Face validates themodeling capability of Pixel-3DM. Benchmarking Results on Pixel-Face.

We evaluate sev-eral state-of-the-art methods on our Pixel-Face, including3DDFA [ ? ], PRNet [ ? ] and RingNet [ ? ]. We take NMS nputGround Truth3DDFAPRNetRingNet Figure 6. We qualitatively compare 3DDFA [ ? ], PRNet [ ? ] andRingNet [ ? ] on the evaluation set of Pixel-3DM, showing both thereconstructed 3D meshes and error maps. The results demonstratethat previous models trained with synthetic data tend to predictmean shape and neutral expressions and fall short in modelingauthentic face shapes and various expressions. InputGTw/o finetunewith finetune BP4D Pixel-Face

Figure 7. We qualitatively compare performance of PRNet [ ? ]with and without ﬁnetuning on our dataset. The results show thatour data can effectively promote the models’ capacity in modelingvarious face shapes and expressions. and ARMSE as metrics and evaluate different subsets ofPixel-Face divided by expression and age. Limited by thespace, we roughly divided 22 expressions into three cate-gories which are neutral, positive and negative in terms of emotion. We also split ages into 4 non-overlapping subsets.Fig. 5 summarizes the performance of different methods ondifferent expression and age subsets. Fig. 6 shows some qual-itative results. Several valuable observations are revealedfrom the experiment results: 1) Although some synthetic 3Ddataset [ ? ] has verisimilar 3D faces and diverse attributes,the models trained on them have limited capability of mod-eling real 3D faces. 2) The neutral expression and commonages are handled more easily than the uncommon ones. Pixel-Face as An Effective Training Source.

We are cu-rious about how well our data can promote other methods’performances on both Pixel-Face and other previous datasets.In addition to the evaluation set of Pixel-Face, we use thesubset of the BP4D dataset [ ? ] with 328 2D/3D pairs ofdata with different expressions. We ﬁnetune PRNet with thetraining set of Pixel-Face. The 3D ground-truth for each 2Dimage is generated by the preprocessing method providedby PRNet. Tab. 3 lists the experiment results of modelswith and without ﬁnetuning on Pixel-Face. Fig. 7 shows thequalitative results. The experiment results show that PRNetgenerates better aligned 3D meshes after ﬁnetuned on ourdataset. Using our data can further improve the performanceof PRNet on both face shape and expression representing.Speciﬁcally, ARMSE drops by 30% and NME drops by28% on the evaluation set of Pixel-Face. Furthermore, the7% dropping of ARMSE and 7% dropping of NME on thesubset of BP4D demonstrates the generalization ability ofPixel-Face.

6. Conclusion6. Conclusion