[PDF] Facial Pose Estimation by Deep Learning from Label Distributions

Abstract

Facial pose estimation has gained a lot of attentions in many practical applications, such as human-robot interaction, gaze estimation and driver monitoring. Meanwhile, end-to-end deep learning-based facial pose estimation is becoming more and more popular. However, facial pose estimation suffers from a key challenge: the lack of sufficient training data for many poses, especially for large poses. Inspired by the observation that the faces under close poses look similar, we reformulate the facial pose estimation as a label distribution learning problem, considering each face image as an example associated with a Gaussian label distribution rather than a single label, and construct a convolutional neural network which is trained with a multi-loss function on AFLW dataset and 300W-LP dataset to predict the facial poses directly from color image. Extensive experiments are conducted on several popular benchmarks, including AFLW2000, BIWI, AFLW and AFW, where our approach shows a significant advantage over other state-of-the-art methods.

Full PDF

FFacial Pose Estimation by Deep Learning from Label Distributions

Zhaoxiang LiuCloudMinds [email protected]

Zezhou ChenCloudMinds [email protected]

Jinqiang BaiBeihang University [email protected]

Shaohua LiCloudMinds [email protected]

Shiguo LianCloudMinds sg [email protected]

Abstract

Facial pose estimation has gained a lot of attentions inmany practical applications, such as human-robot interac-tion, gaze estimation and driver monitoring. Meanwhile,end-to-end deep learning-based facial pose estimation isbecoming more and more popular. However, facial pose es-timation suffers from a key challenge: the lack of sufﬁcienttraining data for many poses, especially for large poses. In-spired by the observation that the faces under close poseslook similar, we reformulate the facial pose estimation as alabel distribution learning problem, considering each faceimage as an example associated with a Gaussian label dis-tribution rather than a single label, and construct a convo-lutional neural network which is trained with a multi-lossfunction on AFLW dataset and 300W-LP dataset to predictthe facial poses directly from color image. Extensive ex-periments are conducted on several popular benchmarks,including AFLW2000, BIWI, AFLW and AFW, where ourapproach shows a signiﬁcant advantage over other state-of-the-art methods.

1. Introduction

Facial pose estimation has received more and more at-tentions in the past few years [17, 28, 55, 41, 48, 47,27, 8, 51, 56, 2, 53, 34, 4, 22], it plays an importantrole in many practical applications such as driver monitor-ing [17, 28], human-robot or human-computer interaction[55, 41, 48, 47, 50, 25, 7], gaze estimation [27, 8, 51, 56],human behavior analysis [2], face alignment [53, 6] andface recognition [5]. All of these unconstrained scenariosrequire a facial pose estimator which is resistant to envi-ronmental variations ( e.g . occlusion, pose, illumination andresolution variations).Though some good results have been made by usingcommercial depth cameras [12], one limitation that could not be neglected lies in that depth camera does not workwell under uncontrolled environment where sunlight or am-bient light is strong, and it often needs more space and morepower compared to monocular RGB camera. These impedeits feasibility in real-world applications [40, 34].Traditionally, facial pose can be computed by estimat-ing some facial key-points from target face and solving2D to 3D correspondence with a mean 3D head model.Though facial key-point estimation has been recently im-proved greatly by deep learning [4], facial pose estimationis inherently a two-step process which is error-prone. Theaccuracy of the pose estimate depends upon the quality ofkey-points as well as the 3D head model. If the localizedkey-points are inaccurate or inadequate,the estimate of posebecomes poor or the pose estimation may even become in-feasible. Additionally, generic 3D head models can alsobring in errors for any given individual, and deforming thehead model to adapt to each individual demands signiﬁcantamounts of data and computation.Recently, it has become more popular to estimate fa-cial pose end to end using deep learning due to its ro-bustness to environmental variations. The deep learning-based methods have large advantages compared to tradi-tional landmark-to-pose methods, for they always output apose prediction which does not rely on landmark detectionand 3D head model. However, the deep learning-based fa-cial pose estimation has not been thoroughly investigatedoverall. In some of these cases, facial pose estimation isjust one branch of multi-tasks for face analysis, which isused to improve the performances of these other tasks ( e.g .face detection, key-points localization and gender recogni-tion). The facial pose branch was not designed dedicatedlyin terms of accuracy. Some other deep learning-based meth-ods have dedicatedly addressed the facial pose estimation asa pose regression from image [34, 22, 37, 36] using convo-lution neural networks(CNN), while the work in [40] hasconcluded that the combination of binned classiﬁcation and1 a r X i v : . [ c s . C V ] S e p

00 75 50 25 0 25 50 75 100 roll angle nu m b e r o f p i c t u r e s p e r d e g r ee

100 75 50 25 0 25 50 75 100 pitch angle nu m b e r o f p i c t u r e s p e r d e g r ee

100 75 50 25 0 25 50 75 100 yaw angle nu m b e r o f p i c t u r e s p e r d e g r ee

100 75 50 25 0 25 50 75 100 roll angle nu m b e r o f p i c t u r e s p e r d e g r ee

100 75 50 25 0 25 50 75 100 pitch angle nu m b e r o f p i c t u r e s p e r d e g r ee

100 50 0 50 100 yaw angle nu m b e r o f p i c t u r e s p e r d e g r ee Figure 1. The sample distributions of two popular datasets. 300W-LP [59] (ﬁrst row), AFLW [21] (second row). We can see that mostfaces lie in the area of small poses. regression works better than regression solely. However,all these deep learning-based methods ignore an importantfact that the distribution of training samples is quite imbal-anced and there are not sufﬁcient training samples for thelarge poses. To varying degrees, the most popular datasetsfor facial pose estimation, such as AFLW [21] and 300W-LP [59], exist this problem (as shown in Fig.1) which candegrade the accuracy of pose estimate, especially for largepose. We argue that it is unreasonable to use soft-maxcross-entropy loss for facial pose estimation when train-ing samples are considerably imbalanced, and the accuracyof facial pose estimate still has potential to be improvedfurtherly. For these losses ignore the similarity betweenadjacent poses, not taking the relationship between adja-cent poses into consideration, other appropriate constraintshould be introduced into the loss function.To this end, we reformulate the facial pose estimationas label distribution learning problem and introduce a moreintuitive similarity constraint: Gaussian label distributionloss into the training for facial pose estimation to improvethe accuracy. The main contributions of our work can besummarized as follows: • We reveal the fact that the lack of sufﬁcient strainingsamples exists in the popular facial pose datasets. Andwe explain why it is not optimal to use soft-max cross-entropy loss for facial pose estimation under this situ-ation. • We introduce a novel Gaussian label distribution lossinto the training for facial pose estimation, the Gaus-sian label distribution loss which constrains the sim- ilarities between neighbouring poses and can effec-tively mitigate the insufﬁciency of training samples,and dramatically boost the accuracy of facial pose es-timate. • We demonstrate the effectiveness of our method in fa-cial pose estimation by various comparative experi-ments. Trained on publicly available datasets, suchas AFLW [21]dataset and 300W-LP [59] dataset, ourmethod achieves the-state-of-art results on AFLW2000[59], BIWI [10], AFLW [21] and AFW [35] bench-marks.

2. Related Works

So far a variety of efforts on facial pose estimation havebeen dedicated. All these methods can be easily divideddepending on whether they use 2D camera or depth cam-era. Since our work is concerned with deep learning-basedmethod using RGB image from a monocular camera, anyother methods using the depth camera will not be consid-ered here. A more detailed description of depth camera-based methods can be found in a recent survey [30] andother previous works [27, 12, 3, 11, 54].Some early classic studies [32, 43, 31] can be catego-rized as appearance template methods which match a viewof a person’s face to a set of exemplars with correspond-ing pose labels in order to ﬁnd the most similar view. Forexample, the method in [31] adopts support vector machine(SVM) to model the appearance of human faces across mul-tiple views and performs pose estimation by using nearest-neighbor matching. However, the appearance templateethods suffer from some limitations. They can only es-timate discrete pose without the use of some interpolationmethod, and they also suffer from the accuracy concernswhen the facial region is not localized accurately and efﬁ-ciency concerns when the exemplar set is very large.The face detector arrays [19, 57] whose idea is totrain multiple face detectors for different facial poses oncebecame popular as the success of frontal face detection[49, 33, 39]. The method in [57] uses a sequence of ﬁvemulti-view face detectors to estimate facial pose. It is ev-ident that many face detectors are required for each corre-sponding discrete facial pose, and it is difﬁcult to implementa real time facial pose estimator with a large detector array.Facial pose estimation can also be formulated as a mani-fold embedding problem that the high dimensional face im-age can be embedded into a low dimensional manifold inwhich facial pose is estimated. Any dimensionality reduc-tion technique can be considered as a part of manifold em-bedding category. The methods in [42, 52] project a faceimage into a PCA or KPCA subspace and in which com-pare the result to a set of embedded templates. The methodin [38] uses Isometric Feature Mapping (Isomap) to embeda face image into a nonlinear manifold which representsthe pose-varying faces. These approaches ignore the poselabels that are available during training and operate in anunsupervised fashion. This results in that the built mani-folds not only describe the pose variations but also identityvariations [1]. The method [46] utilizes the feature corre-spondence of identity-invariant geometric features to learna similarity kernel that only reﬂects the pose variation ignor-ing other sources of variation. This method shows a goodreliability on benchmark dataset. However, further researchis still needed to achieve state-of-the-art performance.Facial pose estimation can be naturally formulated as anonlinear regression problem which learns a nonlinear map-ping from images to poses. The methods in [23, 26, 29]adopt support vector regressor(SVR) to estimate the fa-cial pose after a series of preprocessing, including face re-gion cropping, Sobel ﬁltering, PCA [23], priori knowledge-based linear projection [26], or localized gradient orienta-tion histogram [29]. The methods in [45, 44] utilize mul-tilayer perception(MLP) to regress the facial pose. Thesemethods have one disadvantage that they are prone to er-ror from poor face localization. Recently thank to the greatsuccess of deep learning techniques, it has become popu-lar to estimate facial pose using CNN which is robust toshift, scale and distortion. The method in [34] presents anin-depth study of CNN trained on AFLW dataset using L2regression loss and tested on the Prima, AFLW and AFLWdatasets. The method in [22] proposes a GoogLeNet-basedarchitecture trained on AFLW dataset which can predict thekey-points and facial pose jointly and reports the pose re-sults on AFLW dataset and AFW [35]dataset. L2 Euclidean loss function is adopted to train the pose predictor whichis used to improve key-point localization. The method in[53] also trains a pose estimator using 300W dataset to as-sist face alignment. Both the method in [37] and the methodin [36] build a multitask learning framework for face anal-ysis, including face detection, face alignment, face recog-nition, pose estimation, age prediction, gender recognitionand smile detection. Both methods utilize AFLW datasetto train pose regressors and pose results are also reportedon AFLW dataset and AFW dataset. The method in [40]makes an extensive study of combination of classiﬁcationloss and regression loss on benchmark datasets, including300w-LP dataset, AFLW dataset, BIWI [10] dataset andAFW dataset, and concludes that the combination of binnedclassiﬁcation and regression works better than regressionsolely. However, all these deep learning-based methods payno attention to the lack of sufﬁcient training data for manyposes. Consequently, the performance of facial pose esti-mator is limited. This reason motivates us to seek a bettersolution in this paper.The label distribution learning (LDL) is a novel machinelearning paradigm recently proposed for facial age estima-tion [16, 24]. The LDL is based on the observation thatage is ambiguous and faces with adjacent ages are stronglycorrelated. The main idea of LDL is to utilize adjacent ageswhen learning a particular age. And a label distribution cov-ers a number of class labels, representing the degree thateach label describes the instance. Hence, the LDL is able todeal with insufﬁcient and incomplete training data. Someother problems which share the same characteristic as fa-cial age estimation, such as facial attractiveness computa-tion [9], crowd counting [58] and pre-release movie ratingprediction [14] have achieved outstanding performances byusing LDL.Facial pose appears similar to facial age, i.e . the faces un-der close poses look similar (as shown in Fig.2), the chang-ing of facial pose can be regarded as a relative slow andsmooth process and faces under adjacent poses are highlycorrelated. Thus, the LDL paradigm is an ideal match forthe task of facial pose estimation. We notice that similarlearning paradigms[15, 13] have been proposed to mitigatelabel ambiguity in head pose estimation. However, theyonly focused on 2D head pose estimation and were not ex-tensively investigated on such precisely annotated bench-marks as AFLW2000 and BIWI.

3. Method

We argue that the lack of sufﬁcient training samples candegrade the accuracy of pose estimator. The reason is thatthe soft-max cross-entropy loss function used in training en-codes the distance between all poses equally and does nota) yaw = 24 . ◦ (b) yaw = 29 . ◦ (c) yaw = 34 . ◦ (d) yaw = 39 . ◦ Figure 2. The faces of one subject under different poses. take the relationship between adjacent poses into considera-tion. So it cannot effectively handle the insufﬁciency prob-lem of training samples. Inspired by the previous work onage estimation [16, 24] and facial attractiveness ranking [9],we reformulate the facial pose estimation as a label distri-bution learning problem.It is apparent that the faces under close poses lookquite similar (as shown in Fig.2). Consequently, additionalknowledge about the faces with different poses can be in-troduced to reinforce the learning problem. It is straightfor-ward to utilize faces under neighboring poses while learninga particular pose. To achieve this, we assign a label distri-bution to each face image rather than a single label of realpose. This can make a face image contribute to not only thelearning of its real pose, but also the learning of its neigh-bouring poses. We employ three Gaussian label distribu-tions to describe a face example in the yaw, pitch and rolldomain respectively to reinforce the whole learning process.Here we take the yaw as an example to illustrate theGaussian label distribution. Given a face image x i and acomplete set of yaw labels y = { y , y , . . . , y M } , if its yawlabel is y α , α = 1 , , . . . , M , then the corresponding yawlabel distribution is represented as a multi-dimension vec-tor D yi = { d y x i , d y x i , . . . , d y M x i } , with the l -th dimension asfollows: d y l x i = exp (cid:16) d ylxi (cid:17)(cid:80) Mu =1 exp ( d yuxi ) ,d y l x i = exp ( − ( l − α ) σ y ) /σ y , l = 1 , , . . . , M (1)where l denotes the l -th binned yaw, α is the binned ground-truth yaw, σ y is the label standard deviation, and M is thedimension of the yaw label vector which also implicitly rep-resents the maximum yaw. Consequently, d y l x i represents thedegree that the label y l describes the example x i under theconstraint (cid:80) Ml =1 d y l x i = 1 , meaning that the label set y fullydescribes the example. Fig.3 demonstrates an example ofGaussian label distribution for yaw.Following the same deﬁnition, another two label dis-tributions: D pi = { d p x i , d p x i , . . . , d p N x i } and D ri = { d r x i , d r x i , . . . , d r K x i } can be obtained for x i with a set ofpitch labels p = { p , p , . . . , p N } and a set of roll labels Binned Yaw D e s c r i p t i o n D e g r ee Figure 3. Gaussian label distribution with σ y = 4 for the ground-truth yaw = − ◦ . r = { r , r , . . . , r k } respectively as follows: d p j x i = exp (cid:18) d pjxi (cid:19)(cid:80) Nv =1 exp ( d pvxi ) ,d p j x i = exp ( − ( j − β ) σ p ) /σ p , j = 1 , , . . . , N (2) d r k x i = exp (cid:16) d rkxi (cid:17)(cid:80) Kw =1 exp ( d rwxi ) ,d r k x i = exp ( − ( k − γ ) σ r ) /σ r , k = 1 , , . . . , K (3)where β and γ denote binned ground-truth pitch and rollof the face respectively. Consequently, the training set canbe represented as { ( x i , ( D yi , D pi , D ri )) , ≤ i ≤ n } andthe goal of the learning becomes to train a set of networkparameters θ to generate a triplet of probability distribution ( F y ( x i ; θ ) , F p ( x i ; θ ) , F r ( x i ; θ )) for the three label sets,which is similar to ( D yi , D pi , D ri ) . Wherein, F y ( x i ; θ ) = { f ( y | x i ; θ ) , f ( y | x i ; θ ) , . . . , f ( y M | x i ; θ ) } , (cid:80) Ml =1 f ( y l | x i ; θ ) = 1; F p ( x i ; θ ) = { f ( p | x i ; θ ) , f ( p | x i ; θ ) , . . . , f ( p N | x i ; θ ) } , (cid:80) Nj =1 f ( p j | x i ; θ ) = 1; F r ( x i ; θ ) = { f ( r | x i ; θ ) , f ( r | x i ; θ ) , . . . , f ( r K | x i ; θ ) } , (cid:80) Kk =1 f ( r k | x i ; θ ) = 1 . (4)The Euclidean distance and Kullback-Leibler (KL)divergence are adopted to construct the loss functionmeasuring the similarity between the ground-truthdistribution ( D yi , D pi , D ri ) and predicted distribution ( F y ( x i ; θ ) , F p ( x i ; θ ) , F r ( x i ; θ )) . The objective ofhe label distribution learning is to minimize either of thefollowing overall loss functions: L Eu = n (cid:80) i =1 (cid:107) D yi − F y ( x i ; θ ) (cid:107) + n (cid:80) i =1 (cid:107) D pi − F p ( x i ; θ ) (cid:107) + n (cid:80) i =1 (cid:107) D ri − F r ( x i ; θ ) (cid:107) ,L KL = n (cid:80) i =1 M (cid:80) l =1 d p l x i ln d plxi f ( y l | x i ; θ ) + n (cid:80) i =1 N (cid:80) j =1 d p l x i ln d pjxi f ( p j | x i ; θ ) + n (cid:80) i =1 K (cid:80) k =1 d p l x i ln d rkxi f ( r k | x i ; θ ) (5)And we deﬁne L GLD = L Eu + L KL as our Gaussian labeldistribution loss. We modify the framework presented in Hopenet [40] toconstruct our network architecture for facial pose estima-tion. The framework presented in Hopenet [40] originallyconsists of three separate losses for yaw, pitch and roll re-spectively and got state-of-the-art result. Each loss is alinear combination of a soft-max cross-entropy loss and amean squared error(MSE) loss. To achieve better accuracy,we replace the soft-max cross-entropy loss with our Gaus-sian label distribution loss. Consequently, our learning ar-chitecture can be constructed as shown in Fig.4.Our framework consists of a ResNet50 [18]-based back-bone network and three branches for yaw, pitch and roll re-spectively. Each branch is comprised of a fully-connectedlayer with the number of neurons equal to the total num-ber of corresponding labels and a soft-max layer followedby the combined loss layer. The soft-max operation ensuresto satisfy the aforementioned constraints: (cid:80) Ml =1 d y l x i = 1 , (cid:80) Nj =1 d p j x i = 1 and (cid:80) Kk =1 d r k x i = 1 .Then the total loss is deﬁned as L total = L GLD + α ∗ L MSE . Wherein, L MSE is the mean squared error loss, and α is a weight used to adjust the two loss components.

4. Experiments

We choose the 300W-LP [59] and the AFLW [21] totrain our network respectively. These two datasets haveenough examples with enough different identities and dif-ferent lighting conditions. The 300W-LP [59]dataset isa collection of popular in-the-wild 2D landmark datasetswhich have been grouped and re-annotated. The AFLW[21]dataset, which is commonly used to train and test land-mark detection methods, also includes pose annotations.We divide the facial pose into 66 bins within ± ◦ foryaw, pitch and roll respectively, i.e ., M = N = K = 66 .And we set σ y = σ p = σ r = 4 . All the data is nor-malized before training by using the ImageNet mean and standard deviation for each color channel. And a pretrainedResNet50[18] on ImageNet is adopted to initialize our net-work. The proposed multi-loss network is trained with α = 0 , α = 0 . , α = 0 . , α = 1 and α = 2 on both the300W-LP dataset and AFLW dataset. All the ten networksare trained using Adam optimization [51] with a learningrate of − and β = 0 . , β = 0 . and ε = 10 − . The AFLW2000 [59] dataset contains the ﬁrst 2000 iden-tities of the in-the-wild AFLW [21]dataset with accuratepose annotations. It is an ideal candidate to test our method.The BIWI [59] dataset is collected indoor by recordingRGB-D video of different subjects across different facialposes using Kinect v2 device. It is commonly used asbenchmark for depth-based pose estimation. Here we willonly use the color frames instead of the depth information.Firstly, we compare our results to the state-of-the-artmethod Hopenet [40] which is trained using a combina-tion of L2 Euclidean loss and soft-max cross-entropy loss.Then, we compare to the pose estimated from 3DDFA [60]whose primary task is to align facial landmarks, and poseestimated from landmarks using two different landmark de-tectors: FAN [4] and Dlib[20], and ground-truth landmarkson both datasets. Additionally, we also list the results ofKEPLER [22] on BIWI dataset reported in [40]. Table1 shows the performance evaluations on AFLW2000 andBIWI Benchmark.We can see that our best model( α = 0 . ) outperformsall other baseline methods by a large margin on AFLW2000benchmark, reducing the yaw error of the best-performingbaselines 3DDFA[60] by 43.9%, reducing the yaw errorof Hopenet[40] by 53.2%, reducing the pitch error, theroll error, and the mean average error (MAE) of the best-performing baseline Hopenet[40] by 22.8%, 32.2%, 36.2%respectively.On BIWI benchmark, our method also performs bet-ter than all other baseline methods. Our best model( α =0 )trained on 300W-LP dataset reduces the error of the cor-responding best-performing baseline Hopenet[40] trainedon 300W-LP datatset by 14.3%, 15%, 3.7% and 12.3% foryaw, pitch, roll and MAE respectively. Our best model( α =0 . )trained on AFLW dataset also outperforms Hopenet[40]trained on AFLW datatset, reducing the error by 20.8%,19.4%, 0.9% and 13.8% for yaw, pitch, roll and MAE re-spectively. In this section, we present the evaluation results onAFLW [21] and AFW [35] benchmark, using the modeltrained on AFLW dataset. The AFW [35]benchmark whichis commonly used to test landmark detection methods con-tains rough pose annotations. Here, we compare our results oftmaxSoftmaxSoftmax ExpectationExpectationExpectation

GLD Loss

RollPitchYaw Total Yaw LossTotal Roll LossTotal Pitch LossResNet50

Gaussian label distribution loss Mean squared error loss

FC Layers MSE Loss

GLD Loss

MSE LossMSE Loss

GLD LossGLD Loss

MSE Loss

Figure 4. Proposed network architecture for facial pose estimation.

Benchmark Method Yaw Pitch Roll MAEAFLW2000 Hopenet[40]* 6.470 6.559 5.436 6.155FAN[4] 6.358 12.277 8.714 9.1163DDFA[60] 5.400 8.530 8.250 7.393Dlib[20] 23.153 13.633 10.545 15.777Ground-truth landmarks 5.924 11.756 8.271 8.651Ours( α = 0 )* 3.1791 5.3372 3.7983 4.1049Ours( α = 0 . )* 3.0288 5.0634 3.6842 Ours( α = 0 . )* 3.1446 5.2047 3.6901 4.0131Ours( α = 1 )* 3.1064 5.3446 3.6957 4.0489Ours( α = 2 )* 3.3236 5.3570 3.8392 4.1733BIWI Hopenet[40]* 4.810 6.606 3.269 4.895Hopenet[40]+ 5.785 11.726 8.194 8.568FAN[4] 8.532 7.483 7.631 7.8823DDFA[60] 36.175 12.252 8.776 19.068Dlib[20] 16.756 13.802 6.190 12.249KEPLER[22]+ 8.084 17.277 16.196 13.852Ours( α = 0 )* 4.1233 5.6142 3.1469 Ours( α = 0 . )* 4.2367 5.8446 3.4675 4.5163Ours( α = 0 . )* 4.0967 6.0498 3.2933 4.4799Ours( α = 1 )* 3.9236 5.8832 3.4014 4.4027Ours( α = 2 )* 4.6890 6.1271 3.3669 4.7276Ours( α = 0 )+ 4.5674 10.0874 8.0633 7.5737Ours( α = 0 . )+ 4.5652 8.9595 8.7420 7.4223Ours( α = 0 . )+ 4.5839 9.4471 8.1225 Ours( α = 1 )+ 4.3564 9.2310 8.8810 7.4895Ours( α = 2 )+ 4.3587 9.9015 8.6058 7.6220 *: trained on 300W-LP dataset.+: trained on AFLW dataset. Table 1. Evaluations on AFLW2000 and BIWI benchmarks. to some deep learning-based methods, including Hopenet[40], KEPLER [22], the method proposed by Patacchiolaand Cangelosi[34], Hyperface [36] and All-In-One [37].Table 2 and Fig.5 respectively show the results on AFLWand AFW benchmark.We can see that our method outperforms all other base-line methods on AFLW benchmark. Our best model( α =0 . ) reduces the error of the best-performing baselineHopenet[40] by 4.2%, 9.85%, 0.53% and 5.71% for yaw,pitch, roll and MAE respectively. On AFW benchmark, ourmethod also performs better than all other baseline methods. Method Yaw Pitch Roll MAEHopenet[40] 6.26 5.89 3.82 5.324KEPLER[22] 6.45 5.85 8.75 7.017Patacchiola,Cangelosi[34] 11.04 7.15 4.4 7.530Ours( α = 0 ) 6.83 5.26 3.92 5.34Ours( α = 0 . ) 6.00 5.31 3.75 Ours( α = 0 . ) 5.93 5.30 4.03 5.085Ours( α = 1 ) 5.90 5.51 3.87 5.094Ours( α = 2 ) 5.90 5.62 3.77 5.097 Table 2. Evaluation on AFLW benchmark.

Pose estimation error(in degrees) F r a c t i o n o f t h e nu m . o f t e s t i n g f a c e s Ours, =0.01(99.7%)Hopenet(96.15%)All-In-One ConvNet(99.1%)Hyperface(97.7%)KEPLER discrete(96.67%)

Figure 5. Evaluation on AFW benchmark.

Our best model( α = 0 . ) achieves a saturated accuracy ofover 99%.It is noteworthy that, on BIWI and AFLW benchmarks,the improvement of accuracy for roll is much less than foryaw and pitch. We argue that two reasons result in this sit-uation. One reason is that, the distribution of training setsin roll domain is extremely imbalanced compared to that inyaw and pitch domains(as shown in Fig.1), and the most oftraining examples lie in the area of small roll, which limitshe learning ability of our method in roll domain, especiallyin the area of large roll. The other reason is that the test setsalso have the similar characteristic as the ﬁrst reason men-tioned. In test sets, 67.65% examples of BIWI and 65.57%examples of AFLW lie in ± ◦ for roll, while 33.54% ofBIWI and 26.23% of AFLW for yaw, and 22.97% of BIWIand 47.13% of AFLW for pitch. That is, the BIWI andAFLW benchmarks have relatively few examples with largeroll. Both reasons restrict the improvement our method canmake for roll.

5. Conclusion

This paper presents a novel computational model for fa-cial pose estimation, which is reformulated as label distri-bution learning problem rather than the conventional single-label supervised learning. This makes a face image con-tribute to not only the learning of its real pose, but also thelearning of its adjacent poses, mitigating the degradationof pose predictor caused by the lack of sufﬁcient trainingdata. Experiments on several popular benchmarks show ourmethod is state-of-the-art.

References [1] V. N. Balasubramanian, J. Ye, and S. Panchanathan. Biasedmanifold embedding: A framework for person-independenthead pose estimation. In , pages 1–7. IEEE,2007. 3[2] R. H. Baxter, M. J. Leach, S. S. Mukherjee, and N. M.Robertson. An adaptive motion model for person trackingwith instantaneous head-pose features.

IEEE Signal Process-ing Letters , 22(5):578–582, 2015. 1[3] M. D. Breitenstein, D. Kuettel, T. Weise, L. Van Gool, andH. Pﬁster. Real-time face pose estimation from single rangeimages. In , pages 1–8. IEEE, 2008. 2[4] A. Bulat and G. Tzimiropoulos. How far are we from solv-ing the 2d & 3d face alignment problem?(and a dataset of230,000 3d facial landmarks). In

Proceedings of the IEEEInternational Conference on Computer Vision , pages 1021–1030, 2017. 1, 5, 6[5] K. Cao, Y. Rong, C. Li, X. Tang, and C. Change Loy. Pose-robust face recognition via deep residual equivariant map-ping. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 5187–5196, 2018. 1[6] F.-J. Chang, A. Tuan Tran, T. Hassner, I. Masi, R. Nevatia,and G. Medioni. Faceposenet: Making a case for landmark-free face alignment. In

Proceedings of the IEEE Interna-tional Conference on Computer Vision , pages 1599–1608,2017. 1[7] Z. Chen, Z. Liu, H. Hu, J. Bai, S. Lian, F. Shi, and K. Wang.A realistic face-to-face conversation system based on deepneural networks. arXiv preprint arXiv:1908.07750 , 2019. 1[8] E. Chong, K. Chanda, Z. Ye, A. Southerland, N. Ruiz, R. M.Jones, A. Rozga, and J. M. Rehg. Detecting gaze towards eyes in natural social interactions and its use in child assess-ment.

Proceedings of the ACM on Interactive, Mobile, Wear-able and Ubiquitous Technologies , 1(3):43, 2017. 1[9] Y.-Y. Fan, S. Liu, B. Li, Z. Guo, A. Samal, J. Wan, and S. Z.Li. Label distribution-based facial attractiveness computa-tion by deep residual learning.

IEEE Transactions on Multi-media , 20(8):2196–2208, 2018. 3, 4[10] G. Fanelli, M. Dantone, J. Gall, A. Fossati, and L. Van Gool.Random forests for real time 3d face analysis.

InternationalJournal of Computer Vision , 101(3):437–458, 2013. 2, 3[11] G. Fanelli, J. Gall, and L. Van Gool. Real time head poseestimation with random regression forests. In

CVPR 2011 ,pages 617–624. IEEE, 2011. 2[12] G. Fanelli, T. Weise, J. Gall, and L. Van Gool. Real timehead pose estimation from consumer depth cameras. In

JointPattern Recognition Symposium , pages 101–110. Springer,2011. 1, 2[13] B.-B. Gao, C. Xing, C.-W. Xie, J. Wu, and X. Geng. Deeplabel distribution learning with label ambiguity.

IEEE Trans-actions on Image Processing , 26(6):2825–2838, 2017. 3[14] X. Geng and P. Hou. Pre-release prediction of crowd opin-ion on movies by label distribution learning. In

Twenty-Fourth International Joint Conference on Artiﬁcial Intelli-gence , 2015. 3[15] X. Geng and Y. Xia. Head pose estimation based on multi-variate label distribution. In

Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages1837–1842, 2014. 3[16] X. Geng, C. Yin, and Z.-H. Zhou. Facial age estimation bylearning from label distributions.

IEEE transactions on pat-tern analysis and machine intelligence , 35(10):2401–2412,2013. 3, 4[17] D. Geronimo, A. M. Lopez, A. D. Sappa, and T. Graf. Surveyof pedestrian detection for advanced driver assistance sys-tems.

IEEE Transactions on Pattern Analysis & MachineIntelligence , (7):1239–1258, 2009. 1[18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. In

Proceedings of the IEEE con-ference on computer vision and pattern recognition , pages770–778, 2016. 5[19] J. Huang, X. Shao, and H. Wechsler. Face pose discrimina-tion using support vector machines (svm). In

Proceedings.Fourteenth International Conference on Pattern Recognition(Cat. No. 98EX170) , volume 1, pages 154–156. IEEE, 1998.3[20] V. Kazemi and J. Sullivan. One millisecond face alignmentwith an ensemble of regression trees. In

Proceedings of theIEEE conference on computer vision and pattern recogni-tion , pages 1867–1874, 2014. 5, 6[21] M. Koestinger, P. Wohlhart, P. M. Roth, and H. Bischof. An-notated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In , pages 2144–2151. IEEE, 2011. 2,5[22] A. Kumar, A. Alavi, and R. Chellappa. Kepler: keypoint andpose estimation of unconstrained faces by learning efﬁcient-cnn regressors. In ,pages 258–265. IEEE, 2017. 1, 3, 5, 6[23] Y. Li, S. Gong, J. Sherrah, and H. Liddell. Support vec-tor machine based multi-view face detection and recognition.

Image and Vision Computing , 22(5):413–427, 2004. 3[24] X. Liu, S. Li, M. Kan, J. Zhang, S. Wu, W. Liu, H. Han,S. Shan, and X. Chen. Agenet: Deeply learned regressor andclassiﬁer for robust apparent age estimation. In

Proceedingsof the IEEE International Conference on Computer VisionWorkshops , pages 16–24, 2015. 3, 4[25] Z. Liu, H. Hu, Z. Wang, K. Wang, J. Bai, and S. Lian. Videosynthesis of human upper body with realistic face. arXivpreprint arXiv:1908.06607 , 2019. 1[26] H. Moon and M. L. Miller. Estimating facial pose from asparse representation, Apr. 28 2009. US Patent 7,526,123. 3[27] S. S. Mukherjee and N. M. Robertson. Deep head pose:Gaze-direction estimation in multimodal video.

IEEE Trans-actions on Multimedia , 17(11):2094–2107, 2015. 1, 2[28] E. Murphy-Chutorian, A. Doshi, and M. M. Trivedi. Headpose estimation for driver assistance systems: A robust algo-rithm and experimental evaluation. In , pages 709–714. IEEE,2007. 1[29] E. Murphy-Chutorian, A. Doshi, and M. M. Trivedi. Headpose estimation for driver assistance systems: A robust algo-rithm and experimental evaluation. In , pages 709–714. IEEE,2007. 3[30] E. Murphy-Chutorian and M. M. Trivedi. Head pose esti-mation in computer vision: A survey.

IEEE transactions onpattern analysis and machine intelligence , 31(4):607–626,2009. 2[31] J. Ng and S. Gong. Composite support vector machines fordetection of faces across views and pose estimation.

Imageand Vision Computing , 20(5-6):359–368, 2002. 2[32] S. Niyogi and W. T. Freeman. Example-based head track-ing. In

Proceedings of the second international conferenceon automatic face and gesture recognition , pages 374–378.IEEE, 1996. 2[33] E. Osuna, R. Freund, F. Girosi, et al. Training support vectormachines: an application to face detection. In cvpr , vol-ume 97, page 99, 1997. 3[34] M. Patacchiola and A. Cangelosi. Head pose estimation inthe wild using convolutional neural networks and adaptivegradient methods.

Pattern Recognition , 71:132–143, 2017.1, 3, 6[35] D. Ramanan and X. Zhu. Face detection, pose estimation,and landmark localization in the wild. In

Proceedings ofthe 2012 IEEE Conference on Computer Vision and PatternRecognition (CVPR) , pages 2879–2886. Citeseer, 2012. 2, 3,5[36] R. Ranjan, V. M. Patel, and R. Chellappa. Hyperface: A deepmulti-task learning framework for face detection, landmarklocalization, pose estimation, and gender recognition.

IEEETransactions on Pattern Analysis and Machine Intelligence ,41(1):121–135, 2019. 1, 3, 6 [37] R. Ranjan, S. Sankaranarayanan, C. D. Castillo, and R. Chel-lappa. An all-in-one convolutional neural network for faceanalysis. In , pages17–24. IEEE, 2017. 1, 3, 6[38] B. Raytchev, I. Yoda, and K. Sakaue. Head pose estima-tion by nonlinear manifold learning. In

Proceedings of the17th International Conference on Pattern Recognition, 2004.ICPR 2004. , volume 4, pages 462–466. IEEE, 2004. 3[39] H. A. Rowley. Neural network-based face detection. Tech-nical report, CARNEGIE-MELLON UNIV PITTSBURGHPA DEPT OF COMPUTER SCIENCE, 1999. 3[40] N. Ruiz, E. Chong, and J. M. Rehg. Fine-grained headpose estimation without keypoints. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion Workshops , pages 2074–2083, 2018. 1, 3, 5, 6[41] E. Seemann, K. Nickel, and R. Stiefelhagen. Head pose es-timation using stereo vision for human-robot interaction. In

Sixth IEEE International Conference on Automatic Face andGesture Recognition, 2004. Proceedings. , pages 626–631.IEEE, 2004. 1[42] J. Sherrah, S. Gong, and E.-J. Ong. Understanding pose dis-crimination in similarity space. In

BMVC , pages 1–10, 1999.3[43] J. Sherrah, S. Gong, and E.-J. Ong. Face distributions insimilarity space under varying head pose.

Image and VisionComputing , 19(12):807–819, 2001. 2[44] R. Stiefelhagen. Estimating head pose with neural networks-results on the pointing04 icpr workshop evaluation data. In

Proc. Pointing 2004 Workshop: Visual Observation of Deic-tic Gestures , volume 1, pages 21–24, 2004. 3[45] R. Stiefelhagen, J. Yang, and A. Waibel. Modeling focus ofattention for meeting indexing. In

International MultimediaConference: Proceedings of the seventh ACM internationalconference on Multimedia(Part 1) , volume 30, pages 3–10.Citeseer, 1999. 3[46] K. Sundararajan and D. L. Woodard. Head pose estimationin the wild using approximate view manifolds. In

Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition Workshops , pages 50–58, 2015. 3[47] Y.-J. Tu, C.-C. Kao, and H.-Y. Lin. Human computer interac-tion using face and gesture recognition. In , pages 1–8. IEEE, 2013. 1[48] Y.-J. Tu, C.-C. Kao, H.-Y. Lin, and C.-C. Chang. Face andgesture based human computer interaction.

InternationalJournal of Signal Processing, Image Processing and PatternRecognition , 8(9):219–228, 2015. 1[49] P. Viola, M. Jones, et al. Rapid object detection using aboosted cascade of simple features.

CVPR (1) , 1:511–518,2001. 3[50] Z. Wang, Z. Liu, Z. Chen, H. Hu, and S. Lian. A neuralvirtual anchor synthesizer based on seq2seq and gan models. arXiv preprint arXiv:1908.07262 , 2019. 1[51] U. Weidenbacher, G. Layher, P. Bayerl, and H. Neumann.Detection of head pose and gaze direction for human-computer interaction. In

International Tutorial and Researchorkshop on Perception and Interactive Technologies forSpeech-Based Systems , pages 9–19. Springer, 2006. 1[52] J. Wu and M. M. Trivedi. A two-stage head pose estimationframework and evaluation.

Pattern Recognition , 41(3):1138–1158, 2008. 3[53] H. Yang, W. Mou, Y. Zhang, I. Patras, H. Gunes, andP. Robinson. Face alignment assisted by head pose estima-tion. arXiv preprint arXiv:1507.03148 , 2015. 1, 3[54] Y. Yu, K. A. F. Mora, and J.-M. Odobez. Robust and ac-curate 3d head pose estimation through 3dmm and onlinehead model reconstruction. In , pages 711–718. Ieee, 2017. 2[55] D. Zanatto, M. Patacchiola, J. Goslin, and A. Cangelosi.Priming anthropomorphism: Can the credibility of human-like robots be transferred to non-humanlike robots? In , pages 543–544. IEEE, 2016. 1[56] X. Zhang, Y. Sugano, M. Fritz, and A. Bulling. Appearance-based gaze estimation in the wild. In

Proceedings of theIEEE conference on computer vision and pattern recogni-tion , pages 4511–4520, 2015. 1[57] Z. Zhang, Y. Hu, M. Liu, and T. Huang. Head poseestimation in seminar room using multi view face detec-tors. In

International Evaluation Workshop on Classiﬁca-tion of Events, Activities and Relationships , pages 299–304.Springer, 2006. 3[58] Z. Zhang, M. Wang, and X. Geng. Crowd counting in publicvideo surveillance by label distribution learning.

Neurocom-puting , 166:151–163, 2015. 3[59] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li. Face align-ment across large poses: A 3d solution. In

Proceedings ofthe IEEE conference on computer vision and pattern recog-nition , pages 146–155, 2016. 2, 5[60] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li. Face align-ment across large poses: A 3d solution. In