AnchorFace: An Anchor-based Facial Landmark Detector Across Large Poses
AAnchorFace: An Anchor-based Facial LandmarkDetector Across Large Poses
Zixuan Xu (cid:63) , Banghuai Li (cid:63) , Miao Geng , Ye Yuan , and Gang Yu Peking University [email protected] MEGVII Technology { libanghuai,yuannye,yugang } @megvii.com Beihang University geng [email protected]
Abstract.
Facial landmark localization aims to detect the predefinedpoints of human faces, and the topic has been rapidly improved withthe recent development of neural network based methods. However, itremains a challenging task when dealing with faces in unconstrained sce-narios, especially with large pose variations. In this paper, we target theproblem of facial landmark localization across large poses and addressthis task based on a split-and-aggregate strategy. To split the searchspace, we propose a set of anchor templates as references for regres-sion, which well addresses the large variations of face poses. Based onthe prediction of each anchor template, we propose to aggregate the re-sults, which can reduce the landmark uncertainty due to the large poses.Overall, our proposed approach, named AnchorFace, obtains state-of-the-art results with extremely efficient inference speed on four challeng-ing benchmarks, i.e. AFLW, 300W, Menpo, and WFLW dataset. Codewill be released for reproduction.
Facial landmark localization, or face alignment, refers to detect a set of pre-defined landmarks on the human face. It is a fundamental step for many facialrelated applications, e.g. face verification/recognition, expression recognition,and facial attribute analysis.With the recent development of convolutional neural network based meth-ods [29], the performance for facial landmark localization in constrained scenar-ios has been greatly improved [12,57,42]. However, unconstrained scenarios, forexample, faces with large pose, still limit the wide application of the existinglandmark algorithms. In this paper, we target to address the problem of faciallandmark localization across large poses.There are two challenges for facial landmark detection across large poses.On one hand, faces with large poses will significantly increase the difficulty forlandmark localization due to the large variations among different poses .As shown in Fig. 1, directly regressing the point coordinates may not be ableto localize every landmark point precisely. On the other hand, there usually (cid:63)
Equal Contribution a r X i v : . [ c s . C V ] J u l Z. Xu, B. Li et al. exists a large probability of uncertainty due to the self-occlusion and noisyannotations. For example, occlusion will usually lead to invisible landmarks,which will increase the uncertainty for the landmark prediction. Besides, thefaces with a large pose will also cause difficulty during the data annotationprocess.To address the above two challenges, we propose a novel pipeline for faciallandmark localization based on an anchor-based design. The new pipeline in-cludes two steps: split and aggregate. An overview of our pipeline can be foundin Fig. 2. To deal with the first challenge with large pose variations, we adoptthe divide-and-conquer way following an anchor-based design. We propose touse the anchor templates to split the search space, and each anchor will serve asa reference for regression. This can significantly reduce the pose variations foreach anchor. To address the second issue with pose uncertainty, we propose toaggregate each anchor result weighted by the predicted confidence.
Regression on an anchor template Ground-truthlandmarksAggregated resultDirect regression
Anchor template
Fig. 1: A comparison between direct regression and anchor-based regression (An-chorFace). Our AnchorFace includes two steps. The first step is to introduce theanchor templates and regress the offsets based on each anchor template (SecondColumn). The second step is to aggregate the prediction results from multipleanchor templates (Third Column)In summary, we propose AnchorFace to implement the split-and-aggregatestrategy. There are three contributions in our paper. – We propose a novel pipeline with a split-and-aggregate strategy which canwell address the challenges for face alignment across large poses. – To implement the split-and-aggregate strategy, we introduce the anchor de-sign into the facial landmark problem, which can simplify the search space foreach anchor template and meanwhile improve the robustness for landmarkuncertainty. nchorFace 3 – Our proposed AnchorFace can achieve promising results on four challengingbenchmarks with an impressive inference speed of 4050 FPS † . Facial Landmark Localization.
In the literature of facial landmark lo-calization, a number of achievements have been developed including the classicASMs [32], AAMs [17,20,30,39], CLMs [8,9], and Cascaded Regression Mod-els [5,6,7,15,46,59,58]. Nowadays, more and more deep learning-based methodshave been applied in this area. These deep learning based methods could bedivided into two categories, i.e. coordinate regression methods and heatmap re-gression methods.Coordinate regression methods directly map the discriminative features tothe target landmark coordinates. The earliest work can be dated to [40]. Sun etal. [40] used a three-level cascade CNN to do facial landmark localization in acoarse-to-fine manner, and achieved promising localization accuracy. MDM [41]was the first to apply a recurrent convolutional network model for facial land-mark localization in an end-to-end manner. Zhang et al. [55] utilized a multi-tasklearning framework to optimize facial landmark localization and correlated fa-cial attributes analysis simultaneously. Recently, Wingloss [16] was proposed asa new loss function for landmark localization, which can obtain robust perfor-mance against widely used L Faces with Large Pose.
Large pose is a challenging task for facial land-mark localization, and different strategies have been proposed to address thedifficulty. Multi-view framework and 3D model are two popular ways. Multi-view framework uses different landmark configurations for different views. Forexample, TSPM [61] and CDM [50] employ DPM-like [14] method to align faceswith different shape models, and choose the highest possibility model as the finalresult. However, multi-view methods have to cover each view, making it imprac-tical in the wild. 3D face models have been widely used in recent years, whichfit a 3D morphable model (3DMM) [4] by minimizing the difference betweenface image and model appearance. Lost face information can be recovered tolocalize the invisible facial landmarks [3,18,19,24,63]. However, 3D face modelsare limited by their own database and the iterative label generation method. † The computational speed of 4050 FPS is calculated on Nvidia 2080 Ti GPU withbatchsize 256. If batchsize is set as 1, the FPS is 320. Z. Xu, B. Li et al.
Besides, researchers have applied multi-task learning to address the difficultiesresulting from pose variations. Other facial analysis tasks, such as pose estima-tion or facial attributes analysis, can be jointly trained with facial landmarklocalization [35,47,54]. With joint training, multi-task learning can boost theperformance of each subtask. The facial landmark localization task can achieverobust performance. But the multi-task framework is not specially designed forlandmark localization, it contains much redundant information and contributesto large models.In this paper, we propose an anchor-based model for facial landmark local-ization. Different from [45], which utilized anchor points to predict the positionsof a human 3D pose, our approach introduces a split-and-aggregate pipeline forthe facial landmark localization. Anchor is utilized as a reference for regressionin our approach. Overall, our model requires neither cascaded networks nor largebackbones, leading to a great reduction in model parameters and computationcomplexity, while still achieving comparable or even better accuracy.
Backbone
Regression BranchConfidence Branch
Split C i C j Anchor configuration
Aggregate
Predicted landmarksAnchor template
Fig. 2: The pipeline of our proposed AnchorFace landmark detector. AnchorFaceis based on a split-and-aggregate strategy, which consists of the backbone andtwo functional branches: the offset regression branch and the confidence branch.In the split step, we predict the landmark position based on each anchor tem-plate. During aggregate step, the predictions of multiple anchor templates areaveraged by weighted confidence
In this paper, we propose a new split-and-aggregate strategy for facial land-mark detector across large poses. An overview of our pipeline can be found inFig. 2. To implement the split-and-aggregate strategy, we introduce the anchor-based design, and our approach is named AnchorFace. In the following section,we will discuss the split and aggregate steps separately, followed by the detailson the network training.
Due to the large pose variations among different poses, it is a challengingproblem to directly regress the facial landmarks while maintaining high localiza- nchorFace 5 tion precision. In this paper, we propose to utilize the divide-and-conquer wayto address the issue from large pose variations. More specifically, we propose toemploy the anchor templates as regression references to split the search space.Different from the traditional methods which regress the landmarks with a uni-form facial landmark detector, we propose to regress the offsets base on a set ofanchor templates.
Anchor templates on one sample anchor point
Anchor grid
Anchor area
Fig. 3: An illustration of our anchor configuration. Anchor area is a region cen-tered at the image center with a spatial neighborhood. Based on the anchorarea, we setup a grid of anchor points, where each anchor point contains a setof anchor templates to model various pose variations
Anchor Configuration.
As shown in Fig. 3, there are three hyper-parametersfor designing the anchor configuration: anchor area, anchor grid, and anchortemplates.Anchor area is denoted as the region to set the anchors. It is usually centeredat the image center with a spatial neighborhood. The reason to define the anchorarea is that the input image is cropped to put the face near the image center.Thus, we select a region near the image center, which is called anchor area,to set up the anchors. Based on the anchor area, we sample a set of anchorpoints in a grid, e.g. a 7 × N yaw base anchors ( N yaw = 3 inour paper representing the anchors for the left, frontal, right faces). To generate Z. Xu, B. Li et al. the N yaw base anchors, we utilize a heuristic approach to divide the training facesinto three buckets and compute the average face landmarks for each bucket toobtain the anchor proposal. More specifically, we use the ratio of two eyes’ widthfor bucket assignment. We define an indicator to estimate the yaw angle of eachtraining face: r = | p l − p l | | p r − p r | − | p r − p r | | p l − p l | , (1)where p l , p l , p r , p r are the coordinates of left eye inner corner, left eye outercorner, right eye inner corner, and right eye outer corner respectively. Witha threshold γ , we put the faces into the left or right bucket, when r > γ or r < − γ . The other faces will be kept into the frontal bucket. We set γ = 6 inour experiments, as shown in Fig. 4. r = r r − Fig. 4: An illustration of the metric r for classifying the faces into threebuckets along the yaw direction Sample anchor templates by KMeansSample anchor templates by hand-design
Fig. 5: A comparison of three base an-chors generated by hand-design ap-proach and KMeans clustering basedon AFLW [22] datasetBased on the N yaw base anchors, to cover the roll variations, we rotate eachanchor on the roll dimension. For example, we can get twenty-four templatesby rotating the basic three anchors each 45 ◦ from 0 ◦ to 360 ◦ . Optionally, wecan involve the pitch variations by directly projecting (rotating) along the pitchdimension. However, based on our experimental results, the anchors designedalong the pitch view cannot further improve the performance but compromisethe computational speed. Thus, in our final design, only anchors along the yawand roll dimensions are utilized as shown in Fig. 3.An alternative solution for the anchor design is based on the data distribu-tion among the training faces. We first perform KMeans clustering, and we cangenerate a set of base anchors. One example is shown in Fig. 5. We can see thatthe clustered anchors among all the training faces obtain similar anchors alongthe yaw direction, as discussed in hand-designed anchors. Following the similarsteps for the hand-designed anchors, we can rotate the generated prototypesalong the roll and pitch direction to generate more anchors. Regression and Confidence Branch.
Based on the anchor proposals, wedesign a new head structure which involves two branches: regression branch andconfidence branch. Regression branch aims to regress the landmark coordinate nchorFace 7 offsets based on each anchor. Confidence branch assigns each anchor with aconfidence score. Among all the anchor templates, those anchors which are closeto the pose of the ground-truth face should be given higher confidence.As shown in Fig. 2, both the confidence branch and the regression branchare built upon the output feature map of the backbone network. While we set h · w anchors in the image, the output of the confidence and regression branchare C con · h · w and C reg · h · w respectively, where C con and C reg are denoted asthe output channel number of the confidence branch and the regression branchrespectively. Here C con = K and C reg = K · L , where K , L refer to the numberof anchor templates on each anchor point and the number of facial landmarksrespectively. Table 1: The definition of Symbols Symbol Definition A A set of Anchor points on the spatial anchor grid a One anchor point a ∈ AT A set of anchor templates as in Fig. 3 T ( a, t ) Anchor template t ∈ T centering at anchor point aT j ( a, t ) Landmark j on anchor template T ( a, t ) O ( a, t ) Output from the regression branch based on T ( a, t ) O ( a, t ) Ground-truth (GT) offsets based on T ( a, t ) C ( a, t ) Output from the confidence branch based on T ( a, t ) C ( a, t ) Confidence GT label based on T ( a, t ) Large-pose faces will increase the uncertainty for the landmark prediction.To address this problem, we propose to aggregate the predictions from differentanchor templates. More specifically, we first set a threshold C th to pick up thereliable anchor predictions. The anchor predictions with low confidence scoresare regarded as outliers and will be discarded. The remaining anchor predictionswill be averaged by the weighted confidence for each prediction. As a result, theposition of landmark j can be obtained as the weighted average of the outputsof all anchor faces as: (cid:101) S j = (cid:80) a ∈ A,t ∈ T (cid:101) C ( a, t ) · ( O j ( a, t ) + T j ( a, t )) (cid:80) a ∈ A,t ∈ T (cid:101) C ( a, t ) , (2)where (cid:101) C ( a, t ) = (cid:40) , C ( a, t ) < C th C ( a, t ) , others (3)The definition of the symbols can be found from Table 1, and the threshold C th is set to 0 . Z. Xu, B. Li et al.
In this subsection, we will discuss the ground-truth setting for the regressionand confidence branch as well as the related losses. For the regression branch,the target is to regress the offsets against each of the predefined anchor. Theregression loss L reg can be defined as: L reg = (cid:88) a ∈ A,t ∈ T C ( a, t ) (cid:88) j | O j ( a, t ) − O j ( a, t ) | , (4)where O j ( a, t ) and O j ( a, t ) refer to the prediction offsets and the ground-truthoffsets for j th landmark. C ( a, t ) is denoted as the confidence weight for theanchor template T ( a, t ). The detailed symbol definitions can be found in Table 1.For the confidence branch, we set the targeted confidence output C ( a, t ) asthe L2 distance between the anchor pose v and the ground-truth pose v as || v − v || , where v , v refer to flatten landmark coordinates. To normalize thepose difference, we perform a tanh operation as: C = tanh( || v − v || β · L ) , (5)where β is a hyperparameter and L refers to the count of facial landmarks. Theconfidence loss is then defined as: L con = (cid:88) a ∈ A,t ∈ T ( − C ( a, t ) · log C ( a, t ) − (1 − C ( a, t )) · log(1 − C ( a, t ))) . (6)The network is jointly supervised by the two loss functions above with end-to-end training. The final training loss is then defined as: L total = L reg + λ · L con (7)where λ is a hyperparameter in our method, and it is insensitive to the localiza-tion accuracy in our experiments. The experiments are evaluated on four challenging datasets, i.e.AFLW, 300W, Menpo, and WFLW.AFLW [22] dataset: AFLW contains 24386 in-the-wild faces with a large headpose up to 120 ◦ for yaw direction. We follow the standard setting [59,58], whichignores two landmarks of ears and evaluates the remaining 19 landmarks. AFLWis split into two sets: (i) AFLW-Full: 20000 and 4386 images are used for trainingand testing, respectively; (ii) AFLW-Frontal: 1314 images are selected from 4386testing images for evaluation on frontal faces. nchorFace 9 Evaluation metric.
We adopt the normalized mean error (NME) for evalu-ation. The normalized mean error is defined as the average Euclidean distancebetween the predicted facial landmark locations O i,j and their correspondingground-truth facial landmark annotations O i,j : N M E = 1 N N (cid:88) i =1 1 L (cid:80) Lj =1 | O i,j − O i,j | d (8)where N is the number of images in the testing set, L is the number of landmarks,and d is the normalization factor. On AFLW dataset, we follow [59] to use facesize as the normalization factor. On Menpo dataset, we use the distance betweenleft-top corner and right-bottom corner as the normalization factor. On 300Wand WFLW dataset, we follow MDM [41] and [37] to use the “inter-ocular”normalization factor, i.e. the distance between the outer eye corners.In addition, on WFLW dataset, two further statistics i.e. the area-under-the-curve (AUC) [48] and the failure rate (which is defined as the proportion of faileddetected faces) are measured for furthter analysis. Especially, any normalizederror above 0 . Implementation details.
In our method, the original images are cropped andresized to a fixed resolution, i.e. 224 × ×
56 and 7 × × − and train the networkfor 50 epochs in total. The learning rate is set to 1 × − and divided by ten at 20-th, 30-th, 40-th epoch. β = 0 .
05 and λ = 0 . For a fair comparison, we only compare the methods following the standardsettings, as discussed in Section 4.1. Therefore, those methods which are trainedfrom external datasets or combined multiple datasets are not compared.
AFLW dataset : We first evaluate our algorithm on the AFLW dataset.The performance comparisons are given in Table 2. It can be observed that, onthis large dataset, our network outperforms the other approaches. As mentionedin Section 4.1, AFLW contains lots of faces with large poses. Note that ourmethod has a significant improvement on AFLW-Full set against AFLW-Frontal,which means that we achieve more robust localization performance on facesin unconstrained scenarios, including large pose. This essentially validates thesuperiority of our approach.Table 2: Normalized mean error (%)on AFLW dataset
Methods AFLW-Full AFLW-FrontalLBF [36] 4.24 2.74CFSS [59] 3.92 2.69CCL [60] 2.72 2.17TSR [27] 2.17 -SAN [12] 1.91 1.85Wing [16] 1.65 -SA [25] 1.62 -ODN [57] 1.63 1.38
AnchorFace 1.56 1.38
Table 3: Normalized mean error (%)on 300W dataset
Methods Common Challenge FullTwo-Stage [28] 4.36 7.42 4.96RDR [44] 5.03 8.95 5.80Pose-Invariant [19] 5.43 9.88 6.30SBR [13] 3.28 7.58 4.10PCD-CNN [21] 3.67 7.62 4.44LAB [42]
SAN [12] 3.34 6.60 3.98ODN [57] 3.56 6.67 4.17AnchorFace 3.12 6.19 3.72 : We compare our approach against several state-of-the-artmethods on 300W Fullset. The results are shown in Table 3. Since there arefewer large pose variations across the whole dataset and the cropped faces nor-mally center near the image center point, 300W dataset is not very challengingcompared with the other three benchmarks. However, our algorithm still canachieve promising localization performance with an efficient speed at 4050fpswith batch size 256 and 320 fps with batch size 1. Compared with LAB [42],which is slightly better than our method, our approach is much faster (320 vs17 fps).
Menpo dataset : Menpo dataset has two subsets: semi-frontal and profile.Follow the standard settings on Menpo dataset, we conduct the experiments oneach subset and evaluate the testing set separately. The experiment results arereported in Table 4 with the normalized mean error. Our method achieves state-of-the-art performance. Especially on the profile subset, our method outperformsstate-of-the-art methods with a large margin, which validates the effectivenessof our proposed approach across large poses. nchorFace 11
Table 4: Normalized mean error (%) on Menpo dataset
Methods CLNF [1] CFAN [53] CFSS [59] TCDCN [56] 3DDFA [63] CE-CLM [51]
AnchorFace
Frontal 2.66 2.87 2.32 3.32 4.51 2.23
Profile 6.68 25.33 9.99 9.82 6.02 5.39
Table 5: Evaluation on WFLW dataset
Metric Methods Flops Testset Pose Expression Illumination Make-up Occlusion BlurNME(%) CFSS [59] - 9.07 21.36 10.09 8.30 8.74 11.76 9.96DVLN [43] - 6.08 11.54 6.78 5.73 5.98 7.33 6.88LAB [42] 10.6G 5.27 10.24 5.51 5.23 5.15 6.79 6.32SAN [12] 11.3G 5.22 10.39 5.71 5.19 5.49 6.83 5.80Wing [16] 3.8G 5.11 8.75 5.36 4.93 5.41 6.37 5.81AVS [34] 1.8G 5.25 9.10 5.83 4.93 5.47 6.26 5.86
AnchorFace 227M
AnchorFace - 7.00 27.91 5.09 5.73 9.70 13.04 8.27AUC CFSS [59] - 0.3659 0.0632 0.3157 0.3854 0.3691 0.2688 0.3037DVLN [43] - 0.4551 0.1474 0.3889 0.4743 0.4494 0.3794 0.3973LAB [42] - 0.5323 0.2345 0.4951 0.5433 0.5394 0.4490 0.4630SAN [12] - 0.5355 0.2355 0.4620 0.5552 0.5222 0.4560 0.4932Wing [16] - 0.5504 0.3100 0.4959 0.5408 0.5582 0.4885 0.4918AVS [34] - 0.5034 0.2294 0.4534 0.5252 0.4849 0.4318 0.4532
AnchorFace - 0.5380 0.2555 0.4961 0.5451 0.5423 0.4540 0.4746
WFLW dataset : A comparison of the performance from our proposed ap-proach as well as state-of-the-art methods on WFLW dataset is shown in Table 5.As indicated in Table 5, benefit from the split-and-aggregate pipeline, our pro-posed method achieves comparable localization accuracy with much lower com-putational complexity. The best method just outperforms our model by 0.1% inNME, while its computational cost is ten times larger than ours. It is clear thatAnchorFace can achieve the best trade-off between the speed and accuracy.
Based on the efficient ShuffleNet-V2 backbone, the total FLOPs of our net-work is 227M. Due to the light design of the network, our approach can run asfast as 4050 fps with batchsize 256 and 320 fps with batchsize 1 on an NVIDIAGeForce RTX 2080Ti GPU. Comparisons with some state-of-the-art methodsare shown in Table 5. AnchorFace can not only achieve promising results on thechallenging benchmarks but also provide extremely efficient inference speed.
Our proposed AnchorFace introduces a novel split-and-aggregate strategybased on anchor design to address the face alignment across large poses. In thissection, we perform further analysis of its mechanism.
Anchor design . Anchor templates serve as regression references to split thesearch space in our proposed approach. In comparison with directly regressing target landmark coordinates in whole 2D space, regress offsets based on anchortemplates can simplify the search space and boost the robustness of localizationaccuracy. We conduct several experiments on AFLW dataset and make statisticsacross yaw dimension which is shown in Fig. 6. It is quite clear that AnchorFacesignificantly outperforms the baseline with lower NME and smaller variances ineach subinterval especially for large pose, which can well verify our assumptions. N M E ( % ) YAW( )AnchorFace: 1.67 Baseline: 1.56 N M E / % yaw/ AnchorFace: 1.56 Baseline: 1.67 Fig. 6: A comparison of baseline and AnchorFace across yaw dimension
Split-and-aggregate strategy . In our proposed algorithm, we follow thedivide-and-conquer way to address the challenges for face alignment across largeposes. To verify its effectiveness, we adopt
Pearson correlation coefficient tomeasure the correlation between confidence scores C ( a, t ) and predictionerrors | O ( a, t ) − O ( a, t ) | : P = 1 N N (cid:88) i =1 [ r ia,t ( | O ( a, t ) − O ( a, t ) | , C ( a, t ))] (9)Where r represents the calculation function of Pearson correlation coefficient .We conduct experiments on AFLW dataset and get P = -0.82, which means astrong negative correlation between them. In other words, anchor template withlarger confidence score can achieve more accurate predicted landmarks. It canhelp filter prediction outliers and aggregate remaining predictions to mitigatethe uncertainty of the localization result on a single anchor face. Due to theconfidence score is defined as mathematical modeling of the distance between theanchor pose and the ground-truth pose, we can come to another conclusion that closer anchors tend to achieve more accurate localization, which also directlyproves our search space split strategy based on anchor. Comparison details canbe found in Section 4.5 and intuitive samples are also shown in Fig. 7.
In this section, we perform the ablation study for our proposed algorithm onthe AFLW dataset, which is a challenging benchmark with large pose variations. nchorFace 13
Confidence: 0.15 Confidence: 0.74
Confidence: 0.78Confidence: 0.13
Fig. 7: Algorithm analysis based on different anchors. First column shows anchorgrid settings, the green and blue anchors are two selected samples. Second andthird column shows two anchor templates with prediction scores, which are ran-domly selected from the previous two anchors. The last column shows the finalprediction results while the groud-truth is in redMore specifically, we divide the test set into four subsets according to the yawdimension, i.e. Light (0 ◦ ∼ ◦ ), Medium (30 ◦ ∼ ◦ ), Large (60 ◦ ∼ ◦ ), andHeavy (90 ◦ ∼ ). Normalized mean error is utilized to evaluate the performance ofour algorithm. Without explicitly specified, we use anchor templates as KMeans-24 (KMeans clustering to generate 24 anchor templates), anchor area as 56 × ×
7, and the aggregating strategy is weighted average for ablation.
Comparison with the regression baseline.
Table 6 compares the perfor-mance of our proposed approach with the baseline of direct regression on AFLWdataset. “Baseline” directly maps the discriminative features to the target land-mark coordinates with ShuffleNet-V2 backbone. A fully connected layer withlength 2 L is used as the output of the baseline network. As shown in Table6, our proposed anchor-based method significantly outperforms the baseline bya large margin across yaw variations.The improvements are attributed to tworeasons. First, the anchor design can significantly reduce the search space andsimplify the regression problem. Second, the aggregating of different anchors canfurther improve model robustness. Comparison of various split configurations.
Due to the challenges from thelarge-pose faces, we propose a set of anchor templates as references for regressionto split search space. Split strategy consists of three hyper-parameters: anchortemplates, anchor area, and anchor grid, as shown in Fig. 3.
Anchor template plays a voting role in our method as a reference forregression. As mentioned in Section 3.1, we get three basic template faces fromthe training dataset by hand design or KMeans clustering. Then we do sometransformations to get more templates, corresponding to the pose variations in yaw, roll, and pitch dimension. By comparing KMeans-24 against HandDesign-24 in Table 7, KMeans is better than hand-design approach based on the sameanchor number (24). The potential reason is that KMeans utilizes more datafeatures to generate the base anchors, which should be more general comparedwith hand-designed based anchors. Besides, as shown in Table 7, 24 may be agood option for the number of anchor templates compared with 3 or 48 in ouralgorithm.Table 6: A comparison of direct regression and anchor-based regression
Method Full Light Medium Large HeavyBaseline 1.67 1.46 1.92 1.99 2.13
AnchorFace 1.56 1.40 1.74 1.80 1.96
Table 7: Comparisons of different tem-plate settings
Template Full Light Medium Large HeavyKmeans-3 1.60 1.43 1.80 1.83 2.10Kmeans-24
Kmeans-48 1.58 1.41 1.79 1.82 2.06HandDesign-24 1.58 1.42 1.76 1.84 2.00
Table 8: Comparisons of different an-chor area settings
Anchor area Full Light Medium Large Heavy112 ×
112 1.58 1.40 1.79 1.83 2.0656 × ×
28 1.58 1.42 1.79 1.82 2.0414 ×
14 1.58 1.41 1.81 1.83 2.01
Table 9: Comparisons of different an-chor grid settings
Anchor grid Full Light Medium Large Heavy3 × × × ×
13 1.56 1.40 1.75 1.81 2.07
Table 10: Comparisons of different ag-gregation strategies
Aggregate Full Light Medium Large HeavyArgmax 1.58 1.42 1.76 1.83 2.00Weighted
Mean 1.61 1.43 1.80 1.89 2.22
Anchor area is the area where we set anchors in the image for the spatialdomain. As the input image is cropped and resized to 224 ×
224 and the face isaround the image center, we set anchor points at a center area with size 14 × ×
28, 56 ×
56, 128 × ×
56 around theimage center would be a good choice for putting anchors in the spatial domain.
Anchor grid defines how many anchors we set in the anchor area. Forexample, 7 × × × Comparison of various aggregate strategies.
To mitigate the uncertaintyof the localization result on a single anchor face, we aggregate the predictionsfrom different anchor templates. We introduce three aggregate strategies: Mean,Argmax, confidence weighted voting (Weighted). As shown in Table 10, aggre-gating the predictions with weighted confidences can obtain superior results nchorFace 15 compared with the argmax choice without aggregating. Besides, the confidencegenerated by the confidence branch is important if we compare the strategy of“Weighted” and “Mean”.
In this paper, a novel split-and-aggregate strategy is proposed for large-posefaces. By introducing an anchor-based design, our proposed approach can sim-plify the regression problem by splitting the search space. Moreover, aggregat-ing the prediction results contributes to reducing uncertainty and improving thelocalization performance. As validated on four challenging benchmarks, our pro-posed AnchorFace obtains state-of-the-art results with extremely fast inferencespeed.
References
1. Baltruˇsaitis, T., Robinson, P., Morency, L.P.: Continuous conditional neural fieldsfor structured regression. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.)Computer Vision – ECCV 2014. pp. 593–608. Springer International Publishing,Cham (2014)2. Belhumeur, P.N., Jacobs, D.W., Kriegman, D.J., Kumar, N.: Localizing parts offaces using a consensus of exemplars. In: IEEE Transactions on Pattern Analysisand Machine Intelligence (PAMI). vol. 35, pp. 2930–2940 (December 2013)3. Bhagavatula, C., Zhu, C., Luu, K., Savvides, M.: Faster than real-time facial align-ment: A 3d spatial transformer network approach in unconstrained poses. In: TheIEEE International Conference on Computer Vision (ICCV) (Oct 2017)4. Blanz, V., Vetter, T.: Face recognition based on fitting a 3d morphable model. IEEETransactions on Pattern Analysis and Machine Intelligence (9), 1063–1074 (Sep2003). https://doi.org/10.1109/TPAMI.2003.12279835. Burgos-Artizzu, X.P., Perona, P., Dollar, P.: Robust face landmark estimation un-der occlusion. In: The IEEE International Conference on Computer Vision (ICCV)(December 2013)6. Cao, X., Wei, Y., Wen, F., Sun, J.: Face alignment by explicit shape re-gression. International Journal of Computer Vision (2), 177–190 (Apr2014). https://doi.org/10.1007/s11263-013-0667-3, https://doi.org/10.1007/s11263-013-0667-3
7. Chen, D., Ren, S., Wei, Y., Cao, X., Sun, J.: Joint cascade face detection andalignment. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ComputerVision – ECCV 2014. pp. 109–122. Springer International Publishing, Cham (2014)8. Cristinacce, D., Cootes, T.: Automatic feature localisation with con-strained local models. Pattern Recognition (10), 3054 – 3067 (2008).https://doi.org/https://doi.org/10.1016/j.patcog.2008.01.024,
9. Cristinacce, D., Cootes, T.F.: Feature detection and tracking with constrained localmodels. In: BMVC (2006)6 Z. Xu, B. Li et al.10. Deng, J., Roussos, A., Chrysos, G., Ververas, E., Kotsia, I., Shen, J., Zafeiriou,S.: The menpo benchmark for multi-pose 2d and 3d facial landmark locali-sation and tracking. International Journal of Computer Vision (6), 599–624 (Jun 2019). https://doi.org/10.1007/s11263-018-1134-y, https://doi.org/10.1007/s11263-018-1134-y
11. Deng, J., Trigeorgis, G., Zhou, Y., Zafeiriou, S.: Joint multi-view face alignmentin the wild. IEEE Transactions on Image Processing , 3636–3648 (2017)12. Dong, X., Yan, Y., Ouyang, W., Yang, Y.: Style aggregated network for facial land-mark detection. In: 2018 IEEE/CVF Conference on Computer Vision and PatternRecognition. pp. 379–388 (June 2018). https://doi.org/10.1109/CVPR.2018.0004713. Dong, X., Yu, S.I., Weng, X., Wei, S.E., Yang, Y., Sheikh, Y.: Supervision-by-registration: An unsupervised approach to improve the precision of facial landmarkdetectors. In: The IEEE Conference on Computer Vision and Pattern Recognition(CVPR) (June 2018)14. Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object de-tection with discriminatively trained part-based models. IEEE Transactionson Pattern Analysis and Machine Intelligence (9), 1627–1645 (Sep 2010).https://doi.org/10.1109/TPAMI.2009.16715. Feng, Z., Kittler, J., Christmas, W., Huber, P., Wu, X.: Dynamic attention-controlled cascaded shape regression exploiting training data augmentation andfuzzy-set sample weighting. In: 2017 IEEE Conference on Computer Vision andPattern Recognition (CVPR). pp. 3681–3690 (July 2017)16. Feng, Z.H., Kittler, J., Awais, M., Huber, P., Wu, X.J.: Wing Loss for RobustFacial Landmark Localisation with Convolutional Neural Networks. arXiv e-printsarXiv:1711.06753 (Nov 2017)17. Ikeuchi, K., Hebert, M., Delingette, H.: A spherical representation for recognition offree-form surfaces. IEEE Transactions on Pattern Analysis & Machine Intelligence (07), 681–690 (jul 1995). https://doi.org/10.1109/34.39141018. Jourabloo, A., Liu, X.: Large-pose face alignment via cnn-based dense 3d modelfitting. In: The IEEE Conference on Computer Vision and Pattern Recognition(CVPR) (June 2016)19. Jourabloo, A., Ye, M., Liu, X., Ren, L.: Pose-invariant face alignment with a singlecnn. In: The IEEE International Conference on Computer Vision (ICCV) (Oct2017)20. Kahraman, F., Gokmen, M., Darkner, S., Larsen, R.: An active illumina-tion and appearance (aia) model for face alignment. In: 2007 IEEE Con-ference on Computer Vision and Pattern Recognition. pp. 1–7 (June 2007).https://doi.org/10.1109/CVPR.2007.38339921. Kumar, A., Chellappa, R.: Disentangling 3d pose in a dendritic cnn for uncon-strained 2d face alignment. In: The IEEE Conference on Computer Vision andPattern Recognition (CVPR) (June 2018)22. Kstinger, M., Wohlhart, P., Roth, P.M., Bischof, H.: Annotated facial landmarksin the wild: A large-scale, real-world database for facial landmark localization. In:2011 IEEE International Conference on Computer Vision Workshops (ICCV Work-shops). pp. 2144–2151 (Nov 2011). https://doi.org/10.1109/ICCVW.2011.613051323. Le, V., Brandt, J., Lin, Z., Bourdev, L., Huang, T.S.: Interactive facial featurelocalization. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C.(eds.) Computer Vision – ECCV 2012. pp. 679–692. Springer Berlin Heidelberg,Berlin, Heidelberg (2012)24. Liu, Y., Jourabloo, A., Ren, W., Liu, X.: Dense face alignment. In: The IEEEInternational Conference on Computer Vision (ICCV) Workshops (Oct 2017)nchorFace 1725. Liu, Z., Zhu, X., Hu, G., Guo, H., Tang, M., Lei, Z., Robertson, N.M., Wang,J.: Semantic alignment: Finding semantically consistent ground-truth for faciallandmark detection. ArXiv abs/1903.10661 (2019)26. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic seg-mentation. In: The IEEE Conference on Computer Vision and Pattern Recognition(CVPR) (June 2015)27. Lv, J., Shao, X., Xing, J., Cheng, C., Zhou, X.: A deep regression architecturewith two-stage re-initialization for high performance facial landmark detection. In:2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp.3691–3700 (July 2017). https://doi.org/10.1109/CVPR.2017.39328. Lv, J., Shao, X., Xing, J., Cheng, C., Zhou, X.: A deep regression architecturewith two-stage re-initialization for high performance facial landmark detection. In:The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July2017)29. Ma, N., Zhang, X., Zheng, H.T., Sun, J.: Shufflenet v2: Practical guidelines forefficient cnn architecture design. In: The European Conference on Computer Vision(ECCV) (September 2018)30. Matthews, I., Baker, S.: Active appearance models revisited. Inter-national Journal of Computer Vision (2), 135–164 (Nov 2004).https://doi.org/10.1023/B:VISI.0000029666.37597.d3, https://doi.org/10.1023/B:VISI.0000029666.37597.d3
31. Messer, K., Matas, J., Kittler, J., Jonsson, K.: Xm2vtsdb: The extended m2vtsdatabase. In: In Second International Conference on Audio and Video-based Bio-metric Person Authentication. pp. 72–77 (1999)32. Milborrow, S., Nicolls, F.: Locating facial features with an extended active shapemodel. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) Computer Vision – ECCV2008. pp. 504–513. Springer Berlin Heidelberg, Berlin, Heidelberg (2008)33. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose esti-mation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision –ECCV 2016. pp. 483–499. Springer International Publishing, Cham (2016)34. Qian, S., Sun, K., Wu, W., Qian, C., Jia, J.: Aggregation via Separation: BoostingFacial Landmark Detector With Semi-Supervised Style Translation p. 1135. Ranjan, R., Patel, V.M., Chellappa, R.: Hyperface: A deep multi-task learningframework for face detection, landmark localization, pose estimation, and gen-der recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (1), 121–135 (Jan 2019). https://doi.org/10.1109/TPAMI.2017.278123336. Ren, S., Cao, X., Wei, Y., Sun, J.: Face alignment via regressing local binaryfeatures. IEEE Transactions on Image Processing (3), 1233–1245 (March 2016).https://doi.org/10.1109/TIP.2016.251886737. Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: 300 faces in-the-wildchallenge: The first facial landmark localization challenge. In: 2013 IEEE Inter-national Conference on Computer Vision Workshops. pp. 397–403 (Dec 2013).https://doi.org/10.1109/ICCVW.2013.5938. Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: 300 faces in-the-wildchallenge: The first facial landmark localization challenge. 2013 IEEE InternationalConference on Computer Vision Workshops pp. 397–403 (2013)39. Saragih, J., Goecke, R.: A nonlinear discriminative approach to aam fitting. In:2007 IEEE 11th International Conference on Computer Vision. pp. 1–8 (Oct 2007).https://doi.org/10.1109/ICCV.2007.44091068 Z. Xu, B. Li et al.40. Sun, Y., Wang, X., Tang, X.: Deep convolutional network cascade for facial pointdetection. In: The IEEE Conference on Computer Vision and Pattern Recognition(CVPR) (June 2013)41. Trigeorgis, G., Snape, P., Nicolaou, M.A., Antonakos, E., Zafeiriou, S.: Mnemonicdescent method: A recurrent process applied for end-to-end face alignment. In:2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp.4177–4187 (June 2016). https://doi.org/10.1109/CVPR.2016.45342. Wu, W., Qian, C., Yang, S., Wang, Q., Cai, Y., Zhou, Q.: Look at boundary: Aboundary-aware face alignment algorithm. In: CVPR (2018)43. Wu, W., Yang, S.: Leveraging intra and inter-dataset variations for robust facealignment. In: The IEEE Conference on Computer Vision and Pattern Recognition(CVPR) Workshops (July 2017)44. Xiao, S., Feng, J., Liu, L., Nie, X., Wang, W., Yan, S., Kassim, A.: Recurrent 3d-2ddual learning for large-pose facial landmark detection. In: The IEEE InternationalConference on Computer Vision (ICCV) (Oct 2017)45. Xiong, F., Zhang, B., Xiao, Y., Cao, Z., Yu, T., Zhou Tianyi, J., Yuan, J.: A2j:Anchor-to-joint regression network for 3d articulated pose estimation from a singledepth image. In: Proceedings of the IEEE Conference on International Conferenceon Computer Vision (ICCV) (2019)46. Xiong, X., De la Torre, F.: Global supervised descent method. In: The IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR) (June 2015)47. Xu, X., Kakadiaris, I.A.: Joint head pose estimation and face alignment frame-work using global and local cnn features. In: 2017 12th IEEE International Confer-ence on Automatic Face Gesture Recognition (FG 2017). pp. 642–649 (May 2017).https://doi.org/10.1109/FG.2017.8148. Yang, H., Jia, X., Loy, C.C., Robinson, P.: An empirical study of recent face align-ment methods. CoRR abs/1511.05049 (2015), http://arxiv.org/abs/1511.05049
49. Yang, J., Liu, Q., Zhang, K.: Stacked hourglass network for robust facial landmarklocalisation. In: The IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) Workshops (July 2017)50. Yu, X., Huang, J., Zhang, S., Yan, W., Metaxas, D.N.: Pose-free facial landmarkfitting via optimized part mixtures and cascaded deformable shape model. In: 2013IEEE International Conference on Computer Vision. pp. 1944–1951 (Dec 2013).https://doi.org/10.1109/ICCV.2013.24451. Zadeh, A., Chong Lim, Y., Baltrusaitis, T., Morency, L.P.: Convolutional expertsconstrained local model for 3d facial landmark detection. In: The IEEE Interna-tional Conference on Computer Vision (ICCV) Workshops (Oct 2017)52. Zafeiriou, S., Trigeorgis, G., Chrysos, G., Deng, J., Shen, J.: The menpo facial land-mark localisation challenge: A step towards the solution. 2017 IEEE Conference onComputer Vision and Pattern Recognition Workshops (CVPRW) pp. 2116–2125(2017)53. Zhang, J., Shan, S., Kan, M., Chen, X.: Coarse-to-fine auto-encoder networks (cfan)for real-time face alignment. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T.(eds.) Computer Vision – ECCV 2014. pp. 1–16. Springer International Publishing,Cham (2014)54. Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment usingmultitask cascaded convolutional networks. IEEE Signal Processing Letters23