Fast Facial Landmark Detection and Applications: A Survey
Fast Facial Landmark Detection and Applications: A Survey
Kostiantyn S. Khabarlak*, Larysa S. Koriashkina* *Dnipro University of Technology, Dnipro, Ukraine Emails: [email protected] [email protected]
Abstract:
In this paper we survey and analyze modern neural-network-based facial landmark detection algorithms. We focus on approaches that have led to a significant increase in quality over the past few years on datasets with large pose and emotion variability, high levels of face occlusions β all of which are typical in real-world scenarios. We summarize the improvements into categories, provide quality comparison on difficult and modern in-the-wild datasets: 300-W, AFLW, WFLW, COFW. Additionally, we compare algorithm speed on CPU, GPU and Mobile devices. For completeness, we also briefly touch on established methods with open implementations available. Besides, we cover applications and vulnerabilities of the landmark detection algorithms. Based on which, we raise problems that as we hope will lead to further algorithm improvements in future.
Keywords:
Computer vision, neural networks, facial landmarks, mobile applications, driver status monitoring, face reenactment, face recognition, survey.
Introduction
Neural networks show high quality in solving tasks, in which we, humans, are good at, such as image classification or natural language processing. Initial neural network research was focused on large-scale servers with many GPUs and a stable power supply. However, the development of Internet of Thing and mobile devices makes client-server applications sometimes impractical or even unacceptable, for instance, when internet connectivity is poor, low data processing latency is required, when the application needs to provide userβs data security guarantee, meaning that no data can leave the userβs device, or when the amounts of raw data generated are too large to be sent over to a server. In many of these cases the use of neural networks is desirable, the processing should be done directly on the mobile device, thus on-device machine learning has become one of the most prominent machine learning research directions [1], [2]. In this paper we focus on one particular application of mobile machine learning, namely facial landmark detection, as it is a part of many algorithms that in some way process face images, where low processing latency and guarantee that no data will leave the device is often required. In this review we: 1) will describe and analyze key ideas of modern facial recognition algorithms that have improved detection accuracy; 2) will highlight essential requirements to the facial landmark detection algorithms with respect to their practical applications; 3) will point out weaknesses of the existing approaches and will outline the prospects of their development.
1. Facial landmark detection problem statement
Let πΌ be an input image of size π Γ π» Γ πΆ , where π is width, π» β height, πΆ β number of image color channels (usually 3). Then facial landmark detection problem is to find a function π·: πΌ β πΏ , which from the input image πΌ predicts a landmark vector πΏ , which for each landmark contains π₯ and π¦ coordinates. Number of landmarks can be different depending on the target task where they will be used and on the training dataset. The quality of the constructed function π· is usually assessed on the test sets. Next, we provide a brief description of the facial landmark detection datasets. Each of them has a special protocol, which defines train/test split, metrics for algorithm comparison and other testing details. The protocol is described in the paper in which the dataset was first introduced. The main metrics include [3-7]: 1. Normalized Mean Error (NME, %): (1)
πππΈ = β πππΈ ππΎπ=1 , πππΈ π = πΏ β βπ¦ π βπ¦ π Μ β ππ πΏ π=1 Γ 100 , where π¦ β the true landmark locations, π¦Μ β model (function π· ) predictions, π β normalization coefficient (different for each dataset), π πΏ β number of facial landmarks per face in the dataset, πΎ β number of images in the test set. 2. Failure Rate (%): (2)
πΉπ = 1πΎ β[πππΈ π β₯ 10%] πΎπ=1 Γ 100 Cumulative Error Distribution β Area Under Curve (CED-AUC). The higher β the better. Here on Y axis number of images is plotted that have NME lower than a particular threshold against the NME threshold value on X axis. Table 1 has information about number of images in train and test sets as well number of facial landmarks the dataset has been labeled with. COFW-68 has only test set (more on that later).
Table 1. Information about facial landmark detection datasets. Dataset [3] dataset contains a collection of different datasets, such as HELEN, LFPW, AFW and IBUG, that were labeled with 68 facial landmarks (fig. 1, a ). The protocol described in [3] defines which images should be used for training and which for testing. The testing subset is split into common, challenge and full. The NME scores for each of the splits are usually presented for comparison. The NME is normalized ( π in formula (1)) by inter-pupil or inter-ocular distance. This is done, so that faces of different sizes make an equal contribution to the resulting error. Note, that images in the 300W dataset have different shooting conditions (lighting, color gamut), emotions and faces at an angle. Annotated Facial Landmarks in the Wild (AFLW) [4] contains a larger number of images (table 1), yet they labeled with only 21 facial landmarks (fig. 1, b ). In comparison to 300W, this dataset has a higher face shooting angle range with Β±120Β° yaw and Β±90Β° pitch. The authors propose splitting the dataset into AFLW-Frontal (with face photos that are close to frontal) and AFLW-Full (all images). There is also a relabeled version with 68 facial landmarks, named AFLW-68 [5], yet in practice it is used less often.
MERL-RAV dataset is presented in [6], which has AFLW relabeled to 68 landmarks, where each landmark has an extra visibility label, such as: 1) visible; 2) self-occluded (for instance, due to large pose); 3) occluded by other object (hand, etc.). NME metrics is used for comparison. a b c d
Fig. 1. Images from a β 300W, b β AFLW, c β
COFW, d β
WFLW datasets
Caltech Occluded Faces in the Wild (COFW) [7] is a more complicated dataset (fig. 1, c ), which focuses on labeling face images, that are partially occluded by real-world objects (microphone, etc.) or by the person itself (hair, hand, etc.). The dataset not only uses NME metric, but also a failure rate (FR, shown in (2)), that is a percent of the images, whose landmark detection error is higher than a certain threshold. The COFW test set has also been relabeled to 68 landmarks in COFW-68 [8], which can be used to assess landmark detection quality when the network has been trained on different datasets.
Wider Facial Landmarks in-the-wild (WFLW) [9] β one of the newest datasets, the most difficult, as the task is to densely label facial landmarks with a wide range of emotions, poses, lighting conditions, maquillage, occlusion and blurriness (fig. 1, d ). Three metrics are used to present the results: NME, Failure Rate and CED-AUC.
2. Facial landmark detection algorithms
However, in real-word shooting conditions, which we need in many applications, their quality is insufficient. The following approaches include methods based on Random Forests and Gradient Boosting, such as ERT [12] method, which we describe below. Such methods have better accuracy, yet still fail in certain applications. Currently, neural regression-based algorithms show lowest error on facial landmark detection task with a wide shooting angle and high occlusion. They include: direct regression methods, when the model predicts π₯, π¦ coordinates directly for each landmark; heatmap regression methods , where a 2D heatmap is built for each landmark. The values in the heatmap can be interpreted as probabilities of landmark location at a certain image location. Also, some algorithms are implemented in a form of cascades, where a prediction is refined over several steps. 2.2. Brief description of established methods Dlib [13] is an open-source machine learning library. Among others, it has ERT [12] facial landmark detection algorithm, which is a cascade, based on gradient boosting. In ERT face template is refined over several iterations, using a mean template constructed on top of a face bounding rectangle, that is found via Viola-Jones face detector. High detection speed is the main advantage of ERT (according to the authors, around 1 millisecond per face). The library contains an ERT implementation, trained on 300W dataset. The algorithm is still actively used in the modern research thanks to an open implementation and speed. However, not so long ago it has been shown that neural networks are preferred in terms of quality for faces with high pose variability [14].
Multi-task Cascaded Convolutional Networks (MTCNN) [15], where the neural network is trained jointly to detect faces and landmark locations (five of them to be more precise: eyes, tip of the nose, mouth corners), which improves quality on both tasks. The network is built in a form of a three-network cascade: Proposal Network (P-Net), Refine Network (R-Net), Output Network (O-Net). Each of them predicts face bounding rectangle, probability that a particular rectangle contains a face and five landmarks. P-Net is a fast fully convolutional network, which processes the original image in multiple resolutions (the so-called image pyramid). This network outputs a lot of coarse face rectangle predictions, which are then filtered out by the Non-Maximum Suppression (NMS) algorithm. Subsequently, R-Net refines the predicted rectangles, without reprocessing the whole image, which saves the computation time. NMS is then applied again. Last, O-Net makes final refinement (fig. 2). This is the slowest network in the cascade, but it processes a small number of face rectangles. According to the authors, to improve the quality it is important to solve the following tasks at the same time: 1) classify bounding rectangle as a face or not a face; 2) perform regression over bounding rectangle coordinates; 3) localize face landmarks. Each of these tasks has a weight πΌ assigned: for P-Net and R-Net πΌ = 1, πΌ = πΌ = 0.5 , for O-Net πΌ = 1, πΌ = 0.5, πΌ = 1 correspondingly. Another algorithmβs feature is online hard-example mining, when training is performed on complicated training examples while skipping those, on which network prediction is quite accurate already. In the paper the authors select around 70% of hardest examples in each training batch. Fig. 2. MTCNN network architecture. A set of images in multiple resolutions is fed through P-Net, R-Net, O-Net neural network cascade [15]
Note that while ERT and MTCNN were not initially designed for use in smartphones, there are open reimplementations of these algorithms available for Android and iOS devices. A more comprehensive survey of early neural network based facial landmark detection algorithms can be found in [11] and [16]. 2.3. A survey of key modern facial landmark detection developments
Dense Face Alignment (DeFA) [17] is the only algorithm described in this section, where neural network is used for facial landmark prediction through a 3D deformable face mesh. Algorithm is interesting in that it: 1) allows to build a dense 3D face mesh using only a single 2D image, mesh can be built for a wide range of poses and emotions (fig. 3); 2) DeFA can be trained jointly on datasets with a different number of landmarks, as they will be βhookedβ as mesh constraints.
Fig. 3. Upper row: DeFA facial landmark prediction. Bottom row: DeFA dense 3D model [17]
Style Aggregated Network (SAN) [18]. The authors have noticed a style variability of photographs in 300W and AFLW datasets, which can be dark or light, colored or black & white. Existing to date algorithms were not accounting for that information. Furthermore, the authors have noticed that depending on style, prior algorithms were predicting facial landmark location in slightly different places, with higher error on photographs with harsh lighting conditions. As a solution they have proposed to: first, train Generative Adversarial Network CycleGAN [19] to transform images of different styles into neutral; second, train another neural network to predict landmarks with two inputs: βneuralβ and original image (fig. 4). As authors note, βneuralβ image produced by a GAN might lack fine details, that is why adding original image helps localization in certain cases.
Fig. 4. Style aggregation in SAN. In each pair left β source image, right β style aggregated (βneuralβ) [18]
Look at Boundary (LAB) [9]. The key advancement of this architecture is that the authors introduce face feature boundary heatmap, which is built as an intermediate representation between original image and predicted landmarks (fig. 5). Such a trick improves facial landmark prediction quality and furthermore allows to train boundary estimation module on several datasets with different annotation schemes at once. After boundary module (Hourglass [20] architecture has been used), another network predicts actual facial landmark locations. It should be noted, that only boundary submodule can be trained on datasets with different annotation schemes, while the landmark regression is trained for each dataset separately. As the authors have shown, pretraining the boundary module on 300W improves prediction quality on AFLW and COFW datasets. Also, the authors have proposed a new more complicated facial landmark dataset, they call it WFLW (fig. 1, d ). a b c Fig. 5. a β image to be labeled, b β intermediate boundary representation, c β predicted facial landmarks [9] Wing Loss [14]. The authors note that the field of loss functions for facial landmark prediction problem is barely studied. Most researchers use
πΏ2 = π₯ as a loss function for direct regression landmark prediction methods, which is known to be sensitive to outliers, that is why some of prior works have used π ππππ‘β πΏ1 loss instead. The authors make a comparison of πΏ2 against other loss functions, such as πΏ1(π₯) = |π₯| and smoothL1 , which is defined as [21] : (3) π ππππ‘β πΏ1(π₯) = { π₯ /2, ππ |π₯| < 1|π₯| β 1/2, ππ‘βπππ€ππ π and note that they give better results. The main paper contribution is that the authors have introduced a new loss, named Wing loss, which combines πΏ1 for large landmark deviations and ln (β ) for medium and small: (4) π€πππ(π₯) = {π€ ln(1 + |π₯| πβ ) , ππ |π₯| < π€,|π₯| β πΆ, ππ‘βπππ€ππ π, where πΆ = π€ β π€ ln (1 + π€ πβ ), π€ and π are hyperparameters ( π€ = 15, π = 3 in paper). Visual comparison of loss function is presented in fig. 6. Fig. 6. Loss function comparison: πΏ2 , πΏ1 , smoothL1 , Wing (with π€ = 15, π = 3 ). In addition, to train more on hard examples the authors introduce PDB algorithm. Which works as follows: 1) face rotation angle histogram is built; 2) rare examples (determined via the histogram) are duplicated with augmentation. As can be seen from table 2, using CNN-6/7 cascade with π€πππ(β ) loss in combination with PDB substantially lowers the NME.
Table 2. Comparison of NME of different loss functions on AFLW dataset Network Loss L2 L1 smoothL1 Wing CNN-6/7 2.06 1.82 1.84 1.71 CNN-6/7 + PDB 1.94 1.73 1.76
Practical Facial Landmark Detector (PFLD) [22] outperforms many of the algorithms on NME metric on 300W and AFLW datasets. Wherein, it is easy to implement and allows fast facial landmark detection directly on a mobile device. This is, apparently, the only modern neural network-based algorithm, whose authors have shown that their algorithm can work efficiently on a mobile device. MobileNetV2 [2] is used as a feature extractor in PFLD. Two heads are attached to it: 1) facial landmark regression, where they use multi-scale fully-convolutional layer in the end of the head; 2) 3D face model rotation angle estimator (yaw, pitch and roll). The second head contains a set of convolutional layers and is only used during training (fig. 7). As the most common datasets do not have information about 3D landmark location, to get them the authors propose to 1) build a βmeanβ facial representation containing 11 facial landmarks, based on the data in the training set; 2) estimate rotation matrix for each of the faces between its and βmeanβ facial landmarks; 3) compute yaw, pitch, roll angles from the rotation matrix. According to the authors, such an approach is not so accurate for estimating angles, yet improves facial landmark prediction accuracy at inference time. Furthermore, during training, the data is weighted based on image difficulty using a special loss function: (5)
β = 1π β β (β Ο πππΆπ=1 β β(1 β cos ΞΈ ππ ) πΎπ=1 ) βπ ππ β , where π is the number of facial landmarks, π is the number of training examples, πΎ = 3 , π , π , π are yaw, pitch, roll rotation angles of the above-described 3D face model correspondingly, π ππ represents a difference vector between n th predicted and training facial landmark for m th image; C is a number of complexity classes for face images (such as profile or frontal face, face-up, face-down, emotions or occlusion), π ππ is set as a ratio of images in the corresponding complexity class to their total number M. Fig. 7. PFLD architecture. The upper block predicts yaw, pitch, roll rotation and is used only during training. Lower block predicts facial landmark location [22]
AWing [23]. Algorithm uses heatmap regression, where for each of the landmarks a heatmap of size
64 Γ 64 is built, on which via a
Gaussian distribution landmark location is estimated. The algorithm is based on above-described Wing Loss paper, approaches from [24], LAB paper (previously described) and CoordConv [25]. The authors have noticed, that πΏ2 loss function does not produce sharp-enough heatmaps on βdifficultβ faces, because it is insensitive to small errors, while the original Wing loss is inappropriate for heatmap regression problem as its gradient is discontinuous at the point of zero. In addition, in the discussed problem, each heatmap has a class imbalance as only few pixels on the map relate to the foreground class (meaning that landmark is likely to be at this point) while the most of the image is labeled as background class. This is also not considered in the original Wing loss implementation. To account for all of the described features Adaptive Wing loss is introduced in [23], which is 1) differentiable around zero; 2) accents small errors around foreground pixels, but not around background (fig. 8, 9). Here we do not give the function itself due to its complexity. To predict foreground pixels even more precisely, the authors introduce a special Weighted loss map, which additionally enhances sharpness of the facial landmark heatmap. a b
Fig. 8. a β AWing surface plot, b β its gradient: the function behaves as πΏ2 for background pixels and as Wing for foreground, while preserving continuity [23] a b c d Fig. 9. AWing heatmap prediction comparison: a β blurry source image with large pose; b β ground truth heatmap; heatmap when trained with c β πΏ2 loss (NME: 6.27%), d β AWing loss (NME: 4.23%) [23]
Geometry Aggregated Network (GEAN) [26]. Based on Hourglass architecture authors propose a network, which uses Adversarial Attack method during both network training and testing (we have a deeper look at the concept of Adversarial Attacks in section 4 of this paper). The network forward pass is done in several steps: 1) with a trained face recognition and adversarial attack (from [27]) algorithms the authors generate πΎ images that fool facial detector; 2) for each of the images Hourglass network predicts facial landmark location; 3) note, in step 1 geometrical transformations were applied to the images, which have moved the facial landmark locations. This step reverts them; 4) results on all of the πΎ images are aggregated. According to the authors, with respect to performance/quality ratio, it is the most beneficial to generate πΎ = 5 adversarial examples during training and testing. It is also possible to use different number of adversarial images during training and testing. The authors have explored several modifications to the adversarial attack algorithm, and the best results are obtained when attack scale is set individually for each of the landmarksβ semantical groups. The groups are assigned based on face region, such as nose, eyes, eye-brows, etc. Note, that while in case of training such groups can be obtained directly from training labels, during testing an additional forward pass through the network is required to estimate them.
Deep Adaptive Graph (DAG) [28]. The authors propose to use a Graph Convolutional Network cascade. This approach gives better results on many of the datasets as the network can better βunderstandβ image structure. As can be seen from comparison, previous architectures fail to comprehend that image has two overlapping faces, because-of that predicted facial landmarks are distributed between two faces (fig. 10, a ). In contrast, DAG assigns all of the labels to a single face (fig. 10, b ). Here green dots show predicted landmark locations, while ground truth labels are shown in red. Moreover, the learned graph representation does make sense. See fig. 10, c , where on the figure edges with top 10 weights are shown. a b c Fig. 10. a β previous best vs b β DAG prediction (predictions are shown in green, true labels in red). c β learned graph facial landmark representation [28] Other approaches include:
MobileFAN [29], where problems of reducing number of model parameters and increasing inference speed for heatmap regression methods are considered;
LUVLi [6], where it has been highlighted that facial landmark detection algorithms are used in many of the critically important applications. The authors propose a method, where together with landmark coordinates, facial landmark visibility and algorithm confidence are predicted. Also, they have relabeled AFLW dataset and have shown that their algorithm properly shows low facial landmark prediction confidence on facial landmarks in regions, occluded by other objects, and high in well-visible regions. 2.4. Summarizing approaches in modern facial landmark detection algorithms All of the recent implementations of neural network facial detection algorithms clearly show that for an accurate neural network training, information explicitly presented in the dataset in a form of a pair (input image β labeled facial landmark) is insufficient. To solve this problem several approaches are proposed: β’ use of an auxiliary representation, which contains structural information about the face, such as 3D facial landmark location (DeFA), facial boundaries (LAB), yaw, pitch, roll rotation angles (PFLD), landmark visibility (LUVLi) or face representation as a graph model (DAG); β’ hard example mining during training. Different variations on the theme have been presented in MTCNN, Wing and PFLD papers; β’ reducing contribution of very large errors (outliers) and increasing contribution of small to medium-sized errors (refining the prediction): Wing, AWing. Tables 3-5 present some of the facial landmark detection method metrics on the most common datasets. Table 3 for 300W dataset has metrics split info common, challenge and full as per protocol. Each of the tables has the best results highlighted in bold, pretrained models (meaning additional data was used) are shown in italics. Metrics in the tables include NME, Failure Rate (FR, %) and CED-AUC. The tables are filled based on the results presented in the corresponding papers. If the results were published later, the metrics source for the results is shown in square brackets. Although, significant growth of the algorithm quality over the recent years is evident, appearance of the new datasets with more difficult real-world shooting conditions, such as COFW and WFLW (table 5), clearly shows that the problem of a precise facial landmark estimation is still unsolved. Besides, extremely small attention has been paid to the algorithmβs performance. Table 6 has an estimation of algorithmsβ performance for Desktop CPU, GPU and Mobile devices. As authors have used different hardware for the experiments, so the timings presented there are rough. And, although, the demand on fast facial landmark detection on mobile and portable devices is growing, only one of the recent algorithms, namely PFLD, was adapted to a mobile device. Table 3. Landmark detection algorithm comparison on 300-W dataset Model Year 300W Common Challenge Full Inter-pupil distance ERT 2014 - - 6.40 [14] LAB 2018 3.42 6.98 4.12 CNN-6 + PDB (Wing) 2018 3.35 7.20 4.10 CNN-6/7 + PDB (Wing) 2018 3.27 7.18 4.04
ResNet-50 + PDB (Wing) 2018
PFLD 0.25X 2019 3.38 6.83 4.02 PFLD 1X 2019 3.32 6.56 3.95
PFLD 1X+ 2019 3.17
AWing 2019 3.77 6.52 4.31
DAG 2020 3.64 6.88 4.27
Inter-ocular distance
DeFA 2017 5.37 9.38 6.10
SAN 2018 3.34 6.60 3.98 LAB 2018 2.98 5.19 3.49 PFLD 0.25X 2019 3.03 5.15 3.45 PFLD 1X 2019 3.01 5.08 3.40
PFLD 1X+ 2019 2.96 4.98 3.37
AWing-1HG 2019 2.81 4.72
MobileFAN (0.5) 2020 4.22 6.87 4.74 MobileFAN 2020 2.98 5.34 3.45 GEAN 2020 2.68 4.71 3.05
LUVLi 2020 2.76 5.16 3.23
DAG 2020
Table 4. Facial landmark detection comparison on AFLW Model AFLW AFLW-Frontal SAN 1.91 1.85 LAB 1.85 1.62
LAB (pretrained)
CNN-6 + PDB (Wing) 1.83 - CNN-6/7 + PDB (Wing) 1.65 -
ResNet-50 + PDB (Wing) 1.47 -
PFLD 0,25X 2.07 - PFLD 1X 1.88 - AWing 1.53 1.38
GEAN 1.59 1.34
Table 5. COFW and WFLW algorithm error comparison Model COFW WFLW NME % (β) FR % (β) NME % (β) FR % (β) AUC (β) LAB 5.58 2.76 5.27 7.56 0.5323 LAB (pretrained) - - - Wing 5.44 [23] 3.75 [23] 5.11 [23] 6.00 [23] 0.5504 [23] AWing 4.94 0.99 4.36
MobileFAN (0.5) 3.68 0.59 5.59 6.72 0.4682 MobileFAN 3.66 0.59 4.93 5.32 0.5296
LUVLi - - 4.37 3.12 0.577
DAG - - 4.21
Table 6. Algorithm inference speed comparison Model CPU (ms) GPU (ms) Mobile (ms) ERT ~1 - - SAN -
343 [22] - LAB 2600 [22] 60 - CNN-6 (Wing) 6.7 2.5 - CNN-6/7 (Wing) 50 5.9 - ResNet-50 (Wing) 125 33.3 - PFLD 0.25X 1.2
PFLD 1X/1X+ 6.1 3.5 26.4 AWing-1HG - 8.3 - AWing-2HG - 15.7 - AWing-3HG - 22.1 - AWing - 29.0 - MobileFAN (0.5) - 4.0 - MobileFAN - 4.2 - GEAN - 58.8 - LUVLi - 17 -
3. Facial landmark detection algorithm applications
Besides that, DeFA [17] algorithm can build a 3D whole-face mesh for varied poses and emotions, as has been said previously (fig. 3). Many of the modern neural-network-based algorithms do not use an intermediate 3D face model for realistic image generation, but generate images directly from facial landmark locations via Generative Adversarial Networks (GANs) that were first introduced in [31]. For instance, in [32] landmark information is explicitly extracted from the image (by a means of an algorithm in [24]) and is one of the neural network inputs. By using meta-learning approaches from [33], GAN and style component [34], the authors obtain high face reenactment quality (fig. 11). They point out, that when source and target images have the same person in play, the algorithm generates image sequence that contains fewer artifacts than when reenactment is transferred between people. According to their report, this method outperforms the competition for the face emotion transfer task in a few- or one-shot problem statement. An improvement of facial extraction algorithm and addition of the gaze direction might have improved the reenactment quality. In paper [35] authors are using Pix2PixHD [36] neural network to accomplish lip sync task. There it has been proposed to synthesize the intermediate face representation using its boundaries, face landmarks (using Dlib library) and sound-track-based representation. In [37] FReeNet is used for the reenactment between different, unknown during training people. For that a special Unified Landmark Converter module has been introduced, which adapts facial landmark location between different people. Landmarks for the source and target people are extracted via PFLD algorithm, then images are generated via Cycle-GAN [19] and a special loss function, that reduces model overfitting to only the landmarks and helps generating more detailed faces. The use of landmark converter module has given the largest performance increase on the test sets. A survey of emotion transfer, face reenactment and other face feature modification methods can be found in [38] section βExpression Swapβ. a b c d
Fig. 11. Expression transfer scheme. a β source character image (the one we want to reenact), b β one of image sequence frames of the target emotions, c β extracted facial landmarks that are fed to the reenactment algorithm, d β reenactment result [32] The topic of face recognition is well-described in, for example, [43]. We note in particular, a high interest to face recognition directly on mobile devices [44], [45]. Likewise, our emotions mostly consist of lip, eyes, eyebrows or mouth movements, in certain cases it is fruitful not to force the neural network to learn face parts during emotion recognition on its own, but to feed this information together with the original image [46], [47].
4. Facial landmark detection algorithm vulnerabilities
Modern computer vision algorithms (including neural networks) are amenable to the so called βadversarial attacksβ, first reported in the field of computer vision in [48], where by adding an especially crafted noise to an image (invisible to a human eye) authors were able to drastically change neural network prediction in a classification task. The attack has been conducted by maximizing network error on the target image via L-BFGS method. When testing the network on an adversarial example generated for the MNIST dataset, it has been made possible to make the network misclassify almost all of the examples. It should be stressed that during the adversarial attack the network itself is not modified, only the image fed to it. Moreover, adversarial examples often remain malicious to networks different from the one they were crafted for, given that the other network was trained on the same or similar dataset. It should be noted, that adding random noise has a much lower negative effect on the networkβs classification accuracy. In [49] it has been shown that for a successful adversarial attack on the MNIST dataset, model as simple as logistic regression can be used to generate examples, while the attack remains efficiently transferrable to architectures that are more complicated. If previous algorithms have attacked a digital image (stored in computer memory), in [50] it has been shown that attacks can be performed through a smartphone camera. In [51] binary importance maps have been introduced, which hint where adversarial marks should be placed on a piece of paper to fool the network trained to classify NMIST digits. While first adversarial attack algorithms were white-box (meaning the network architecture and trained weights are known to the attacker), follow-up works similar to [52] and others have shown that it is possible to perform black-box attacks without such knowledge. Despite the fact that numerous works are devoted to detecting or preventing attacks from happening, new more advanced algorithms bypass all of the defense methods [53]. A survey of adversarial attack methods can be found in [54]. All of them are applicable to algorithms of face or facial landmark detection. In the meantime, there exist special methods that can prevent the face from being found or correctly detected by using stickers or accessories in real world. In [55] it has been shown, that in a controllable environment it is possible to fool face recognition algorithm or Viola-Jones face detector. The authors used special eyeglasses with a print on a frame (fig. 12, a ). In [56] it has been proposed to fool MTCNN face detection algorithm with the use of stickers on cheeks or medical mask (fig. 12, b ). In cases when the face cannot be detected, landmark localization cannot be performed either. How to modify image, using facial landmark information is also shown in work [27]. a b Fig. 12. Ways to perform adversarial attack: a β sunglasses with an adversarial print [55]; b β medical mask with especially crafted black spots [56] Conclusion
From a detailed survey of facial landmark detection algorithms, we make the following conclusions: 1) despite a significant growth of methodsβ quality, few of them focus on the real-world applicability, meaning that in many cases even when executed on a GPU, algorithms performs slower that real-time (around 30 fps or 33 milliseconds); 2) many of the applications require high performance on mobile or portable devices, yet to the best of our knowledge, authors of only a single algorithm have targeted a mobile application directly in the original paper; 3) while modern research already focuses on the datasets in uncontrollable environments with high pose, emotion, lighting conditions variability, such as 300W or AFLW, a promising research direction in the field of computer vision is to enhance algorithms in even harsher conditions, when, for instance, significant parts of faces are occluded while still maintaining high landmark density. We hope, that the described modern developments in all of the sections (facial landmark detection algorithms, application and vulnerabilities) will lead the reader to the new ideas of practical use and further research directions in the field.
References Z h a n g , X . , X . Z h o u , M . L i n , J . S u n . ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City. UT. 2018. pp. 6848β6856. doi: 10.1109/CVPR.2018.00716 . 2.
S a n d l e r , M . , A . H o w a r d , M . Z h u , A . Z h m o g i n o v , L . C h e n . MobileNetV2: Inverted Residuals and Linear Bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018. pp. 4510β4520. doi: 10.1109/CVPR.2018.00474 S a g o n a s , C . , G . T z i m i r o p o u l o s , S . Z a f e i r i o u , M . P a n t i c . 300 Faces in-the-Wild Challenge: The first facial landmark localization Challenge. IEEE International Conference on Computer Vision Workshops. 2013. pp. 397β403. doi: 10.1109/ICCVW.2013.59 K ΓΆ s t i n g e r , M . , P . W o h l h a r t , P . M . R o t h , H . B i s c h o f . Annotated Facial Landmarks in the Wild: A large-scale, real-world database for facial landmark localization IEEE International Conference on Computer Vision Workshops. Barcelona. 2011. pp. 2144β2151. doi: 10.1109/ICCVW.2011.6130513 Q i a n , S . , K . S u n , W . W u , C . Q i a n , J . J i a . Aggregation via Separation: Boosting Facial Landmark Detector With Semi-Supervised Style Translation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2020. pp. 10152β10162. doi: 10.1109/ICCV.2019.01025 K u m a r , A . , e t a l . LUVLi Face Alignment: Estimating Landmarksβ Location, Uncertainty, and Visibility Likelihood. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 8233-8243. doi: 10.1109/CVPR42600.2020.00826 B u r g o s - A r t i z z u , X . P . , P . P e r o n a , P . D o l l a r . Robust Face Landmark Estimation under Occlusion. 2013 IEEE International Conference on Computer Vision, Sydney, NSW, 2013, pp. 1513-1520. doi: 10.1109/ICCV.2013.191 G h i a s i , G . , C . C . F o w l k e s . Occlusion coherence: Detecting and localizing occluded faces. arXiv preprint, 2015. arXiv:1506.08347 W u , W . , C . Q i a n , S . Y a n g , Q . W a n g , Y . C a i , Q . Z h o u . Look at Boundary: A Boundary-Aware Face Alignment Algorithm. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. pp. 2129β2138. doi: 10.1109/CVPR.2018.00227
W a n g , N . , X . G a o , D . T a o , H . Y a n g , X . L i . Facial feature point detection: A comprehensive survey. Neurocomputing, 275, pp. 50-65. doi: 10.1016/j.neucom.2017.05.013
W u , Y . , Q . J i . Facial Landmark Detection: A Literature Survey. International Journal of Computer Vision, 2018. N.127. pp. 115β142. doi: 10.1007/s11263-018-1097-z
K a z e m i , V . , J . S u l l i v a n . One millisecond face alignment with an ensemble of regression trees. 2014 IEEE Conference on Computer Vision and Pattern Recognition 2014. pp. 1867β1874. doi: 10.1109/CVPR.2014.241
K i n g , D . Dlib-ml: A Machine Learning Toolkit. J. Mach. Learn. Res. 2009. Vol. 10. pp. 1755β1758. doi: 10.1145/1577069.1755843
F e n g , Z . , J . K i t t l e r , M . A w a i s , P . H u b e r , X . W u . Wing Loss for Robust Facial Landmark Localisation with Convolutional Neural Networks 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT. 2018. pp. 2235β2245. doi: 10.1109/CVPR.2018.00238
Z h a n g , K . , Z . Z h a n g , Z . L i , Y . Q i a o . Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks IEEE Signal Processing Letters Oct. 2016. Vol. 23. No. 10. pp. 1499β1503. doi: 10.1109/LSP.2016.2603342
Y a n , Y . , X . N a t u r e l , T . C h a t e a u , S . D u f f n e r , C . G a r c i a , C . B l a n c . A survey of deep facial landmark detection, 2018, Paris, France. https://hal.archives-ouvertes.fr/hal-02892002
L i u , Y . , A . J o u r a b l o o , W . R e n , X . L i u . Dense Face Alignment Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2017. pp. 1619β1628. doi: 10.1109/ICCVW.2017.190
D o n g , X . , Y . Y a n , W . O u y a n g , Y . Y a n g . Style Aggregated Network for Facial Landmark Detection. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018. pp. 379β388. doi: 10.1109/CVPR.2018.00047
Z h u , J . , T . P a r k , P . I s o l a , A . A . E f r o s . Unpaired image-to-image translation using cycle-consistent adversarial networks. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 2017, pp. 2242-2251. doi: 10.1109/ICCV.2017.244
N e w e l l , A . , K . Y a n g , J . D e n g . Stacked hourglass networks for human pose estimation. ECCV 2016, Lecture Notes in Computer Science, vol , Springer, Cham, 2016, pp. 483-499. doi: 10.1007/978-3-319-46484-8_29
G i r s h i c k , R . Fast R-CNN. 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 2015, pp. 1440-1448. doi: 10.1109/ICCV.2015.169
G u o , X . , S . L i , J . Z h a n g , J . M a , L . M a , W . L i u , H . L i n g . PFLD: A practical facial landmark detector. arXiv preprint, 2019. arXiv:1902.10859
W a n g , X . , L . B o , L . F u x i n . Adaptive Wing Loss for Robust Face Alignment via Heatmap Regression. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp. 6970-6980. doi: 10.1109/ICCV.2019.00707
B u l a t , A . , G . T z i m i r o p o u l o s . How Far are We from Solving the 2D & 3D Face Alignment Problem? (and a Dataset of 230,000 3D Facial Landmarks). 2017 IEEE
International Conference on Computer Vision (ICCV), Venice, 2017, pp. 1021-1030. doi: 10.1109/ICCV.2017.116
L i u , R . , J . L e h m a n , P . M o l i n o , F . P . S u c h , E . F r a n k , A . S e r g e e v , J . Y o s i n s k i . An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution. Advances in neural information processing systems, Vol. , 2018, pp.: 9605-9616. 26. I r a n m a n e s h , S . M . , A . D a b o u e i , S . S o l e y m a n i , H . K a z e m i , N . M . N a s r a b a d i . Robust facial landmark detection via aggregation on geometrically manipulated faces. 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 2020, pp. 319-329. doi: 10.1109/WACV45572.2020.9093508
D a b o u e i , A . , S . S o l e y m a n i , J . D a w s o n , N . N a s r a b a d i . Fast Geometrically-Perturbed Adversarial Faces. 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 2019, pp. 1979-1988. doi: 10.1109/WACV.2019.00215
L i , W . , e t a l . Structured Landmark Detection via Topology-Adapting Deep Graph Learning. arXiv preprint, 2020. arXiv:2004.08190
Z h a o , Y . , Y . L i u , C . S h e n , Y . G a o , S . X i o n g . MobileFAN: Transferring Deep Hidden Representation for Face Alignment. In Pattern Recognition, Vol. , 2020. doi: 10.1016/j.patcog.2019.107114
G a r r i d o , P . , L . V a l g a e r t s , H . S a r m a d i , I . S t e i n e r , K . V a r a n a s i , P . P Γ© r e z , C . T h e o b a l t . VDub: Modifying Face Video of Actors for Plausible Visual Alignment to a Dubbed Audio Track. Comput. Graph. 2015. Forum 34. 2 (May 2015). pp. 193β204. doi: 10.1111/cgf.12552
G o o d f e l l o w , I . , J . P o u g e t - A b a d i e , M . M i r z a , B . X u , D . W a r d e -F a r l e y , S . O z a i r , A . C o u r v i l l e , Y . B e n g i o . Generative adversarial networks, arXiv preprint, 2014. arXiv:1406.2661
Z a k h a r o v , E . , A . S h y s h e y a , E . B u r k o v , V . L e m p i t s k y . Few-Shot Adversarial Learning of Realistic Neural Talking Head Models Proc. IEEE/CVF International Conference on Computer Vision, 2019. doi: 10.1109/ICCV.2019.00955
F i n n , C . , P . A b b e e l , S . L e v i n e . Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, Vol. , 2017, pp. 1126β1135. 34. J o h n s o n , J . , A . A l a h i , L . F e i - F e i . Perceptual losses for real-time style transfer and super-resolution. ECCV 2016, Lecture Notes in Computer Science, Vol. , Springer, Cham 2016. doi: 10.1007/978-3-319-46475-6_43
Z h e n g , R . , Z . Z h u , B . S o n g , C . J i . Photorealistic Lip Sync with Adversarial Temporal Convolutional Networks. arXiv preprint, 2020. arXiv:2002.08700
W a n g , T . , M . L i u , J . Z h u , A . T a o , J . K a u t z , B . C a t a n z a r o . High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 8798-8807. doi: 10.1109/CVPR.2018.00917
Z h a n g , J . , e t a l . FReeNet: Multi-Identity Face Reenactment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5326-5335. 38.
T o l o s a n a , R . , R . V e r a - R o d r i g u e z , J . F i e r r e z , A . M o r a l e s , J . O r t e g a - G a r c i a . DeepFakes and Beyond: A Survey of Face Manipulation and Fake Detection Information Fusion, Vol. , 2020. pp. 131β148. doi: 10.1016/j.inffus.2020.06.014 Ja b b a r , R . , M . S h i n o y , M . K h a r b e c h e , K . A l - K h a l i f a , M . K r i c h e n , K . B a r k a o u i . 2020 IEEE International Conference on Informatics, IoT, and Enabling Technologies (ICIoT), Doha, Qatar, 2020, pp. 237-242. doi: 10.1109/ICIoT48696.2020.9089484
W i j n a n d s , J . S . , J . T h o m p s o n , K . A . N i c e , e t a l . Real-time monitoring of driver drowsiness on mobile platforms using 3D neural networks. Neural Comput & Applic, Vol. , 2020. pp. doi: 10.1007/s00521-019-04506-0 K i m , W . , W . - S . J u n g , H . K . C h o i . Lightweight Driver Monitoring System Based on Multi-Task Mobilenets. Sensors 2019, Vol. , p. 3200. doi: 10.3390/s19143200 T a d a s h i , H . , K . K o i c h i , N . K e n t a , H . Y u k i . Driver Status Monitoring System in Autonomous Driving Era. OMRON TECHNICS, Vol. , 2019. 43.
W a n g , M . , W . D e n g . Deep Face Recognition: A Survey. arXiv preprint, 2020. arXiv:1804.06655
D u o n g , C . N . , K . G . Q u a c h , I . J a l a t a , N . L e , K . L u u . MobiFace: A Lightweight Deep Learning Face Recognition on Mobile Devices. 2019 IEEE 10th International Conference on Biometrics Theory, Applications and Systems (BTAS), Tampa, FL, USA, 2019, pp. 1-6. doi: 10.1109/BTAS46853.2019.9185981
C h e n , S . , Y . L i u , X . G a o , Z . H a n . Mobilefacenets: Efficient cnns for accurate real-time face verification on mobile devices. In CCBR 2018, Lecture Notes in Computer Science, Vol. , 2018, Springer, Cham. doi: 10.1007/978-3-319-97909-0_46
K o , B . C . A brief review of facial emotion recognition based on visual information. In Sensors, Vol. 18, 2018, p. 401. doi:10.3390/s18020401
L i , S . , D . W e i h o n g . Deep Facial Expression Recognition: A Survey." IEEE Transactions on Affective Computing, 2020. doi: 10.1109/TAFFC.2020.2981446
S z e g e d y , C . , W . Z a r e m b a , I . S u t s k e v e r , J . B r u n a , D . E r h a n , I . J . G o o d f e l l o w , R . F e r g u s . Intriguing properties of neural networks. arXiv preprint, 2014. arXiv:1312.6199
K h a b a r l a k , K . S . , L . S . K o r i a s h k i n a . Scoping Adversarial Attack for Improving Its Quality." Radio Electronics, Computer Science, Control. Vol. , 2019, pp.: 108-118. doi: 10.15588/1607-3274-2019-2-12 K u r a k i n , A . , I . G o o d f e l l o w , S . B e n g i o . Adversarial examples in the physical world. arXiv preprint, 2016. arXiv:1607.02533
K h a b a r l a k , K . , L . K o r i a s h k i n a . Minimizing Perceived Image Quality Loss Through Adversarial Attack Scoping. arXiv preprint, 2019. arXiv:1904.10390
C h e n , P . , H . Z h a n g , Y . S h a r m a , J . Y i , C . H s i e h . ZOO: Zeroth Order Optimization Based Black-box Attacks to Deep Neural Networks without Training Substitute Models. Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, 2017, pp. 15β26. doi: 10.1145/3128572.3140448
C a r l i n i , N . , D . W a g n e r . Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods. Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, 2017, Dallas, Texas, USA. doi: 10.1145/3128572.3140444
A k h t a r , N . , A . M i a n . Threat of Adversarial Attacks on Deep Learning in Computer Vision: A Survey. IEEE Access, vol. 6, pp. 14410-14430, 2018. doi: 10.1109/ACCESS.2018.2807385
S h a r i f , M , S . B h a g a v a t u l a , L . B a u e r , M . K . R e i t e r . Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition." Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 1528β1540, Vienna, Austria, 2016. doi: 10.1145/2976749.2978392
K a z i a k h m e d o v , E . , K . K i r e e v , G . M e l n i k o v , M . P a u t o v , A . P e t i u s h k o . Real-world attack on MTCNN face detection system. 2019 International Multi-Conference on Engineering, Computer and Information Sciences (SIBIRCON), Novosibirsk, Russia, 2019, pp. 0422-0427. doi: 10.1109/SIBIRCON48586.2019.8958122doi: 10.1109/SIBIRCON48586.2019.8958122