Facial Expressions Analysis Under Occlusions Based on Specificities of Facial Motion Propagation
Delphine Poux, Benjamin Allaert, Jose Mennesson, Nacim Ihaddadene, Ioan Marius Bilasco, Chaabane Djeraba
NNoname manuscript No. (will be inserted by the editor)
Facial Expressions Analysis Under Occlusions Basedon Specificities of Facial Motion Propagation
Delphine Poux* · Benjamin Allaert · JoseMennesson · Nacim Ihaddadene · IoanMarius Bilasco · Chaabane Djeraba the date of receipt and acceptance should be inserted later
Abstract
Although much progress has been made in the facial expression anal-ysis field, facial occlusions are still challenging. The main innovation brought bythis contribution consists in exploiting the specificities of facial movement propa-gation for recognizing expressions in presence of important occlusions. The move-ment induced by an expression extends beyond the movement epicenter. Thus,the movement occurring in an occluded region propagates towards neighboringvisible regions. In presence of occlusions, per expression, we compute the impor-tance of each unoccluded facial region and we construct adapted facial frameworksthat boost the performance of per expression binary classifier. The output of eachexpression-dependant binary classifier is then aggregated and fed into a fusionprocess that aims constructing, per occlusion, a unique model that recognizes allthe facial expressions considered. The evaluations highlight the robustness of thisapproach in presence of significant facial occlusions.
Keywords
Facial occlusions, motion propagation, facial framework, facialexpressions
Facial expression analysis is a field of research that has been extensively studied inrecent years. Recognizing facial expressions provides information about the emo-tional state of a person. This information is essential in many applications areas
D. Poux (corresponding author), B. Allaert, I.M. Bilasco and C. DjerabaCentre de Recherche en Informatique Signal et Automatique de Lille, Univ. Lille,CNRS, Centrale Lille, UMR 9189 - CRIStAL -, F-59000 Lille, FranceE-mail: { delphine.poux, marius.bilasco, chabane.djeraba } @univ-lille1.frJ. MennessonIMT Lille Douai, Centre de Recherche en Informatique Signal et Automatique de Lille,Univ. Lille, CNRS, Centrale Lille, UMR 9189 - CRIStAL -, F-59000 Lille, FranceE-mail: [email protected]. IhaddadeneISEN Lille, Yncrea Hauts-de-France, FranceE-mail: [email protected] a r X i v : . [ c s . C V ] A p r Delphine Poux* et al. such as health, security or communication. Indeed, it is possible to consider thata person with bad intentions in a public place can be detected by the fact that hisbehaviour is abnormal compared to other people. In this case, it is interesting toautomatically notify the security officers in order to anticipate the danger.The majority of the approaches dealing with facial expressions are generallytrained on unoccluded faces and give very good results. However, these approachesperform poorly when deployed on un-controlled data (e.g., video surveillance sys-tem), where the face can be highly occluded. Two types of approaches have beenproposed to address challenges in presence of occlusions. The first approach tendto reconstruct the occluded parts of the face and simulate an ideal analysis con-text. The second approach consists in characterizing the face despite the facialocclusion and let classifier identify the closest expression among the training data.In all cases, the facial expression analysis remains challenging when occlusionsoccur.In this paper, we propose an innovative approach to overcome facial occlu-sions challenges. We assume that the facial movement induced by an expressionis relatively close between individuals although the texture or facial geometry ofeach individual is highly different. The innovation brought by our contribution re-lies on the propagation properties of the facial movement. The movement inducedby an expression spreads beyond the movement epicenter to neighbouring regions.Hence, if a region is occluded then it is possible to focus on the movement informa-tion that has been propagated to neighbouring regions. Specific facial frameworks(i.e., specific sets of facial regions) are constructed per expression, according to theimportance of each facial region to recognize the underlying expression in presenceof specific occlusions. Only the most relevant regions are selected in order to berobust to both small and large occlusions. The per-expression facial frameworksare then merged into a unique model in order to recognize globally any given facialexpression in presence of a specific occlusion.In Section 2, we highlight the main contributions of the paper and discussapproaches used to handle facial occlusions challenges. The construction of opti-mized facial frameworks per expression in presence of given occlusions is explainedin Section 3. The merging of these facial frameworks is introduced in Section 4.In Section 5, we present the data used for learning and the experimental protocolused. Then, we present the performances obtained considering one expression at atime or all expressions simultaneously. In Section 6, we analyze the ability of thefacial frameworks to recognize a given expression in presence of specific occlusions.In Section 7, we evaluate the generic expression recognition performance in pres-ence of large occlusions and compare our approach to the other approaches fromthe literature. To conclude, we summarize the results and discuss perspectives insection 8.
This section highlights the main objectives of our contribution and provides a briefoverview of existing approaches to handle facial occlusions. itle Suppressed Due to Excessive Length 3 individuals in the presence of facial expressions, the reconstruction of occluded regions remains relatively complex.Regarding the approaches characterizing facial expressions despite the pres-ence of occlusions, they can be divided in two categories : sparse representationapproaches and sub-regions approaches. Sparse representation approaches recog-nize facial expression on an occluded face by representing a test image as a linearcombination of unoccluded images from a dictionary [10,5]. This dictionary is com-posed of a set of unoccluded training images. Because the dictionary is composedof unoccluded data, occlusions cause errors in the linear combination. When theseerrors reach a threshold, they are implicitly considered as occlusions and are rep-resented by an identity matrix which is isolated from the extracted facial featuresused by the classification process. These approaches have the advantage to implic-itly localize occlusions. However, these approaches require large dictionaries cov-ering variations for each expression in order to build accurate linear combinationsand in order to have enough characteristics to discriminate between expressions.In the sub-regions approaches, the face is divided into different regions and eachregion is analyzed individually [4]. The results are then merged to recognize theexpression. The advantage of these approaches is that they perfom well even inthe absence of a large set of training data. However, the granularity of the subdi-vision of the face into local regions and its effect on performance remains an openquestion, particularly in the presence of important occlusions.
Delphine Poux* et al. sions. In addition, the large intra-face variety of individuals in presence of facialexpressions increases the complexity of the learning process.Considering the descriptors used to characterize facial expressions, majority ofapproaches are based on texture or geometry descriptors. However, in presence ofan important facial occlusion, the information to characterize the facial expressionis almost completely lost or has a high probability of being noisy due to estimationerrors. Recent approaches have proven the effectiveness of optical flow in charac-terizing facial expressions [1]. Thanks to the physical properties of skin, descriptorsbased on movement seem adapted in the case of occlusion. Indeed, despite the factthat the epicenter of a movement is situated in an occluded part of the face, themovement related to the expression is still visible in the neighboring regions, asillustrated in Fig. 1 (see input data part of the image), where the motion inducedby the smile has, as a secondary effect, the rise of the cheeks.Fig. 1: Overview of the proposed approach.Assuming that the movement within a face region spreads to neighbouring re-gions, we consider it appropriate to characterize facial expressions based on theevolution of movement through specific regions of the face. Inspired by the sub-region approaches, we propose an innovative approach to overcome facial occlu-sions. Fig. 1 illustrates an overview of our approach which consists in recognizingfacial expressions in presence of partial occlusions of the face. This approach iscomposed of two main steps. The first step consists in building optimized facialframeworks defining the facial regions contributing the most to the recognitionof specific expressions in presence of a given occlusion. These facial frameworksare generated thanks to optimized weights computed for each facial region. Theseweights represent the contribution of each region to recognize a particular expres- sion. The most important ones are selected in order to construct dedicated facialframeworks. The second step, illustrated in the lower part of Fig. 1, takes ad-vantage of the obtained facial frameworks in order to train one binary model perexpression. The results obtained with these binary models are then merged and aunique model per occlusion is trained in order to classify all expressions. itle Suppressed Due to Excessive Length 5
In this section, we investigate the best compromise between the minimum num-ber of facial regions required to recognize facial expression and the performanceobtained in different occlusions.3.1 Weighting facial region schemeThe weighting algorithm consists in three steps. The first step generates variouspartial facial frameworks (a subset of facial regions), called configurations, includ-ing fewer regions than the initial facial framework. Inspired by [1], we consider afacial framework using 25 regions laid out following the facial muscle scheme. Foreach configuration C j , the weighting algorithm evaluates the performance of theclassification process using only the motion information contained in the regions R ik composing C j . Then, the recognition rate obtained for a given configuration C j serves to infer the contribution of each region R ik to the classification process. The choice of the retained configurations in the weighting algorithm is essential.Generating the whole set of configurations that covers all combinations of one totwenty-five regions is heavy and time consuming. Instead, in order to reinforcethe motion propagation properties, we decided to consider only configurationscontaining pair-wise connected regions. As illustrated in Fig. 2-A, from the region R , the combinations { R } , { R , R } , { R , R } , { R , R } and { R , R } of size one and two are obtained. Indeed, the regions R , R , R and R aredirectly connected to the region R . Bigger combinations are obtained using thepair-wise connectivity of regions. Fig. 2: Neighboring configurations.We have chosen to explore configurations containing less than 8 regions asthese configurations cover already horizontal, vertical and diagonal parts of theface as illustrated in Fig. 2-B. The configuration construction process guarantees
Delphine Poux* et al. that the configurations cover several muscles of the face and enable us to studythe correlation between them.
The collected results obtained from all configurations are directly used to computeeach region weight. At the beginning, each region receives a zero weight. Then, theclassification rate obtained for each configuration C j is used to compute a scoreaccording to the mean classification rate of all configurations normalized by thestandard deviation. This score is calculated as follows : ω ( C j ) = exp (( res C j − mean i ) /std i ) /exp ( i ) (1)where i is the number of regions of the configuration C j , j ∈ [1 , i = | C j | ∈ [1 , res C refers to the result obtained with C j . mean i and std i arerespectively the mean and the standard deviation of the results obtained with allthe configurations containing i regions. Finally, exp ( i ) which is the exponential of i , moderates the contribution of each configuration with regard to its size. Indeed,configurations covering larger portion of the face are expected to provide higherrecognition rates.The score obtained is then added to the current weights of each region R jk included in the configuration C j . Finally, each region weight is normalized withregard to the number of apparition of the region in all the combinations.The obtained weights reflect the importance of each region for recognizing eachexpression. Fig. 3 illustrates the heatmaps obtained on CK+ [11] dataset using anSVM classifier with RBF kernel and a 10-fold cross-validation protocol. This figurereveals that for almost all expressions the bottom of the face is activated, exceptfor the anger which mainly activates the eyebrows regions.Fig. 3: Heatmaps of the importance of regions per facial expression.The weight transferring process is represented in Fig. 4. The example showsthe construction of the heatmap for the sadness expression. Fig. 4-A (bounded by the purple border) presents the heatmap obtained in absence of occlusions and inFig. 4-B (bounded by the blue border) presents the heatmap obtained in presenceof one occlusion.As seen in the weighting heatmap obtained without any occlusion (i.e. consid-ering entire set of configurations C ), the most important regions for this expression itle Suppressed Due to Excessive Length 7 are situated under the mouth. Considering the weight evaluation in presence ofocclusions, the process is very similar to the unoccluded situation, but during theweight transferring part, we only consider configurations C that include only unoc-cluded regions (e.g., corresponds to the checked green configurations in Fig. 4-B).Thus, the importance of each visible region is computed independently from theoccluded regions. Besides, occluded regions have a zero weight at the end of theprocess. This result is completely consistent because an occluded region gives noinformation about the facial expression.As seen in the heatmap computed in presence of an occlusion impacting allthe right part of the face, all configurations involving right regions are filtered outbefore transferring weights to regions. The resulting heatmap has zero weights forall regions on the right side of the face (blank areas) and the weights on the leftside of the face are different with regard to the unoccluded heatmap notably forthe cheek regions.Fig. 4: Weights transfer considering facial occlusion (sadness expression). Delphine Poux* et al. containing from one to twenty-five regions. Each model contains the n best re-gions for each facial expression. The obtained results reveals : a) the optimal facialframeworks for each facial expression; and b)the minimal number of regions re-quired to recognize the expressions with performances similar to those obtainedin absence of occlusion. These facial frameworks are illustrated in Fig.5 for theexpression of happiness in presence of different occlusion patterns by selecting the6 best regions.Fig. 5: Best facial frameworks for happiness expression under different occlusions. The weighing optimization algorithm allows the construction of one model perexpression and per occlusion. Each model corresponds to a binary classifier andindicates, per expression, if the input data corresponds to the underlying expres-sion or not. In order to recognize an expression, regardless of the binary classifiers,we add a fusion step and, hence, construct a unified model for all expressions. Theoverview of the whole process are illustrated in Fig. 6.Fig. 6: Overview of the fusion process.As we build a learning architecture using two layers, we proceed with twolearning processes : one concerning the binary classifiers and one concerning the fusion layer. For each learning process, an adapted training set is prepared.At first, six models are trained to recognize one expression against the others.In this case, we need a training set per expression (one expression against theothers). For each model, the regions are first selected according to the x bestregions that characterize an expression under a specific occlusion. itle Suppressed Due to Excessive Length 9 The constructed facial frameworks are then used to train, per expression, bi-nary classifiers. The outputs of these models represent : a) the probability of aninput sample to be classified as the underlying expression; and b) the probabilityof an input sample to belong to a different expression class.A new training step is then performed with another training set which coversall expressions. In order to do this second training, raw data is fed to each binaryclassifier previously trained. The binary classifiers are not trained anymore andthe models do not change. They are used only to compute, per expression, theprobabilities that the input sample belongs or not to a specific expression class.These probabilities are concatenated into feature vectors that are fed into thefusion process.
In this section, we present the protocol used to conduct our evaluations. First, weintroduce the descriptors used to characterize facial movements. Then, we detailthe dataset and the selected facial occlusions.5.1 Facial motion characterizationRecently, Allaert et al. [1] proposed a descriptor called Local Motion Patterns(LMP). It characterizes the facial movement by retaining only the main directionsrelated to facial expressions, while avoiding motion discontinuities.In order to characterize the movement within the face, LMPs are applied tosmall regions that are laid out on the face according to the facial muscles scheme.Hence, according to the relevance of the movement in the presence of facial ex-pressions and the location of facial muscles, the face is segmented into 25 regions.For experiments, we use the same segmentation to characterize expressions whenocclusions occur.Within each facial region, LMP filters the movement computed by the opti-cal flow using the coherence in terms of direction and intensity. Considering theelastic properties of the skin, a coherent movement gradually spreads in its neigh-borhood respecting, to some extend, the initial direction and intensity. Recursively,LMP analyzes the motion distribution (direction and intensity)
F DM H
LMR x,y onsmall patches (LMRs) within each facial region. If there is a correlation betweenthe distribution of these LMRs, then it means that there is a coherent motionpropagation action taking place. This implies that the captured movement has agreater probability of reflecting a real expression and that it is not linked to adiscontinuity in the movement.
When a facial region associated with an LMP is analyzed, the set of coherentLMR motion distributions are summed. Two motion distributions are coherentif there is a strong correlation in terms of main directions and intensities be-tween them. The summed distribution results in a magnified direction histogram
M D
LMP x,y calculated as follows:
M D
LMP x,y = { n (cid:88) i =0 F DM H
LMR x,y | F DM H
LMR x,y ∈ LM P x,y } . (2)where n represents the number of coherent LMR that composes the LMP. Theintensity of each direction is then represented by the number of co-occurrences ofeach direction bin within the different LMRs.The characterization of the global facial movement by LMPs consists in calcu-lating M D
LMP x,y for the 25 facial regions. To reinforce the coherence of the move-ment, each region is analyzed over the entire expression sequence. Each
M D
LMP x,y of the same region during the video sequence is aggregated within a space-timehistogram
ST M D
LMP x,y , as follows :
ST M D
LMP x,y k = time (cid:88) t =1 M D
LMP x,y kt . (3)where t is the frame index and k = 1 , , ...,
25 is the facial region index. Fi-nally, histograms
ST M D
LMP x,y k are concatenated into one-row vector GM D ,which is considered as the feature vector for the global facial movement
GM D =( ST M D
LMP x,y , ST M D LMP x,y , ..., ST M D LMP x,y ). The feature vector size isequal to the number of ROI multiplied by the number of bins. This vector is thenused in our assessments to characterize facial expressions.5.2 DatasetThe proposed approach is evaluated on the CK+ dataset as it is one of the most fre-quently dataset used in the literature to handle occlusions [3,4,5,9] and it containsvideo sequences which are adapted to study the movement. CK+ is a controlleddataset which contains 374 labelled video sequences. Each video sequence startsfrom the neutral face and ends with the apex of the expression.In this dataset, images do not contain any occlusions, so, they have to besimulated. On one hand, occlusions are not totally realistic and there is a littlegap between a real occlusion and a simulated one. But, on the other hand, wecan totally control the experiments. By controlling the occlusion process, we canclearly quantify its impact on the overall performance. Besides, it offers also thepossibility to construct precise benchmarks for comparison purpose.5.3 Selected facial occlusions Generally, the occluded regions are located at the level of the mouth and eyes,under different sizes. In order to simulate head pose variation, some approach hidehalf of the faces (right or left). Occlusion are often generated by the altering partsof the face by adding white, black or noisy pixels. Sometimes a blur effect can beapplied instead. Some examples are presented in the left part of Fig. 7. itle Suppressed Due to Excessive Length 11
Fig. 7: Selected occlusions according to those used in the literature.Not having a stable and widely accepted baseline to compare the performanceof our approach on occluded faces, we choose to simulate important occlusions tochallenge our approach, as illustrated in the right part of Fig. 7.Inspired by the wide range of occlusions used in the literature, we choose alimited set of occlusions but which covers all the challenges. Indeed, the occlusionconsidered in our study present occlusions that impact larger facial area than thoseusually met in the literature.The first two configurations (Occ1 and Occ2) present important occlusionson the left and right parts of the face. The third occlusion is inspired by theobservations of Kotsia et al. [7] that underline the fact that the mouth has agreat importance to recognize expression. Hence, in order to strongly challenge ourapproach, we define an occlusion configuration that impacts the mouth, the cheeksand the nose. Two other configurations consider important occlusions appearingon the upper part of the face and occlusions appearing in the middle part of theface.
We propose a per expression evaluation in order to check if the constructed facialframeworks provide interesting results. We first evaluate the impact of the regionselection when there is no occlusion. This first evaluation allows us to evaluate theaccuracy of our weight calculation and, also, to find a minimal number of regionsrequired to recognize an expression. Then, we evaluate the efficiency of our perexpression recognition method in presence of the selected occlusions. per expression. In each newly generated subset, all the sequences available forone expression are compared to a randomly stratified combination of all otherexpressions. For example the happiness subset contains two classes : happinessversus no-happiness. All the videos labeled happiness from the initial dataset arekept. For the no-happiness class, videos labeled with the five other expressions arerandomly picked and a stratification scheme is employed in order to guarantee thesame representativity of the other expressions as in the initial dataset.For this evaluation, we have generated 25 configurations using one region,46 configurations using two regions and so on until 12827 configurations using 8regions. A total number of 21294 configurations are generated. The 21294 config-urations are generated for each expression and all these models are sent to SVMclassifiers. Weights are calculated for the twenty-five regions and for each expres-sion. The regions are ranked according to the computed weights. The ranking isfurther used to generate twenty-five models by facial expression containing fromone to twenty-five regions.6.2 Impact of the selection processFig. 8 shows the results obtained for each facial expression using the sorted regions.These results show that this approach allows to be robust to really importantocclusions. Indeed, facial expressions corresponding to surprise, happiness anddisgust have quite optimal results with only one region. Sadness must have at leastthree regions to be recognized and anger needs at least six regions. This result isrelated to the complexity of the emotion. The anger and disgust expressions sharesthe same facial regions, which makes it hard to distinguish them with fewer regions.It is then necessary to take into consideration a larger number of regions.
Fig. 8: Expression recognition rate according to the number of regions. itle Suppressed Due to Excessive Length 13
In this section, We present an evaluation of the entire process. We first evalu-ate the effectiveness of our approach to characterize the six universal expressions(happiness, anger, disgust, fear, sadness and surprise) under different facial occlu-sions. Then, a comparison with representative approaches from the literature isperformed.7.1 Experimental protocol
In our evaluation, we selected the six best regions for each expression for each oc-clusion calculated. The selected facial frameworks for each expression per occlusionare illustrated in Fig. 10.In order to evaluate our approach, we had to split the dataset in two trainingsets : one to train the per expression models and another to train the fusion model.
Fig. 9: Performance comparison with occlusion by expression on CK+.We take 40% of the sequences to build the first set for the first training sets. Theremaining 60% are then used for the second training set for the fusion.For the first training sets, we need six different training sets : one per expres-sion. In order to build these training sets, we take all sequences for the currentexpression. In order to have balanced distribution, a same number of data for theexpression and for the others expressions are respected. Thus, we randomly pick1 / itle Suppressed Due to Excessive Length 15 Fig. 10: The six best regions according to occlusions per expression.
In this section, we analyze the performances of our approach to characterize the sixuniversal facial expressions (happiness, sadness, disgust, fear, surprise and anger)under different occlusions. Table 1 shows the results obtained with our processwith and without occlusion. The process without occlusion considers the six mostimportant regions of the face to recognize each expression. The process with oc-clusion consider the calculated regions considering the several occlusions. ‘ Table 1: Accuracy of our approach on CK+ dataset with and without occlusion.
No occlusion91.35% 71.9% 88.8% 88.6% 88.8% 90.6%
As observed in Table 1, we can conclude that the proposed approach is rela-tively robust in the presence of severe facial occlusions. As we can see, the results obtained with an occlusion of the bottom of the face drop significantly. It demon-strates the importance of the mouth regions about the expression and it is harderto compensate with the information found in the upper part of the face. Concern-ing the eyes, although these regions are really important for some expressions, theresults obtained with our process give good results.
Although some occlusions tend to reduce performance, the difference with theunoccluded face is not significant. This shows that the analysis of the propagationof movement is a good solution to overcome partial facial occlusions.
In this section, we compare the performance of our approach with the other ap-proaches proposed in the literature on the CK+ database. Since there is no prede-fined baseline to compare the different approaches, we only analyze the occlusionsthat are closest to the other approaches. The results are represented in Fig. 11.Fig. 11: Comparison of performances with others approaches in the litterature.In view of the results, our approach gives very competitive performances. It isimportant to note that our occlusions are more severe than those used in otherapproaches except for [12] which may explain the difference with some approaches.Indeed, cheek level information for the first line and forehead level information forthe second line are important to characterize facial expressions.
In this paper, we design an approach that handle expression recognition in presenceof occlusions. We propose, as a first step, a method to calculate a facial frameworkfor each expression adapted to a considered occlusion. Based on the calculatedfacial frameworks, we propose then a fusion step in order to build an entire processwhich takes an input data and predict, at the end, the expression.In order to do that, we pre-trained several models: one per expression in orderto get the probabilities that the input data belongs to an expression class. These probabilities are then aggregated and they are used for training the fusion model.The results obtained with this process are competitive with state-of-the-artmethods, although we have considered larger occlusions. Nevertheless, it is stilldifficult to compare with other approaches especially due to reproductibility issues.One of our future work consists in building a benchmark regrouping a large set itle Suppressed Due to Excessive Length 17 of occlusions and allowing the community to benefit from a stable evaluationframework.
References
1. Allaert, B., Bilasco, I.M., Djeraba, C.: Advanced local motion patterns for macro andmicro facial expression recognition. arXiv preprint arXiv:1805.01951 (2018)2. Chen, Y.A., Chen, W.C., Wei, C.P., Wang, Y.C.F.: Occlusion-aware face inpainting viagenerative adversarial networks. In: 2017 IEEE International Conference on Image Pro-cessing (ICIP), pp. 1202–1206. IEEE (2017)3. Cornejo, J.Y.R., Pedrini, H.: Emotion recognition from occluded facial expressions usingweber local descriptor. In: 2018 25th International Conference on Systems, Signals andImage Processing (IWSSIP), pp. 1–5. IEEE (2018)4. Dapogny, A., Bailly, K., Dubuisson, S.: Confidence-weighted local expression predictionsfor occlusion handling in expression recognition and action unit detection. IJCV (2-4),255–271 (2018)5. Huang, X., Zhao, G., Zheng, W., Pietik¨ainen, M.: Towards a dynamic expression recogni-tion system under facial occlusion. Pattern Recognition Letters (16), 2181–2191 (2012)6. Jampour, M., Li, C., Yu, L.F., Zhou, K., Lin, S., Bischof, H.: Face inpainting based onhigh-level facial attributes. Computer vision and image understanding , 29–41 (2017)7. Kotsia, I., Buciu, I., Pitas, I.: An analysis of facial expression recognition under partialfacial image occlusion. Image and Vision Computing26