[PDF] Spontaneous Subtle Expression Detection and Recognition based on Facial Strain

Abstract

Optical strain is an extension of optical flow that is capable of quantifying subtle changes on faces and representing the minute facial motion intensities at the pixel level. This is computationally essential for the relatively new field of spontaneous micro-expression, where subtle expressions can be technically challenging to pinpoint. In this paper, we present a novel method for detecting and recognizing micro-expressions by utilizing facial optical strain magnitudes to construct optical strain features and optical strain weighted features. The two sets of features are then concatenated to form the resultant feature histogram. Experiments were performed on the CASME II and SMIC databases. We demonstrate on both databases, the usefulness of optical strain information and more importantly, that our best approaches are able to outperform the original baseline results for both detection and recognition tasks. A comparison of the proposed method with other existing spatio-temporal feature extraction approaches is also presented.

Full PDF

SSpontaneous Subtle Expression Detection and Recognition based onFacial Strain

Sze-Teng Liong a,b, ∗ , John See c , Raphael C.-W. Phan b , Yee-Hui Oh b , Anh Cat Le Ngo b , KokSheik Wong a ,Su-Wei Tan b a Faculty of Computer Science and Information Technology, University of Malaya, 50603 Kuala Lumpur, Malaysia b Faculty of Engineering, Multimedia University, 63100 Cyberjaya, Malaysia c Faculty of Computing and Informatics, Multimedia University, 63100 Cyberjaya, Malaysia

Abstract

Optical strain is an extension of optical ﬂow that is capable of quantifying subtle changes on faces andrepresenting the minute facial motion intensities at the pixel level. This is computationally essential for therelatively new ﬁeld of spontaneous micro-expression, where subtle expressions can be technically challengingto pinpoint. In this paper, we present a novel method for detecting and recognizing micro-expressions byutilizing facial optical strain magnitudes to construct optical strain features and optical strain weighted fea-tures. The two sets of features are then concatenated to form the resultant feature histogram. Experimentswere performed on the CASME II and SMIC databases. We demonstrate on both databases, the usefulnessof optical strain information and more importantly, that our best approaches are able to outperform theoriginal baseline results for both detection and recognition tasks. A comparison of the proposed methodwith other existing spatio-temporal feature extraction approaches is also presented.

Keywords:

Subtle expressions, Micro-expressions, Facial strain, Detection, Recognition

1. Introduction

Micro-expression is one of the nonverbal communications that only occurs for a fraction of a second [1].It is an uncontrollable expression that reveals the true emotional state of a person even when she is tryingto conceal it. The appearance of a micro-expression is extremely rapid and brief, and it usually lastsfor merely one twenty-ﬁfth to one ﬁfth of a second [2]. This is the main reason why ordinary peoplealways face diﬃculties in recognizing and understanding the genuine emotions of each other during realtime conversations. There are six basic facial expressions, notably happiness, surprise, anger, sad, fear anddisgust, a categorization ﬁrst proposed by [3]. Recognition of facial micro-expressions is valuable to variousapplications in the ﬁeld of medical diagnosis [4], national safety [5] and police interrogation [6].The task ofautomatic recognition of spontaneous subtle expressions is essentially of great interest to aﬀective computingin this day and age. To date, many techniques and algorithms had been proposed and implemented fornormal facial expression (or macro -expression) detection and recognition [7–11] but analysis on micro -expression is still a relatively new research topic and very few works have been published [12–16].In one of our previous works [17] on subtle expression recognition, we proposed a technique that outper-forms the baseline method of both CASME II [18] and SMIC databases [19], using optical strain magnitudesas weight matrices to improve the importance of the feature values extracted by Local Binary Patterns withThree Orthogonal Planes (LBP-TOP) in diﬀerent block regions. In our second paper [20], recognition of ∗ Corresponding author

Email addresses: [email protected] (Sze-Teng Liong), [email protected] (John See), [email protected] (Raphael C.-W. Phan), [email protected] (Yee-Hui Oh), [email protected] (Anh Cat Le Ngo), [email protected] (KokSheik Wong), [email protected] (Su-Wei Tan)

Preprint submitted to Elsevier October 9, 2018 a r X i v : . [ c s . C V ] J un icro-expressions was achieved using another technique. While only tested on the SMIC database, thismethod worked reasonably well by directly utilizing the optical strain features, following the temporal sumpooling and ﬁltering processes. We further extend these two works, with substantial improvements and amore comprehensive evaluation on both detection and recognition tasks.In this paper, we introduce a novel method for automatic detection and recognition of spontaneous facialmicro-expressions using optical strain information. Detection refers to the presence of micro-expressions onthe face without identiﬁcation of its type, whereas recognition goes a step further to distinguish the exactstate or type of expression shown on the face. The proposed method mainly builds on the feature extractionprocess the optical ﬂow method proposed by [21], which gives rise to the notion of optical strain. Thefeature histogram is constructed using optical strain information, following three main processes: (1) All theoptical strain images in each video are temporally pooled, then the strain magnitudes of the pooled imageare treated as features; (2) Optical strain magnitudes are pooled in both spatial and temporal directionsto form a weighting matrix. The respective weights of each video are then multiplied with features fromthe XY -plane extracted by LBP-TOP; (3) Lastly, the feature histograms from processes (1) and (2) areconcatenated to form the ﬁnal resultant feature histogram of the video sample.The rest of the paper is organized as follows. Section 2 brieﬂy reviews the related work. Section 3describes the spatio-temporal features used in the paper, followed by Section 4 that explains the proposedalgorithm in detail. The description of the databases used are discussed in Section 5. The experiment resultsfor detection and recognition of micro-expressions are summarized in Section 6. Finally, conclusion is drawnin Section 7.

2. Related Work

In the paper by [22], optical strain pattern was used for spotting facial micro-expressions automatically.They achieved 100% detection accuracy on the USF micro-expression database, with one false spot. Aftercomputing the strain values for each pixel of the frames, a local threshold value was calculated by segmentingeach frame into three pre-deﬁned regions (i.e., forehead, cheek and mouth). The optical strain magnitudesthat fell within certain threshold boundaries were considered a macro-expression. If the high strain valuesdetected in less than two facial regions and only lasted for less than 20 frames (the frame rate was 30 f ps ),it only considered to be a micro-expression. However, the dataset used possess a small sample size, andcontains a total of only 7 micro-expressions. Besides, the micro-expressions detected were not spontaneousbut rather posed ones, which are less natural and realistic.The same authors [23] later carried out an extensive test on two larger datasets containing a total of 124micro-expressions. They also implemented an improved algorithm to spot the micro-expressions on thesetwo datasets. Instead of partitioning the frame into three regions, they divided each frame into eight regions,namely: forehead, left and right of eye, left and right of cheek, left and right of mouth and chin. Some partsof the face were masked to overcome the existing noise problem in the captured image frames. A promisingdetection accuracy of 74% was achieved.LBP-TOP [24] describes the space-time texture of a video volume, which encodes the local texture patternby thresholding the center pixel against its neighbouring pixels. Block-based LBP-TOP partitions the threeorthogonal image planes into N × N non-overlapping blocks, where the ﬁnal histogram is a concatenationof histograms from each block volume. This ﬁnal histogram represents the appearance, horizontal motionand vertical motion of a video. It has been a growing interest in micro-expression recognition recently, witha majority of works [14, 25] employing LBP-TOP and other variants [26] as their choice of feature.The large number of pixels in an image or video can be summarized into a more compact lower dimensionrepresentation. Generally, feature pooling in spatial domain is one commonly employed technique, whichpartitions the image into several regions, then summing up or averaging the pixel intensities of each re-gion. Spatial pooling was employed together with state-of-the-art feature descriptors such as SIFT [27] andHistograms of Oriented Gradients (HOG) [28] to enhance their robustness against noise and clutter [29].Diﬀerent variations of pooling functions (e.g., maximum, minimum, mean and variance) can also be appliedto summarize features over time to improve the performance of an automatic annotation and ranking musicradio system [30]. 2here are many facial expression databases widely used in literature [31]. However, spontaneous micro-expressions databases that are publicly available are somewhat limited, which is one of the most challengingobstacle towards the work of automatic recognition of micro-expressions. Naturally, this can be attributedto the diﬃculties faced in proper elicitation and manual labeling of micro-expression data. In addition, thereremain several ﬂaws with the existing micro-expression databases that hinders the progress of this research.For instance, USF-HD [32] and Polikovsky’s database [33] recorded only 100 and 42 videos respectively, bothcontaining posed micro-expressions instead of spontaneous (i.e., natural) ones. Since micro-expressions aretypically involuntary and uncontrollable, theoretically they cannot be imitated or acted out [34]. Hence,the spontaneity of these micro-expressions is an essential characteristic for realistic and meaningful usages.Interestingly, the USF-HD also uses an unconventional criteria for determining micro-expressions (i.e., 2/3second), which is longer than the most accepted durations.Another database, YorkDDT [35] is a spontaneous micro-expression database which collected for a de-ception detection test (DDT) as part of a psychological study. Although consists of spontaneously obtainedsamples, its inadequacy lies in its rather short samples, low frame rate (i.e., 25 fps ), while the originalvideos also had no expression labels. Irrelevant head and face movements are obvious when the subjectsare speaking, contributing to more hurdles in the aspect of face alignment and registration for a recognitiontask. Similarly, spontaneous micro-expressions were also used in the Canal-9 political debate corpus [23],although the reported detection performance was only as good as a chance accuracy (i.e., 50%). This wasattributed to the head movement and talking, while a much larger set of samples could also provide bettergeneralization of patterns.

3. Feature Extraction

Optical ﬂow is a popular method of estimating the image motion between two successive frames and it isexpressed in a two-dimensional vector ﬁeld [36]. It measures the spatial and temporal changes of intensity tolook for a matching pixel in the next framewith the assumption that all the temporal intensity changes in theimage are due to motion only. In general, there are three common assumptions to approximate optical ﬂowvalues: (1) Brightness constancy – the observed brightness of the objects are constant over time (shadowand illuminations due to any motion are constant); (2) Spatial coherence – the surfaces of an object havespatial extent where its neighboring points are likely to be of the same surface hence having similar velocityvalues; and (3) Temporal persistence – image motion of a surface patch changes gradually through time.[37] did the performance comparison among the four basic optical ﬂow techniques, namely: region-basedmatching, diﬀerential, energy-based and phase-based method. Local diﬀerential method was found to bethe most reliable and produced consistent results. Therefore, we opt for the diﬀerential method [38] toapproximate the optical ﬂow motion vectors. The general optical ﬂow constraint equation is deﬁned as: ∇ I • (cid:126)p + I t = 0 , (1)where I ( x, y, t ) is the image intensity at time t located at spatial point ( x, y ). ∇ I = ( I x , I y ) is the spatialgradients and I t is the temporal gradient of the intensity function. Assume that the point of interest in theimage is initially positioned at ( x, y ), and it moves through a distance ( dx, dy ) after the change in time of dt . The ﬂow vector (cid:126)p consists of its horizontal and vertical components, (cid:126)p = [ p = dx/dt, q = dy/dt ] T denotesthe horizontal and vertical components of the optical ﬂow. Optical strain is a technique derived from the concept of optical ﬂow that enables the computation ofsmall and subtle motion on the face by measuring the amount of facial tissue deformation. This is in partmotivated by the concept of strain rate tensor [39] from continuum mechanics, which describes the rate ofchange of the deformation of a material in the vicinity of a certain point, at a certain moment of time. Thework by [22] demonstrated that optical strain is more reliable than optical ﬂow in the task of automaticmacro and micro facial expression spotting, producing more consistent results in their experiments. In this3ork, we intend to leverage the strengths of optical strain to describe suitable features for detection andrecognition tasks.Optical strain can be described by a two-dimensional displacement vector, u = [ u, v ] T . The magnitudeof the optical strain can be represented in the form of a Lagrangian strain tensor [39]: ε = 12 [ ∇ u + ( ∇ u ) T ] (2)which in expanded form, is deﬁned as, ε =  ε xx = ∂u∂x ε xy = ( ∂u∂y + ∂v∂x ) ε yx = ( ∂v∂x + ∂u∂y ) ε yy = ∂v∂y  (3)where { ε xx , ε yy } are the normal gradient components while { ε xy , ε yx } are the shear components of theoptical strain.Each of these strain components are a function of the displacement vectors ( u,v ). Thus, they can beapproximated using the ﬂow vector components ( p,q ) from Eq. (1) in discrete form, where ∆ t is the timeinstance between two image frames: p = dxdt . = ∆ x ∆ t = u ∆ t , u = p ∆ t, (4) q = dydt . = ∆ y ∆ t = v ∆ t , v = q ∆ t (5)By setting ∆ t to a constant interval length, the partial derivatives of Eq. (4) and (5) can be approximated,respectively as: ∂u∂x = ∂p∂x ∆ t, ∂u∂y = ∂p∂y ∆ t, (6) ∂v∂x = ∂q∂x ∆ t, ∂v∂y = ∂q∂y ∆ t (7)The second order derivatives are determined using ﬁnite diﬀerence approximations, i.e. ∂u∂x = u ( x + ∆ x ) − u ( x − ∆ x )2∆ x. = p ( x + ∆ x ) − p ( x − ∆ x )2∆ x (8) ∂v∂y = v ( y + ∆ y ) − v ( y − ∆ y )2∆ y. = q ( y + ∆ y ) − q ( y − ∆ y )2∆ y (9)where ∆ x = ∆ y = 1 pixel. The partial derivatives ∂u/∂y and ∂v/∂x are calculated in the similar manner.Finally, the optical strain magnitude for each pixel at a particular time t can be calculated as follows: ε = (cid:113) ε xx + ε yy + ε xy + ε yx . (10)Fig. 1 shows the optical ﬂow and optical strain images obtained from two sample videos. The values fromboth images are temporally sum-pooled and normalized for better visualization. In the bottom row images,the raised left eyebrow that is somewhat indistinguishable with the optical ﬂow ﬁelds, is clearly emphasizedby the strain components. 4igure 1: Example of optical ﬂow (left) and optical strain (right) images from CASME (top row) and SMIC(bottom row) databases. Local facial features on a video or sequence of images can be represented using the state-of-the-artdynamic texture descriptor - Local Binary Patterns from Three Orthogonal Planes (LBP-TOP) [24]. TheLBP code is extracted from three orthogonal planes (i.e., XY, XT and YT), which encodes the appearanceand motion in three directions. Each pixel in an image forms a LBP code by applying a thresholdingtechnique within its neighbourhood:

LBP

P,R = P − (cid:88) p =0 s ( g p − g c )2 p , s ( x ) = (cid:40) , x ≥ , x < P is the number of neighbouring points around the center pixel, ( P, R ) representing a neighbourhoodof P points equally spaced on a circle of radius R , g c is the gray value of the center pixel and g p are the P gray values of the sampled points in the neighbourhood.Block-based LBP-TOP partitions each image appearance plane (XY) into N × N non-overlapping blocksthen concatenates the feature histograms of each volumetric blocks to construct the ﬁnal resultant his-togram. This is done for each of the three orthogonal planes. Fig. 2 illustrates the process of extractingand concatenating the computed local features from the ﬁrst two blocks in three orthogonal planes. Theconcatenated histogram describing the global motion of the face over a video sequence can be succinctlydenoted as, M b ,b ,d,c = (cid:88) x,y,t I { h d ( x, y, t ) = c } ,c = 0 , . . . , P − d = 0 , , b , b ∈ . . . N (12)where 2 P is the number of diﬀerent labels produced by the LBP operator on the d -th plane ( d = 0 : XY(appearance) , h d ( x, y, t )5igure 2: Block-based features extraction of the ﬁrst two block volumes in three dimension space of a videosequence: (a) Block volumes; (b) LBP features from three orthogonal planes; (c) Concatenation of thehistograms from XY, XT and YT planes to become a single histogram; (d) Concatenation of the featurehistograms from the two block volumesis the LBP code, i.e., Eq. (11), of the central pixel ( x, y, t ) on the d -th plane, x ∈ { , . . . , X − } , y ∈{ , . . . , Y − } , t ∈ { , . . . , T − } . X and Y are the width and height of image (thus, b and b are the rowand column index, respectively), while T is the video length. As such, the functional term I { A } determinesthe count of the c -th histogram bin when h d ( x, y, t ) = c : I { A } = (cid:40) , if A is true0 , if A is false (13)Hence, the ﬁnal feature histogram is of dimension 2 P × N . For an appropriate comparison of the featuresamong video samples of diﬀerent spatial and temporal lengths, the concatenated histogram is sum-normalizedto obtain a coherent description: M b ,b ,d,c = M b ,b ,d,cn d − (cid:80) k =0 M b ,b ,d,k (14)Throughout this paper, we will denote the LBP-TOP parameters by LBP-TOP P XY ,P XY ,P Y T ,R X ,R Y ,R T where the P parameters indicate the number of neighbour points for each of the three orthogonal planes,while the R parameters denote the radii along the X, Y, and T dimensions of the descriptor.

4. Proposed Algorithm

We propose three main steps to extract the spatio-temporal features by utilizing facial optical straininformation: (1) The optical strain magnitudes in each frame are derived from the optical ﬂow values. Thenall the optical strain maps in each video are temporally pooled into a composite strain map. Thereafter,the optical strain magnitudes in the composite strain map are directly used as features; (2) Spatio-temporalpooling is applied on the optical strain frames of each video, then the ﬁnal matrix of normalized coeﬃcientvalues obtained are used as the weights for each video. The weighting matrix (of N × N dimension afterpooling) is then multiplied with their respective LBP-TOP-extracted histogram bins on the XY plane; (3)Finally, the feature histograms extracted in steps (1) and (2) are concatenated into a ﬁnal feature histogramthat represents the processed video sample. 6igure 3: Overview of the proposed algorithmThe overview of the proposed method is shown in Fig. 3. Note that, all the image data from the prior-database [18, 19] are captured under constrained lab condition and have been performed face registrationand alignment. As optical strain magnitudes can aptly describe the minute extent of facial deformation at the pixel level,they can directly be employed as features as well. For clarity, we ﬁrst describe the notations used in thesubsequent sections.A micro-expression video clip is expressed as s i = { f i,j | i = 1 , . . . , n ; j = 1 , . . . , F i } (15)where F i is the total number of frames in the i -th sequence, which is taken from a collection of n videosequences.The optical ﬂow ﬁeld is ﬁrst estimated by its 2-D motion vector, (cid:126)p = ( p, q ). Then, the optical strainmagnitude at each pixel location ε x,y (from Eq. (10)) is calculated for each ﬂow ﬁeld over two consecutiveframes, i.e., { f i,j , f i,j +1 } .Hence, each video of resolution X × Y produces a set of F i − m i,j = { ε x,y | x = 1 , . . . , X ; y = 1 , . . . , Y } (16)for i ∈ , . . . , n and j ∈ , . . . , F i − Prior to feature extraction, two pre-processing steps are carried out to reduce unwanted noise andattenuate the strong and weak signals from the optical strain maps.First, the edges in each strain map m i,j are removed. The edges are the gradient of the moving objectsthat consist of local maximas. Hence, the purpose of eliminating the edges is to remove a large number ofirrelevant maximas if the strain map is very noisy [40]. Among the diﬀerent types of edge detectors, theSobel ﬁlter justiﬁes its feasibility by two main advantages [41]— its ability to detect the edges in a noisyimage by introducing smoothing and blurring eﬀect on the image; the diﬀerential of two rows or two columnsenhances the strength of important edges. The Sobel operator is a simple approximation to the conceptof 2-D spatial gradient, by convoluting a grayscale input image with a pair of 3 × m i,j are clipped to zero for ε x,y / ∈ [ T l , T u ], with thetwo threshold values T l and T u denoting the lower and upper thresholds respectively.7igure 4: Eﬀect of ρ l and ρ u values on micro-expression recognition rate for the SMIC databaseFigure 5: Example of vertical segmentation of the optical strain frame into three regions.The values of T l and T u are determined using the lower and upper percentages ( ρ l , ρ u ) of the strainmagnitude range, [ ε min = min { ε x,y } , ε max = max { ε x,y } ] The lower and upper thresholds are computed asfollows: T l = ε min + ρ l · ( ε max − ε min ) , ρ l ∈ [0 ,

1] (17) T u = ε max − ρ u · ( ε max − ε min ) , ρ u ∈ [0 ,

1] (18)Figure 4 illustrates the eﬀect of ρ l and ρ u on the micro-expression recognition rate. It is observed that ρ l = ρ u = 0 .

05 yields the best results. Therefore, we set the clipping tolerance to 5% of the magnitude rangeof each processed frame.With each frame properly aligned, the optical strain maps can be then segmented vertically into threeregions of equal size (i.e. forehead–lower eyelid, lower eyelid–nostril and nostril–mouth) to obtain theirindividual local threshold values. The purpose of performing this segmentation is to minimize the eﬀectsof dominant motions that arise from a particular region as the range of strain magnitudes diﬀer across thethree regions. Fig. 5 shows how an optical strain map is divided into the three vertical segments.

In order to describe the optical strain patterns in a compact and consistent representation, the opticalstrain maps m i,j are pooled across time (i.e. temporal pooling).We perform temporal mean pooling toobtain a composite strain map, ˆ m i = 1 F i − F i − (cid:88) j =1 m i,j (19)where all optical strain magnitudes ε x,y for each strain map m i,j are averaged across the temporal dimension.The intuition behind this pooling step is to help in accentuating the minute motions in micro-expressions byaggregation of these facial strain patterns. Mean pooling also ensures that the optical strain magnitudes are8igure 6: Extracting optical strain features (OSF) from a sample video sequence: (a) Original images, f ,j ;(b) Optical strain maps m ,j ; (c) Images after pre-processing; (d) Temporal pooled strain images; (e) Afterapplying maximum normalization and resizingnormalized based on their respective sequence lengths. Then, the composite strain map is max-normalizedto increase the signiﬁcance of its values. In the ﬁnal step, we resize the composite strain map to 50x50 pixelsand vectorize its rows to form a 2500-dimension feature vector. Fig. 6 shows a graphical illustration of theentire process of extracting optical strain features. While the OSF describes pixel-level motion features, the LBP-TOP is capable of encoding texture dy-namics in larger facial patches. In block-based LBP-TOP [24], the feature histograms obtained from allblocks are given equal treatment. Since subtle expressions typically occur in highly localized regions ofthe face (and this diﬀers for diﬀerent expressions classes), the feature histogram representing these regionsshould be ampliﬁed. As such, larger motions will generate larger optical strain magnitudes and vice versa.A set of weights are then computed to scale the features in each block proportionally to their respectivemotion strengths. We proceed to elaborate how optical strain weighted (OSW) features are obtained.

Features are extracted by block-based LBP-TOP from each video clip s i , whereby the entire video volumeis partitioned into N × N non-overlapping block volumes. For each of these block volumes, we compute theLBP features from three orthogonal planes (concatenated to form LBP-TOP) to obtain dynamic texturefeatures that are local to each particular block region. Finally, the feature histograms from all N × N blockvolumes are concatenated to form the ﬁnal feature histogram. Upon partitioning into blocks, two blocks that are located at the left and right bottom corner of theframes are eliminated due to noticeable amount of movements or noise caused by the background lightingcondition, and also the presence of wires from the headset worn by the subjects (see Fig. 7 for a framesample from both datasets). For ease, we refer to these two blocks as “noise blocks”. Therefore, we omitthese noise blocks and only the remaining N − × σ = 0 . To obtain the weights for each block, we perform spatio-temporal pooling on all optical strain maps m i,j in the video sequence. We consider spatio-temporal pooling in a separable fashion; spatial mean pooling isperformed ﬁrst, followed by temporal mean pooling. 9igure 7: Top row: a sample image from SMIC (left) and the corresponding optical strain map (right).Bottom row: a sample image from CASME II (left) and the corresponding optical strain map (right)Firstly, spatial mean pooling averages all the strain magnitudes ε x,y within each block, resulting in ablock-wise strain magnitude: z b ,b = 1 HL b H (cid:88) y =( b − H +1 b L (cid:88) x =( b − L +1 ε x,y (20)where L = XN , H = YN , the block indices ( b , b ) ∈ . . . N , and ( X, Y ) are the dimensions (width and height)of the frame. This process summarizes the encoded features locally in each block area of the face.Secondly, temporal mean pooling is applied on the spatially-pooled frames, where the values of z b ,b areaveraged up along the temporal dimension across all video frames. Therefore, for each video, we derive aunique set of N × N weights, W i = { w b ,b } Nb ,b =1 where each weight coeﬃcient w is described as: w b ,b = 1 F i − F i − (cid:88) t =1 z b ,b = 1( F i − HL F i − (cid:88) t =1 b H (cid:88) y =( b − H +1 b L (cid:88) x =( b − L +1 ε x,y (21) XY -Plane Histogram After obtaining the feature histograms extracted by LBP-TOP and the optical strain weights, the coef-ﬁcients of the weight matrix W are multiplied with the XY-plane feature histograms of their correspondingmatching blocks. This weighting procedure is performed only on features from the XY plane so that themotion strengths are well accentuated in each local area of the face.10igure 8: Derivation of the optical strain weighted feature for the ﬁrst video: (a) Each j -th frame in m ,j is divided into 5 × ε x,y within each block region are spatiallypooled; (b) The block-wise strain magnitudes z b ,b from all frames ( j ∈ . . . F i − ) are temporally meanpooled; (c) The N × N size of weighting matrix, W is formed; (d) coeﬃcients of W are multiplied to theirrespective XY -plane histogram bins.Concisely, the optical strain weighted histograms can be deﬁned as: G b ,b ,d,c = (cid:40) w b ,b M b ,b ,d,c , if d = 0; M b ,b ,d,c , otherwise. (22)where M is the normalized feature histogram of block ( b , b ) from Equation 14. The whole process ﬂow ofobtaining the optical strain weighted features is graphically shown in Fig. 8.11 .3. Concatenating OSF and OSW Features In the ﬁnal step, the two extracted features — OSF and OSW features are concatenated into a singlecomposite feature histogram. The concatenation process enriches the variety of features used, providingfurther robustness towards the detection and recognition of facial micro-expressions. The dimension of thefeature histogram in LBP-TOP with 5 × × × × ×

5. Experiment

The experiments were performed on two spontaneous micro-expressions datasets that are publicly available—CASME II [18] and SMIC [19].

The Chinese Academy of Sciences Micro-Expression (CASME) II dataset contains 247 micro-expressionsﬁlm clips elicited from 26 participants (mean age of 22.03 years, standard deviation of 1.60). Each cliponly consists of one type of expression and they were recorded using a Point Grey GRAS-03K2C camerawith a high temporal resolution of 200 fps and a spatial resolution of 640 ×

480 pixels. To collect the micro-expression videos, some of the participants were asked to keep neutralize faces and the rest were asked tosuppress their facial motion when watching the video ﬁlms. The clips are labeled with ﬁve expression classes:happiness, disgust, surprise, repression and tense. The ground truths of the dataset are provided, whichinclude the onset, oﬀset of the expression, the represented expression and the marked action unit (AUs).The labeling of the micro-expressions were done by two coders. They determined the micro-expressionsfrom the raw data by ﬁrst spotting the onset and oﬀset frames for each video. Then, the spotted sequencesare labeled as micro-expressions if the duration lasts for less than 1 s . The decision of marking them asmicro-expressions as well as identifying their respective classes are based on the AUs found, the participants’self-report and the contents of the ﬁlm clips.A ﬁve-class baseline recognition accuracy was reported as 63.41%. Block-based LBP-TOP was employedas the feature extractor with images partitioned into 5 × The second dataset called Spontaneous Micro-expression (SMIC) Database consists of 16 participants(mean age of 28.1 years, six females and ten males with eight Caucasians and eight Asians) and 164 micro-expressions videos. The video clips were recorded using a high speed PixeLINK PL-B774U camera at areasonably good resolution of 640 ×

480 and frame rate of 100 fps . The micro-expressions were collectedby asking the participants to put on a poker face and suppress their genuine feelings when watching theﬁlms. The video clips are classiﬁed into three main categories: positive (happiness), negative (sad, fear,disgust) and surprise. Labeling of the micro-expression clips was done by two coders, whereby the clips wereﬁrst viewed frame by frame at a slower playback before repeating it in increasing frame rate. This followsthe suggestion established by [46]. Next, the selected clips agreed by both of the coders were comparedto the participants’ self-reported expressions before deciding to include the micro-expression clips into thedatabase. In the pre-processing step, all the video clips were ﬁrst interpolated into ten frames using thetemporal interpolation model (TIM) [19] to minimize feature complexity and processing time. In addition,TIM was proved to boost the micro-expression recognition performance [35]. Hence, we also apply TIM tostandardize each video in the SMIC database to a length of 10 frames. the Two tasks were performed onthe database – detection and recognition, where the block-based LBP-TOP ( N × N non-overlapping blocks)is chosen for feature extraction and SVM as classiﬁer. Experiments were conducted based on leave-one-subject-out cross-validation (LOSOCV) setting.The best baseline performance for this three-class classiﬁcation task is 48.78%. An 8 × ×

6. Results and Discussion

We evaluated our methods on two separate experiments: (1) detection of micro-expressions (SMIC only),and (2) recognition of micro-expressions (CASME II and SMIC). The detection task involves determiningwhether the expression shown in a clip contains micro-expressions or not, regardless of the emotional stateit represents. Meanwhile, the recognition task involves identiﬁcation of the emotional state present in thevideo clip.Note that both CASME II and SMIC databases provide the cropped face video sequence, where only theface region is retained while the unnecessary background are removed. We directly use the cropped imageframes in our experiments; these frames have an average spatial resolution of 340 ×

280 for CASME II and170 ×

140 pixels for SMIC. Here, we establish the parameter setting used in our experiments, namely, theparameters used in the feature extractor and classiﬁer are mostly the values adopted from the original work,i.e., CASME II [18] and SMIC [19]. SVM with linear kernel ( c = 10000) was utilized as classiﬁer. Theblock sizes of LBP-TOP were selected based on the original works [18, 19]. We set the block partitioning ofLBP-TOP to both 5 × ×

8. In addition, the number of neighbour points and the radii along the threeorthogonal planes were set to

LBP-TOP , , , , , . The reason of selecting these parameter conﬁgurations isexplained in Section 6.2.In SMIC, recognition is a three-class classiﬁcation (i.e., positive, negative, surprise classes), while de-tection of micro-expressions is a binary decision (i.e., yes/no). Evaluations on SMIC were conducted usingSVM classiﬁer with LOSOCV setting. For CASME II, only the recognition task was performed since thedatabase did not provide non-micro-expression clips to enable a detection task. The recognition task onCASME II is a ﬁve-class problem, evaluated using SVM classiﬁer with LOVOCV setting.There are two ways to measure the classiﬁcation performance in the LOSOCV setting, namely, macro -and micro -averaging [47]. The macro-averaged result gives the accuracy computed by averaging across allindividual subject-wise accuracy results. Micro-averaged result refers to the overall accuracy result acrossall evaluated samples. We also present further performance metrics such as F1-measure, precision andrecall when the LOSOCV setting is used, as suggested by [48]. These metrics provide a more meaningfulperspective than accuracy rates when the datasets used are naturally imbalanced since each subject has adiﬀerent number of video samples.The three proposed methods: (i) Optical Strain Features ( OSF ), (ii) Optical Strain Weighted (

OSW )LBP-TOP Features, and (iii) concatenation of both features (

OSF + OSW ), were evaluated and comparedto their respective baseline methods on both detection and recognition experiments.

From the results shown in Table 1 and 2, we observe that the

OSF method is able to produce reasonablypositive results compared to the baselines in some cases. However, better and more consistent results wereobtained using

OSW and

OSF + OSW methods for both macro- and micro-averaging measures.For the detection task on SMIC database (using 5 × OSF + OSW outperformedthe baseline by 11 .

28% and 8 .

29% in micro- and macro-averaged results respectively. In addition, weobtained an even larger improvement of ≈

15% when the block partition of 8 × OSF + OSW , the micro- and non micro-expressions can be better distinguished. It can be seen that fornon micro-expression, there is a great increase in recognition rate of around 17%.In the recognition experiment on the SMIC database, we are able to achieve up to ≈

5% improvementover the baseline results on the concatenated

OSF + OSW method using 5 × × Table 2: Micro-expression detection and recognition results on SMIC and CASME II database with LBP-TOP of 8 × Table 3: Confusion matrices of baseline and

OSF + OSW methods for detection task on SMIC databasewith LBP-TOP of 5 × (a) Baseline micro n/micromicro (b) OSF+OSW micro n/micromicro This method also registered a performance improvement of ≈

10% over the baseline results when 8 × OSF method did notperform as well, its contribution towards the concatenated

OSF + OSW should not be disregarded. Thedetailed confusion matrices for the recognition performance on SMIC database are shown in Table 4. It canbe seen that ‘Negative’ and ‘Surprise’ expressions can be recognized with higher accuracy using

OSF + OSW method, while the accuracy of the ‘Positive’ expression remains unchanged at 49.02%.For the recognition experiment on the CASME II dataset, we observe a better performance by the

OSW method as compared to the baseline and other evaluated methods. There is a substantial improvement in

OSF + OSW method for both the 5 × × OSF + OSW method, but ‘Happiness’ and ‘Repression’ havelower recognition rate compared to the baseline. However, the average recognition accuracy for

OSF + OSW method is better than the baseline.Other performance metrics (including F1-score, recall and precision) reported in Table 6 and Table 7further substantiate the superiority of our proposed methods over the baseline method, especially for the

OSW and

OSF + OSW methods in the detection and recognition tasks respectively, in SMIC database.The performance of the concatenated

OSF + OSW method in CASME II recognition appears to be slightlybetter than that oﬀered by the baseline. 14able 4: Confusion matrices of baseline and

OSF + OSW methods for recognition task on SMIC databasewith LBP-TOP of 8 × (a) Baseline negative positive surprisenegative (b) OSF+OSW negative positive surprisenegative Table 5: Confusion matrices of baseline and

OSF + OSW methods for recognition task on CASME IIdatabase with LBP-TOP of 5 × (a) Baseline disgust happiness tense surprise repressiondisgust (b) OSF+OSW disgust happiness tense surprise repressiondisgust In a nutshell, optical strain characterizes the relative amount of displacement by a moving object withina time interval. Its ability to compute any small muscular movements on faces can be advantageous tosubtle expression research. By simple product of the LBP-TOP histogram bins with the weights, theresulting feature histograms are intuitively scaled to accommodate the importance of block regions. The

OSF + OSW approach generated consistently promising results throughout the experiments tested on theSMIC database. The reason why the proposed method did not performed as excellent on the CASME IIas the experiments conducted on SMIC, is probably due to the high frame rate of its acquired video clips.Repetitive frames with very little changes in movements might result in redundancy of input data. Hence,the extracted strain information may be too insigniﬁcant (hence negligible) to oﬀer much discriminationbetween features of diﬀerent classes. This is most obvious in the

OSF results for CASME II, where therewas in fact a signiﬁcant deterioration of performance. SMIC videos, on the other hand (at only halved theframe rate of CASME II videos), is able to harness the full capability of optical strain information wherethe

OSF is seen to complement

OSW very well, producing even better results when combined together.Since there are existing background noises in the video frames from both databases, spatial pooling helpsto improve the robustness against these noises. Furthermore, high strain magnitudes detected in the framethat exceeded the upper threshold (Eq. 18) are treated as noises and not micro-expression movements. Onthe other hand, low strain magnitudes below the lower threshold (Eq. 17) will be ignored since they do not15able 6: F1-score, recall and precision scores for detection and recognition performance on SMIC andCASME II database with LBP-TOP of 5 × (a) Detection - SMIC Micro MacroMethods F1 Rec Prec F1 Rec PrecBaseline 0.6070 0.6067 0.6074 0.6623 0.6623 0.6697OSF 0.6621 0.6616 0.6626 0.6729 0.6634 0.6906OSW 0.6372 0.6372 0.6372 0.6938 0.6767 0.7133OSF + OSW (b) Recognition - SMIC

Micro MacroMethods F1 Rec Prec F1 Rec PrecBaseline 0.4075 0.4108 0.4043 0.3682 0.3778 0.3929OSF 0.4171 0.4227 0.4116 0.3849 0.3798 0.4165OSW 0.4704 0.4745 0.4664 0.4508 0.4375 0.4939OSF + OSW (c) Recognition - CASME II

Methods F1 Rec PrecBaseline 0.5945 0.5781 0.6118OSF 0.4397 0.4109 0.4729OSW 0.6249 0.6105 0.6399OSF + OSW contribute suﬃcient details towards the micro-expressions.Another notable observation worth mentioning lies with the radii parameters of the LBP-TOP featureextractor (which is used by

OSW method). By varying the value of R T (temporal radius), we can observeits importance in the results shown in Fig. 9. The recognition accuracy is the highest for both the OSW andbaseline (LBP-TOP) methods when R T = 4. Therefore, all the OSW experiments on the SMIC databasewere conducted using

LBP-TOP , , , , , to maximize the performance on accuracy.We apply settings that were used in the original papers of CASME II [18] and SMIC [19]. The LBP-TOPblock partitions employed for the detection task in SMIC is 5 × × × OSF + OSW fare with diﬀerent block size conﬁguration. with the baselines indicated by dashed lines. It can be seen thatthe larger blocks (i.e. smaller number of partitions, N = 1 , ,

3) did not produce better results comparedto smaller blocks (i.e. larger number of partitions, N = 6 , ,

8) in all scenarios. This is because the localfacial appearance and motion that carry important details at speciﬁc facial locations are not well describedin large block areas. Hence, this analysis justiﬁes our choice of using the block settings suggested in theoriginal works, where the best results using the block-based LBP-TOP feature can be achieved. On theother hand, it is also clear in Fig. 10 that the proposed method,

OSF + OSW outperformed the baselineLBP-TOP (dashed lines) in a majority of the experiments.

We compare the results obtained using

OSF + OSW against other spatio-temporal based features,namely the optical ﬂow based features

OFF + OFW , which we construct in the similar manner (optical16able 7: F-measure, recall and precision scores for detection and recognition performance on SMIC andCASME II database with LBP-TOP of 8 × (a) Detection - SMIC Micro MacroMethods F1 Rec Prec F1 Rec PrecBaseline 0.5732 0.5732 0.5733 0.6005 0.5935 0.6219OSF 0.6621 0.6616 0.6626 0.6729 0.6634 0.6906OSW 0.6281 0.6280 0.6281 0.6499 0.6337 0.6688OSF + OSW (b) Recognition - SMIC

Micro MacroMethods F1 Rec Prec F1 Rec PrecBaseline 0.4600 0.4613 0.4587 0.4111 0.4265 0.4161OSF 0.4171 0.4227 0.4116 0.3849 0.3798 0.4165OSW 0.5023 0.5053 0.4993 0.4187 0.4276 0.4283OSF + OSW (c) Recognition - CASME II

Methods F1 Rec PrecBaseline 0.5775 0.5610 0.5950OSF 0.4397 0.4109 0.4729OSW 0.5928 0.5737 0.6132OSF + OSW

Figure 9: Micro-averaged accuracy results of the baseline (LBP-TOP) and

OSW methods using diﬀerentLBP-TOP radii parameters on SMIC database based on LOSOCVﬂow magnitudes used instead of optical strain magnitudes),

STIP (HOG+HOF) or Histogram of OrientedGradients and Histogram of Optical Flow extracted from spatio-temporal interest points) [49] and

HOG3D or 3D Oriented Gradients [50]. The last two descriptors are popular spatio-temporal features used in varioushuman action recognition [51] and facial expression recognition analysis [52]. For both these methods,interest points were densely sampled with their default parameters speciﬁed by the authors, and bag-of-17igure 10: Recognition accuracy results of the baseline (LBP-TOP) and

OSF + OSW methods usingdiﬀerent block partitions in LBP-TOP. The baseline results are denoted by the dashed lines.words (BOW) [49] representation is used to learn the visual vocabulary and build the feature vectors. Thenumber of clusters or “bags” used in the vocabulary learning is determined empirically and the best resultis reported. For all these methods, we apply SVM classiﬁer with linear kernel for fair comparison, exceptfor the method TICS [13] and MDMO [12], where they classiﬁed the micro-expression in CASME II intofour categories (i.e., negative, positive, surprise and others), instead of ﬁve (i.e., disgust, happiness, tense,surprise and repression). Besides, MDMO utilized polynomial kernel in SVM with heuristic determinedparameter settings.The recognition accuracy for detection and recognition tasks are reported in Table 8. We observe thatSTIP and HOG3D features yielded poor results because they are not designed to capture ﬁne appearance andmotion changes. The performance of OFF+OFW features are more comparable to the baseline performanceof the SMIC, but it is poorer than the CASME II baseline by a signiﬁcant amount. Overall, the proposed

OSF + OSW features yielded promising detection and recognition results compared to the other spatio-temporal features evaluated. We are able to conclude that the proposed method is capable of describing thespatio-temporal information in micro-expressions in a more eﬀective manner.

7. Conclusion

A novel feature extraction approach is proposed for the detection and recognition of facial micro-expressions in video clips. The proposed method describes the ﬁne subtle movements on the face usingoptical strain technique in two diﬀerent ways. The ﬁrst, a direct usage of optical strain information as afeature histogram, and second, the usage of strain information as weighted coeﬃcients to LBP-TOP features.The concatenation of the two feature histograms enable us to achieve promising results in both detectionand recognition tasks. Experiments were performed on two recent state-of-the-art databases – SMIC andCASME II. 18able 8: Comparison of micro-expression detection and recognition performance on the SMIC and CASMEII databases for diﬀerent feature extraction methodsMethods Det-SMIC ∗ Recog-SMIC ∗ Recog-CASMEII • Micro Macro Micro MacroBaselines † (cid:7) [13] N/A N/A N/A N/A 62.30MDMO (cid:7) [12] N/A N/A N/A N/A 70.34OSF [20] 66.16 66.34 41.46 46.00 51.01OSW [17] 62.80 63.37 49.39 50.71 61.94OFF + OFW 61.59 61.40 40.24 41.94 55.87OSF + OSW † Baseline results from [18, 19] (cid:7)

Used 4 classes for CASME II instead of 5 ∗ LOSO cross-validation • LOVO cross-validation

The best detection performance for SMIC was 74 .

52% using 5 × × ∼ × × .

41% and +1 .

22% in 5 × × Acknowledgment

This work was funded by Telekom Malaysia (TM) under project 2beAware and by University MalayaResearch Collaboration Grant (Title: Realistic Video-Based Assistive Living, Grant Number: CG009-2013)under the purview of University of Malaya Research.

ReferencesReferences [1] P. Ekman, W. V. Friesen, Nonverbal leakage and clues to deception, Journal for the Study of Interpersonal Processes 32(1969) 88–106.

2] S. Porter, L. ten Brinke, Reading between the lies identifying concealed and falsiﬁed emotions in universal facial expressions,Psychological Science 19.5 (2008) 508–514.[3] P. Ekman, W. V. Friesen, Constants across cultures in the face and emotion, Journal of personality and social psychology17(2) (1971) 124.[4] M. G. Frank, M. Herbasz, K. Sinuk, A. Keller, A. Kurylo, C. Nolan, I see how you feel: Training laypeople and professionalsto recognize ﬂeeting emotions, in: Annual meeting of the International Communication Association, Sheraton New York,New York City, NY, 2009.[5] M. G. Frank, C. J. Maccario, V. Govindaraju, Protecting Airline Passengers in the Age of Terrorism, ABC-CLIO, 2009,pp. 86–106.[6] M. OSullivan, M. G. Frank, C. M. Hurley, J. Tiwana, Police lie detection accuracy: The eect of lie scenario, Law andHuman Behavior 33.6 (2009) 530–538.[7] A. Uar, Y. Demir, C. Gzeli, A new facial expression recognition based on curvelet transform and online sequential extremelearning machine initialized with spherical clustering, Neural Computing and Applications 27(1) (2016) 131–142.[8] Q. Jia, X. Gao, H. Guo, Z. Luo, Y. Wang, Multi-layer sparse representation for weighted lbp-patches based facial expressionrecognition, Sensors 15(3) (2015) 6719–6739.[9] Z. Zeng, Y. Fu, G. I. Roisman, Z. Wen, Y. Hu, T. S. Huang, Spontaneous emotional facial expression detection, Journalof multimedia 1(5) (2006) 1–8.[10] D. Ghimire, S. Jeong, J. Lee, S. H. Park, Facial expression recognition based on local region speciﬁc features and supportvector machines, Multimedia Tools and Applications 16(1) (2016) 1–19.[11] Z. Wang, Q. Ruan, G. An, Facial expression recognition using sparse local ﬁsher discriminant analysis, Neurocomputing174 (2016) 756–766.[12] Y. J. Liu, J. K. Zhang, W. J. Yan, S. J. Wang, G. Zhao, X. Fu, A main directional mean optical ﬂow feature for spontaneousmicro-expression recognition, IEEE Transactions on Aﬀective Computing To appear.[13] S. Wang, W. Yan, X. Li, G. Zhao, C. Zhou, X. Fu, M. Yang, J. Tao, Micro-expression recognition using color spaces,IEEE Transactions on Image Processing 24(12) (2015) 6034–6047.[14] S. J. Wang, W. J. Yan, X. Li, G. Zhao, X. Fu, Micro-expression recognition using dynamic textures on tensor independentcolor space, in: International Conference on Pattern Recognition (ICPR), 2014, pp. 4678–4683.[15] S. J. Wang, H. L. Chen, W. J. Yan, Y. H. Chen, X. Fu, Face recognition and micro-expression recognition based ondiscriminant tensor subspace analysis plus extreme learning machine, Neural Processing Letters 39(1) (2014) 25–43.[16] X. Huang, S. J. Wang, G. Zhao, M. Piteikainen, Facial micro-expression recognition using spatiotemporal local binarypattern with integral projection, in: Computer Vision Workshops, 2015, pp. 1–9.[17] S. T. Liong, J. See, R. C.-W. Phan, A. C. Le Ngo, Y. H. Oh, K. Wong, Subtle expression recognition using optical strainweighted features, in: ACCV Workshops on Computer Vision for Aﬀective Computing (CV4AC), 2014, pp. 644–657.[18] W.-J. Yan, S.-J. Wang, G. Zhao, X. Li, Y.-J. Liu, Y.-H. Chen, X. Fu, CASME II: An improved spontaneous micro-expression database and the baseline evaluation, PLoS ONE 9 (2014) e86041.[19] X. Li, T. Pﬁster, X. Huang, G. Zhao, M. Pietikainen, A spontaneous micro-expression database: Inducement, collectionand baseline, in: Automatic Face and Gesture Recognition, 2013, pp. 1–6.[20] S. T. Liong, R. C. W. Phan, J. See, Y. H. Oh, K. Wong, Optical strain based recognition of subtle emotions, in: IntelligentSignal Processing and Communication Systems (ISPACS), 2014, pp. 180–184.[21] M. J. Black, P. Anandan, The robust estimation of multiple motions: Parametric and piecewise-smooth ﬂow ﬁelds,Computer Vision and Image Understanding 63.1 (1996) 75–104.[22] M. Shreve, S. Godavarthy, V. Manohar, D. Goldgof, S. Sarkar, Towards macro-and micro-expression spotting in videousing strain patterns, in: Applications of Computer Vision (WACV), 2009, pp. 1–6.[23] M. Shreve, S. Godavarthy, D. Goldgof, S. Sarkar, Macro-and micro-expression spotting in long videos using spatio-temporal strain, in: Automatic Face, Gesture Recognition and Workshops, 2011, pp. 51–56.[24] G. Zhao, M. Pietikainen, Dynamic texture recognition using local binary patterns with an application to facial expressions,Pattern Analysis and Machine Intelligence, IEEE Transactions 29(6) (2007) 915–928.[25] A. K. Davison, M. H. Yap, N. Costen, K. Tan, C. Lansley, D. Leightley, Micro-facial movements: An investigation onspatio-temporal descriptors, in: Computer Vision-ECCV Workshops, 2014, pp. 111–123.[26] Y. Wang, J. See, R. C. W. Phan, Y. H. Oh, Eﬃcient spatio-temporal local binary patterns for spontaneous facial micro-expression recognition, PloS ONE 10 (5) (2015) e0124674.[27] D. G. Lowe, Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision 60(2)(2004) 91–110.[28] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Computer Vision and Pattern Recognition,Vol. 1, 2005, pp. 886–893.[29] Y. L. Boureau, J. Ponce, Y. LeCun, A theoretical analysis of feature pooling in visual recognition, in: Proceedings of the27th International Conference on Machine Learning, 2010, pp. 111–118.[30] P. Hamel, S. Lemieux, Y. Bengio, D. Eck, Temporal pooling and multiscale learning for automatic annotation and rankingof music audio, in: International Society for Music Information Retrieval Conference, 2011, pp. 729–734.[31] C. Anitha, M. K. Venkatesha, B. S. Adiga, A survey on facial expression databases, International Journal of EngineeringScience and Technology 2(10) (2010) 5158–5174.[32] W. J. Yan, S. J. Wang, Y. J. Liu, Q. Wu, X. Fu, For micro-expression recognition: Database and suggestions, Neurocom-puting 136 (2014) 82–87.[33] S. Polikovsky, Y. Kameda, Y. Ohta, Facial micro-expressions recognition using high speed camera and 3d-gradient de-scriptor, in: Crime Detection and Prevention, 2009, pp. 16–16.

34] P. Ekman, Emotions revealed, 2003.[35] T. Pﬁster, Li, X., G. Zhao, M. Pietikainen, Recognising spontaneous facial micro-expressions, in: International Conferenceon Computer Vision, 2011, pp. 1449–1456.[36] B. K. Horn, B. G. Schunck, Determining optical ﬂow, in: International Society for Optics and Photonics, 1981, pp.319–331.[37] J. L. Barron, D. J. Fleet, S. S. Beauchemin, Performance of optical ﬂow techniques, International Journal of ComputerVision 12.1 (1994) 43–77.[38] A. Bainbridge-Smith, J. Lane, R. G., Determining optical ﬂow using a diﬀerential method, Image and Vision Computing15(1) (1997) 11–22.[39] R. W. Ogden, Non-linear elastic deformations, Courier Corporation, 1997.[40] C. A. Z. Barcelos, M. Boaventura, E. C. Silva Jr, A well-balanced ﬂow equation for noise removal and edge detection,Image Processing 12(7) (2003) 751–763.[41] W. Gao, X. Zhang, L. Yang, H. Liu, An improved sobel edge detection, in: Computer Science and Information Technology(ICCSIT), Vol. 5, 2010, pp. 67–71.[42] M. Juneja, P. S. Sandhu, Performance evaluation of edge detection techniques for images in spatial domain, InternationalJournal of Computer Theory and Engineering 1(5) (2009) 614–621.[43] S. L. Happy, A. Routray, Automatic facial expression recognition using features of salient facial patches, Aﬀective Com-puting 6 (1) (2015) 1–12.[44] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, I. Matthews, The extended cohn-kanade dataset (ck+):A complete dataset for action unit and emotion-speciﬁed expression, in: Computer Vision and Pattern RecognitionWorkshops (CVPRW), 2010, pp. 94–101.[45] M. J. Lyons, S. Akamatsu, M. Kamachi, J. Gyoba, J. Budynek, The Japanese female facial expression (JAFFE) database.[46] P. Ekman, Lie catching and microexpressions, The philosophy of deception (2009) 118–133.[47] G. Tsoumakas, I. Katakis, I. Vlahavas, Mining multi-label data, in: Data Mining and Knowledge Discovery Handbook,Springer, 2010, pp. 667–685.[48] A. C. Le Ngo, R. C. W. Phan, J. See, Spontaneous subtle expression recognition: Imbalanced databases and solutions, in:Asian Conference on Computer Vision, 2014, pp. 33–48.[49] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, C. Schmid, Evaluation of local spatio-temporal features for action recognition,in: British Machine Vision Conference, 2009, pp. 124–1.[50] A. Klaser, M. Marszaek, C. Schmid, A spatio-temporal descriptor based on 3d-gradients, in: British Machine VisionConference, 2008, pp. 275–1.[51] A. Kovashka, K. Grauman, Learning a hierarchy of discriminative space-time neighborhood features for human actionrecognition, in: Computer Vision and Pattern Recognition (CVPR), 2010, pp. 2046–2053.[52] M. Hayat, M. Bennamoun, A. El-Sallam, Evaluation of spatiotemporal detectors and descriptors for facial expressionrecognition, in: Human System Interactions (HSI), 2012, pp. 43–47.[53] Y. H. Oh, A. C. Le Ngo, J. See, S. T. Liong, R. C. W. Phan, H. C. Ling, Monogenic riesz wavelet representation formicro-expression recognition, in: Digital Signal Processing, 2015, pp. 1237–1241.[54] Y. Wang, J. See, R. C. W. Phan, Y. H. Oh, Lbp with six intersection points: Reducing redundant information in lbp-topfor micro-expression recognition, in: Computer Vision–ACCV, 2015, pp. 525–537.34] P. Ekman, Emotions revealed, 2003.[35] T. Pﬁster, Li, X., G. Zhao, M. Pietikainen, Recognising spontaneous facial micro-expressions, in: International Conferenceon Computer Vision, 2011, pp. 1449–1456.[36] B. K. Horn, B. G. Schunck, Determining optical ﬂow, in: International Society for Optics and Photonics, 1981, pp.319–331.[37] J. L. Barron, D. J. Fleet, S. S. Beauchemin, Performance of optical ﬂow techniques, International Journal of ComputerVision 12.1 (1994) 43–77.[38] A. Bainbridge-Smith, J. Lane, R. G., Determining optical ﬂow using a diﬀerential method, Image and Vision Computing15(1) (1997) 11–22.[39] R. W. Ogden, Non-linear elastic deformations, Courier Corporation, 1997.[40] C. A. Z. Barcelos, M. Boaventura, E. C. Silva Jr, A well-balanced ﬂow equation for noise removal and edge detection,Image Processing 12(7) (2003) 751–763.[41] W. Gao, X. Zhang, L. Yang, H. Liu, An improved sobel edge detection, in: Computer Science and Information Technology(ICCSIT), Vol. 5, 2010, pp. 67–71.[42] M. Juneja, P. S. Sandhu, Performance evaluation of edge detection techniques for images in spatial domain, InternationalJournal of Computer Theory and Engineering 1(5) (2009) 614–621.[43] S. L. Happy, A. Routray, Automatic facial expression recognition using features of salient facial patches, Aﬀective Com-puting 6 (1) (2015) 1–12.[44] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, I. Matthews, The extended cohn-kanade dataset (ck+):A complete dataset for action unit and emotion-speciﬁed expression, in: Computer Vision and Pattern RecognitionWorkshops (CVPRW), 2010, pp. 94–101.[45] M. J. Lyons, S. Akamatsu, M. Kamachi, J. Gyoba, J. Budynek, The Japanese female facial expression (JAFFE) database.[46] P. Ekman, Lie catching and microexpressions, The philosophy of deception (2009) 118–133.[47] G. Tsoumakas, I. Katakis, I. Vlahavas, Mining multi-label data, in: Data Mining and Knowledge Discovery Handbook,Springer, 2010, pp. 667–685.[48] A. C. Le Ngo, R. C. W. Phan, J. See, Spontaneous subtle expression recognition: Imbalanced databases and solutions, in:Asian Conference on Computer Vision, 2014, pp. 33–48.[49] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, C. Schmid, Evaluation of local spatio-temporal features for action recognition,in: British Machine Vision Conference, 2009, pp. 124–1.[50] A. Klaser, M. Marszaek, C. Schmid, A spatio-temporal descriptor based on 3d-gradients, in: British Machine VisionConference, 2008, pp. 275–1.[51] A. Kovashka, K. Grauman, Learning a hierarchy of discriminative space-time neighborhood features for human actionrecognition, in: Computer Vision and Pattern Recognition (CVPR), 2010, pp. 2046–2053.[52] M. Hayat, M. Bennamoun, A. El-Sallam, Evaluation of spatiotemporal detectors and descriptors for facial expressionrecognition, in: Human System Interactions (HSI), 2012, pp. 43–47.[53] Y. H. Oh, A. C. Le Ngo, J. See, S. T. Liong, R. C. W. Phan, H. C. Ling, Monogenic riesz wavelet representation formicro-expression recognition, in: Digital Signal Processing, 2015, pp. 1237–1241.[54] Y. Wang, J. See, R. C. W. Phan, Y. H. Oh, Lbp with six intersection points: Reducing redundant information in lbp-topfor micro-expression recognition, in: Computer Vision–ACCV, 2015, pp. 525–537.