Optimizing Filter Size in Convolutional Neural Networks for Facial Action Unit Recognition
Shizhong Han, Zibo Meng, Zhiyuan Li, James O'Reilly, Jie Cai, Xiaofeng Wang, Yan Tong
OOptimizing Filter Size in Convolutional Neural Networks for Facial Action UnitRecognition
Shizhong Han , Zibo Meng , Zhiyuan Li , James O’Reilly , Jie Cai , Xiaofeng Wang , Yan Tong Department of Computer Science & Engineering Department of Electrical EngineeringUniversity of South Carolina, Columbia, SC han38, mengz, oreillyj, jcai, wangxi, [email protected]
Abstract
Recognizing facial action units (AUs) during sponta-neous facial displays is a challenging problem. Most re-cently, Convolutional Neural Networks (CNNs) have shownpromise for facial AU recognition, where predefined andfixed convolution filter sizes are employed. In order toachieve the best performance, the optimal filter size is of-ten empirically found by conducting extensive experimentalvalidation. Such a training process suffers from expensivetraining cost, especially as the network becomes deeper.This paper proposes a novel Optimized Filter Size CNN(OFS-CNN), where the filter sizes and weights of all convo-lutional layers are learned simultaneously from the trainingdata along with learning convolution filters. Specifically,the filter size is defined as a continuous variable, which isoptimized by minimizing the training loss. Experimental re-sults on two AU-coded spontaneous databases have shownthat the proposed OFS-CNN is capable of estimating opti-mal filter size for varying image resolution and outperformstraditional CNNs with the best filter size obtained by ex-haustive search. The OFS-CNN also beats the CNN usingmultiple filter sizes and more importantly, is much more ef-ficient during testing with the proposed forward-backwardpropagation algorithm.
1. Introduction
Facial behavior is a natural and powerful means forhuman communications. Facial Action Coding System(FACS) developed by Ekman and Friesen [6] describes fa-cial behavior with a set of facial action units (AUs), eachof which is anatomically related to the contraction of a setof facial muscles. An automatic AU recognition system hasvarious applications in human-computer interaction (HCI)such as interactive games, advertisement impact analysis,and synthesizing human expression. However, it is still achallenging problem to recognize facial AUs from spon- taneous facial displays, especially with large variations infacial appearance caused by free head movements, occlu-sions, and illumination changes.Extensive efforts have been focused on extracting fea-tures that are capable of capturing facial appearance and/orgeometrical changes caused by AUs. While most ofthe earlier approaches employed handcrafted and general-purposed features; deep learning, especially CNN basedmethods, has shown great promise in recognizing facial ex-pressions or AUs [7, 24, 19, 15, 9, 12, 34, 17, 30, 21].In CNNs, the size of the convolution filters determinesthe size of receptive field where information is extracted.CNN-based methods employ predefined and fixed filtersizes in each convolutional layer, which is called the tradi-tional CNN hereafter. In general, larger filter sizes are em-ployed in the lower convolutional layers, whereas smallerfilter sizes are used in the upper layers [18, 4]. However,the fixed filter sizes are not necessarily optimal for all ap-plications/tasks as well as for different image resolutions.Specifically, different AUs cause facial appearance changesover various regions at different scales and therefore, mayprefer different filter sizes. For example, long and deep na-solabial furrows are important for recognizing AU10 (upperlip raiser), while short “wrinkles in the skin above and be-low the lips” and small bulges below the lower lip are cuesfor recognizing AU23 (lip tightener) [6].Given a predefined input image size, the best filter sizeis often selected experimentally or by visualization [32] foreach convolutional layer. For example, Kim et al. [17],who achieved the best expression recognition performanceof EmotiW2015 challenge [5], experimentally selected thebest filter sizes for the three convolutional layers.
How-ever, with CNNs becoming deeper and deeper [23, 11], itis impractical to search for the best filter size by exhaustivesearch, due to the highly expensive training cost.
In this work, we propose a novel and feasible solutionin a CNN framework to automatically learn the filter sizes1 a r X i v : . [ c s . C V ] N ov or all convolutional layers simultaneously from the train-ing data along with learning the convolution filters. In par-ticular, we proposed an Optimized Filter Size CNN (OFS-CNN), where the optimal filter size of each convolutionallayer is estimated iteratively using stochastic gradient de-scent (SGD) during the backpropagation process . As illus-trated in Figure. 1, the filter size k of a convolutional layer,which is a constant in the traditional CNNs, is defined as acontinuous variable in the OFS-CNN. During backpropaga-tion, the filter size k will be updated, e.g., decreased whenthe partial derivative of CNN loss with respect to the filtersize is positive, i.e., ∂L∂k > , and vice versa.In this work, a forward-backward propagation algorithmis proposed to estimate the filter size iteratively. To facili-tate the convolution operation with a continuous filter size, upper-bound and lower-bound filters with integer-sizes aredefined. In the forward process , an activation resulted froma convolution operation with a continuous filter size canbe calculated as the interpolation of the activations usingthe upper-bound and lower-bound filters. Furthermore, weshow that only one convolution operation is needed with theupper-bound and lower-bound filters. Therefore, the pro-posed OFS-CNN has similar computation complexity as thetraditional CNNs in the forward process as well as in thetesting process . During backpropagation , the partial deriva-tive of the activation with respect to the filter size k is de-fined, from which ∂L∂k can be calculated. With a change inthe filter size k , the filter sizes of the upper-bound or lower-bound filters may be updated via a transformation operation proposed in this work.Experimental results on two benchmark AU-coded spon-taneous databases, i.e., FERA2015 BP4D database [26] andDenver Intensity of Spontaneous Facial Action (DISFA)database [20] have demonstrated that the proposed OFS-CNN outperforms the traditional CNNs with the best filtersize obtained by exhaustive search and achieves state-of-the-art performance for AU recognition. Furthermore, theOFS-CNN also beats a deep CNN using multiple filter sizeswith a remarkable improvement in time efficiency duringtesting, which is highly desirable for realtime applications.In addition, the OFS-CNN is capable of estimating optimalfilter size for varying image resolution.
2. Related Work
Extensive efforts have been devoted to extracting themost effective features that characterize facial appearanceand geometry changes caused by activation of facial ex-pressions or AUs. The earlier approaches adopted vari-ous handcrafted features such as Gabor wavelets [3], his-tograms of Local Binary Patterns (LBP) [27], Histogramof Oriented Gradients (HOG) [2], Scale Invariant FeatureTransform (SIFT) features [31], histograms of Local PhaseQuantization (LPQ) [14], and their spatiotemporal exten- ...... ......
IncreaseFilter size B a c k w a r d P r o p a g a t i o n F o r w a r d F ee d i n g DecreaseFilter size
Figure 1. An overview of the proposed method to optimize theconvolution filter size k with the CNN loss backpropagation at the t th iteration. ∂L t ∂k t is the partial derivative of the loss with respect tothe filter size at the t th iteration ( k t ). The filter size k will decreasewhen ∂L t ∂k t > , and vice versa. sions [14, 33, 29].Most recently, CNNs have attracted increasing atten-tion and shown great promise for facial expression and AUrecognition [7, 24, 19, 15, 9, 12, 34, 17, 30, 21, 28, 25].For example, the top 3 methods [17, 30, 21] in the recentEmotiW2015 challenge [5] are all based on CNNs and havebeen demonstrated to be more robust to real world con-ditions for facial expression recognition. All those CNN-based methods use fixed-size convolution filters.To achieve the best performance, the optimal filter sizeis usually chosen empirically by either experimental valida-tion or visualization for each convolutional layer [32]. Forexample, Kim et al. [17] experimentally compared facial ex-pression recognition performance using different filter sizesand found that the CNN with 5 ×
5, 4 ×
4, and 5 × ×
42 input images. Zeiler and Fer-gus [32] found that 7 × ×
11 filters on ImageNet dataset through vi-sualization. However, such empirically selected filter sizesmay not be optimal for all applications as well as for dif-ferent image resolutions. Furthermore, it is impossible toperform an exhaustive search for the optimal combinationof filter sizes of all convolutional layers for deep CNNs.To achieve scale invariance, CNNs with multiple filtersizes have been developed. The inception module [23] con-catenates the activation feature maps from 1 ×
1, 3 ×
3, and2 × × filters. Multi-grid Neural Architecture [16] concatenates the feature mapsactivated by pyramid filters. However, all those methods arestill based on fixed filter size and more importantly, demanda significant increase in the time and space complexity dueto the complex model structure.In contrast, the proposed OFS-CNN is capable of learn-ing and optimizing the filter sizes for all convolutional lay-ers simultaneously in a CNN learning framework, which isdesirable, especially when the CNNs go deeper and deeper.Furthermore, we show that only one convolution operationis needed in the proposed forward-backward propagation al-gorithm. Thus, the proposed OFS-CNN has similar com-putational complexity as the traditional CNNs and thus, ismore efficient than the structures using multiple filter sizes.
3. Methodology
In this work, we propose an OFS-CNN, which is capableof optimizing and learning the filter size k from the train-ing data. In the following, we will first give a brief reviewof the CNN, especially the convolutional layer, and thenpresent the forward and backward propagation processes ofthe OFS-CNN. A CNN consists of a stack of layers such as convolu-tional layers, pooling layers, rectification layers, fully con-nected (FC) layers, and loss layers. These layers transformthe input data to highly nonlinear representations. Convo-lutional layers are used to perform convolution on input im-ages or feature maps from the previous layer with filters.Generally, the first convolutional layer is used to extractlow-level image features such as edges; while the upper lay-ers can extract complex and task-related features.Given an input image/feature map denoted by x , an ac-tivation at the i th row and the j th column, denoted by y ij ,in a convolutional layer can be calculated using the convo-lution operation by computing the inner product of the filterand the input as follows: y ij ( k ) = w ( k ) ⊤ x ij ( k ) + b ij (1) where w ( k ) is a convolution filter with the filter size k × k ; x ij ( k ) denotes the input with a k × k receptive field cen-tered at the i th row and the j th column; and b ij is a bias.Traditionally, the filter size k is a predefined integer andfixed throughout the training/testing process. In this work, k ∈ R + is defined as a continuous variable that can belearned and optimized during CNN training. In the forward process, convolution operations are con-ducted to calculate activations using learned filters as in Eq. 1. However, the convolution operation can only be per-formed with integral size filters in the CNN.
Upper-bound and lower-bound filters:
In order to buildthe relationship between the activation y ij and the contin-uous filter size k , we first define an upper-bound filter de-noted by w ( k + ) and a lower-bound filter denoted by w ( k − ) .Specifically, k + is the upper-bound filter size and is thesmallest odd number that is bigger than k ; while k − is thelower-bound filter size and is the largest odd number that isless than or equal to k . k + and k − can be calculated as k + = ⌊ k + 12 ⌋ ∗ , k − = ⌊ k + 12 ⌋ ∗ − (2) Then, the activation y ij ( k ) can be defined as the lin-ear interpolation of the activations of the upper-bound andlower-bound filters denoted by y ij ( k − ) and y ij ( k + ) , re-spectively: y ij ( k ) = αy ij ( k + ) + (1 − α ) y ij ( k − ) (3)where y ij ( k + ) and y ij ( k − ) are calculated as in Eq. 1 withthe same bias, but with the upper-bound and lower-boundfilters, i.e., w ( k + ) and w ( k − ) , respectively. α = ( k − k − )2 isthe linear interpolation weight. Remark 1.
A cubic interpolation can also be used to buildthe relationship between the activation y ij and the continu-ous variable k . However, it requires a higher computationalcomplexity and needs at least three points; while the linearinterpolation only needs two points k − and k + . Remark 2.
The filter size k is actually a weight-related fil-ter size in the interval [ k − , k + ) and can be calculated as: k = k − + 2 α (4) Convolution with a continuous filter size:
As in Re-mark 2, we can explicitly define the filter w ( k ) with acontinuous size k . As shown in Fig. 2, the upper-boundand lower-bound filters are defined to share the same co-efficients in the region with green color and to differ bythe pink boundary denoted by △ w ( k + ) . Let △ w ( k + ) = w ( k + ) − w ( k − ) be the ring boundary with zeros inside asshown in Fig. 2, then the filter w ( k ) with a continuous size k can be defined as follows: w ( k ) = α △ w ( k + ) + w ( k − ) , (5) Remark 3.
In Eq. 5, w ( k ) and w ( k − ) have an actual filtersize of k + ; while w ( k − ) is zero-padded. Lemma 1.
Given the definition of the filter w ( k ) as in Eq. 5,the activation y ij ( k ) in Eq. 3 can be simplified as: y ij ( k ) = w ( k ) ⊤ x ij ( k + ) + b ij (6) Proof.
Eq. 6 can be deduced from Eq. 3 as follows: y ij ( k ) = αy ij ( k + ) + (1 − α ) y ij ( k − )= α w ( k + ) ⊤ x ij ( k + ) + (1 − α ) w ( k − ) ⊤ x ij ( k − ) + b ij (7) After padding zeros for w ( k − ) , w ( k − ) ⊤ x ( k − ) is equiv-alent to w ( k − ) ⊤ x ( k + ) . Then, Eq. 7 can be simplified asfollows:3 ..... (cid:1) a (cid:1) a (cid:1) a (cid:1) a (cid:1) a a a (cid:1) a (cid:1) a n-1,1 a n-1,2 a n-1,n-1 (cid:1) a n-1,n (cid:1) a n,1 (cid:1) a n,2 (cid:1) a n,n-1 (cid:1) a nn ... ... ... ... ... ...... w (k) * w (k - ) a ... a a n-1,2 a n-1,n-1 ... ... ... ... w (k + ) a a a a a a ... a a a n-1,1 a n-1,2 a n-1,n-1 a n-1,n a n,1 a n,2 a n,n-1 a nn ... ... ... ... ... ......... Δ w (k + ) a a a a a a n-1,1 n-1,n a n,1 a n,2 a n,n-1 a nn ... ... ... ... ... ......... w (k - ) a ... a a n-1,2 a n-1,n-1 ... ... ... ... Δ w (k + ) a a a a a a n-1,1 n-1,n a n,1 a n,2 a n,n-1 a nn ... ... ... ... ... ......... ... ......... Figure 2. An illustrative definition of a filter with a continuousfilter size k ∈ R + . w ( k + ) and w ( k − ) are the upper-bound andlower-bound filters, respectively, and share the same elements inthe green region. The pink region △ w ( k + ) denotes the differencebetween the upper-bound and lower-bound filters and has a ringshape with zeros inside. α is the linear interpolation weight asso-ciated with the upper-bound filter w ( k + ) . w ( k ) is a weight-relatedfilter with a continuous filter size k . y ij ( k ) = α w ( k + ) ⊤ x ( k + ) + (1 − α ) w ( k − ) ⊤ x ( k + ) + b ij = h α w ( k + ) ⊤ + (1 − α ) w ( k − ) ⊤ i x ( k + ) + b ij = h α △ w ( k + ) ⊤ + w ( k − ) ⊤ i x ( k + ) + b ij (8) By substituting Eq. 5 into Eq. 8, we have y ij ( k ) = w ( k ) ⊤ x ij ( k + ) + b ij (9) Thus, the activation of y ij ( k ) can be simplified as Eq. 6. Remark 4.
According to Eq. 6, only one convolution op-eration needs to be performed to calculate each activation y ij ( k ) . Therefore, the time complexity does not increasecompared with the traditional CNN in the forward trainingprocess as well as in the testing process. Since the relationshipbetween the activation and the filter size has been definedas in Eq. 3, the partial derivative of the activation y ij w.r.t.the filter size can be calculated based on the derivative defi-nition as follows: ∂y ij ( k ) ∂k = lim △ k → y ij ( k + △ k ) − y ij ( k − △ k )2 △ k (10) When k + △ k and k − △ k are in the interval [ k − , k + ) ,the derivative of each point ∂y ij ( k ) ∂k is equal to the gradientof the line because of the linear interpolation. Hence, thepartial derivative can be calculated as follows: ∂y ij ( k ) ∂k = y ij ( k + ) − y ij ( k − ) k + − k − (11) Substituting Eq. 1 into Eq. 11, we have ∂y ij ( k ) ∂k = w ( k + ) ⊤ x ij ( k + ) − w ( k − ) ⊤ x ij ( k − ) k + − k − (12) By padding zeros for w ( k − ) , we can simplify Eq. 12 as ∂y ij ( k ) ∂k = w ( k + ) ⊤ x ij ( k + ) − w ( k − ) ⊤ x ij ( k + ) k + − k − = (cid:2) w ( k + ) ⊤ − w ( k − ) ⊤ (cid:3) x ij ( k + ) k + − k − = △ w ( k + ) ⊤ x ij ( k + ) k + − k − (13) Based on Eq. 13, the partial derivative of the loss L w.r.t. k can be calculated as follows with chain rule: ∂L∂k = X i,j ∂L∂y ij ∂y ij ∂k (14) Updating the filter size:
Given the partial derivative of theloss L w.r.t. k , the filter size k can be updated iterativelywith the SGD strategy for the ( t + 1) th iteration as follows: k t +1 = k t − γ ∂L t ∂k t (15) where γ is the learning rate. Since the lower-bound filter w t ( k − ) is defined as the in-ner part of the upper-bound filter w t ( k + ) , we only need toperform backpropagation for the upper-bound filter w t ( k + ) ,which can be divided into two parts as w t ( k + ) = w t ( k − ) + △ w t ( k + ) , where △ w t ( k + ) is the ring boundary with zerosinside and w ( k − ) is padded with zeros. Then, the forwardactivation function in Eq. 6 can be reorganized as: y tij ( k t ) = w t ( k t ) ⊤ x tij ( k t + ) + b tij = h α t △ w t ( k t + ) ⊤ + w t ( k t − ) ⊤ i x tij ( k t + ) + b tij = α t △ w ( k t + ) ⊤ △ x tij ( k t + )+ w t ( k t − ) ⊤ x tij ( k t − )+ b tij (16) where △ x tij ( k t + ) is the ring boundary of x tij ( k t + ) in theinput image/feature map with zeros inside and x tij ( k t − ) ispadded with zeros.Hence, the partial derivative of the activation y tij w.r.t.the upper-bound filter w t ( k t + ) can be calculated as follows: ∂y tij ∂ w t ( k t + ) = x tij ( k t − ) ⊤ + α t △ x tij ( k t + ) ⊤ (17) With the chain rule, the derivative of CNN loss w.r.t. w t ( k + ) can be calculated as ∂L t ∂ w t ( k t + ) = X i,j ∂L t ∂y tij ∂y tij ∂ w t ( k t + ) (18) Thus, the upper-bound filter w ( k + ) can be updated iter-atively using the SGD strategy. As a result, the filter w ( k ) with a continuous size k can be updated as in Eq. 5.4 igure 3. When the filter size k is updated during backpropagation, it may be out of the interval [ k t − , k t + ) . In this case, transformationoperations are needed to update the sizes of the upper-bound and lower-bound filters after updating their coefficients. Specifically, anexpanding operation is employed to increase the sizes of both upper-bound and lower-bound filters; whereas a shrinking operation is usedto decrease the filter sizes. Transforming the upper-bound and lower-bound filters:
According to Eq. 15, the filter size k can be continuouslyupdated over time. As long as k t +1 is in the interval of [ k t − , k t + ) , the upper-bound and lower bound filters remainthe same sizes as those in the t th iteration, i.e., k t +1 − = k t − and k t +1+ = k t + . However, as the filter size k is updated, k t +1 may be outside of the interval of [ k t − , k t + ) . Conse-quently, both the sizes of the upper-bound and lower-boundfilters should be updated. As illustrated in Fig. 3, we de-fine transformation operations , including expanding and shrinking to update the upper-bound and lower-bound fil-ters to accommodate a size change. Note that, the transformation operations are conductedafter updating coefficients of the upper-bound and lower-bound filters.Expanding:
When k t +1 > k t + , the upper-bound and lower-bound filters w t +1 ( k t +1+ ) and w t +1 ( k t +1 − ) should be up-dated by an expanding operation as follows: w t +1 ( k t +1 − ) = w t +1 ( k t +1+ ) w t +1 ( k t +1+ ) = expand ( w t +1 ( k t +1+ )) (19) where expand ( · ) is a function to increase the filter size, par-ticularly by padding values from the nearest neighbors ofthe original filter as illustrated in Figure 4. Shrinking:
As opposed to the expand ( · ) function, when k t +1 < k t − , the upper-bound and lower-bound filters w t +1 ( k t +1 − ) and w t +1 ( k t +1+ ) will be shrunk as follows: w t +1 ( k t +1+ ) = w t +1 ( k t +1 − ) w t +1 ( k t +1 − ) = shrink ( w t +1 ( k t +1 − )) (20) where shrink ( · ) is a function to decrease the filter size,specifically by filling the boundary with zeros as shown inFigure 4. shrink expand a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
00 a a a
00 a a a
00 0 0 0 0
Figure 4. An illustration of the shrink and expand operationsto change the filter size. The shrink operation sets zeros to theoutside boundary; while the expand operation is to pad the outsideboundary with the nearest neighbors from the original filter.
Remark 5.
There are alternative methods that can be usedto expand or shrink the filters. For example, we have alsotried to resize the filter by bicubic interpolation. However,the recognition performance became worse. The reason isthat the filters learned in the previous iterations are dis-torted after scaling and thus, may fail to activate the pat-terns in the images. In contrast, the proposed expand andshrink functions can well preserve the learned filters.
Updating other parameters:
In addition to updating thefilter size k and the convolution filter w ( k ) , we should alsoupdate the bias b ij and the feature x ij during backpropaga-tion. Based on the forward activation function as defined inEq. 6, the derivative of feature activation y tij w.r.t. x tij ( k t + ) can be calculated as below: ∂y tij ∂ x tij ( k t + ) = w t ( k t ) (21) With the chain rule, the derivative of CNN loss w.r.t. x tij ( k t + ) can be calculated as: ∂L t ∂ x tij ( k t + ) = ∂L t ∂y tij ∂y tij ∂ x tij ( k t + ) (22) lgorithm 1 The forward-backward propagation algorithm forthe OFS-CNN
Input:
Input images or feature maps from the previouslayer x and an initial filter size k ∈ R + . Initialization:
Initialize k and k − as Eq. 2.Randomly initialize the convolution filter w ( k ) . for iteration t from to T do//Forward:w t ( k t − ) = shrink ( w t ( k t + )) Calculate the convolution filter w t ( k t ) based on Eq. 5Calculate the forward activation y ij ( k ) based on Eq. 6 //Backward: Calculate the derivative of activation w.r.t. k t , w t ( k t + ) ,and x t , based on Eqs.13, 17, and 21, respectivelyCalculate the derivative of loss w.r.t. k t , w t ( k t + ) , and x t , based on Eqs.14, 18, and 22, respectivelyUpdate k t +1 , w t +1 ( k t +1+ ) , and x t +1 based on SGDUpdate the bias using standard CNN backpropagation// Transformation:if k t +1 > k t + then k t +1 − = k t + k t +1+ = k t + + 2 Expand the upper-bound and lower bound filters w t +1 ( k t +1+ ) and w t +1 ( k t +1 − ) as in Eq. 19 else if k t +1 < k t − then k t +1+ = k t − k t +1 − = k t − − Shrink the upper-bound and lower bound filters w t +1 ( k t +1+ ) and w t +1 ( k t +1 − ) as in Eq. 20 end ifend for Hence, the feature x ij can be updated using the SGDstrategy and will be further backpropagated to update theparameters in the lower layers. The backpropagation of b tij is exactly the same as that in the traditional CNNs. Theforward and backward propagation process for the proposedOFS-CNN is summarized in Algorithm 1.
4. Experiments
To demonstrate the effectiveness of the proposed model,extensive experiments have been conducted on two bench-mark AU-coded databases, i.e., the BP4D database [26]and the DISFA database [20], containing spontaneous fa-cial behavior with moderate head movements. Specifically,the BP4D database [26] has 11 AUs and 41 subjects with146,847 images; and the DISFA database [20] has 12 AUsand 27 subjects with 130,814 images. Following the exper-imental setup of the state-of-the-art methods (DRML [34]and PL-CNN [28]), two AUs, i.e., AU5 and AU20, whichappear less than of the frames in the DISFA database, are not considered in the experiments. First, facial landmarks are detected, from which facealignment can be conducted to reduce the variationsfrom scaling and in-plane rotation. For the DISFAdatabase [20], 66 landmarks are detected using a state-of-the-art method [1]. For the BP4D database [26], the 49landmarks provided in the database are used for face align-ment. Based on the extracted facial landmarks, face regionsare aligned based on three fiducial points: the centers of thetwo eyes and the mouth, and then scaled to × . Fol-lowing the work [10], each face image is warped to a frontalview to reduce variations from face pose; and then sequencenormalization is performed by subtracting the mean and di-viding the standard deviation calculated from the video se-quence to reduce the identity-related information and to en-hance appearance and geometrical changes caused by AUs. The proposed OFS-CNN is modified from cifar10 quickin Caffe [13], which consists of three convolutional layers,two average pooling layers, two FC layers, and ending withthe weighted sigmoid cross entropy loss layer for calculat-ing the loss. Specifically, all the convolutional layers havea stride of 1. The first two convolutional layers have 32 fil-ters, whose output feature maps are sent to a rectified layerfollowed by the average pooling layer with a downsamplingstride of 3. The last convolutional layer has 64 filters, whoseoutput feature maps are fed into an FC layer with 64 nodes.Finally, the output of the last FC layer, which contains asingle node, is sent to the loss layer. The SGD, with a mo-mentum of 0.9 and a mini-batch size of 100, is used fortraining the CNN. Each AU has one trained CNN model.All filter sizes are × in the original cifar10 quick [13]and will be used for the baseline CNN for comparison. Inthe OFS-CNN, all filter sizes are initialized with 4, implying α = 0 . , k = 5 , and k − = 3 . The proposed OFS-CNN is compared with the baselineCNN with fixed convolution filter sizes on the two bench-mark datasets. Since the BP4D database [26] providesthe training and development partitions, an average perfor-mance of five runs is reported to reduce the influence of therandomness during training. For the DISFA database [20],a 9-fold cross-validation strategy is employed, such that thetraining and testing subjects are mutually exclusive. Exper-imental results are reported in terms of the average F1 scoreand 2AFC score (area under ROC curve). In the experiments, three resolutions, i.e., × , × , × are employed to evaluate the proposed OFS-CNN on different resolutions. able 1. Performance comparison of the proposed OFS-CNNs and traditional CNNs with varying filter size on the BP4D database [26].In the , the filter size is learned only for the first layer. The average converged filter size is reported for each AU,respectively. The results are calculated from 5 runs in terms of the average F1 score and the 2AFC score. The underline highlights the bestperformance among the 4 fixed filter sizes. The bold highlights the best performance among all models. AUs CNN-Filter3 CNN-Filter5 CNN-Filter7 CNN-Filter9 1-layer OFS-CNN 3-layer OFS-CNN
F1 2AFC F1 2AFC F1 2AFC F1 2AFC F1 2AFC Converged Size F1 2AFC Converged SizeAU1 0.315 0.577 0.313 0.578 0.310 0.577 0.315 0.583 0.320 0.586 6.0
Exhaustive search vs optimization of filter size:
We firstshow that the proposed OFS-CNN is capable of learningthe optimal filter sizes. Specifically, baseline CNNs are de-signed with varying filter sizes including × , × , × ,and × in the first convolutional layer. In addition to the , where the filter sizes in all three convo-lutional layers are learned, a is designedwhere the filter size is learned only for the first layer. All thebaseline CNNs and the used the fixed fil-ter sizes ( × ) for the other two convolutional layers. Allthe models in comparison are trained on the training parti-tion and tested on the development partition of the BP4Ddatabase [26]. The results are reported in Table 1, whichare calculated as the average of 5 runs. The average filtersize of OFS-CNNs is reported for each AU at the th iteration, where most of the CNN models are converged inour experiments.As shown in Table 1, the not only out-performs CNN-Filter5 as in the original cifar10 quick [13]in terms of the average F1 score (0.501 vs 0.499) and the av-erage 2AFC score (0.666 vs 0.664), but also achieves sim-ilar performance as
CNN-Filter7 that has the best perfor-mance among all baseline CNNs. Furthermore, the beats all models compared to in terms of the av-erage F1 score and 2AFC score. This demonstrates that theproposed OFS-CNN is superior to the best CNN model ob-tained by exhaustive search. In addition, the learned filtersize is often consistent with the best filter size obtained byexhaustive search, which is either the upper-bound or lower-bound filter size in the OFS-CNN.
OFS-CNNs on different image resolutions:
We also showthat the learned filter sizes adapt well to changes in imageresolutions. Specifically, experiments have been conductedto compare the proposed OFS-CNN and the baseline CNNon the BP4D database [26] with different resolutions of theinput images. All the CNN models have similar CNN struc-ture as described in Section 4.2. In order to accommodate
Table 2. Performance comparison of the proposed OFS-CNN andthe baseline CNN for varying image resolutions on the BP4Ddatabase [26] in terms of the average F1 score. The bold high-lights the best performance among all models.
Resolution 64 ×
48 128 ×
96 256 × AU2 0.277
AU7 0.643 0.634 0.642
AU14 0.517 0.532 0.552
AU23 0.348 0.355 0.381 0.398 0.354
AVE 0.499 0.515 0.522 0.533 0.478
Table 3. The average converged filter sizes for varying image res-olutions on the BP4D database [26]. The bold highlights the filtersizes with the best performance.
Resolution 64 ×
48 128 ×
96 256 × AU2
AU7 5.0 4.8 4.7
AU14 5.1 4.6 4.5
AU23 5.4 4.6 4.7 6.0 4.7 4.7
AVE 5.0 5.0 4.8 5.5 5.0 5.0 5.7 5.0 5.0 the changes in the resolution, the number of nodes in thefirst FC layer is set to 64, 128, and 256 for resolutions of × , × , and × , respectively, for all modelsin comparison. In this set of experiments, the is employed and the average converged filter sizes foreach AU under each resolution are reported in Table 3.As shown in Table 2, most of AUs prefer a higher im-age resolution to preserve subtle cues of facial appearancechanges. However, the performance of the baseline CNN7ecreases for the highest resolution × . Whenthe image resolution increases, the receptive field covers asmaller actual area of the whole face when using the same × filter size, compared to lower resolutions. In contrast,the proposed OFS-CNN can optimize filter size at variousimage resolutions. As shown in Table 3, the OFS-CNN hasthe largest average filter size of 5.7 for conv1 (the first con-volutional layer) for × and thus, can benefit from anincreased receptive field because of the × upper-boundfilter. As a result, the OFS-CNN outperforms the baselineCNN for all image resolutions, especially for × by6%, in terms of the average F1 score. Comparison with the CNNs using multiple filter sizes:
We also compare the proposed to theCNN structure with multiple filter sizes, i.e., the incep-tion module [23]. In particular, the GoogLeNet [23] with7 inception modules is trained and evaluated on the BP4Ddatabase with an image resolution of × . Table 4. Comparison with the GoogLeNet on the BP4D databasein terms of F1 score.
AUs % GoogLeNet OFS-CNN × OFS-CNN × AU1 23.1 0.369 0.345
AU2 17.9 0.267 0.303
AU4 22.7
AU23 17.0 0.376 0.398
AVE - 0.531 0.533
As shown in Table 4, the OFS-CNN with a shallow struc-ture (15 layers, trained in 3,000 iterations) performs no-ticeably better than the GoogLeNet (100 layers, trained in20,000 iterations) in terms of the average F1 score. Theimprovement becomes more substantial for the AUs witha lower occurrence rate such as AU2 (17.9%) and AU23(17.0%). Furthermore, the GoogLeNet is much more com-plex compared to the OFS-CNN and thus, demands moretraining data. Note that the proposed OFS-CNN runs morethan 8 times faster on a × image and more than 6times faster on a × image than the GoogLeNet( × ) during testing, which is critical and hence,highly desirable for real-time applications. Comparison with the baseline CNN on the DISFAdatabase [20]:
As illustrated in Table 5, the proposed OFS-CNN also outperforms the baseline CNN with a notablemargin in terms of the average F1 score on the DISFAdatabase [20]. The experiments are conducted on the im-age resolution of × using the . Comparison with state-of-the-art methods:
In addition tothe baseline CNN, we further compare the proposed OFS-CNN with state-of-the-art methods, particularly the most re-
Table 5. Performance comparison with the baseline CNN on theDISFA database [20] in terms of the average F1 score and the2AFC score.
AUs CNN (baseline) OFS-CNNF1 2AFC F1 2AFCAU1
AU4
AU12 0.786
AU25
AVE
Table 6. Performance comparison with the state-of-the-art CNNbased methods on the BP4D and the DISFA databases in terms ofF1 score and 2AFC score.
BP4D DISFAMethods F1 2AFC Methods F1 2AFC
DL [9] 0.522 N/A ML-CNN [8] N/A 0.757AlexNet [34] 0.384 0.422 AlexNet [34] 0.236 0.491LCN [34] 0.466 0.544 LCN [34] 0.240 0.468ConvNet [34] 0.470 0.518 ConvNet [34] 0.231 0.458DRML [34] 0.483 0.560 DRML [34] 0.267 0.523PL-CNN [28] 0.491 N/A PL-CNN [28] 0.584 N/A
OFS-CNN
OFS-CNN cent approaches based on CNNs [8, 9, 34, 28], on the twobenchmark databases. As shown in Table 6, the proposedOFS-CNN achieves the state-of-the-art performance of AUrecognition on the two databases .
5. Conclusion and Future Work
Traditional CNNs have a predefined and fixed integralfilter size for each convolutional layer, which may be notoptimal for all tasks as well as for all image resolutions. Inthis work, we proposed a novel OFS-CNN with a forward-backward propagation algorithm to iteratively optimize thefilter size while learning the convolution filters. Upper-bound and lower-bound filters are defined to facilitate theconvolution operations with continuous-size filters; andtransformation operations are developed to accommodatethe size changes of the filters. Experimental results on twobenchmark AU-coded spontaneous databases have shownthat the OFS-CNN outperforms the baseline CNNs with thebest filter size found by exhaustive search and achieves bet-ter or at least comparable performance to the state-of-the-artCNN-based methods. Furthermore, the OFS-CNN has beenshown to be effective for automatically adapting filter sizesto different image resolutions. In the current practice, dif-ferent channels of a single convolutional layer share a singlefilter size. In the future, the OFS-CNN will be extended tolearn a filter size for each channel, which would be moreeffective for learning variously sized patterns. The performance of the ML-CNN was reported for 10 AUs on theDISFA database [20]. eferences [1] A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic. Robustdiscriminative response map fitting with constrained localmodels. In CVPR , pages 3444–3451, 2013.[2] T. Baltrusaitis, M. Mahmoud, and P. Robinson. Cross-datasetlearning and person-specific normalisation for automatic ac-tion unit detection. In FG , volume 6, pages 1–6, 2015.[3] M. S. Bartlett, G. Littlewort, M. G. Frank, C. Lainscsek,I. Fasel, and J. R. Movellan. Recognizing facial expression:Machine learning and application to spontaneous behavior.In CVPR , pages 568–573, 2005.[4] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman.Return of the devil in the details: Delving deep into convo-lutional nets. In
BMVC , 2014.[5] A. Dhall, O. Ramana Murthy, R. Goecke, J. Joshi, andT. Gedeon. Video and image based emotion recognition chal-lenges in the wild: Emotiw 2015. In
ICMI , pages 423–426.ACM, 2015.[6] P. Ekman, W. V. Friesen, and J. C. Hager.
Facial ActionCoding System: the Manual . Research Nexus, Div., NetworkInformation Research Corp., Salt Lake City, UT, 2002.[7] B. Fasel. Head-pose invariant facial expression recognitionusing convolutional neural networks. In
ICMI , pages 529–534, 2002.[8] S. Ghosh, E. Laksana, S. Scherer, and L. Morency. A multi-label convolutional neural network approach to cross-domainaction unit detection.
ACII , 2015.[9] A. Gudi, H. E. Tasli, T. M. den Uyl, and A. Maroulis. Deeplearning based FACS action unit occurrence and intensity es-timation. In FG , 2015.[10] S. Han, Z. Meng, S. KHAN, and Y. Tong. Incremental boost-ing convolutional neural network for facial action unit recog-nition. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon,and R. Garnett, editors, NIPS , pages 109–117. Curran Asso-ciates, Inc., 2016.[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In
CVPR , pages 770–778, 2016.[12] S. Jaiswal and M. F. Valstar. Deep learning the dynamic ap-pearance and shape of facial action units. In
WACV , 2016.[13] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-shick, S. Guadarrama, and T. Darrell. Caffe: Convolutionalarchitecture for fast feature embedding. In
ACM MM , pages675–678. ACM, 2014.[14] B. Jiang, B. Martinez, M. F. Valstar, and M. Pantic. Deci-sion level fusion of domain specific regions for facial actionrecognition. In
ICPR , pages 1776–1781, 2014.[15] H. Jung, S. Lee, J. Yim, S. Park, and J. Kim. Joint fine-tuningin deep neural networks for facial expression recognition. In
ICCV , pages 2983–2991, 2015.[16] T. W. Ke, M. Maire, and X. Y. Stella. Multigrid neural archi-tectures. 2017.[17] B.-K. Kim, H. Lee, J. Roh, and S.-Y. Lee. Hierarchical com-mittee of deep cnns with exponentially-weighted decision fu-sion for static facial expression recognition. In
ICMI , pages427–434, 2015. [18] G. Levi and T. Hassner. Age and gender classification usingconvolutional neural networks. In
CVPR Workshops , pages34–42, 2015.[19] M. Liu, S. Li, S. Shan, R. Wang, and X. Chen. Deeply learn-ing deformable facial action parts model for dynamic expres-sion analysis. In
ACCV , 2014.[20] S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, and J. F.Cohn. Disfa: A spontaneous facial action intensity database.
IEEE Trans. on Affective Computing , 4(2):151–160, 2013.[21] H.-W. Ng, V. D. Nguyen, V. Vonikakis, and S. Winkler.Deep learning for emotion recognition on small datasets us-ing transfer learning. In
ICMI , pages 443–449, 2015.[22] S. Saxena and J. Verbeek. Convolutional neural fabrics. In
Advances in Neural Information Processing Systems , pages4053–4061, 2016.[23] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In
CVPR , pages 1–9, 2015.[24] Y. Tang. Deep learning using linear support vector machines.In
ICML , 2013.[25] Z. T˝os´er, L. Jeni, A. L˝orincz, and J. Cohn. Deep learning forfacial action unit detection under large head poses. In
ECCV ,pages 359–371. Springer, 2016.[26] M. Valstar, J. Girard, T. Almaev, G. McKeown, M. Mehu,L. Yin, M. Pantic, and J. Cohn. FERA 2015 - second facialexpression recognition and analysis challenge. FG , 2015.[27] M. F. Valstar, M. Mehu, B. Jiang, M. Pantic, and K. Scherer.Meta-analysis of the first facial expression recognition chal-lenge. IEEE T-SMC-B , 42(4):966–979, 2012.[28] S. Wu, S. Wang, B. Pan, and Q. Ji. Deep facial actionunit recognition from partially labeled data. In
ICCV , pages3951–3959, 2017.[29] P. Yang, Q. Liu, and M. D. N. Boosting encoded dynamicfeatures for facial expression recognition.
Pattern Recogni-tion Letters , 30(2):132–139, Jan. 2009.[30] Z. Yu and C. Zhang. Image based static facial expressionrecognition with multiple deep network learning. In
ICMI ,pages 435–442, 2015.[31] A. Yuce, H. Gao, and J. Thiran. Discriminant multi-labelmanifold embedding for facial action unit detection. In FG ,2015.[32] M. D. Zeiler and R. Fergus. Visualizing and understandingconvolutional networks. In ECCV , pages 818–833, 2014.[33] G. Zhao and M. Pieti¨ainen. Dynamic texture recognition us-ing local binary patterns with an application to facial expres-sions.
IEEE T-PAMI , 29(6):915–928, June 2007.[34] K. Zhao, W. Chu, and H. Zhang. Deep region and multi-label learning for facial action unit detection. In
CVPR , pages3391–3399, 2016., pages3391–3399, 2016.