[PDF] Optimizing Filter Size in Convolutional Neural Networks for Facial Action Unit Recognition

Abstract

Recognizing facial action units (AUs) during spontaneous facial displays is a challenging problem. Most recently, Convolutional Neural Networks (CNNs) have shown promise for facial AU recognition, where predefined and fixed convolution filter sizes are employed. In order to achieve the best performance, the optimal filter size is often empirically found by conducting extensive experimental validation. Such a training process suffers from expensive training cost, especially as the network becomes deeper. This paper proposes a novel Optimized Filter Size CNN (OFS-CNN), where the filter sizes and weights of all convolutional layers are learned simultaneously from the training data along with learning convolution filters. Specifically, the filter size is defined as a continuous variable, which is optimized by minimizing the training loss. Experimental results on two AU-coded spontaneous databases have shown that the proposed OFS-CNN is capable of estimating optimal filter size for varying image resolution and outperforms traditional CNNs with the best filter size obtained by exhaustive search. The OFS-CNN also beats the CNN using multiple filter sizes and more importantly, is much more efficient during testing with the proposed forward-backward propagation algorithm.

Full PDF

OOptimizing Filter Size in Convolutional Neural Networks for Facial Action UnitRecognition

Shizhong Han , Zibo Meng , Zhiyuan Li , James O’Reilly , Jie Cai , Xiaofeng Wang , Yan Tong Department of Computer Science & Engineering Department of Electrical EngineeringUniversity of South Carolina, Columbia, SC han38, mengz, oreillyj, jcai, wangxi, [email protected]

Abstract

Recognizing facial action units (AUs) during sponta-neous facial displays is a challenging problem. Most re-cently, Convolutional Neural Networks (CNNs) have shownpromise for facial AU recognition, where predeﬁned andﬁxed convolution ﬁlter sizes are employed. In order toachieve the best performance, the optimal ﬁlter size is of-ten empirically found by conducting extensive experimentalvalidation. Such a training process suffers from expensivetraining cost, especially as the network becomes deeper.This paper proposes a novel Optimized Filter Size CNN(OFS-CNN), where the ﬁlter sizes and weights of all convo-lutional layers are learned simultaneously from the trainingdata along with learning convolution ﬁlters. Speciﬁcally,the ﬁlter size is deﬁned as a continuous variable, which isoptimized by minimizing the training loss. Experimental re-sults on two AU-coded spontaneous databases have shownthat the proposed OFS-CNN is capable of estimating opti-mal ﬁlter size for varying image resolution and outperformstraditional CNNs with the best ﬁlter size obtained by ex-haustive search. The OFS-CNN also beats the CNN usingmultiple ﬁlter sizes and more importantly, is much more ef-ﬁcient during testing with the proposed forward-backwardpropagation algorithm.

1. Introduction

Facial behavior is a natural and powerful means forhuman communications. Facial Action Coding System(FACS) developed by Ekman and Friesen [6] describes fa-cial behavior with a set of facial action units (AUs), eachof which is anatomically related to the contraction of a setof facial muscles. An automatic AU recognition system hasvarious applications in human-computer interaction (HCI)such as interactive games, advertisement impact analysis,and synthesizing human expression. However, it is still achallenging problem to recognize facial AUs from spon- taneous facial displays, especially with large variations infacial appearance caused by free head movements, occlu-sions, and illumination changes.Extensive efforts have been focused on extracting fea-tures that are capable of capturing facial appearance and/orgeometrical changes caused by AUs. While most ofthe earlier approaches employed handcrafted and general-purposed features; deep learning, especially CNN basedmethods, has shown great promise in recognizing facial ex-pressions or AUs [7, 24, 19, 15, 9, 12, 34, 17, 30, 21].In CNNs, the size of the convolution ﬁlters determinesthe size of receptive ﬁeld where information is extracted.CNN-based methods employ predeﬁned and ﬁxed ﬁltersizes in each convolutional layer, which is called the tradi-tional CNN hereafter. In general, larger ﬁlter sizes are em-ployed in the lower convolutional layers, whereas smallerﬁlter sizes are used in the upper layers [18, 4]. However,the ﬁxed ﬁlter sizes are not necessarily optimal for all ap-plications/tasks as well as for different image resolutions.Speciﬁcally, different AUs cause facial appearance changesover various regions at different scales and therefore, mayprefer different ﬁlter sizes. For example, long and deep na-solabial furrows are important for recognizing AU10 (upperlip raiser), while short “wrinkles in the skin above and be-low the lips” and small bulges below the lower lip are cuesfor recognizing AU23 (lip tightener) [6].Given a predeﬁned input image size, the best ﬁlter sizeis often selected experimentally or by visualization [32] foreach convolutional layer. For example, Kim et al. [17],who achieved the best expression recognition performanceof EmotiW2015 challenge [5], experimentally selected thebest ﬁlter sizes for the three convolutional layers.

How-ever, with CNNs becoming deeper and deeper [23, 11], itis impractical to search for the best ﬁlter size by exhaustivesearch, due to the highly expensive training cost.

In this work, we propose a novel and feasible solutionin a CNN framework to automatically learn the ﬁlter sizes1 a r X i v : . [ c s . C V ] N ov or all convolutional layers simultaneously from the train-ing data along with learning the convolution ﬁlters. In par-ticular, we proposed an Optimized Filter Size CNN (OFS-CNN), where the optimal ﬁlter size of each convolutionallayer is estimated iteratively using stochastic gradient de-scent (SGD) during the backpropagation process . As illus-trated in Figure. 1, the ﬁlter size k of a convolutional layer,which is a constant in the traditional CNNs, is deﬁned as acontinuous variable in the OFS-CNN. During backpropaga-tion, the ﬁlter size k will be updated, e.g., decreased whenthe partial derivative of CNN loss with respect to the ﬁltersize is positive, i.e., ∂L∂k > , and vice versa.In this work, a forward-backward propagation algorithmis proposed to estimate the ﬁlter size iteratively. To facili-tate the convolution operation with a continuous ﬁlter size, upper-bound and lower-bound ﬁlters with integer-sizes aredeﬁned. In the forward process , an activation resulted froma convolution operation with a continuous ﬁlter size canbe calculated as the interpolation of the activations usingthe upper-bound and lower-bound ﬁlters. Furthermore, weshow that only one convolution operation is needed with theupper-bound and lower-bound ﬁlters. Therefore, the pro-posed OFS-CNN has similar computation complexity as thetraditional CNNs in the forward process as well as in thetesting process . During backpropagation , the partial deriva-tive of the activation with respect to the ﬁlter size k is de-ﬁned, from which ∂L∂k can be calculated. With a change inthe ﬁlter size k , the ﬁlter sizes of the upper-bound or lower-bound ﬁlters may be updated via a transformation operation proposed in this work.Experimental results on two benchmark AU-coded spon-taneous databases, i.e., FERA2015 BP4D database [26] andDenver Intensity of Spontaneous Facial Action (DISFA)database [20] have demonstrated that the proposed OFS-CNN outperforms the traditional CNNs with the best ﬁltersize obtained by exhaustive search and achieves state-of-the-art performance for AU recognition. Furthermore, theOFS-CNN also beats a deep CNN using multiple ﬁlter sizeswith a remarkable improvement in time efﬁciency duringtesting, which is highly desirable for realtime applications.In addition, the OFS-CNN is capable of estimating optimalﬁlter size for varying image resolution.

2. Related Work

Extensive efforts have been devoted to extracting themost effective features that characterize facial appearanceand geometry changes caused by activation of facial ex-pressions or AUs. The earlier approaches adopted vari-ous handcrafted features such as Gabor wavelets [3], his-tograms of Local Binary Patterns (LBP) [27], Histogramof Oriented Gradients (HOG) [2], Scale Invariant FeatureTransform (SIFT) features [31], histograms of Local PhaseQuantization (LPQ) [14], and their spatiotemporal exten- ...... ......

IncreaseFilter size B a c k w a r d P r o p a g a t i o n F o r w a r d F ee d i n g DecreaseFilter size

Figure 1. An overview of the proposed method to optimize theconvolution ﬁlter size k with the CNN loss backpropagation at the t th iteration. ∂L t ∂k t is the partial derivative of the loss with respect tothe ﬁlter size at the t th iteration ( k t ). The ﬁlter size k will decreasewhen ∂L t ∂k t > , and vice versa. sions [14, 33, 29].Most recently, CNNs have attracted increasing atten-tion and shown great promise for facial expression and AUrecognition [7, 24, 19, 15, 9, 12, 34, 17, 30, 21, 28, 25].For example, the top 3 methods [17, 30, 21] in the recentEmotiW2015 challenge [5] are all based on CNNs and havebeen demonstrated to be more robust to real world con-ditions for facial expression recognition. All those CNN-based methods use ﬁxed-size convolution ﬁlters.To achieve the best performance, the optimal ﬁlter sizeis usually chosen empirically by either experimental valida-tion or visualization for each convolutional layer [32]. Forexample, Kim et al. [17] experimentally compared facial ex-pression recognition performance using different ﬁlter sizesand found that the CNN with 5 ×

5, 4 ×

4, and 5 × ×

42 input images. Zeiler and Fer-gus [32] found that 7 × ×

11 ﬁlters on ImageNet dataset through vi-sualization. However, such empirically selected ﬁlter sizesmay not be optimal for all applications as well as for dif-ferent image resolutions. Furthermore, it is impossible toperform an exhaustive search for the optimal combinationof ﬁlter sizes of all convolutional layers for deep CNNs.To achieve scale invariance, CNNs with multiple ﬁltersizes have been developed. The inception module [23] con-catenates the activation feature maps from 1 ×

1, 3 ×

3, and2 × × ﬁlters. Multi-grid Neural Architecture [16] concatenates the feature mapsactivated by pyramid ﬁlters. However, all those methods arestill based on ﬁxed ﬁlter size and more importantly, demanda signiﬁcant increase in the time and space complexity dueto the complex model structure.In contrast, the proposed OFS-CNN is capable of learn-ing and optimizing the ﬁlter sizes for all convolutional lay-ers simultaneously in a CNN learning framework, which isdesirable, especially when the CNNs go deeper and deeper.Furthermore, we show that only one convolution operationis needed in the proposed forward-backward propagation al-gorithm. Thus, the proposed OFS-CNN has similar com-putational complexity as the traditional CNNs and thus, ismore efﬁcient than the structures using multiple ﬁlter sizes.

3. Methodology

In this work, we propose an OFS-CNN, which is capableof optimizing and learning the ﬁlter size k from the train-ing data. In the following, we will ﬁrst give a brief reviewof the CNN, especially the convolutional layer, and thenpresent the forward and backward propagation processes ofthe OFS-CNN. A CNN consists of a stack of layers such as convolu-tional layers, pooling layers, rectiﬁcation layers, fully con-nected (FC) layers, and loss layers. These layers transformthe input data to highly nonlinear representations. Convo-lutional layers are used to perform convolution on input im-ages or feature maps from the previous layer with ﬁlters.Generally, the ﬁrst convolutional layer is used to extractlow-level image features such as edges; while the upper lay-ers can extract complex and task-related features.Given an input image/feature map denoted by x , an ac-tivation at the i th row and the j th column, denoted by y ij ,in a convolutional layer can be calculated using the convo-lution operation by computing the inner product of the ﬁlterand the input as follows: y ij ( k ) = w ( k ) ⊤ x ij ( k ) + b ij (1) where w ( k ) is a convolution ﬁlter with the ﬁlter size k × k ; x ij ( k ) denotes the input with a k × k receptive ﬁeld cen-tered at the i th row and the j th column; and b ij is a bias.Traditionally, the ﬁlter size k is a predeﬁned integer andﬁxed throughout the training/testing process. In this work, k ∈ R + is deﬁned as a continuous variable that can belearned and optimized during CNN training. In the forward process, convolution operations are con-ducted to calculate activations using learned ﬁlters as in Eq. 1. However, the convolution operation can only be per-formed with integral size ﬁlters in the CNN.

Upper-bound and lower-bound ﬁlters:

In order to buildthe relationship between the activation y ij and the contin-uous ﬁlter size k , we ﬁrst deﬁne an upper-bound ﬁlter de-noted by w ( k + ) and a lower-bound ﬁlter denoted by w ( k − ) .Speciﬁcally, k + is the upper-bound ﬁlter size and is thesmallest odd number that is bigger than k ; while k − is thelower-bound ﬁlter size and is the largest odd number that isless than or equal to k . k + and k − can be calculated as k + = ⌊ k + 12 ⌋ ∗ , k − = ⌊ k + 12 ⌋ ∗ − (2) Then, the activation y ij ( k ) can be deﬁned as the lin-ear interpolation of the activations of the upper-bound andlower-bound ﬁlters denoted by y ij ( k − ) and y ij ( k + ) , re-spectively: y ij ( k ) = αy ij ( k + ) + (1 − α ) y ij ( k − ) (3)where y ij ( k + ) and y ij ( k − ) are calculated as in Eq. 1 withthe same bias, but with the upper-bound and lower-boundﬁlters, i.e., w ( k + ) and w ( k − ) , respectively. α = ( k − k − )2 isthe linear interpolation weight. Remark 1.

A cubic interpolation can also be used to buildthe relationship between the activation y ij and the continu-ous variable k . However, it requires a higher computationalcomplexity and needs at least three points; while the linearinterpolation only needs two points k − and k + . Remark 2.

The ﬁlter size k is actually a weight-related ﬁl-ter size in the interval [ k − , k + ) and can be calculated as: k = k − + 2 α (4) Convolution with a continuous ﬁlter size:

As in Re-mark 2, we can explicitly deﬁne the ﬁlter w ( k ) with acontinuous size k . As shown in Fig. 2, the upper-boundand lower-bound ﬁlters are deﬁned to share the same co-efﬁcients in the region with green color and to differ bythe pink boundary denoted by △ w ( k + ) . Let △ w ( k + ) = w ( k + ) − w ( k − ) be the ring boundary with zeros inside asshown in Fig. 2, then the ﬁlter w ( k ) with a continuous size k can be deﬁned as follows: w ( k ) = α △ w ( k + ) + w ( k − ) , (5) Remark 3.

In Eq. 5, w ( k ) and w ( k − ) have an actual ﬁltersize of k + ; while w ( k − ) is zero-padded. Lemma 1.

Given the deﬁnition of the ﬁlter w ( k ) as in Eq. 5,the activation y ij ( k ) in Eq. 3 can be simpliﬁed as: y ij ( k ) = w ( k ) ⊤ x ij ( k + ) + b ij (6) Proof.

Eq. 6 can be deduced from Eq. 3 as follows: y ij ( k ) = αy ij ( k + ) + (1 − α ) y ij ( k − )= α w ( k + ) ⊤ x ij ( k + ) + (1 − α ) w ( k − ) ⊤ x ij ( k − ) + b ij (7) After padding zeros for w ( k − ) , w ( k − ) ⊤ x ( k − ) is equiv-alent to w ( k − ) ⊤ x ( k + ) . Then, Eq. 7 can be simpliﬁed asfollows:3 ..... (cid:1) a (cid:1) a (cid:1) a (cid:1) a (cid:1) a a a (cid:1) a (cid:1) a n-1,1 a n-1,2 a n-1,n-1 (cid:1) a n-1,n (cid:1) a n,1 (cid:1) a n,2 (cid:1) a n,n-1 (cid:1) a nn ... ... ... ... ... ...... w (k) * w (k - ) a ... a a n-1,2 a n-1,n-1 ... ... ... ... w (k + ) a a a a a a ... a a a n-1,1 a n-1,2 a n-1,n-1 a n-1,n a n,1 a n,2 a n,n-1 a nn ... ... ... ... ... ......... Δ w (k + ) a a a a a a n-1,1 n-1,n a n,1 a n,2 a n,n-1 a nn ... ... ... ... ... ......... w (k - ) a ... a a n-1,2 a n-1,n-1 ... ... ... ... Δ w (k + ) a a a a a a n-1,1 n-1,n a n,1 a n,2 a n,n-1 a nn ... ... ... ... ... ......... ... ......... Figure 2. An illustrative deﬁnition of a ﬁlter with a continuousﬁlter size k ∈ R + . w ( k + ) and w ( k − ) are the upper-bound andlower-bound ﬁlters, respectively, and share the same elements inthe green region. The pink region △ w ( k + ) denotes the differencebetween the upper-bound and lower-bound ﬁlters and has a ringshape with zeros inside. α is the linear interpolation weight asso-ciated with the upper-bound ﬁlter w ( k + ) . w ( k ) is a weight-relatedﬁlter with a continuous ﬁlter size k . y ij ( k ) = α w ( k + ) ⊤ x ( k + ) + (1 − α ) w ( k − ) ⊤ x ( k + ) + b ij = h α w ( k + ) ⊤ + (1 − α ) w ( k − ) ⊤ i x ( k + ) + b ij = h α △ w ( k + ) ⊤ + w ( k − ) ⊤ i x ( k + ) + b ij (8) By substituting Eq. 5 into Eq. 8, we have y ij ( k ) = w ( k ) ⊤ x ij ( k + ) + b ij (9) Thus, the activation of y ij ( k ) can be simpliﬁed as Eq. 6. Remark 4.

According to Eq. 6, only one convolution op-eration needs to be performed to calculate each activation y ij ( k ) . Therefore, the time complexity does not increasecompared with the traditional CNN in the forward trainingprocess as well as in the testing process. Since the relationshipbetween the activation and the ﬁlter size has been deﬁnedas in Eq. 3, the partial derivative of the activation y ij w.r.t.the ﬁlter size can be calculated based on the derivative deﬁ-nition as follows: ∂y ij ( k ) ∂k = lim △ k → y ij ( k + △ k ) − y ij ( k − △ k )2 △ k (10) When k + △ k and k − △ k are in the interval [ k − , k + ) ,the derivative of each point ∂y ij ( k ) ∂k is equal to the gradientof the line because of the linear interpolation. Hence, thepartial derivative can be calculated as follows: ∂y ij ( k ) ∂k = y ij ( k + ) − y ij ( k − ) k + − k − (11) Substituting Eq. 1 into Eq. 11, we have ∂y ij ( k ) ∂k = w ( k + ) ⊤ x ij ( k + ) − w ( k − ) ⊤ x ij ( k − ) k + − k − (12) By padding zeros for w ( k − ) , we can simplify Eq. 12 as ∂y ij ( k ) ∂k = w ( k + ) ⊤ x ij ( k + ) − w ( k − ) ⊤ x ij ( k + ) k + − k − = (cid:2) w ( k + ) ⊤ − w ( k − ) ⊤ (cid:3) x ij ( k + ) k + − k − = △ w ( k + ) ⊤ x ij ( k + ) k + − k − (13) Based on Eq. 13, the partial derivative of the loss L w.r.t. k can be calculated as follows with chain rule: ∂L∂k = X i,j ∂L∂y ij ∂y ij ∂k (14) Updating the ﬁlter size:

Given the partial derivative of theloss L w.r.t. k , the ﬁlter size k can be updated iterativelywith the SGD strategy for the ( t + 1) th iteration as follows: k t +1 = k t − γ ∂L t ∂k t (15) where γ is the learning rate. Since the lower-bound ﬁlter w t ( k − ) is deﬁned as the in-ner part of the upper-bound ﬁlter w t ( k + ) , we only need toperform backpropagation for the upper-bound ﬁlter w t ( k + ) ,which can be divided into two parts as w t ( k + ) = w t ( k − ) + △ w t ( k + ) , where △ w t ( k + ) is the ring boundary with zerosinside and w ( k − ) is padded with zeros. Then, the forwardactivation function in Eq. 6 can be reorganized as: y tij ( k t ) = w t ( k t ) ⊤ x tij ( k t + ) + b tij = h α t △ w t ( k t + ) ⊤ + w t ( k t − ) ⊤ i x tij ( k t + ) + b tij = α t △ w ( k t + ) ⊤ △ x tij ( k t + )+ w t ( k t − ) ⊤ x tij ( k t − )+ b tij (16) where △ x tij ( k t + ) is the ring boundary of x tij ( k t + ) in theinput image/feature map with zeros inside and x tij ( k t − ) ispadded with zeros.Hence, the partial derivative of the activation y tij w.r.t.the upper-bound ﬁlter w t ( k t + ) can be calculated as follows: ∂y tij ∂ w t ( k t + ) = x tij ( k t − ) ⊤ + α t △ x tij ( k t + ) ⊤ (17) With the chain rule, the derivative of CNN loss w.r.t. w t ( k + ) can be calculated as ∂L t ∂ w t ( k t + ) = X i,j ∂L t ∂y tij ∂y tij ∂ w t ( k t + ) (18) Thus, the upper-bound ﬁlter w ( k + ) can be updated iter-atively using the SGD strategy. As a result, the ﬁlter w ( k ) with a continuous size k can be updated as in Eq. 5.4 igure 3. When the ﬁlter size k is updated during backpropagation, it may be out of the interval [ k t − , k t + ) . In this case, transformationoperations are needed to update the sizes of the upper-bound and lower-bound ﬁlters after updating their coefﬁcients. Speciﬁcally, anexpanding operation is employed to increase the sizes of both upper-bound and lower-bound ﬁlters; whereas a shrinking operation is usedto decrease the ﬁlter sizes. Transforming the upper-bound and lower-bound ﬁlters:

According to Eq. 15, the ﬁlter size k can be continuouslyupdated over time. As long as k t +1 is in the interval of [ k t − , k t + ) , the upper-bound and lower bound ﬁlters remainthe same sizes as those in the t th iteration, i.e., k t +1 − = k t − and k t +1+ = k t + . However, as the ﬁlter size k is updated, k t +1 may be outside of the interval of [ k t − , k t + ) . Conse-quently, both the sizes of the upper-bound and lower-boundﬁlters should be updated. As illustrated in Fig. 3, we de-ﬁne transformation operations , including expanding and shrinking to update the upper-bound and lower-bound ﬁl-ters to accommodate a size change. Note that, the transformation operations are conductedafter updating coefﬁcients of the upper-bound and lower-bound ﬁlters.Expanding:

When k t +1 > k t + , the upper-bound and lower-bound ﬁlters w t +1 ( k t +1+ ) and w t +1 ( k t +1 − ) should be up-dated by an expanding operation as follows: w t +1 ( k t +1 − ) = w t +1 ( k t +1+ ) w t +1 ( k t +1+ ) = expand ( w t +1 ( k t +1+ )) (19) where expand ( · ) is a function to increase the ﬁlter size, par-ticularly by padding values from the nearest neighbors ofthe original ﬁlter as illustrated in Figure 4. Shrinking:

As opposed to the expand ( · ) function, when k t +1 < k t − , the upper-bound and lower-bound ﬁlters w t +1 ( k t +1 − ) and w t +1 ( k t +1+ ) will be shrunk as follows: w t +1 ( k t +1+ ) = w t +1 ( k t +1 − ) w t +1 ( k t +1 − ) = shrink ( w t +1 ( k t +1 − )) (20) where shrink ( · ) is a function to decrease the ﬁlter size,speciﬁcally by ﬁlling the boundary with zeros as shown inFigure 4. shrink expand a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a

00 a a a

00 0 0 0 0

Figure 4. An illustration of the shrink and expand operationsto change the ﬁlter size. The shrink operation sets zeros to theoutside boundary; while the expand operation is to pad the outsideboundary with the nearest neighbors from the original ﬁlter.

Remark 5.

There are alternative methods that can be usedto expand or shrink the ﬁlters. For example, we have alsotried to resize the ﬁlter by bicubic interpolation. However,the recognition performance became worse. The reason isthat the ﬁlters learned in the previous iterations are dis-torted after scaling and thus, may fail to activate the pat-terns in the images. In contrast, the proposed expand andshrink functions can well preserve the learned ﬁlters.

Updating other parameters:

In addition to updating theﬁlter size k and the convolution ﬁlter w ( k ) , we should alsoupdate the bias b ij and the feature x ij during backpropaga-tion. Based on the forward activation function as deﬁned inEq. 6, the derivative of feature activation y tij w.r.t. x tij ( k t + ) can be calculated as below: ∂y tij ∂ x tij ( k t + ) = w t ( k t ) (21) With the chain rule, the derivative of CNN loss w.r.t. x tij ( k t + ) can be calculated as: ∂L t ∂ x tij ( k t + ) = ∂L t ∂y tij ∂y tij ∂ x tij ( k t + ) (22) lgorithm 1 The forward-backward propagation algorithm forthe OFS-CNN

Input:

Input images or feature maps from the previouslayer x and an initial ﬁlter size k ∈ R + . Initialization:

Initialize k and k − as Eq. 2.Randomly initialize the convolution ﬁlter w ( k ) . for iteration t from to T do//Forward:w t ( k t − ) = shrink ( w t ( k t + )) Calculate the convolution ﬁlter w t ( k t ) based on Eq. 5Calculate the forward activation y ij ( k ) based on Eq. 6 //Backward: Calculate the derivative of activation w.r.t. k t , w t ( k t + ) ,and x t , based on Eqs.13, 17, and 21, respectivelyCalculate the derivative of loss w.r.t. k t , w t ( k t + ) , and x t , based on Eqs.14, 18, and 22, respectivelyUpdate k t +1 , w t +1 ( k t +1+ ) , and x t +1 based on SGDUpdate the bias using standard CNN backpropagation// Transformation:if k t +1 > k t + then k t +1 − = k t + k t +1+ = k t + + 2 Expand the upper-bound and lower bound ﬁlters w t +1 ( k t +1+ ) and w t +1 ( k t +1 − ) as in Eq. 19 else if k t +1 < k t − then k t +1+ = k t − k t +1 − = k t − − Shrink the upper-bound and lower bound ﬁlters w t +1 ( k t +1+ ) and w t +1 ( k t +1 − ) as in Eq. 20 end ifend for Hence, the feature x ij can be updated using the SGDstrategy and will be further backpropagated to update theparameters in the lower layers. The backpropagation of b tij is exactly the same as that in the traditional CNNs. Theforward and backward propagation process for the proposedOFS-CNN is summarized in Algorithm 1.

4. Experiments

To demonstrate the effectiveness of the proposed model,extensive experiments have been conducted on two bench-mark AU-coded databases, i.e., the BP4D database [26]and the DISFA database [20], containing spontaneous fa-cial behavior with moderate head movements. Speciﬁcally,the BP4D database [26] has 11 AUs and 41 subjects with146,847 images; and the DISFA database [20] has 12 AUsand 27 subjects with 130,814 images. Following the exper-imental setup of the state-of-the-art methods (DRML [34]and PL-CNN [28]), two AUs, i.e., AU5 and AU20, whichappear less than of the frames in the DISFA database, are not considered in the experiments. First, facial landmarks are detected, from which facealignment can be conducted to reduce the variationsfrom scaling and in-plane rotation. For the DISFAdatabase [20], 66 landmarks are detected using a state-of-the-art method [1]. For the BP4D database [26], the 49landmarks provided in the database are used for face align-ment. Based on the extracted facial landmarks, face regionsare aligned based on three ﬁducial points: the centers of thetwo eyes and the mouth, and then scaled to × . Fol-lowing the work [10], each face image is warped to a frontalview to reduce variations from face pose; and then sequencenormalization is performed by subtracting the mean and di-viding the standard deviation calculated from the video se-quence to reduce the identity-related information and to en-hance appearance and geometrical changes caused by AUs. The proposed OFS-CNN is modiﬁed from cifar10 quickin Caffe [13], which consists of three convolutional layers,two average pooling layers, two FC layers, and ending withthe weighted sigmoid cross entropy loss layer for calculat-ing the loss. Speciﬁcally, all the convolutional layers havea stride of 1. The ﬁrst two convolutional layers have 32 ﬁl-ters, whose output feature maps are sent to a rectiﬁed layerfollowed by the average pooling layer with a downsamplingstride of 3. The last convolutional layer has 64 ﬁlters, whoseoutput feature maps are fed into an FC layer with 64 nodes.Finally, the output of the last FC layer, which contains asingle node, is sent to the loss layer. The SGD, with a mo-mentum of 0.9 and a mini-batch size of 100, is used fortraining the CNN. Each AU has one trained CNN model.All ﬁlter sizes are × in the original cifar10 quick [13]and will be used for the baseline CNN for comparison. Inthe OFS-CNN, all ﬁlter sizes are initialized with 4, implying α = 0 . , k = 5 , and k − = 3 . The proposed OFS-CNN is compared with the baselineCNN with ﬁxed convolution ﬁlter sizes on the two bench-mark datasets. Since the BP4D database [26] providesthe training and development partitions, an average perfor-mance of ﬁve runs is reported to reduce the inﬂuence of therandomness during training. For the DISFA database [20],a 9-fold cross-validation strategy is employed, such that thetraining and testing subjects are mutually exclusive. Exper-imental results are reported in terms of the average F1 scoreand 2AFC score (area under ROC curve). In the experiments, three resolutions, i.e., × , × , × are employed to evaluate the proposed OFS-CNN on different resolutions. able 1. Performance comparison of the proposed OFS-CNNs and traditional CNNs with varying ﬁlter size on the BP4D database [26].In the , the ﬁlter size is learned only for the ﬁrst layer. The average converged ﬁlter size is reported for each AU,respectively. The results are calculated from 5 runs in terms of the average F1 score and the 2AFC score. The underline highlights the bestperformance among the 4 ﬁxed ﬁlter sizes. The bold highlights the best performance among all models. AUs CNN-Filter3 CNN-Filter5 CNN-Filter7 CNN-Filter9 1-layer OFS-CNN 3-layer OFS-CNN

F1 2AFC F1 2AFC F1 2AFC F1 2AFC F1 2AFC Converged Size F1 2AFC Converged SizeAU1 0.315 0.577 0.313 0.578 0.310 0.577 0.315 0.583 0.320 0.586 6.0

Exhaustive search vs optimization of ﬁlter size:

We ﬁrstshow that the proposed OFS-CNN is capable of learningthe optimal ﬁlter sizes. Speciﬁcally, baseline CNNs are de-signed with varying ﬁlter sizes including × , × , × ,and × in the ﬁrst convolutional layer. In addition to the , where the ﬁlter sizes in all three convo-lutional layers are learned, a is designedwhere the ﬁlter size is learned only for the ﬁrst layer. All thebaseline CNNs and the used the ﬁxed ﬁl-ter sizes ( × ) for the other two convolutional layers. Allthe models in comparison are trained on the training parti-tion and tested on the development partition of the BP4Ddatabase [26]. The results are reported in Table 1, whichare calculated as the average of 5 runs. The average ﬁltersize of OFS-CNNs is reported for each AU at the th iteration, where most of the CNN models are converged inour experiments.As shown in Table 1, the not only out-performs CNN-Filter5 as in the original cifar10 quick [13]in terms of the average F1 score (0.501 vs 0.499) and the av-erage 2AFC score (0.666 vs 0.664), but also achieves sim-ilar performance as

CNN-Filter7 that has the best perfor-mance among all baseline CNNs. Furthermore, the beats all models compared to in terms of the av-erage F1 score and 2AFC score. This demonstrates that theproposed OFS-CNN is superior to the best CNN model ob-tained by exhaustive search. In addition, the learned ﬁltersize is often consistent with the best ﬁlter size obtained byexhaustive search, which is either the upper-bound or lower-bound ﬁlter size in the OFS-CNN.

OFS-CNNs on different image resolutions:

We also showthat the learned ﬁlter sizes adapt well to changes in imageresolutions. Speciﬁcally, experiments have been conductedto compare the proposed OFS-CNN and the baseline CNNon the BP4D database [26] with different resolutions of theinput images. All the CNN models have similar CNN struc-ture as described in Section 4.2. In order to accommodate

Table 2. Performance comparison of the proposed OFS-CNN andthe baseline CNN for varying image resolutions on the BP4Ddatabase [26] in terms of the average F1 score. The bold high-lights the best performance among all models.

Resolution 64 ×

48 128 ×

96 256 × AU2 0.277

AU7 0.643 0.634 0.642

AU14 0.517 0.532 0.552

AU23 0.348 0.355 0.381 0.398 0.354

AVE 0.499 0.515 0.522 0.533 0.478

Table 3. The average converged ﬁlter sizes for varying image res-olutions on the BP4D database [26]. The bold highlights the ﬁltersizes with the best performance.

Resolution 64 ×

48 128 ×

96 256 × AU2

AU7 5.0 4.8 4.7

AU14 5.1 4.6 4.5

AU23 5.4 4.6 4.7 6.0 4.7 4.7

AVE 5.0 5.0 4.8 5.5 5.0 5.0 5.7 5.0 5.0 the changes in the resolution, the number of nodes in theﬁrst FC layer is set to 64, 128, and 256 for resolutions of × , × , and × , respectively, for all modelsin comparison. In this set of experiments, the is employed and the average converged ﬁlter sizes foreach AU under each resolution are reported in Table 3.As shown in Table 2, most of AUs prefer a higher im-age resolution to preserve subtle cues of facial appearancechanges. However, the performance of the baseline CNN7ecreases for the highest resolution × . Whenthe image resolution increases, the receptive ﬁeld covers asmaller actual area of the whole face when using the same × ﬁlter size, compared to lower resolutions. In contrast,the proposed OFS-CNN can optimize ﬁlter size at variousimage resolutions. As shown in Table 3, the OFS-CNN hasthe largest average ﬁlter size of 5.7 for conv1 (the ﬁrst con-volutional layer) for × and thus, can beneﬁt from anincreased receptive ﬁeld because of the × upper-boundﬁlter. As a result, the OFS-CNN outperforms the baselineCNN for all image resolutions, especially for × by6%, in terms of the average F1 score. Comparison with the CNNs using multiple ﬁlter sizes:

We also compare the proposed to theCNN structure with multiple ﬁlter sizes, i.e., the incep-tion module [23]. In particular, the GoogLeNet [23] with7 inception modules is trained and evaluated on the BP4Ddatabase with an image resolution of × . Table 4. Comparison with the GoogLeNet on the BP4D databasein terms of F1 score.

AUs % GoogLeNet OFS-CNN × OFS-CNN × AU1 23.1 0.369 0.345

AU2 17.9 0.267 0.303

AU4 22.7

AU23 17.0 0.376 0.398

AVE - 0.531 0.533

As shown in Table 4, the OFS-CNN with a shallow struc-ture (15 layers, trained in 3,000 iterations) performs no-ticeably better than the GoogLeNet (100 layers, trained in20,000 iterations) in terms of the average F1 score. Theimprovement becomes more substantial for the AUs witha lower occurrence rate such as AU2 (17.9%) and AU23(17.0%). Furthermore, the GoogLeNet is much more com-plex compared to the OFS-CNN and thus, demands moretraining data. Note that the proposed OFS-CNN runs morethan 8 times faster on a × image and more than 6times faster on a × image than the GoogLeNet( × ) during testing, which is critical and hence,highly desirable for real-time applications. Comparison with the baseline CNN on the DISFAdatabase [20]:

As illustrated in Table 5, the proposed OFS-CNN also outperforms the baseline CNN with a notablemargin in terms of the average F1 score on the DISFAdatabase [20]. The experiments are conducted on the im-age resolution of × using the . Comparison with state-of-the-art methods:

In addition tothe baseline CNN, we further compare the proposed OFS-CNN with state-of-the-art methods, particularly the most re-

Table 5. Performance comparison with the baseline CNN on theDISFA database [20] in terms of the average F1 score and the2AFC score.

AUs CNN (baseline) OFS-CNNF1 2AFC F1 2AFCAU1

AU4

AU12 0.786

AU25

AVE

Table 6. Performance comparison with the state-of-the-art CNNbased methods on the BP4D and the DISFA databases in terms ofF1 score and 2AFC score.

BP4D DISFAMethods F1 2AFC Methods F1 2AFC

DL [9] 0.522 N/A ML-CNN [8] N/A 0.757AlexNet [34] 0.384 0.422 AlexNet [34] 0.236 0.491LCN [34] 0.466 0.544 LCN [34] 0.240 0.468ConvNet [34] 0.470 0.518 ConvNet [34] 0.231 0.458DRML [34] 0.483 0.560 DRML [34] 0.267 0.523PL-CNN [28] 0.491 N/A PL-CNN [28] 0.584 N/A

OFS-CNN

OFS-CNN cent approaches based on CNNs [8, 9, 34, 28], on the twobenchmark databases. As shown in Table 6, the proposedOFS-CNN achieves the state-of-the-art performance of AUrecognition on the two databases .

5. Conclusion and Future Work

Traditional CNNs have a predeﬁned and ﬁxed integralﬁlter size for each convolutional layer, which may be notoptimal for all tasks as well as for all image resolutions. Inthis work, we proposed a novel OFS-CNN with a forward-backward propagation algorithm to iteratively optimize theﬁlter size while learning the convolution ﬁlters. Upper-bound and lower-bound ﬁlters are deﬁned to facilitate theconvolution operations with continuous-size ﬁlters; andtransformation operations are developed to accommodatethe size changes of the ﬁlters. Experimental results on twobenchmark AU-coded spontaneous databases have shownthat the OFS-CNN outperforms the baseline CNNs with thebest ﬁlter size found by exhaustive search and achieves bet-ter or at least comparable performance to the state-of-the-artCNN-based methods. Furthermore, the OFS-CNN has beenshown to be effective for automatically adapting ﬁlter sizesto different image resolutions. In the current practice, dif-ferent channels of a single convolutional layer share a singleﬁlter size. In the future, the OFS-CNN will be extended tolearn a ﬁlter size for each channel, which would be moreeffective for learning variously sized patterns. The performance of the ML-CNN was reported for 10 AUs on theDISFA database [20]. eferences [1] A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic. Robustdiscriminative response map ﬁtting with constrained localmodels. In CVPR , pages 3444–3451, 2013.[2] T. Baltrusaitis, M. Mahmoud, and P. Robinson. Cross-datasetlearning and person-speciﬁc normalisation for automatic ac-tion unit detection. In FG , volume 6, pages 1–6, 2015.[3] M. S. Bartlett, G. Littlewort, M. G. Frank, C. Lainscsek,I. Fasel, and J. R. Movellan. Recognizing facial expression:Machine learning and application to spontaneous behavior.In CVPR , pages 568–573, 2005.[4] K. Chatﬁeld, K. Simonyan, A. Vedaldi, and A. Zisserman.Return of the devil in the details: Delving deep into convo-lutional nets. In

BMVC , 2014.[5] A. Dhall, O. Ramana Murthy, R. Goecke, J. Joshi, andT. Gedeon. Video and image based emotion recognition chal-lenges in the wild: Emotiw 2015. In

ICMI , pages 423–426.ACM, 2015.[6] P. Ekman, W. V. Friesen, and J. C. Hager.

Facial ActionCoding System: the Manual . Research Nexus, Div., NetworkInformation Research Corp., Salt Lake City, UT, 2002.[7] B. Fasel. Head-pose invariant facial expression recognitionusing convolutional neural networks. In

ICMI , pages 529–534, 2002.[8] S. Ghosh, E. Laksana, S. Scherer, and L. Morency. A multi-label convolutional neural network approach to cross-domainaction unit detection.

ACII , 2015.[9] A. Gudi, H. E. Tasli, T. M. den Uyl, and A. Maroulis. Deeplearning based FACS action unit occurrence and intensity es-timation. In FG , 2015.[10] S. Han, Z. Meng, S. KHAN, and Y. Tong. Incremental boost-ing convolutional neural network for facial action unit recog-nition. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon,and R. Garnett, editors, NIPS , pages 109–117. Curran Asso-ciates, Inc., 2016.[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In

CVPR , pages 770–778, 2016.[12] S. Jaiswal and M. F. Valstar. Deep learning the dynamic ap-pearance and shape of facial action units. In

WACV , 2016.[13] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-shick, S. Guadarrama, and T. Darrell. Caffe: Convolutionalarchitecture for fast feature embedding. In

ACM MM , pages675–678. ACM, 2014.[14] B. Jiang, B. Martinez, M. F. Valstar, and M. Pantic. Deci-sion level fusion of domain speciﬁc regions for facial actionrecognition. In

ICPR , pages 1776–1781, 2014.[15] H. Jung, S. Lee, J. Yim, S. Park, and J. Kim. Joint ﬁne-tuningin deep neural networks for facial expression recognition. In

ICCV , pages 2983–2991, 2015.[16] T. W. Ke, M. Maire, and X. Y. Stella. Multigrid neural archi-tectures. 2017.[17] B.-K. Kim, H. Lee, J. Roh, and S.-Y. Lee. Hierarchical com-mittee of deep cnns with exponentially-weighted decision fu-sion for static facial expression recognition. In

ICMI , pages427–434, 2015. [18] G. Levi and T. Hassner. Age and gender classiﬁcation usingconvolutional neural networks. In

CVPR Workshops , pages34–42, 2015.[19] M. Liu, S. Li, S. Shan, R. Wang, and X. Chen. Deeply learn-ing deformable facial action parts model for dynamic expres-sion analysis. In

ACCV , 2014.[20] S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, and J. F.Cohn. Disfa: A spontaneous facial action intensity database.

IEEE Trans. on Affective Computing , 4(2):151–160, 2013.[21] H.-W. Ng, V. D. Nguyen, V. Vonikakis, and S. Winkler.Deep learning for emotion recognition on small datasets us-ing transfer learning. In

ICMI , pages 443–449, 2015.[22] S. Saxena and J. Verbeek. Convolutional neural fabrics. In

Advances in Neural Information Processing Systems , pages4053–4061, 2016.[23] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In

CVPR , pages 1–9, 2015.[24] Y. Tang. Deep learning using linear support vector machines.In

ICML , 2013.[25] Z. T˝os´er, L. Jeni, A. L˝orincz, and J. Cohn. Deep learning forfacial action unit detection under large head poses. In

ECCV ,pages 359–371. Springer, 2016.[26] M. Valstar, J. Girard, T. Almaev, G. McKeown, M. Mehu,L. Yin, M. Pantic, and J. Cohn. FERA 2015 - second facialexpression recognition and analysis challenge. FG , 2015.[27] M. F. Valstar, M. Mehu, B. Jiang, M. Pantic, and K. Scherer.Meta-analysis of the ﬁrst facial expression recognition chal-lenge. IEEE T-SMC-B , 42(4):966–979, 2012.[28] S. Wu, S. Wang, B. Pan, and Q. Ji. Deep facial actionunit recognition from partially labeled data. In

ICCV , pages3951–3959, 2017.[29] P. Yang, Q. Liu, and M. D. N. Boosting encoded dynamicfeatures for facial expression recognition.

Pattern Recogni-tion Letters , 30(2):132–139, Jan. 2009.[30] Z. Yu and C. Zhang. Image based static facial expressionrecognition with multiple deep network learning. In

ICMI ,pages 435–442, 2015.[31] A. Yuce, H. Gao, and J. Thiran. Discriminant multi-labelmanifold embedding for facial action unit detection. In FG ,2015.[32] M. D. Zeiler and R. Fergus. Visualizing and understandingconvolutional networks. In ECCV , pages 818–833, 2014.[33] G. Zhao and M. Pieti¨ainen. Dynamic texture recognition us-ing local binary patterns with an application to facial expres-sions.

IEEE T-PAMI , 29(6):915–928, June 2007.[34] K. Zhao, W. Chu, and H. Zhang. Deep region and multi-label learning for facial action unit detection. In

CVPR , pages3391–3399, 2016., pages3391–3399, 2016.