[PDF] Learning a Target Sample Re-Generator for Cross-Database Micro-Expression Recognition

Abstract

Full PDF

LLearning a Target Sample Re-Generator for Cross-DatabaseMicro-Expression Recognition

Yuan Zong ∗ Key Laboratory of Child Developmentand Learning Science of Ministry ofEducation, Research Center forLearning Science, SoutheastUniversityNanjing 210096, [email protected]

Xiaohua Huang

Center for Machine Vision and SignalAnalysis, Faculty of InformationTechnology and ElectricalEngineering, University of OuluOulu FI-90014, [email protected]

Wenming Zheng † Key Laboratory of Child Developmentand Learning Science of Ministry ofEducation, Research Center forLearning Science, SoutheastUniversityNanjing 210096, Chinawenming [email protected]

Zhen Cui

School of Computer Science andEngineering, Nanjing University ofScience and TechnologyNanjing 210094, [email protected]

Guoying Zhao

Center for Machine Vision and SignalAnalysis, Faculty of InformationTechnology and ElectricalEngineering, University of OuluOulu FI-90014, [email protected]

ABSTRACT

In this paper, we investigate the cross-database micro-expressionrecognition problem, where the training and testing samples arefrom two different micro-expression databases. Under this set-ting, the training and testing samples would have different featuredistributions and hence the performance of most existing micro-expression recognition methods may decrease greatly. To solve thisproblem, we propose a simple yet effective method called TargetSample Re-Generator (TSRG) in this paper. By using TSRG, weare able to re-generate the samples from target micro-expressiondatabase and the re-generated target samples would share sameor similar feature distributions with the original source samples.For this reason, we can then use the classifier learned based on thelabeled source samples to accurately predict the micro-expressioncategories of the unlabeled target samples. To evaluate the per-formance of the proposed TSRG method, extensive cross-databasemicro-expression recognition experiments designed based on SMICand CASME II databases are conducted. Compared with recentstate-of-the-art cross-database emotion recognition methods, theproposed TSRG achieves more promising results. ∗ Yuan Zong is also with the Center for Machine Vision and Signal Analysis, Faculty ofInformation Technology and Electrical Engineering, University of Oulu, Oulu FI-90014,Finland. † Corresponding author.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

CCS CONCEPTS • Computing methodologies → Computer vision problems;Transfer learning;

KEYWORDS

Cross-database micro-expression recognition, micro-expressionrecognition, domain adaptation, transfer learning

ACM Reference format:

Yuan Zong, Xiaohua Huang, Wenming Zheng, Zhen Cui, and Guoying Zhao.2017. Learning a Target Sample Re-Generator for Cross-Database Micro-Expression Recognition. In

Proceedings of MM ’17, Mountain View, CA, USA,October 23–27, 2017,

Micro-expression is one type of particular facial expressions andit can reveal human beings’ true emotional states which peopletry to conceal [29]. Therefore, recognizing micro-expressions bymachines will have many valuable applications, e.g., clinical diag-nosis [10], interrogation [11], and security [27]. However, com-pared with ordinary facial expression, micro-expression has muchlower intensity and shorter duration. This makes automatic micro-expression recognition become a very challenging task. Neverthe-less, micro-expression recognition is still one of recent attractiveresearch topics among affective computing, multimedia informa-tion processing and pattern recognition communities [26] due toits potential values.The micro-expression recognition research can be early tracedto the work of [29], in which Pfister et al. proposed to use tempo-ral interpolation model (TIM) and local binary pattern from threeorthogonal planes (LBP-TOP) [44] to deal with micro-expression a r X i v : . [ c s . C V ] J u l M ’17, October 23–27, 2017, Mountain View, CA, USA Y. Zong et al. recognition problem. Their experimental results show that LBP-TOP is effective for micro-expression recognition problem. Fol-lowing Pfister et al.’s work, Ruiz-Hernandez et al. [30] employedre-parameterization of second order Gaussian jet to boost LBP-TOP such that LBP-TOP is more applicable to micro-expressionrecognition. For better describing micro-expressions, Wang etal. [39] proposed a novel spatio-temporal descriptor called LBPwith six intersection points (LBP-SIP) which can reduce the re-dundant information in LBP-TOP. Subsequently, lots of spatio-temporal descriptors are developed for micro-expression recogni-tion tasks, such as spatio-temporal LBP with integration projection(STLBP-IP) [15], completed local quantized pattern-TOP (CLQP-TOP) [16], histogram of oriented gradient-TOP (HOG-TOP) [22],and histogram of image gradient orientation-TOP (HIGO-TOP) [22].Furthermore, different from the above spatio-temporal descrip-tors, other types of micro-expression features are also investigatedby researchers. Among them, it is worth mentioning the worksof [25, 41], in which Liu et al. and Xu et al. respectively de-signed novel and effective features, i.e., main directional meanoptical (MDMO) and facial dynamics map (FDM), to describe micro-expressions. Their experimental results demonstrated the effective-ness of these two novel micro-expression features.On the other hand, in recent years, some researchers investigatedmicro-expression recognition from a new angle, i.e., aiming at lever-aging other important information of micro-expression clips, whichcontributes to distinguishing micro-expressions, to boost the per-formance of the spatio-temporal descriptors. In the work of [38],Wang et al. proposed to use robust principal component analy-sis (RPCA) [40] to extract the background information from themicro-expression video clips and then extract the spatio-temporaldescriptor of the background information to describe such micro-expression clips. Recently, Wang et al. [36, 37] designed a set ofregions of interest (ROIs) according to the facial action coding sys-tem (FACS) [9] for micro-expression feature extraction. Meanwhile,they proposed a color space decomposition method called tensorindependent color space (TICS) to utilize the color information formicro-expression recognition. It is notable that before Wang et al.’sROIs based method [36, 37], most researchers employed the fixedgird based spatial division method, e.g., 8 ×

8, to boost the perfor-mance of spatio-temporal descriptors. Specifically, the original themicro-expression video clip is first divided into a set of spatial facialblocks and then the spatio-temporal descriptors are extracted tocompose a supervector to describe the micro-expressions. Morerecently, deep learning methods have also been applied to micro-expression recognition. Kim et al. [20] proposed a deep learningframework consisting of popular convolutional neural network(CNN) [21] and long short-term memory (LSTM) recurrent net-work [12] for micro-expression recognition. In this framework, therepresentative expression-states frames of micro-expression videoclips are first selected to train a CNN. Then, the CNN feature of eachimage frame in a video clip is extracted to train a LSTM networkfor recognizing micro-expressions.Although micro-expression recognition has made great progressin recent years, it should be pointed out that nearly all of the aboveproposed methods are just considered to evaluate on one micro-expression database, which means the training and testing samplesbelong to the same micro-expression database. In this case, since the training and testing samples are collected by the same equip-ment and under the same environment, it is a common view thatthe training and testing samples abide by the same or similar fea-ture distributions. However, in practical applications, we will facethe micro-expression samples recorded by different equipmentsor under different environments, which inevitably brings the fea-ture distribution difference between the training and testing sam-ples. Because of this, the performance of existing micro-expressionrecognition methods may sharply drop. Consequently, in order todevelop more practical micro-expression recognition methods, itis very worthwhile to investigate cross-database micro-expressionproblem, in which the training and testing samples come from twodifferent micro-expression databases. Clearly, cross-database micro-expression recognition problem is more challenging and difficultthan ordinary micro-expression recognition one. For convenience,we refer the training database (samples) as the source database(samples) and the testing database (samples) as the target database(samples) in cross-database micro-expression recognition problemthroughout this paper.In the work of [42], Yan et al. roughly divided cross-database fa-cial expression recognition problem, which is closely related to ourtopic, into two cases including semi-supervised case and unsuper-vised case. The major difference between these two cases is whetherwe have access to the label information of target domain. Similarly,cross-database micro-expression recognition problem can followthis categorization. In this paper, we will focus on the unsupervisedproblem setting, in which the source micro-expression samples arelabeled while the label information of the target micro-expressionsamples is completely unknown. To deal with this challenging prob-lem, we propose a simple yet effective method called Target SampleRe-Generator (TSRG). TSRG aims at learning a sample re-generatorfor source and target micro-expression samples. When the sourceand target samples are fed to TSRG, respectively, the output ofTSRG for source samples will still be themselves, while for targetsamples TSRG will output a new set of samples which are differ-ent from their original forms but shares same or similar featuredistributions with the source sample set. After that, we are able tolearn a classifier such as support vector machine (SVM) based onthe labeled source micro-expression samples and subsequently useit to predict the labels of the re-generated target micro-expressionsamples.The rest of this paper is organized as follows: Section 2 reviews re-cent cross-database emotion (including facial expression and speechemotion) recognition works which are very closely related to cross-database micro-expression recognition topic. In Section 3, we intro-duce our proposed TSRG based cross-database micro-expressionrecognition method in detail. For evaluating the performance of theproposed TSRG method, extensive cross-database micro-expressionrecognition experiments between SMIC and CASME II databasesare conducted in Section 4. Finally, the paper is concluded in Sec-tion 5.

Since cross-database micro-expression recognition problem hasnot yet been investigated, in this section we review recent worksabout other modality based cross-database emotion recognition earning a TSRG for Cross-Database Micro-Expression Recognition MM ’17, October 23–27, 2017, Mountain View, CA, USA

Figure 1: The overall picture of TSRG based cross-database micro-expression recognition method. that is closely related to cross-database micro-expression recog-nition including cross-database facial expression recognition andcross-database speech emotion recognition. In recent years, thesetwo challenging and interesting problems have gained lots of re-searchers’ attention.For cross-database facial expression recognition and its relatedproblems, various effective methods [3, 4, 32, 42, 46, 47] have beenproposed. For example, in the works of [3, 4], Chu et al. proposeda novel method called selective transfer machine (STM) for person-alized (cross-subject) facial action units detection problem. STM isable to utilize the target samples to learn a set of weights for thesource samples such that the weighted source samples would havethe same or similar feature distributions with the target samples.Consequently, the classifier trained based on the weighted sourcesamples could also be suitable for distinguishing the target samples.Recently, Sangineto et al. [32] investigated the cross-domain facialexpression recognition problem by using a transductive parametertransfer method. They proposed a novel classifier parameter trans-fer method to directly transfer knowledge about the parametersof source person-specific classifiers to the target individuals suchthat the target classifier can accurately predict the expressions oftarget samples. More recently, Yan et al. [42] proposed an unsu-pervised domain-adaptive dictionary learning (UDADL) model tocope with the unsupervised cross-database facial expression recog-nition problem and achieved a promising result. In addition, Zhenget al. [46, 47] proposed a transductive transfer subspace learningframework to deal with cross-pose and cross-database cases in fa-cial expression recognition. In this framework, an auxiliary set isselected from the unlabeled target samples for learning a subspacetogether with the labeled source samples. In such a subspace, thesource and target samples would be enforced to abide by the samefeature distribution and hence the classifier trained by the sourcesamples can predict the expressions of the target samples.The earliest cross-database speech emotion recognition researchmay be the work of [33], in which Schuller et al. proposed to em-ploy a series of normalization methods to investigate cross-databasespeech emotion recognition problem and conducted extensive cross-database experiments on many speech emotion databases. Fromthen on, a variety of interesting methods have been proposed to deal with this challenging problem [5–7, 13, 17, 31, 34, 48]. For example,Hassan et al. [13] proposed an importance-weighted support vectormachine (IW-SVM) to handle cross-database speech emotion recog-nition. In this method, they first used three transfer learning [28]methods, i.e., kernel mean matching (KMM) [14], Kullback-Leiblerimportance estimation procedure (KLIEP) [35], and unconstrainedleast-squares importance fitting (uLSIF) [19] to learn the impor-tance weights for target samples with respect to source samplesand then incorporated the learned weights into SVM to obtain theIW-SVM. In the works of [5–7], Deng et al. proposed a series ofauto-encoder based domain adaptation methods by leveraging vari-ous auto-encoder based networks to learn a common representationbetween the source and target samples for cross-database speechemotion recognition problem. Recently, Song et al. [34] proposed atransfer non-negative matrix factorization method for coping withcross-database speech emotion recognition, in which the maximummean discrepancy (MMD) [1] is introduced to eliminate the featuredistribution difference between source and target speech databases.From the above cross-database emotion recognition methods, it isclear that the basic idea of these methods can be almost categorizedinto two types. The first type of methods target at learning theimportance weights for source or target samples to balance thefeature distribution difference between source and target databases,while the second type of methods achieve this goal by learning acommon subspace. Different from the above methods, our proposedTSRG is designed to this end from a new angle, i.e., re-generatingtarget samples in the original feature space.

For better understanding TSRG method, we draw a picture shownin Fig. 1 to illustrate the basic idea of the TSRG method and showhow TSRG works for dealing with cross-database micro-expressionrecognition problem. As depicted in Fig. 1, it can be seen that thegoal of the proposed TSRG is to learn a sample re-generator whichcan re-generate the source and target micro-expression samplesby inputting the original source and target ones. Interestingly, there-generated source micro-expression samples are still themselves,

M ’17, October 23–27, 2017, Mountain View, CA, USA Y. Zong et al. while the re-generated target micro-expression samples are differ-ent from their original forms but their feature distribution becomessame or similar with the source samples. Consequently, once theoptimal TSRG is learned, we can train a classifier, e.g., SVM, basedon the labeled source samples and then obtain the micro-expressioncategories of the unlabeled target samples by using the trained clas-sifier to predict the labels of the corresponding new target samplesre-generated by TSRG.

Suppose we have feature matrices X s ∈ R d × n s and X t ∈ R d × n t of source and target micro-expression samples from two differ-ent databases, where d is the dimension of feature vectors and n s and n t denote the numbers of source samples and target samples,respectively. Note that the feature here can be any widely-usedmicro-expression feature such as LBP-TOP [29, 44], LBP-SIP [39],and MDMO [25]. To obtain the functions of the sample re-generatorshown in Fig. 1, firstly, the sample re-generator must output thesource samples themselves with the source samples as input, whichcan be formulated as the following optimization problem:min G k X s − G ( X s ) k F , (1)where G denotes the sample re-generator to be learned and F de-notes the Frobenius norm.Secondly, to ensure that the re-generated target samples havesame or similar feature distributions with the source samples, weshould also design a function f G ( X s , X t ) for TSRG, whose detailswill be introduced in what follows. By using f G ( X s , X t ) to serveas the regularization term, we can obtain the optimization problemof TSRG as follows:min G k X s − G ( X s ) k F + λf G ( X s , X t ) , (2)where λ is the trade-off parameter to control the balance betweenthese two terms of the objective function.It is notable that the output G ( X s ) of TSRG for the source samplesis hoped to be still the original source samples X s . This goal isactually easy to achieve by a combination of the kernel mappingoperation and the linear projection operation, which are two typicaloperations in subspace learning. More specifically, a suitable samplere-generator G can first map the source samples from the originalfeature space into a reproduced kernel Hilbert space (RKHS) by akernel mapping operator ϕ and subsequently transform the infinite-dimensional source features in RKHS back to the original featurespace by a projection matrix ϕ ( C ) ∈ R ∞× d . Following this idea, thesample re-generator G can be finally defined as G ( · ) = ϕ ( C ) T ϕ ( · ) .Then the optimization problem of TSRG in Eq. (2) can be writtenas: min ϕ ( C ) k X s − ϕ ( C ) T ϕ ( X s ) k F + λf G ( X s , X t ) . (3)As seen from Eq. (3), the sample re-generator G consisting ofkernel mapping and linear projection operators can re-generate X s themselves. More importantly, it can also bring a benefit to con-struct f G ( X s , X t ) for TSRG. It is known that we are able to eliminatethe feature distribution difference between two different featuresets by minimizing their maximum mean discrepancy (MMD) [1]which is defined in a RKHS. Therefore, regarding f G ( X s , X t ) of our TSRG, we can formulate it as the MMD between source and targetsamples in the RKHS produced by ϕ , which is expressed as:MMD ( X s , X t ) = k n s ϕ ( X s ) s − n t ϕ ( X t ) t k H , (4)where H denotes a Hilbert space, s and t are the vectors withthe lengths of n s and n t , respectively, and their elements are all one.However, it is hard to directly learn the optimal kernel mappingoperator ϕ . Therefore, we relax MMD in Eq. (4) to the followingformulation to serve as f G ( X s , X t ) for TSRG such that we only needto learn the optimial ϕ ( C ) , which is feasible and also consistentwith the model parameter of TSRG to be learned in Eq. (3): f G ( X s , X t ) = k n s ϕ ( C ) T ϕ ( X s ) s − n t ϕ ( C ) T ϕ ( X t ) t k . (5)We have following lemma to show that minimizing MMD inEq. (4) is equivalent to minimizing the proposed f G ( X s , X t ) inEq. (5), which can support our relaxation.Lemma 3.1. For

MMD ( X s , X t ) and f G ( X s , X t ) defined as Eqs. (4)and (5) based on the kernel mapping operator ϕ , we have f G ( X s , X t ) → if MMD ( X s , X t ) → . Proof. From the condition that MMD ( X s , X t ) is close to 0, weknow that ϕ ( X s ) and ϕ ( X t ) have similar expectations, which canbe formulated as E ( ϕ ( x is )) ≈ E ( ϕ ( x it ) ), where E ( · ) denotes theexpectation operator. Then according to the linear property ofexpectation [2], i.e., E ( Ax ) = A E ( x ) , it is easy to deduce that E ( ϕ ( C ) T ϕ ( x is )) ≈ E ( ϕ ( C ) T ϕ ( x it )) , which guarantees f G ( X s , X t ) will be close to 0. (cid:3) By substituting the proposed f G ( X s , X t ) in Eq. (5) into TSRG inEq. (3), the optimization problem of TSRG becomes as follows:min ϕ ( C ) k X s − ϕ ( C ) T ϕ ( X s ) k F + λ k n s ϕ ( C ) T ϕ ( X s ) s − n t ϕ ( C ) T ϕ ( X t ) t k . (6)To solve TSRG, let ϕ ( C ) = [ ϕ ( X s ) , ϕ ( X t ) ] P , where P ∈ R ( n s + n t ) × d .Then by using the kernel trick, the optimization problem of TSRGcan be converted to the following formulation:min P k X s − P T K s k F + λ k n s P T K s s − n t P T K t t k + µ k P k , (7)where K s = [ K Tss , K Tst ] T and K t = [ K Tts , K Ttt ] T . Herein K ss = ϕ ( X s ) T ϕ ( X s ) , K st = ϕ ( X s ) T ϕ ( X t ) , K ts = ϕ ( X t ) T ϕ ( X s ) and K tt = ϕ ( X t ) T ϕ ( X t ) , and they can be computed by different kernel func-tions such as Gaussian kernel. This is the final formulation of theproposed TSRG. Note that in Eq. (7), we also introduce l norm withrespect to P , i.e., k P k = P di = k p i k , where p i is the i -th columnof P , for TSRG to serve as the regularization term. There are twoimportant reasons. Firstly, it can avoid the overfitting problem [8]during optimizing TSRG. Secondly, each column of ϕ ( C ) will beenforced to reconstruct by all the columns of ϕ ( X s ) and ϕ ( X t ) sparsely, which is more reasonable. The sparsity of P is controlledby the trade-off parameter µ . earning a TSRG for Cross-Database Micro-Expression Recognition MM ’17, October 23–27, 2017, Mountain View, CA, USA The optimization problem of TSRG can be easily solved by vari-ous methods such as iterative thresholding (IT) [40], acceleratedproximal gradient (APG) [18], exact augmented Lagrange multi-plier (EALM) [24] and inexact ALM (IALM) [24]. In this paper,we employ IALM method to learn the optimial P of TSRG. Morespecificially, we introduce a new variable Q which equals P forTSRG and then convert TSRG optimization problem of Eq. (7) to aconstrained one as follows:min P , Q k X s − Q T K s k F + λ k n s Q T K s s − n t Q T K t t k F + µ k P k , s.t. P = Q . (8)Subsequently, the Lagrange function can be obtained as thefollowing formulation: L ( P , Q , T , κ ) = k X s − Q T K s k F + λ k Q T ∆ k st k + µ k P k + tr [ T T ( P − Q ) ] + κ k P − Q k F , (9)where T is the Lagrange multiplier, κ > ∆ k st = n s K s s − n t K t t .Finally, to learn the optimal P , we only need to minimize the La-grange function in Eq. (9) with respect to one of the variables whilefixing the others iteratively. Specifically, repeating the followingfour steps until obtaining convergence:(1) Fix P , T , κ and update Q :In this step, the optimization problem with respect to Q is asfollows:min Q k X s − Q T K s k + λ k Q T ∆ k st k F + tr [ T T ( P − Q ) ] + κ k P − Q k F , whose close-form solution is Q = ( K s K Ts + ∆ k st ∆ k Tst + κ I ) − ( K s X Ts + κ P + T ) . (2) Fix Q , T , κ and update P :min P µκ k P k + k P − ( Q − T κ ) k F . By using the soft-thresholding operator, we are able to obtainthe optimal P according to the following criterion: P ij =  ( Q ij − T ij κ ) − µκ , if ( Q ij − T ij κ ) > µκ ; ( Q ij − T ij κ ) + µκ , if ( Q ij − T ij κ ) < µκ ;0 , otherwise . where P ij , Q ij , and T ij are the elements in the i th row and j th column of their corresponding matrices.(3) Update T and κ : T = T + κ ( P − Q ) , κ = min ( ρκ , κ max ) , where ρ is a scaled parameter.(4) Check convergence: k P − Q k ∞ < ϵ , where ϵ denotes the machine epsilon. Figure 2: Examples of SIMC and CASME II micro-expressiondatabases. From left to right, they are image frames of thevideo clips from (a) SMIC (HS), (b) SMIC (NIR), (c) SMIC (VIS),and (d) CASME II, respectively.

In this section, we conduct extensive experiments between SMICand CASME II micro-expression databases to evaluate our proposedTSRG based cross-database micro-expression recognition method.SMIC database is built by Li et al. [23]. It has three types of datasets,i.e., SMIC (HS), SMIC (VIS), and SMIC (NIR), which are recorded bya high speed (HS) camera of 100fps, a normal visual (VIS) cameraof 25fps, and a near-infrared (NIR) camera, respectively. SMIC(HS) contains 164 micro-expression clips from 16 different subjects,while SMIC (VIS) and SMIC (NIR) both consist of 71 samples from 8participants. The samples of three datasets of SMIC are all dividedinto three micro-expression categories, i.e., Positive , Neдative , and

Surprise . CASME II database is collected by Yan et al. [43]. Itincludes 26 subjects and records their 247 micro-expression samples.These samples are categorized into five micro-expression classesincluding Happiness , Surprise , Disдust , Repression , and

Others ,respectively. In this paper, the face images in the video clips fromCASME II database are cropped and then transformed to 308 × ×

139 pixels for experiments.To see the difference among three datasets of SMIC and CASME II,we select an image frame from the micro-expression video clipbelonging to four datasets, respectively, which are shown in Fig. 2as the examples.The cross-database micro-expression experiments in this paperare between CASME II and one dataset of SMIC, i.e., CASME II v.s.SMIC (HS), CASME II v.s. SMIC (VIS), and CASME II v.s. SMIC(NIR). The two datasets in each above combination are alternativelyserved as source and target databases. Therefore, there will be sixgroups of experiments in total. For convenience, we denote these sixexperiments by Exp.1, Exp.2, Exp.3, Exp.4, Exp.5, and Exp.6, respec-tively. In order to make CASME II and three datasets of SMIC sharethe same micro-expression categorization, we select the samplesbelonging to

Happiness , Surprise , Disдust , and

Repression fromCASME II and re-label these samples. Specifically, the samples of

Happiness are given

Positive labels and the samples of

Disдust and CASME II database is publicly available and can be downloaded fromhttp://fu.psych.ac.cn/CASME/casme2-en.php.

M ’17, October 23–27, 2017, Mountain View, CA, USA Y. Zong et al.

Table 1: The numbers of different micro-expression samplesin CASME II and SMIC datasets for cross-database micro-expression recognition experiments.

Dataset Micro-Expression Category

Neдative Positive Surprise

CASME II 91 32 25SMIC (HS) 70 51 43SMIC (VIS) 28 23 20SMIC (NIR) 28 23 20

Repression are categorized into

Neдative . The labels of

Surprise samples are unchanged.The new sample constitution information of SMIC and CASME IIwith respect to consistent micro-expression categorization is shownin Table 1. From Table 1, it can be seen that class imbalance prob-lem exists in CASME II and SMIC (HS) datasets, which means thenumber of one type of micro-expression samples is significantlylarger or lower than other types of micro-expression samples. Con-sequently, for better reporting the experimental results, we choosetwo metrics widely used in cross-database speech emotion recogni-tion, i.e., unweighted average recall (

UAR ) and weighted averagerecall (

WAR ) [33], in the experiments. According to the definitionin [33],

WAR is the normal recognition accuracy, while

UAR isthe mean accuracy of each class divided by the number of classeswithout consideration of samples per class. It can comprehensivelyreveal the true performance of one method by comparing

WAR and

UAR of this method. For example, if there is a big gap be-tween a high

WAR and a low

UAR in a method, it usually occursthat most of target samples are predicted by this method as themicro-expression category whose sample number percentage isdominant among all the micro-expression samples. Consequently,this method cannot be deemed to perform good even though itachieves a high

WAR (recognition accuracy).For comparison purpose, some recently proposed well-performingcross-database emotion (speech emotion and facial expression)recognition methods including KMM [13, 14], KLIEP [13, 35], uL-SIF [13, 19], and STM [3, 4] are chosen. Since STM method has beenintroduced in related works section, we briefly introduce the otherthree methods here. KMM was proposed by Huang et al. [14] andaims at learning a set of weighted parameters for source samplessuch that weighted source samples and target samples satisfy theMMD criterion and hence they obey the same or feature distribu-tions. KLIEP was proposed by Sugiyama et al. [35]. The key idea ofKLIEP is to learn the importance weights for target samples basedon KL divergence to balance the feature distribution gap betweenthe source and target samples. uLSIF is also a method for learningthe importance weights for source samples and was proposed byKanamori et al. [19]. The most novelty of uLSIF is that it can be con-verted to a regularized model and has a closed form solution suchthat it is very fast compard with KMM. Note that SVM is served asthe classifier for all the methods. Besides, SVM without any domainadaptation is also included for comparison. The detailed parametersetting of all the methods and micro-expression features are shownas follows: (1) For micro-expression feature, we use uniform LBP-TOP [44]and the neighboring radius R and the number of the neighboringpoints P for LBP operator on three orthogonal planes are set as 3and 8, respectively. Besides, following the work of [45], a multi-scale spatial division scheme consisting of 1 ×

1, 2 ×

2, 4 ×

4, and 8 × C = B and ϵ are set as 1000 and √ n tr − / √ n tr , where n tr denotes thenumber of training samples. For STM, following the work of [3, 4],the upper limit of importance weight B and ϵ are set as the samevalues with those of KMM.(4) For KLIEP, no parameter needs be set, while for uLSIF, STM,and the proposed TSRG, there are trade-off parameters to be set.Since the label information of target domain is entirely unknown,cross-validation method is not feasible for determining the trade-off parameters. Consequently, to offer a fair comparison amongall the methods, in the experiments we use the parameter gridsearch strategy for these methods and report the best results whichcorrespond to the optimal trade-off parameters. The optimal trade-off parameters of these three methods are as follows: (a) For uLSIF,its trade-off parameter λ is fixed at 570, 36 × , 89, 1, 79, and42 × in Exp.1, Exp.2, Exp.3, Exp.4, Exp.5, and Exp.6, respectively.(b) For STM, its trade-off parameter λ is fixed at 4, 0.04, 8, 0.02, 15,and 0.03 in Exp.1, Exp.2, Exp.3, Exp.4, Exp.5, and Exp.6, respectively.(c) For our TSRG, the optimal trade-off parameters ( λ , µ ) are (0.001,0.03), (0.1, 0.007), (1, 0.07), (0.01, 0.005), (0.01, 0.003), and (10, 0.001)in Exp.1, Exp.2, Exp.3, Exp.4, Exp.5, and Exp.6, respectively. The

WAR and

UAR of six experiments obtained by all the methodsare depicted in Tables 2 and 3, respectively. From Tables 2 and 3, itis clear that in all the experiments, the proposed TSRG method haspromising increases in the performance over SVM without domainadaptation. Moreover, our method achieves the best

WAR , UAR or both in most cases, e.g., Exp.1 (best

WAR ), Exp.3 (best

UAR )and Exp.4 (best both

WAR and

UAR ). In addition, although STMperforms better in term of

UAR in Exp.1 and uLSIF outperformsthe proposed TSRG in term of

WAR in Exp.2 and in terms of both

WAR and

UAR in Exp.5, we can find that our TSRG is actuallyoverall competitive against uLSIF and STM in these experiments.Furthermore, it can be found that most of methods performpoorly in Exp.6, which indicates that Exp. 6 is a tough task forthese methods. However, it is surprising to see that STM and theproposed TSRG achieve the

WAR of 60.81% and 42.57%, respec-tively, which are considerably higher than other four methods inthis experiment. By further comparing the

WAR and

UAR of STMand the proposed TSRG in this experiment, we can clearly find thatthe gap between

WAR and

UAR of the proposed TSRG (42.57%and 43.94%) is much narrower than STM (60.81% and 34.32%). It is earning a TSRG for Cross-Database Micro-Expression Recognition MM ’17, October 23–27, 2017, Mountain View, CA, USA

Table 2: Experimental results on CASME II and one subset of SMIC (HS, VIS, NIR) databases in recognizing three micro-expressions (

Neдative , Positive , and

Surprise ). The results are reported in term of weighted average recall (WAR).

Exp.2 SMIC (HS) CASME II 24.32 27.70 24.32

Exp.4 SMIC (VIS) CASME II 36.49 29.73 36.49 39.19 51.35

Exp.5 CASME II SMIC (NIR) 45.07 26.76 47.89

Figure 3: The confusion matrices of all the comparison methods in Exp.5. From (a) to (f), the experimental results correspondto SVM, KMM, KLIEP, STM, and TSRG, respectively. very likely due to the extreme class imbalance problem exists inCASME II. As Table 1 shows, the percentage of

Neдative samples isdominant in CASME II. Because of this, most of CASME II samplesmay be mistakenly predicted as

Neдative by STM and hence theSTM method achieves a low

UAR although its

WAR is leadingamong all the methods. In other words, the gap of our TSRG isactually more acceptable, which indicates that the proposed TSRGmethod is less affected by the extreme class imbalance problemexists in CASME II and is more applicable to this experiment.In order to check the above analysis and further observe theinterference of class imbalance in CASME II to each method, weselect two experiments, i.e., Exp.5 and Exp.6, where CASME II isserved as source and target database, respectively, as the represen-tatives and draw the confusion matrices of all six methods in thesetwo experiments. All the confusion matrices in Exp.5 and Exp.6 areshown in Figs. 3 (Exp.5) and 4 (Exp.6), respectively. From Figs. 3and 4, some interesting findings and conclusions can be obtained:(1) Firstly, as the confusion matrix of STM in Fig. 4 (Exp.6) shows,it is clear that nearly all the samples of CASME II database are pre-dicted as

Neдative micro-expression by STM, which is consistent with our analysis previously. Consequently, the above analysis toexplain why a big gap between the

WAR and

UAR exists in STMis reasonable.(2) Secondly, it can be found that for all the methods, three micro-expressions are much more easily confused in the case when theCASME II is served as the target database (Fig. 4 and Exp.6) than theopposite case where the CASME II is used as the source database(Fig. 3 and Exp.5). This indicates that if the class imbalance problemoccurred in the target database, domain adaptation methods wouldbe more possibly interfered and hence their performance may bedecreased.(3) Thirdly, we notice that compared with Exp.6 (Fig. 4), theconfusion among different micro-expressions in Exp.5 (Fig. 3) isrelieved promisingly for all the methods. However, it should bepointed out that in Exp.5 (Fig. 3), the results of most methodsare still unsatisfactory. Besides the class imbalance problem inCASME II, we think the heterogeneous problem of facial imagesbetween CASME II and SMIC (NIR) may be a major factor as well.From the works of [23, 43], we know that the samples of SMIC(NIR) and CASME II are recorded by a near-infrared camera and a

M ’17, October 23–27, 2017, Mountain View, CA, USA Y. Zong et al.

Table 3: Experimental results on CASME II and one subset of SMIC (HS, VIS, NIR) databases in recognizing three micro-expressions (

Neдative , Positive , and

Surprise ). The results are reported in term of unweighted average recall (UAR).

Exp.3 CASME II SMIC (VIS) 42.67 25.72 41.93 53.12 53.10

Exp.4 SMIC (VIS) CASME II 48.94 44.02 49.91 53.31 39.99

Exp.5 CASME II SMIC (NIR) 43.67 23.14 45.58 high speed color camera, respectively. Consequently, the images ofvideo clips from these two datasets are considerably heterogeneousand looks very different, which adds difficulty to the experimentsbetween such two datasets.(4) Finally, it is clear to see that even though most methods per-form poorly in Exp.6 (Fig. 4), the proposed TSRG method can stillachieve a satisfactory result, in which the extreme micro-expressionconfusion occurs in other comparison methods is promisingly alle-viated (see Fig. 4 (f)).

In this paper, we have proposed a Target Sample Re-Generator(TSRG) method to deal with cross-database micro-expression recog-nition problem, which is more challenging than conventional micro-expression recognition one. By inputting source and target micro-expression samples, TSRG can re-generate the source and targetsamples, where the re-generated source samples are still the orig-inal ones, while the re-generated target samples would share thesimilar feature distribution as the source samples. Consequently,the classifier learned based on the source samples can accurately predict the micro-expression categories of the target samples. Ex-tensive cross-database micro-expression recognition experimentsbetween CASME II and SMIC databases are conducted to evaluatethe performance of TSRG method. Experimental results show thatTSRG method can achieve promising results and outperform lotsof recent proposed state-of-the-art cross-database emotion recogni-tion methods.

This work was supported by the National Basic Research Programof China under Grant 2015CB351704, the National Natural ScienceFoundation of China under Grant 61231002 and Grant 61572009,China Scholarship Council, Academy of Finland, Tekes FidiproProgram and Infotech Oulu. We also gratefully acknowledge thesupport of NVIDIA Corporation with the donation of the Titan XPascal GPU used for this research.

REFERENCES [1] Karsten M Borgwardt, Arthur Gretton, Malte J Rasch, Hans-Peter Kriegel, Bern-hard Sch¨olkopf, and Alex J Smola. 2006. Integrating structured biological data earning a TSRG for Cross-Database Micro-Expression Recognition MM ’17, October 23–27, 2017, Mountain View, CA, USA by kernel maximum mean discrepancy.

Bioinformatics

22, 14 (2006), e49–e57.[2] George Casella and Roger L Berger. 2002.

Statistical inference . Vol. 2. DuxburyPacific Grove, CA.[3] Wen-Sheng Chu, Fernando De la Torre, and Jeffery F Cohn. 2013. Selectivetransfer machine for personalized facial action unit detection. In

Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition . 3515–3522.[4] Wen-Sheng Chu, Fernando De la Torre, and Jeffrey F Cohn. 2017. Selectivetransfer machine for personalized facial expression analysis.

IEEE Transactionson Pattern Analysis and Machine Intelligence

39, 3 (2017), 529–545.[5] Jun Deng, Xinzhou Xu, Zixing Zhang, Sascha Fr¨uhholz, and Bj¨orn Schuller.2017. Universum Autoencoder-Based Domain Adaptation for Speech EmotionRecognition.

IEEE Signal Processing Letters

24, 4 (2017), 500–504.[6] Jun Deng, Zixing Zhang, Florian Eyben, and Bjorn Schuller. 2014. Autoencoder-based unsupervised domain adaptation for speech emotion recognition.

IEEESignal Processing Letters

21, 9 (2014), 1068–1072.[7] Jun Deng, Zixing Zhang, Erik Marchi, and Bjorn Schuller. 2013. Sparseautoencoder-based feature transfer learning for speech emotion recognition.In . IEEE, 511–516.[8] Richard O Duda, Peter E Hart, and David G Stork. 2012.

Pattern classification .John Wiley & Sons.[9] Paul Ekman and Wallace V Friesen. 1977. Facial action coding system. (1977).[10] MG Frank, Malgorzata Herbasz, Kang Sinuk, A Keller, and Courtney Nolan. 2009.I see how you feel: Training laypeople and professionals to recognize fleetingemotions. In

The Annual Meeting of the International Communication Association.Sheraton New York, New York City .[11] Mark G Frank, Carl J Maccario, and Venugopal Govindaraju. 2009. Behavior andsecurity.

Protecting Airline Passengers in the Age of Terrorism. Greenwood PubGroup, Santa Barbara, California (2009), 86–106.[12] Felix A Gers, Nicol N Schraudolph, and J¨urgen Schmidhuber. 2002. Learningprecise timing with LSTM recurrent networks.

Journal of Machine LearningResearch

3, Aug (2002), 115–143.[13] Ali Hassan, Robert Damper, and Mahesan Niranjan. 2013. On acoustic emotionrecognition: compensating for covariate shift.

IEEE Transactions on Audio, Speech,and Language Processing

21, 7 (2013), 1458–1468.[14] Jiayuan Huang, Arthur Gretton, Karsten M Borgwardt, Bernhard Sch¨olkopf,and Alex J Smola. 2006. Correcting sample selection bias by unlabeled data. In

Advances in Neural Information Processing Systems . 601–608.[15] Xiaohua Huang, Su-Jing Wang, Guoying Zhao, and Matti Pietik¨ainen. 2015.Facial micro-expression recognition using spatiotemporal local binary patternwith integral projection. In

Proceedings of the IEEE International Conference onComputer Vision Workshops . 1–9.[16] Xiaohua Huang, Guoying Zhao, Xiaopeng Hong, Wenming Zheng, and MattiPietik¨ainen. 2016. Spontaneous facial micro-expression analysis using spatiotem-poral completed local quantized patterns.

Neurocomputing

175 (2016), 564–578.[17] Zhengwei Huang, Wentao Xue, Qirong Mao, and Yongzhao Zhan. 2016. Un-supervised domain adaptation for speech emotion recognition using PCANet.

Multimedia Tools and Applications (2016), 1–15.[18] Shuiwang Ji and Jieping Ye. 2009. An accelerated gradient method for tracenorm minimization. In

Proceedings of the 26th annual International Conference onMachine Learning . ACM, 457–464.[19] Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. 2009. A least-squaresapproach to direct importance estimation.

The Journal of Machine LearningResearch

10 (2009), 1391–1445.[20] Dae Hoe Kim, Wissam J Baddar, and Yong Man Ro. 2016. Micro-ExpressionRecognition with Expression-State Constrained Spatio-Temporal Feature Rep-resentations. In

Proceedings of the 2016 ACM on Multimedia Conference . ACM,382–386.[21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifica-tion with deep convolutional neural networks. In

Advances in Neural InformationProcessing Systems . 1097–1105.[22] Xiaobai Li, Xiaopeng Hong, Antti Moilanen, Xiaohua Huang, Tomas Pfister,Guoying Zhao, and Matti Pietik¨ainen. 2017. Towards Reading Hidden Emotions:A Comparative Study of Spontaneous Micro-expression Spotting and RecognitionMethods.

IEEE Transactions on Affective Computing (2017).[23] Xiaobai Li, Tomas Pfister, Xiaohua Huang, Guoying Zhao, and Matti Pietik¨ainen.2013. A spontaneous micro-expression database: Inducement, collection andbaseline. In

Proceedings of the 10th IEEE International Conference and Workshopson Automatic Face and Gesture Recognition (FG) . IEEE, 1–6.[24] Zhouchen Lin, Minming Chen, and Yi Ma. 2010. The augmented lagrangemultiplier method for exact recovery of corrupted low-rank matrices. arXivpreprint arXiv:1009.5055 (2010).[25] Yong-Jin Liu, Jin-Kai Zhang, Wen-Jing Yan, Su-Jing Wang, Guoying Zhao, andXiaolan Fu. 2016. A Main Directional Mean Optical Flow Feature for SpontaneousMicro-Expression Recognition.

IEEE Transactions on Affective Computing

7, 4(2016), 299–310.[26] Ping Lu, Wenming Zheng, Ziyan Wang, Qiang Li, Yuan Zong, Minghai Xin, andLenan Wu. 2016. Micro-Expression Recognition by Regression Model and Group Sparse Spatio-Temporal Feature Learning.

IEICE TRANSACTIONS on Informationand Systems

99, 6 (2016), 1694–1697.[27] Maureen OfiSullivan, Mark G Frank, Carolyn M Hurley, and Jaspreet Tiwana.2009. Police lie detection accuracy: The effect of lie scenario.

Law and HumanBehavior

33, 6 (2009), 530–538.[28] Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning.

IEEETransactions on Knowledge and Data Engineering

22, 10 (2010), 1345–1359.[29] Tomas Pfister, Xiaobai Li, Guoying Zhao, and Matti Pietik¨ainen. 2011. Recognis-ing spontaneous facial micro-expressions. In

International Conference on Com-puter Vision . IEEE, 1449–1456.[30] John A Ruiz-Hernandez and Matti Pietik¨ainen. 2013. Encoding local binarypatterns using the re-parametrization of the second order gaussian jet. In

Pro-ceedings of the 10th IEEE International Conference and Workshops on AutomaticFace and Gesture Recognition (FG) . IEEE, 1–6.[31] Hesam Sagha, Jun Deng, Maryna Gavryukova, Jing Han, and Bj¨orn Schuller.2016. Cross lingual speech emotion recognition using canonical correlationanalysis on principal component subspace. In

IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) . IEEE, 5800–5804.[32] Enver Sangineto, Gloria Zen, Elisa Ricci, and Nicu Sebe. 2014. We are not allequal: Personalizing models for facial expression analysis with transductiveparameter transfer. In

Proceedings of the 22nd ACM International Conference onMultimedia . ACM, 357–366.[33] Bjorn Schuller, Bogdan Vlasenko, Florian Eyben, Martin Wollmer, AndreStuhlsatz, Andreas Wendemuth, and Gerhard Rigoll. 2010. Cross-corpus acousticemotion recognition: Variances and strategies.

IEEE Transactions on AffectiveComputing

1, 2 (2010), 119–131.[34] Peng Song, Wenming Zheng, Shifeng Ou, Xinran Zhang, Yun Jin, Jinglei Liu, andYanwei Yu. 2016. Cross-corpus speech emotion recognition based on transfernon-negative matrix factorization.

Speech Communication

83 (2016), 34–41.[35] Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V Buenau, andMotoaki Kawanabe. 2008. Direct importance estimation with model selection andits application to covariate shift adaptation. In

Advances in Neural InformationProcessing Systems . 1433–1440.[36] Su-Jing Wang, Wen-Jing Yan, Xiaobai Li, Guoying Zhao, and Xiaolan Fu. 2014.Micro-expression recognition using dynamic textures on tensor independentcolor space. In . IEEE,4678–4683.[37] Su-Jing Wang, Wen-Jing Yan, Xiaobai Li, Guoying Zhao, Chun-Guang Zhou,Xiaolan Fu, Minghao Yang, and Jianhua Tao. 2015. Micro-expression recognitionusing color spaces.

IEEE Transactions on Image Processing

24, 12 (2015), 6034–6047.[38] Su-Jing Wang, Wen-Jing Yan, Guoying Zhao, Xiaolan Fu, and Chun-Guang Zhou.2014. Micro-expression recognition using robust principal component analy-sis and local spatiotemporal directional features. In

Workshop at the EuropeanConference on Computer Vision . Springer, 325–338.[39] Yandan Wang, John See, Raphael C-W Phan, and Yee-Hui Oh. 2014. Lbp withsix intersection points: Reducing redundant information in lbp-top for micro-expression recognition. In

Asian Conference on Computer Vision . Springer, 525–537.[40] John Wright, Arvind Ganesh, Shankar Rao, Yigang Peng, and Yi Ma. 2009. Robustprincipal component analysis: Exact recovery of corrupted low-rank matricesvia convex optimization. In

Advances in Neural Information Processing Systems .2080–2088.[41] Feng Xu, Junping Zhang, and James Z Wang. 2017. Microexpression identificationand categorization using a facial dynamics map.

IEEE Transactions on AffectiveComputing

8, 2 (2017), 254–267.[42] Keyu Yan, Wenming Zheng, Zhen Cui, and Yuan Zong. 2016. Cross-DatabaseFacial Expression Recognition via Unsupervised Domain Adaptive DictionaryLearning. In

International Conference on Neural Information Processing . Springer,427–434.[43] Wen-Jing Yan, Xiaobai Li, Su-Jing Wang, Guoying Zhao, Yong-Jin Liu, Yu-HsinChen, and Xiaolan Fu. 2014. CASME II: An improved spontaneous micro-expression database and the baseline evaluation.

PloS one

9, 1 (2014), e86041.[44] Guoying Zhao and Matti Pietik¨ainen. 2007. Dynamic texture recognition usinglocal binary patterns with an application to facial expressions.

IEEE Transactionson Pattern Analysis and Machine Intelligence

29, 6 (2007), 915–928.[45] Guoying Zhao and Matti Pietik¨ainen. 2009. Boosted multi-resolution spatiotem-poral descriptors for facial expression recognition.

Pattern Recognition Letters

IEEE Inter-national Conference on Image Processing (ICIP) . IEEE, 1935–1939.[47] Wenming Zheng, Yuan Zong, Xiaoyan Zhou, and Minghai Xin. 2016. Cross-Domain Color Facial Expression Recognition Using Transductive Transfer Sub-space Learning.

IEEE Transactions on Affective Computing (2016).[48] Yuan Zong, Wenming Zheng, Tong Zhang, and Xiaohua Huang. 2016. Cross-corpus speech emotion recognition based on domain-adaptive least-squaresregression.