Audio-video Emotion Recognition in the Wild using Deep Hybrid Networks
AAUDIO-VIDEO EMOTION RECOGNITION IN THE WILD USING DEEP HYBRIDNETWORKS
Xin Guo (cid:63)
Luisa F. Polan´ıa † Kenneth E. Barner (cid:63)(cid:63)
University of Delaware, Department of Electrical and Computer Engineering, Newark, DE, USA { guoxin, barner } @udel.edu † Target Corporation, Sunnyvale, CA, USA
ABSTRACT
This paper presents an audiovisual-based emotion recogni-tion hybrid network. While most of the previous work fo-cuses either on using deep models or hand-engineered fea-tures extracted from images, we explore multiple deep mod-els built on both images and audio signals. Specifically, inaddition to convolutional neural networks (CNN) and recur-rent neutral networks (RNN) trained on facial images, thehybrid network also contains one SVM classifier trained onholistic acoustic feature vectors, one long short-term mem-ory network (LSTM) trained on short-term feature sequencesextracted from segmented audio clips, and one Inception(v2)-LSTM network trained on image-like maps, which are builtbased on short-term acoustic feature sequences. Experimen-tal results show that the proposed hybrid network outperformsthe baseline method by a large margin.
Index Terms — Audio-video emotion recognition, multi-modal fusion, long short-term memory networks
1. INTRODUCTION
Emotion recognition is relevant in many computing areasthat take into account the affective state of the user, suchas human-computer interaction [1], human-robot interac-tion [2], music and image recommendation [3], affectivevideo summarization [4], and personal wellness and assistivetechnologies [5]. Although emotion recognition is an interest-ing problem, it is also very challenging unless the recordingconditions are well controlled. Emotion recognition “in thewild” suffers from many issues that need to be overcome,such as cluttered backgrounds, large variances in face poseand illumination, video and audio noise, and occlusions.Recently, hybrid neural networks combining CNNs andRNNs [6, 7, 8] have become the state-of-the-art for emotionrecognition. Of particular interest are the top-performingworks of the EmotiW Challenge, whose goal is to advance
The work is supported by the National Science Foundation under GrantNo. 1319598. emotion recognition in unconstrained conditions by providingresearchers with a platform to benchmark the performanceof their algorithms on “in-the-wild” datasets. One of thesub-challenges of the EmotiW challenge is the audio-videoemotion recognition sub-challenge, which is based on an aug-mented version of the AFEW dataset [9] that contains shortvideo clips extracted from movies that have been annotatedfor seven different emotions.Deep learning has played an important role in most ofthe sub-challenge winning submissions. In 2013, the winnerspresented a method that combines CNNs for static faces, anauto-encoder for human action recognition, a deep belief net-work for audio information, and a shallow network architec-ture for feature extraction of the mouth region [10]. The win-ners of the 2014 sub-challenge used CNNs for feature extrac-tion of the aligned faces provided by the challenge organiz-ers [11], while in 2016, the winners of the sub-challenge pro-posed a hybrid network architecture that combines 3D CNNsand a CNN-RNN in a late-fusion fashion [6].While a variety of methods based on images have beenproposed, the audio channel has been explored in a lessextent. Existing approaches that exploit the audio channelfor emotion recognition include the use of support vec-tor machines (SVM) [12, 6] , random forests [13, 14] andCNNs trained on comprehensive acoustic vectors extractedby openSmile [15]. We propose to fully exploit the audio-channel information in this paper. Inspired by the recurrentsupport vector machines designed by Yun et al . [16] on eventdetection, we propose an LSTM [17] trained on short-termaudio features extracted from segmented audio clips. Fur-thermore, a CNN-RNN network trained on image-like maps,formed by stacking short-term audio features, is also pre-sented in this paper. The proposed hybrid network (Figure 1)results in the combination of a CNN-RNN network trainedon images and audio-based models, and achieves an overallaccuracy of . and . on the validation and testingsets, respectively, surpassing the audio-video emotion recog-nition sub-challenge baseline of 38.81% on the validation setwith significant gains. a r X i v : . [ c s . C V ] F e b ig. 1 . The overall structure of the proposed hybrid network. For visualization purposes, only one VGG-LSTM on faces isshown in the diagram; however, note that the hybrid network contain two VGG-LSTMs with the same network structure buttrained on faces detected by different methods.
2. THE PROPOSED METHOD2.1. VGG-LSTM based on Faces
A traditional CNN-LSTM neural network [6, 14] is exploredto learn emotion from faces. Video frames are extracted at afrequency of 60 fps. Faces and facial landmarks are first de-tected within each frame using the method described in [18],then a 2D affine transformation where the left and right eyecorners of all the images are aligned to the same positions isperformed (the code of the face detection and alignment algo-rithms is developed based on [19]).Aligned faces are used as input to the VGG-16 convolu-tional neural network [20]. The VGG architecture is modifiedby changing the number of neurons in the last layer to 7, in-dicating 7 emotion classes. This modified VGG architectureis initialized with the parameters of the VGG-FACE model,except for the last fully-connected layer which is initializedwith weights sampled from a Gaussian distribution of zeromean and variance × − , and trained from scratch with thelearning rate for the weight and bias filters set to be 10 timeslarger than the overall learning rate. The VGG-FACE modelwas presented as the result of training the 16-layer VGG ar-chitecture on a large-scale dataset containing 2.6M images of2.6K celebrities and public figures for face recognition in [21]The training procedure is three-fold. First, the modified VGG network is trained on the facial expression recognition2013 (FER-2013) database [22]. The FER-2013 databasecontains images corresponding to basic emotions.The idea of this step is to transfer the knowledge from facerecognition to face emotion recognition. Second, the result-ing model is fine-tuned on the detected faces of the AFEWdataset. Third, the layers of the fine-tuned model after the“fc6” layer are replaced by a one-layer LSTM and a finalfully connected layer with output units. The weights ofthe LSTM are initialized with values drawn from a uniformdistribution over [-0.01, 0.01] and the bias terms are initial-ized to 0. The combined VGG-LSTM is trained end-to-end.The LSTM layer has nodes and the length of the inputsequence is . Face images extracted at every 8 frames ofthe original video sequence are selected as input to the VGG-LSTM network. Experimental results show that the framegap helps improve the classification accuracy since facialchange is more visible in this way.Unlike some existing works that first train the CNN anduse their “fc6” features as input vector sequences for theLSTM network, the proposed structure connects the VGGand LSTM networks end-to-end and learns all the param-eters simultaneously. Experimental results show that ourVGG-LSTM outperforms the results of the winner of theaudio-video emotion recognition sub-challenge in 2016. able 1 . Confusion matrix results of the VGG-LSTM net-work, trained on aligned faces, when tested on the valida-tion set. The attained overall accuracy is . and the un-weighted average of the per-class accuracies is . . AN DI FE HA NE SA SUAN 53.12 6.25 7.81 0 17.19 3.12 12.50DI 17.50 27.50 7.50 2.50 25.00 15.00 5.00FE 21.74 4.35 23.91 13.04 6.52 17.39 13.04HA 7.94 1.59 0 84.13 0 4.76 1.59NE 11.11 11.11 7.94 6.35 53.97 6.35 3.17SA 8.20 4.92 1.64 6.56 22.95 55.74 0SU 21.74 6.52 17.39 4.35 17.39 4.35 28.26
Table 2 . Confusion matrix results of the VGG-LSTM net-work, trained on the faces provided by the challenge organiz-ers, tested on the validation set. The attained overall accuracyis . and the unweighted average of the per-class accu-racies is . . AN DI FE HA NE SA SUAN 64.06 1.56 7.81 1.56 12.50 6.25 6.25DI 22.50 15.00 5.00 10.00 25.00 20.00 2.50FE 32.61 8.70 26.09 4.35 13.04 8.70 6.52HA 9.52 3.17 0 73.02 6.35 6.35 1.59NE 14.29 11.11 1.59 3.17 63.49 6.35 0SA 16.39 11.48 6.56 8.20 13.11 40.98 3.28SU 32.61 6.52 17.39 0 15.22 8.70 19.57
Table 3 . Confusion matrix of the audio SVM model onthe validation set, with overall accuracy of . and un-weighted average of the per-class accuracies of . . AN DI FE HA NE SA SUAN 76.56 0 3.12 9.38 7.81 3.12 0DI 25.00 0 0 42.50 20.00 12.50 0FE 23.91 0 30.43 23.91 13.04 8.70 0HA 15.87 1.59 9.52 42.86 20.63 9.52 0NE 12.70 1.59 3.17 34.92 46.03 1.59 0SA 11.48 0 11.48 26.23 21.31 27.87 1.64SU 19.57 0 17.39 36.96 13.04 13.04 0
Table 4 . Confusion matrix of the audio LSTM model onthe validation set, with overall accuracy of . and un-weighted average of the per-class accuracies of . . AN DI FE HA NE SA SUAN 48.44 1.56 0 15.62 15.62 18.75 0DI 15.00 2.50 0 35.00 32.50 15.00 0FE 30.43 0 0 19.57 28.26 21.74 0HA 12.70 4.76 0 33.33 30.16 19.05 0NE 6.35 3.17 0 17.46 52.38 20.63 0SA 11.48 6.56 0 22.95 32.79 26.23 0SU 17.39 0 2.17 30.43 36.96 13.04 0
An SVM classifier, which is learned based on the 1582-dimensional acoustic features extracted using openSMILE,is incorporated into the hybrid network. Acoustic fea-tures include low level descriptors, such as energy, mel-frequency cepstral coefficients (MFCCs), linear predictivecoding (LPC), zero-crossing rate (ZCR), spectral flux, spec-tral roll-off, chroma vector, and statistical features summa-rized by functions, such as mean and standard deviation.
Instead of extracting holistic features, each audio is dividedinto segments of length 100 ms, using an overlapping factorof 50%, and then segment-level features are extracted to forma sequence of vectors. Specifically, short-term features areextracted for each segment using pyAudioAnalysis [23]. Fea-tures include ZCR, energy, entropy of energy, spectral cen-troid, spectral spread, spectral flux, spectral roll-off, MFCCs,chroma vector, and chroma deviation. Assume that the au-dio signal has length m , then the number of sequences n is ( m − / . Therefore, this feature extraction process re-sults in n sequences of dimension × . Since each audio isof different length, a sequence length converter is applied tomake the number of sequences be at least by copying thelast feature vector of the sequence − n times when the num-ber of sequences is less than 16. A one-layer LSTM with neurons is trained on the sequence of feature vectors. Unlikethe audio SVM model which focuses on the holistic proper-ties of the signal, the audio LSTM model focuses on learningthe dynamic temporal behavior of the audio signals. In this section, the sequence of feature vectors from Section2.3 is converted into sequential image-like maps. Specifically,the feature vectors are organized in matrix form to build animage-like map of dimension × n . The next step is to seg-ment this image-like map into smaller maps of size × using an overlapping factor of . For the architecture pro-posed in this section, the sequence length n must be a multipleof and greater or equal than . If this condition is not sat-isfied, then the last column of the × n image-like map isreplicated n (cid:48) − n times, where n (cid:48) is the closest multiple of 17larger than n . This approach results in a sequence of × image-like maps, whose sequence length is ( n (cid:48) − / .A network similar to the VGG-LSTM network describedin section 2.1, Inception(v2)-LSTM, is developed to trainon image-like maps. Instead of using the VGG architecture,we use Inception-v2 [24] to train the image-like maps first.The number of output units of the last layer is changed to ,and the training parameters, such as the learning rate and theweight decay are set the same as the ones used in Inception-v2on ImageNet [25]. able 5 . Confusion matrix of the audio Incetion(v2)-LSTM model on the validation set, with overall accuracy of . and unweighted average of the per-class accuraciesof . . AN DI FE HA NE SA SUAN 56.25 0 0 29.69 10.94 3.12 0DI 12.50 0 0 57.50 27.50 2.50 0FE 13.04 0 2.17 45.65 36.96 2.17 0HA 11.11 0 0 52.38 26.98 9.52 0NE 6.35 0 0 52.38 39.68 1.59 0SA 8.20 0 0 63.93 16.39 11.48 0SU 6.52 2.17 2.17 56.52 26.09 6.52 0
Table 6 . Confusion matrix results of submission 6 whenevaluating the hybrid network on the testing dataset. The at-tained overall accuracy is . and the un-weighted aver-age of the per-class accuracies is . . AN DI FE HA NE SA SUAN 77.55 0 4.08 8.16 7.14 2.04 1.02DI 32.50 10.00 2.50 12.50 20.00 20.00 2.50FE 31.43 0 50.00 1.43 5.71 7.14 4.29HA 20.83 0 1.39 63.89 10.42 3.47 0NE 16.69 1.04 7.77 6.22 50.78 11.92 2.59SA 22.50 1.25 11.25 11.25 16.25 36.25 1.25SU 10.71 3.57 35.71 10.71 14.29 25.00 0
After the training of the modified Inception-v2 on theindividual image-like maps, the layers after the “global pool”layer of the Inception-v2 architecture are replaced by a one-layer LSTM with neurons and a fully connected layerwith outputs. The resulting network is referred to asInception(v2)-LSTM. This network takes a sequence of 8image-like maps at a time and learns the features end-to-end to model the dynamic temporal properties of the se-quence. Since the sequence length of the image-like maps is ( n (cid:48) − / and needs to be greater than 8 to serve as input tothe Inception(v2)-LSTM architecture, the last × image-like map of the sequence is copied − ( n (cid:48) − / times tomake the sequence satisfy the minimum length requirement.The initial learning rate is set to . and decreases every iterations by a factor of . The batch size, the weightdecay and the maximum number of iterations are set to , . and , respectively.
3. EXPERIMENTAL RESULTS3.1. Database
The AFEW database used in EmotiW 2017 contains 773, 383and 653 audio-video movie clips in the training, validationand testing datasets, respectively. The task is to assign a sin-gle emotion label to a video clip from the basic emotions, namely, anger, disgust, fear, happiness, neutral, sad and sur-prise. Participants compete on the accuracy of the testingdata . Confusion matrices for each model are shown in Tables 1through 5. One VGG-LSTM model is trained on the alignedfaces, which are obtained as described in Section 2.1, andanother VGG-LSTM model is trained on the faces providedby the challenge organizers. Our best VGG-LSTM modelachieves an overall classification accuracy of . , outper-forming the . accuracy obtained by the winner of the2016 audio-video emotion recognition sub-challenge, whichsuggests that the frame gap introduced by the proposed VGG-LSTM model is a better way to represent the dynamics of faceexpression in video. The second VGG-LSTM model trainedon the faces provided by the challenge organizers comple-ments the proposed model, and the combination of the twoachieves a classification accuracy of . on the validationset. Audio models, including audio SVM, audio LSTM andaudio Inception(v2)-LSTM have lower accuracy than theVGG-LSTM models trained on faces. However, they per-form well on the anger class, and therefore, it improves theoverall accuracy of the hybrid network.The aforementioned deep models are combined using de-cision fusion. Grid search is employed to find the modelweights that maximize the classification accuracy on the val-idation set. Fused hybrid network achieves a classificationaccuracy of . on the validation set, while the challengebaseline is of accuracy . . When trained on a combi-nation of the training and validation sets, the accuracy on thetesting set of the proposed hybrid network is . . Thecorresponding confusion matrix is shown in Table 6.
4. CONCLUSIONS
In this paper, we proposed an audiovisual-based hybrid net-work that combines the predictions of models for emotionrecognition in the wild, with an emphasis on exploring au-dio channels. The overall accuracy of the proposed methodachieves . and . classification accuracy on thevalidation and testing dataset, respectively, surpassing theaudio-video emotion recognition sub-challenge baseline of38.81% on the validation set with significant gains. Note that since the class distribution is unbalanced, the accuracy par-ticipants compete on is the overall accuracy, which is computed on all thesamples of the testing set. The unweighted average of the per-class accura-cies is also provided in this paper. . REFERENCES [1] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Vot-sis, S. Kollias, W. Fellenz, and J.G. Taylor, “Emotionrecognition in human-computer interaction,”
IEEE Sig-nal Processing Magazine , vol. 18, no. 1, pp. 32–80,2001.[2] D. Kulic and E.A. Croft, “Affective state estimationfor human–robot interaction,”
IEEE Transactions onRobotics , vol. 23, no. 5, pp. 991–1000, 2007.[3] M. Shan, F. Kuo, M. Chiang, and S. Lee, “Emotion-based music recommendation by affinity discovery fromfilm music,”
Expert systems with applications , vol. 36,no. 4, pp. 7666–7674, 2009.[4] H. Joho, J.M. Jose, R. Valenti, and N. Sebe, “Exploitingfacial expressions for affective video summarisation,” in
Proceedings of the ACM international conference on im-age and video retrieval . ACM, 2009, p. 31.[5] M. Pantic, A. Pentland, A. Nijholt, and T.S. Huang,“Human computing and machine understanding of hu-man behavior: A survey,” in
Artifical Intelligence forHuman Computing , pp. 47–71. Springer, 2007.[6] Y. Fan, X. Lu, D. Li, and Y. Liu, “Video-based emo-tion recognition using CNN-RNN and C3D hybrid net-works,” in
Proceedings of the 18th ACM InternationalConference on Multimodal Interaction . ACM, 2016, pp.445–450.[7] P. Khorrami, T. Le Paine, K. Brady, C. Dagli, and T.S.Huang, “How deep neural networks can improve emo-tion recognition on video data,” in
IEEE InternationalConference on Image Processing . IEEE, 2016, pp. 619–623.[8] S. Ebrahimi Kahou, V. Michalski, K. Konda, R. Memi-sevic, and C. Pal, “Recurrent neural networks for emo-tion recognition in video,” in
Proceedings of the 2015ACM on International Conference on Multimodal Inter-action . ACM, 2015, pp. 467–474.[9] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon, “Collect-ing large, richly annotated facial-expression databasesfrom movies,” vol. 19, no. 3, pp. 34–41, July 2012.[10] S.E. Kahou et al., “Combining modality specific deepneural networks for emotion recognition in video,” in
Proceedings of the 15th International conference onmultimodal interaction . ACM, 2013, pp. 543–550.[11] M. Liu, R. Wang, S. Li, S. Shan, Z. Huang, and X. Chen,“Combining multiple kernel methods on Riemannianmanifold for emotion recognition in the wild,” in
Pro-ceedings of the 16th International Conference on Multi-modal Interaction . ACM, 2014, pp. 494–501. [12] C. Cortes and V. Vapnik, “Support-vector networks,”
Mach. Learn. , vol. 20, no. 3, pp. 273–297, Sept. 1995.[13] L. Breiman, “Random forests,”
Mach. Learn. , vol. 45,no. 1, pp. 5–32, Oct. 2001.[14] B. Sun, Q. Wei, L. Li, Q. Xu, J. He, and L. Yu, “Lstmfor dynamic emotion and group emotion recognition inthe wild,” in
Proceedings of the 18th ACM InternationalConference on Multimodal Interaction , New York, NY,USA, 2016, ICMI 2016, pp. 451–457, ACM.[15] J. Yan et al., “Multi-clue fusion for emotion recogni-tion in the wild,” in
Proceedings of the 18th ACM In-ternational Conference on Multimodal Interaction , NewYork, NY, USA, 2016, ICMI 2016, pp. 458–463, ACM.[16] Y. Wang and F. Metze, “Recurrent support vector ma-chines for audio-based multimedia event detection,” in
Proceedings of the 2016 ACM on International Con-ference on Multimedia Retrieval , New York, NY, USA,2016, ICMR ’16, pp. 265–269, ACM.[17] S. Hochreiter and J. Schmidhuber, “Long short-termmemory,”
Neural Comput. , vol. 9, no. 8, pp. 1735–1780,Nov. 1997.[18] V. Kazemi and Josephine Sullivan, “One millisecondface alignment with an ensemble of regression trees,” in
CVPR , 2014.[19] T. Hassner, S. Harel, E. Paz, and R. Enbar, “Effectiveface frontalization in unconstrained images,” in
CVPR ,June 2015.[20] K. Simonyan and A. Zisserman, “Very deep convo-lutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[21] O.M. Parkhi, A. Vedaldi, A. Zisserman, et al., “Deepface recognition,” in
BMVC , 2015, vol. 1, p. 6.[22] I.J. Goodfellow et al., “Challenges in representationlearning: A report on three machine learning contests,”in
International Conference on Neural Information Pro-cessing . Springer, 2013, pp. 117–124.[23] T. Giannakopoulos, “pyaudioanalysis: An open-sourcepython library for audio signal analysis,”
PloS one , vol.10, no. 12, 2015.[24] C. Szegedy et al., “Rethinking the inception architecturefor computer vision,”
CoRR , vol. abs/1512.00567, 2015.[25] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical ImageDatabase,” in