MUSE2020 challenge report
aa r X i v : . [ c s . MM ] J u l MUSE2020 Challenge Report
Ruichen Li [email protected]
School of Information Renmin University of ChinaBeijing, China
Jingwen Hu
School of Information Renmin University of ChinaBeijing, China
Shuai Guo
School of Information Renmin University of ChinaBeijing, China
Jinming Zhao [email protected]
School of Information Renmin University of ChinaBeijing, China
Abstract
This paper is a brief report for MUSE2020 challenge. Wepresent our solution for Muse-Wild sub challenge. The aimof this challenge is to investigate sentiment analysis methodin real-world situation. Our solutions achieve the best CCCperformance of 0.4670, 0.3571 for arousal, and valence re-spectively on the challenge validation set, which outperformsthe baseline system with corresponding CCC of 0.3078 and1506.
ACM Reference Format:
Ruichen Li, Jingwen Hu, Shuai Guo, and Jinming Zhao. 2020. MUSE2020Challenge Report. In
Proceedings of ACM Conference (Conference’17).
ACM, New York, NY, USA, 2 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
Multimodal Sentiment Analysis is a significant task whichhelps people leverage daily data better from visual, textualand acoustic modalities. With the help of this task, amountsof systems are built incorporating conversational agents[3],education tutoring[2] and etc.Muse-Wild, a sub-challenge of MuSe(Multimodal Senti-ment Analysis in Real-life Media), argues participants to pre-dict the level of affective dimensions including arousal(a af-fective activation) and valence(a measure of pleasure), whichare described by dimensional theory, one of the most impor-tant affective computing theories. Some significant emotionrecognizing systems are constructed based on the theory.Previous studies on multimodal sentiment analysis haveapplied LSTM+Self-Att and EndYou model, which performs
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. Copyrights for compo-nents of this work owned by others than ACM must be honored. Abstract-ing with credit is permitted. To copy otherwise, or republish, to post onservers or to redistribute to lists, requires prior specific permission and/ora fee. Request permissions from [email protected].
Conference’17, July 2017, Washington, DC, USA © 2020 Association for Computing Machinery.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 https://doi.org/10.1145/nnnnnnn.nnnnnnn well in this task. In addition, (introduce main work in thiscompetition).Our contributions to the challenge in this paper are fromtwo aspects: • We investigate several efficient features differing fromgiven features of MuSe from acoustic, visual and tex-tual modalities. • Our ensemble model achieve the best CCC performanceof 0.4121 for arousal and valence, which outperformsthe baseline systems with corresponding CCC of 0.2047for arousal and valence.
We tried Bert[4], Albert[6], Glove models to encode the textrespectly. Specifically, pretrained models are used to extracttext features, then we averaged the word or character fea-tures the timestamps of which overlapped a certain 250msframe as a method of aligning. We refer there features asâĂIJbert_coverâĂİ, âĂIJalbert_coverâĂİ and âĂIJglove_coverâĂİ.
We tried egemaps [5] LLD(Low-level Descriptors) set in de-scribing the acoustic emotion features. In addition, self-supervisedmodel like wav2vec[8], is also used for improving acousticmodel training. In order to get better results, the wav2vecmodel is pretrained on Librispeech dataset in advance. Werefer these features the same as its name.
We use DenseFace model and VggFace feature as visual in-put. The DenseFace model is pretrained on FER+ dataset[1] use same structure and finetuning strategy proposed in[9].We extract features from last mean pooling layer of thefinetuned model and refer the features as "denseface". weuse vgg16 model which is pretrained on Vggface dataset [7]and finetuned the same way as denseface, vggface feature isrefered as "vggface". onference’17, July 2017, Washington, DC, USA Ruichen Li, Jingwen Hu, Shuai Guo, and Jinming Zhao
In this section we will describe our proposed method in de-tail. We use combination of different features described insection 2 to solve Muse-Wild challenge.Here we formulate the question, each video is framed intosegments of 250ms, we have v i = [ X , Y ] , X = { x ji } , Y = { y j } , i = , , ... K , j = , , ... t (1)K is the number of different types of feature for modelinput, and t is the number of segments in a video. Each y j is anumber ranged in [-1, 1], which could be arousal or valence, x ji means a feature vector for i th feature and j th timestamp.For each different input feature, we firstly concatenatethem together by timestamps, z j = concat ([ x j , x j , ..., x jK ] (2)then a fully-connected layer with RELU activation maps thefeatures into a embedding space, we use a LSTM module toencode time sequential information of the input sequence.Finally 2 fully-connected layer is used for label regression.ˆ z j = RELU ( W ∗ z j + b ) (3) h = LST M ([ ˆ z , ˆ z , ..., ˆ z t ] h = [ h , h , ..., h t ] (4)ˆ y j = f ully _ connect ( h j ) (5)We calculate MSE(mean square error) between each y j and ˆ y j as supervise signal during training. loss = t Õ j = ( y j − ˆ y j ) (6) As described in section3, we use LSTM as our encoder, thenumber of layers is set to be 1 and the number of hiddenunits is optimized for different input features. Adam opti-mizer is applied to optimize the model, and we set the maxtime step to be 100 and the dropout rate to be 0.5.
As shown in the tabel 1, in uni-modal experiments, lld fea-ture has the best validation performance on arousal, andbert feature works best on valence. In multi-modal exper-iments, the feature combination of lld, wav2vec, denseface,au and bert feature has the best performance, which reachedCCC of 0.4670 on arousal and 0.3571 on valence. Features arousal valenceUni-modal lld
Multi-modal lld-bert-glove 0.3378 0.3447wav2vec-bert_base-glove 0.3440 0.3556lld-wav2vec-denseface-au-bert lld-wav2vec-vggface-au-bert 0.4514 0.3153
Table 1.
Experiment Result on Validation Set
In this paper, we explore different efficient deep learningfeatures from acoustic, visual and textual modalities in real-world sentiment analysis task, our proposed model reachedCCC of 0.4670 on arousal and 0.3571 on valence on valida-tion set.
References [1] Emad Barsoum, Cha Zhang, Cristian Canton Ferrer, and ZhengyouZhang. 2016. Training deep networks for facial expression recognitionwith crowd-sourced label distribution. (2016), 279–283.[2] Cristina Conati. 2002. Probabilistic assessment of user’s emotions ineducational games.
Applied Artificial Intelligence
16, 7 (2002), 555–575.[3] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias,W. Fellenz, and J. G. Taylor. 2002. Emotion recognition in human-computer interaction.
IEEE Signal Processing Magazine
18, 1 (2002),32–80.[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.2018. BERT: Pre-training of Deep Bidirectional Transformers for Lan-guage Understanding. arXiv preprint arXiv:1810.04805 (2018).[5] Florian Eyben, Klaus R Scherer, Bjorn Schuller, Johan Sundberg, Elis-abeth Andre, Carlos Busso, Laurence Devillers, Julien Epps, PetriLaukka, Shrikanth S Narayanan, et al. 2016. The Geneva Minimalis-tic Acoustic Parameter Set (GeMAPS) for Voice Research and AffectiveComputing.
IEEE Transactions on Affective Computing
7, 2 (2016), 190–202.[6] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel,Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT forSelf-supervised Learning of Language Representations. (2019).[7] Omkar M Parkhi, Andrea Vedaldi, and Andrew Zisserman. 2015. DeepFace Recognition. (2015).[8] Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli.2019. wav2vec: Unsupervised Pre-Training for Speech Recognition.(2019).[9] Jinming Zhao, Ruichen Li, Jingjun Liang, Shizhe Chen, and Qin Jin.2019. Adversarial Domain Adaption for Multi-Cultural DimensionalEmotion Recognition in Dyadic Interactions. In
Proceedings of the 9thInternational on Audio/Visual Emotion Challenge and Workshop (Nice,France) (AVEC âĂŹ19) . Association for Computing Machinery, NewYork, NY, USA, 37âĂŞ45.. Association for Computing Machinery, NewYork, NY, USA, 37âĂŞ45.