[PDF] MUSE2020 challenge report

Abstract

This paper is a brief report for MUSE2020 challenge. We present our solution for Muse-Wild sub challenge. The aim of this challenge is to investigate sentiment analysis method in real-world situation. Our solutions achieve the best CCC performance of 0.4670, 0.3571 for arousal, and valence respectively on the challenge validation set, which outperforms the baseline system with corresponding CCC of 0.3078 and 1506.

Full PDF

aa r X i v : . [ c s . MM ] J u l MUSE2020 Challenge Report

Ruichen Li [email protected]

School of Information Renmin University of ChinaBeijing, China

Jingwen Hu

School of Information Renmin University of ChinaBeijing, China

Shuai Guo

School of Information Renmin University of ChinaBeijing, China

Jinming Zhao [email protected]

School of Information Renmin University of ChinaBeijing, China

Abstract

This paper is a brief report for MUSE2020 challenge. Wepresent our solution for Muse-Wild sub challenge. The aimof this challenge is to investigate sentiment analysis methodin real-world situation. Our solutions achieve the best CCCperformance of 0.4670, 0.3571 for arousal, and valence re-spectively on the challenge validation set, which outperformsthe baseline system with corresponding CCC of 0.3078 and1506.

ACM Reference Format:

Ruichen Li, Jingwen Hu, Shuai Guo, and Jinming Zhao. 2020. MUSE2020Challenge Report. In

Proceedings of ACM Conference (Conference’17).

ACM, New York, NY, USA, 2 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

Multimodal Sentiment Analysis is a signiﬁcant task whichhelps people leverage daily data better from visual, textualand acoustic modalities. With the help of this task, amountsof systems are built incorporating conversational agents[3],education tutoring[2] and etc.Muse-Wild, a sub-challenge of MuSe(Multimodal Senti-ment Analysis in Real-life Media), argues participants to pre-dict the level of aﬀective dimensions including arousal(a af-fective activation) and valence(a measure of pleasure), whichare described by dimensional theory, one of the most impor-tant aﬀective computing theories. Some signiﬁcant emotionrecognizing systems are constructed based on the theory.Previous studies on multimodal sentiment analysis haveapplied LSTM+Self-Att and EndYou model, which performs

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for proﬁt or commercial advantage and that copiesbear this notice and the full citation on the ﬁrst page. Copyrights for compo-nents of this work owned by others than ACM must be honored. Abstract-ing with credit is permitted. To copy otherwise, or republish, to post onservers or to redistribute to lists, requires prior speciﬁc permission and/ora fee. Request permissions from [email protected].

Conference’17, July 2017, Washington, DC, USA © 2020 Association for Computing Machinery.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 https://doi.org/10.1145/nnnnnnn.nnnnnnn well in this task. In addition, (introduce main work in thiscompetition).Our contributions to the challenge in this paper are fromtwo aspects: • We investigate several eﬃcient features diﬀering fromgiven features of MuSe from acoustic, visual and tex-tual modalities. • Our ensemble model achieve the best CCC performanceof 0.4121 for arousal and valence, which outperformsthe baseline systems with corresponding CCC of 0.2047for arousal and valence.

We tried Bert[4], Albert[6], Glove models to encode the textrespectly. Speciﬁcally, pretrained models are used to extracttext features, then we averaged the word or character fea-tures the timestamps of which overlapped a certain 250msframe as a method of aligning. We refer there features asâĂĲbert_coverâĂİ, âĂĲalbert_coverâĂİ and âĂĲglove_coverâĂİ.

We tried egemaps [5] LLD(Low-level Descriptors) set in de-scribing the acoustic emotion features. In addition, self-supervisedmodel like wav2vec[8], is also used for improving acousticmodel training. In order to get better results, the wav2vecmodel is pretrained on Librispeech dataset in advance. Werefer these features the same as its name.

We use DenseFace model and VggFace feature as visual in-put. The DenseFace model is pretrained on FER+ dataset[1] use same structure and ﬁnetuning strategy proposed in[9].We extract features from last mean pooling layer of theﬁnetuned model and refer the features as "denseface". weuse vgg16 model which is pretrained on Vggface dataset [7]and ﬁnetuned the same way as denseface, vggface feature isrefered as "vggface". onference’17, July 2017, Washington, DC, USA Ruichen Li, Jingwen Hu, Shuai Guo, and Jinming Zhao

In this section we will describe our proposed method in de-tail. We use combination of diﬀerent features described insection 2 to solve Muse-Wild challenge.Here we formulate the question, each video is framed intosegments of 250ms, we have v i = [ X , Y ] , X = { x ji } , Y = { y j } , i = , , ... K , j = , , ... t (1)K is the number of diﬀerent types of feature for modelinput, and t is the number of segments in a video. Each y j is anumber ranged in [-1, 1], which could be arousal or valence, x ji means a feature vector for i th feature and j th timestamp.For each diﬀerent input feature, we ﬁrstly concatenatethem together by timestamps, z j = concat ([ x j , x j , ..., x jK ] (2)then a fully-connected layer with RELU activation maps thefeatures into a embedding space, we use a LSTM module toencode time sequential information of the input sequence.Finally 2 fully-connected layer is used for label regression.ˆ z j = RELU ( W ∗ z j + b ) (3) h = LST M ([ ˆ z , ˆ z , ..., ˆ z t ] h = [ h , h , ..., h t ] (4)ˆ y j = f ully _ connect ( h j ) (5)We calculate MSE(mean square error) between each y j and ˆ y j as supervise signal during training. loss = t Õ j = ( y j − ˆ y j ) (6) As described in section3, we use LSTM as our encoder, thenumber of layers is set to be 1 and the number of hiddenunits is optimized for diﬀerent input features. Adam opti-mizer is applied to optimize the model, and we set the maxtime step to be 100 and the dropout rate to be 0.5.

As shown in the tabel 1, in uni-modal experiments, lld fea-ture has the best validation performance on arousal, andbert feature works best on valence. In multi-modal exper-iments, the feature combination of lld, wav2vec, denseface,au and bert feature has the best performance, which reachedCCC of 0.4670 on arousal and 0.3571 on valence. Features arousal valenceUni-modal lld

Multi-modal lld-bert-glove 0.3378 0.3447wav2vec-bert_base-glove 0.3440 0.3556lld-wav2vec-denseface-au-bert lld-wav2vec-vggface-au-bert 0.4514 0.3153

Table 1.

Experiment Result on Validation Set

In this paper, we explore diﬀerent eﬃcient deep learningfeatures from acoustic, visual and textual modalities in real-world sentiment analysis task, our proposed model reachedCCC of 0.4670 on arousal and 0.3571 on valence on valida-tion set.

References [1] Emad Barsoum, Cha Zhang, Cristian Canton Ferrer, and ZhengyouZhang. 2016. Training deep networks for facial expression recognitionwith crowd-sourced label distribution. (2016), 279–283.[2] Cristina Conati. 2002. Probabilistic assessment of user’s emotions ineducational games.

Applied Artiﬁcial Intelligence

16, 7 (2002), 555–575.[3] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias,W. Fellenz, and J. G. Taylor. 2002. Emotion recognition in human-computer interaction.

IEEE Signal Processing Magazine

18, 1 (2002),32–80.[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.2018. BERT: Pre-training of Deep Bidirectional Transformers for Lan-guage Understanding. arXiv preprint arXiv:1810.04805 (2018).[5] Florian Eyben, Klaus R Scherer, Bjorn Schuller, Johan Sundberg, Elis-abeth Andre, Carlos Busso, Laurence Devillers, Julien Epps, PetriLaukka, Shrikanth S Narayanan, et al. 2016. The Geneva Minimalis-tic Acoustic Parameter Set (GeMAPS) for Voice Research and AﬀectiveComputing.

IEEE Transactions on Aﬀective Computing

7, 2 (2016), 190–202.[6] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel,Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT forSelf-supervised Learning of Language Representations. (2019).[7] Omkar M Parkhi, Andrea Vedaldi, and Andrew Zisserman. 2015. DeepFace Recognition. (2015).[8] Steﬀen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli.2019. wav2vec: Unsupervised Pre-Training for Speech Recognition.(2019).[9] Jinming Zhao, Ruichen Li, Jingjun Liang, Shizhe Chen, and Qin Jin.2019. Adversarial Domain Adaption for Multi-Cultural DimensionalEmotion Recognition in Dyadic Interactions. In

Proceedings of the 9thInternational on Audio/Visual Emotion Challenge and Workshop (Nice,France) (AVEC âĂŹ19) . Association for Computing Machinery, NewYork, NY, USA, 37âĂŞ45.. Association for Computing Machinery, NewYork, NY, USA, 37âĂŞ45.