[PDF] Multi-modal Attention for Speech Emotion Recognition

Abstract

Emotion represents an essential aspect of human speech that is manifested in speech prosody. Speech, visual, and textual cues are complementary in human communication. In this paper, we study a hybrid fusion method, referred to as multi-modal attention network (MMAN) to make use of visual and textual cues in speech emotion recognition. We propose a novel multi-modal attention mechanism, cLSTM-MMA, which facilitates the attention across three modalities and selectively fuse the information. cLSTM-MMA is fused with other uni-modal sub-networks in the late fusion. The experiments show that speech emotion recognition benefits significantly from visual and textual cues, and the proposed cLSTM-MMA alone is as competitive as other fusion methods in terms of accuracy, but with a much more compact network structure. The proposed hybrid network MMAN achieves state-of-the-art performance on IEMOCAP database for emotion recognition.

Full PDF

MMulti-modal Attention for Speech Emotion Recognition

Zexu Pan , , Zhaojie Luo , Jichen Yang , Haizhou Li , Institute of Data Science, NUS, Singapore Graduate School for Integrative Sciences and Engineering, NUS, Singapore Department of Electrical and Computer Engineering,National University of Singapore (NUS), Singapore Osaka University, Osaka, Japan pan [email protected], [email protected], { eleyji, haizhou.li } @nus.edu.sg Abstract

Emotion represents an essential aspect of human speech that ismanifested in speech prosody. Speech, visual, and textual cuesare complementary in human communication. In this paper,we study a hybrid fusion method, referred to as multi-modalattention network (MMAN) to make use of visual and textualcues in speech emotion recognition. We propose a novel multi-modal attention mechanism, cLSTM-MMA, which facilitatesthe attention across three modalities and selectively fuse the in-formation. cLSTM-MMA is fused with other uni-modal sub-networks in the late fusion. The experiments show that speechemotion recognition beneﬁts signiﬁcantly from visual and tex-tual cues, and the proposed cLSTM-MMA alone is as compet-itive as other fusion methods in terms of accuracy, but witha much more compact network structure. The proposed hy-brid network MMAN achieves state-of-the-art performance onIEMOCAP database for emotion recognition.

Index Terms : speech emotion recognition, multi-modal atten-tion, early fusion, hybrid fusion

1. Introduction

Emotions play an important role in speech communication [1].The recent advancement of artiﬁcial intelligence has equippedmachines with intelligence quotient. It is equally important formachines to understand emotions, and to improve their emo-tional intelligence.The fact that voice call is more informative than text mes-saging suggests that the affective prosody of speech deliversadditional information that includes emotion. Similarly, speak-ing face-to-face is more effective than text messaging and voicecall, which suggests that visual cues play an important role.Humans express emotion through prosody, gesture, and lexi-cal choice. Emotion is quantized by physiological arousal andhedonic valence level [2], which are only partially expressedthrough speech. The use of speciﬁc phrases further indicatesour valence level and our body language carries the remainingarousal and valence. It is found that humans rely more on multi-modalities than uni-modal [3] to understand emotions.Multi-modal speech emotion recognition has been an areaof research for decades.

Cho et al. [4] used text to aid speechin the MCNN network. Similarly

Hossain et al. [5] and

Xue etal. [6] used visual cues to augment speech using SVM and Sym-cHDP networks. It is evident that emotion recognition beneﬁtsfrom the fusion of speech, vision and text information [7–11].However, it has not been an easy task to fuse the informationfrom different modalities. As the information coming fromdifferent modality is neither completely independent nor cor- related, the fusion mechanism is expected to pick up the rightinformation from the right modality.Early or late fusions are the typical options in multi-modalclassiﬁer design in emotion recognition. The state-of-the-art method introduced the contextual long short-term memoryblock (cLSTM) and built a late fusion network (cLSTM-LF)[12,13]. The predictions of uni-modal models are fused to makea ﬁnal prediction. It is effective at modelling modality-speciﬁcinteractions but not cross-modal interactions [14].There are also studies to explore the interaction betweenmodalities with early fusion [15–17].

Sebastian et al. [15]concatenated the low-level features and passed them througha convolutional neural network.

Georgiou et al. [17] concate-nated features from different modality at various levels and usedmulti-layer perceptron for emotion prediction. With early fu-sion, we are able to explore the interaction between raw fea-tures across modalities, that is good. However, the raw featuresrepresent different physical properties of the signals in the re-spective modalities. Therefore, the classiﬁer network will haveto learn both the feature abstraction of respective modalities,and the interaction of them at the same time, that is not easy.Furthermore, simple concatenation utilizes whatever informa-tion from the input streams that may or may not be relevant tothe classiﬁcation tasks. Early fusion also potentially suppressesmodality-speciﬁc interactions [18]. In general, concatenationbased early fusion methods do not outperform the late fusionmethods in emotion recognition [14, 19].Transformer has been effective in natural language process-ing that features a self-attention mechanism where each inputfeature embedding is ﬁrst projected into query, key and valueembeddings [20]. In multi-modal situation, the query is fromone modality while key and value are from another modality.The attentions between two modalities are computed by cosinesimilarities between the query and the key. The values are thenfused based on the attention scores. The attention mechanismin Transformer is one of the effective solutions to learn cross-modality correlation [21, 22].

Tsai et al. [21] used directionalpairwise cross-modal attention for sentiment analysis. Theyshow positive results with two-modalities attention. In this pa-per, we would like to explore a mechanism for three-modalitiesattention for the ﬁrst time. We believe that speech, visual andtext modalities provide complementary evidence for emotion.Three modalities cross-modal attention allows us to take advan-tage of such evidence.We propose a multi-modal attention mechanism in place ofconcatenation to model the correlation between three modalitiesin cLSTM-MMA. As cLSTM-MMA takes the multi-modal fea-tures as input, we consider it as an early fusion sub-network. It a r X i v : . [ ee ss . A S ] S e p motionlabels L S T M La y e r D en s e La y e r Multi-modal attention D en s e La y e r D en s e La y e r D en s e La y e r (S, V, T)T (S, V, T)S (S, V, T)VDirectional multi-modal attention module D en s e La y e r S o ft m a x La y e r Emotionlabels D en s e La y e r S o ft m a x La y e r TextVisualFramesSpeech cLSTM-MMAcLSTM-VisualcLSTM-SpeechcLSTM-Text cLSTM

Figure 1:

On the left panel is the proposed multi-modal attention network (MMAN). It consists of a multi-modal attention sub-network(cLSTM-MMA) for early fusion and three uni-modal sub-networks cLSTM-Text, cLSTM-Visual and cLSTM-Speech. The predictionsof the four sub-networks are fused with a dense and a softmax layer in late fusion. The architecture of the cLSTM-MMA sub-networkis shown in the red dotted box on the right panel. The symbol ⊕ represents concatenation and S, V, T represents speech, visual andtext respectively. The cLSTM-MMA consists of three independent dense layers for uni-modal feature embeddings standardisation,multi-modal attention with three parallel directional multi-modal attention modules and ﬁnally a cLSTM with one LSTM layer inside. consists of three parallel directional multi-modal attention mod-ules for multi-modal fusion. In each module, a query is ﬁrstcomputed from a modality. It is then used to compute the cross-modal attention and the self-attention scores to ﬁnd the relevantinformation answering to this query. The three parallel moduleshave distinct queries from three different modalities speciﬁcally.Thus, allowing the network to attend for different interactionsbased on the different queries jointly. The multi-modal atten-tion can be easily scaled up if more than three modalities arepresent. To take advantage of both the late fusion and earlyfusion to account for modality-speciﬁc and cross-modal inter-actions, we propose a hybrid multi-modal attention network(MMAN) which fuses the predictions of the cLSTM-MMA anduni-modal cLSTM sub-networks for the ﬁnal prediction.The rest of the paper is organized as follows. Section 2presents the details of the proposed multi-modal attention net-work. Section 3 describes the experimental setup. Section 4 re-ports the results and evaluations. Finally, conclusions are drawnin Section 5.

2. Multi-modal attention network

The proposed hybrid fusion network MMAN is shown onthe left panel of Figure 1. We have the speech, visualand text feature embeddings of the same utterance as the in-put. The MMAN consists of a cLSTM multi-modal atten-tion sub-network (cLSTM-MMA) for early fusion, and threeuni-modal sub-networks cLSTM-Speech, cLSTM-Visual andcLSTM-Text for late fusion. The outputs of the four sub-networks are fused with a dense and a softmax layer.

The architecture of cLSTM-MMA sub-network is shown in thered dotted box on the right panel of Figure 1. The cLSTM-MMA consists of three independent dense layers for uni-modalfeature embeddings standardisation, multi-modal attention withthree parallel directional multi-modal attention modules and ﬁ-nally a cLSTM with one LSTM layer inside.

The three inputs that represent one utterance are ﬁrst encoded asthe feature embeddings of different dimensions. We ﬁrst stan- xAttention Scores x

Figure 2:

The details of the directional multi-modal attentionmodule S −→ ( S, V, T ) with query from speech. The inputs tothis module are the uni-modal feature embeddings ( ˆ s i , ˆ v i , ˆ t i )after the standardization dense layers dardize all feature embeddings into the same dimension d model to facilitate the subsequent processing.Let’s denote the dataset as D = { s i , v i , t i , y i } i =1: M where s i , v i , t i and y i represent the speech, visual, text feature embed-dings and the emotion labels of utterance i . M is the numberof utterances in a conversation. With s i ∈ R d s , v i ∈ R d v and t i ∈ R d t where d s , d v , d t are dimensions of correspondingspeech, visual and text features. By passing the original featureembeddings through the individual dense feed forward layersas shown in Figure 1, we standardize the outputs into the samedimension ˆ s i ∈ R d model , ˆ v i ∈ R d model and ˆ t i ∈ R d model . Taking the directional multi-modal attention module withspeech query for illustration. It is represented by S −→ ( S, V, T ) as shown in the blue module in Figure 1. This module computesthe directional attention from speech to visual and text as wellas the self-attention of speech. The detail of this speech querymodule is illustrated in Figure 2.We use the query, key and value representation to computethe attention. We compute the query of speech q s through alearnable weights W sq ∈ R d model × d q as shown in Equation 1. s = W sqT ˆ s i (1)where d q is the dimension of the query vector.The keys K s and values V s are computed using learnableweights W sk , W vk , W tk ∈ R d model × d k , W sv , W vv , W tv ∈ R d model × d v , where d k , d v are dimensions of key and value vec-tor. The computation is shown in Equation 2 and 3. K s = concat { ˆ s Ti W sk , ˆ v Ti W vk , ˆ t Ti W tk } (2) V s = concat { ˆ s Ti W sv , ˆ v Ti W vv , ˆ t Ti W tv } (3)The cross-modal and self attention scores are computed bythe dot product of the query q s and keys K s . It is then usedto compute the weighted sum of the values ˆ z is , which repre-sents the interaction of different modalities answering to speechquery. The directional multi-modal attention from speech query D S −→ ( S,V,T ) is given in Equation 4 and illustrated in Figure 2. ˆ z is = D S −→ ( S,V,T ) (ˆ s i , ˆ v i , ˆ t i )= softmax ( q Ts K Ts √ d k ) V s (4)The same computing procedure is applied to text and vi-sual directional multi-modal attention modules except that eachmodule has its own learnable weights computing the query tofacilitate the learning of different interactions based on differ-ent directional queries. The outputs from three parallel attentionmodules are concatenated with a skip connection. The output from multi-modal attention is passed through acLSTM block with one LSTM layer as shown in Figure 1 tocapture the contextual cues between consecutive utterances in aconversation [13].

The cLSTM-Speech, cLSTM-Visual and cLSTM-Text sub-networks are all built using cLSTM block with two LSTM lay-ers except that their inputs are different. Their network hyper-parameters are customized to suit different modalities. ThecLSTM-MMA and three uni-modal sub-networks are separatelytrained. Their weights are ﬁxed during the training of the latefusion dense layer in the MMAN.

3. Experimental setup

The IEMOCAP dataset [23] is used to evaluate the proposednetwork. The dataset contains 10K videos split into 5 minutesof dyadic conversations for human emotion analysis. Each con-versation is split into spoken utterance. Each utterance con-sists of corresponding transcription, speech waveform and vi-sual frames. To align with previous works, we consider theemotion classes of angry, happy (excited), sad (frustrated) andneutral for multi-class classiﬁcation but without excited andfrustrated for binary sentiment classiﬁcation system. The trainand the test sets are disjoint for speakers. The speakers in thetraining set are not contained in the test set as we assume thespeakers are unknown at the inference time. The details of thedataset are provided in Table 1. Table 1:

The number of utterances labelled happy (HPY), sad,neutral (NEU), angry (ANG), excited (EXC) and frustrated(FRU) in the training and testing set of IEMOCAP

HPY SAD NEU ANG EXC FRUTrain 504 839 1324 933 742 1468Test 144 245 384 170 299 381

We follow

Poria et al. for low level feature extraction [13]. Theinput video of an utterance is ﬁrst separated into correspond-ing text, video frames and speech modalities and extraction isdone by using individual pre-trained networks transferred fromother tasks. The feature of each utterance is extracted as a ﬁxed-length vector for each modality.

Speech : OpenSMILE toolkit [24] with IS13-ComParE [25] isused to for feature extraction. It is performed with 30 Hz frame-rate and 100 ms sliding window. The features include Mel Fre-quency Cepstral Coefﬁcient (MFCC), spectral centroid, spectralﬂux, beat histogram, beat sum, voice intensity, pitch, mean androot quadratic mean, etc [12].

Visual : We use a 3D-CNN [26] pre-trained from human ac-tion recognition to extract their body language. The 3D-CNNis applied to the consecutive visual frames of the speaker’s up-per body. It learns the relevant features of each frame and thechanges among the given number of consecutive frames, whichare the motion cues.

Text : Word2vec [27] is used to embed each word of an utter-ance’s transcript into word2Vec vectors. The embedded wordsare concatenated, padded and standardized to a 1-dimensionalvector by passing through a CNN [28].

Three baselines are constructed from the state-of-the-art model[13]. They are all built based on cLSTM block. The ﬁrst base-line uses speech data only while the other two uses speech, vi-sual and text data.

Speech-only cLSTM (cLSTM-Speech) : The speech-onlybaseline receives speech features only, the speech features arepassed through a cLSTM block with two LSTM layers for pre-diction.

Multi-modal cLSTM with early fusion (cLSTM-EF) : ThecLSTM-EF baseline receives concatenated speech, visual andtext feature embeddings as input. The concatenated features arepassed through a cLSTM block with two LSTM layers for pre-diction.

Multi-modal cLSTM with late fusion (cLSTM-LF) : ThecLSTM-LF baseline has a hierarchical structure. The lowerlevel consists of three uni-modal networks cLSTM-Speech,cLSTM-Text and cLSTM-visual. At the higher level, the pre-dictions of the three uni-modal networks are concatenated andpassed through another cLSTM block for ﬁnal prediction.

4. Evaluation results

We reported the results of our proposed models and baselines ofmulti-class recognition systems in terms of accuracy and recallrates in Table 2 and Figure 3.able 2:

Accuracy (%) and number of network parameters ofthe baselines and our models in multi-class classiﬁcation

Model Accuracy No. of Parameters (million)cLSTM-Speech 57.12 cLSTM-EF 69.75 2.00cLSTM-LF 71.78 4.73 cLSTM-MMA

A comparative study of binary sentiment analysis withdifferent multi-modal implementations in terms of classiﬁcationaccuracy (%)

Model Happy Sad Neutral AngryMFM [29] 90.2 cLSTM-MMA

We ﬁrst presented the recognition accuracy of the Speech-onlybaseline cLSTM-Speech, which obtains a recognition accuracyof 57% which is more than 10 absolute percentage points lowerthan any other the multi-modal methods in Table 2.We compared its confusion matrix with our cLSTM-MMAmodel in Figure 3, since they have similar network size.cLSTM-Speech’s recall rates for neutral is very low but veryhigh for sad as see in Figure 3. This unbalance phenomenonis alleviated by cLSTM-MMA, which uses multi-modal infor-mation as seen from the cLSTM-MMA confusion matrix. Thisshows that the visual and textual cues do complement speech’sambiguity in emotion recognition.

The cLSTM-MMA is 2% higher than cLSTM-EF in terms ofaccuracy as shown in Table 2. This means that the proposedmulti-modal attention is more prevalent in computing the inter-action between modalities compared to concatenation methodwith early fusion. Besides, the cLSTM-MMA has 40% fewerparameters compared to the cLSTM-EF baseline.

HPY SAD NEU ANG

Predicted labels

HPYSADNEUANG T r u e l a b e l s cLSTM-Speech HPY SAD NEU ANG

Predicted labels

HPYSADNEUANG T r u e l a b e l s cLSTM-MMA Figure 3:

Normalised confusion matrix of the Speech-only base-line cLSTM-Speech and proposed cLSTM-MMA network. Diag-onal entries represent the recall rates of each emotion.

Table 4:

A comparative study of multi-class emotion recognitionwith different multi-modal implementations

Model Accuracy (%)

Rozgic et al. [31] 69.4

Poria et al. [19] 71.59

Tripathi et al. [12] 71.04

MMAN 73.94

The cLSTM-MMA achieves comparable accuracy with thestate-of-the-art late-fusion model cLSTM-LF with only a quar-ter of its’ parameters as shown in Table 2. The proposed hy-brid MMAN network outperforms all the multi-modal networksand achieves the state-of-the-art accuracy of 73.98% using thesame amount of parameters as cLSTM-LF, suggesting that bothmodality-speciﬁc and cross-modal interactions are important inemotion recognition.

We compared the accuracies of our model with other binarysentiment classiﬁcation systems using speech, visual and textin Table 3. The cLSTM-MMA has superior performance overthe pairwise correlation network MulT and others (with the ex-ception of the sad emotion). Showing that correlation betweenthree modalities is superior then pairwise correlation. Inter-estingly, MMAN has similar performance with cLSTM-MMA,suggesting that modality-speciﬁc interaction may not contributemuch in binary sentiment classiﬁcation case.Table 4 summaries the performance of previous multi-classemotion recognition network using speech visual and text. Ourproposed MMAN achieves a state-of-the-art result of 73.94%on dataset IEMOCAP. Also, most of the methods of the previ-ous work that achieved comparable results are based on BLSTMwhich have access to future utterance information when decid-ing for the current utterance, thus the comparison reported inthis paper is in their favour. Nevertheless, MANN outperformsall reference baselines.

5. Conclusion

In this work, we presented a hybrid fusion model MMAN us-ing visual and textual cues to aid speech in emotion recogni-tion. We proposed the multi-modal attention in early fusionwhich features parallel directional attention between modali-ties in place of concatenation. The attention mechanism en-ables better data association between modalities and has a sig-niﬁcantly less amount of parameters needed. Through exper-iments, we showed that the multi-modal attention alone is ascompetitive as other fusion methods with a much more com-pact network. Our hybrid model achieved the state-of-the-artresult on IMEOCAP dataset for emotion recognition.

6. Acknowledgement

This research work is partially supported by ProgrammaticGrant No. A1687b0033 from the Singapore Government’s Re-search, Innovation and Enterprise 2020 plan (Advanced Man-ufacturing and Engineering domain), and in part by Human-Robot Interaction Phase 1 (Grant No. 192 25 00054) from theNational Research Foundation, Prime Minister’s Ofﬁce, Singa-pore under the National Robotics Programme. . References [1] M. Sreeshakthy and J. Preethi, “Classiﬁcation of human emotionfrom deap eeg signal using hybrid improved neural networks withcuckoo search,”

BRAIN. Broad Research in Artiﬁcial Intelligenceand Neuroscience , vol. 6, no. 3-4, pp. 60–73, 2016.[2] L. F. Barrett, “Solving the emotion paradox: Categorization andthe experience of emotion,”

Personality and social psychology re-view , vol. 10, no. 1, pp. 20–46, 2006.[3] S. Shimojo and L. Shams, “Sensory modalities are not separatemodalities: plasticity and interactions,”

Current opinion in neuro-biology , vol. 11, no. 4, pp. 505–509, 2001.[4] J. Cho, R. Pappagari, P. Kulkarni, J. Villalba, Y. Carmiel, andN. Dehak, “Deep neural networks for emotion recognition com-bining audio and transcripts,” arXiv preprint arXiv:1911.00432 ,2019.[5] M. S. Hossain and G. Muhammad, “Emotion recognition usingdeep learning approach from audio–visual emotional big data,”

Information Fusion , vol. 49, pp. 69–78, 2019.[6] J. Xue, Z. Luo, K. Eguchi, T. Takiguchi, and T. Omoto, “Abayesian nonparametric multimodal data modeling framework forvideo emotion recognition,” in . IEEE, 2017, pp. 601–606.[7] S. Poria, E. Cambria, R. Bajpai, and A. Hussain, “A review ofaffective computing: From unimodal analysis to multimodal fu-sion,”

Information Fusion , vol. 37, pp. 98–125, 2017.[8] V. P´erez-Rosas, R. Mihalcea, and L.-P. Morency, “Utterance-levelmultimodal sentiment analysis,” in

Proceedings of the 51st An-nual Meeting of the Association for Computational Linguistics(Volume 1: Long Papers) , 2013, pp. 973–982.[9] M. W¨ollmer, F. Weninger, T. Knaup, B. Schuller, C. Sun,K. Sagae, and L.-P. Morency, “Youtube movie reviews: Senti-ment analysis in an audio-visual context,”

IEEE Intelligent Sys-tems , vol. 28, no. 3, pp. 46–53, 2013.[10] S. Poria, E. Cambria, and A. Gelbukh, “Deep convolutionalneural network textual features and multiple kernel learning forutterance-level multimodal sentiment analysis,” in

Proceedings ofthe 2015 conference on empirical methods in natural languageprocessing , 2015, pp. 2539–2544.[11] E. Cambria, D. Hazarika, S. Poria, A. Hussain, and R. Subra-manyam, “Benchmarking multimodal sentiment analysis,” in

In-ternational Conference on Computational Linguistics and Intelli-gent Text Processing . Springer, 2017, pp. 166–179.[12] S. Tripathi and H. Beigi, “Multi-modal emotion recogni-tion on iemocap dataset using deep learning,” arXiv preprintarXiv:1804.05788 , 2019.[13] S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, andL.-P. Morency, “Context-dependent sentiment analysis in user-generated videos,” in

Proceedings of the 55th Annual Meeting ofthe Association for Computational Linguistics (Volume 1: LongPapers) , 2017, pp. 873–883.[14] Y. Wang, Y. Shen, Z. Liu, P. P. Liang, A. Zadeh, and L.-P.Morency, “Words can shift: Dynamically adjusting word repre-sentations using nonverbal behaviors,” in

Proceedings of the AAAIConference on Artiﬁcial Intelligence , vol. 33, 2019, pp. 7216–7223.[15] J. Sebastian and P. Pierucci, “Fusion techniques for utterance-level emotion recognition combining speech and transcripts,” in

Proc. Interspeech , 2019, pp. 51–55.[16] S. Poria, E. Cambria, N. Howard, G.-B. Huang, and A. Hus-sain, “Fusing audio, visual and textual clues for sentiment analysisfrom multimodal content,”

Neurocomputing , vol. 174, pp. 50–59,2016.[17] E. Georgiou, C. Papaioannou, and A. Potamianos, “Deep hierar-chical fusion with application in sentiment analysis,”

Proc. Inter-speech 2019 , pp. 1646–1650, 2019. [18] Z. Liu, Y. Shen, V. B. Lakshminarasimhan, P. P. Liang, A. B.Zadeh, and L.-P. Morency, “Efﬁcient low-rank multimodal fusionwith modality-speciﬁc factors,” in

Proceedings of the 56th AnnualMeeting of the Association for Computational Linguistics (Volume1: Long Papers) , 2018, pp. 2247–2256.[19] S. Poria, N. Majumder, D. Hazarika, E. Cambria, A. Gelbukh,and A. Hussain, “Multimodal sentiment analysis: Addressing keyissues and setting up the baselines,”

IEEE Intelligent Systems ,vol. 33, no. 6, pp. 17–25, 2018.[20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”in

Advances in neural information processing systems , 2017, pp.5998–6008.[21] Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, andR. Salakhutdinov, “Multimodal transformer for unaligned multi-modal language sequences,” in

Proceedings of the 57th AnnualMeeting of the Association for Computational Linguistics , 2019,pp. 6558–6569.[22] H. Le, D. Sahoo, N. Chen, and S. Hoi, “Multimodal transformernetworks for end-to-end video-grounded dialogue systems,” in

Proceedings of the 57th Annual Meeting of the Association forComputational Linguistics , 2019, pp. 5612–5623.[23] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower,S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “IEMOCAP:Interactive emotional dyadic motion capture database,”

Languageresources and evaluation , vol. 42, no. 4, p. 335, 2008.[24] F. Eyben, M. W¨ollmer, and B. Schuller, “Opensmile: the mu-nich versatile and fast open-source audio feature extractor,” in

Proceedings of the 18th ACM international conference on Mul-timedia . ACM, 2010, pp. 1459–1462.[25] B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer,F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchi et al. , “The interspeech 2013 computational paralinguistics chal-lenge: social signals, conﬂict, emotion, autism,” in

ProceedingsINTERSPEECH 2013, 14th Annual Conference of the Interna-tional Speech Communication Association, Lyon, France , 2013.[26] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural net-works for human action recognition,”

IEEE transactions on pat-tern analysis and machine intelligence , vol. 35, no. 1, pp. 221–231, 2012.[27] Tomas, K. Mikolov, G. Chen, J. Corrado, and Dean, “Efﬁcient es-timation of word representations in vector space,” arXiv preprintarXiv:1301.3781 , 2013.[28] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, andL. Fei-Fei, “Large-scale video classiﬁcation with convolutionalneural networks,” in

Proceedings of the IEEE conference on Com-puter Vision and Pattern Recognition , 2014, pp. 1725–1732.[29] Y. H. Tsai, P. P. Liang, A. Zadeh, L. Morency, and R. Salakhutdi-nov, “Learning factorized multimodal representations,” in

ICLR ,2019.[30] A. Zadeh, C. Mao, K. Shi, Y. Zhang, P. P. Liang, S. Poria, and L.-P. Morency, “Factorized multimodal transformer for multimodalsequential learning,” arXiv preprint arXiv:1911.09826 , 2019.[31] V. Rozgi´c, S. Ananthakrishnan, S. Saleem, R. Kumar, andR. Prasad, “Ensemble of svm trees for multimodal emotion recog-nition,” in