Attention Driven Fusion for Multi-Modal Emotion Recognition
Darshana Priyasad, Tharindu Fernando, Simon Denman, Clinton Fookes, Sridha Sridharan
AATTENTION DRIVEN FUSION FOR MULTI-MODAL EMOTION RECOGNITION
Darshana Priyasad, Tharindu Fernando, Simon Denman, Sridha Sridharan, Clinton Fookes
Speech and Audio Research Lab - SAIVTQueensland University of Technology, Brisbane, Australia
ABSTRACT
Deep learning has emerged as a powerful alternative tohand-crafted methods for emotion recognition on combinedacoustic and text modalities. Baseline systems model emo-tion information in text and acoustic modes independentlyusing Deep Convolutional Neural Networks (DCNN) and Re-current Neural Networks (RNN), followed by applying at-tention, fusion, and classification. In this paper, we presenta deep learning-based approach to exploit and fuse text andacoustic data for emotion classification. We utilize a SincNetlayer, based on parameterized sinc functions with band-passfilters, to extract acoustic features from raw audio followed bya DCNN. This approach learns filter banks tuned for emotionrecognition and provides more effective features comparedto directly applying convolutions over the raw speech signal.For text processing, we use two branches (a DCNN and a Bi-direction RNN followed by a DCNN) in parallel where crossattention is introduced to infer the N-gram level correlationson hidden representations received from the Bi-RNN. Follow-ing existing state-of-the-art, we evaluate the performance ofthe proposed system on the IEMOCAP dataset. Experimentalresults indicate that the proposed system outperforms exist-ing methods, achieving improvement in weighted ac-curacy. Index Terms — Speech emotion recognition, deep learn-ing, multi-modal fusion, cross attention, SincNet
1. INTRODUCTION
With the advance of technology, Human Computer Interac-tion (HCI) has become a major research area. Within thisfield, automatic emotion recognition is being pursued as ameans to improve the level of user experience by tailoringresponses to the emotional context, especially in human-machine interactions [1]. However, this remains challengingdue to the ambiguity of expressed emotions. An utterancemay contain subject dependent auditory clues regarding ex-pressed emotions which are not captured through speech tran-scripts alone. With deep learning, architectures can extract Note: this has been updated from the ICASSP published version dueto a small error. Results have been updated to correct the error, but overallfindings are unchanged. higher level features and more robust features for accuratespeech emotion recognition [2, 3]. In this paper, we proposea model that combines acoustic and textual information forspeech emotion recognition.Recently, multi-modal information has been used in emo-tion recognition in preference to uni-modal methods [4], sincehumans express emotion via multiple modes [5, 6, 7, 8]. Moststate-of-the-art methods for utterance level emotion recogni-tion have used low-level (energy) and high-level acoustic fea-tures (such as Mel Frequency Cepstral Coefficients (MFCC)[5, 9]). However, when the emotion expressed through speechbecomes ambiguous, the lexical content may provide comple-mentary information that can address the ambiguity.Tripathi et al. [10] has used a Long-Short Term Memory(LSTM) along with a DCNN to perform joint acoustic andtextual emotion recognition. They have used features such asMFCC, Zero Crossing Rate and spectral entropy for acousticdata, while using Glove [11] embeddings to extract a featurevector from speech transcripts. However, the performancegain is minimal due to the lack of robustness in the acous-tic features, and the sparse text feature vectors. Yenigalla et al. [12] proposed a spectral and phenoms-sequence basedDCNN model which is capable of retaining the emotionalcontent of the speech that is lost when converted to text. Yoon et al. [5] presented a framework using Bidirectional LSTMsto obtain hidden representations of acoustic and textual infor-mation. The resultant features are fused with multi-hop at-tention, where one modality is used to direct attention for theother mode. Higher performance has been achieved due to theattention and fusion which select relevant segments from thetextual model, and complementary features from each mode.Yoon et al. [6] have also presented an encoder based methodwhere the fusion of two recurrent encoders is used to com-bine features from audio and text. However, both methodsuse manually calculated audio features which limit their ac-curacy and the robustness of the acoustic model [13, 14].Gu et al. [7] presented a multimodal framework where ahybrid deep multimodal structure that considers spatial andtemporal information is employed. Features obtained fromeach model were fused using a DNN to classify the emo-tion. Li et al. [8] proposed a personalized attribute aware at-tention mechanism where an attention profile is learned based a r X i v : . [ ee ss . A S ] O c t n acoustic and lexical behavior data. Mirsamadi et al. [15]used deep learning along with local attention to automati-cally extract relevant features where segment level acousticfeatures are aggregated for utterance level emotion represen-tation. However, the accuracy could be further improved byfusing the audio features with another modality with comple-mentary information.In this paper, we present a multi-modal emotion recog-nition model with combines acoustic and textual informationby using DCNNs, and both cross attention and self-attention.Experiments are performed on the IEMOCAP [16] dataset toenable fair comparison with state-of-the-art methods, and aperformance gain of . in weighted accuracy is achieved.
2. METHODOLOGY2.1. Acoustic Feature Extraction
In our proposed model, we utilize a SincNet filter layer [17] tolearn custom filter banks tuned for emotion recognition fromspeech audio. This layer is shown to have fast convergenceand higher interpretability with a smaller number of parame-ters compared to conventional convolution layers. Formally,this layer can be defined as, y [ n ] = x [ n ] ∗ g [ n, θ ] , (1)where x [ n ] , y [ n ] , g, θ refers to the input signal, filtered out-put, filter-bank function, and the learnable parameters respec-tively. In SincNet filters, convolution operations are appliedover a raw waveform with predefined functions. Each de-fined filter-bank is composed of rectangular band-pass filterswhich can be represented by two low-pass filters with learn-able cutoff frequencies. The time-domain representation ofthe function g can be derived as follows [17], g [ n, f , f ] = 2 f sinc ( 2 πf n ) − f sinc ( 2 πf n ) , (2)where f , f refers to low and high cutoff frequencies and sinc = sin x/x .The resultant convolution layer outputs are passed througha DCNN which contains several “Convolution1D”, “BatchNormalization” and “fully connected layers”. During theinitial training phase, a random ms chunk from the audiosignal is selected as the input. During validation and testing,we obtain the final “softmax” response for each chunk andadd them together to get the final classification scores, similarto [17]. However, we retrieve a 2048-D feature vector fromthe final dense layer before the classification layer for eachchunk of an utterance, and average these before fusing themwith textual features from the corresponding transcript in alater step (see Section 2.3). In our proposed model, after the sequence vector is passedthrough a common embedding layer, we utilize two parallel branches for textual feature extraction as illustrated in Figure1. Bi-RNNs followed by DCNNs have been extensively usedin textual emotion analysis [18, 19]. As an alternative, a CNNbased architecture which is capable of considering “n” wordsat a time (n-grams) can be used [20]. Therefore we use twoparallel branches, employing, one using Bi-RNNs with DC-NNs and the other DCNNs alone to increase the effectivenessof the learned features (see Figure 1 (B)). The resultant fea-ture vector from the Bi-RNN is passed through three convolu-tional layers with a filter sizes of , and ; and convolutionallayers with the same size filter are used in the parallel branch.We introduce cross-attention where we use convolution layerswith the same filter size from the right branch as the attentionfor the left branch, as illustrated, and jointly train with theother components of the network. The cross-attention is cal-culated using α i = exp (( b i,j ) (cid:62) a (cid:62) i,j ) (cid:80) i exp (( b i,j ) (cid:62) a (cid:62) i,j ) , (3) H = (cid:88) i α i a (cid:62) i,j , (4)where α i , b i,j , a i,j , H are the attention score, context vectorfrom the right branch with a filter size of j , output of the con-volution layer with filter size j in left branch, and the output.The convolution layers from both branches are concate-nated together and passed through a DNN consisting of fullyconnected layers for textual emotion classification. We re-trieve a 4800-D feature vector from the final dense layer be-fore the classification layer for multi-modal feature fusion. Mid-level fusion is used to fuse textual and acoustic featuresobtained from individual networks. A 2048-D feature vectorfrom the acoustic network and a 4800-D feature vector fromthe textual network are concatenated as illustrated in Figure1 (C). A neural network with attention is used to identify in-formative segments in the feature vectors. We have exploredusing fusion without attention (F-I), attention after fusion (F-II) where self-attention is applied on concatenated features,and attention before fusion (F-III) where attention is appliedon individual feature vectors. For F-III, we calculate attentionweights and combine the vectors using [21], h t = tanh ( W h c t + b h ) , (5) β t = sigmoid(h t ) , (6) q = β t ⊗ c t (7) c t , h t , β t , and q refer to the merged feature vector, neu-ral network (which is randomly initialized and jointly trained -DNNBi-RNN Text Emotion Classification
Conv1D-RF-1 Conv1D-RF-3 Conv1D-RF-5 Conv1D-RF-1 Conv1D-RF-3 Conv1D-RF-5Text Embeddings
Encoded Text Emotion Classification
A-DCNN T-DNN
Emotion Classification
A-DCNN T-DNNA-DCNN T-DNN(F-II) (F-III)(F-I)
Emotion Classification
SincNetA-DCNN
Audio Emotion ClassificationRaw Waveform
A B C
Cross AttentionConcatenationSelf Attention
Fig. 1 . Proposed architecture - The system contains three main parts: the text network (A); the audio network (B); and the fusionnetwork (C). The raw audio is passed to the acoustic network after amplitude normalization. It contains SincNet filter layerscontaining parameterized sinc functions with band-pass filters to extract acoustic features, followed by the A-DCNN componentcontaining convolutional and dense layers. The corresponding text is converted to a word vector using Glove embeddings, andthen passed through the text network. It contains two parallel branches with bi-LSTM and convolutional layers (8 filters) withdifferent filter sizes to capture n-grams (n-words) in one iteration where n = { } . As shown in the figure, convolutionallayers in the right branch are used as cross-attention for the left branch. The two branches are fused, followed by the T-DNNcomponent for textual emotion recognition. A-DCNN and T-DNN are then fused using self-attention to get the final emotionclassification result.with other components of the network) output, attention scoreand the output respectively. Finally the the utterance emotionis classified using a “softmax” activation over the final denselayer of the fusion network.
3. EXPERIMENTS3.1. Dataset and Experimental Setup
Experiments are conducted on the Interactive EmotionalDyadic Motion Capture (IEMOCAP) dataset which includesfive sessions of utterances for 10 unique speakers. We fol-low the evaluation protocol of [5, 6], and select utterancesannotated with four basic emotions “anger”, “happiness”,“neutral” and “sadness”. Samples with “excitement” aremerged with “happiness” as per [5, 6]. The resultant datasetcontains utterances { “anger”: , “happiness”: ,“neutral”: , “sadness”: } .Initial training is carried out on both acoustic and textualnetworks separately before the fusion. The sampling rate ofeach utterance waveform is set to , Hz while a randomsegment of ms is used in training the acoustic network.During the evaluation, the cumulative sum of all the predic-tions with a window size and shift of ms and ms are considered. In the textual network, all the transcripts of theutterances are set to a maximum length of and paddedwith 0s. Glove-300d embeddings are used to convert the wordsequence to a vector of (100 , . We utilize a 10-fold cross-validation with an 8:1:1 split for training, validation, and testsets respectively for text model. We select an average per-forming split and use this split to train the acoustic and fusionnetworks (we use a single split due to the high computationtime of the acoustic model), such that all networks (text, au-dio and fusion) use the same data splits. The learning rateand the batch size in each network are fixed at . and respectively, and the Adam optimiser is used. Following [5, 6], the performance of our system is measuredin terms of weighted accuracy (WA) and unweighted accu-racy (UA). Table 1 and Figure 2 present performance of ourapproach for emotion recognition compared with the state ofthe art methods.MDRE [6] has used two RNNs to encode both acousticand textual data followed by a DNN for classification, whileEvec-MCNN-LSTM [22] has used an RNN and a DCNN toencode both modalities followed by fusion and an SVM for ig. 2 . Confusion matrices of the proposed architecture for separate fusion methods calculated using a average performing8:1:1 split. Left, middle and right figures represent Fusion I, Fusion II and Fusion III respectively.Model Modality WA UAEvec-MCNN-LSTM [22] A + T .
9% 65 . MDRE [6] A + T . − MHA-2 [5] A + T .
5% 77 . Ours - F-I A + T .
85% 79 . Ours - F-II A + T .
98% 80 . Ours - F-III A + T . Recognition accuracy for IEMOCAP using an av-erage performing 8:1:1 split, compared with state of the artmethods.classification. MHA-2 [5] has used two Bidirectional Recur-rent Encoders (BRE) for both modalities followed by a multi-hop attention mechanism. MDRE has outperformed MCNN-LSTM by . and MHA-2 has outperformed MDRE by . , demonstrating how attention can increase performance.Our proposed model has achieved a substantial improve-ment in overall accuracy, with a . increase compared toMHA-2. We have utilized self-attention before (F-III) and af-ter fusion (F-II) as illustrated in Figure 1. Cross-modal atten-tion has not been utilized after fusion since the dimensionalityof the feature vectors from the two modalities are different. Aslight increase in the classification accuracy has been obtainedby applying self-attention compared to conventional featurefusion (F-I). Furthermore, the highest accuracy has been ob-tained by F-III , outperforming F-II by . . Given that F-Islightly outperforms MHA-2, we have compared the classifi-cation accuracy of the individual modes of MHA-2 with ourindividual modes in Table 2.Our acoustic and textual models outperformed the cor-responding individual modes of MHA-2 [5], where a sub-stantial improvement of is achieved with the acousticmodel. The SincNet layer in the acoustic model is capable of Model Modality WAMHA-2 [5] A . MHA-2 [5] T . Ours A T . Recognition accuracy of individual modes of theIEMOCAP dataset with an average performing 8:1:1 split,comparing the proposed approach with [5].learning and deriving customized filter banks tuned for emo-tion recognition. It has been successfully applied for speakerrecognition [17] as an alternative to i-vectors. The confusionmatrices for F-I, F-II, and F-III are illustrated in Figure 2. A relative improvement in classification accuracy can beobserved when comparing the individual modalities with thefusion network in our model. Given that accuracy is approx-imately similar for both modalities, each modality has com-plemented the other to increase recognition accuracy.
4. CONCLUSION
In this paper, we present an attention-based multi-modal emo-tion recognition model combining acoustic and textual data.The raw audio waveform is utilized in our method, rather thanextracting hand-crafted features as done by baseline methods.Combining a DCNN with a SincNet layer, which learns suit-able filter parameters over the waveform for emotion recogni-tion, outperforms the hand-crafted feature-based audio emo-tion detection of the baselines. Cross attention is appliedto text-based feature extraction to guide the features derivedby RNNs using N-gram level features extracted by a paral-lel branch. We have used self-attention on both feature vec-tors obtained from two networks before the fusion, to attendto the informative segments from each feature vector. Weave achieved a weighted accuracy of on the IEMO-CAP database, which outperforms the existing state-of-the-artmodel by .
5. ACKNOWLEDGEMENTS
This research was supported by an Australia Research Coun-cil (ARC) Discovery grant DP140100793.
6. REFERENCES [1] A. Mohanta and U. Sharma, “Detection of human emo-tion from speech—tools and techniques,” in
Speech andLanguage Processing for Human-Machine Communica-tions . Springer, 2018, pp. 179–186.[2] P. Tzirakis, J. Zhang, and B. W. Schuller, “End-to-end speech emotion recognition using deep neural net-works,” in .IEEE, 2018, pp. 5089–5093.[3] L. Tarantino, P. N. Garner, and A. Lazaridis, “Self-attention for speech emotion recognition,”
Proc. Inter-speech 2019 , pp. 2578–2582, 2019.[4] M. El Ayadi, M. S. Kamel, and F. Karray, “Surveyon speech emotion recognition: Features, classificationschemes, and databases,”
Pattern Recognition , vol. 44,no. 3, pp. 572–587, 2011.[5] S. Yoon, S. Byun, S. Dey, and K. Jung, “Speech emo-tion recognition using multi-hop attention mechanism,”in
ICASSP 2019-2019 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2019, pp. 2822–2826.[6] S. Yoon, S. Byun, and K. Jung, “Multimodal speechemotion recognition using audio and text,” in . IEEE,2018, pp. 112–118.[7] Y. Gu, S. Chen, and I. Marsic, “Deep mul timodallearning for emotion recognition in spoken language,”in . IEEE, 2018,pp. 5079–5083.[8] J.-L. Li and C.-C. Lee, “Attentive to individual: A mul-timodal emotion recognition network with personalizedattention profile,”
Proc. Interspeech 2019 , pp. 211–215,2019.[9] D. Nguyen, K. Nguyen, S. Sridharan, D. Dean, andC. Fookes, “Deep spatio-temporal feature fusion withcompact bilinear pooling for multimodal emotion recog-nition,”
Computer Vision and Image Understanding ,vol. 174, pp. 33–42, 2018. [10] S. Tripathi and H. Beigi, “Multi-modal emotion recog-nition on iemocap dataset using deep learning,” arXivpreprint arXiv:1804.05788 , 2018.[11] J. Pennington, R. Socher, and C. D. Manning, “Glove:Global vectors for word representation,” in
EmpiricalMethods in Natural Language Processing (EMNLP) ,2014, pp. 1532–1543. [Online]. Available: [12] P. Yenigalla, A. Kumar, S. Tripathi, C. Singh, S. Kar,and J. Vepa, “Speech emotion recognition using spec-trogram & phoneme embedding.” in
Interspeech , 2018,pp. 3688–3692.[13] C. W. Lee, K. Y. Song, J. Jeong, and W. Y. Choi, “Con-volutional attention networks for multimodal emotionrecognition from speech and text data,”
ACL 2018 , p. 28,2018.[14] M. Chen, X. He, J. Yang, and H. Zhang, “3-d convo-lutional recurrent neural networks with attention modelfor speech emotion recognition,”
IEEE Signal Process-ing Letters , vol. 25, no. 10, pp. 1440–1444, 2018.[15] S. Mirsamadi, E. Barsoum, and C. Zhang, “Automaticspeech emotion recognition using recurrent neural net-works with local attention,” in . IEEE, 2017, pp. 2227–2231.[16] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh,E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S.Narayanan, “Iemocap: Interactive emotional dyadicmotion capture database,”
Language resources and eval-uation , vol. 42, no. 4, p. 335, 2008.[17] M. Ravanelli and Y. Bengio, “Speaker recognition fromraw waveform with sincnet,” in . IEEE, 2018, pp.1021–1028.[18] C. Etienne, G. Fidanza, A. Petrovskii, L. Devillers, andB. Schmauch, “Cnn+ lstm architecture for speech emo-tion recognition with data augmentation,” arXiv preprintarXiv:1802.05630 , 2018.[19] G. Ramet, P. N. Garner, M. Baeriswyl, and A. Lazaridis,“Context-aware attention mechanism for speech emo-tion recognition,” in . IEEE, 2018, pp. 126–131.[20] Y. Kim, “Convolutional neural networks for sentenceclassification,” arXiv preprint arXiv:1408.5882 , 2014.[21] D. Priyasad, T. Fernando, S. Denman, S. Sridharan, andC. Fookes, “Learning salient features for multimodalemotion recognition with recurrent neural networks andttention based fusion,” , 2019.[22] J. Cho, R. Pappagari, P. Kulkarni, J. Villalba,Y. Carmiel, and N. Dehak, “Deep neural networks foremotion recognition combining audio and transcripts,”