Jointly Fine-Tuning "BERT-like" Self Supervised Models to Improve Multimodal Speech Emotion Recognition
Shamane Siriwardhana, Andrew Reis, Rivindu Weerasekera, Suranga Nanayakkara
JJointly Fine-Tuning “BERT-like” Self Supervised Models to ImproveMultimodal Speech Emotion Recognition
Shamane Siriwardhana ,Andrew Reis ,Rivindu Weerasekera ,Suranga Nanayakkara Augmented Human Lab, Auckland Bioengineering Institute, The University of Auckland [email protected],[email protected],[email protected],[email protected]
Abstract
Multimodal emotion recognition from speech is an importantarea in affective computing. Fusing multiple data modali-ties and learning representations with limited amounts of la-beled data is a challenging task. In this paper, we explorethe use of modality specific“BERT-like” pretrained Self Su-pervised Learning (SSL) architectures to represent both speechand text modalities for the task of multimodal speech emotionrecognition. By conducting experiments on three publicly avail-able datasets (IEMOCAP, CMU-MOSEI, and CMU-MOSI), weshow that jointly fine-tuning “BERT-like” SSL architecturesachieve state-of-the-art (SOTA) results. We also evaluate twomethods of fusing speech and text modalities and show that asimple fusion mechanism can outperform more complex oneswhen using SSL models that have similar architectural proper-ties to BERT.
Index Terms : speech emotion recognition, self supervisedlearning, Transformers, BERT, multimodal deep learning
1. Introduction
Emotion recognition plays a significant role in many intelligentinterfaces [1]. Even with the recent advances in Deep Learning(DL), this is still a challenging task. The main reason being thatmost publicly available annotated datasets in this domain aresmall in scale, which makes DL models prone to over-fitting.Another important feature of emotion recognition is the inher-ent multi-modality in the way we express emotions [2]. Emo-tional information can be captured by studying many modali-ties, including facial expressions, body postures, and EEG [3].Of these, arguably, speech is the most accessible. In additionto accessibility, speech signals contain many other emotionalcues [4]. Although speech signals contain substantial amountsof information, it can be unrewarding to drop the linguistic com-ponent that coexists with it, especially given that the text com-ponent can be easily transcribed in real world applications withthe considerable successes in the domain of speech-to-text withseveral commercial-scale APIs being available [5].In multimodal emotion recognition, representation learningand fusion of modalities can be identified as a major researcharea [6, 7, 8]. Recent work has explored the use of deep rep-resentations in contrast to low level representations [9] such asMFCC, COVAREP [10] or GloVe embeddings [11]. Such deeprepresentation techniques can mainly be categorised into twomain categories: 1) transfer learning techniques that use pre-trained networks to extract features [12, 13, 14] or fine-tunemodels [15]; and 2) unsupervised embeddings learning tech-niques which include variational auto-encoders (VAE) [16] andadversarial auto-encoders (AE) [17]. It is also important tohighlight that performance usually degrades in transfer learn-ing techniques due to the mismatch of source and target tasks.Recent work [18] explains the problems related to learn disen- tangled representations from VAE when no inductive bias on themodel or the dataset exists. In terms of fusing multiple modali-ties, recent work has explored architectures like attention [2, 7],graph neural networks [19] and transformers [8]. Multimodalfusion mechanisms, especially those that fuse deep representa-tions, usually result in architecturally complex models [20].In representation learning, a class of techniques known asSSL has achieved SOTA performance in many areas of NatualLanguage Processing (NLP) [21, 22], Computer Vision [23, 24](CV) and speech recognition [25, 26, 27]. SSL enables us touse a large unlabelled dataset to train models that can be laterused to extract representations and fine-tune for specific prob-lems that may have limited amounts of training data. Priorworks [21, 23] have highlighted the effectiveness of fine-tuningpre-trained SSL models for specific tasks in contrast to usingthem only as frozen feature extractors. A significant transitionhappened in the field of NLP with the introduction of SSL mod-els like the Deep Bidirectional Transformers [21] (BERT) andits successors [22]. By adding a single task-specific layer to apre-trained SSL model like BERT, one can solve multiple down-stream tasks. BERT-like models also consist of favourable ar-chitectural features such as the
CLS token, which can be usedas a representation for the entire sequence. Another importantfactor is the extensive availability of pre-trained models in theopen-source community, which leads to both cost and time sav-ings since these models tend to be very computationally expen-sive to train from scratch.Even though several SSL models have been introduced forspeech recognition related tasks like speech-to-text [26, 27] andspeech emotion recognition [25], prior work has not looked atcombining multiple separate SSL models, each specializing inone modality. This may be due to the architectural complex-ity of such models brought on by the fact that usually, theseSSL networks have a large number of parameters. Combiningmultiple drastically different high dimensional representationsis also not a simple task and may increase the parameter counteven further. If, however, the modality specific SSL architec-tures share similar properties, then we may be able to use sim-pler fusion mechanisms to extract information for the desiredtask. Our work was heavily inspired by recent work [27, 28],which explored the effectiveness of self supervised pre-trainingwith discretized speech representations for the task of Auto-matic Speech Recognition (ASR).For the first time in the literature, we jointly fine-tuned modality-specific “BERT-like” SSL models that representspeech [28, 27] and text [24] on the task of multimodal emotionrecognition. We further evaluate how simple fusion methods, -which add minimal additional trainable parameters, performedwhen compared with more complex fusion mechanisms such asCo-Attentional [24] fusion. We also conducted a series of abla-tion studies to explore which factors affect the performance ofthese models. Please refer the
Pytorch implementation. a r X i v : . [ ee ss . A S ] A ug . Pre-Trained SSL models We summarise the three pretrained SSL models that were usedto process speech and text signals in the proposed framework inthe following sections. We use pretrained models available inthe Fairseq toolkit [29].
VQ-Wav2Vec [27] is an extension of Wav2Vec [26], which fo-cuses on moving continuous speech representations into the dis-crete domain. Wav2Vec [26] learns representations from speechsignals based on Contrastive Predictive Coding [30] (CPC). Themajor difference in VQ-Wav2Vec [27] from Wav2Vec [26] isthe application of Vector Quantization [31] methods to gen-erate discretized speech representations. In our experiments,we used a pretrained VQ-Wav2Vec [27] model that trained onLibrispeech-960 [32] to represent speech signals as a sequenceof tokens, similar to the tokenization step for a sentence in NLP.
The term
Speech-BERT is used in our work to define a BERT-like Transformer architecture trained on a set of discretizedspeech tokens, where the speech signal was discretized and tok-enized by a pretrained VQ-Wav2Vec as mentioned in the abovesection. We were heavily motivated by recent work[28], whichillustrated the effectiveness of BERT-like models in the domainof ASR. We used a pretrained Speech-BERT, which was trainedon the discretized Librispeech-960 [32] dataset with the pretexttask of mask token prediction. The Speech-BERT model ar-chitecture is similar to BERT-base [21] that consists of 12 lay-ers and an embedding dimension of 768. During our exper-iments, we fine-tune the Speech-BERT model for the task ofmultimodal emotion recognition.
RoBERTa [22] is an extension of the BERT [21] model whichdoes not use the next sentence prediction task [22] during train-ing. The RoBERTa [22] architecture consists of 24 layers andan embedding dimension of 1024. Similar to Speech-BERT, wefine-tune the RoBERTA [22] model for the task of multimodalemotion recognition.
3. Methodology
We explore the use of Speech-BERT and RoBERTa SSL mod-els for the task of multimodal speech emotion recognition. Asthe first step, we evaluate two possible fusion mechanisms tocombine the two SSL models. The performance of the final pro-posed model was then compared with published SOTA resultson IEMOCAP [33], CMU-MOSEI [34], and CMU-MOSI [35]datasets. Finally, we conduct an extensive set of ablation stud-ies with the IEMOCAP dataset [33] to understand the behaviourof our proposed framework under different settings. We inves-tigate the effects of fine-tuning and frozen states for Speech-BERT, RoBERTa, as well as the effects of different fusionmechanisms.
Figure 1 gives an overview of the proposed framework. Both thespeech signal and text transcripts are simultaneously fed intothe model via two different pipelines. The speech signal getsdiscretized by a pre-trained VQ-Wav2Vec [27] model and the raw speech signal raw text transcriptpretrained VQ-Wav2VecQunatizer(tokenizing speech) GPT-2 tokenizer(tokenizing text) s n ....s s s t m ....t t t Pre-trained RobertaPre-trained Speech-BERT
CLS speech E s3 E s2 .... E sn E t3 E t .... E tn Fully Connected Layer + SoftmaxConcatenation
CLS speech
CLS text
CLS text
Figure 1:
Overview of the proposed framework with Shallow-Fusion text transcript is tokenized with the GPT-2 tokenizer [36]. Oncespeech and text modalities are tokenized, we send them throughpre-trained Speech-BERT and RoBERTa models, where theoutputs have embedding sizes 768 and 1024 and maximum se-quence lengths of 2048 and 512 respectively. The next stepis fusing these embeddings prior to the prediction head, whichconsists of a single fully connected layer. In this paper we ex-plore two possible mechanisms, which we will discuss in Sec-tion 3.2. Finally, we fine-tune the entire framework, includingboth Speech-BERT and Roberta SSL models (the componentsinside the blue dotted box in Figure 1).
The fusion mechanism plays an essential role in any multimodalspeech emotion recognition framework. In this work, we anal-yse how the following different fusion mechanisms affect per-formance.
The success of BERT in sentence classification tasks [21] high-lights the effective use of the
CLS token as a representation ofthe entire sentence.
CLS stands for the classification and it isthe first token of every input sequence to the BERT. Motivatedby recent work in the domain of NLP [21, 22], we concatenatethe two
CLS tokens computed respectively from speech-BERTand RoBERTa models as described in Figure 1 . Finally we sendthe concatenated embedding through a classification head thatincludes a fully connected layer that outputs logits followed bya softmax function. Due to the simplicity of the fusion, we de-scribe this mechanism as Shallow-Fusion. We also use Shallow-Fusion as the standard fusion mechanism in the ablation studiessince it achieved superior performances in our experiments.
In order to provide an opportunity for an embedding level in-teraction between the two modalities, we propose to use aCo-Attentional layer [24]. Co-Attention is a variant of Self-Attention [37] that has been used in visual-linguistic Trans- ext To Speech Co-AttentionSpeech To Text Co-AttentionAdd Multi-Head AttentionFc Q Fc K Fc V Modality 1 Modality 2W K W V Q K VW Q Add Multi-Head AttentionFc Q Fc K Fc V Modality 1 Modality 2W K W V Q K VW Q CLS speech E s3 E s2 .... E sn CLS text E t3 E t .... E tn CLS cross-speech
CLS text
CLS speech
CLS cross-text
Fully Connected Layer + SoftmaxConcatanation
Figure 2:
Co-Attentional layer and fusion mechanism formers like VilBERT [24]. In contrast to Self-Attention, Co-Attention is computed by interchanging Key-Value vector pairsfrom one modality with the Query vector from another modal-ity. Since the
CLS token from each modality already aggre-gates the sequential information [21], we calculate the Queryvector from each mortality’s
CLS token and let it attend to theentire sequence of the other modality embeddings. After theCo-Attentional layer, we concatenate the modified
CLS tokensfrom each modality and send these through a prediction head.Figure 2 gives a detailed illustration of the Co-Attentional layerand the fusion mechanism.
4. Experimental Setup
We implemented our model using Pytorch and the Fairseq [29]toolkit. All models were trained under distributed settings, us-ing two Tesla v100 32GB GPUs with an effective batch-sizeof 16. The Adam optimiser was used in the optimization withwarm-up updates and the polynomial decay learning-rate sched-uler. The initial learning rate and dropout were set to e − and . . The IEMOCAP [33] dataset contains conversation data of 10male and female actors. Similar to prior work [38, 2], weselected the most commonly used four emotion categories ofHappy (& Excitement), Sad, Anger, and Neutral. We followedthe experimental procedure and evaluation metrics of previousstudies [8, 39]. Table 4 provides a comparison of model perfor-mance with other SOTA models on Binary Accuracy (BA) andthe F1-score [39, 8]. Table 5 illustrates the performance com-parison w.r.t 4-class unweighted accuracy metric following therecent work done by Li et al [2].
CMU-MOSEI [34] is the current largest dataset for multimodalemotion recognition that consists of , examples createdby extracting review videos from YouTube. Each exampleis annotated with an integer score between -3 to +3. CMU-MOSI [35] also has similar properties to CMU-MOSEI [34] butonly with 2000 examples. To compare our model with bothdatasets, we follow the latest prior work that has used thesedatasets [8, 39]. For both, we used 7-class accuracy, Mean-Average-Error (MAE), 2-class accuracy (binary), and F1-score.Table 2 and Table 3 describe the evaluation results on CMU-MOSEI [34] and CMU-MOSI [35] datasets respectively.
5. Ablation studies
We conducted four ablation studies using the IEMOCAP [33]dataset to understand the behaviour of the proposed framework.We use Binary-Accuracy and F1 score for each emotion as theevaluation metric. In order to have a fair setting in our abla-tion experiments, we use the first three sessions of the IEMO-CAP [33] dataset as training, the fourth session as validationand fifth session as the test set. Table 1 illustrates the results ofour four ablation studies discussed in the following sections.
In the first ablation study, Table 1 (5.1), we compare the perfor-mance of each fusion mechanism when fine-tuning Speech-Bertand Roberta for the downstream task of multimodal-emotionrecognition. Shallow-Fusion shows a slight improvement withrespect to binary accuracy and F1-scores for each emotion overCo-Attentional fusion. This illustrates how a simple fusionmechanism followed by a classification head work remarkablywell with finetuned “Bert-Like”” pretrained SSL models evenin a multimodal setting.
Table 1 (5.2) illustrates the performance comparison betweenuni-modal inputs. In this experiment, we use the
CLS token ofSpeech-BERT and Roberta as the sequence representation forspeech and text. As the results suggest, text-only performs bet-ter than speech-only. A possible reason might be the availabilityof high emotional clues in the linguistic structure. However, wecan still see a clear improvement when comparing text-only re-sults with our best performing multimodal model, which high-lights the importance of multi-modality.
Table 1 (5.3) looks at the effect of finetuning vs having frozenSpeech-BERT and RoBERTa models. Unsurprisingly, fine-tuning the two SSL models with Shallow-Fusion for the down-stream task leads to better performance.
Finally, we compare how the two fusion mechanism behavewhen we keep Speech-BERT and RoBERTa in a frozen state-using them only as feature extractors. Table 1 (5.4) showswhen the SSL networks are frozen, Co-Attentional fusion per-forms better. We highlight that the increased number of inter-actions prior to the prediction layers enables Co-Attentional fu-sion to adapt better than Shallow-Fusion. Co-Attentioal fusion blation Studies Happy Sad Angry NeutralAlgorithm Metric Acc(h) F1(h) Acc(h) F1(h) Acc(h) F1(h) Acc(h) F1(h) (5.1) Shallow-Fusion VS Co-Attentional Fusion (SSL models in get finetuned on the task)
Co-Attentional-Fusion 83.64 83.07 90.12 90.13 92.56 93.01 81.06 79.97
Shallow-Fusion (Best Performaing Model)
Speech - Speech-BERT only 71.15 66.87 85.657 85.01 87.67 87.73 71.78 68.9Text - Roberta Only 81.02 81.62 87.67 87.1 91.02 91.21 78.72 78.85 (5.3) Fine-Tune VS Frozen (using Shallow-Fusion)
Frozen SSL 69.22 62.98 85.57 83.92 89.68 89.5 75.34 72.95Fine-Tune SSL (same as best performing model) (5.4) Shallow-Fusion VS Co-Attentional Fusion (SSL models in frozen state)
Co-Attentional-Fusion (frozen SSL) 76.55 75.32 86.86 85.72 90 90.2 77.04 76.65Shallow-Fusion (frozen SSL) 69.22 62.98 85.57 83.92 89.68 89.5 75.34 72.95
Table 1:
Evaluation results of the ablation studies on IEMOCAP dataset with Binary Accuracy (BA) and F1 score
Algorithm
Metric Acc (7 classes) Acc (2 classes) f1 score MAEMultimodal Transformer (Tsai et al - 2019) 50.7 81.6 81.6 0.691ICCN (Sun et al - 2019) 51.58 84.4 84.36 0.713
Shallow-Fusion of SSL models (Ours) 55.971 88.04 88.08 0.491
Ablation Studies Happy Sad Angry NeutralAlgorithm Metric Acc(h) F1(h) Acc(h) F1(h) Acc(h) F1(h) Acc(h) F1(h) (5.1) Shallow-Fusion VS Co-Attentional Fusion (SSL models in get finetuned on the task)
Co-Attentional-Fusion 83.64 83.07 90.12 90.13 92.56 93.01 81.06 79.97Shallow-Fusion (Best Performaing Model) 84.13 83.75 90.894 90.7 93.56 93.62 81.55 81.05 Algorithm
Metric Acc (7 classes) Acc (2 classes) f1 score MAE (5.2) Uni-Modal Comparison
Multimodal Transformer (Tsai et al - 2019) 39.1 81.1 81 0.686Speech - Speech-BERT only 71.15 66.87 85.657 85.01 87.67 87.73 71.78 68.9 ICCN (Sun et al - 2019) 39.01 83.07 83.02 0.862Text - Roberta Only 81.02 81.62 87.67 87.1 91.02 91.21 78.72 78.85
Shallow-Fusion of SSL models (Ours) 52.187 88.87 88.57 0.577(5.3) Fine-Tune VS Frozen (using Shallow-Fusion)
Frozen SSL 69.22 62.98 85.57 83.92 89.68 89.5 75.34 72.95Fine-Tune SSL (same as best performing model) (5.4) Shallow-Fusion VS Co-Attentional Fusion (SSL models in forzen state)
Co-Attentional-Fusion 76.55 75.32 86.86 85.72 90 90.2 77.04 76.65Shallow-Fusion 69.22 62.98 85.57 83.92 89.68 89.5 75.34 72.95Task Happy Sad Angry NeutralAlgorithm
Metric Acc(h) F1(h) Acc(h) F1(h) Acc(h) F1(h) Acc(h) F1(h) MulT (Tsai et al., 2019) 84.4 81.9 77.7 74.1 73.9 70.2 62.5 59.7ICCN (Sun et al - 2019) 87.41 84.72 86.26 85.93 88.62 88.02 69.73 68.47
Shallow-Fusion of SSL models (Ours) 89.71 88.34 89.48 89.2 93.82 93.9 80.93 81.01
Table 2:
Results for multimodal emotions analysis on CMU-MOSEI with Seven Class accuracy ,BA (binary accuracy) F1 score, andMAE. Performance’s of the other models are taken from ICCCN (2019) [39] MULT (2019) [8]
Algorithm
Metric Acc (7 classes) Acc (2 classes) f1 score MAEMultimodal Transformer (Tsai et al - 2019) 39.1 81.1 81 0.686ICCN (Sun et al - 2019) 39.01 83.07 83.02 0.862
Shallow-Fusion of SSL models (Ours) 48.927 88.275 88.57 0.577
Table 3:
Results for multimodal emotions analysis on CMU-MOSI. Performance of the other models are taken from ICCCN (2019) [39]MULT (2019) [8]
Algorithm
Metric Acc (7 classes) Acc (2 classes) f1 score MAEMultimodal Transformer (Tsai et al - 2019) 50.7 81.6 81.6 0.691ICCN (Sun et al - 2019) 51.58 84.4 84.36 0.713
Shallow-Fusion of SSL models (Ours) 55.971 88.04 88.08 0.491
Ablation Studies Happy Sad Angry NeutralAlgorithm Metric Acc(h) F1(h) Acc(h) F1(h) Acc(h) F1(h) Acc(h) F1(h) (5.1) Shallow-Fusion VS Co-Attentional Fusion (SSL models in get finetuned on the task)
Co-Attentional-Fusion 83.64 83.07 90.12 90.13 92.56 93.01 81.06 79.97Shallow-Fusion (Best Performaing Model) 84.13 83.75 90.894 90.7 93.56 93.62 81.55 81.05 Algorithm
Metric Acc (7 classes) Acc (2 classes) f1 score MAE (5.2) Uni-Modal Comparison
Multimodal Transformer (Tsai et al - 2019) 39.1 81.1 81 0.686Speech - Speech-BERT only 71.15 66.87 85.657 85.01 87.67 87.73 71.78 68.9 ICCN (Sun et al - 2019) 39.01 83.07 83.02 0.862Text - Roberta Only 81.02 81.62 87.67 87.1 91.02 91.21 78.72 78.85
Shallow-Fusion of SSL models (Ours) 52.187 88.87 88.57 0.577(5.3) Fine-Tune VS Frozen (using Shallow-Fusion)
Frozen SSL 69.22 62.98 85.57 83.92 89.68 89.5 75.34 72.95Fine-Tune SSL (same as best performing model) (5.4) Shallow-Fusion VS Co-Attentional Fusion (SSL models in forzen state)
Co-Attentional-Fusion 76.55 75.32 86.86 85.72 90 90.2 77.04 76.65Shallow-Fusion 69.22 62.98 85.57 83.92 89.68 89.5 75.34 72.95Task Happy Sad Angry NeutralAlgorithm
Metric Acc(h) F1(h) Acc(h) F1(h) Acc(h) F1(h) Acc(h) F1(h) MulT (Tsai et al., 2019) 84.4 81.9 77.7 74.1 73.9 70.2 62.5 59.7ICCN (Sun et al - 2019) 87.41 84.72 86.26 85.93 88.62 88.02 69.73 68.47
Shallow-Fusion of SSL models (Ours) 89.71 88.34 89.48 89.2 93.82 93.9 80.93 81.01
Table 4:
Results for multimodal emotions analysis on IEMOCAP with Binary Accuracy (BA) ad F1 score. Performance of the othermodels are taken from ICCCN (2019) [39] MULT (2019) [8]
Task Happy Sad Angry Neutral Unweighted AccuracyAlgorithm
Metric Acc(h) Acc(h) Acc(h) Acc(h) Acc(h) PAaANLC (Li et al., 2019) 71.9 73.2 76.9 59.2 70.3
Shallow-Fusion of SSL models (Ours) 77.06 78.414 81.88 64.748 75.458
Table 5:
Results for multimodal emotions analysis on IEMOCAP with 4 class unweighted accuracy. Performance of the other model istaken from PAaAN (2019) [2] requires a much larger number of new trainable parameters.While the Co-Attentional layer adds nearly 6 million new pa-rameters, Shallow-Fusion only adds close to fourteen thousandparameters.
6. Conclusion
In this work, we use two pretrained “BERT-like” architecturesto solve the downstream task of multimodal emotion recogni-tion. As per our knowledge, this is the first time that two SSLalgorithms that represent speech and texts are fine-tuned for thetask of multimodal speech emotion recognition. By conduct-ing several experiments, we show how a simple fusion mecha-nism (Shallow-Fusion) makes the overall framework simple andstraightforward and improve on more complex fusion mecha-nisms. We also highlight the importance of introducing BERT-like models to process speech signals, which can easily be used to improve the performance of multimodal tasks like emotionrecognition. Having structurally similar “BERT-like” architec-tures to represent both speech and text allows us to fuse modal-ities in a straightforward way and quickly adapt standard prac-tices in the NLP domain.In future work, we hope to visualize and further explorethe behavior of SSL models for the task of multimodal emotionrecognition. Exploring the use of BERT-like models to repre-sent speech could enable advances in NLP to be easily used inthe domain of speech.
7. Acknowledgements
This work is supported by the Assistive Augmentation researchgrant under the Entrepreneurial Universities (EU) initiative ofNew Zealand. . References [1] R. W. Picard,
Affective computing . MIT press, 2000.[2] J.-L. Li and C.-C. Lee, “Attentive to individual: A multimodalemotion recognition network with personalized attention profile,”
Proc. Interspeech 2019 , pp. 211–215, 2019.[3] N. Sebe, I. Cohen, T. Gevers, and T. S. Huang, “Multimodal ap-proaches for emotion recognition: a survey,” in
Internet ImagingVI , vol. 5670. International Society for Optics and Photonics,2005, pp. 56–67.[4] J. Kim and R. A. Saurous, “Emotion recognition from humanspeech using temporal information and deep learning.” in
Inter-speech , 2018, pp. 937–940.[5] A. Singh, V. Kadyan, M. Kumar, and N. Bassan, “Asroil: a com-prehensive survey for automatic speech recognition of indian lan-guages,”
Artificial Intelligence Review , pp. 1–32, 2019.[6] Y.-H. H. Tsai, P. P. Liang, A. Zadeh, L.-P. Morency, andR. Salakhutdinov, “Learning factorized multimodal representa-tions,” arXiv preprint arXiv:1806.06176 , 2018.[7] A. Zadeh, P. P. Liang, S. Poria, P. Vij, E. Cambria, and L.-P.Morency, “Multi-attention recurrent network for human commu-nication comprehension,” in
Thirty-Second AAAI Conference onArtificial Intelligence , 2018.[8] Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, andR. Salakhutdinov, “Multimodal transformer for unaligned mul-timodal language sequences,” arXiv preprint arXiv:1906.00295 ,2019.[9] M. Swain, A. Routray, and P. Kabisatpathy, “Databases, featuresand classifiers for speech emotion recognition: a review,”
Interna-tional Journal of Speech Technology , vol. 21, no. 1, pp. 93–120,2018.[10] G. Degottex, J. Kane, T. Drugman, T. Raitio, and S. Scherer, “Co-varepa collaborative voice analysis repository for speech tech-nologies,” in . IEEE, 2014, pp. 960–964.[11] J. Pennington, R. Socher, and C. D. Manning, “Glove: Globalvectors for word representation,” in
Proceedings of the 2014 con-ference on empirical methods in natural language processing(EMNLP) , 2014, pp. 1532–1543.[12] Z. Han, H. Zhao, and R. Wang, “Transfer learning for speech emo-tion recognition,” in . IEEE, 2019,pp. 96–99.[13] S. Parthasarathy, V. Rozgic, M. Sun, and C. Wang, “Improvingemotion classification through variational inference of latent vari-ables,” in
ICASSP 2019-2019 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019,pp. 7410–7414.[14] K. Feng and T. Chaspari, “A review of generalizable transferlearning in automatic emotion recognition,”
Frontiers in Com-puter Science , vol. 2, p. 9, 2020.[15] Z. Lu, L. Cao, Y. Zhang, C.-C. Chiu, and J. Fan, “Speech senti-ment analysis via pre-trained features from end-to-end asr mod-els,” arXiv preprint arXiv:1911.09762 , 2019.[16] S. Latif, R. Rana, J. Qadir, and J. Epps, “Variational autoencodersfor learning latent representations of speech emotion: A prelimi-nary study,” arXiv preprint arXiv:1712.08708 , 2017.[17] S. Sahu, R. Gupta, G. Sivaraman, W. AbdAlmageed, and C. Espy-Wilson, “Adversarial auto-encoders for speech based emotionrecognition,” arXiv preprint arXiv:1806.02146 , 2018.[18] F. Locatello, S. Bauer, M. Lucic, G. R¨atsch, S. Gelly,B. Sch¨olkopf, and O. Bachem, “Challenging common assump-tions in the unsupervised learning of disentangled representa-tions,” arXiv preprint arXiv:1811.12359 , 2018.[19] D. Ghosal, N. Majumder, S. Poria, N. Chhaya, and A. Gelbukh,“Dialoguegcn: A graph convolutional neural network for emotionrecognition in conversation,” arXiv preprint arXiv:1908.11540 ,2019. [20] C. Zhang, Z. Yang, X. He, and L. Deng, “Multimodal intelli-gence: Representation learning, information fusion, and applica-tions,” arXiv preprint arXiv:1911.03977 , 2019.[21] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language under-standing,” arXiv preprint arXiv:1810.04805 , 2018.[22] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy,M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: Arobustly optimized bert pretraining approach,” arXiv preprintarXiv:1907.11692 , 2019.[23] A. Kolesnikov, X. Zhai, and L. Beyer, “Revisiting self-supervisedvisual representation learning,” in
Proceedings of the IEEE con-ference on Computer Vision and Pattern Recognition , 2019, pp.1920–1929.[24] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-languagetasks,” in
Advances in Neural Information Processing Systems ,2019, pp. 13–23.[25] S. Pascual, M. Ravanelli, J. Serr`a, A. Bonafonte, and Y. Bengio,“Learning problem-agnostic speech representations from multipleself-supervised tasks,” arXiv preprint arXiv:1904.03416 , 2019.[26] S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec:Unsupervised pre-training for speech recognition,” arXiv preprintarXiv:1904.05862 , 2019.[27] A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-supervised learning of discrete speech representations,” arXivpreprint arXiv:1910.05453 , 2019.[28] A. Baevski, M. Auli, and A. Mohamed, “Effectiveness of self-supervised pre-training for speech recognition,” arXiv preprintarXiv:1911.03912 , 2019.[29] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grang-ier, and M. Auli, “fairseq: A fast, extensible toolkit for sequencemodeling,” arXiv preprint arXiv:1904.01038 , 2019.[30] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning withcontrastive predictive coding,” arXiv preprint arXiv:1807.03748 ,2018.[31] A. van den Oord, O. Vinyals et al. , “Neural discrete representationlearning,” in
Advances in Neural Information Processing Systems ,2017, pp. 6306–6315.[32] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: an asr corpus based on public domain audio books,”in . IEEE, 2015, pp. 5206–5210.[33] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower,S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap:Interactive emotional dyadic motion capture database,”
Languageresources and evaluation , vol. 42, no. 4, p. 335, 2008.[34] A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L.-P. Morency,“Multimodal language analysis in the wild: Cmu-mosei datasetand interpretable dynamic fusion graph,” in
Proceedings of the56th Annual Meeting of the Association for Computational Lin-guistics (Volume 1: Long Papers) , 2018, pp. 2236–2246.[35] A. Zadeh, R. Zellers, E. Pincus, and L.-P. Morency, “Mosi: mul-timodal corpus of sentiment intensity and subjectivity analysis inonline opinion videos,” arXiv preprint arXiv:1606.06259 , 2016.[36] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, andI. Sutskever, “Language models are unsupervised multitask learn-ers,”
OpenAI Blog , vol. 1, no. 8, p. 9, 2019.[37] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”in
Advances in neural information processing systems , 2017, pp.5998–6008.[38] S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, andL.-P. Morency, “Context-dependent sentiment analysis in user-generated videos,” in
Proceedings of the 55th annual meeting ofthe association for computational linguistics (volume 1: Long pa-pers) , 2017, pp. 873–883.[39] Z. Sun, P. Sarma, W. Sethares, and Y. Liang, “Learning re-lationships between text, audio, and video via deep canonicalcorrelation for multimodal language analysis,” arXiv preprintarXiv:1911.05544arXiv preprintarXiv:1911.05544