[PDF] Jointly Fine-Tuning "BERT-like" Self Supervised Models to Improve Multimodal Speech Emotion Recognition

Abstract

Multimodal emotion recognition from speech is an important area in affective computing. Fusing multiple data modalities and learning representations with limited amounts of labeled data is a challenging task. In this paper, we explore the use of modality-specific "BERT-like" pretrained Self Supervised Learning (SSL) architectures to represent both speech and text modalities for the task of multimodal speech emotion recognition. By conducting experiments on three publicly available datasets (IEMOCAP, CMU-MOSEI, and CMU-MOSI), we show that jointly fine-tuning "BERT-like" SSL architectures achieve state-of-the-art (SOTA) results. We also evaluate two methods of fusing speech and text modalities and show that a simple fusion mechanism can outperform more complex ones when using SSL models that have similar architectural properties to BERT.

Full PDF

JJointly Fine-Tuning “BERT-like” Self Supervised Models to ImproveMultimodal Speech Emotion Recognition

Shamane Siriwardhana ,Andrew Reis ,Rivindu Weerasekera ,Suranga Nanayakkara Augmented Human Lab, Auckland Bioengineering Institute, The University of Auckland [email protected],[email protected],[email protected],[email protected]

Abstract

Multimodal emotion recognition from speech is an importantarea in affective computing. Fusing multiple data modali-ties and learning representations with limited amounts of la-beled data is a challenging task. In this paper, we explorethe use of modality speciﬁc“BERT-like” pretrained Self Su-pervised Learning (SSL) architectures to represent both speechand text modalities for the task of multimodal speech emotionrecognition. By conducting experiments on three publicly avail-able datasets (IEMOCAP, CMU-MOSEI, and CMU-MOSI), weshow that jointly ﬁne-tuning “BERT-like” SSL architecturesachieve state-of-the-art (SOTA) results. We also evaluate twomethods of fusing speech and text modalities and show that asimple fusion mechanism can outperform more complex oneswhen using SSL models that have similar architectural proper-ties to BERT.

Index Terms : speech emotion recognition, self supervisedlearning, Transformers, BERT, multimodal deep learning

1. Introduction

Emotion recognition plays a signiﬁcant role in many intelligentinterfaces [1]. Even with the recent advances in Deep Learning(DL), this is still a challenging task. The main reason being thatmost publicly available annotated datasets in this domain aresmall in scale, which makes DL models prone to over-ﬁtting.Another important feature of emotion recognition is the inher-ent multi-modality in the way we express emotions [2]. Emo-tional information can be captured by studying many modali-ties, including facial expressions, body postures, and EEG [3].Of these, arguably, speech is the most accessible. In additionto accessibility, speech signals contain many other emotionalcues [4]. Although speech signals contain substantial amountsof information, it can be unrewarding to drop the linguistic com-ponent that coexists with it, especially given that the text com-ponent can be easily transcribed in real world applications withthe considerable successes in the domain of speech-to-text withseveral commercial-scale APIs being available [5].In multimodal emotion recognition, representation learningand fusion of modalities can be identiﬁed as a major researcharea [6, 7, 8]. Recent work has explored the use of deep rep-resentations in contrast to low level representations [9] such asMFCC, COVAREP [10] or GloVe embeddings [11]. Such deeprepresentation techniques can mainly be categorised into twomain categories: 1) transfer learning techniques that use pre-trained networks to extract features [12, 13, 14] or ﬁne-tunemodels [15]; and 2) unsupervised embeddings learning tech-niques which include variational auto-encoders (VAE) [16] andadversarial auto-encoders (AE) [17]. It is also important tohighlight that performance usually degrades in transfer learn-ing techniques due to the mismatch of source and target tasks.Recent work [18] explains the problems related to learn disen- tangled representations from VAE when no inductive bias on themodel or the dataset exists. In terms of fusing multiple modali-ties, recent work has explored architectures like attention [2, 7],graph neural networks [19] and transformers [8]. Multimodalfusion mechanisms, especially those that fuse deep representa-tions, usually result in architecturally complex models [20].In representation learning, a class of techniques known asSSL has achieved SOTA performance in many areas of NatualLanguage Processing (NLP) [21, 22], Computer Vision [23, 24](CV) and speech recognition [25, 26, 27]. SSL enables us touse a large unlabelled dataset to train models that can be laterused to extract representations and ﬁne-tune for speciﬁc prob-lems that may have limited amounts of training data. Priorworks [21, 23] have highlighted the effectiveness of ﬁne-tuningpre-trained SSL models for speciﬁc tasks in contrast to usingthem only as frozen feature extractors. A signiﬁcant transitionhappened in the ﬁeld of NLP with the introduction of SSL mod-els like the Deep Bidirectional Transformers [21] (BERT) andits successors [22]. By adding a single task-speciﬁc layer to apre-trained SSL model like BERT, one can solve multiple down-stream tasks. BERT-like models also consist of favourable ar-chitectural features such as the

CLS token, which can be usedas a representation for the entire sequence. Another importantfactor is the extensive availability of pre-trained models in theopen-source community, which leads to both cost and time sav-ings since these models tend to be very computationally expen-sive to train from scratch.Even though several SSL models have been introduced forspeech recognition related tasks like speech-to-text [26, 27] andspeech emotion recognition [25], prior work has not looked atcombining multiple separate SSL models, each specializing inone modality. This may be due to the architectural complex-ity of such models brought on by the fact that usually, theseSSL networks have a large number of parameters. Combiningmultiple drastically different high dimensional representationsis also not a simple task and may increase the parameter counteven further. If, however, the modality speciﬁc SSL architec-tures share similar properties, then we may be able to use sim-pler fusion mechanisms to extract information for the desiredtask. Our work was heavily inspired by recent work [27, 28],which explored the effectiveness of self supervised pre-trainingwith discretized speech representations for the task of Auto-matic Speech Recognition (ASR).For the ﬁrst time in the literature, we jointly ﬁne-tuned modality-speciﬁc “BERT-like” SSL models that representspeech [28, 27] and text [24] on the task of multimodal emotionrecognition. We further evaluate how simple fusion methods, -which add minimal additional trainable parameters, performedwhen compared with more complex fusion mechanisms such asCo-Attentional [24] fusion. We also conducted a series of abla-tion studies to explore which factors affect the performance ofthese models. Please refer the

Pytorch implementation. a r X i v : . [ ee ss . A S ] A ug . Pre-Trained SSL models We summarise the three pretrained SSL models that were usedto process speech and text signals in the proposed framework inthe following sections. We use pretrained models available inthe Fairseq toolkit [29].

VQ-Wav2Vec [27] is an extension of Wav2Vec [26], which fo-cuses on moving continuous speech representations into the dis-crete domain. Wav2Vec [26] learns representations from speechsignals based on Contrastive Predictive Coding [30] (CPC). Themajor difference in VQ-Wav2Vec [27] from Wav2Vec [26] isthe application of Vector Quantization [31] methods to gen-erate discretized speech representations. In our experiments,we used a pretrained VQ-Wav2Vec [27] model that trained onLibrispeech-960 [32] to represent speech signals as a sequenceof tokens, similar to the tokenization step for a sentence in NLP.

The term

Speech-BERT is used in our work to deﬁne a BERT-like Transformer architecture trained on a set of discretizedspeech tokens, where the speech signal was discretized and tok-enized by a pretrained VQ-Wav2Vec as mentioned in the abovesection. We were heavily motivated by recent work[28], whichillustrated the effectiveness of BERT-like models in the domainof ASR. We used a pretrained Speech-BERT, which was trainedon the discretized Librispeech-960 [32] dataset with the pretexttask of mask token prediction. The Speech-BERT model ar-chitecture is similar to BERT-base [21] that consists of 12 lay-ers and an embedding dimension of 768. During our exper-iments, we ﬁne-tune the Speech-BERT model for the task ofmultimodal emotion recognition.

RoBERTa [22] is an extension of the BERT [21] model whichdoes not use the next sentence prediction task [22] during train-ing. The RoBERTa [22] architecture consists of 24 layers andan embedding dimension of 1024. Similar to Speech-BERT, weﬁne-tune the RoBERTA [22] model for the task of multimodalemotion recognition.

3. Methodology

We explore the use of Speech-BERT and RoBERTa SSL mod-els for the task of multimodal speech emotion recognition. Asthe ﬁrst step, we evaluate two possible fusion mechanisms tocombine the two SSL models. The performance of the ﬁnal pro-posed model was then compared with published SOTA resultson IEMOCAP [33], CMU-MOSEI [34], and CMU-MOSI [35]datasets. Finally, we conduct an extensive set of ablation stud-ies with the IEMOCAP dataset [33] to understand the behaviourof our proposed framework under different settings. We inves-tigate the effects of ﬁne-tuning and frozen states for Speech-BERT, RoBERTa, as well as the effects of different fusionmechanisms.

Figure 1 gives an overview of the proposed framework. Both thespeech signal and text transcripts are simultaneously fed intothe model via two different pipelines. The speech signal getsdiscretized by a pre-trained VQ-Wav2Vec [27] model and the raw speech signal raw text transcriptpretrained VQ-Wav2VecQunatizer(tokenizing speech) GPT-2 tokenizer(tokenizing text) s n ....s s s t m ....t t t Pre-trained RobertaPre-trained Speech-BERT

CLS speech E s3 E s2 .... E sn E t3 E t .... E tn Fully Connected Layer + SoftmaxConcatenation

CLS speech

CLS text

Figure 1:

Overview of the proposed framework with Shallow-Fusion text transcript is tokenized with the GPT-2 tokenizer [36]. Oncespeech and text modalities are tokenized, we send them throughpre-trained Speech-BERT and RoBERTa models, where theoutputs have embedding sizes 768 and 1024 and maximum se-quence lengths of 2048 and 512 respectively. The next stepis fusing these embeddings prior to the prediction head, whichconsists of a single fully connected layer. In this paper we ex-plore two possible mechanisms, which we will discuss in Sec-tion 3.2. Finally, we ﬁne-tune the entire framework, includingboth Speech-BERT and Roberta SSL models (the componentsinside the blue dotted box in Figure 1).

The fusion mechanism plays an essential role in any multimodalspeech emotion recognition framework. In this work, we anal-yse how the following different fusion mechanisms affect per-formance.

The success of BERT in sentence classiﬁcation tasks [21] high-lights the effective use of the

CLS token as a representation ofthe entire sentence.

CLS stands for the classiﬁcation and it isthe ﬁrst token of every input sequence to the BERT. Motivatedby recent work in the domain of NLP [21, 22], we concatenatethe two

CLS tokens computed respectively from speech-BERTand RoBERTa models as described in Figure 1 . Finally we sendthe concatenated embedding through a classiﬁcation head thatincludes a fully connected layer that outputs logits followed bya softmax function. Due to the simplicity of the fusion, we de-scribe this mechanism as Shallow-Fusion. We also use Shallow-Fusion as the standard fusion mechanism in the ablation studiessince it achieved superior performances in our experiments.

In order to provide an opportunity for an embedding level in-teraction between the two modalities, we propose to use aCo-Attentional layer [24]. Co-Attention is a variant of Self-Attention [37] that has been used in visual-linguistic Trans- ext To Speech Co-AttentionSpeech To Text Co-AttentionAdd Multi-Head AttentionFc Q Fc K Fc V Modality 1 Modality 2W K W V Q K VW Q Add Multi-Head AttentionFc Q Fc K Fc V Modality 1 Modality 2W K W V Q K VW Q CLS speech E s3 E s2 .... E sn CLS text E t3 E t .... E tn CLS cross-speech

CLS text

CLS speech

CLS cross-text

Fully Connected Layer + SoftmaxConcatanation

Figure 2:

Co-Attentional layer and fusion mechanism formers like VilBERT [24]. In contrast to Self-Attention, Co-Attention is computed by interchanging Key-Value vector pairsfrom one modality with the Query vector from another modal-ity. Since the

CLS token from each modality already aggre-gates the sequential information [21], we calculate the Queryvector from each mortality’s

CLS token and let it attend to theentire sequence of the other modality embeddings. After theCo-Attentional layer, we concatenate the modiﬁed

CLS tokensfrom each modality and send these through a prediction head.Figure 2 gives a detailed illustration of the Co-Attentional layerand the fusion mechanism.

4. Experimental Setup

We implemented our model using Pytorch and the Fairseq [29]toolkit. All models were trained under distributed settings, us-ing two Tesla v100 32GB GPUs with an effective batch-sizeof 16. The Adam optimiser was used in the optimization withwarm-up updates and the polynomial decay learning-rate sched-uler. The initial learning rate and dropout were set to e − and . . The IEMOCAP [33] dataset contains conversation data of 10male and female actors. Similar to prior work [38, 2], weselected the most commonly used four emotion categories ofHappy (& Excitement), Sad, Anger, and Neutral. We followedthe experimental procedure and evaluation metrics of previousstudies [8, 39]. Table 4 provides a comparison of model perfor-mance with other SOTA models on Binary Accuracy (BA) andthe F1-score [39, 8]. Table 5 illustrates the performance com-parison w.r.t 4-class unweighted accuracy metric following therecent work done by Li et al [2].

CMU-MOSEI [34] is the current largest dataset for multimodalemotion recognition that consists of , examples createdby extracting review videos from YouTube. Each exampleis annotated with an integer score between -3 to +3. CMU-MOSI [35] also has similar properties to CMU-MOSEI [34] butonly with 2000 examples. To compare our model with bothdatasets, we follow the latest prior work that has used thesedatasets [8, 39]. For both, we used 7-class accuracy, Mean-Average-Error (MAE), 2-class accuracy (binary), and F1-score.Table 2 and Table 3 describe the evaluation results on CMU-MOSEI [34] and CMU-MOSI [35] datasets respectively.

5. Ablation studies

We conducted four ablation studies using the IEMOCAP [33]dataset to understand the behaviour of the proposed framework.We use Binary-Accuracy and F1 score for each emotion as theevaluation metric. In order to have a fair setting in our abla-tion experiments, we use the ﬁrst three sessions of the IEMO-CAP [33] dataset as training, the fourth session as validationand ﬁfth session as the test set. Table 1 illustrates the results ofour four ablation studies discussed in the following sections.

In the ﬁrst ablation study, Table 1 (5.1), we compare the perfor-mance of each fusion mechanism when ﬁne-tuning Speech-Bertand Roberta for the downstream task of multimodal-emotionrecognition. Shallow-Fusion shows a slight improvement withrespect to binary accuracy and F1-scores for each emotion overCo-Attentional fusion. This illustrates how a simple fusionmechanism followed by a classiﬁcation head work remarkablywell with ﬁnetuned “Bert-Like”” pretrained SSL models evenin a multimodal setting.

Table 1 (5.2) illustrates the performance comparison betweenuni-modal inputs. In this experiment, we use the

CLS token ofSpeech-BERT and Roberta as the sequence representation forspeech and text. As the results suggest, text-only performs bet-ter than speech-only. A possible reason might be the availabilityof high emotional clues in the linguistic structure. However, wecan still see a clear improvement when comparing text-only re-sults with our best performing multimodal model, which high-lights the importance of multi-modality.

Table 1 (5.3) looks at the effect of ﬁnetuning vs having frozenSpeech-BERT and RoBERTa models. Unsurprisingly, ﬁne-tuning the two SSL models with Shallow-Fusion for the down-stream task leads to better performance.

Finally, we compare how the two fusion mechanism behavewhen we keep Speech-BERT and RoBERTa in a frozen state-using them only as feature extractors. Table 1 (5.4) showswhen the SSL networks are frozen, Co-Attentional fusion per-forms better. We highlight that the increased number of inter-actions prior to the prediction layers enables Co-Attentional fu-sion to adapt better than Shallow-Fusion. Co-Attentioal fusion blation Studies Happy Sad Angry NeutralAlgorithm Metric Acc(h) F1(h) Acc(h) F1(h) Acc(h) F1(h) Acc(h) F1(h) (5.1) Shallow-Fusion VS Co-Attentional Fusion (SSL models in get finetuned on the task)

Co-Attentional-Fusion 83.64 83.07 90.12 90.13 92.56 93.01 81.06 79.97

Shallow-Fusion (Best Performaing Model)

Speech - Speech-BERT only 71.15 66.87 85.657 85.01 87.67 87.73 71.78 68.9Text - Roberta Only 81.02 81.62 87.67 87.1 91.02 91.21 78.72 78.85 (5.3) Fine-Tune VS Frozen (using Shallow-Fusion)

Frozen SSL 69.22 62.98 85.57 83.92 89.68 89.5 75.34 72.95Fine-Tune SSL (same as best performing model) (5.4) Shallow-Fusion VS Co-Attentional Fusion (SSL models in frozen state)

Co-Attentional-Fusion (frozen SSL) 76.55 75.32 86.86 85.72 90 90.2 77.04 76.65Shallow-Fusion (frozen SSL) 69.22 62.98 85.57 83.92 89.68 89.5 75.34 72.95

Table 1:

Evaluation results of the ablation studies on IEMOCAP dataset with Binary Accuracy (BA) and F1 score

Algorithm

Metric Acc (7 classes) Acc (2 classes) f1 score MAEMultimodal Transformer (Tsai et al - 2019) 50.7 81.6 81.6 0.691ICCN (Sun et al - 2019) 51.58 84.4 84.36 0.713

Shallow-Fusion of SSL models (Ours) 55.971 88.04 88.08 0.491

Ablation Studies Happy Sad Angry NeutralAlgorithm Metric Acc(h) F1(h) Acc(h) F1(h) Acc(h) F1(h) Acc(h) F1(h) (5.1) Shallow-Fusion VS Co-Attentional Fusion (SSL models in get finetuned on the task)

Co-Attentional-Fusion 83.64 83.07 90.12 90.13 92.56 93.01 81.06 79.97Shallow-Fusion (Best Performaing Model) 84.13 83.75 90.894 90.7 93.56 93.62 81.55 81.05 Algorithm

Metric Acc (7 classes) Acc (2 classes) f1 score MAE (5.2) Uni-Modal Comparison

Multimodal Transformer (Tsai et al - 2019) 39.1 81.1 81 0.686Speech - Speech-BERT only 71.15 66.87 85.657 85.01 87.67 87.73 71.78 68.9 ICCN (Sun et al - 2019) 39.01 83.07 83.02 0.862Text - Roberta Only 81.02 81.62 87.67 87.1 91.02 91.21 78.72 78.85

Shallow-Fusion of SSL models (Ours) 52.187 88.87 88.57 0.577(5.3) Fine-Tune VS Frozen (using Shallow-Fusion)

Frozen SSL 69.22 62.98 85.57 83.92 89.68 89.5 75.34 72.95Fine-Tune SSL (same as best performing model) (5.4) Shallow-Fusion VS Co-Attentional Fusion (SSL models in forzen state)

Co-Attentional-Fusion 76.55 75.32 86.86 85.72 90 90.2 77.04 76.65Shallow-Fusion 69.22 62.98 85.57 83.92 89.68 89.5 75.34 72.95Task Happy Sad Angry NeutralAlgorithm

Metric Acc(h) F1(h) Acc(h) F1(h) Acc(h) F1(h) Acc(h) F1(h) MulT (Tsai et al., 2019) 84.4 81.9 77.7 74.1 73.9 70.2 62.5 59.7ICCN (Sun et al - 2019) 87.41 84.72 86.26 85.93 88.62 88.02 69.73 68.47

Shallow-Fusion of SSL models (Ours) 89.71 88.34 89.48 89.2 93.82 93.9 80.93 81.01

Table 2:

Results for multimodal emotions analysis on CMU-MOSEI with Seven Class accuracy ,BA (binary accuracy) F1 score, andMAE. Performance’s of the other models are taken from ICCCN (2019) [39] MULT (2019) [8]

Algorithm

Metric Acc (7 classes) Acc (2 classes) f1 score MAEMultimodal Transformer (Tsai et al - 2019) 39.1 81.1 81 0.686ICCN (Sun et al - 2019) 39.01 83.07 83.02 0.862

Shallow-Fusion of SSL models (Ours) 48.927 88.275 88.57 0.577

Table 3:

Results for multimodal emotions analysis on CMU-MOSI. Performance of the other models are taken from ICCCN (2019) [39]MULT (2019) [8]