[PDF] Disentanglement for audio-visual emotion recognition using multitask setup

Abstract

Deep learning models trained on audio-visual data have been successfully used to achieve state-of-the-art performance for emotion recognition. In particular, models trained with multitask learning have shown additional performance improvements. However, such multitask models entangle information between the tasks, encoding the mutual dependencies present in label distributions in the real world data used for training. This work explores the disentanglement of multimodal signal representations for the primary task of emotion recognition and a secondary person identification task. In particular, we developed a multitask framework to extract low-dimensional embeddings that aim to capture emotion specific information, while containing minimal information related to person identity. We evaluate three different techniques for disentanglement and report results of up to 13% disentanglement while maintaining emotion recognition performance.

Full PDF

DDISENTANGLEMENT FOR AUDIO-VISUAL EMOTION RECOGNITION USINGMULTITASK SETUP

Raghuveer Peri † ∗

Srinivas Parthasarathy (cid:63)

Charles Bradshaw (cid:63)

Shiva Sundaram (cid:63) † University of Southern California, Los Angeles, CA, USA (cid:63)

Amazon Inc., Sunnyvale, CA, USA

ABSTRACT

Deep learning models trained on audio-visual data have been suc-cessfully used to achieve state-of-the-art performance for emotionrecognition. In particular, models trained with multitask learninghave shown additional performance improvements. However, suchmultitask models entangle information between the tasks, encodingthe mutual dependencies present in label distributions in the realworld data used for training. This work explores the disentanglementof multimodal signal representations for the primary task of emotionrecognition and a secondary person identiﬁcation task. In particu-lar, we developed a multitask framework to extract low-dimensionalembeddings that aim to capture emotion speciﬁc information, whilecontaining minimal information related to person identity. We eval-uate three different techniques for disentanglement and report resultsof up to 13% disentanglement while maintaining emotion recogni-tion performance.

Index Terms — Keywords: Emotion recognition, multimodallearning, disentanglement, multitask learning

1. INTRODUCTION

Emotions play an important role in human communication. Humansexternalize their reactions to surrounding stimuli through a changein the tone of their voice, facial expressions, hand and body ges-tures. Therefore, automatic emotion recognition is of interest forbuilding natural interfaces and effective human-machine interaction.[1]. With regards to human communication, emotion is primarilymanifested through speech and facial expressions, each providingcomplementary information [2]. Therefore, multimodal techniqueshave been widely used for reliable emotion prediction [3, 4, 5].Several studies have shown that emotion recognition beneﬁtsfrom training with secondary related tasks through multitask learn-ing (MTL). In Parthasarathy and Busso [6], predicting the continu-ous affective attributes of valence, arousal and dominance are treatedas the multiple tasks and trained jointly. In Li et al. [7] and Kim et al.[8], gender prediction as a secondary task improves emotion recog-nition performance by upto 7.7% as measured by weighted accuracyon a standard corpus. A more comprehensive study involving do-main, gender and corpus differences was performed in Zhang et al.[9], where cross-corpus evaluations showed that, in general, infor-mation sharing across tasks yields improvements in performance ofemotion recognition across corpora. These studies indicate that sev-eral paralinguistic tasks help generalize shared representations thatimprove overall performance of the primary task. This motivatesus to use person identiﬁcation as a secondary task to help improveperformance on the primary emotion task. ∗ Work was completed while interning at Amazon

With MTL the shared representations among tasks retain infor-mation pertaining to all the tasks. While this generalizes the overallarchitecture, it does so by entangling information between multipletasks [10, 11, 12]. Since most machine learning models are trainedon human-annotated, unconstrained real-world data, several factorsthat should theoretically be independent end up being dependent.For e.g. in the case of emotions, studies have shown the correlationwith demographical information [13]. Therefore, MTL inherentlycaptures the joint dependencies between different factors in the data.This is problematic as the gains through generalization across tasksmay lead to bias and subsequently poor performance on unseen data.To address the entanglement of information in MTL, this paperdevelops a multimodal emotion recognition model, improves its per-formance using person identiﬁcation as a secondary task and subse-quently disentangles the learned person identity information, whilestill maintaining the improved emotion recognition performance. Asan additional contribution, we analyze how much emotion infor-mation is present in the identity representations when models aretrained in a MTL setup. For disentanglement, we experiment withthree distinct disentanglement techniques to minimize the informa-tion transfer between speaker embeddings and emotional labels andvice-versa. We present experiments that make use of alternate ad-versarial training strategy, gradient reversal based technique adaptedfrom Domain Adversarial Training (DAT) literature and a confusionloss based technique inspired from [14]. We evaluate the modelspre and post disentaglement, showing that disentanglement retainsor improves performance on primary tasks upto 2% absolute, whilereducing the leakage of information between the tasks with disen-tanglement upto 13% as measured by F-score.

2. RELATED WORK

In the context of representation learning for emotion recognition,the goal is to extract low dimensional embeddings that are invari-ant to factors such as domain and speaker. Abdelwahab and Busso[15] used gradient reversal (GR) to extract emotion representationsthat are invariant to domain. Mao et al. [16] imposed an explicit or-thogonality criterion to encourage the learning of domain invariantand emotion discriminative features. Similarly, to extract speaker-invariant emotion representations, adversarial learning approach wasexplored in addition to an online data augmentation technique byTu et al. [17]. They showed improvements in the emotion recogni-tion performance while testing on speakers unseen during training.More recently Li et al. [18] proposed an entropy-based loss functionalong with GR and showed improved performance compared to [17].Kang et al. [19] propose channel and emotion invariant speaker em-beddings. However, most of these works consider emotion recogni-tion using speech modality alone. Jaiswal and Provost [20] explored a r X i v : . [ ee ss . I V ] F e b rivacy-preserving multimodal emotion representations, where au-dio and text modalities were utilized. Our study differs from previ-ous studies by using a secondary task to improve primary emotionrecognition performance while being invariant to the auxiliary fac-tors.With regards to identity embeddings, Wiliams and King [12]have shown that speaker embeddings capture signiﬁcant amount ofaffect information. It has been found that differences in the affec-tive states of a person between training and testing conditions candegrade the performance on the task of identity veriﬁcation fromspeech [21, 22]. Techniques have been proposed to compensate thisby transforming features from expressive speech to neutral speechdomain [23, 24]. While most of the existing works learn identity rep-resentations separately and then try to make them invariant to emo-tional states, we co-learn identity representations with an emotionrecognition task while simultaneously removing emotion informa-tion from them.

3. METHODOLOGY

VideoAudio

Overlapping segmentsLength-3.6sec, stride-0.72sec

Convolution + temporal pooling block(audio) Convolution + temporal pooling block(video)

Concatenation

Pre-processing Pre-processing2D Conv(Audio) 3D Conv(Video)

Fully-connected

Speaker Identification

Temporal aggregation1D conv

Speaker Embedding

Convolution + temporal pooling block(audio) Convolution + temporal pooling block(video)

ConcatenationFully-connected

Emotion labelEmotion Embedding

Fig. 1 . Block diagram depitcing the baseline multimodal, multitasktraining

2D Conv(Audio)3D Conv(Video)

Speaker embeddingEmotion embedding

Fully-connectedFully-connectedDisentanglementDisentanglement

PRIMARYAUXILLARY

INPUT LAYER EMBEDDING EXTRACTION OUTPUT LAYER TASK TYPE

Speaker identificationEmotion RecognitionSpeaker IdentificationEmotion Recognition

AUDIOVIDEO

Fig. 2 . Block diagram depicting the baseline model with an auxiliarydisentanglement taskFig. 1 illustrates the multitask architecture for emotion recog-nition and person identiﬁcation. The inputs to the model are bothaudio and video frames that are time-synchronized. The ﬁrst step isa shared convolutional feature extraction stage where a data-driven representation is extracted for both audio and video independently.The architectures for this ﬁrst stage blocks are adopted from [25]. Asecond level temporal aggregation block pools the feature represen-tation for audio and video separately over entire clips to ﬁxed dimen-sional representation. The outputs of the audio and video poolingblocks are concatenated; resulting in independent embedding layers emotion embedding and speaker embedding . The ﬁnal output layersfor task-speciﬁc outputs are fully connected layers with a softmaxactivation function to predict the emotion and person identity labelsrespectively. Please note that we have used the terms speaker iden-tity and person identity interchangeably throughout the paper.Fig. 2 illustrates the addition of auxiliary branches to the base-line multitask architecture. The auxiliary branches are used to assessthe amount of emotion information in the speaker embeddings andvice versa. These auxiliary branches are also used for disentangle-ment as explained in Section 3.2.

The input audio and face crop streams from a video clip are ﬁrstfed into corresponding pre-processing blocks. On the audio stream,pre-processing includes extracting log Mel frequency spectrogramfeatures on overlapping segments of ﬁxed length and stride. Thisresults in one feature vector per segment, with varying number ofsegments per video clip, depending on the length of the clip. In or-der to perform efﬁcient batch processing, we pad the features witha constant value to ensure that each video clip contains the samenumber of segments, N . The resulting features have the dimensions B ∗ N ∗ D a where B is the minibatch size and D a is the dimensionof the Mel spectrogram features. On the face crops, pre-procesingincludes resizing them into a ﬁxed size of D v ∗ D v pixels and rescal-ing the values to between − and . The resulting face crops havethe dimensions B ∗ N ∗ D v ∗ D v . The multitask outputs are built on top of the common embeddinglayers for the emotion and person identiﬁcation tasks respectively.As a result, when training the model, it tends to train an entangledembedding that is optimized for both tasks. This form of entangle-ment could lead to learning needless dependencies in the train setthat may affect the overall generalization. In this work, both forperson identiﬁcation and emotion recognition, the second output orauxiliary task is used to disentangle the emotion information fromthe speaker embeddings and vice-versa (Fig. 2). Disentanglementis achieved using the auxiliary branch. The basic intuition here issimilar to domain adversarial training, where the goal is to learn rep-resentations that are optimized for the primary task, while simulta-neously training it to perform poorly on the auxiliary task. To thisend, we experiment with three techniques for disentanglement: (1)gradient reversal, (2) alternate primary-auxiliary training and (2) andconfusion loss (CONF).

Gradient reversal was originally developed in Ganin and Lem-pitsky [26] to make digit recognition task invariant to domainthrough adversarial training. As discussed in Section 2, it wasadapted to extract speaker-invariant speech emotion representationsin Tu et al. [17]. Gradient reversal is achieved by introducing it inthe stages of a network where the auxiliary branch separates fromthe primary branch. This layer has no effect in the forward pass oftraining, while in the backward pass the gradients from the auxiliarybranch are multipled by a negative value before backpropagating itto the embedding layer. lternate training strategy for disentanglement was inspiredfrom adversarial training literature [27], where two models aretrained with competing objectives. In our setup, for emotion embed-dings, the primary task is to predict the emotion labels, while theauxiliary task is to predict person identity labels. Equations 1 and 2show the loss functions of the primary and auxiliary branch respec-tively, which are modeled as cross-entropy loss. ˆ e prim and ˆ s prim denote the primary predictions from the emotion and speaker identi-ﬁcation branches respectively. Similarly, ˆ e aux and ˆ s aux denote theauxiliary predictions from the speaker identiﬁcation and emotionrecognition branches respectively. e target and s target denote thegroundtruth emotion and speaker identity labels. L primary = w em prim ∗ L (ˆ e prim , e target )+ w spk prim ∗ L (ˆ s prim , s target ) (1) L auxiliary = w spk aux ∗ L (ˆ e aux , e target )+ w em aux ∗ L (ˆ s aux , s target ) (2)Alternate training proceeds in a minimax fashion. The auxil-iary branch is trained to minimize L auxiliary , while the primarybranch is trained to minimize L primary and simultaneously maxi-mize L auxiliary . Confusion loss for disentanglement has been introduced inTzeng et al. [28] and adapted for disentangling person identity andspoken content representations in Nagrani et al. [25]. We apply asimilar strategy to disentangle the emotion and person identity rep-resentations. On a high level, the loss forces the embeddings suchthat, for the auxiliary task, each class is predicted with the sameprobability. Similar to [25], we implement the confusion loss as thecross-entropy between the predictions and a uniform distribution.

4. EXPERIMENTAL FRAMEWORK4.1. Dataset

For the primary task and disentanglement experiments for multi-modal emotion recognition, we use the EmoVox dataset [29]. TheEmoVox dataset comprises of emotional labels on the VoxCelebdataset obtained by predictions using a strong teacher network overeight emotional states: neutral, happiness, surprise, sadness, anger,disgust, fear and comtempt. Note that the teacher model was trainedonly using facial features (visual only). Overall, the dataset con-sists of interview videos from celebrities spanning a widerange of ages and nationalities. For each video clip, we ﬁnd themost dominant emotion based on the distribution and use that asour ground-truth label similar to [29]. The label distribution isheavily skewed towards a few emotion classes because emotionssuch as disgust, fear, contempt and surprise are rarely exhibited ininterviews. Following previous approaches that deal with such im-balanced datasets [30], we combine these labels into a single class‘other‘, resulting in emotion classes. Further, we discard videoscorresponding to speakers belonging to the bottom percentile w.r.tthe number of segments to reduce the imbalance in the numberof speech segments per speaker. We create three splits from thedatabase: EmoVox-Train to train models,

EmoVox-Validation forhyperparameter tuning,

EmoVox-Test to evaluate models on held outspeech segments from speakers present in the train set. The subset

EmoVox-Train corresponds to the

Train partition in [29], whereasthe

EmoVox-Validation and

EmoVox-Test were created from the

Heard-Val partition in [29].

The model architecture for the shared 2D Convolutional layers andthe fully connected layers was adapted from [25] and modiﬁed tosuit the dimensions of our inputs and outputs. We use uniform du-ration videos of 12 seconds each as input to our system. For theaudio features we use D a = 40 , and for the visual features we use D v = 224 . We ﬁx the emotion embedding dimension to ,while varying the speaker embedding dimension , and .We use Adam optimizer with an initial learning rate of e − and e − for the primary branch and auxiliary branch updates respec-tively, decaying exponentially with a factor of γ = 0 . . For alternatetraining (Eqs. 1 and 2), we chose w em prim and w spk prim to be . each and w em aux and w spk aux to . each. All parameters werechosen based on preliminary experiments on a subset of EmoVox-Train . The emotion recognition performance was evaluated usingunweighted F-score averaged across the emotion classes and forperson identity with identiﬁcation accuracy scores. Disentanglementis measured by combining both the F-score on emotion recognitionusing speaker embeddings and accuracy on person identiﬁcation us-ing emotion embeddings. Optimal models were chosen to give thebest disentanglement (lowest score) on the EmoVox-Validation set.All results are presented on the

EmoVox-Test set.

5. RESULTS5.1. Baseline models without disentanglementEmotion Recognition:

Figure 3(a) illustrates the primary emotionrecognition results. The blue bars show the performance of all mod-els trained using MTL and the dashed line shows the performance ofSingle-task learning (STL) setup where the models are not trainedon person identiﬁcation. It is evident that MTL gives substantialgains in performance compared to STL setup. It is also observedthat emotion recognition performance improves as the person identi-ﬁcation embedding dimension is reduced, which may indicate betterregularization with fewer embedding dimensions.

Person identiﬁcation:

Table 1 shows the person identiﬁcation ac-curacy, trained with varying speaker embedding dimensions. It isworth noting that, despite the reduction in speaker embedding di-mension, the models retain performance, pointing to the fact that thetask of learning identity representations when both audio and visualmodalities are available does not require many degrees of freedom.

Identity information in emotion embeddings:

Our preliminaryexperiments showed that the amount of person identity informationentangled in emotion embeddings was minimal. Evaluating theperson identiﬁcation task using emotion embeddings produced anaccuracy of 0.1%, which was close to random chance performance.Therefore we focus on disentangling emotion information in identityembeddings.

Emotion information in identity embeddings

To baseline theamount of emotion information entangled in the speaker embed-dings, we separately train single hidden layer neural network clas-siﬁers that predict the emotion labels from speaker embeddings.Figure 3(b) illustrates the performance. First, it is worth noting thatspeaker embeddings from models trained for the single task of per-son identiﬁcation retain substantial amount of emotion information,as shown by the red dashed line, compared to a random chanceF-score of . if all samples were predicted as ‘neutral‘ class(shown by the green dashed line). Further the blue bars illustrate theperformance in the MTL setup where the F-scores are well aboverandom chance as there is more information entanglement. Thismotivates the need for disentanglement to minimize the emotion aghuveer Peri 2 Baseline (multitask) ALT GR CONF * All models were found to be different than baseline, based on Stuart-Maxwell test for marginal homogeneity, with p<0.05

Random chance : 17.4% Single task training : 33.87%

Raghuveer Peri 3

Baseline (multitask) ALT GR CONF

Random chance : 17.4%Single task training : 38.48%

Fig. 3 . Unweighted Average F-scores for AER on

EmoVox-Test by varying speaker embedding dimension using (a) Emotion embeddings(Higher is better) (b) Speaker embeddings (Lower is better). A Stuart-Maxwell marginal homogeneity test comparing the results found, withstatistical signﬁcance, that all the models with disentanglement were different compared to the baseline model

Table 1 . Person identiﬁcation accuracy (%) comparing models withvarying speaker embedding dimensions without and with disentan-glement on

EmoVox-Test

Emb Dim Baseline ALT GR CONF2048 90.98 92.40 93.19 93.12256 94.75 95.04 95.86 95.4264 90.62 92.83 91.17 90.75 information present in speaker embeddings without compromisingperformance on the emotion recognition, person identiﬁcation tasks.

Next we report the results of the proposed disentanglement tech-niques and compare them to the baseline models. We trained eachdisentanglement technique for all three conﬁgurations of speakerembedding dimension, , and to investigate their effecton disentanglement performance Emotion Recognition

From Fig. 3(a), we observe that modelstrained with all three disentanglement strategies outperform thebaseline models trained without disentanglement in all but onecase. In particular, ALT and CONF methods provide gains consis-tently across the various embedding dimensions. We performed aStuart-Maxwell marginal homogeneity test comparing the resultsand found, with statistical signﬁcance, that all the models with dis-entanglement were different compared to the baseline models . Wealso observe that, similar to the baseline models, models trainedwith disentanglement tend to perform better for reduced speakerembedding dimensions, though with smaller gains. Person identiﬁcation

Table 1 shows the person identiﬁcation accu-racy for the models with disentanglement compared to the baselinewithout disentanglement. We observe that, in general, all modelsperform better after disentanglement when compared to the baselinewithout disentanglement. There is no clear evidence of one tech-nique performing better than the other, though GR and ALT seem to H : The predictions from the compared models are the same. Reject H if p < α with α = 0 . perform marginally better compated to CONF. Emotion information in identity embeddings

Fig. 3(b) illus-trates the amount of emotion information in the person identityembeddings after explicit disentanglement. The drop in unweightedaverage F-score for emotion recognition shows the measure of ob-tained disentanglement. Compared to the models trained withoutdisentanglement, we observe that the models trained with explicitdisentanglement show reduction in F-score of predicting emotionsfrom speaker embeddings. This is noticeable in all the three disen-tanglement techniques. ALT, CONF training show better disentan-glement than GR. Overall, these results show the efﬁcacy of usinga separate auxiliary branch to disentangle the emotion informationfrom speaker embeddings. Furthermore, it can be observed that themodels trained using the smallest speaker embedding dimension of shows the least amount of emotion information. This is expectedbecause a reduced person identity embedding dimension createsa bottleneck to capture the primary identity information, and thusretains lesser amount of entangled emotion information. Consid-ering the person identity dimension of 64, we see absolute gainsof 2% for emotion recognition while ALT training gives 13.5%disentanglement.

6. CONCLUSIONS

This study analyses disentanglement techniques for emotion recog-nition in a multitask learning setup, where person identiﬁcation is thesecondary task. We showed with an audio-visual architecture thatperson identiﬁcation helps emotion recognition performance. Thiscomes at a cost, as there is signiﬁcant information transfer betweenthe tasks, which lets us predict emotional categories from speakerembeddings well above chance percentage. To combat this we stud-ied three disentanglement techniques, each reducing the amount ofinformation that is entangled while maintaining or improving per-formance on the primary task. For our next steps we will exploreand validate these methods on other databases which have strongeremotion labels. Furthermore, it is of interest to dig deeper into thereasons for differences in performance across the various disentan-glement methods. Finally, this paper shows that there is signiﬁcantemotional information in the speaker embeddings and the contrary isnot necessarily true. Therefore we will explore a hierarchical struc-ture where emotion recognition is more downstream than the personidentiﬁcation task. . REFERENCES [1] M. Pantic and J. M. L. Rothkrantz, “Toward an affect-sensitivemultimodal human-computer interaction,”

Proceedings of theIEEE , vol. 91, no. 9, pp. 1370–1390, 2003.[2] A. Mehrabian, “Communication without words,”

Communi-cation theory , vol. 6, pp. 193–200, 2008.[3] Y. Kim, H. Lee, and E. M. Provost, “Deep learning for ro-bust feature generation in audiovisual emotion recognition,” in , Vancouver, BC, Canada, 2013, IEEE, pp.3687–3691.[4] Y. Wang and L. Guan, “Recognizing human emotional statefrom audiovisual signals,”

IEEE transactions on multimedia ,vol. 10, no. 5, pp. 936–946, 2008.[5] M. Song, J. Bu, C. Chen, and N. Li, “Audio-visual basedemotion recognition-a new approach,” in

Proceedings of the2004 IEEE Computer Society Conference on Computer Visionand Pattern Recognition, 2004. CVPR 2004. , Washington, DC,USA, 2004, IEEE, vol. 2, pp. II–II.[6] S. Parthasarathy and C. Busso, “Jointly predicting arousal,valence and dominance with multi-task learning.,” in

Inter-speech , Stockholm, Sweden, 2017, pp. 1103–1107.[7] Y. Li, T. Zhao, and T. Kawahara, “Improved end-to-end speechemotion recognition using self attention mechanism and multi-task learning.,” in

Interspeech , Graz, Austria, 2019, pp. 2803–2807.[8] J. Kim, G. Englebienne, P. K. Truong, and V. Evers, “To-wards speech emotion recognition” in the wild” using aggre-gated corpora and deep multi-task learning,” arXiv preprintarXiv:1708.03920 , 2017.[9] B. Zhang, E. M. Provost, and G. Essl, “Cross-corpus acousticemotion recognition with multi-task learning: Seeking com-mon ground while preserving differences,”

IEEE Transactionson Affective Computing , vol. 10, no. 1, pp. 85–99, 2017.[10] J. Liang, Z. Liu, J. Zhou, X. Jiang, C .Zhang, and F. Wang,“Model-protected multi-task learning,”

IEEE Transactions onPattern Analysis and Machine Intelligence , pp. 1–1, 2020.[11] L. Xiao, H. Zhang, W. Chen, Y. Wang, and Y. Jin, “Learn-ing what to share: Leaky multi-task network for text classiﬁ-cation,” in

Proceedings of the 27th International Conferenceon Computational Linguistics , Santa Fe, NM, USA, 2018, pp.2055–2065.[12] J. Williams and S. King, “Disentangling style factors fromspeaker representations.,” in

INTERSPEECH , Graz, Austria,2019, pp. 3945–3949.[13] T. M. Chaplin, “Gender and emotion expression: A develop-mental contextual perspective,”

Emotion Review , vol. 7, no. 1,pp. 14–21, 2015.[14] M. Alvi, A. Zisserman, and C. Nell˚aker, “Turning a blind eye:Explicit removal of biases and variation from deep neural net-work embeddings,” in

Proceedings of the European Confer-ence on Computer Vision (ECCV) , Munich, Germany, 2018,pp. 556–572.[15] M. Abdelwahab and C. Busso, “Domain adversarial for acous-tic emotion recognition,”

IEEE/ACM Transactions on Audio,Speech, and Language Processing , vol. 26, no. 12, pp. 2423–2435, 2018. [16] Q. Mao, G. Xu, W. Xue, J. Gou, and Y. Zhan, “Learningemotion-discriminative and domain-invariant features for do-main adaptation in speech emotion recognition,”

Speech Com-munication , vol. 93, pp. 1–10, 2017.[17] M. Tu, Y. Tang, J. Huang, X. He, and B. Zhou, “Towards adver-sarial learning of speaker-invariant representation for speechemotion recognition,” arXiv preprint arXiv:1903.09606 , 2019.[18] H. Li, M. Tu, J. Huang, S. Narayanan, and P. Georgiou,“Speaker-invariant affective representation learning via ad-versarial training,” in

ICASSP 2020-2020 IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) , Barcelona, Spain, 2020, IEEE, pp. 7144–7148.[19] W. H. Kang, S. H. Mun, M. H. Han, and N. S. Kim, “Dis-entangled speaker and nuisance attribute embedding for ro-bust speaker veriﬁcation,”

IEEE Access , vol. 8, pp. 141838–141849, 2020.[20] M. Jaiswal and E. M. Provost, “Privacy enhanced multimodalneural representations for emotion recognition.,” in

AAAI ,2020, pp. 7985–7993.[21] S. Parthasarathy, C. Zhang, J. H. L. Hansen, and C. Busso,“A study of speaker veriﬁcation performance with expressivespeech,” in , New Orleans, LA,USA, 2017, IEEE, pp. 5540–5544.[22] W. Wu, T. F. Zheng, M. X. Xu, and H. J. Bao, “Study onspeaker veriﬁcation on emotional speech,” in

Ninth Interna-tional Conference on Spoken Language Processing , Pittsburgh,PA, USA, 2006, pp. 2102–2105.[23] H. Bao, M. X. Xu, and T. F. Zheng, “Emotion attribute projec-tion for speaker recognition on emotional speech,” in

EighthAnnual Conference of the International Speech Communica-tion Association , Antwerp, Belgium, 2007, pp. 758–761.[24] S. R. Krothapalli, J. Yadav, S. Sarkar, G. S. Koolagudi, andA. K. Vuppala, “Neural network based feature transformationfor emotion independent speaker identiﬁcation,”

InternationalJournal of Speech Technology , vol. 15, no. 3, pp. 335–349,2012.[25] A. Nagrani, J. S . Chung, S. Albanie, and A. Zisserman,“Disentangled speech embeddings using cross-modal self-supervision,” in

ICASSP 2020-2020 IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP) ,Barcelona, Spain, 2020, IEEE, pp. 6829–6833.[26] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptationby backpropagation,” in

International conference on machinelearning , Lille, France, 2015, PMLR, pp. 1180–1189.[27] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative ad-versarial nets,” in

Advances in neural information processingsystems , Montreal, Canada, 2014, pp. 2672–2680.[28] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko, “Simultane-ous deep transfer across domains and tasks,” in

Proceedings ofthe IEEE International Conference on Computer Vision , Santi-ago, Chile, pp. 4068–4076.[29] S. Albanie, A. Nagrani, A. Vedaldi, and A. Zisserman, “Emo-tion recognition in speech using cross-modal transfer in thewild,” in

Proceedings of the 26th ACM international confer-ence on Multimedia , Seoul, Republic of Korea, pp. 292–301.[30] C. Busso, S. Parthasarathy, A. Burmania, M. AbdelWahab,N. Sadoughi, and E. M. Provost, “Msp-improv: An acted cor-pus of dyadic interactions to study emotion perception,”