[PDF] Semi-supervised Multi-modal Emotion Recognition with Cross-Modal Distribution Matching

Abstract

Automatic emotion recognition is an active research topic with wide range of applications. Due to the high manual annotation cost and inevitable label ambiguity, the development of emotion recognition dataset is limited in both scale and quality. Therefore, one of the key challenges is how to build effective models with limited data resource. Previous works have explored different approaches to tackle this challenge including data enhancement, transfer learning, and semi-supervised learning etc. However, the weakness of these existing approaches includes such as training instability, large performance loss during transfer, or marginal improvement. In this work, we propose a novel semi-supervised multi-modal emotion recognition model based on cross-modality distribution matching, which leverages abundant unlabeled data to enhance the model training under the assumption that the inner emotional status is consistent at the utterance level across modalities. We conduct extensive experiments to evaluate the proposed model on two benchmark datasets, IEMOCAP and MELD. The experiment results prove that the proposed semi-supervised learning model can effectively utilize unlabeled data and combine multi-modalities to boost the emotion recognition performance, which outperforms other state-of-the-art approaches under the same condition. The proposed model also achieves competitive capacity compared with existing approaches which take advantage of additional auxiliary information such as speaker and interaction context.

Full PDF

SSemi-supervised Multi-modal Emotion Recognition withCross-Modal Distribution Matching

Jingjun Liang

School of InformationRenmin University of [email protected]

Ruichen Li

School of InformationRenmin University of [email protected]

Qin Jin ∗ School of InformationRenmin University of [email protected]

Abstract

Automatic emotion recognition is an active research topic withwide range of applications. Due to the high manual annotation costand inevitable label ambiguity, the development of emotion recog-nition dataset is limited in both scale and quality. Therefore, oneof the key challenges is how to build effective models with limiteddata resource. Previous works have explored different approachesto tackle this challenge including data enhancement, transfer learn-ing, and semi-supervised learning etc. However, the weakness ofthese existing approaches includes such as training instability, largeperformance loss during transfer, or marginal improvement. In thiswork, we propose a novel semi-supervised multi-modal emotionrecognition model based on cross-modality distribution matching,which leverages abundant unlabeled data to enhance the modeltraining under the assumption that the inner emotional status isconsistent at the utterance level across modalities. We conductextensive experiments to evaluate the proposed model on twobenchmark datasets, IEMOCAP and MELD. The experiment re-sults prove that the proposed semi-supervised learning model caneffectively utilize unlabeled data and combine multi-modalities toboost the emotion recognition performance, which outperformsother state-of-the-art approaches under the same condition. Theproposed model also achieves competitive capacity compared withexisting approaches which take advantage of additional auxiliaryinformation such as speaker and interaction context.

CCS Concepts • Computing methodologies → Semi-supervised learning set-tings ; Semantic networks ; •

Human-centered computing → HCI design and evaluation methods . Keywords

Multimodal Emotion Recognition, Cross-Modality DistributionMatching, Semi-supervised Learning

ACM Reference Format:

Jingjun Liang, Ruichen Li, and Qin Jin. 2020. Semi-supervised Multi-modalEmotion Recognition with Cross-Modal Distribution Matching. In

Proceed-ings of the 28th ACM International Conference on Multimedia (MM ’20), ∗ Corresponding AuthorPermission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

Figure 1: The latent representation of acoustic, visual andlexical modalities of the same video are expected to be closewhen they are mapped into a common emotion space.

October 12–16, 2020, Seattle, WA, USA.

ACM, New York, NY, USA, 10 pages.https://doi.org/10.1145/3394171.3413579

Emotion is an important part of daily interpersonal human in-teractions. Automatic recognition or detection of human emotionshave attracted much research interest in the field of computer vision,speech processing, and multimedia computing. Emotion recogni-tion technology has a wide range of applications including assistingmental health analysis [14], improving natural human machineinteraction [15], enabling emotional robot design and intelligenteducation tutoring [31, 54] etc.Emotion recognition can be generally categorized into two typesof tasks, namely discrete (categorical) emotion recognition and con-tinuous (dimensional) emotion recognition. The discrete emotionrecognition normally divides the emotion space into several basicemotion classes such as happiness, sadness, anger and neutral etc[12], while the continuous emotion recognition treats emotionalstate as distribution in a continuous space, which is normally de-scribed by two or three dimensions such as arousal, valence anddominance [37]. Although continuous emotion representation canmodel more flexible and complicated emotional state, it is not aseasy to understand as discrete emotion representation, which isreflected in the quite high variance in continuous emotion humanannotations from different annotators [8, 41]. We therefore focuson the discrete emotion modeling in this work.We humans convey emotions in various ways including bothspoken words and nonverbal behaviors, such as facial expressionand body language etc [16]. Such rich information from multi-modalities could be used to understand the emotional state [29].Previous research works have shown that different modalities are a r X i v : . [ ee ss . A S ] S e p omplementary for emotion recognition [23, 36]. Different modali-ties all carry emotion relevant information and how to effectivelycombine multiple modalities has been an active research focus.Besides multi-modality, another challenge for emotion recogni-tion is the limitation of supervised data. Although we can easilycollect large amount of emotional data from online social media,the emotion annotation requires heavy manual efforts and usu-ally involves inevitable label ambiguity. Therefore shortage of highquality supervised data has been a big obstacle for developinggeneralized and robust emotion models. There have been someendeavors to tackle the data shortage challenge. For example, Al-banie et al. [1] apply transfer learning to obtain supervision fromanother labeled modality. However, the improvement is very mar-ginal. Data augmentation and semi-supervised learning throughgenerative adversarial network [8, 18, 39] have also been explored.However, such models are hard to optimize due to the unstabletraining procedure and non-intuitive synthetic samples.Inspired by the research in cross-modal retrieval task [6, 48, 55],in this paper, we propose a novel semi-supervised training strategyfor discrete multi-modal emotion recognition. We assume that differ-ent modalities are expected to express similar emotion informationat the coarse-grained level (such as the utterance level) under acertain scenario as shown in Figure 1. Under this assumption, wecan regard this latent relationship as an auxiliary task to obtainguidance from unlabeled data to enhance the fully-supervised train-ing procedure. Specifically, we use auto-encoder structure [38] toextract utterance-level representation from different modalities andapply Maximum Mean Discrepancy(MMD) [43] to restrict their dis-tribution difference. We conduct extensive experiments to comparewith other state-of-the-art techniques on two benchmark datasets,IEMOCAP [5] and MELD [35], and demonstrate the effectivenessof our proposed semi-supervised learning approach. We also carryout detailed analysis experiments to study the performance impactfrom each model component.The remainder of this paper is organized as follows. Section 2introduces some related works. Section 3 describes the details of ouremotion recognition system based on the proposed semi-supervisedlearning strategy, including the representation learning with DAEand the design of multiple loss functions. Section 4 then shows theextensive experiment results on two benchmark datasets. Finally,Section 6 presents our conclusions. The quality of multi-modal features plays a decisive role in emo-tion recognition. Thus previous works have explored effective fea-tures in acoustic, visual and lexical modalities for emotion recog-nition tasks. Brady et al. [4] derive high-level acoustic, visual andphysiological features from the low-level descriptors using sparsecoding and deep learning. Seng et al. [44] uses a mixture of rule-based and machine learning techniques upon prosodic and spectralfeatures to determine the emotion state contained in the audio andvisual signal. The granularity of these features varies from framelevel to sentence level.For modality aggregation, Viktor et al. [36] use early fusion toconcatenate multi-modal features as the input for the inferencemodels. But it ignores the mismatch between different modalities. Considering the inner relationship alignment, Yoon et al. [51] pro-pose deep dual recurrent encoder to combine text information andspeech signals concurrently to gain a better understanding of emo-tion recognition. Xu et al. [50] propose to learn the frame-levelalignment between speech and text signal via attention mechanism.They both learn the similarity between these two modalities to com-press acoustic sequence and align the speech with text. However,the length of speech sequence is much larger than text sequenceso that the accurate alignment is quite hard to learn. To avoid thisproblem, we tend to use utterance acoustic feature in this work.For modeling the long-term dependency in emotion expression,Mao et al.[30] aggregate the segment-level decisions to improveutterance-level classification. Li et al. [27] propose an novel repre-sentation learning component with residual convolutional network,multi-head self-attention and global context-aware attention LSTM.Following their suggestion, we utilize self-attention mechanism tocapture temporal information as well.Besides, emotion recognition in conversational scenario has be-come a popular sub-task recently. It emphasizes the interactioncontextual information extraction and modeling in a human dia-logue. Several approaches have been proposed to capture contextualand speaker cues to assist emotion recognition [20, 21, 28, 33, 53].Although our proposed work is applied in non-interactive scenario,we compare the emotion recognition performance on the samedataset with these interactive models as well.

Du et al. [11] propose a semi-supervised multi-modal generativeframework with non-uniformly weighted Gaussian mixture poste-rior approximation for the shared latent variable. They use a condi-tional probabilistic distribution for the unknown labels in the semi-supervised classification algorithm. Salimans et al.[39] use genera-tive adversarial networks to implement semi-supervised learning.And Chang et al.[8] apply similar GAN-based semi-supervisedframework on acoustic representations learning and it helps toimprove emotion recognition. Besides generative models, Albanieet al.[2] explore to transfer emotion label from one modality to theother modality assuming that the supervised annotation does existin one modality.

Distribution matching [17, 24] has been proposed and developedfor cross-modal retrieval recently. Several methods [6, 55] are pro-posed to map distribution from different domains into a sharedspace so that the representations of similar distribution from dif-ferent domains can be aligned. They use various similarity metricssuch as Asymmetric Quantizer Distance (AQD) [25] and MaximumMean Discrepancy (MMD) [43]. Wang et al.[48] use distributionmatching loss to incorporate the labeled and unlabeled data into oneframework simultaneously and apply the semi-supervised cross-modal retrieval. Although all these previous works address thecross-modal search problem, we find that this concept can be seam-lessly integrated with the semi-supervised emotion recognition. Sowe conduct the experiment and prove the efficiency of distributionmatching across modality.

We assume that a video database naturally consists of informa-tion from three modalities (acoustic, visual and lexical). Given a igure 2: The overall structure of the proposed semi-supervised multi-modal emotion recognition framework. Both labeledand unlabeled data participate in the model learning (solid line for labeled data, dash line for unlabeled data). labeled video database { X L , Y } = {( x ai , x vi , x li , y i )} n L i = and an un-labeled video database ˜ X uL = {( ˜ x ai , ˜ x vi , ˜ x li )} n uL i = , where x a , x v , x l denote the feature representation from the acoustic, visual andlexical modalities respectively, L and uL are used to distinguishlabeled and unlabeled data, and n L and n uL denote the size of thelabeled and unlabeled dataset respectively, our goal is to involvethe unlabeled data in training to improve model performance.Previous study [5] has shown that the emotional status is keptunchanged during an utterance and the average duration of ut-terances in the dataset is 4.5 seconds. Rigoulot et al [45] has alsostudied the time course for human emotion expression and foundthat 4 seconds of speech emotions can usually be classified cor-rectly. Based on such discovery, we make the following assumption:although the emotion expression is not necessarily aligned at theframe level across modalities, the overall emotional state shouldbe similar at the coarse-grained utterance level. We can utilize thisassumption to extract supervision from unlabeled data. Duringtraining, we improve the accuracy of classification on labeled data,and reduce the difference of inter modality distribution on bothlabeled and unlabeled data simultaneously. Objective = Classi f ication ( X L , Y ) + Reconstruction ( X L , ˜ X uL ) + Matchinд ( X L , ˜ X uL ) (1)As shown in Fig. 2, two types of data (labeled and unlabeled data)both participate in the model learning (solid line for labeled data,dash line for unlabeled data). The additional unlabeled data canhelp learn more robust and emotion-salient latent representation.We use Maximum Mean Discrepancy (MMD) [43] to measure thedistribution similarity, which is motivated by its previous successin transfer learning and feature representation learning.Formally, the training objective of our semi-supervised model(Eq. 1) consists of three components corresponding to the emotionclassification, data reconstruction and data distribution matching re-spectively, among which only the emotion classification component requires labeled data while the other two components are unsu-pervised. We present the details of model architecture (Section 3.1)and loss function design (Section 3.2) in following subsections. As the model structure is related to the feature characteristics ofeach modality, we first introduce the features and then present thenetwork design.

Multi-modal Features

We first extract raw features from acoustic, visual and lexical modal-ities respectively. • Acoustic: We utilize the toolkit OpenSMILE [13] to extract theutterance-level features with the configuration of INTERSPEECH2010[42]. The extracted feature vector is in 1,582 dimension. • Visual: We utilize the state-of-the-art Dense Convolutional Neu-ral Networks (DenseNet) [22] to extract the facial features. TheDenseNet is pretrained on the FERPlus [3] dataset for facial ex-pression recognition. We extract the 342 dimensional activationfrom the last pooling layer for each face image as in [9]. • Lexical: We use the state-of-the-art word embedding trained byBidirectional Encoder Representation from Transformers (BERT)[10]. Each word is represented as a 1,024-dimensional vector.We apply z-normalization on each dimension of the raw featuresto reduce data discrepancy.

DAE for Representation Learning

Deep Auto-encoder (DAE) is proposed to learn high quality latentrepresentation by encoding and reconstructing its input data. Itcan capture the data manifolds smoothly without losing too muchoriginal information [38]. Cross modal distribution matching isapplied to avoid the latent representation collapsing into zero space.

Acoustic:

As the extracted acoustic features are at the utterancelevel, we consider the stacked linear layers as the encoder struc-ture. We first transform the input acoustic feature x a to the latentrepresentation z a with a set of linear layers and then get the recon-structed output ˆ x a with symmetric layers. The network structureis shown in Figure 3. Please note that we do not apply frame-level igure 3: Acoustic DAE: symmetric stacked linear layers. distribution matching across modalities due to two reasons: Firstly,as mentioned above, previous research has shown that emotionexpression is not necessarily aligned at the frame level or wordlevel across modalities [19, 32, 34, 46], forcing the matching at theframe level will lead to inherently poor optimization of the model.Secondly, reconstruction at the frame level will result in very largeamount of trainable parameters. Visual and Lexical:

As both the raw visual and lexical featuresare a sequence of features, we consider seq2seq type of structure forthe encoder and decoder. Transformer [47] is one type of the state-of-the-art Seq2Seq models, which is completely based on attentionmechanism and does not need recurrence and convolution. It cap-tures the relative dependencies between elements of the sequence.We therefore design the transformer architecture for DAE of thevisual and lexical modalities. The detailed component structure ofvisual and lexical DAE is shown in Figure 4.(1) Firstly, the input and output of the visual/lexical DAE model arethe raw and reconstructed visual/lexical features, we thereforedrop the embedding layer of the input and set the number ofhidden units as 1 in the last linear output layer.(2) Secondly, following Srivastava et al.[45], we take the reversedinput sequence as the reconstruction target instead of the orig-inal sequence. Reversing the reconstruction target makes theoptimization easier because the model can get off the groundby looking at low range correlations.(3) Lastly, the latent representations for different modalities areexpected to follow the same shape. But the encoder output inthe visual/lexical DAE is still stacked by time order. So in themiddle of transformer, we set up a set of convolutional layersfor down sampling and, symmetrically, a set of deconvolutionallayers for reconstruction.

The training objective of our semi-supervised model (Eq. 1) con-sists of three components corresponding to the emotion classifi-cation, data reconstruction and data distribution matching respec-tively.

Reconstruction loss.

The loss function of reconstruction is MeanSquared Error (MSE) as follows: L DAE ( x ) = ( x − rec ( x )) (2)where x is the input to DAE and rec ( x ) is the reconstructed outputfrom the corresponding DAE. Unsupervised Distribution Matching Loss.

Given a numbers ofvideo samples segmented by utterance, we assume that the latentrepresentation of acoustic, visual and lexical modalities from the

Figure 4: Visual and Lexical DAE: modified Transformerstructure. The part in blue dotted box is the original com-ponent in Transformer. same video can be mapped into a similar space, while the distribu-tion of modalities from videos with different emotion status shouldbe diverse. We employ Maximum Mean Discrepancy (MMD) [43]to measure the distribution similarity. The distribution matchingloss is as follows: L MMD ( p , q ) = m ( m − ) m (cid:213) i (cid:44) j k ( p i , p j ) + n ( n − ) n (cid:213) i (cid:44) j k ( q i , q j ) − mn m , n (cid:213) i , j = k ( p i , q j ) (3) k ( x , x ′ ) = exp (− || x − x ′ || σ ) (4)where p , q are the latent representation from two different modali-ties. The latent representation of each modality is mapped to Repro-ducing Kernel Hilbert Space (RKHS) before computing the distance.We use the Gaussian kernel (Eq. 4) to calculate the dot product inRKHS. This formula dose not contain any annotation information,so it can be forced on both labeled and unlabeled data to build upunsupervised training target. Supervised Emotion Classification Loss.

As we assume the emo-tion status is aligned across modalities at the utterance level, we canapply multi-modal fusion through directly concatenating the latentrepresentation and then feed it into the classifier. For the labeleddata, we compute the cross-entropy loss (Eq. 5) for optimization. L cls = − n L n L (cid:213) i = K (cid:213) k = y i , k loд ( p i , k )( p i , , p i , , ..., p i , K ) = softmax ( C ([ z ai ; z vi ; z li ])) (5)where C is a neural network based emotion classifier, K is thenumber of emotion classes, n L is the total number of supervisedamples, y i and p i are the annotated and predicted emotion classprobability for input data x i and [ z ai ; z vi ; z li ] is the concatenation ofthe latent representation of x i from all the modalities. Joint Loss Function.

We combine all the losses into the joint lossfunction as below. For the supervised part, the loss function is setas: L s = L cls + αL rec + βL pair L rec = L DAE ( z a ) + L DAE ( z v ) + L DAE ( z l ) L pair = L MMD ( z a , z v ) + L MMD ( z a , z l ) (6)However, we find that the latent representation still collapsesinto zero space. To avoid meaningless matching of modality repre-sentation, we add unpaired samples into the training. The ‘unpaired’means the feature extracted from different modalities doesn’t be-long to the same video or are not aligned. The gap between latentdistribution of paired and unpaired should be obviously large. Weshuffle the features across the acoustic, visual and lexical modalitiesso that they are not aligned anymore. We thus can form unpairedsamples in this way. We hope the unpaired samples are mapped intodifferent emotion space which means we enlarge the distributiondistance during training. Based on such idea, the loss function ismodified as: L s = L cls + αL rec + β ( L pair + L unpaired ) L unpair = −( L MMD ( z a , z v ) + L MMD ( z a , z l )) (7)We then add unsupervised data in the training and the loss functionfor the unsupervised part is set as: L u = αL urec + β ( L upair + L uunpaired ) (8)where α , β are hyper-parameters. Finally, we form the semi-supervisedloss function by combining the supervised and unsupervised losses: L semi = L s + ωL u (9)where ω is the hyper-parameter to balance the two losses. In this section, we present a series of comparison experimentson discrete emotion recognition task under fully-supervised andsemi-supervised settings.

We utilize both labeled and unlabeled data for our experiments.

IEMOCAP [5] contains 12 hours of video recordings of situationaldialogues. The videos are divided into five sessions. Each sessioncontains only two actors so that in total there are ten actors inthe database. The recorded dialogues are manually segmented into10039 utterances with 9 discrete emotion classes, namely happi-ness, anger, sadness, fear, surprise, excitement, frustration, neutraland others. To compare with the state-of-the-art approaches, wefollow the data split setting as in [27] and use 5531 utterances fromthe top 4 emotion classes: happiness, anger, sadness and neutral(the ‘excitement’ utterances are merged into the ‘happiness’ class).The data distribution is shown at Table 1 We follow the speaker-independent setting to avoid actor overlap in the validation andtesting set. Under this consideration, four sessions are chosen as

Table 1: Data distribution of IEMOCAP dataset

Happy Anger Sadness Neutral Total1636 1103 1084 1708 5531

Table 2: Data distribution of MELD dataset

Emotion Train Dev Test TotalAnger 1109 153 345 1607Disgust 271 22 68 361Fear 268 40 50 358Joy 1743 163 402 2308Neutral 4710 470 1256 6436Sadness 683 111 208 1002Surprise 1205 150 281 1636the the training set and the remaining one session is divided intovalidation set and testing set.

MELD [35] is a multi-modal conversational dataset. It extracts morethan 1300 dialogues and 13000 utterances from Friends TV serieswith total 304 speakers. Each utterance segment contains audiotrack, visual scene and text transcript. And it is labeled with oneof seven discrete emotions which are joy, anger, sadness, surprise,disgust, fear and neutral. The data distribution is shown at Table 2.The video contains multiple faces in a scene and the speaker labelis not provided in the dataset. Thus, in this work, we can not matchthe speaker with his/her face exactly and we only use acoustic andlexical modalities in related experiments.

AMI [7] dataset consists of about 100 hours of unlabeled meetingrecordings. It provides video recordings of each speaker, voice trackand transcripts of their speech. But there is no emotion annotationin the dataset, so we use it as the unsupervised dataset.

Unlabeled Sample Sampling:

We summarize several basic rulesto select unsupervised data and perform semi-supervised learning:firstly, the spoken language, cultural background and age range ofspeakers should be similar. Secondly, the camera setting should beas consistent as possible (e.g. lighting, shooting position, resolution,etc.). To avoid the case that the sampled region from AMI is silence,we randomly select three continuous words in the transcript andlook up the time region in the video. We then extract the audioand video segment with middle word as the center of the segment.Due to various video duration and utterance length in AMI andIEMOCAP datasets, we pre-define the video duration of unlabeledsample in advance. Because the duration of 80% utterances in theIEMOCAP dataset is limited to 7.2s, we therefore extract unlabeledsample with 7.2s as the crop width for the experiments on theIEMOCAP dataset. Similarly, for experiments on the MELD datasetwhose utterances are shorter, the crop width is set as 3.5s. This stepensures that there is no significant difference in the duration be-tween labeled samples and unlabeled samples. Additionally, peopletend to keep neutral emotion status during meetings and less likelyto express sad or anger emotions. We apply a pre-trained vanillaemotion classifier with IEMOCAP dataset on the AMI dataset andobserve that about 80% of the 20,000 unlabeled samples are clas-sified as happiness and neutral. We therefore apply sub-sampling able 3: Model architecture setting. In convolutional anddeconvolutional layers, we denote kernel size as k , stridelength as s , padding length as p and the number of channelsas c . In Transformer, we denote the number of self-attentionheads as h , the number of transformer blocks as b and thesize of hidden embedding as e . Modality Input EncoderA 1 × → → → ×

342 Transformer Encoder h =4, b =2, e =342Convolutional layers k =4, s =2, p =1, c =16 k =5, s =2, p =1, c =64 k =3, s =3, p =1, c =32Flatten layerLinear:1856 → × h =4, b =2, e =1024convolutional layers k =4, s =2, p =1, c =64 k =4, s =3, p =1, c =4Flatten layerLinear:2736 → → ×

128 Linear:128 → → → ×

128 Linear:128 → × × k =3, s =3, p =1, c =64 k =5, s =2, p =1, c =16 k =4, s =2, p =1, c =1Transformer Decoder h =4, b =2, e =342L 1 ×

128 Linear:128 → → × × k =4, s =3, p =1, c =64 k =4, s =2, p =1, c =1Transformer Decoder h =4, b =2, e =1024on these two emotion types which selects 5000 samples for thehappiness and neutral classes respectively. The number of samplesfor sadness and anger classes is less than 5000, we therefore applyover-sampling on these two classes and collect 5000 samples foreach class respectively. After the filtering process, the unlabeleddataset is more balanced. Face Extraction:

We apply face detection and extraction on allthe datasets with the toolkit Seetaface [49]. Each face image istransformed into the gray scale with size of 64x64. These videoscontain 30 frames per second and there is almost no change betweenadjacent frames. To reduce computation cost without losing toomuch information, we set the sampling rate to 1/10, which meanswe can get about 3 frames per second. For those frames where facescannot be detected, we use the detection results of the previousframe. However, for the AMI dataset, there are about 11% videoswhere faces can not be detected. We simply drop these samples.Finally, we get 20,000 unlabeled samples from AMI in total.

Table 4: Fully-supervised experiment results under speaker-independent settings on IEMOCAP

Modality Model WAP UAA ARE[51] 54.6% 58.0%LSTM+Att[50] 55.5% 57.4%

Ours 57.2 % %V Ours 52.5% 45.4%

L TRE[51] 63.5% 59.1%LSTM+Att[50] 59.0% 57.8%

Ours 65.4% 64.8%

A+V

Ours 63.2% 60.8%

A+L MDRE[51] % 67.2%MDRE+Att[51] 69.0% 68.1%Concat[50] 67.1% 67.7%Alignment[50] 68.4% % Ours

Ours 73.0% 72.5%Hyper-parameters:

For the IEMOCAP dataset, the number of faceimages and words in one utterance is fixed as 18 and 22. For theMELD dataset, the number of words in a single utterance is fixedas 12. We pad zeros when the utterance is not long enough andcut off if it is too long. We set the batch size as 128, the weight ofreconstruction loss α as 0.2, the weight of MMD loss β as 0.1 andthe weight of unsupervised part ω as 0.3. We apply Adam algorithmwith learning rate of 1e-3 to optimize the parameters. The detailedsetting of network is presented in Table 3. The structure of theencoder and decoder is completely symmetric. We select the bestmodel based on 5-fold cross validation on the validation set andreport its performance on the testing set. We first compare the proposed semi-supervised framework withseveral recent state-of-the-art approaches on the IEMOCAP dataset.1): Yoon et al. [51] propose a deep dual recurrent encoder to com-bine the text information and audio signals. They first investigatethe performance of uni-modal recurrent encoder on audio and text(

ARE , TRE ). Then they propose multi-modal dual recurrent en-coder with and without attention techique (

MDRE+Att , MDRE ).All the results are reported under speaker-independent settings.2): Xu et al. [50] propose to learn the frame-level alignmentbetween audio and text via the attention mechanism in order toproduce more accurate multi-modal feature representations. Theyconduct uni-modal experiments on acoustic and lexical data usingLSTM with attention (

LSTM+Att ). For multi-modal settings, theycompare the performance of direct concatenation (

Concat ) andapplying alignment (

Alignment ) via attention computing. Theyreport the results under speaker-independent settings as well.For fair comparison with above baselines, we also conduct speaker-independent experiments and report the performance with weightedaverage precision (WAP) and unweighted accuracy (UA) [26].Table 4 presents the experiment results under the fully super-vised speaker-independent setting. We can see from the resultsthat our method outperforms recurrent encoder and LSTM withattention under the uni-modal scenario, which indicates that thefeature selection and DAE architecture can capture emotional char-acteristics well and the long-term dependency in transcript can able 5: Semi-supervised experiment results under speaker-independent settings on IEMOCAP

Modality Supervised mode WAP UAA+V Ours-fully 63.2% 60.8%

Ours-semi 63.7% 61.2%

A+L Ours-fully 70.3% 68.6%

Ours-semi 72.6% 72.1%

A+V+L Ours-fully 73.0% 72.5%

Ours-semi 75.6% 74.5%Table 6: Performance comparison (weighted F1 score) onMELD dataset. * indicates that the corresponding approachuses conversation context information, and △ indicates thatthe corresponding approach uses the speaker information Approaches A L A+LMFN[52] - - 54.7%CMN[21] △ △ △ △ △ Figure 5: An example frame from videos in IEMOCAPdataset. We can not capture the full face of the right personwhich reduces the capability of visual model. be modeled effectively. The visual modality achieves worse perfor-mance compared with acoustic and lexical modalities. The possiblereason might be that nearly half of the speakers only show partof their faces as exampled by the speaker on the right in Fig 5.The low quality of face images limits visual model performance,which leads to that few research explores visual modality on theIEMOCAP dataset. The better performance of multi-modal modeldemonstrates that multiple modalities are complementary to eachother for emotion expression.The semi-supervised results are shown in Table 5. We com-pare the performance between fully-supervised setting and semi-supervised setting to verify the feasibility of our assumption ofmodality distribution matching. Since the proposed semi-supervisedapproach needs at least two modalities, we didn’t implement semi-supervised experiments for uni-modal settings. Our semi-supervisedtraining strategy boosts the model capability in all modality combi-nation, which outperforms all the baseline approaches. It demon-strates the rationality and effectiveness of the proposed assumptionand model, which can take advantages of unsupervised data andextract more emotion-salient latent representation.We then present the experiment results on the MELD dataset.Because MELD dataset is the new emotion dataset collected in the interaction scenario, majority of existing approaches validating onit consider auxiliary information from interaction, such as inter-action context or speaker information. However, in this work, ourfocus is on how to effectively utilize unlabeled data, our proposedapproach doesn’t consider the context or speaker information. Soin this set of experiments, we not only compare to approacheswithout using interaction related information, but also to thoseinteractive approaches. We use following state-of-the-art baselinesfor comparison.(1) Zadeh et al. propose Memory Fusion Network (

MFN ) [52]which focuses on improving multi-modal fusion effectiveness.This method does not use any context or speaker information.(2) Poria et al. propose Bidirectional Contextual LSTM (

BC-LSTM )[33]. It performs contextual information fusion in conversa-tional scenario.(3) Hazarika et al. propose Conversational Memory Network (

CMN )[21] and Interactive Conversational Memory Network (

ICON )[20]. Both models utilize the context of the speaker and theinterlocutor during the two-speaker interaction. The former ig-nores global contextual information while the later incorporatesthe global context.(4) Majumder et al. propose

DialogueRNN [28] which modelspreceding emotion status of two speakers and global contextualinformation through three GRUs.(5) Zhang et al. propose Context-sensitive and speaker-sensitivegraph-based convolutional neural network (

ConGCN ) [53] tosimulate dialogue relationship. They aggregate multi-speakerand multi-conversation into a graph and explore the latentconnection. And they also report the results of experimentwithout conversation-sensitive component.Since the number of each emotion category is unbalanced in theMELD dataset, following Zhang et al. [53], we report performancewith weighted average F1 score [40].As shown in Table 6, our semi-supervised model significantlyoutperforms MFN which also does not use any conversation contextand speaker information. Our model also achieves better perfor-mance than CMN, ICON and BC-LSTM, which uses additionalauxiliary information, either conversation context information orspeaker information. It demonstrates the advantage of our modelin isolated emotion recognition scenario. Furthermore, our modelachieves comparable performance with DialogueRNN and ConGCN,the two state-of-the-art interactive emotion recognition approachesproposed recently. Although ConGCN outperforms our model by2.3% when it makes full usage of both speaker and contextual in-formation, the overall experiment results show that even thoughour model does not utilize any auxiliary interaction information, itis also very competitive in conversational scenarios.

Unlabeled data quantity analysis.

To gain more insights aboutthe impact of unlabeled data, we analyze the classification perfor-mance change with different quantity of unlabeled training data.We conduct the semi-supervised experiment with combining all thethree modalities. To show the impact of unlabeled data, we keep thehyper-parameter unchanged except the number of unlabeled data. a) (b)(c) (d)

Figure 6: Confusion matrix of experiments on IEMOCAPdatasets. (a) experiments on lexical modality; (b) fully-supervised experiment on acoustic and lexical modalities;(c) fully-supervised experiments on acoustic,visual and lexi-cal modalities; (d) semi-supervised experiments on acoustic,visual and lexical modalities

We train a fully-supervised model at first and then add 5000 unla-beled samples step by step till all the 20,000 unlabeled samples fromthe AMI corpus are used. As shown in Fig. 7(a), the performanceof semi-supervised model gradually improves with the increaseof unlabeled samples, which indicates that the additional samplesbenefit the generalization and robustness of the recognition model.

Confusion matrix analysis.

As shown in Fig. 6, neutral sam-ples are more likely to be identified as emotional categories. Ac-cording to the confusion matrices, multi-modal combination booststhe performance on all classes and semi-supervised learning mainlyimproves performance on the neutral class.

Training procedure analysis.

We present the loss curves inFig. 7(b). We can see that the classification loss, our main optimiza-tion target, decreases rapidly and converges within 15 epochs. Dueto the limited scale of training data, most of the best models on thevalidation set appear between 10th and 15th epoch in our experi-ments. Too many training iterations will lead to over fitting on thetraining set. The curve of reconstruction loss changes smoothlyand converges stably. The change of distribution loss also meetsour expectation that the paired one and the unpaired one changetowards the opposite direction and then converge to a similar value.

To explore the contribution of each loss component, we fix thesupervised emotion classification loss and do an ablation study onreconstruction loss and unsupervised distribution matching loss.We select the speaker-independent experiment of the acoustic, vi-sual and lexical modalities on the IEMOCAP dataset as example.Table 7 presents the results. The experiments are divided into fully-supervised part and semi-supervised part. In the fully-supervisedexperiments, the model only employing distribution matching loss (a) (b)

Figure 7: (a) Influence of unlabeled data quantity. (b) Thechanging curves of loss value.Table 7: Experimental results for component contributionevaluation (based on A+V+L)

Setting Reconstruction MMD WAP UAfully-supervised ✗ ✗ ✗ ✓ ✓ ✗ ✓ ✓ ✗ ✓ ✓ ✓

In this work, under the assumption that the emotion status is con-sistent across different modalities at the coarse utterance level, wepropose a novel semi-supervised learning method based on cross-modal distribution matching for multi-modal emotion recognition.We jointly optimize the emotion classification, utterance-level cross-modal distribution matching and feature reconstruction objectives.Extensive experiments on IEMOCAP and MELD datasets prove theeffectiveness of our proposed semi-supervised model and demon-strate that unlabeled data and multi-modality fusion both benefit theclassification performance. Our model without contextual informa-tion outperforms existing state-of-the-art models in non-interactivescenario and is competitive with interactive methods.

Acknowledgement

This work was supported by National Natural Science Founda-tion of China (No. 61772535) and Beijing Natural Science Founda-tion (No. 4192028).

References [1] Samuel Albanie, Arsha Nagrani, Andrea Vedaldi, and Andrew Zisserman. 2018.Emotion Recognition in Speech using Cross-Modal Transfer in the Wild. In . 292–301.https://doi.org/10.1145/3240508.3240578[2] Samuel Albanie, Arsha Nagrani, Andrea Vedaldi, and Andrew Zisserman. 2018.Emotion recognition in speech using cross-modal transfer in the wild. In

Pro-ceedings of the 26th ACM international conference on Multimedia . 292–301.[3] Emad Barsoum, Cha Zhang, Cristian Canton Ferrer, and Zhengyou Zhang. 2016.Training Deep Networks for Facial Expression Recognition with Crowd-SourcedLabel Distribution. In

ACM International Conference on Multimodal Interaction(ICMI) .[4] Kevin Brady, Youngjune Gwon, Pooya Khorrami, Elizabeth Godoy, William Camp-bell, Charlie Dagli, and Thomas S. Huang. 2016. Multi-Modal Audio, Video andPhysiological Sensor Learning for Continuous Emotion Prediction. In

Interna-tional Workshop on Audio/visual Emotion Challenge . 97–104.[5] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower,Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. 2008.IEMOCAP: interactive emotional dyadic motion capture database.

LanguageResources and Evaluation

42, 4 (2008), 335–359.[6] Yue Cao, Mingsheng Long, Jianmin Wang, and Shichen Liu. 2017. Collective deepquantization for efficient cross-modal retrieval. In

Thirty-First AAAI Conferenceon Artificial Intelligence .[7] Jean Carletta, Simone Ashby, Sebastien Bourban, Mike Flynn, Mael Guillemot,Thomas Hain, Jaroslav Kadlec, Vasilis Karaiskos, Wessel Kraaij, and MelissaKronenthal. 2006. The AMI meeting corpus: a pre-announcement. In

InternationalWorkshop on Machine Learning for Multimodal Interaction . 28–39.[8] Jonathan Chang and Stefan Scherer. 2017. Learning representations of emo-tional speech with deep convolutional generative adversarial networks. In

IEEEInternational Conference on Acoustics .[9] Shizhe Chen, Qin Jin, Jinming Zhao, and Shuai Wang. 2017. Multimodal Multi-task Learning for Dimensional and Continuous Emotion Recognition. In

TheWorkshop on Audio/visual Emotion Challenge . 19–26.[10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:Pre-training of deep bidirectional transformers for language understanding. arXivpreprint arXiv:1810.04805 (2018).[11] Changde Du, Changying Du, Hao Wang, Jinpeng Li, Wei-Long Zheng, Bao-Liang Lu, and Huiguang He. 2018. Semi-supervised deep generative modellingof incomplete multi-modality emotional data. In

Proceedings of the 26th ACMinternational conference on Multimedia . 108–116.[12] Paul Ekman. 1992. An argument for basic emotions.

Cognition & emotion

6, 3-4(1992), 169–200.[13] Florian Eyben. 2010. Opensmile:the munich versatile and fast open-source audiofeature extractor. In

ACM International Conference on Multimedia . 1459–1462.[14] Fabien Ringeval and Björn Schuller and Michel Valstar and Roddy Cowie andHeysem Kaya and Maximilian Schmitt and Shahin Amiriparian and NicholasCummins and Denis Lalanne and Adrien Michaud and Elvan Çiftçi and HüseyinGüleç and Albert Ali Salah and Maja Pantic. 2018. AVEC 2018 Workshop andChallenge: Bipolar Disorder and Cross-Cultural Affect Recognition. In

Proceedingsof the 8th International Workshop on Audio/Visual Emotion Challenge, AVEC’18,co-located with the 26th ACM International Conference on Multimedia, MM 2018 ,Fabien Ringeval, Björn Schuller, Michel Valstar, Roddy Cowie, and Maja Pantic(Eds.). ACM, Seoul, Korea.[15] N Fragopanagos and J. G. Taylor. 2002. Emotion recognition in human-computerinteraction.

IEEE Signal Processing Magazine

18, 1 (2002), 32–80.[16] Kathleen R Gibson, Kathleen Rita Gibson, and Tim Ingold. 1993.

Tools, languageand cognition in human evolution . Cambridge University Press.[17] Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin. 2012.Iterative quantization: A procrustean approach to learning binary codes forlarge-scale image retrieval.

IEEE transactions on pattern analysis and machineintelligence

35, 12 (2012), 2916–2929.[18] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generativeadversarial nets. In

International Conference on Neural Information ProcessingSystems . 2672–2680.[19] Yue Gu, Kangning Yang, Shiyu Fu, Shuhong Chen, Xinyu Li, and Ivan Marsic.2018. Multimodal Affective Analysis Using Hierarchical Attention Strategywith Word-Level Alignment.. In

Proceedings of the conference. Association forComputational Linguistics. Meeting , Vol. 2018. NIH Public Access, 2225–2235.[20] Devamanyu Hazarika, Soujanya Poria, Rada Mihalcea, Erik Cambria, and RogerZimmermann. 2018. ICON: interactive conversational memory network for multimodal emotion detection. In

Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing . 2594–2604.[21] Devamanyu Hazarika, Soujanya Poria, Amir Zadeh, Erik Cambria, Louis-PhilippeMorency, and Roger Zimmermann. 2018. Conversational memory networkfor emotion recognition in dyadic dialogue videos. In

Proceedings of the 2018Conference of the North American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Volume 1 (Long Papers) . 2122–2132.[22] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. 2017.Densely Connected Convolutional Networks. In

IEEE Conference on ComputerVision and Pattern Recognition . 2261–2269.[23] Zhaocheng Huang, Ting Dang, Nicholas Cummins, Brian Stasak, and JulienEpps. 2015. An Investigation of Annotation Delay Compensation and Output-Associative Fusion for Multimodal Continuous Emotion Prediction. In

Interna-tional Workshop on Audio/visual Emotion Challenge .[24] Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2010. Product quantizationfor nearest neighbor search.

IEEE transactions on pattern analysis and machineintelligence

33, 1 (2010), 117–128.[25] Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2010. Product quantizationfor nearest neighbor search.

IEEE transactions on pattern analysis and machineintelligence

33, 1 (2010), 117–128.[26] Allen Kent, Madeline M Berry, Fred U Luehrs Jr, and James W Perry. 1955. Machineliterature searching VIII. Operational criteria for designing information retrievalsystems.

American documentation

6, 2 (1955), 93–101.[27] Runnan Li, Zhiyong Wu, Jia Jia, Yaohua Bu, Sheng Zhao, and Helen Meng. 2019.Towards discriminative representation learning for speech emotion recognition.In

Proceedings of the 28th International Joint Conference on Artificial Intelligence(IJCAI) . 5060–5066.[28] Navonil Majumder, Soujanya Poria, Devamanyu Hazarika, Rada Mihalcea,Alexander Gelbukh, and Erik Cambria. 2019. Dialoguernn: An attentive rnnfor emotion detection in conversations. In

Proceedings of the AAAI Conference onArtificial Intelligence , Vol. 33. 6818–6825.[29] Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, StevenBethard, and David McClosky. 2014. The Stanford CoreNLP natural languageprocessing toolkit. In

Proceedings of 52nd annual meeting of the association forcomputational linguistics: system demonstrations . 55–60.[30] Shuiyang Mao, PC Ching, and Tan Lee. 2019. Deep Learning of Segment-LevelFeature Representation with Multiple Instance Learning for Utterance-LevelSpeech Emotion Recognition.

Proc. Interspeech 2019 (2019), 1686–1690.[31] Debra K Meyer and Julianne C Turner. 2002. Discovering emotion in classroommotivation research.

Educational psychologist

37, 2 (2002), 107–114.[32] Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, and Barn-abás Póczos. 2019. Found in translation: Learning robust joint representationsby cyclic translations between modalities. In

Proceedings of the AAAI Conferenceon Artificial Intelligence , Vol. 33. 6892–6899.[33] Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, AmirZadeh, and Louis-Philippe Morency. 2017. Context-dependent sentiment anal-ysis in user-generated videos. In

Proceedings of the 55th annual meeting of theassociation for computational linguistics (volume 1: Long papers) . 873–883.[34] Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, AmirZadeh, and Louis-Philippe Morency. 2017. Context-dependent sentiment anal-ysis in user-generated videos. In

Proceedings of the 55th annual meeting of theassociation for computational linguistics (volume 1: Long papers) . 873–883.[35] Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, ErikCambria, and Rada Mihalcea. 2018. MELD: A multimodal multi-party dataset foremotion recognition in conversations. arXiv preprint arXiv:1810.02508 (2018).[36] Viktor Rozgić, Sankaranarayanan Ananthakrishnan, Shiri Saleem, Rohi Kumar,Vembu Aravind Namandi, and Prasad Rohit. 2012. Emotion recognition usingacoustic and lexical feature. In

INTERSPEECH 2012 . 366–369.[37] James A Russell and Albert Mehrabian. 1977. Evidence for a three-factor theoryof emotions.

Journal of research in Personality

11, 3 (1977), 273–294.[38] Ruslan Salakhutdinov and Geoffrey Hinton. 2009. Semantic hashing.

InternationalJournal of Approximate Reasoning

50, 7 (2009), 969–978.[39] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford,and Xi Chen. 2016. Improved Techniques for Training GANs. In

Advances in Neu-ral Information Processing Systems 29: Annual Conference on Neural InformationProcessing Systems 2016, December 5-10, 2016, Barcelona, Spain .[40] Yutaka Sasaki et al. 2007. The truth of the F-measure.

Teach Tutor mater

1, 5(2007), 1–5.[41] Stefan Scherer, John Kane, Christer Gobl, and Friedhelm Schwenker. 2013. In-vestigating fuzzy-input fuzzy-output support vector machines for robust voicequality classification.

Computer Speech & Language

27, 1 (2013), 263–287.[42] Björn Schuller, Anton Batliner, Stefan Steidl, and Dino Seppi. 2011. Recognisingrealistic emotions and affect in speech: State of the art and lessons learnt fromthe first challenge.

Speech Communication

53, 9 (2011), 1062–1087.[43] Dino Sejdinovic, Bharath Sriperumbudur, Arthur Gretton, and Kenji Fukumizu.2013. Equivalence of distance-based and RKHS-based statistics in hypothesistesting.

The Annals of Statistics (2013), 2263–2291.44] Kah Phooi Seng, Li-Minn Ang, and Chien Shing Ooi. 2016. A combined rule-based & machine learning audio-visual emotion recognition approach.

IEEETransactions on Affective Computing

9, 1 (2016), 3–13.[45] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. 2015. Unsuper-vised learning of video representations using lstms. In

International conferenceon machine learning . 843–852.[46] Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-PhilippeMorency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unalignedmultimodal language sequences. arXiv preprint arXiv:1906.00295 (2019).[47] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Advances in neural information processing systems . 5998–6008.[48] Xin Wang, Wenwu Zhu, and Chenghao Liu. 2019. Semi-supervised Deep Quan-tization for Cross-modal Search. In

Proceedings of the 27th ACM InternationalConference on Multimedia . 1730–1739.[49] Shuzhe Wu, Meina Kan, Zhenliang He, Shiguang Shan, and Xilin Chen. 2017.Funnel-Structured Cascade for Multi-View Face Detection with Alignment-Awareness.

Neurocomputing arXiv preprint arXiv:1909.05645 (2019).[51] Seunghyun Yoon, Seokhyun Byun, and Kyomin Jung. 2018. Multimodal speechemotion recognition using audio and text. In . IEEE, 112–118.[52] Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria,and Louis-Philippe Morency. 2018. Memory fusion network for multi-viewsequential learning. In

Thirty-Second AAAI Conference on Artificial Intelligence .[53] Dong Zhang, Liangqing Wu, Changlong Sun, Shoushan Li, Qiaoming Zhu, andGuodong Zhou. 2019. Modeling both context-and speaker-sensitive dependencefor emotion detection in multi-speaker conversations. In

See Proceedings of thetwenty-eighth international joint conference on artificial intelligence, IJCAI . 10–16.[54] Feiran Zhang, Panos Markopoulos, and Tilde Bekker. 2018. The role of children’semotions during design-based learning activity: a case study at a Dutch highschool. In . SCITEPRESS-Science and Technology Publications, Lda., 198–205.[55] Ting Zhang and Jingdong Wang. 2016. Collaborative quantization for cross-modal similarity search. In