[PDF] Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis

Abstract

Representation Learning is a significant and challenging task in multimodal learning. Effective modality representations should contain two parts of characteristics: the consistency and the difference. Due to the unified multimodal annotation, existing methods are restricted in capturing differentiated information. However, additional uni-modal annotations are high time- and labor-cost. In this paper, we design a label generation module based on the self-supervised learning strategy to acquire independent unimodal supervisions. Then, joint training the multi-modal and uni-modal tasks to learn the consistency and difference, respectively. Moreover, during the training stage, we design a weight-adjustment strategy to balance the learning progress among different subtasks. That is to guide the subtasks to focus on samples with a larger difference between modality supervisions. Last, we conduct extensive experiments on three public multimodal baseline datasets. The experimental results validate the reliability and stability of auto-generated unimodal supervisions. On MOSI and MOSEI datasets, our method surpasses the current state-of-the-art methods. On the SIMS dataset, our method achieves comparable performance than human-annotated unimodal labels. The full codes are available at this https URL

Full PDF

LLearning Modality-Speciﬁc Representations with Self-SupervisedMulti-Task Learning for Multimodal Sentiment Analysis

Wenmeng Yu, Hua Xu, * Ziqi Yuan, Jiele Wu

State Key Laboratory of Intelligent Technology and Systems,Department of Computer Science and Technology, Tsinghua University, Beijing, ChinaEmail: [email protected], [email protected]

Abstract

Representation Learning is a signiﬁcant and challenging taskin multimodal learning. Effective modality representationsshould contain two parts of characteristics: the consistencyand the difference. Due to the uniﬁed multimodal annota-tion, existing methods are restricted in capturing differenti-ated information. However, additional uni-modal annotationsare high time- and labor-cost. In this paper, we design a la-bel generation module based on the self-supervised learningstrategy to acquire independent unimodal supervisions. Then,joint training the multi-modal and uni-modal tasks to learnthe consistency and difference, respectively. Moreover, dur-ing the training stage, we design a weight-adjustment strat-egy to balance the learning progress among different sub-tasks. That is to guide the subtasks to focus on samples witha larger difference between modality supervisions. Last, weconduct extensive experiments on three public multimodalbaseline datasets. The experimental results validate the re-liability and stability of auto-generated unimodal supervi-sions. On MOSI and MOSEI datasets, our method surpassesthe current state-of-the-art methods. On the SIMS dataset,our method achieves comparable performance than human-annotated unimodal labels. The full codes are available athttps://github.com/thuiar/Self-MM.

Multimodal Sentiment Analysis (MSA) attracts more andmore attention in recent years (Zadeh et al. 2017; Tsai et al.2019; Poria et al. 2020). Compared with unimodal sentimentanalysis, multimodal models are more robust and achievesalient improvements when dealing with social media data.With the booming of user-generated online content, MSAhas been introduced into many applications such as riskmanagement, video understanding, and video transcription.Though previous works have made impressive improve-ments on benchmark datasets, MSA is still full of chal-lenges. Baltruˇsaitis, Ahuja, and Morency (2019) identi-ﬁed ﬁve core challenges for multimodal learning: align-ment, translation, representation, fusion, and co-learning.Among them, representation learning stands in a funda-mental position. In most recent work, Hazarika, Zimmer-mann, and Poria (2020) stated that unimodal representa- * tions should contain both consistent and complementary in-formation. According to the difference of guidance in rep-resentation learning, we classify existing methods into thetwo categories: forward guidance and backward guidance.In forward-guidance methods, researches are devoted to de-sign interactive modules for capturing the cross-modal in-formation (Zadeh et al. 2018a; Sun et al. 2020; Tsai et al.2019; Rahman et al. 2020). However, due to the uniﬁedmultimodal annotation, it is difﬁcult for them to capturemodality-speciﬁc information. In backward-guidance meth-ods, researches proposed additional loss function as priorconstraint, which leads modality representations to containboth consistent and complementary information(Yu et al.2020a; Hazarika, Zimmermann, and Poria 2020).Yu et al. (2020a) introduced independent unimodal hu-man annotations. By joint learning unimodal and multi-modal tasks, the proposed multi-task multimodal frame-work learned modality-speciﬁc and modality-invariant rep-resentations simultaneously. Hazarika, Zimmermann, andPoria (2020) designed two distinct encoders projectingeach modality into modality-invariant and modality-speciﬁcspace. Two regularization components are claimed toaid modality-invariant and modality-speciﬁc representationlearning. However, in the former unimodal annotations needadditional labor costs, and in the latter, spatial differencesare difﬁcult to represent the modality-speciﬁc difference.Moreover, they require manually balanced weights betweenconstraint components in the global loss function, whichhighly relies on the human experience.In this paper, we focus on the backward-guidance method.Motivated by the independent unimodal annotations and ad-vanced modality-speciﬁc representation learning, we pro-pose a novel self-supervised multi-task learning strategy.Different from Yu et al. (2020a), our method does not needhuman-annotated unimodal labels but uses auto-generatedunimodal labels. It is based on two intuitions. First, labeldifference is positively correlated with the distance differ-ence between modality representations and class centers.Second, unimodal labels are highly related to multimodal la-bels. Hence, we design a unimodal label generation modulebased on multimodal labels and modality representations.The details are shown in Section 3.3.Considering that auto-generated unimodal labels arenot stable enough at the beginning epochs, we design a a r X i v : . [ c s . C L ] F e b omentum-based update method, which applies a largerweight for the unimodal labels generated later. Furthermore,we introduce a self-adjustment strategy to adjust each sub-task’s weight when integrating the ﬁnal multi-task loss func-tion. We believe that it is difﬁcult for subtasks with small la-bel differences, between auto-generated unimodal labels andhuman-annotated multimodal labels, to learn the modality-speciﬁc representations. Therefore, the weight of subtasks ispositively correlated with the labels difference.The novel contributions of our work can be summarizedas follows:• We propose the relative distance value based on the dis-tance between modality representations and class centers,positively correlated with model outputs.• We design a unimodal label generation module based onthe self-supervised strategy. Furthermore, a novel weightself-adjusting strategy is introduced to balance differenttask loss constraints.• Extensive experiments on three benchmark datasets val-idate the stability and reliability of auto-generated uni-modal labels. Moreover, our method outperforms currentstate-of-the-art results. In this section, we mainly discuss related works in the do-main of multimodal sentiment analysis and multi-task learn-ing. We also emphasize the innovation of our work.

Multimodal sentiment analysis has become a signiﬁcant re-search topic that integrates verbal and nonverbal informationlike visual and acoustic. Previous researchers mainly focuson representation learning and multimodal fusion. For repre-sentation learning methods, Wang et al. (2019) constructeda recurrent attended variation embedding network to gener-ate multimodal shifting. Hazarika, Zimmermann, and Poria(2020) presented modality-invariant and modality-speciﬁcrepresentations for representation learning in multimodal.For multimodal fusion, according to the fusion stage, previ-ous works can be classiﬁed into two categories: early fusionand late fusion. Early fusion methods usually use delicateattention mechanisms for cross-modal fusion. Zadeh et al.(2018a) designed a memory fusion network for cross-viewinteractions. Tsai et al. (2019) proposed cross-modal trans-formers, which learn the cross-modal attention to reinforce atarget modality. Late fusion methods learn intra-modal rep-resentation ﬁrst and perform inter-modal fusion last. Zadehet al. (2017) used a tensor fusion network that obtains ten-sor representation by computing the outer product betweenunimodal representations. Liu et al. (2018) proposed a low-rank multimodal fusion method to decrease the computa-tional complexity of tensor-based methods.Our work aims at representation learning based onlate fusion structure. Different from previous studies, wejoint learn unimodal and multimodal tasks with the self-supervised strategy. Our method learns similarity informa-tion from multimodal task and learns differentiated infor-mation from unimodal tasks.

Transformer is a sequence-to-sequence architecture withoutrecurrent structure (Vaswani et al. 2017). It is used for mod-eling sequential data and has superior performance on re-sults, speed, and depth than recurrent structure. BERT (Bidi-rectional Encoder Representations from Transformers) (De-vlin et al. 2018) is a successful application on the trans-former. The pre-trained BERT model has achieved signif-icant improvements in multiple NLP tasks. In multimodallearning, pre-trained BERT also achieved remarkable re-sults. Currently, there are two ways to use pre-trained BERT.The ﬁrst is to use the pre-trained BERT as a language fea-ture extraction module (Hazarika, Zimmermann, and Poria2020). The second is to integrate acoustic and visual infor-mation on the middle layers (Tsai et al. 2019; Rahman et al.2020). In this paper, we use the ﬁrst way and ﬁnetune thepre-trained BERT for our tasks.

Multi-task learning aims to improve the generalization per-formance of multiple related tasks by utilizing the knowl-edge contained in different tasks (Zhang and Yang 2017).Compared with single-task learning, there are two mainchallenges for multi-task learning in the training stage. Theﬁrst is how to share network parameters, including hard-sharing and soft-sharing methods. The second is how tobalance the learning process of different tasks. Recently,multi-task learning is wildly applied in MSA(Liu et al. 2015;Zhang et al. 2016; Akhtar et al. 2019; Yu et al. 2020b).In this work, we introduce unimodal subtasks to aid themodality-speciﬁc representation learning. We adopt a hard-sharing strategy and design a weight-adjustment method tosolve the problem of how to balance.

In this section, we explain the Self-Supervised Multi-taskMultimodal sentiment analysis network (Self-MM) in detail.The goal of the Self-MM is to acquire information-rich uni-modal representations by joint learning one multimodal taskand three unimodal subtasks. Different from the multimodaltask, the labels of unimodal subtasks are auto-generated inthe self-supervised method. For the convenience of the fol-lowing sections, we refer the human-annotated multimodallabels as m-labels and the auto-generated unimodal labelsas u-labels . Multimodal Sentiment Analysis (MSA) is to judge the sen-timents using multimodal signals, including text ( I t ), audio( I a ), and vision ( I v ). Generally, MSA can be regarded aseither a regression task or a classiﬁcation task. In this work,we regard it as the regression task. Therefore, Self-MM takes I t , I a , and I v as inputs and outputs one sentimental intensityresult ˆ y m ∈ R . In the training stage, to aid representationlearning, Self-MM has extra three unimodal outputs ˆ y s ∈ R ,where s ∈ { t, a, v } . Though more than one output, we onlyuse ˆ y m as the ﬁnal predictive result. LSTMAudio Feature Extraction Video Feature ExtractionsLSTM Linear Linear LinearLinearLinear 𝑦" ! ULGM ULGM ULGMLinear Linear Linear 𝑦" " 𝑦" It was reallyreally funny … 𝑦 ! 𝑦" $ 𝑦 $ 𝑦 " 𝑦 P r e - t r a i n e d B e r t 𝐹 !∗ 𝐹 𝐹 $∗ 𝐹 %∗ 𝐹 ! 𝐹 & 𝐹 $ 𝐹 % 𝐹 & 𝐹 $ 𝐹 % Figure 1: The overall architecture of Self-MM. The ˆ y m , ˆ y t , ˆ y a , and ˆ y v are the predictive outputs of the multimodal task andthe three unimodal tasks, respectively. The y m is the multimodal annotation by human. The y t , y a , and y v are the unimodalsupervision generated by the self-supervised strategy. Finally, ˆ y m is used as the sentiment output. Shown in Figure 1, the Self-MM consists of one multi-modal task and three independent unimodal subtasks. Be-tween the multimodal task and different unimodal tasks, weadopt hard-sharing strategy to share the bottom representa-tion learning network.

Multimodal Task.

For the multimodal task, we adopt aclassical multimodal sentiment analysis architecture. It con-tains three main parts: the feature representation module, thefeature fusion module, and the output module. In the textmodality, since the great success of the pre-trained languagemodel, we use the pre-trained 12-layers BERT to extract sen-tence representations. Empirically, the ﬁrst-word vector inthe last layer is selected as the whole sentence representa-tion F t . F t = BERT ( I t ; θ bertt ) ∈ R d t In audio and vision modalities, following Zadeh et al.(2017); Yu et al. (2020b), we use pre-trained ToolKits toextract the initial vector features, I a ∈ R l a × d a and I v ∈ R l v × d v , from raw data. Here, l a and l v are the sequencelengths of audio and vision, respectively. Then, we usea single directional Long Short-Term Memory (sLSTM)(Hochreiter and Schmidhuber 1997) to capture the tim-ing characteristics. Finally, the end-state hidden vectors areadopted as the whole sequence representations. F a = sLST M ( I a ; θ lstma ) ∈ R d a F v = sLST M ( I v ; θ lstmv ) ∈ R d v Then, we concatenate all uni-modal representations andproject them into a lower-dimensional space R d m . F ∗ m = ReLU ( W ml T [ F t ; F a ; F v ] + b ml ) where W ml ∈ R ( d t + d a + d v ) × d m and ReLU is the relu acti-vation function.Last, the fusion representation F ∗ m is used to predict themultimodal sentiment. ˆ y m = W ml T F ∗ m + b ml where W ml ∈ R d m × . Uni-modal Task.

For the three unimodal tasks, they sharemodality representations with the multimodal task. In or-der to reduce the dimensional difference between differentmodalities, we project them into a new feature space. Then,get the uni-modal results with linear regression. F ∗ s = ReLU ( W sl T F s + b sl )ˆ y s = W sl T F ∗ s + b sl where s ∈ { t, a, v } .To guide the unimodal tasks’ training process, we designa Unimodal Label Generation Module (ULGM) to get u-labels. Details of the ULGM are discussed in Section 3.3. y s = U LGM ( y m , F ∗ m , F ∗ s ) where s ∈ { t, a, v } .Finally, we joint learn the multimodal task and three uni-modal tasks under m-labels and u-labels supervision. It isworth noting that these unimodal tasks are only exist in thetraining stage. Therefore, we use ˆ y m as the ﬁnal output.odel MOSI MOSEI DataSettingMAE Corr Acc-2 F1-Score MAE Corr Acc-2 F1-ScoreTFN (B) / /85.08 AlignedSelf-MM (B)* UnalignedTable 1: Results on MOSI and MOSEI. (B) means the language features are based on BERT; is from Hazarika, Zimmermann,and Poria (2020) and is from Rahman et al. (2020). Models with ∗ are reproduced under the same conditions. In Acc-2 andF1-Score, the left of the “/” is calculated as “negative/non-negative” and the right is calculated as “negative/positive”. Multimodal m-pos m-neg D !" D ! 𝐹 !∗ Unimodal s-pos s-neg 𝐹 %∗ D %" D % positivenegative 𝑦 ! 𝑦 % 𝛿 %! Figure 2: Unimodal label generation example. Multimodalrepresentation F ∗ m is closer to the positive center (m-pos)while unimodal representation is closer to the negative cen-ter (s-neg). Therefore, unimodal supervision y s is added anegative offset δ sm to the multimodal label y m . The ULGM aims to generate uni-modal supervision valuesbased on multimodal annotations and modality representa-tions. In order to avoid unnecessary interference with theupdate of network parameters, the ULGM is designed asa non-parameter module. Generally, unimodal supervisionvalues are highly correlated with multimodal labels. There-fore, the ULGM calculates the offset according to the rela-tive distance from modality representations to class centers,shown as Figure 2.

Relative Distance Value.

Since different modality represen-tations exist in different feature spaces, using the absolutedistance value is not accurate enough. Therefore, we pro-pose the relative distance value, which is not related to thespace difference. First, when in training process, we main-

Algorithm 1

Unimodal Supervisions Update Policy

Input: unimodal inputs I t , I a , I v , m-labels y m Output: u-labels y ( i ) t , y ( i ) a , y ( i ) v where i means the numberof training epochs Initialize model parameters M ( θ ; x ) Initialize u-labels y (1) t = y m , y (1) a = y m , y (1) v = y m Initialize global representations F gt = 0 , F ga = 0 , F gv =0 , F gm = 0 for n ∈ [1 , end ] do for mini-batch in dataLoader do Compute mini-batch modality representations F ∗ t , F ∗ a , F ∗ v , F ∗ m Compute loss L using Equation (10) Compute parameters gradient ϑLϑθ Update model parameters: θ = θ − η ϑLϑθ if n (cid:54) = 1 then Compute relative distance values α m , α t , α a , and α v using Equation (1 ∼ Compute y t , y a , y v using Equation (8) Update y ( n ) t , y ( n ) a , y ( n ) t using Equation (9) end if Update global representations F gs using F ∗ s , where s ∈ { m, t, a, v } end for end for tain the positive center ( C pi ) and the negative center ( C ni ) ofdifferent modality representations: C pi = (cid:80) Nj =1 I ( y i ( j ) > · F gij (cid:80) Nj =1 I ( y i ( j ) > (1) C ni = (cid:80) Nj =1 I ( y i ( j ) < · F gij (cid:80) Nj =1 I ( y i ( j ) < (2)where i ∈ { m, t, a, v } , N is the number of training samples,and I ( · ) is a indicator function. F gij is the global representa-tion of the j th sample in modality i .or modality representations, we use L2 normalization asthe distance between F ∗ i and class centers. D pi = || F ∗ i − C pi || √ d i (3) D ni = || F ∗ i − C ni || √ d i (4)where i ∈ { m, t, a, v } . d i is the representation dimension, ascale factor.Then, we deﬁne the relative distance value, which evalu-ates the relative distance from the modality representation tothe positive center and the negative center. α i = D ni − D pi D pi + (cid:15) (5)where i ∈ { m, t, a, v } . (cid:15) is a small number in case of zeroexceptions. Shifting Value.

It is intuitive that α i is positively related tothe ﬁnal results. To get the link between supervisions andpredicted values, we consider the following two relation-ships. y s y m ∝ ˆ y s ˆ y m ∝ α s α m ⇒ y s = α s ∗ y m α m (6) y s − y m ∝ ˆ y s − ˆ y m ∝ α s − α m ⇒ y s = y m + α s − α m (7)where s ∈ { t, a, v } .Speciﬁcally, the Equation 7 is introduced to avoid the“zero value problem”. In Equation 6, when y m equals tozero, the generated unimodal supervision values y s are al-ways zero. Then, joint considering the above relationships,we can get unimodal supervisions by equal-weight summa-tion. y s = y m ∗ α s α m + y m + α s − α m y m + α s − α m ∗ y m + α m α m = y m + δ sm (8)where s ∈ { t, a, v } . The δ sm = α t − α m ∗ y m + α m α m representsthe offset value of unimodal supervisions to multimodal an-notations. Momentum-based Update Policy.

Due to the dynamicchanges of modality representations, the generated u-labelscalculated by Equation (8) are unstable enough. In order tomitigate the adverse effects, we design a momentum-basedupdate policy, which combines the new generated value withhistory values. y ( i ) s = (cid:40) y m i = 1 i − i +1 y ( i − s + i +1 y is i > (9)where s ∈ { t, a, v } . y is is the new generated u-labels at the i th epoch. y ( i ) s is the ﬁnal u-labels after the i th epoch.Formally, assume the total epochs is n , we can get that theweight of y is is i ( n )( n +1) . It means that the weight of u-labelsgenerated later is greater than the previous one. It is in ac-cordance with our experience. Because generated unimodal Dataset Finally, we use the L1Loss as the basic optimization ob-jective. For uni-modal tasks, we use the difference betweenu-labels and m-labels as the weight of loss function. It in-dicates that the network should pay more attention on thesamples with larger difference. L = 1 N N (cid:88) i ( | ˆ y im − y im | + { t,a,v } (cid:88) s W is ∗ | ˆ y is − y ( i ) s | ) (10)where N is the number of training samples. W is = tanh ( | y ( i ) s − y m | ) is the weight of i th sample for auxiliarytask s . In this section, we introduce our experimental settings, in-cluding experimental datasets, baselines, and evaluations.

In this work, we use three public multimodal sentiment anal-ysis datasets, MOSI (Zadeh et al. 2016), MOSEI (Zadehet al. 2018b), and SIMS (Yu et al. 2020a). The basic statis-tics are shown in Table 2. Here, we give a brief introductionto the above datasets.

MOSI.

The CMU-MOSI dataset (Zadeh et al. 2016) is oneof the most popular benchmark datasets for MSA. It com-prises 2199 short monologue video clips taken from 93Youtube movie review videos. Human annotators label eachsample with a sentiment score from -3 (strongly negative) to3 (strongly positive).

MOSEI.

The CMU-MOSEI dataset (Zadeh et al. 2018b) ex-pands its data with a higher number of utterances, greatervariety in samples, speakers, and topics over CMU-MOSI.The dataset contains 23,453 annotated video segments (ut-terances), from 5,000 videos, 1,000 distinct speakers and250 different topics.

SIMS.

The SIMS dataset(Yu et al. 2020a) is a distinctiveChinese MSA benchmark with ﬁne-grained annotations ofmodality. The dataset consists of 2,281 reﬁned video clipscollected from different movies, TV serials, and varietyshows with spontaneous expressions, various head poses,occlusions, and illuminations. Human annotators label eachsample with a sentiment score from -1 (strongly negative) to1 (strongly positive).

OSISIMSMOSEI

Figure 3: The distribution update process of u-labels on different datasets. The number (

To fully validate the performance of the Self-MM, we makea fair comparison with the following baselines and state-of-the-art models in multimodal sentiment analysis.

TFN.

The Tensor Fusion Network (TFN) (Zadeh et al.2017) calculates a multi-dimensional tensor (based on outer-product) to capture uni-, bi-, and tri-modal interactions.

LMF.

The Low-rank Multimodal Fusion (LMF) (Liu et al.2018) is an improvement over TFN, where low-rank mul-timodal tensors fusion technique is performed to improveefﬁciency.

MFN.

The Memory Fusion Network (MFN) (Zadeh et al.2018a) accounts for continuously modeling the view-speciﬁc and cross-view interactions and summarizing themthrough time with a Multi-view Gated Memory.

MFM.

The Multimodal Factorization Model (MFM) (Tsaiet al. 2018) learns generative representations to learn themodality-speciﬁc generative features along with discrimina-tive representations for classiﬁcation.

RAVEN.

The Recurrent Attended Variation EmbeddingNetwork (RAVEN) (Wang et al. 2019) utilizes an attention-based model re-adjusting word embeddings according toauxiliary non-verbal signals.

MulT.

The Multimodal Transformer (MulT) (Tsai et al.2019) extends multimodal transformer architecture with di-rectional pairwise crossmodal attention which translatesone modality to another using directional pairwise cross-attention.

MAG-BERT.

The Multimodal Adaptation Gate for Bert(MAG-BERT) (Rahman et al. 2020) is an improvement overRAVEN on aligned data with applying multimodal adapta-tion gate at different layers of the BERT backbone. Model MAE Corr Acc-2 F1-ScoreTFN 0.428 0.605 79.86 80.15LMF 0.431 0.600 79.37 78.65Human-MM 0.408 0.647 81.32 81.73Self-MM 0.419 0.616 80.74 80.78Table 3: Results on SIMS.

MISA.

The Modality-Invariant and -Speciﬁc Representa-tions (MISA) (Hazarika, Zimmermann, and Poria 2020) in-corporate a combination of losses including distributionalsimilarity, orthogonal loss, reconstruction loss and taskprediction loss to learn modality-invariant and modality-speciﬁc representation.

Experimental Details.

We use Adam as the optimizer anduse the initial learning rate of e − for Bert and e − forother parameters. For a fair comparison, in our model (Self-MM) and two state-of-the-art methods (MISA and MAG-BERT), we run ﬁve times and report the average perfor-mance. Evaluation Metrics.

Following the previous works (Haz-arika, Zimmermann, and Poria 2020; Rahman et al. 2020),we report our experimental results in two forms: classiﬁca-tion and regression. For classiﬁcation, we report WeightedF1 score (F1-Score) and binary classiﬁcation accuracy (Acc-2). Speciﬁcally, for MOSI and MOSEI datasets, we calculateAcc-2 and F1-Score in two ways: negative / non-negative(non-exclude zero)(Zadeh et al. 2017) and negative / posi-tive (exclude zero)(Tsai et al. 2019). For regression, we re-asks MSE Corr Acc-2 F1-ScoreM 0.730 0.781 82.38/83.67 82.48/83.70M, V 0.732 0.775 82.67/83.52 82.76/83.55M, A 0.728 0.790 82.80/84.76 82.85/84.75M, T 0.731 0.789 82.65/84.15 82.66/84.10M, A, V 0.719 0.789 82.94/84.76 83.05/84.81M, T, V 0.714 0.797 /85.91 84.33/

M, T, A /85.95Table 4: Results for multimodal sentiment analysis with dif-ferent tasks using Self-MM. M, T, A, V represent the multi-modal, text, audio, and vision task, respectively.port Mean Absolute Error (MAE) and Pearson correlation(Corr). Except for MAE, higher values denote better perfor-mance for all metrics.

In this section, we make a detailed analysis and discussionabout our experimental results.

Table 1 shows the comparative results on MOSI and MOSEIdatasets. For a fair comparison, according to the differenceof “Data Setting”, we split models into two categories: Un-aligned and Aligned. Generally, models using aligned cor-pus can get better results (Tsai et al. 2019). In our exper-iments, ﬁrst, comparing with unaligned models (TFN andLMF), we achieve a signiﬁcant improvement in all eval-uation metrics. Even comparing with aligned models, ourmethod gets competitive results. Moreover, we reproducethe two best baselines “MISA” and “MAG-BERT” under thesame conditions. We ﬁnd that our model surpasses them onmost of the evaluations.Since the SIMS dataset only contains unaligned data, wecompare the Self-MM with TFN and LMF. Besides, we usethe human-annotated unimodal labels to replace the auto-generated u-labels, called Human-MM. Experimental re-sults are shown in Table 3. We can ﬁnd that the Self-MMgets better results than TFN and LMF and achieve compara-ble performance with Human-MM. The above results showthat our model can be applied to different data scenarios andachieve signiﬁcant improvements.

To further explore the contributions of Self-MM, we com-pare the effectiveness of combining different uni-modaltasks. Results are shown in Table 4. Overall, compared withthe single-task model, the introduce of unimodal subtaskscan signiﬁcantly improve model performance. From the re-sults, we can see that “M, T, V” and “M, T, A” achieve com-parable or even better results than “M, T, A, V”. Moreover,we can ﬁnd that subtasks, “T” and “A”, help more than thesubtask “V”.

And the crackon you know in thepreview is like so much type.And he did a great job.Just not enough depth to be interesting.Frown Raise eyesNodded SmileHead downExample M- / U-labelsM: 0.80V : -0.21T : -0.27A : -0.97M: -0.5V : -0.31T : 0.91A : 0.85M: 1.40V : -0.55T : 0.28A : -1.08

Figure 4: Case study for the Self-MM on MOSI. The “M” ishuman-annotated, and “V, T, A” are auto-generated.

To validate the reliability and reasonability of auto-generated u-labels, we analyze the update process of u-labels, shown in Figure 3. We can see that as the numberof iterations increases, the distributions of u-labels tends tostabilize. It is in line with our expectations. Compared withMOSI and SIMS datasets, the update process on the MOSEIhas faster convergence. It shows that the larger dataset hasmore stable class centers, which is more suitable for self-supervised methods.In order to further show the reasonability of the u-labels,we selected three multimodal examples from the MOSIdataset, as shown in Figure 4. In the ﬁrst and third cases,human-annotations m-labels are . and . . However, forsingle modalities, they are inclined to negative sentiments.In line with expectation, the u-labels get negative offsets onthe m-labels. A positive offset effect is achieved in the sec-ond case. Therefore, the auto-generated u-labels are signif-icant. We believe that these independent u-labels can aid inlearning modality-speciﬁc representation. In this paper, we introduce unimodal subtasks to aid in learn-ing modality-speciﬁc representations. Different from previ-ous works, we design a unimodal label generation strategybased on the self-supervised method, which saves lots of hu-man costs. Extensive experiments validate the reliability andstability of the auto-generated unimodal labels. We hope thiswork can provide a new perspective on multimodal represen-tation learning.We also ﬁnd that the generated audio and vision labels arenot signiﬁcant enough limited by the pre-processed features.In future work, we will build an end-to-end multimodallearning network and explore the relationship between uni-modal and multimodal learning.

Acknowledgments

This paper is supported by National Key R&D ProgramProjects of China (Grant No: 2018YFC1707605) and seedfund of Tsinghua University (Department of Computer Sci-ence and Technology) -Siemens Ltd., China Joint ResearchCenter for Industrial Intelligence and Internet of Things. Wewould like to thank the anonymous reviewers for their valu-able suggestions.

References

Akhtar, M. S.; Chauhan, D.; Ghosal, D.; Poria, S.; Ekbal,A.; and Bhattacharyya, P. 2019. Multi-task Learning forMulti-modal Emotion Recognition and Sentiment Analysis.In

Proceedings of the 2019 Conference of the North Ameri-can Chapter of the Association for Computational Linguis-tics: Human Language Technologies, Volume 1 (Long andShort Papers) , 370–379.Baltruˇsaitis, T.; Ahuja, C.; and Morency, L. 2019. Multi-modal Machine Learning: A Survey and Taxonomy.

IEEETransactions on Pattern Analysis and Machine Intelligence arXiv preprint arXiv:1810.04805 .Hazarika, D.; Zimmermann, R.; and Poria, S. 2020. MISA:Modality-Invariant and -Speciﬁc Representations for Mul-timodal Sentiment Analysis.

CoRR abs/2005.03545. URLhttps://arxiv.org/abs/2005.03545.Hochreiter, S.; and Schmidhuber, J. 1997. Long short-termmemory.

Neural computation

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 3707–3715.Liu, Z.; Shen, Y.; Lakshminarasimhan, V. B.; Liang, P. P.;Zadeh, A. B.; and Morency, L.-P. 2018. Efﬁcient Low-rankMultimodal Fusion With Modality-Speciﬁc Factors. In

Pro-ceedings of the 56th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers) , 2247–2256.Poria, S.; Hazarika, D.; Majumder, N.; and Mihalcea, R.2020. Beneath the Tip of the Iceberg: Current Challengesand New Directions in Sentiment Analysis Research. arXivpreprint arXiv:2005.00357 .Rahman, W.; Hasan, M. K.; Lee, S.; Zadeh, A. B.; Mao, C.;Morency, L.-P.; and Hoque, E. 2020. Integrating MultimodalInformation in Large Pretrained Transformers. In

Proceed-ings of the 58th Annual Meeting of the Association for Com-putational Linguistics , 2359–2369.Sun, Z.; Sarma, P.; Sethares, W.; and Liang, Y. 2020. Learn-ing relationships between text, audio, and video via deepcanonical correlation for multimodal language analysis. In

Proceedings of the AAAI Conference on Artiﬁcial Intelli-gence , volume 34, 8992–8999. Tsai, Y.-H. H.; Bai, S.; Liang, P. P.; Kolter, J. Z.; Morency,L.-P.; and Salakhutdinov, R. 2019. Multimodal transformerfor unaligned multimodal language sequences. In

Proceed-ings of the conference. Association for Computational Lin-guistics. Meeting , volume 2019, 6558. NIH Public Access.Tsai, Y.-H. H.; Liang, P. P.; Zadeh, A.; Morency, L.-P.; andSalakhutdinov, R. 2018. Learning Factorized MultimodalRepresentations. In

International Conference on LearningRepresentations .Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-tention is all you need. In

Advances in neural informationprocessing systems , 5998–6008.Wang, Y.; Shen, Y.; Liu, Z.; Liang, P. P.; Zadeh, A.; andMorency, L.-P. 2019. Words can shift: Dynamically adjust-ing word representations using nonverbal behaviors. In

Pro-ceedings of the AAAI Conference on Artiﬁcial Intelligence ,volume 33, 7216–7223.Yu, W.; Xu, H.; Meng, F.; Zhu, Y.; Ma, Y.; Wu, J.; Zou,J.; and Yang, K. 2020a. CH-SIMS: A Chinese MultimodalSentiment Analysis Dataset with Fine-grained Annotationof Modality. In

Proceedings of the 58th Annual Meeting ofthe Association for Computational Linguistics

Proceedings of the 58th Annual Meeting of theAssociation for Computational Linguistics , 3718–3727.Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; and Morency,L.-P. 2017. Tensor fusion network for multimodal sentimentanalysis. arXiv preprint arXiv:1707.07250 .Zadeh, A.; Liang, P. P.; Mazumder, N.; Poria, S.; Cam-bria, E.; and Morency, L.-P. 2018a. Memory fusion net-work for multi-view sequential learning. arXiv preprintarXiv:1802.00927 .Zadeh, A.; Zellers, R.; Pincus, E.; and Morency, L.-P. 2016.Mosi: multimodal corpus of sentiment intensity and sub-jectivity analysis in online opinion videos. arXiv preprintarXiv:1606.06259 .Zadeh, A. B.; Liang, P. P.; Poria, S.; Cambria, E.; andMorency, L.-P. 2018b. Multimodal language analysis in thewild: Cmu-mosei dataset and interpretable dynamic fusiongraph. In

Proceedings of the 56th Annual Meeting of theAssociation for Computational Linguistics (Volume 1: LongPapers) , 2236–2246.Zhang, W.; Li, R.; Zeng, T.; Sun, Q.; Kumar, S.; Ye, J.; andJi, S. 2016. Deep model based transfer and multi-task learn-ing for biological image analysis.

IEEE transactions on BigData .Zhang, Y.; and Yang, Q. 2017. A Survey on Multi-TaskLearning.