[PDF] A Hierarchical Transformer with Speaker Modeling for Emotion Recognition in Conversation

Abstract

Emotion Recognition in Conversation (ERC) is a more challenging task than conventional text emotion recognition. It can be regarded as a personalized and interactive emotion recognition task, which is supposed to consider not only the semantic information of text but also the influences from speakers. The current method models speakers' interactions by building a relation between every two speakers. However, this fine-grained but complicated modeling is computationally expensive, hard to extend, and can only consider local context. To address this problem, we simplify the complicated modeling to a binary version: Intra-Speaker and Inter-Speaker dependencies, without identifying every unique speaker for the targeted speaker. To better achieve the simplified interaction modeling of speakers in Transformer, which shows excellent ability to settle long-distance dependency, we design three types of masks and respectively utilize them in three independent Transformer blocks. The designed masks respectively model the conventional context modeling, Intra-Speaker dependency, and Inter-Speaker dependency. Furthermore, different speaker-aware information extracted by Transformer blocks diversely contributes to the prediction, and therefore we utilize the attention mechanism to automatically weight them. Experiments on two ERC datasets indicate that our model is efficacious to achieve better performance.

Full PDF

AA Hierarchical Transformer with Speaker Modeling for Emotion Recognition inConversation

Jiangnan Li , Zheng Lin , Peng Fu , Qingyi Si , Weiping Wang Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China { lijiangnan, linzheng, fupeng, siqingyi, wangweiping } @iie.ac.cn Abstract

Emotion Recognition in Conversation (ERC) is a more chal-lenging task than conventional text emotion recognition. Itcan be regarded as a personalized and interactive emotionrecognition task, which is supposed to consider not only thesemantic information of text but also the inﬂuences fromspeakers. The current method models speakers’ interactionsby building a relation between every two speakers. How-ever, this ﬁne-grained but complicated modeling is com-putationally expensive, hard to extend, and can only con-sider local context. To address this problem, we simplifythe complicated modeling to a binary version:

Intra-Speaker and

Inter-Speaker dependencies, without identifying everyunique speaker for the targeted speaker. To better achieve thesimpliﬁed interaction modeling of speakers in Transformer,which shows excellent ability to settle long-distance depen-dency, we design three types of masks and respectively utilizethem in three independent Transformer blocks. The designedmasks respectively model the conventional context modeling,

Intra-Speaker dependency, and

Inter-Speaker dependency.Furthermore, different speaker-aware information extractedby Transformer blocks diversely contributes to the prediction,and therefore we utilize the attention mechanism to automati-cally weight them. Experiments on two ERC datasets indicatethat our model is efﬁcacious to achieve better performance.

Introduction

Nowadays, intelligent machines to precisely capture speak-ers’ emotions in conversations are gaining popularity, thusdriving the development of Emotion Recognition in Conver-sation (ERC). ERC is a task to predict the emotion of thecurrent utterance expressed by a speciﬁc speaker accordingto the context (Poria et al. 2019b), which is more challengingthan the conventional emotion recognition only consideringsemantic information of an independent utterance.To precisely predict the emotion of a targeted utterance,both the semantic information of the utterance and the in-formation provided by utterances in the context are criti-cal. Nowadays, a number of works (Hazarika et al. 2018a,b;Majumder et al. 2019; Ghosal et al. 2019) demonstratethat the interactions between speakers can facilitate extract-ing information from contextual utterances. We denote thiskind of information with modeling speakers’ interactions as

Timeline

Phoebe

Hey! joy u1 Joey

Hey-Hey-Hey! joy u2 Chandler

What are you doing? neutral u3 Phoebe

We’re just celebrating that

Joey got his health insurance back. joy u4 Chandler

Oh, all right. neutral u5 (b). Relational graph for u3, u4 and u5 inSelf and Inter-speaker dependencies. 7 relations are involved. (c).

Relational graph for u3, u4 and u5 in simplified dependencies. Only 2 relations are involved.

Simplified dependencies Self and Inter-speakerdependencies

Intra-Speaker

Inter-Speaker

Pho → PhoCha → PhoJoe → Pho historyfuture

Cha → Cha

Pho → Cha

64 1 1 u3 u4 u5 u3 u4 u5 (a). Illustration of speakers’ interactions.

Figure 1: (a) illustrates a conversation clip of 3 speakers andthe utterance in yellow frame is selected as a targeted utter-ance; (b) and (c) illustrate the relation graphs of u3, u4, u5in different dependencies.speaker-aware contextual information. To capture speaker-aware contextual information, the state-of-the-art model Di-alogueGCN (Ghosal et al. 2019) introduces Self and Inter-Speaker dependencies, which capture the inﬂuences fromdifferent speakers. As illustrated in Fig. 1 (a), Self and Inter-Speaker dependencies establish a speciﬁc relation betweenevery two speakers and construct a fully connected relationalgraph. And then a Relational Graph Convolutional Network(RGCN) (Schlichtkrull et al. 2018) is applied to process sucha graph.Although DialogueGCN can achieve excellent perfor-mance with Self and Inter-Speaker dependencies, thisspeaker modeling is easy to be complicated with the numberof speakers increasing. As shown in Fig. 1 (b), for a con-versation clip with two speakers, the considered relationsreach to 7. The number can drastically increase with morespeakers involved. Thus this complicated speaker modelingis hard to deal with the condition that the number of speak-ers dynamically changes, and not ﬂexible to be deployed inother models. In addition, RGCN processing the fully con-nected graph with multiple relations requires tremendous a r X i v : . [ c s . C L ] D ec onsumption of computation. This limitation leads to Dia-logueGCN only considering the local context in a conversa-tion (Ghosal et al. 2019). Therefore, it is appealing to intro-duce a simple and general speaker modeling, which is easyto extend in all scenes and realize in other models so thatlong-distance context can be available.To address the above problem, we propose a TRans-forMer with Speaker Modeling (TRMSM). First, we sim-plify the Self and Inter-speaker dependencies to a binaryversion, which only contains two relations: Intra-Speaker dependency and

Inter-Speaker dependency. As illustratedin Fig. 1, for the speaker of the targeted utterance,

Intra-Speaker dependency focuses on the inﬂuence from the samespeaker, and

Inter-Speaker dependency treats other speakersas a whole group instead of building a relation between ev-ery two speakers. In this way, our simpliﬁed modeling canbe easy to extend in other models and deal with the scenewith the dynamical number of speakers without introducingnew relations between speakers.Furthermore, with the ability to settle long-distance de-pendency, Transformer (Vaswani et al. 2017) achieves ex-cellent performance among a great number of Natural Lan-guage Processing (NLP) problems. To better model the long-distance contextual utterances in a conversation, we uti-lize a hierarchical Transformer with two levels: sentencelevel and dialogue level. In the sentence level, BERT (De-vlin et al. 2019) encodes the semantic representation for atargeted utterance, and in the dialogue level, Transformeris used to capture the information from contextual utter-ances. To better model our simpliﬁed dependencies in thedialogue-level Transformer, we design three masks: Con-ventional Mask for conventional context modeling, Intra-Speaker Mask for

Intra-Speaker dependency, and Inter-Speaker Mask for

Inter-Speaker dependency. To realize thefunctions of masks, we deploy three independent Trans-former blocks in dialogue level and the designed masks arerespectively used in these Transformer blocks. With differ-ent speaker-aware contextual information extracted by theseTransformer blocks, whose contributions to the ﬁnal predic-tion are diverse, we utilize the attention mechanism to au-tomatically weight and fuse them. Besides, we also applytwo other simple fusing methods: Add and Concatenation todemonstrate the advancement of the attention.Speciﬁcally, our contributions are concluded as follows:• We simplify Self and Inter-speaker dependencies to a bi-nary version, so that the speaker interaction modeling canbe extended in hierarchical Transformer and the long-distance context can be considered.• We design three types of masks to achieve speakers’ in-teractions modeling in Transformer and utilize the atten-tion mechanism to automatically pick up the importantspeaker-aware contextual information.• We conduct experiments on two ERC datasets: IEMO-CAP and MELD. Our method achieves state-of-the-artperformance on both datasets on average.

Related Work

Two aspects are strongly related to our work: Emotionrecognition in conversation and Utilization of mask in Trans-former.

Emotion recognition in conversation

Hierarchicalstructure based on RNN (Jiao, Lyu, and King 2020; Jiaoet al. 2019) or Transformer (Zhong, Wang, and Miao 2019;Li et al. 2020) is leveraged in ERC to capture contextualinformation. Except contextual information, speaker infor-mation is proven to be important to ERC. Speakers can beregarded as objects related to utterances or additional infor-mation for utterances. As objects, speakers are involved inthe graph of conversation as nodes to interact with utterances(Zhang et al. 2019). As additional information, speaker in-formation is modeled via utterances. Speciﬁcally, Hazarikaet al. (2018b,a) employ GRUs and Memory Network (Mem-net) (Sukhbaatar et al. 2015) to model speakers’ interactionsin the dyadic conversation, which is difﬁcult to extend tomulti-speaker conditions. Therefore, Majumder et al. (2019)generalize speakers as parties, track them by GRU, and uti-lize attention mechanism to gather interactive information inmulti-speaker conversations. Even so, Ghosal et al. (2019)argue that Majumder et al. (2019) ignored the inﬂuencesfrom other speakers and propose Self and Inter-Speaker de-pendencies to formalize interactions within and betweenspeakers. However, the complicated modeling of speakers’interactions is difﬁcult to apply in other models, thus requir-ing a simpliﬁed version.

Utilization of mask in Transformer

Masks in Trans-former are utilized to mask the unattended elements in self-attention. Recently, masks are well-designed and leveragedin language modeling (Dong et al. 2019; Devlin et al. 2019;Radford et al. 2018) and conversation structure modeling(Zhu et al. 2020). Masks are ﬂexible and convenient to beimplemented and we choose them to model the interactionsof speakers in Transformer.

Methodology

In this section, we will elaborate on the task deﬁnition andthe structure of TRMSM which is illustrated in Fig. 2. Ourmodel contains 4 parts: Sentence-Level Encoder, Dialogue-Level Encoder, Fusing Method, and Classiﬁer.

Task Deﬁnition

ERC task includes K emotions, whose set is E = { emo , emo , ..., emo K } . Given a conversation C =[ u , u , ..., u N ] containing N textual utterances, each ut-terance u n = [ w , w , ..., w L n ] within is sequentiallyformed by L n words. Particularly, M speakers, whose set is SP K = { spk , spk , ..., spk M } , participate in the conver-sation. For each utterance u n , a emotion label e n ∈ E anda speaker annotation p n ∈ SP K are assigned. ERC taskaims to predict the emotion of every utterance in C with theinformation provided above. Sentence-Level Encoder

To encode a more informative and context-aware represen-tation of a single utterance based on Transformer, we utilize .. BERT Encoder M a x p oo li n g + L i n e a r w w w 𝑳 𝒏 … A dd & N r o m A dd & N r o m F ee d F o r w a r d M a s k e d M u l t i - H e a d A tt e n t i o n A dd & N r o m A dd & N r o m F ee d F o r w a r d M a s k e d M u l t i - H e a d A tt e n t i o n A dd & N r o m A dd & N r o m F ee d F o r w a r d M a s k e d M u l t i - H e a d A tt e n t i o n C l a ss i f i e r e . . . …… (c) Inter-Speaker Mask P P P P P P P P (a) Conventional Mask P P P P P P P P (b) Intra-Speaker Mask P P P P P P P P + + = Concat = Fusing MethodInter-Speaker BlocksIntra-Speaker BlocksConventional Blocks u n u u N (i) Add (ii) Concatenation Dialogue-Level EncoderSentence-Level Encoder Classifier ee U tt e r a n c e x N C O C C O C RA O C ER 𝑅 Attention

Figure 2: The structure of our proposed model, which is based on Transformer structure. Our proposed masks are utilizedin the Multi-Head Attention of Dialogue-Level Encoder and are illustrated for 3 types: (a)conventional, (b)Intra-Speaker and(c)Inter-Speaker masks. The fusing methods include Attention, (i)Add, and (ii)Concatenation.a BERT encoder. Limited by the max length of a sequencesupported by BERT, we cannot input the concatenated se-quence of all utterances in a conversation, whose length fre-quently exceeds 768 in cases of long conversations, to cap-ture the global contextual information. Therefore, BERT issolely used to encode the sentence-level context in a singleutterance. An utterance u n = [ w , w , ..., w L n ] is fed intoBERT to obtain the contextualized representation of words: W = BERT ( w , w , ..., w L n ) (1)where W ∈ R L n × d w is the output of the top layer of BERTand d w is the dimension of the word representation. To ob-tain an utterance representation for u n , a max-pooling oper-ation followed by a projection is deployed: u n = Linear ( M axpooling ( W )) (2)where u n ∈ R d u represents the utterance and d u is the di-mension of utterance representation. By processing every ut-terance in a conversation, we ﬁnally obtain the representa-tion matrix C ∈ R N × d u . Dialogue-Level Encoder

In the dialogue level, we utilize three transformer blocks:Conventional Blocks for conventional context modeling,Intra-Speaker Blocks for

Intra-Speaker dependency, andInter-Speaker Blocks for

Inter-Speaker dependency. Due tothe same structures of all Transformer blocks, we simply in-troduce the general process of the ﬁrst layer of Transformerblocks. Given the conversation matrix C processed by thesentence-level encoder, to avoid the absence of positional in-formation in C , an Absolute Positional Embedding is addedto every representation in C : C = C + P E (0 : N ) (3)where P E (0 : N ) is in the same dimension as C.Self-attention intuitively provides an interactive patternfor contextual modeling of conversations. Taking advantageof the mechanism of self-attention, the targeted utterancescan be parallelly processed. Therefore, the targeted utter-ances are regarded as a query matrix, and the contextual ut-terances act as a key matrix, so that every utterance simul-taneously assesses how much information shall be obtainedfrom every contextual utterance. In this way, C is projectedto query matrix Q ∈ R N × d a , key matrix K ∈ R N × d a , andvalue matrix V ∈ R N × d a by linear projections without bias: [ Q ; K ; V ] = Linear ([ C ; C ; C ]) where [] is the concatenat-ing operation. Self-attention is calculated by: A ( Q, K, V, M ) = sof tmax ( ( QK T ) ∗ M √ d a ) V (4)where ∗ denotes element-wise multiplication; M ∈ R N × N is the utilized mask which is a square matrix whose non-inﬁnite elements equal 1. We will introduce different masksused by diverse blocks later. Transformer hires multiple self-attention (Multi-Head Attention, MHA) to model differentaspects of information. And then the outputs of all headsare concatenated and projected to O with the same sizef C . After the Attention module, a Position-wise Feed-Forward Network (FFN) module is deployed to produce out-put F ∈ R N × d u . MHA and FFN are both residually con-nected. Therefore the output O C of the ﬁrst layer of Trans-former is: A (cid:48) = LayerN orm ( O + C ) , (5) F = max (0 , A (cid:48) W + b ) W + b , (6) O C = LayerN orm ( F + A (cid:48) ) . (7) O C acts as the input of the second transformer layer, andby this analogy, we obtain the ﬁnal output O C ∈ R N × d u after multiple layers. Therefore, the outputs of our 3 blockscan be denoted as: O CC for Conventional Block, O RAC forIntra-Speaker Block, and O ERC for Inter-Speaker Block. Dueto the limited space in this paper, more details about Trans-former can be reviewed in Vaswani et al. (2017).Masks can prompt Transformer blocks to realize theirdifferent functions, and we introduce how to form these 3masks:

Conventional Mask sets all the elements of itself to 1,which means that every targeted utterance can get access toall the contextual utterances. Conventional Mask is appliedin the multi-head attention of Conventional Blocks and isillustrated in Fig. 2 (a). We annotate Conventional Mask as M C . Intra-Speaker Mask only considers those contextual ut-terances tagged with p n , which is the speaker tag of thetargeted utterance. Therefore, based on M C , Intra-SpeakerMask M RA sets positions representing other speakers to - INF . Intra-Speaker Mask is illustrated in Fig. 2 (b).

Inter-Speaker Mask regards other speakers differentfrom the one of the targeted utterance as one unit due toour simpliﬁcation. Therefore, based on M C , Inter-SpeakerMask M ER sets positions whose speaker is the same as thespeaker tag of the targeted utterance to -INF. Inter-SpeakerMask is illustrated in Fig. 2 (c). Fusing Method

As blocks produce different outputs that carry variousspeaker-aware contextual information, we utilize 3 simplemethods to fuse the information.

Add

As illustrated in Fig. 2 (i), Add equally regards thecontributions of all outputs of blocks. Therefore, the fusingrepresentation is: R = O CC + O RAC + O ERC (8)

Concatenation

Concatenation (illustrated in Fig. 2 (ii))is also a simple but effective method to combine differentinformation. Different from Add operation, Concatenationcan implicitly choose the information which is important forthe ﬁnal prediction due to the following linear projection ofclassiﬁer. Therefore, the fusing representation R ∈ R N × d u is: R = Concat ( O CC , O RAC , O

ERC , dim = 1) (9)

Attention

As the contributions of different speaker par-ties are diversely weighted, it is feasible that the model au-tomatically chooses the more important information. There-fore, we utilize the widely used attention (Lian et al. 2019)

Dataset Num. ofdialogues Num. ofutterances Avg. lengthof dialoguetrain dev test train dev test train/dev testIEMOCAP 120 31 5810 1623 48 52MELD 1039 114 280 9989 1109 2610 10 9

Table 1: Statistics about IEMOCAP and MELD.to achieve this goal. Attention mechanism takes 3 block out-putting representations as inputs and produces an attentionscore for each representation. For simplicity, we take rep-resentations O CC i ∈ R × d u , O RAC i ∈ R × d u , and O ERC i ∈ R × d u of utterance i as an example. Therefore, the attentionscore and fusing representation are computed as: O i = Concat ( O CC i , O

RAC i , O

ERC i , dim = 0) , (10) α = sof tmax ( w F O iT ) , (11) R i = αO i . (12)where O i ∈ R × d u is the concatenated representations, α ∈ R × is the attention score, w F ∈ R × d u is a trainableparameter, and R i ∈ R × d u is the fusing representation. Fi-nally, all fusing representations of utterances are concate-nated as R ∈ R N × d u . Classiﬁer

With the sentence-level and dialogue-level contextual infor-mation fully modeled by encoders, the dialogue-level outputis fed to a classiﬁer which predicts the ﬁnal emotion distri-butions: ˆ Y = sof tmax ( RW clf + b clf ) (13)where W clf ∈ R d u × K ( W clf ∈ R d u × K for Concatenation), b clf ∈ R K and ˆ Y is the matrix of emotion distributions ofall utterances in conversation C . The model is trained by across-entropy loss function, which is calculated as: L = − (cid:80) Tl =1 N l T (cid:88) l =1 N l (cid:88) i =1 K (cid:88) e =1 y ei log ( ˆ Y ei ) (14)where y i is the one-hot vector denoting the emotion labelof utterance i in a conversation, e denotes the dimension ofeach emotion, N l denotes the length of l -th conversation,and T denotes the number of conversations in a dataset. Experimental Setup

Datasets

We evaluate our models on two datasets: IEMOCAP (Bussoet al. 2008), MELD (Poria et al. 2019a), and both of themare multi-modal datasets that contain three modalities. Wesolely consider the textual modality following Ghosal et al.(2019). Statistics about the datasets are shown in Tab. 1.•

IEMOCAP

This dataset contains a series of dyadicconversation between 10 unique speakers. 6 categoriesof emotions are considered in our experiments: neutral , happy , sad , angry , excited , and frustrated . Following Ma-jumder et al. (2019), the training set is split into a newtraining set and a validation set by the ratio of 80: 20. ethods IEMOCAPHappy Sad Neutral Angry Excited Frustrated AverageAcc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. wF1Memnet(Sukhbaatar et al. 2015)† 25.72 33.53 55.53 61.77 58.12 52.84 59.32 55.39 51.50 58.30 67.20 59.00 55.72 55.10CMN(Hazarika et al. 2018b)† 25.00 30.38 55.92 62.41 52.86 52.39 61.76 59.83 55.52 60.25 71.13 60.69 56.56 56.13DialogueRNN(Majumder et al. 2019) 25.69 33.18 75.10 Table 2: The results of our models on IEMOCAP. Weighted-F1 score (wF1) is used as the metric. † means referring fromGhosal et al. (2019).•

MELD

This dataset contains over 1400 multi-speakersconversations collected from the TV series

Friends . Emo-tions in this dataset are annotated into 7 categories: neu-tral , joy , surprise , anger , disgust , sadness and fear . Compared Methods

To distinguish our models with different fusing methods,we construct 3 model variants: TRMSM-Add, TRMSM-Cat,and TRMSM-Att. To show the importance of the speaker-related information, we construct our model without Intra-Speaker Blocks and Inter-Speaker Blocks, which we denoteit as TRM. Besides, our models are compared with the base-lines below:•

CMN (Hazarika et al. 2018b)

CMN is proposed tomodel dyadic conversations using two sets of RNNs andMemnets to respectively track different speakers.•

DialogueRNN (Majumder et al. 2019)

DialogueRNNis the ﬁrst state-of-the-art model to address ERC of multi-speakers. RNNs are deployed to track speakers’ states andglobal state during conversations.•

AGHMN (Jiao, Lyu, and King 2020)

AGHMN is thestate-of-the-art unidirectional model in real-time ERC.To retain the positional information from its hierarchicalstructure, a GRU is constructed for attention mechanism.•

DialogueGCN (Ghosal et al. 2019)

To fully model theinteractive information between speakers, DialogueGCNmodels detailed dependencies between speakers using aRelational GCN.•

KET (Zhong, Wang, and Miao 2019)

KET introducesthe Transformer structure to model context in conversa-tions. It also proposes an effective graph attention to ex-tract information from commonsense knowledge bases.•

BERT (Devlin et al. 2019)

A vanilla BERT followed bya classiﬁer is ﬁne-tuned to show the importance of con-text.•

Other baselines

Both based on CNN to extract se-mantic information, scLSTM (Poria et al. 2017) utilizesLSTM (Hochreiter and Schmidhuber 1997) and Mem-net (Sukhbaatar et al. 2015) utilizes memory network tomodel conversational context.

Implementation

For BERT and sentence-level encoder, an uncased BERT-base model is adopted. For the dialogue-level encoder, thedimension of dialogue-level representation is set to 300 forIEMOCAP and 200 for MELD; the number of transformerlayers is set to 6 for IEMOCAP and 1 for MELD; thenumber of heads is set 6 for IEMOCAP and 4 for MELD;dropout rate is set to . . Additionally, models are trainedusing AdamW (Kingma and Ba 2015; Loshchilov and Hut-ter 2019) for 10000 steps with 1000 steps for warming up,and the learning rate linearly decaying after the warm-up isset to 1e-5 for IEMOCAP and 8e-6 for MELD. Due to theparallel prediction of utterances in one conversation, batchsize is set to 1 following Jiao, Lyu, and King (2020). Be-sides, DialogueGCN is trained in the setting of 90:10 datasplit on IEMOCAP, and for a fair comparison, we re-runDialogueGCN with 80:20 data split using the open-sourcecode . All of our results reported are the average values of 5runs. Results and Discussions

Overall Results

For IEMOCAP, weighted-F1 (wF1) score is used as the met-ric. However, the data proportion of MELD is in a severelyimbalanced condition. Therefore, the weighted-F1 score isnot that proper and enough for MELD. To balance the contri-butions of large classes and small classes, we follow Zhanget al. (2020) and also use the average value of macro F1score and micro F1 score as one metric, which is calculatedby mF F macro + F micro ) / .For IEMOCAP, as shown in Tab. 2, BERT attains wF1of 54.01 which is substantially worse than our models andmost state-of-the-art models considering the dialogue-levelcontext. This result may indicate that IEMOCAP containsconsiderable utterances that cannot be predicted only de-pending on the semantic information, which is out of the https://github.com/huggingface/transformers https://github.com/declare-lab/conv-emotion/tree/master/DialogueGCN ethods MELDNeutral Surprise Fear Sadness Joy Disgust Anger AverageAcc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. wF1 Acc. wF1 mF1scLSTM(Poria et al. 2017) ♦ ♦ Table 3: The Results of our models on MELD. MELD uses weighted-F1 (wF1) score, and the average value (mF1) of Macro-F1and Micro-F1, as the metrics. ♦ means referring from Jiao, Lyu, and King (2020). TRMSM W/O SM TRMSM W/O CM TRMSM W - F IEMOCAP

TRMSM W/O SM TRMSM W/O CM TRMSM M - F MELD (a) (b)

TRMSM+AddTRMSM+CatTRMSM+Att TRMSM+AddTRMSM+CatTRMSM+Att

Figure 3: The results of models with different blocks. wF1for IEMOCAP; mF1 for MELD.conversational context. Furthermore, TRMSM-Att outper-forms AGHMN by 2.24 wF1 and DialogueGCN (80: 20)by 3.28 wF1, which beneﬁts from the powerful Transformerand our speaker modeling with long-distance informationconsidered. For emotions, TRMSM-Att achieves the bestF1 on

Frustrated , and TRMSM-Add achieves the best F1on

Neutral . Besides, our models can attain second or thirdhigher results among other emotions. This demonstrates thatour models are competitive to achieve comprehensive per-formance.For MELD, as shown in Tab. 3, BERT outperforms otherstate-of-the-art models by a great margin, which indicatesthe importance of external knowledge brought by BERT.Compared with BERT, TRM attains marginally better re-sults, which may be attributed to the limited conversationalcontextual information in MELD. To conﬁrm this, compar-ing the results of CNN and scLSTM (based on CNN), wecan notice that the improvement is also limited. AlthoughMELD provides limited contextual information in conversa-tions, TRMSM-Att still outperforms BERT by 1.29 wF1 and1.17 mF1, which indicates the effectiveness of our model tocapture such information. For emotions, BERT beats otherstate-of-the-art models by a great margin in such an imbal-anced circumstance, and TRMSM-Att attains the best F1 on4 emotions including large classes Joy , Anger , Surprise , andthe small class

Disgust . This demonstrates that BERT canalleviate data imbalance and our model can take advantageof such a feature.For both datasets, all TRMSM variants outperformTRM to show the importance of speaker-aware contex- CNN achieves 55.02 wF1 on MELD by Poria et al. (2019b). (-0, 0) (-2, 2) (-4, 4) (-8, 8) (-10, 10) all W e i g h t e d - F Window size

TRMSM-AttDialogueGCN

Figure 4: wF1 of TRMSM-Att and DialogueGCN with dif-ferent ranges of context. all means using global context.tual information. TRMSM-Att and TRMSM-Cat outperformTRMSM-Add, which indicates the importance of differentaspects of speaker information requiring to be treated dif-ferently. TRMSM-Att outperforming TRMSM-Cat demon-strates that automatically and explicitly picking up speaker-related information is better than the implicit way.

Model Analysis

Ablation Study

To better understand the inﬂuences ofmasks on our models, we report the results of the modelsremoving the Transformer blocks with different masks onIEMOCAP and MELD. In this part, we denote ConventionMask as CM and Intra-Speaker, Inter-Speaker Masks as SM.Accordingly, TRMSM w/o SM is equivalent to TRM.As seen in Fig. 3, on both datasets, TRMSM w/o CM(solely applying SM) can achieve better performance thanTRMSM w/o SM (solely applying CM). We attribute it tothat speaker modeling does not drop the contextual informa-tion from conversations, and on the contrary, speaker mod-eling can guide the model to extract more effective informa-tion to the ﬁnal prediction. Furthermore, TRMSM outper-forms both TRMSM w/o CM and TRMSM w/o SM, whichdemonstrates that all of our designed masks are critical toachieving better performance.

Effect of Range of Context

To ﬁnd out the inﬂuenceof the range of context on our model, we train TRMSM-Attwith different ranges of available context on IEMOCAP andrefer the results from Ghosal et al. (2019) for DialogueGCN.We utilize different windows ( − x, y ) to limit the context,where x , y is respectively the number of utterances in priorcontext and post context. As illustrated in Fig. 4, with thewindow widened, the performance increases as shown inboth models. For DialogueGCN, (-10, 10) is the max win-dow of context and therefore it cannot get access to the long- F F F F F F F F F FM M M M M M

M M M

Oh, thank God... on the phone for half an hour ... talk to a freaking human being . I got it in the email

Frustrated

TRMSM:

Fru.

TRM:

Ang.

TRMSM:

Fru.

TRM:

Neu. yeah

Frustrated

TRMSM:

Fru.

TRM:

Neu.

Intra-s. Inter-s.

Intra-speaker Block Attention :Speaker:Attention score

All right, all right, I’ll ... my girlfriend. But I'm just doing it for you guys.

Neutral

TRMSM:

Neu.

TRM:

Sad. Yeah, you should, really.

Neutral

TRMSM:

Neu.

TRM:

Neu.

So big deal, so Joey's had a lot of girlfriends, it doesn’t ...

Neutral

TRMSM:

Neu.

TRM:

Neu.

Ch Ch Ch Ch Ch Ro RoMo Mo RaRa Mo

Intra-s. Inter-s.

Attention from fusing methodInter-speaker Block Attention :Speaker :Attention score (a) (b)

M: Male F: Female

Ch: Chandler

Ro: RossMo: Monica

Ra: Rachel

Attention from fusing method

Figure 5: Heatmaps of attention from fusing method and self-attention of Intra-, Inter-Speaker Blocks for the targeted utterances(whose speakers are marked in yellow). Labels of utterances are tagged below the utterances. Predictions of TRMSM and TRMare marked in green for correctness and red for mistake. happy sadneuralangryexcite d frustra ted happy sadneural angry excited frustr ated happy sadneuralangryexcitedfrustra ted neur al surprisefearsadness joy disgustanger neur al surprisefearsadness joy disgustanger neural surpr isefear sadn essjoydisg ustange r (a) 1 Layer (b) 3 Layers (c) 6 Layers(d) 1 Layer (e) 3 Layers (f) 6 Layers TRMTRMSM-Att

Figure 6: The F1 score on every emotion class by TRMSM-Att and TRM. (a)-(c) for IEMOCAP; (d)-(f) for MELD.distance context. On the contrary, the performance is furtherimproved by TRMSM-Att with all context available. Thisindicates that the local contextual information is critical forthe prediction and the long-distance information is also im-portant for contextual modeling to further improve the per-formance.

Effect of Number of Layers

We study the effect ofthe number of layers to our model on different datasets. Fig.6 illustrates the radar graphs for the F1 scores of emotionsin IEMOCAP and MELD by TRM and TRMSM-Att. Asthe number of layers increasing, the F1 scores of emotionsin IEMOCAP normally expand. While in MELD, increas-ing the number of layers gradually hurts the performance tobe 0 of F1 on emotions

Fear and

Disgust which are classeswith the fewest data. We think the reason may be that MELDsuffers from data imbalance and increasing the number oflayers leads to severer overﬁtting on small classes. For dataimbalance, methods like re-balance can be applied to allevi-ate it. Re-balance is out of the scope of this paper and ourfuture work will study data imbalance of ERC.

Case Study

To better understand how our model captures

Intra-Speaker and

Inter-Speaker dependencies, we illustrate two conver-sation clips ending with the targeted utterances so that the targeted utterances can only refer to the prior context.We choose TRMSM-Att without Conventional Blocks sothat only speaker information related blocks are considered.Speciﬁcally, we illustrate heatmaps of attention from fusingmethod and self-attention in Transformer blocks. For sim-plicity, we denote attention from fusing method as FAtt.In the scene of Fig. 5 (a), the speaker M keeps in frus-tration through the conversation and F in a neutral state hasfew inﬂuences on M. Therefore, FAtt pays more attention to Intra-Speaker dependency so that Intra-Speaker Blocks canextract information from M himself. We can see from theheatmap that the targeted utterance yeah grades the highestscore to the farthest contextual utterance whose emotion isalso frustration, which is out of the range of context that Dia-logueGCN can refer. In a sense, this indicates the importanceof long-distance information.As the condition in Fig. 5 (b), speakers in this conver-sation basically keep in a neutral state except that Chan-dler shows other emotions like anger and surprise beforethe targeted utterance. Although the targeted utterance withspeaker Chandler shows slight sadness from the semanticview, it is supposed to be predicted as neutral according tothe conversational context. Speciﬁcally, FAtt grades Inter-Speaker Blocks with higher score and self-attention in Inter-Speaker Blocks extracts information from the neutral utter-ances of other speakers. This case indicates the effectivenessof our model to extract inter-speaker information.

Conclusion

In this work, we simplify the Self and Inter-Speaker de-pendencies to a binary version. To achieve the simpliﬁedmodeling of speakers’ interactions, we design three masks:Conventional Mask, Intra-Speaker Mask, and Inter-SpeakerMask. These masks are utilized in the self-attention mod-ules of the second-level Transformer blocks of a hierarchicalTransformer. As the speaker-aware information extracted bydifferent masks diversely contributes to the prediction, atten-tion mechanism is utilized to weight and fuse them. Finally,our model achieves state-of-the-art results on 2 ERC datasetsand further analysis shows that our model is efﬁcacious forERC. We average the attention scores of all self-attention heads inthe top layer of Transformer eferences

Busso, C.; Bulut, M.; Lee, C.; Kazemzadeh, A.; Mower,E.; Kim, S.; Chang, J. N.; Lee, S.; and Narayanan, S. S.2008. IEMOCAP: interactive emotional dyadic motion cap-ture database.

Lang. Resour. Evaluation .Devlin, J.; Chang, M.; Lee, K.; and Toutanova, K. 2019.BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding. In

Proc. of NAACL-HLT , 4171–4186.Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.;Gao, J.; Zhou, M.; and Hon, H. 2019. Uniﬁed LanguageModel Pre-training for Natural Language Understandingand Generation. In

Proc. of NeurIPS , 13042–13054.Ghosal, D.; Majumder, N.; Poria, S.; Chhaya, N.; and Gel-bukh, A. F. 2019. DialogueGCN: A Graph ConvolutionalNeural Network for Emotion Recognition in Conversation.In

Proc. of EMNLP-IJCNLP , 154–164.Hazarika, D.; Poria, S.; Mihalcea, R.; Cambria, E.; andZimmermann, R. 2018a. ICON: Interactive ConversationalMemory Network for Multimodal Emotion Detection. In

Proc. of EMNLP , 2594–2604.Hazarika, D.; Poria, S.; Zadeh, A.; Cambria, E.; Morency,L.; and Zimmermann, R. 2018b. Conversational Mem-ory Network for Emotion Recognition in Dyadic DialogueVideos. In

Proc. of NAACL-HLT , 2122–2132.Hochreiter, S.; and Schmidhuber, J. 1997. Long Short-TermMemory.

Neural Computation

Proc. of AAAI .Jiao, W.; Yang, H.; King, I.; and Lyu, M. R. 2019. Hi-GRU: Hierarchical Gated Recurrent Units for Utterance-Level Emotion Recognition. In

Proc. of NAACL-HLT , 397–406.Kingma, D. P.; and Ba, J. 2015. Adam: A Method forStochastic Optimization. In

Proc. of ICLR .Li, Q.; Wu, C.; Zheng, K.; and Wang, Z. 2020. HierarchicalTransformer Network for Utterance-level Emotion Recogni-tion. arXiv preprint arXiv:2002.07551 .Lian, Z.; Tao, J.; Liu, B.; and Huang, J. 2019. ConversationalEmotion Analysis via Attention Mechanisms. In

Proc. ofInterspeech , 1936–1940.Loshchilov, I.; and Hutter, F. 2019. Decoupled Weight De-cay Regularization. In

Proc. of ICLR .Majumder, N.; Poria, S.; Hazarika, D.; Mihalcea, R.; Gel-bukh, A. F.; and Cambria, E. 2019. DialogueRNN: An At-tentive RNN for Emotion Detection in Conversations. In

Proc. of AAAI , 6818–6825.Poria, S.; Cambria, E.; Hazarika, D.; Majumder, N.; Zadeh,A.; and Morency, L. 2017. Context-Dependent SentimentAnalysis in User-Generated Videos. In

Proc. of ACL , 873–883. Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria,E.; and Mihalcea, R. 2019a. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In

Proc. of ACL , 527–536.Poria, S.; Majumder, N.; Mihalcea, R.; and Hovy, E. H.2019b. Emotion Recognition in Conversation: ResearchChallenges, Datasets, and Recent Advances.

IEEE Access

7: 100943–100953.Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever,I. 2018. Improving language understanding by generativepre-training.Schlichtkrull, M. S.; Kipf, T. N.; Bloem, P.; van den Berg,R.; Titov, I.; and Welling, M. 2018. Modeling RelationalData with Graph Convolutional Networks. In

Proc. ofESWC , 593–607.Sukhbaatar, S.; Szlam, A.; Weston, J.; and Fergus, R. 2015.End-To-End Memory Networks. In

Proc. of NeuIPS , 2440–2448.Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. At-tention is All you Need. In

Proc. of NeuIPS , 5998–6008.Zhang, B.; Yang, M.; Li, X.; Ye, Y.; Xu, X.; and Dai, K.2020. Enhancing Cross-target Stance Detection with Trans-ferable Semantic-Emotion Knowledge. In

Proc. of ACL ,3188–3197.Zhang, D.; Wu, L.; Sun, C.; Li, S.; Zhu, Q.; and Zhou, G.2019. Modeling both Context- and Speaker-Sensitive De-pendence for Emotion Detection in Multi-speaker Conver-sations. In

Proc. of IJCAI , 5415–5421.Zhong, P.; Wang, D.; and Miao, C. 2019. Knowledge-Enriched Transformer for Emotion Detection in TextualConversations. In

Proc. of EMNLP-IJCNLP , 165–176.Zhu, H.; Nan, F.; Wang, Z.; Nallapati, R.; and Xiang, B.2020. Who did They Respond to? Conversation StructureModeling using Masked Hierarchical Transformer. In