A Hierarchical Transformer with Speaker Modeling for Emotion Recognition in Conversation
AA Hierarchical Transformer with Speaker Modeling for Emotion Recognition inConversation
Jiangnan Li , Zheng Lin , Peng Fu , Qingyi Si , Weiping Wang Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China { lijiangnan, linzheng, fupeng, siqingyi, wangweiping } @iie.ac.cn Abstract
Emotion Recognition in Conversation (ERC) is a more chal-lenging task than conventional text emotion recognition. Itcan be regarded as a personalized and interactive emotionrecognition task, which is supposed to consider not only thesemantic information of text but also the influences fromspeakers. The current method models speakers’ interactionsby building a relation between every two speakers. How-ever, this fine-grained but complicated modeling is com-putationally expensive, hard to extend, and can only con-sider local context. To address this problem, we simplifythe complicated modeling to a binary version:
Intra-Speaker and
Inter-Speaker dependencies, without identifying everyunique speaker for the targeted speaker. To better achieve thesimplified interaction modeling of speakers in Transformer,which shows excellent ability to settle long-distance depen-dency, we design three types of masks and respectively utilizethem in three independent Transformer blocks. The designedmasks respectively model the conventional context modeling,
Intra-Speaker dependency, and
Inter-Speaker dependency.Furthermore, different speaker-aware information extractedby Transformer blocks diversely contributes to the prediction,and therefore we utilize the attention mechanism to automati-cally weight them. Experiments on two ERC datasets indicatethat our model is efficacious to achieve better performance.
Introduction
Nowadays, intelligent machines to precisely capture speak-ers’ emotions in conversations are gaining popularity, thusdriving the development of Emotion Recognition in Conver-sation (ERC). ERC is a task to predict the emotion of thecurrent utterance expressed by a specific speaker accordingto the context (Poria et al. 2019b), which is more challengingthan the conventional emotion recognition only consideringsemantic information of an independent utterance.To precisely predict the emotion of a targeted utterance,both the semantic information of the utterance and the in-formation provided by utterances in the context are criti-cal. Nowadays, a number of works (Hazarika et al. 2018a,b;Majumder et al. 2019; Ghosal et al. 2019) demonstratethat the interactions between speakers can facilitate extract-ing information from contextual utterances. We denote thiskind of information with modeling speakers’ interactions as
Timeline
Phoebe
Hey! joy u1 Joey
Hey-Hey-Hey! joy u2 Chandler
What are you doing? neutral u3 Phoebe
We’re just celebrating that
Joey got his health insurance back. joy u4 Chandler
Oh, all right. neutral u5 (b). Relational graph for u3, u4 and u5 inSelf and Inter-speaker dependencies. 7 relations are involved. (c).
Relational graph for u3, u4 and u5 in simplified dependencies. Only 2 relations are involved.
Simplified dependencies Self and Inter-speakerdependencies
Intra-Speaker
Inter-Speaker
Pho → PhoCha → PhoJoe → Pho historyfuture
Cha → Cha
Pho → Cha
64 1 1 u3 u4 u5 u3 u4 u5 (a). Illustration of speakers’ interactions.
Figure 1: (a) illustrates a conversation clip of 3 speakers andthe utterance in yellow frame is selected as a targeted utter-ance; (b) and (c) illustrate the relation graphs of u3, u4, u5in different dependencies.speaker-aware contextual information. To capture speaker-aware contextual information, the state-of-the-art model Di-alogueGCN (Ghosal et al. 2019) introduces Self and Inter-Speaker dependencies, which capture the influences fromdifferent speakers. As illustrated in Fig. 1 (a), Self and Inter-Speaker dependencies establish a specific relation betweenevery two speakers and construct a fully connected relationalgraph. And then a Relational Graph Convolutional Network(RGCN) (Schlichtkrull et al. 2018) is applied to process sucha graph.Although DialogueGCN can achieve excellent perfor-mance with Self and Inter-Speaker dependencies, thisspeaker modeling is easy to be complicated with the numberof speakers increasing. As shown in Fig. 1 (b), for a con-versation clip with two speakers, the considered relationsreach to 7. The number can drastically increase with morespeakers involved. Thus this complicated speaker modelingis hard to deal with the condition that the number of speak-ers dynamically changes, and not flexible to be deployed inother models. In addition, RGCN processing the fully con-nected graph with multiple relations requires tremendous a r X i v : . [ c s . C L ] D ec onsumption of computation. This limitation leads to Dia-logueGCN only considering the local context in a conversa-tion (Ghosal et al. 2019). Therefore, it is appealing to intro-duce a simple and general speaker modeling, which is easyto extend in all scenes and realize in other models so thatlong-distance context can be available.To address the above problem, we propose a TRans-forMer with Speaker Modeling (TRMSM). First, we sim-plify the Self and Inter-speaker dependencies to a binaryversion, which only contains two relations: Intra-Speaker dependency and
Inter-Speaker dependency. As illustratedin Fig. 1, for the speaker of the targeted utterance,
Intra-Speaker dependency focuses on the influence from the samespeaker, and
Inter-Speaker dependency treats other speakersas a whole group instead of building a relation between ev-ery two speakers. In this way, our simplified modeling canbe easy to extend in other models and deal with the scenewith the dynamical number of speakers without introducingnew relations between speakers.Furthermore, with the ability to settle long-distance de-pendency, Transformer (Vaswani et al. 2017) achieves ex-cellent performance among a great number of Natural Lan-guage Processing (NLP) problems. To better model the long-distance contextual utterances in a conversation, we uti-lize a hierarchical Transformer with two levels: sentencelevel and dialogue level. In the sentence level, BERT (De-vlin et al. 2019) encodes the semantic representation for atargeted utterance, and in the dialogue level, Transformeris used to capture the information from contextual utter-ances. To better model our simplified dependencies in thedialogue-level Transformer, we design three masks: Con-ventional Mask for conventional context modeling, Intra-Speaker Mask for
Intra-Speaker dependency, and Inter-Speaker Mask for
Inter-Speaker dependency. To realize thefunctions of masks, we deploy three independent Trans-former blocks in dialogue level and the designed masks arerespectively used in these Transformer blocks. With differ-ent speaker-aware contextual information extracted by theseTransformer blocks, whose contributions to the final predic-tion are diverse, we utilize the attention mechanism to au-tomatically weight and fuse them. Besides, we also applytwo other simple fusing methods: Add and Concatenation todemonstrate the advancement of the attention.Specifically, our contributions are concluded as follows:• We simplify Self and Inter-speaker dependencies to a bi-nary version, so that the speaker interaction modeling canbe extended in hierarchical Transformer and the long-distance context can be considered.• We design three types of masks to achieve speakers’ in-teractions modeling in Transformer and utilize the atten-tion mechanism to automatically pick up the importantspeaker-aware contextual information.• We conduct experiments on two ERC datasets: IEMO-CAP and MELD. Our method achieves state-of-the-artperformance on both datasets on average.
Related Work
Two aspects are strongly related to our work: Emotionrecognition in conversation and Utilization of mask in Trans-former.
Emotion recognition in conversation
Hierarchicalstructure based on RNN (Jiao, Lyu, and King 2020; Jiaoet al. 2019) or Transformer (Zhong, Wang, and Miao 2019;Li et al. 2020) is leveraged in ERC to capture contextualinformation. Except contextual information, speaker infor-mation is proven to be important to ERC. Speakers can beregarded as objects related to utterances or additional infor-mation for utterances. As objects, speakers are involved inthe graph of conversation as nodes to interact with utterances(Zhang et al. 2019). As additional information, speaker in-formation is modeled via utterances. Specifically, Hazarikaet al. (2018b,a) employ GRUs and Memory Network (Mem-net) (Sukhbaatar et al. 2015) to model speakers’ interactionsin the dyadic conversation, which is difficult to extend tomulti-speaker conditions. Therefore, Majumder et al. (2019)generalize speakers as parties, track them by GRU, and uti-lize attention mechanism to gather interactive information inmulti-speaker conversations. Even so, Ghosal et al. (2019)argue that Majumder et al. (2019) ignored the influencesfrom other speakers and propose Self and Inter-Speaker de-pendencies to formalize interactions within and betweenspeakers. However, the complicated modeling of speakers’interactions is difficult to apply in other models, thus requir-ing a simplified version.
Utilization of mask in Transformer
Masks in Trans-former are utilized to mask the unattended elements in self-attention. Recently, masks are well-designed and leveragedin language modeling (Dong et al. 2019; Devlin et al. 2019;Radford et al. 2018) and conversation structure modeling(Zhu et al. 2020). Masks are flexible and convenient to beimplemented and we choose them to model the interactionsof speakers in Transformer.
Methodology
In this section, we will elaborate on the task definition andthe structure of TRMSM which is illustrated in Fig. 2. Ourmodel contains 4 parts: Sentence-Level Encoder, Dialogue-Level Encoder, Fusing Method, and Classifier.
Task Definition
ERC task includes K emotions, whose set is E = { emo , emo , ..., emo K } . Given a conversation C =[ u , u , ..., u N ] containing N textual utterances, each ut-terance u n = [ w , w , ..., w L n ] within is sequentiallyformed by L n words. Particularly, M speakers, whose set is SP K = { spk , spk , ..., spk M } , participate in the conver-sation. For each utterance u n , a emotion label e n ∈ E anda speaker annotation p n ∈ SP K are assigned. ERC taskaims to predict the emotion of every utterance in C with theinformation provided above. Sentence-Level Encoder
To encode a more informative and context-aware represen-tation of a single utterance based on Transformer, we utilize .. BERT Encoder M a x p oo li n g + L i n e a r w w w 𝑳 𝒏 … A dd & N r o m A dd & N r o m F ee d F o r w a r d M a s k e d M u l t i - H e a d A tt e n t i o n A dd & N r o m A dd & N r o m F ee d F o r w a r d M a s k e d M u l t i - H e a d A tt e n t i o n A dd & N r o m A dd & N r o m F ee d F o r w a r d M a s k e d M u l t i - H e a d A tt e n t i o n C l a ss i f i e r e . . . …… (c) Inter-Speaker Mask P P P P P P P P (a) Conventional Mask P P P P P P P P (b) Intra-Speaker Mask P P P P P P P P + + = Concat = Fusing MethodInter-Speaker BlocksIntra-Speaker BlocksConventional Blocks u n u u N (i) Add (ii) Concatenation Dialogue-Level EncoderSentence-Level Encoder Classifier ee U tt e r a n c e x N C O C C O C RA O C ER 𝑅 Attention
Figure 2: The structure of our proposed model, which is based on Transformer structure. Our proposed masks are utilizedin the Multi-Head Attention of Dialogue-Level Encoder and are illustrated for 3 types: (a)conventional, (b)Intra-Speaker and(c)Inter-Speaker masks. The fusing methods include Attention, (i)Add, and (ii)Concatenation.a BERT encoder. Limited by the max length of a sequencesupported by BERT, we cannot input the concatenated se-quence of all utterances in a conversation, whose length fre-quently exceeds 768 in cases of long conversations, to cap-ture the global contextual information. Therefore, BERT issolely used to encode the sentence-level context in a singleutterance. An utterance u n = [ w , w , ..., w L n ] is fed intoBERT to obtain the contextualized representation of words: W = BERT ( w , w , ..., w L n ) (1)where W ∈ R L n × d w is the output of the top layer of BERTand d w is the dimension of the word representation. To ob-tain an utterance representation for u n , a max-pooling oper-ation followed by a projection is deployed: u n = Linear ( M axpooling ( W )) (2)where u n ∈ R d u represents the utterance and d u is the di-mension of utterance representation. By processing every ut-terance in a conversation, we finally obtain the representa-tion matrix C ∈ R N × d u . Dialogue-Level Encoder
In the dialogue level, we utilize three transformer blocks:Conventional Blocks for conventional context modeling,Intra-Speaker Blocks for
Intra-Speaker dependency, andInter-Speaker Blocks for
Inter-Speaker dependency. Due tothe same structures of all Transformer blocks, we simply in-troduce the general process of the first layer of Transformerblocks. Given the conversation matrix C processed by thesentence-level encoder, to avoid the absence of positional in-formation in C , an Absolute Positional Embedding is addedto every representation in C : C = C + P E (0 : N ) (3)where P E (0 : N ) is in the same dimension as C.Self-attention intuitively provides an interactive patternfor contextual modeling of conversations. Taking advantageof the mechanism of self-attention, the targeted utterancescan be parallelly processed. Therefore, the targeted utter-ances are regarded as a query matrix, and the contextual ut-terances act as a key matrix, so that every utterance simul-taneously assesses how much information shall be obtainedfrom every contextual utterance. In this way, C is projectedto query matrix Q ∈ R N × d a , key matrix K ∈ R N × d a , andvalue matrix V ∈ R N × d a by linear projections without bias: [ Q ; K ; V ] = Linear ([ C ; C ; C ]) where [] is the concatenat-ing operation. Self-attention is calculated by: A ( Q, K, V, M ) = sof tmax ( ( QK T ) ∗ M √ d a ) V (4)where ∗ denotes element-wise multiplication; M ∈ R N × N is the utilized mask which is a square matrix whose non-infinite elements equal 1. We will introduce different masksused by diverse blocks later. Transformer hires multiple self-attention (Multi-Head Attention, MHA) to model differentaspects of information. And then the outputs of all headsare concatenated and projected to O with the same sizef C . After the Attention module, a Position-wise Feed-Forward Network (FFN) module is deployed to produce out-put F ∈ R N × d u . MHA and FFN are both residually con-nected. Therefore the output O C of the first layer of Trans-former is: A (cid:48) = LayerN orm ( O + C ) , (5) F = max (0 , A (cid:48) W + b ) W + b , (6) O C = LayerN orm ( F + A (cid:48) ) . (7) O C acts as the input of the second transformer layer, andby this analogy, we obtain the final output O C ∈ R N × d u after multiple layers. Therefore, the outputs of our 3 blockscan be denoted as: O CC for Conventional Block, O RAC forIntra-Speaker Block, and O ERC for Inter-Speaker Block. Dueto the limited space in this paper, more details about Trans-former can be reviewed in Vaswani et al. (2017).Masks can prompt Transformer blocks to realize theirdifferent functions, and we introduce how to form these 3masks:
Conventional Mask sets all the elements of itself to 1,which means that every targeted utterance can get access toall the contextual utterances. Conventional Mask is appliedin the multi-head attention of Conventional Blocks and isillustrated in Fig. 2 (a). We annotate Conventional Mask as M C . Intra-Speaker Mask only considers those contextual ut-terances tagged with p n , which is the speaker tag of thetargeted utterance. Therefore, based on M C , Intra-SpeakerMask M RA sets positions representing other speakers to - INF . Intra-Speaker Mask is illustrated in Fig. 2 (b).
Inter-Speaker Mask regards other speakers differentfrom the one of the targeted utterance as one unit due toour simplification. Therefore, based on M C , Inter-SpeakerMask M ER sets positions whose speaker is the same as thespeaker tag of the targeted utterance to -INF. Inter-SpeakerMask is illustrated in Fig. 2 (c). Fusing Method
As blocks produce different outputs that carry variousspeaker-aware contextual information, we utilize 3 simplemethods to fuse the information.
Add
As illustrated in Fig. 2 (i), Add equally regards thecontributions of all outputs of blocks. Therefore, the fusingrepresentation is: R = O CC + O RAC + O ERC (8)
Concatenation
Concatenation (illustrated in Fig. 2 (ii))is also a simple but effective method to combine differentinformation. Different from Add operation, Concatenationcan implicitly choose the information which is important forthe final prediction due to the following linear projection ofclassifier. Therefore, the fusing representation R ∈ R N × d u is: R = Concat ( O CC , O RAC , O
ERC , dim = 1) (9)
Attention
As the contributions of different speaker par-ties are diversely weighted, it is feasible that the model au-tomatically chooses the more important information. There-fore, we utilize the widely used attention (Lian et al. 2019)
Dataset Num. ofdialogues Num. ofutterances Avg. lengthof dialoguetrain dev test train dev test train/dev testIEMOCAP 120 31 5810 1623 48 52MELD 1039 114 280 9989 1109 2610 10 9
Table 1: Statistics about IEMOCAP and MELD.to achieve this goal. Attention mechanism takes 3 block out-putting representations as inputs and produces an attentionscore for each representation. For simplicity, we take rep-resentations O CC i ∈ R × d u , O RAC i ∈ R × d u , and O ERC i ∈ R × d u of utterance i as an example. Therefore, the attentionscore and fusing representation are computed as: O i = Concat ( O CC i , O
RAC i , O
ERC i , dim = 0) , (10) α = sof tmax ( w F O iT ) , (11) R i = αO i . (12)where O i ∈ R × d u is the concatenated representations, α ∈ R × is the attention score, w F ∈ R × d u is a trainableparameter, and R i ∈ R × d u is the fusing representation. Fi-nally, all fusing representations of utterances are concate-nated as R ∈ R N × d u . Classifier
With the sentence-level and dialogue-level contextual infor-mation fully modeled by encoders, the dialogue-level outputis fed to a classifier which predicts the final emotion distri-butions: ˆ Y = sof tmax ( RW clf + b clf ) (13)where W clf ∈ R d u × K ( W clf ∈ R d u × K for Concatenation), b clf ∈ R K and ˆ Y is the matrix of emotion distributions ofall utterances in conversation C . The model is trained by across-entropy loss function, which is calculated as: L = − (cid:80) Tl =1 N l T (cid:88) l =1 N l (cid:88) i =1 K (cid:88) e =1 y ei log ( ˆ Y ei ) (14)where y i is the one-hot vector denoting the emotion labelof utterance i in a conversation, e denotes the dimension ofeach emotion, N l denotes the length of l -th conversation,and T denotes the number of conversations in a dataset. Experimental Setup
Datasets
We evaluate our models on two datasets: IEMOCAP (Bussoet al. 2008), MELD (Poria et al. 2019a), and both of themare multi-modal datasets that contain three modalities. Wesolely consider the textual modality following Ghosal et al.(2019). Statistics about the datasets are shown in Tab. 1.•
IEMOCAP
This dataset contains a series of dyadicconversation between 10 unique speakers. 6 categoriesof emotions are considered in our experiments: neutral , happy , sad , angry , excited , and frustrated . Following Ma-jumder et al. (2019), the training set is split into a newtraining set and a validation set by the ratio of 80: 20. ethods IEMOCAPHappy Sad Neutral Angry Excited Frustrated AverageAcc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. wF1Memnet(Sukhbaatar et al. 2015)† 25.72 33.53 55.53 61.77 58.12 52.84 59.32 55.39 51.50 58.30 67.20 59.00 55.72 55.10CMN(Hazarika et al. 2018b)† 25.00 30.38 55.92 62.41 52.86 52.39 61.76 59.83 55.52 60.25 71.13 60.69 56.56 56.13DialogueRNN(Majumder et al. 2019) 25.69 33.18 75.10 Table 2: The results of our models on IEMOCAP. Weighted-F1 score (wF1) is used as the metric. † means referring fromGhosal et al. (2019).•
MELD
This dataset contains over 1400 multi-speakersconversations collected from the TV series
Friends . Emo-tions in this dataset are annotated into 7 categories: neu-tral , joy , surprise , anger , disgust , sadness and fear . Compared Methods
To distinguish our models with different fusing methods,we construct 3 model variants: TRMSM-Add, TRMSM-Cat,and TRMSM-Att. To show the importance of the speaker-related information, we construct our model without Intra-Speaker Blocks and Inter-Speaker Blocks, which we denoteit as TRM. Besides, our models are compared with the base-lines below:•
CMN (Hazarika et al. 2018b)
CMN is proposed tomodel dyadic conversations using two sets of RNNs andMemnets to respectively track different speakers.•
DialogueRNN (Majumder et al. 2019)
DialogueRNNis the first state-of-the-art model to address ERC of multi-speakers. RNNs are deployed to track speakers’ states andglobal state during conversations.•
AGHMN (Jiao, Lyu, and King 2020)
AGHMN is thestate-of-the-art unidirectional model in real-time ERC.To retain the positional information from its hierarchicalstructure, a GRU is constructed for attention mechanism.•
DialogueGCN (Ghosal et al. 2019)
To fully model theinteractive information between speakers, DialogueGCNmodels detailed dependencies between speakers using aRelational GCN.•
KET (Zhong, Wang, and Miao 2019)
KET introducesthe Transformer structure to model context in conversa-tions. It also proposes an effective graph attention to ex-tract information from commonsense knowledge bases.•
BERT (Devlin et al. 2019)
A vanilla BERT followed bya classifier is fine-tuned to show the importance of con-text.•
Other baselines
Both based on CNN to extract se-mantic information, scLSTM (Poria et al. 2017) utilizesLSTM (Hochreiter and Schmidhuber 1997) and Mem-net (Sukhbaatar et al. 2015) utilizes memory network tomodel conversational context.
Implementation
For BERT and sentence-level encoder, an uncased BERT-base model is adopted. For the dialogue-level encoder, thedimension of dialogue-level representation is set to 300 forIEMOCAP and 200 for MELD; the number of transformerlayers is set to 6 for IEMOCAP and 1 for MELD; thenumber of heads is set 6 for IEMOCAP and 4 for MELD;dropout rate is set to . . Additionally, models are trainedusing AdamW (Kingma and Ba 2015; Loshchilov and Hut-ter 2019) for 10000 steps with 1000 steps for warming up,and the learning rate linearly decaying after the warm-up isset to 1e-5 for IEMOCAP and 8e-6 for MELD. Due to theparallel prediction of utterances in one conversation, batchsize is set to 1 following Jiao, Lyu, and King (2020). Be-sides, DialogueGCN is trained in the setting of 90:10 datasplit on IEMOCAP, and for a fair comparison, we re-runDialogueGCN with 80:20 data split using the open-sourcecode . All of our results reported are the average values of 5runs. Results and Discussions
Overall Results
For IEMOCAP, weighted-F1 (wF1) score is used as the met-ric. However, the data proportion of MELD is in a severelyimbalanced condition. Therefore, the weighted-F1 score isnot that proper and enough for MELD. To balance the contri-butions of large classes and small classes, we follow Zhanget al. (2020) and also use the average value of macro F1score and micro F1 score as one metric, which is calculatedby mF F macro + F micro ) / .For IEMOCAP, as shown in Tab. 2, BERT attains wF1of 54.01 which is substantially worse than our models andmost state-of-the-art models considering the dialogue-levelcontext. This result may indicate that IEMOCAP containsconsiderable utterances that cannot be predicted only de-pending on the semantic information, which is out of the https://github.com/huggingface/transformers https://github.com/declare-lab/conv-emotion/tree/master/DialogueGCN ethods MELDNeutral Surprise Fear Sadness Joy Disgust Anger AverageAcc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. wF1 Acc. wF1 mF1scLSTM(Poria et al. 2017) ♦ ♦ Table 3: The Results of our models on MELD. MELD uses weighted-F1 (wF1) score, and the average value (mF1) of Macro-F1and Micro-F1, as the metrics. ♦ means referring from Jiao, Lyu, and King (2020). TRMSM W/O SM TRMSM W/O CM TRMSM W - F IEMOCAP
TRMSM W/O SM TRMSM W/O CM TRMSM M - F MELD (a) (b)
TRMSM+AddTRMSM+CatTRMSM+Att TRMSM+AddTRMSM+CatTRMSM+Att
Figure 3: The results of models with different blocks. wF1for IEMOCAP; mF1 for MELD.conversational context. Furthermore, TRMSM-Att outper-forms AGHMN by 2.24 wF1 and DialogueGCN (80: 20)by 3.28 wF1, which benefits from the powerful Transformerand our speaker modeling with long-distance informationconsidered. For emotions, TRMSM-Att achieves the bestF1 on
Frustrated , and TRMSM-Add achieves the best F1on
Neutral . Besides, our models can attain second or thirdhigher results among other emotions. This demonstrates thatour models are competitive to achieve comprehensive per-formance.For MELD, as shown in Tab. 3, BERT outperforms otherstate-of-the-art models by a great margin, which indicatesthe importance of external knowledge brought by BERT.Compared with BERT, TRM attains marginally better re-sults, which may be attributed to the limited conversationalcontextual information in MELD. To confirm this, compar-ing the results of CNN and scLSTM (based on CNN), wecan notice that the improvement is also limited. AlthoughMELD provides limited contextual information in conversa-tions, TRMSM-Att still outperforms BERT by 1.29 wF1 and1.17 mF1, which indicates the effectiveness of our model tocapture such information. For emotions, BERT beats otherstate-of-the-art models by a great margin in such an imbal-anced circumstance, and TRMSM-Att attains the best F1 on4 emotions including large classes Joy , Anger , Surprise , andthe small class
Disgust . This demonstrates that BERT canalleviate data imbalance and our model can take advantageof such a feature.For both datasets, all TRMSM variants outperformTRM to show the importance of speaker-aware contex- CNN achieves 55.02 wF1 on MELD by Poria et al. (2019b). (-0, 0) (-2, 2) (-4, 4) (-8, 8) (-10, 10) all W e i g h t e d - F Window size
TRMSM-AttDialogueGCN
Figure 4: wF1 of TRMSM-Att and DialogueGCN with dif-ferent ranges of context. all means using global context.tual information. TRMSM-Att and TRMSM-Cat outperformTRMSM-Add, which indicates the importance of differentaspects of speaker information requiring to be treated dif-ferently. TRMSM-Att outperforming TRMSM-Cat demon-strates that automatically and explicitly picking up speaker-related information is better than the implicit way.
Model Analysis
Ablation Study
To better understand the influences ofmasks on our models, we report the results of the modelsremoving the Transformer blocks with different masks onIEMOCAP and MELD. In this part, we denote ConventionMask as CM and Intra-Speaker, Inter-Speaker Masks as SM.Accordingly, TRMSM w/o SM is equivalent to TRM.As seen in Fig. 3, on both datasets, TRMSM w/o CM(solely applying SM) can achieve better performance thanTRMSM w/o SM (solely applying CM). We attribute it tothat speaker modeling does not drop the contextual informa-tion from conversations, and on the contrary, speaker mod-eling can guide the model to extract more effective informa-tion to the final prediction. Furthermore, TRMSM outper-forms both TRMSM w/o CM and TRMSM w/o SM, whichdemonstrates that all of our designed masks are critical toachieving better performance.
Effect of Range of Context
To find out the influenceof the range of context on our model, we train TRMSM-Attwith different ranges of available context on IEMOCAP andrefer the results from Ghosal et al. (2019) for DialogueGCN.We utilize different windows ( − x, y ) to limit the context,where x , y is respectively the number of utterances in priorcontext and post context. As illustrated in Fig. 4, with thewindow widened, the performance increases as shown inboth models. For DialogueGCN, (-10, 10) is the max win-dow of context and therefore it cannot get access to the long- F F F F F F F F F FM M M M M M
M M M
Oh, thank God... on the phone for half an hour ... talk to a freaking human being . I got it in the email
Frustrated
Frustrated
TRMSM:
Fru.
TRM:
Ang.
TRMSM:
Fru.
TRM:
Neu. yeah
Frustrated
TRMSM:
Fru.
TRM:
Neu.
Intra-s. Inter-s.
Intra-speaker Block Attention :Speaker:Attention score
All right, all right, I’ll ... my girlfriend. But I'm just doing it for you guys.
Neutral
TRMSM:
Neu.
TRM:
Sad. Yeah, you should, really.
Neutral
TRMSM:
Neu.
TRM:
Neu.
So big deal, so Joey's had a lot of girlfriends, it doesn’t ...
Neutral
TRMSM:
Neu.
TRM:
Neu.
Ch Ch Ch Ch Ch Ro RoMo Mo RaRa Mo
Intra-s. Inter-s.
Attention from fusing methodInter-speaker Block Attention :Speaker :Attention score (a) (b)
M: Male F: Female
Ch: Chandler
Ro: RossMo: Monica
Ra: Rachel
Attention from fusing method
Figure 5: Heatmaps of attention from fusing method and self-attention of Intra-, Inter-Speaker Blocks for the targeted utterances(whose speakers are marked in yellow). Labels of utterances are tagged below the utterances. Predictions of TRMSM and TRMare marked in green for correctness and red for mistake. happy sadneuralangryexcite d frustra ted happy sadneural angry excited frustr ated happy sadneuralangryexcitedfrustra ted neur al surprisefearsadness joy disgustanger neur al surprisefearsadness joy disgustanger neural surpr isefear sadn essjoydisg ustange r (a) 1 Layer (b) 3 Layers (c) 6 Layers(d) 1 Layer (e) 3 Layers (f) 6 Layers TRMTRMSM-Att
Figure 6: The F1 score on every emotion class by TRMSM-Att and TRM. (a)-(c) for IEMOCAP; (d)-(f) for MELD.distance context. On the contrary, the performance is furtherimproved by TRMSM-Att with all context available. Thisindicates that the local contextual information is critical forthe prediction and the long-distance information is also im-portant for contextual modeling to further improve the per-formance.
Effect of Number of Layers
We study the effect ofthe number of layers to our model on different datasets. Fig.6 illustrates the radar graphs for the F1 scores of emotionsin IEMOCAP and MELD by TRM and TRMSM-Att. Asthe number of layers increasing, the F1 scores of emotionsin IEMOCAP normally expand. While in MELD, increas-ing the number of layers gradually hurts the performance tobe 0 of F1 on emotions
Fear and
Disgust which are classeswith the fewest data. We think the reason may be that MELDsuffers from data imbalance and increasing the number oflayers leads to severer overfitting on small classes. For dataimbalance, methods like re-balance can be applied to allevi-ate it. Re-balance is out of the scope of this paper and ourfuture work will study data imbalance of ERC.
Case Study
To better understand how our model captures
Intra-Speaker and
Inter-Speaker dependencies, we illustrate two conver-sation clips ending with the targeted utterances so that the targeted utterances can only refer to the prior context.We choose TRMSM-Att without Conventional Blocks sothat only speaker information related blocks are considered.Specifically, we illustrate heatmaps of attention from fusingmethod and self-attention in Transformer blocks. For sim-plicity, we denote attention from fusing method as FAtt.In the scene of Fig. 5 (a), the speaker M keeps in frus-tration through the conversation and F in a neutral state hasfew influences on M. Therefore, FAtt pays more attention to Intra-Speaker dependency so that Intra-Speaker Blocks canextract information from M himself. We can see from theheatmap that the targeted utterance yeah grades the highestscore to the farthest contextual utterance whose emotion isalso frustration, which is out of the range of context that Dia-logueGCN can refer. In a sense, this indicates the importanceof long-distance information.As the condition in Fig. 5 (b), speakers in this conver-sation basically keep in a neutral state except that Chan-dler shows other emotions like anger and surprise beforethe targeted utterance. Although the targeted utterance withspeaker Chandler shows slight sadness from the semanticview, it is supposed to be predicted as neutral according tothe conversational context. Specifically, FAtt grades Inter-Speaker Blocks with higher score and self-attention in Inter-Speaker Blocks extracts information from the neutral utter-ances of other speakers. This case indicates the effectivenessof our model to extract inter-speaker information.
Conclusion
In this work, we simplify the Self and Inter-Speaker de-pendencies to a binary version. To achieve the simplifiedmodeling of speakers’ interactions, we design three masks:Conventional Mask, Intra-Speaker Mask, and Inter-SpeakerMask. These masks are utilized in the self-attention mod-ules of the second-level Transformer blocks of a hierarchicalTransformer. As the speaker-aware information extracted bydifferent masks diversely contributes to the prediction, atten-tion mechanism is utilized to weight and fuse them. Finally,our model achieves state-of-the-art results on 2 ERC datasetsand further analysis shows that our model is efficacious forERC. We average the attention scores of all self-attention heads inthe top layer of Transformer eferences
Busso, C.; Bulut, M.; Lee, C.; Kazemzadeh, A.; Mower,E.; Kim, S.; Chang, J. N.; Lee, S.; and Narayanan, S. S.2008. IEMOCAP: interactive emotional dyadic motion cap-ture database.
Lang. Resour. Evaluation .Devlin, J.; Chang, M.; Lee, K.; and Toutanova, K. 2019.BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding. In
Proc. of NAACL-HLT , 4171–4186.Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.;Gao, J.; Zhou, M.; and Hon, H. 2019. Unified LanguageModel Pre-training for Natural Language Understandingand Generation. In
Proc. of NeurIPS , 13042–13054.Ghosal, D.; Majumder, N.; Poria, S.; Chhaya, N.; and Gel-bukh, A. F. 2019. DialogueGCN: A Graph ConvolutionalNeural Network for Emotion Recognition in Conversation.In
Proc. of EMNLP-IJCNLP , 154–164.Hazarika, D.; Poria, S.; Mihalcea, R.; Cambria, E.; andZimmermann, R. 2018a. ICON: Interactive ConversationalMemory Network for Multimodal Emotion Detection. In
Proc. of EMNLP , 2594–2604.Hazarika, D.; Poria, S.; Zadeh, A.; Cambria, E.; Morency,L.; and Zimmermann, R. 2018b. Conversational Mem-ory Network for Emotion Recognition in Dyadic DialogueVideos. In
Proc. of NAACL-HLT , 2122–2132.Hochreiter, S.; and Schmidhuber, J. 1997. Long Short-TermMemory.
Neural Computation
Proc. of AAAI .Jiao, W.; Yang, H.; King, I.; and Lyu, M. R. 2019. Hi-GRU: Hierarchical Gated Recurrent Units for Utterance-Level Emotion Recognition. In
Proc. of NAACL-HLT , 397–406.Kingma, D. P.; and Ba, J. 2015. Adam: A Method forStochastic Optimization. In
Proc. of ICLR .Li, Q.; Wu, C.; Zheng, K.; and Wang, Z. 2020. HierarchicalTransformer Network for Utterance-level Emotion Recogni-tion. arXiv preprint arXiv:2002.07551 .Lian, Z.; Tao, J.; Liu, B.; and Huang, J. 2019. ConversationalEmotion Analysis via Attention Mechanisms. In
Proc. ofInterspeech , 1936–1940.Loshchilov, I.; and Hutter, F. 2019. Decoupled Weight De-cay Regularization. In
Proc. of ICLR .Majumder, N.; Poria, S.; Hazarika, D.; Mihalcea, R.; Gel-bukh, A. F.; and Cambria, E. 2019. DialogueRNN: An At-tentive RNN for Emotion Detection in Conversations. In
Proc. of AAAI , 6818–6825.Poria, S.; Cambria, E.; Hazarika, D.; Majumder, N.; Zadeh,A.; and Morency, L. 2017. Context-Dependent SentimentAnalysis in User-Generated Videos. In
Proc. of ACL , 873–883. Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria,E.; and Mihalcea, R. 2019a. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In
Proc. of ACL , 527–536.Poria, S.; Majumder, N.; Mihalcea, R.; and Hovy, E. H.2019b. Emotion Recognition in Conversation: ResearchChallenges, Datasets, and Recent Advances.
IEEE Access
7: 100943–100953.Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever,I. 2018. Improving language understanding by generativepre-training.Schlichtkrull, M. S.; Kipf, T. N.; Bloem, P.; van den Berg,R.; Titov, I.; and Welling, M. 2018. Modeling RelationalData with Graph Convolutional Networks. In
Proc. ofESWC , 593–607.Sukhbaatar, S.; Szlam, A.; Weston, J.; and Fergus, R. 2015.End-To-End Memory Networks. In
Proc. of NeuIPS , 2440–2448.Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. At-tention is All you Need. In
Proc. of NeuIPS , 5998–6008.Zhang, B.; Yang, M.; Li, X.; Ye, Y.; Xu, X.; and Dai, K.2020. Enhancing Cross-target Stance Detection with Trans-ferable Semantic-Emotion Knowledge. In
Proc. of ACL ,3188–3197.Zhang, D.; Wu, L.; Sun, C.; Li, S.; Zhu, Q.; and Zhou, G.2019. Modeling both Context- and Speaker-Sensitive De-pendence for Emotion Detection in Multi-speaker Conver-sations. In
Proc. of IJCAI , 5415–5421.Zhong, P.; Wang, D.; and Miao, C. 2019. Knowledge-Enriched Transformer for Emotion Detection in TextualConversations. In
Proc. of EMNLP-IJCNLP , 165–176.Zhu, H.; Nan, F.; Wang, Z.; Nallapati, R.; and Xiang, B.2020. Who did They Respond to? Conversation StructureModeling using Masked Hierarchical Transformer. In