[PDF] CHARET: Character-centered Approach to Emotion Tracking in Stories

Abstract

Autonomous agents that can engage in social interactions witha human is the ultimate goal of a myriad of applications. A keychallenge in the design of these applications is to define the socialbehavior of the agent, which requires extensive content creation.In this research, we explore how we can leverage current state-of-the-art tools to make inferences about the emotional state ofa character in a story as events unfold, in a coherent way. Wepropose a character role-labelling approach to emotion tracking thataccounts for the semantics of emotions. We show that by identifyingactors and objects of events and considering the emotional stateof the characters, we can achieve better performance in this task,when compared to end-to-end approaches.

Full PDF

CCHARET: Character-centered Approachto Emotion Tracking in Stories

A Preprint

Diogo S. Carvalho

INESC-ID & Instituto SuperiorTécnico, Universidade de [email protected]

Joana Campos

[email protected]

Manuel Guimarães

INESC-ID & Instituto SuperiorTécnico, Universidade de [email protected]

Ana Antunes

INESC-ID & Instituto SuperiorTécnico, Universidade de [email protected]

João Dias

INESC-ID & Faculdade de Ciências eTecnologia, Universidade do Algarve& [email protected]

Pedro A. Santos

INESC-ID & Instituto SuperiorTécnico, Universidade de [email protected]

ABSTRACT

Intelligent Virtual Agents (IVA) have an ever increasing range ofapplications from conversational interfaces on websites to tutorsor teammates in educational environments [6, 28], where they areequipped with tools to conduct human-like interactions, in closed-context environments. It is in the role of commercial automatedassistants (e.g., Siri, Alexa, Google Home) that IVAs are currentlythe most popular. Their conversational skills are a result of techno-logical advances in Natural Language Processing (NLP) that allowIVAs to support the user in everyday tasks. Although these systemsare getting more and more sophisticated, their communication abil-ities are still limited. These end-to-end deep learning approaches todialog generation for IVAs focus only on response quality and donot explicitly control social factors in natural human-like interac-tions. Among other aspects, to be socially resonant [19], IVAs haveto understand the user’s beliefs, emotions, goals and intentions,while maintaining and sharing their own, to produce consistentbehaviour overtime.Building computer artifacts that engage in the complex socialdance, particularly in open-ended domains, is a compelling prospectthat has attracted many researchers. Agent-based attempts to sim-ulate individual cognitive and affective processes [9, 16, 23, 30],model how individual traits, goals, beliefs and actions interact toproduce intelligent and emotionally plausible behaviour, in any scenario that the user can imagine. Yet, it is up to the author of ascenario to manually describe them for each character and guar-antee plot adaptability and consistency as events unfold. Whilethis can be manageable in narrow domains of application, scenariocomplexity can escalate rapidly, reducing the power of such ar-chitectures by relying heavily on the accessibility of the tool [24]and foremost the authors’ creativity and ability to anticipate allinteraction paths.In our project, the ultimate goal is to create agents that caneffectively act socially on the users’ actions, without relying onhand-generated content only. At the same time. it is our stance thatwe cannot depend solely on machine learning to create sociallyresonant agents and we need to leverage symbolic models, seman-tic networks and other conceptual models to create more naturalinteractions. In this work, we explore this idea by focusing on acentral element of social interactions – emotions . We are interestedin detecting and understanding the user’s (or a character) emotionalstate, as interactions evolve in ways that were not accounted for. Tothat end, we explore how we can leverage state-of-the-art tools toaccurately infer emotions of characters in stories, as events unfold.In this paper, the emotion classification task is modeled as acharacter role-labeling problem, because we are interested in whofelt the emotion and why. Our character-centered approach divergesfrom common approaches to sentiment analysis in text that attemptto infer an emotion from a set of words, ignoring the semanticsand subjectiveness of emotions: emotional reactions are caused byan event that the actor and the object (of that event, if exists) mayexperience differently . Furthermore, in our perspective emotions arenot static reactions to events, but dynamic constructs that evolveas interactions unfold . We find that by identifying the characters’roles - actors and objects of an event - in a story and keeping trackof their emotional state, we can perform better than end-to-endapproaches in the task of emotion tracking in stories. We draw fromthe results of this work particular challenges for domain authoringof open-ended socially resonant human-agent interactions.

Research in psychology, neurology and cognitive science showsthat not only do people use their cognitive functions, but they also a r X i v : . [ c s . H C ] F e b eavily rely on their emotions when taking decisions [31]. Damasioet al. proved that if these two parts don’t interconnect in a propermanner, multiple options are harder to be filtered and bad decisionsare easier to take [22]. These findings influenced the design ofintelligent characters and led agent-based modelling systems tobuild their frameworks around emotions. Affective Agents notonly aim at being more realistic and providing a more engagingexperience in human-computer interaction, but also at improvingthe performance of rational agents. FAtiMA Toolkit is a collection of open-source tools that is designedto facilitate the creation and use of cognitive agents with socioe-motional skills [15]. Its objective is to help researchers, developersand roboticists to incorporate a computational model of emotionand decision-making in their projects. In particular, it enables de-velopers to easily create Role Play Characters. These are sociallyintelligent characters with detailed AI modules that makes themautonomous regarding social interactions [36]. Both the Emotionaland the Decision Making processes behind FAtiMA-designed char-acters are defined by logical rules. Expressing the agent’s decisionsusing conditions with logical variables the action space of the agentgrows and adapts according to its beliefs [24].The Virtual Human Toolkit [16] is a well known architecturethat is also designed to facilitate the creation of autonomous con-versational characters. Moreover, the architecture is also highlymodular. In this case, the modules that are provided are focusedin handling aspects that are more related to the embodiment ofa character rather than its cognitive abilities. This functionalityincludes aspects and services such as speech processing, emotionalmodeling of the learner, emotional modeling of the virtual human,the gestures of the virtual human, rendering, and other services[4].GAMYGDALA [30] is a computational model of emotions thatis based on the OCC theory[7]. Similar to FAtiMA Toolkit, thisemotional appraisal engine was designed to be more accessible togame developers. Essentially, authors need only to provide a list ofgoals for each character and then specify which events will blockor facilitate each goal. Based on that information, the engine willdetermine the changes made to the character’s emotional state [12].As we have mentioned before, agent modelling architectures relyon a type of authoring that is oriented towards cognitive conceptssuch as goals, beliefs and emotions, among others [24]. In turn,they are also heavily dependent on the designer’s ability to imaginea variety of social situations and to use their intuition to specifysocial behaviours and execution rules for the agent, which maybe difficult to articulate [21], particularly for people outside of thescientific field.In the past, when authoring Agent Modelling Architectures, thistask would be manually performed by a team of engineers or bythe developers of the architecture themselves. However, recentdevelopments in Natural Language Processing (NLP) and MachineLearning (ML) fields have led to the automatization, or at leastpartial automatization, of this task. One of the most recent and successful approaches to this prob-lem is the translation of natural language descriptions into agent-readable concepts. Authors are asked to provide textual input suchas stories or scrips and the system is able to compute and transformit into intelligent character’s scenarios.For example, in the AI Planning Field, much like in the IVA field,there is the underlying assumption that users can formulate theproblem using some formal language [20]. Here, knowledge acquisi-tion tools have been used to extract the domain model through NLP.Framer uses Natural Language descriptions, written by users, asinput and is able to learn planning domain models [13]. Janghorbaniet al. [17] introduced an authoring assistant tool to automate theprocess of domain generation from natural language description ofvirtual characters.Given the importance of emotions on intelligent agent modellingsystems, we believe that the first step towards easing the authoringburden should focus in detection of emotion in Natural Languagetexts describing a story or sequence of events. This will allow usto replace traditional hand-authored appraisal rules used in socio-emotional agent systems [15] by a learned model that is able tosubjectively appraise an event according to different perspectives.

Sentiment analysis is the umbrella term for a series of tasks focusedon detecting valence, emotions and other affective states in text.Some tasks explore the positive or negative orientation of words[38], while other intend to infer a driving sentiment or opinion in awhole document [1]. These works focus on extracting an emotionallabel (or overall sentiment) from a set of words and do not considerthe emotional state of the different entities in the text. Differentmethods including supervised machine learning, lexicon-based ap-proaches and linguistic analysis are the common techniques usedto tackle sentiment analysis of text, but recently developed toolshave open avenues for sentiment detection in text. In particular, anewly created commonsense tool -ATOMIC [34] - that allows toreason about causes and effects of events has shown its applicabilityin question answering (QA) tasks [35] and the potential to reasonabout events and their related emotions. While emotion classifica-tion of events can be considered a QA task under this setting, itis important to note that an individual’s emotional reaction to anevent is not static and may depend on other aspects. As shown byRashkin et al. [32] story context is relevant for emotion tracking instories and methods should be devised to capture it. More recently[2], the power of Transformer language model tools allowed thecreation of COMET [3], a new framework for training neural repre-sentations of knowledge graphs that leverages ATOMIC knowledgebase. This new tool allows to make contextualized inferences (usingthe information in the graph), which are the scaffold of a character’semotional state classification.Given the underlying semantics of emotions, we consider thatan important form of context is who felt what given a certain event.As noted by others [26] identifying semantic roles in a sentence canimprove sentiment analysis task because sentiment is not alwaysexplicit in text. This implies that some pre-processing is necessaryand that end-to-end approaches may not suffice in this task [5].

RESEARCH QUESTION

Following the related work we are introducing a new mechanismthat considers the semantics of emotions to track and predict theemotional state of a character in a story. We consider that a character-centered approach to emotion classification that leverages state-of-the-art commonsense tools designed around cognitive processes (e.g., beliefs, emotions, intentions, causes and effects) will assist inthe task of creating of intelligent character’s scenarios. This con-sideration drives the research questions for the evaluation latterpresented in the paper:

RQ1

Can state-of-the-art commonsense tools allow an agent to keeptrack of the emotional state of a character as events unfold, ina coherent way?

This question can be broken down into more specific sub-questions:(1) Is it possible to make a better use of commonsense inferencetools by considering the semantics of emotions?(2) Does a layered approach to emotion tracking, i.e., an ap-proach that breaks down the problem into meaningful sub-problems, yield better results than end-to-end approaches?This research question seeks to address whether character spe-cific inferences and context improve emotion classification in sto-ries, when compared to previous work [2] that use the same contextacross characters, when making new inferences. A layered approachto sentiment analysis refers to the set of pre-processing steps re-quired to apply the proposed approach. These steps are detailed inSection 4 and constitute the approach pipeline.

In this paper we propose a character-centered approach to emo-tion recognition and tracking in stories that identifies stimuli andtheir objects, aided by semantic role-labelling [18]. We formalizeit in the following way. Given a story 𝑆 , consisting of 𝑛 events, ( 𝑠 , . . . , 𝑠 𝑛 ) , and a set of 𝑚 characters 𝐶 = { 𝑐 , . . . , 𝑐 𝑚 } , we assumethat emotional episodes are defined by an event-character pair ( 𝑠 𝑡 , 𝑐 𝑖 ) in 𝑆 × 𝐶 , which have a corresponding set of emotional reac-tions 𝑌 𝑠 𝑡 ,𝑐 𝑖 . The set 𝑌 𝑠 𝑡 ,𝑐 𝑖 is a subset of 𝑃 = { 𝑒𝑚𝑜𝑡𝑖𝑜𝑛 , . . . , 𝑒𝑚𝑜𝑡𝑖𝑜𝑛 𝑁 } consisting of a previously defined set of emotions. We highlightthat the empty set, corresponding to no emotion , and the whole set,corresponding to all emotions , are allowed. Here, we follow OCCtheory of emotions[27], which posits that multiple emotions canbe experienced simultaneously as the result of an appraisal of anevent. Possible choices of 𝑃 include Ekman’s basic emotions [11]or Plutchik’s wheel of emotions [29].Provided S and C, we set ourselves the task of tracking the emo-tions in 𝑌 𝑠 𝑡 ,𝑐 𝑖 with the support of commonsense inference tools.Our approach is as follows. For each event-character pair ( 𝑠 𝑡 , 𝑐 𝑖 ) ,for each emotion 𝑦 in 𝑃 , a function 𝑓 : 𝑆 × 𝐶 × 𝑃 → { , } pre-dicts whether 𝑦 is an emotional reaction of the state-character pair ( 𝑠 𝑡 , 𝑐 𝑖 ) , or in other words whether 𝑦 ∈ 𝑌 𝑠 𝑡 ,𝑐 𝑖 , based on a score, 𝑠𝑐𝑜𝑟𝑒 𝑠 𝑡 ,𝑐 𝑖 ,𝑦 ∈ [ , ] . If the score is sufficiently high, we classify 𝑦 asan emotional reaction to ( 𝑠 𝑡 , 𝑐 𝑖 ) . Once we followed the approachfor every emotion 𝑦 from 𝑃 , we are left with the predicted set ofemotional reactions ˆ 𝑌 𝑠 𝑡 ,𝑐 𝑖 . The function 𝑓 encompasses a pipeline with three components: a ) character role-labeling, b ) commonsense inference, and c ) emo-tion classification. It consists in identifying stimuli (events) andtheir objects, and then use a language model – COMET [3] – tomake inferences about life events and identify the triggered emo-tions, depending on the perspective of the character in a story. Wedetail our pipeline in the following subsections.To capture character-specific context and allow to detect morecoherent emotions as events unfold, we use information from theprevious event 𝑠 𝑡 − . Specifically we consider the effects that theprevious event, 𝑠 𝑡 − , had on each character, to model how the char-acter feels about 𝑠 𝑡 . This entails that emotions are not momentaryreactions to an event, but instead are constructs that incrementallyunfold. Figure 1 shows a diagram of our approach over the first twoexamples of a story from StoryCommonsense , Hot Coffee . Figure 1: Diagram of our approach. the inferences of effects from the first event are used as context to classify the emo-tions of the characters on the second event.

To establish a character’s role with respect to an event in a sentence– a character can be either the actor or the object of an event – weuse PredPatt. PredPatt [37, 40] is a tool that can be used to performsemantic-role labeling, since it defines a set of interpretable, extensi-ble and non-lexicalized patterns based on Universal Dependencies[8, 10] and extracts predicates and their arguments using theserule-based patterns. While other (deep) tools, such as AllenNLPsemantic role-labelling [14], are available to perform this task, it hasbeen shown in the literature that rule-based system produce betterresults [39]. PredPatt is an attractive tool to our project because itallows to perform SRL in other languages, different than English,without consequences to the other blocks in the pipeline.Note that as a pre-processing step we had to resolve co-referenceswithin a story (see Figure 2). We used NeuralCoref to assist in thistask. NeuralCoref annotates and resolves co-reference clusters usinga neural network. The system receives a set of ordered sentences,which constitute a story, and substitutes the pronouns he , his , they https://uwnlp.github.io/storycommonsense/ https://github.com/huggingface/neuralcoref nd him by the corresponding entity (see Figure 3 ). From there, weuse PredPatt to classify the character with respect to its role on theevent and keep track of the characters in the story. Figure 2: Co-reference solution example. The box on top isa five-line story present in the StoryCommonsense dataset.The box at the bottom demonstrates how the co-referenceswere resolved.Figure 3: Character Role-Labeling example. Given the twoentities present in this story –

Tom and

People – the algo-rithm identifies who is the actor of the event (in green).

COMET is a tool for automatic knowledge graph construction. Itis constructed as a Transformer language model fine-tuned on theATOMIC knowledge graph. In ATOMIC, each event is annotatedwith possible intents ( xIntent ), needs ( xNeed ), reactions ( xReact ),attributes ( xAttr ) and effects ( xWant, xEffect ), with respect to the actors of the event. With respect to others, each event is annotatedwith possible reactions ( oReact ) and effects ( oWant , oEffect ). Givenan event, COMET is thus able to perform commonsense inferencesabout events described in natural language. Such inferences can,again, include intents, attributes, effects and, even more importantlyfor our work, emotional reactions.We build on the work of Bosselut et al. [2], who leveraged com-monsense inference to reason about unstated events in questionanswering tasks. In the cited work, the author’s propose COMET-CGA, an approach which uses COMET to reason over commonsenseinferences. For each story event-character pair ( 𝑠 𝑡 , 𝑐 𝑖 ) , we perform commonsense inferences to produce unstated events, using COMET[3]. In our approach, if a character is an actor of the event, we useCOMET to infer its most likely intent, reaction and effect, producinga set of character-specific events 𝐸 𝑥𝑠 𝑡 ,𝑐 𝑖 = { 𝑠 𝑡 , 𝑥𝐼𝑛𝑡𝑒𝑛𝑡, 𝑥𝑅𝑒𝑎𝑐𝑡, 𝑥𝐸𝑓 𝑓 𝑒𝑐𝑡 } . If, on the contrary, a character is an object of the event, we useCOMET to infer only its reaction and the effect, producing the setof character-specific events 𝐸 𝑜𝑠 𝑡 ,𝑐 𝑖 = { 𝑠 𝑡 , 𝑜𝑅𝑒𝑎𝑐𝑡, 𝑜𝐸𝑓 𝑓 𝑒𝑐𝑡 } . Figure 4 shows an example of the commonsense inference step ofour pipeline.

Figure 4: Commonsense Inference example. Given 𝑠 we in-fer the most likely intent, reaction and effect of the charac-ter Tom , using COMET.

As stated before, to capture character-specific context we usethe effect inferences – both xEffect and oEffect – from the previ-ous story line, 𝑠 𝑡 − . If 𝑡 =

0, we do not use any effect inference.The set of character-specific events 𝐸 𝑠 𝑡 ,𝑐 𝑖 can be seen as the set ofcharacter-specific information available. Each piece of informationis described as an event described in natural language. While theattempt to add story context to emotion classification in stories isnot novel (see Section 5.2), using unstated inferred events from pre-vious events is. Adding story context intends to classify emotionsmore coherently throughout the story. While built with the aim of commonsense inference for knowledgegraph construction, COMET can also be used in emotion classi-fication tasks (see Section 5.2). The use of COMET for emotionclassification is not new, and the main difference in our approachis that, by including a Character Role-Labeling step, we can makecharacter-specific commonsense inferences from events and clas-sify different emotions for different characters, according to theirrole.Being a language model, once an event is inputted, to produce areaction inference, COMET uses a probability distribution over avocabulary V, corresponding to the probabilities of outputting eachword in V. As such, and for example, we can model the likelihoodthat surprise is an emotion of a certain event-character pair as theprobability 𝑝 𝑒,𝑠𝑢𝑟𝑝𝑟𝑖𝑠𝑒 of the first word of the reaction inferencebeing surprised , or the likelihood of joy as 𝑝 𝑒,ℎ𝑎𝑝𝑝𝑦 . More generally,we can define a dictionary 𝑑 that, for each emotion 𝑦 in 𝑃 , maps theemotion to a word from the vocabulary 𝑉 . The choice of dictionaryis ad-hoc and can there are various alternatives for each set 𝑃 ofemotions considered.For each event 𝑒 in the set 𝐸 𝑠 𝑡 ,𝑐 𝑖 , and for each emotion 𝑦 in 𝑃 ,we compute the probability 𝑝 𝑒,𝑦 that emotion 𝑦 is the reactionnference from event 𝑒 . We observe that, if for the event-characterpair ( 𝑠 𝑡 , 𝑐 𝑖 ) , the character 𝑐 𝑖 is classified as an actor in the CharacterRole Labeling step, we use the probabilities from COMET’s xReact inference; if, on the other hand, the character 𝑐 𝑖 is an object, weuse the probabilities from COMET’s oReact inference. Finally thescore of the emotion is the geometric mean of these probabilities, 𝑠𝑐𝑜𝑟𝑒 𝑠 𝑡 ,𝑐 𝑖 ,𝑦 = | 𝐸𝑠𝑡,𝑐 | √︄ (cid:214) 𝑒 ∈ 𝐸 𝑠𝑡,𝑐𝑖 𝑝 𝑒,𝑦 , same as in COMET-CGA. Other alternatives to the geometric meanexist, such as the arithmetic mean, the maximum and the mini-mum. Based on the emotion score, we must decide whether theemotion 𝑦 is a reaction of character 𝑐 𝑖 and event 𝑠 𝑡 . We establishthat 𝑦 is an emotional reaction of the event-character pair ( 𝑠 𝑡 , 𝑐 𝑖 ) if 𝑠𝑐𝑜𝑟𝑒 𝑠 𝑡 ,𝑐 𝑖 ,𝑦 is bigger than an emotion specific threshold 𝑘 𝑦 ∈ [ , ] .Conversely, we establish that 𝑦 is not an emotional reaction of theevent-character pair if 𝑠𝑐𝑜𝑟𝑒 𝑠 𝑡 ,𝑐 𝑖 ,𝑦 is lower or equal than the emo-tion specific threshold 𝑘 𝑦 . The emotion specific thresholds 𝑘 𝑦 allowto make up for any bias towards more frequent emotions such as happy and sad , as opposed to fearful and trusting , since we can uselower thresholds for the least frequent emotions.A character’s emotion classification can be executed under twosettings: zero-shot and one-shot, as can COMET-CGA [2]. In thezero-shot setting, the emotion specific thresholds are not optimized,to allow for generality of the approach and ease applicability. In theone-shot setting, each emotion specific threshold is optimized bysweeping a previously defined space of possible thresholds. In thezero-shot setting, we define each emotion threshold as a percentile.Specifically, if an emotion 𝑦 appears on 𝑞 % of the training set,we define 𝑘 𝑦 as the 𝑞 -th percentile of the cumulative distributionfunction of scores for emotion 𝑦 . In the few-shot setting, we defineeach emotion threshold as the lowest value maximizing the F1-score on the training set. Figure 5 shows an example of the emotionclassification step. Figure 5: Emotion Classification example. Each event on theset on top gives each emotion a probability. For each emo-tion, the probabilites from each event are combined to pro-duce a score. The emotions with a score bigger than its spe-cific threshold are classified. In this case, the classified emo-tions are

Joy and

Anticipation . In this section we describe the conducted experimental evaluationin StoryCommonsense dataset [32], which was annotated using acharacter-centered approach. The semantics of emotions were takeninto account in the annotation of this dataset, allowing access toemotional states from the point-of-view of the actor and object of anevent. We test and compare CHARET against previous approaches,which are discussed below (see Section 5.2).

The StoryCommonsense dataset [32] consists in short common-sense stories, with five natural language events each, and is anno-tated with the mental states of the characters - motivations andemotional reactions. To produce the dataset, annotators were askedto describe the mental states of the characters both using free-formnatural language and emotion theory labels - Maslow’s needs [25],Reiss’ motives for motivations [33] and Plutchik basic emotions[29]. The annotations describe how humans think about events andhow they infer motives and emotions. People use a character-basedapproach to reason about their social worlds, which is reflected inthis dataset. For that reason, we believe that the results obtainedwith this dataset transfer well to other interaction scenarios thatwe aim to model in the context of our project.The task in which our approach (see Section 4) is evaluatedconsists in labeling the emotional reactions of the characters ineach story of the StoryCommonsense dataset. Particularly, we tryto label each story event-character pair with a subset of the eightPlutchik basic emotions used to annotate the dataset - surprise , disgust , sadness , joy , anger , fear , trust , anticipation . The most straightforward approach to the task is to train a classi-fier that receives an unprocessed story event and a character andoutputs its emotional reactions. Such classifiers were trained andtested by Rashkin et al.[32], namely TF-IDF features, max-pooledGloVe embeddings, an LSTM and a CNN. The mentioned classifierswere trained both with and without a form of story context. Partic-ularly, besides the event-character pair ( 𝑠 𝑡 , 𝑐 𝑖 ) , the classifiers arealso inputted the previous events of the story where the characteris explicitly mentioned. Event though results across all classifierssuggested the benefits of including this form of story context, therecursive input of story events is computationally expensive. Al-though differently, we also include story context in our approachas described in, Section 4.We also set as a baseline Bosselut et al. work [2], who experi-mented COMET - a GPT language model previously fine-tuned oncommonsense knowledge graph completion [34] - on StoryCommon-sense. The approach ( COMET-CGA ) also included commonsenseinference and was evaluated in a zero-shot setting, showing similarresults to the aforementioned supervised approaches. Additionally,in the few-shot setting, their approach outperformed the supervisedbaselines. Finally, the same authors used the GPT language modelfine-tuned on StoryCommonsense, outperforming every baseline[2]. While producing better empirical results, the approach hasincreased computational costs and a bigger loss of generality whencompared with the zero-shot setting. These limitations may con-strain applications of the model to small datasets.

We test our approach under the zero-shot and few-shot settingand compare it with the previously discussed baselines. In order toshow that our approach can be even further optimized to a specifictask, beyond the one-shot setting, in a supervised manner, we alsoexperiment fine-tuning the COMET model on a StoryCommonsense able 1: Plutchik emotion dictionary.

Emotion WordSuprise surprisedDisgust disgustedSadness sadJoy happyAnger angryFear fearfulTrust trustingAnticipation excited

Table 2: Relative frequency at which each emotion is anno-tated for a character in a story event on the training set.

Emotion Actors ObjectsSurprise 38.8 32.6Disgust 18.3 13.6Sadness 25.2 19.9Joy 53.0 33.4Anger 19.3 15.1Fear 26.3 20.1Trust 34.1 24.0Anticipation 56.4 33.7training set, consisting of 20% of the development set, and skippingthe commonsense inference step of the pipeline.Table 1 shows the dictionary we use to correspond the Plutchikemotions with words from COMET’s vocabulary V.The metrics we use are Precision, Recall and F1-score. We givean example of how such metrics are computed by us in this task.Suppose that an event-character pair ( 𝑠 𝑡 , 𝑐 𝑖 ) as an annotated setof emotions 𝑌 𝑠 𝑡 ,𝑐 𝑖 = { 𝑒𝑚𝑜𝑡𝑖𝑜𝑛 , 𝑒𝑚𝑜𝑡𝑖𝑜𝑛 , 𝑒𝑚𝑜𝑡𝑖𝑜𝑛 } . Suppose ad-ditionally that, using CHARET, we predict that the set of emotionsis instead ˆ 𝑌 𝑠 𝑡 ,𝑐 𝑖 = { 𝑒𝑚𝑜𝑡𝑖𝑜𝑛 , 𝑒𝑚𝑜𝑡𝑖𝑜𝑛 , 𝑒𝑚𝑜𝑡𝑖𝑜𝑛 } . In this case, wehave 2 True Positives, 1 False Positive, 1 False Negative and 4 TrueNegatives. After summing each of these quantities across all event-character pairs, we compute the metrics we use. We show the relative frequency at which each emotion is annotatedfor a character in a story event on the training set and show theresults on Table 2. These results are used to define the emotion-specific thresholds 𝑘 𝑦 in the zero-shot setting.Table 3 shows the main results obtained on the task, comparedwith the discussed baselines. Our approach appears in the tablewith the name CHARET. The best results of each learning setting -zero-shot, few-shot and supervised- are bolden.The results indicate the benefits of our approach compared toothers across all learning settings. Additionally, we notice thatthe performance increases as the level of task-specific knowledgeincreases - from zero-shot to few-shot; from few-shot to supervised. It is important to note that the pipeline proposed in this paper, mayintroduce some errors that may affect the overall performance of

Table 3: Performance of ours and previous approaches to theStoryCommonsense emotion classification task.Model

Precision Recall F1

Zero-shot

Random 20.6 20.8 20.7COMET - DynaGen

COMET - DynaGen 31.2 65.1 42.2CHARET

TF-IDF 20.1 24.1 21.9GloVe 15.2 30.6 20.3LSTM 20.3 30.4 24.3CNN 21.2 23.4 22.2BERT

CHARET 46.4

Table 4: Performance of character role-labeling.

Precision Recall F189.0 63.5 74.1the approach. The reason is that each step of the pipeline uses toolsthat may not produce an accurate output at all times. The outputof NeuralCoref feeds PredPatt, which in turn is responsible forcreating the event-character pairs ( 𝑠 𝑡 , 𝑐 𝑖 ) . We evaluate how mucherror is introduced by the two tools used on labeling a character’srole on a story event, NeuralCoref and PredPatt, on the trainingset. Table 4 shows the results. The results show that our character-role labeling step is precise to some extent, as 89% of the characterwe classify as actors of events are correctly classified. However,recall is not as high as we would like. This represents an additionalvenue for improvement. If we are able to improve the character-rolelabelling process (by using improved language models), we believethat we will be able to achieve better results in the downstreamtask of emotion tracking and prediction. Our approach explored how data-driven and semantic tools com-bined allow to keep track of the emotional state of characters ina story [

RQ1 ] the way humans do. The results indicate that ourcharacter-based approach to emotion inference outperforms previ-ous machine learning approaches in a dataset annotated for con-cepts that people use to decode social interactions.By revisiting the sub-research questions in Section 3, we canpositively say that by considering the semantics of emotions andthus, conducting semantic-aware reasoning, we can make a bet-ter use of a commonsense inference tool that has links betweencauses and effects. Emotion classification is a task that entails manyother pre-processing steps (e.g., personality recognition, anaphoraresolution, polarity detection, etc.) that are not considered in thiswork. Yet, our work points out that an approach that combines dataand semantic analysis should be considered in order to create moreelievable interactions between IVAs and humans. At the sametime, in the real-world applications, we should expect to have morenoisy data points, with a poorer structure, which could result inpoorer results. Predicate and argument extraction, the backbone ofour approach, may be more challenging under those conditions asshown in open-information scenarios.

Advances in NLP have propelled the use of Intelligent VirtualAgents in a myriad of contexts, in particular in the role of automatedassistants. These systems are a good example that data-driven ap-proaches to human-agent interaction alone, are not sufficient toproduce socially resonant behavior.We proposed CHARET, a character centered approach approachthat leverages on a simple form of semantic role labelling andstate-of-the-art commonsense and processing tools for the task ofemotion tracking and prediction from events. We validated ourapproach on a well structured dataset with clear cut events that rep-resent a coherent story. Our approach outperfoms previous worksthat do not account for the semantics and subjectiveness of emo-tions when inferring the emotional state of characters. Althoughnot yet ideal, the results obtained for the few-shot setting - requir-ing a small amount of training - reached an F1 score of 53.1, a25.8% increase over the baseline COMET - DynaGen. These resultsare promising, as they suggest that a layered approach to emotiontracking and prediction can yield better results than end-to-endapproaches.In future work we intend to explore whether this approach per-forms well in more challenging domains, namely Fairy Tales. Addi-tionally, we consider including other aspects that define a characterand a situation to further capture emotional content from events.We are also interested in exploring if the same commonsense toolsfacilitate this type of inferences and what are their limitations.

ACKNOWLEDGMENTS

This work was partially supported by national funds through Fun-dação para a Ciência e Tecnologia under project SLICE with ref-erence PTDC/CCI-COM/30787/2017, University of Lisbon and In-stituto Superior Técnico and INESC-ID multi annual funding withreference UIDB/50021/2020.

REFERENCES [1] Rushlene Kaur Bakshi, Navneet Kaur, Ravneet Kaur, and Gurpreet Kaur. 2016.Opinion mining and sentiment analysis. In . IEEE, 452–455.[2] Antoine Bosselut, Ronan Le Bras, , and Yejin Choi. 2021. Dynamic Neuro-SymbolicKnowledge Graph Construction for Zero-shot Commonsense Question Answer-ing. In

Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI) .[3] Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Ce-likyilmaz, and Yejin Choi. 2019. COMET: Commonsense transformers for auto-matic knowledge graph construction. arXiv preprint arXiv:1906.05317 (2019).[4] Keith Brawner, Greg Goodwin, and Robert Sottilare. 2016. Agent-Based Practicesfor an Intelligent Tutoring System Architecture. In

International Conference onAugmented Cognition . Springer, 3–12.[5] Erik Cambria, Soujanya Poria, Alexander Gelbukh, and Mike Thelwall. 2017.Sentiment analysis is a big suitcase.

IEEE Intelligent Systems

32, 6 (2017), 74–80.[6] Klaus Christoffersen and David D Woods. 2002. How to make automated systemsteam players.

Advances in human performance and cognitive engineering research

Emotion Review

5, 4 (2013), 335–343. [8] Marie-Catherine De Marneffe, Timothy Dozat, Natalia Silveira, Katri Haverinen,Filip Ginter, Joakim Nivre, and Christopher D Manning. 2014. Universal Stanforddependencies: A cross-linguistic typology.. In

LREC , Vol. 14. 4585–4592.[9] Joao Dias and Ana Paiva. 2005. Feeling and reasoning: A computational model foremotional characters. In

Portuguese conference on artificial intelligence . Springer,127–140.[10] Kira Droganova and Daniel Zeman. 2019. Towards Deep Universal Dependencies.In

Proceedings of the Fifth International Conference on Dependency Linguistics(Depling, SyntaxFest 2019) . 144–152.[11] Paul Ekman. 1992. Are there basic emotions? (1992).[12] Masaki Endo, Kaori Yuda, and Maxim Mozgovoy. 2019. Developing EmotionalAI with Gamygdala for Universal Fighting Engine. In

Proceedings of the L Inter-national Scientific Conference on Control Processes and Stability .[13] Wenfeng Feng, Hankz Hankui Zhuo, and Subbarao Kambhampati. 2018. Extract-ing action sequences from texts based on deep reinforcement learning. arXivpreprint arXiv:1803.02632 (2018).[14] Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nel-son Liu, Matthew Peters, Michael Schmitz, and Luke Zettlemoyer. 2018. Al-lennlp: A deep semantic natural language processing platform. arXiv preprintarXiv:1803.07640 (2018).[15] Manuel Guimarães, Samuel Mascarenhas, Rui Prada, Pedro A Santos, and JoãoDias. 2019. An Accessible Toolkit for the Creation of Socio-EmotionalAgents.In

Proceedings of the 18th International Conference on Autonomous Agents andMultiAgent Systems . 2357–2359.[16] Arno Hartholt, David Traum, Stacy C. Marsella, Ari Shapiro, Giota Stratou,Anton Leuski, Louis-Philippe Morency, and Jonathan Gratch. 2013. All TogetherNow: Introducing the Virtual Human Toolkit. In . Edinburgh, UK.[17] Sepehr Janghorbani, Ashutosh Modi, Jakob Buhmann, and Mubbasir Kapadia.2019. Domain Authoring Assistant for Intelligent Virtual Agents. arXiv preprintarXiv:1904.03266 (2019).[18] Daniel Jurafsky and James H. Martin. 2009.

Speech and Language Processing (2ndEdition) . Prentice-Hall, Inc., USA.[19] Stefan Kopp. 2010. Social resonance and embodied coordination in face-to-faceconversation with artificial interlocutors.

Speech Communication

52, 6 (2010),587–597.[20] Alan Lindsay, Jonathon Read, Joao Ferreira, Thomas Hayton, Julie Porteous,and Peter Gregory. 2017. Framer: Planning models from natural language ac-tion descriptions. In .[21] Phoebe Liu, Dylan F Glas, Takayuki Kanda, and Hiroshi Ishiguro. 2016. Data-driven HRI: Learning social behaviors by example from human–human interac-tion.

IEEE Transactions on Robotics

32, 4 (2016), 988–1008.[22] Elwin Marg. 1995. DESCARTES’ERROR: emotion, reason, and the human brain.

Optometry and Vision Science

72, 11 (1995), 847–848.[23] Stacy C Marsella and Jonathan Gratch. 2009. EMA: A process model of appraisaldynamics.

Cognitive Systems Research

10, 1 (2009), 70–90.[24] Samuel Mascarenhas, Manuel Guimarães, Rui Prada, João Dias, Pedro A Santos,Kam Star, Ben Hirsh, Ellis Spice, and Rob Kommeren. 2018. A virtual agenttoolkit for serious games developers. In . IEEE, 1–7.[25] Abraham Harold Maslow. 1958. A Dynamic Theory of Human Motivation. (1958).[26] Saif Mohammad, Xiaodan Zhu, and Joel Martin. 2014. Semantic role labelingof emotions in tweets. In

Proceedings of the 5th Workshop on ComputationalApproaches to Subjectivity, Sentiment and Social Media Analysis . 32–41.[27] Andrew Ortony, Gerald L. Clore, and Allan Collins. 1988.

The CognitiveStructure of Emotions . Cambridge University Press. https://doi.org/10.1017/CBO9780511571299[28] Florian Pecune, Angelo Cafaro, Magalie Ochs, and Catherine Pelachaud. 2016.Evaluating social attitudes of a virtual tutor. In

International Conference on Intel-ligent Virtual Agents . Springer, 245–255.[29] Robert Plutchik. 1980. A general psychoevolutionary theory of emotion. In

Theories of emotion . Elsevier, 3–33.[30] Alexandru Popescu, Joost Broekens, and Maarten Van Someren. 2014. Gamygdala:An emotion engine for games.

IEEE Transactions on Affective Computing

5, 1(2014), 32–44.[31] Mihaela-Alexandra Puica and Adina-Magda Florea. 2013. Emotional belief-desire-intention agent model: Previous work and proposed architecture.

InternationalJournal of Advanced Research in Artificial Intelligence

2, 2 (2013), 1–8.[32] Hannah Rashkin, Antoine Bosselut, Maarten Sap, Kevin Knight, and Yejin Choi.2018. Modeling naive psychology of characters in simple commonsense stories. arXiv preprint arXiv:1805.06533 (2018).[33] Steven Reiss. 2004. Multifaceted nature of intrinsic motivation: The theory of 16basic desires.

Review of general psychology

8, 3 (2004), 179–193.[34] Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, NicholasLourie, Hannah Rashkin, Brendan Roof, Noah A Smith, and Yejin Choi. 2019.Atomic: An atlas of machine commonsense for if-then reasoning. In

Proceedingsof the AAAI Conference on Artificial Intelligence , Vol. 33. 3027–3035.35] Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. 2019.Socialiqa: Commonsense reasoning about social interactions. arXiv preprintarXiv:1904.09728 (2019).[36] Wim Westera, Rui Prada, Samuel Mascarenhas, Pedro A Santos, João Dias, ManuelGuimarães, Konstantinos Georgiadis, Enkhbold Nyamsuren, Kiavash Bahreini,Zerrin Yumak, et al. 2020. Artificial intelligence moving serious gaming: Present-ing reusable game AI components.

Education and Information Technologies

25, 1(2020), 351–380.[37] Aaron Steven White, Drew Reisinger, Keisuke Sakaguchi, Tim Vieira, ShengZhang, Rachel Rudinger, Kyle Rawlins, and Benjamin Van Durme. 2016. UniversalDecompositional Semantics on Universal Dependencies. In

Proceedings of the2016 Conference on Empirical Methods in Natural Language Processing . Association for Computational Linguistics, Austin, Texas, 1713–1723. https://aclweb.org/anthology/D16-1177[38] Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. 2005. Recognizing contextualpolarity in phrase-level sentiment analysis. In

Proceedings of human languagetechnology conference and conference on empirical methods in natural languageprocessing . 347–354.[39] Wei Xiang and Bang Wang. 2019. A Survey of Event Extraction From Text.

IEEEAccess