[PDF] Emotion-Aware, Emotion-Agnostic, or Automatic: Corpus Creation Strategies to Obtain Cognitive Event Appraisal Annotations

Abstract

Appraisal theories explain how the cognitive evaluation of an event leads to a particular emotion. In contrast to theories of basic emotions or affect (valence/arousal), this theory has not received a lot of attention in natural language processing. Yet, in psychology it has been proven powerful: Smith and Ellsworth (1985) showed that the appraisal dimensions attention, certainty, anticipated effort, pleasantness, responsibility/control and situational control discriminate between (at least) 15 emotion classes. We study different annotation strategies for these dimensions, based on the event-focused enISEAR corpus (Troiano et al., 2019). We analyze two manual annotation settings: (1) showing the text to annotate while masking the experienced emotion label; (2) revealing the emotion associated with the text. Setting 2 enables the annotators to develop a more realistic intuition of the described event, while Setting 1 is a more standard annotation procedure, purely relying on text. We evaluate these strategies in two ways: by measuring inter-annotator agreement and by fine-tuning RoBERTa to predict appraisal variables. Our results show that knowledge of the emotion increases annotators' reliability. Further, we evaluate a purely automatic rule-based labeling strategy (inferring appraisal from annotated emotion classes). Training on automatically assigned labels leads to a competitive performance of our classifier, even when tested on manual annotations. This is an indicator that it might be possible to automatically create appraisal corpora for every domain for which emotion corpora already exist.

Full PDF

EEmotion-Aware, Emotion-Agnostic, or Automatic:Corpus Creation Strategies to ObtainCognitive Event Appraisal Annotations

Jan Hofmann, Enrica Troiano, and

Roman Klinger

Institut f¨ur Maschinelle Sprachverarbeitung, University of Stuttgart, Germany { jan.hofmann,enrica.troiano,roman.klinger } @ims.uni-stuttgart.de Abstract

Appraisal theories explain how the cognitiveevaluation of an event leads to a particularemotion. In contrast to theories of basic emo-tions or affect (valence/arousal), this theoryhas not received a lot of attention in naturallanguage processing. Yet, in psychology it hasbeen proven powerful: Smith and Ellsworth(1985) showed that the appraisal dimensions attention , certainty , anticipated effort , pleas-antness , responsibility / control and situationalcontrol discriminate between (at least) 15 emo-tion classes. We study different annotationstrategies for these dimensions, based on theevent-focused enISEAR corpus (Troiano et al.,2019). We analyze two manual annotation set-tings: (1) showing the text to annotate whilemasking the experienced emotion label; (2) re-vealing the emotion associated with the text.Setting 2 enables the annotators to develop amore realistic intuition of the described event,while Setting 1 is a more standard annotationprocedure, purely relying on text. We eval-uate these strategies in two ways: by mea-suring inter-annotator agreement and by ﬁne-tuning RoBERTa to predict appraisal variables.Our results show that knowledge of the emo-tion increases annotators’ reliability. Further,we evaluate a purely automatic rule-based la-beling strategy (inferring appraisal from anno-tated emotion classes). Training on automati-cally assigned labels leads to a competitive per-formance of our classiﬁer, even when tested onmanual annotations. This is an indicator thatit might be possible to automatically createappraisal corpora for every domain for whichemotion corpora already exist. Automatically detecting emotions in written textsconsists of mapping textual units, like documents,paragraphs, or sentences, to a predeﬁned set of emo-tions. Common sets of classes used for this purpose rely on psychological theories such as those pro-posed by Ekman (1992) ( anger , disgust , fear , joy , sadness , surprise ) or Plutchik (2001). These the-ories are based on the assumption that there is arestricted number of emotions that have prototypi-cal realizations. However, not all sets of emotionsare appropriate for every domain. For instance, Dit-trich and Zepf (2019) argue that some of the basicemotions are too strong for measuring how peoplefeel when driving a car and, based on that, Cevheret al. (2019) resort to joy , annoyance (instead of anger ), insecurity (instead of fear ), boredom , and relaxation to classify in-car utterances. Haider et al.(2020) model the emotional perception of poetryand opt for the categories beauty / joy , sadness , un-easiness , vitality , awe / sublime , suspense , humor , nostalgia , and annoyance , following the deﬁnitionof aesthetic emotions (Schindler et al., 2017; Men-ninghaus et al., 2019). Demszky et al. (2020) deﬁnea taxonomy of emotions, reaching a high coveragewhile maintaining inter-class relations.An alternative to the use of categorical variablesare the so-called “dimensional” approaches. Themost popular of them models affective experiencesalong the variables of dominance, valence, andarousal (Russell and Mehrabian, 1977, VAD). Feld-man Barrett (2006, 2017) theorizes that emotionsare interpretations of continuous affective statesexperiencers ﬁnd themselves in. Still, as Smithand Ellsworth (1985) note, not all emotions can bedistinguished based on valence and arousal. Onemight argue that predicting three continuous vari-ables instead of a richer set of categories is a sim-pliﬁcation and can be limiting for downstream ap-plications of emotion analysis models.Smith and Ellsworth (1985) particularly arguethat the VAD model does not capture all relevantaspects of an emotion in the context of an event.In a ﬁght or ﬂight situation (Cannon, 1929), for in-stance, the decision to take one of these two actions a r X i v : . [ c s . C L ] F e b motion Appraisals TextJoy Attention,Certainty,Pleasant,Sit. Ctrl. I felt . . . when I knew that I wasgoing back to Florida a year ear-lier than I thought I would.Disgust Attention,Certainty,Effort,Sit. Ctrl. I felt . . . when my kitten wassick and I had to clean it up.Fear Attention,Effort,Sit. Ctrl. I felt . . . when I was having ahard attach.Guilt Attention,Certainty,Effort,Respons.,Control I felt . . . when I went on holidayand left our cat behind.Sadness Attention,Certainty,Sit. Ctrl. I felt . . . when I found out oneof my favourite shops had shutdown. Table 1: Examples from the corpus of Hofmann et al.(2020). is mostly made based on the effort that the emotionexperiencer anticipates, but this is not representedby VAD. Therefore, Smith and Ellsworth (1985)propose a dimensional approach with the appraisalvariables of how pleasant an event is ( pleasantness ,likely to be associated with joy , but unlikely to ap-pear with disgust ), how much effort an event canbe expected to cause ( anticipated effort , likely tobe high when anger or fear is experienced), howcertain the experiencer is in a speciﬁc situation ( cer-tainty , low, e.g., in the context of hope or surprise ),how much attention is devoted to the event ( atten-tion , likely to be low, e.g., in the case of boredom or disgust ), how much responsibility the experiencerof the emotion has for what has happened ( self-other responsibility/control , high for feeling guilt or pride ), and how much the experiencer feels tobe controlled by the situation ( situational control ,high in anger ).As the cognitive appraisal is a fundamental sub-component of emotions, we deem that appraisaldimensions are useful to perform emotion recog-nition, and that even the prediction of appraisalsthemselves can contribute to computational ap-proaches to affective states. These appraisal di-mensions have only recently found application inautomatic emotion analysis in text: Hofmann et al.(2020) re-annotated a corpus of 1001 English emo-tional event descriptions (Troiano et al., 2019) forwhich the experienced emotion has been disclosedby the author of the description (Table 1 showsexamples from their corpus). Their annotation is designed as a preliminary step for inferring dis-crete emotion categories. In contrast, we argue that the prediction of appraisal dimensions is initself valuable . This intuition has an impact on ourannotation strategy. While Hofmann et al. (2020)did not show any emotion label to the annotators,thus avoiding information leaks, we hypothesizethat knowing such emotion helps understandinghow the described events were originally appraisedby their experiencers: at times, properly annotat-ing appraisals as a third party might be unfeasiblewithout having prior access to emotions.We test this by comparing three annotation pro-cedures: (1) we give the annotator access only tothe text but not to its emotion label; (2) we give theannotator access to the text and the emotion, andevaluate if this additional information has an im-pact on annotation reliability and performance of apretrained transformer-based classiﬁer ﬁne-tunedon these data; and (3) we automatically infer theappraisal dimensions from existing emotion anno-tations, investigating the hypothesis that manualannotation might not be necessary.Our main contributions are that we show that(a) appraisal annotation is more reliable when an-notators have access to the emotion label of theoriginal experiencer, hence, the event descriptionitself does not carry sufﬁcient information for anno-tation There is a wealth of literature in psychology sur-rounding emotions, speciﬁcally regarding the waythey are elicited, their universal validity, their num-ber and stereotypical expressions, and their func-tion (Scherer, 2000; Gendron and Feldman Barrett,2009). The two prominent traditions which havedominated the ﬁeld of emotion classiﬁcation in nat-ural language processing are discrete and dimen-ional models (Kim and Klinger, 2019).Next to the creation of lexicons for emotion anal-ysis (Pennebaker et al., 2001; Strapparava and Vali-tutti, 2004; Mohammad et al., 2013; Mohammad,2018, i.a.), the annotation of text corpora receivedsubstantial attention (Bostan and Klinger, 2018).They vary across emotion categories and domains,with discrete classes being dominating – some ex-ceptions focused on valence and arousal annota-tions are Buechel and Hahn (2017), Preot¸iuc-Pietroet al. (2016), and Yu et al. (2016). For instance, theISEAR study by Scherer and Wallbott (1994) ledto self-reports of emotionally connotated events.Its creators aimed at understanding what aspectsof emotions are universal and which are relative toculture. It was built by asking students to recall anemotion-inducing event and to describe it.Other efforts focused more on creating corporaspeciﬁcally for emotion analysis in NLP. Troianoet al. (2019) built enISEAR and deISEAR, whose1001 event-descriptions were collected via crowd-sourcing, with a questionnaire inspired by ISEAR,both in English and in German. TEC (Moham-mad, 2012), another popular resource, is biggerin size ( ≈

21k instances), contains tweets andwas automatically annotated with hashtags. TheBlogs corpus by Aman and Szpakowicz (2007) hassentence-level annotations for 5205 texts, anno-tated by multiple raters. While ISEAR, enISEARand deISEAR are focused on describing speciﬁcemotion-inducing events, the Blogs corpus andTEC are more general.This is also the set of corpora that we use in ourstudy (a more comprehensive resource overviewwas made available by Hakak et al. (2017) andBostan and Klinger (2018)).

A richer perspective on emotions and their expe-rience than affect models or fundamental emo-tion sets is provided by appraisal models (Scherer,2009b), which did not receive a lot of attentionfrom the NLP community so far. Appraisals are im-mediate evaluations of situations which guide theemotion felt by the experiencer (Scherer, 2009a).More precisely, an emotion is a synchronizedchange in ﬁve organismic subsystems (i.e., cog-nitive, peripheral efference, motivational, motorexpression and subjective feeling) in response tothe evaluation of a stimulus event important to anindividual. Emotion states can be distinguished on the basis of their accompanying appraisals. Forinstance, fear emerges when an event is appraisedas unforeseen and disagreeable, a frightening eventis one appraised as an unforeseen, unpleasant, andcontrary to one’s goal (Mortillaro et al., 2012). Thecognitive part of the emotion is the one guiding theevaluation of the stimulus along different dimen-sions. According to Scherer et al. (2001), they arerelevance (i.e., the pleasantness of the event, and itsrelevance for one’s goals), implication (i.e., its po-tential consequences), coping potential (i.e., one’sability to adjust to or control the situation) andnormative signiﬁcance (i.e. its congruity to one’svalues and beliefs). On a similar vein, Smith andEllsworth (1985) argue that six cognitive appraisaldimensions can differentiate emotional experiences,as there is a relationship between the way situationsare appraised along such dimensions and the experi-enced emotion. They are pleasantness , anticipatedeffort , certainty , attention , responsibility / control and situational control . We use their model to ex-plore appraisals in text. We adhere to the annotation guidelines and the ap-praisal dimensions of Hofmann et al. (2020), split-ting the original situational control from Smith andEllsworth (1985) into control and circumstance .Our judges take binary decisions with respect toseven appraisal dimensions. We ask them the fol-lowing questions:“Most probably, at the time when the event hap-pened, the writer. . . . . . wanted to devote furtherattention to the event (

Attention ) ; . . . was certainabout what was happening (

Certainty ) ; . . . had toexpend mental or physical effort to deal with the sit-uation (

Effort ) ; . . . found that the event was pleas-ant (

Pleasantness ) ; . . . was responsible for the situ-ation (

Responsibility ) ; . . . found that he/she was incontrol of the situation (

Control ) ; . . . found that theevent could not have been changed or inﬂuencedby anyone (

Circumstance ).”

To annotate the appraisal dimensions, judges needto make assumptions about the experienced situa-tion. We believe this is possible, at times, purelyfrom the textual description that needs to be judged.Other times, knowing which emotion a person de-veloped might be necessary to understand how theverall experience was originally appraised.To analyze this assumption and measure the im-portance of emotion labels to reliably assign ap-praisal dimensions, we build our experiment ontop of the English enISEAR corpus by Troianoet al. (2019). Its authors asked workers on a crowd-sourcing platform to complete sentences like “Ifelt [emotion name], when/that/because. . . ”, where[emotion name] is replaced by a concrete emotion.In a later annotation round, other annotators had toinfer the emotion of the text, and for this reason thecreators of the corpus replaced emotion words with“. . . ”. The resulting 1001 instances of enISEAR arelabeled by the experiencers of the emotion them-selves and have masked emotion words in the text.We use these data to perform two annotationexperiments on 210 instances, randomly sampledfrom enISEAR and stratiﬁed by emotion. Two an-notators judge all of these instances in two differentsettings. Setting 1, E MO H IDE , replicates the studyby Hofmann et al. (2020): the emotion label is notavailable to the annotator. In Setting 2, E MO V IS ,the emotion is presented along with the text. Thetwo rounds of annotations (ﬁrst E MO H IDE , laterE MO V IS which makes the emotion available) weredistantiated by six months, to avoid a bias by re-calling the previous round. We evaluate the reliabil-ity of the annotation via inter-annotator agreementwith Cohen’s κ (1960), under the hypothesis thathaving knowledge of the emotion leads to morereliable human annotations. Computational Modelling.

One of the annota-tors annotated the full 1001 instances twice, that is,for the E MO H IDE and the E MO V IS approaches asa basis to evaluate how well the realization of theappraisal concepts in the corpus can be modelledautomatically. As we expect the annotations E MO -V IS to be more reliable, we also expect the modelto perform better. We use RoBERTa (Liu et al., 2019) with the ab-straction layer for tensorﬂow as provided by ktrain(Maiya, 2020), and choose the number of epochs tobe 5, based on the appraisal prediction and emotionclassiﬁcation tasks in the data by Hofmann et al.(2020) (which we annotate in this paper). Onlyminor differences in performance can be seen be-tween epochs 4–7. We keep this number of epochsﬁxed across all experiments and all other param- We acknowledge that model performance on an annotatedcorpus can only to some degree be used to assess data quality.However, in combination with our inter-annotator agreementassessments, it serves as an indicator of the amount of noise. Emotion A tt e n ti on C e r t a i n t y E ff o r t P l ea s a n t R e s p / C on t r . S it . C on t r o l Anger 1 1 1 0 0 0Disgust 0 1 1 0 0 0Fear 1 0 1 0 0 1Guilt 0 1 1 0 1 0Joy 1 1 0 1 1 0Sadness 0 1 0 0 0 1Shame 0 0 1 0 1 0Surprise 1 0 0 1 0 1

Table 2: Discretized associations between appraisal di-mensions and emotion categories, following Smith andEllsworth (1985), as we use them for automatic annota-tion in Exp. 2. eters at their default. The batch size is 5. Moreconcretely, we opt for a 3 × As Smith and Ellsworth (1985) showed, appraisaldimensions are sufﬁcient to discriminate emotioncategories: this is knowledge which we can makeuse of, and we can leverage their ﬁndings toautomatically assign discrete appraisal labels toenISEAR (see Table 2) in a rule-based manner. Forcomparability to the manual annotation setup, weopt for discrete labels which we infer from the con-tinuous principle component analysis values fromthe original paper.The question to be answered is if this rule-basedannotation actually represents the same concepts asthe manual annotation. To answer this, we comparethe automatic annotation (A

UTO A PPR ) with bothannotations that have been performed manually(E MO H IDE , E MO V IS ). Further, we train a modelto predict these automatic annotations and evaluateon the manually annotated labels.Since the automatic method relies on emotionlabels, we expect its annotations to be more similarto E MO V IS , where the annotators also have accessto this information. For the same reason, we alsoassume that the model trained on automatically an-notated labels performs better on E MO V IS than onE MO H IDE . Finding that models trained on labelsassigned in such rule-based manner show compa-rable performance to manual annotations (whentested on manual annotations) would suggest that nter Annotator Agreement RoBERTa ModellingE MO V IS E MO H IDE E MO V IS E MO H IDE

Appraisal κ κ ∆ P R F P R F ∆ F Attentional Activity .55 .30 + .25 .79 .84 .82 .84 .88 .86 − .04Certainty .71 .43 + .28 .94 .97 .96 .81 .93 .87 + .09Anticipated Effort .44 .38 + .06 .77 .83 .80 .66 .58 .62 + .18Pleasantness .93 .87 + .06 .92 .94 .93 .91 .92 .92 + .01Responsibility .80 .64 + .16 .85 .79 .82 .83 .81 .82 ± .00Control .66 .71 − .05 .64 .49 .56 .74 .68 .71 − .15Circumstance .65 .54 + .11 .80 .72 .76 .76 .74 .75 + .01Macro ∅ .68 .55 + .13 .82 .80 .80 .79 .79 .79 + .01Micro ∅ .84 .85 .84 .80 .81 .81 + .03 Table 3: Experiment 1: Cohen’s κ between annotators on E MO V IS and E MO H IDE and modelling experiments.The model is separately trained and tested on E MO V IS and E MO H IDE . the latter might not be necessary to obtain appraisalprediction models.In this automatic setup, we merge responsibility and control . While they are divided in the manuallyannotated corpora, this separation is not availablein the results by Smith and Ellsworth (1985). Thisaffects the comparability of the averages of perfor-mance measures between Exp. 1 and 2.Further, under the assumption that automatic an-notation shows competitive results on the manuallyannotated corpus enISEAR, we extend this analy-sis to other resources for corpus generalization. Inaddition to enISEAR, we use the original ISEARdataset (Scherer and Wallbott, 1994), the Germanevent corpus deISEAR (Troiano et al., 2019) and,as resources without focus on events, the TwitterEmotion Corpus (TEC) (Mohammad, 2012) andthe Blogs corpus (Aman and Szpakowicz, 2007).Since these corpora are not manually annotated forappraisals, we only evaluate on automatic appraisalannotations. In Experiment 1,we compare the reliability of the annotation withand without access to the emotion label. We showthe inter-annotator agreement results in Table 3.As we hypothesized, the agreement on E MO V IS isclearly higher than on E MO H IDE , with .68 in com-parison to .55 κ . The highest agreement increase isobserved for attention ( + .25) and certainty ( + .28),followed by responsibility ( + .16). The only de-crease in agreement, for control , is comparablysmall ( − .05).Figure 1 (and Table 9 in the Appendix) shows the distribution of emotions for the different appraisaldimensions: for most dimensions, the annotationbecomes more clearly connected to emotions withits availability, with certainty and anticipated effort being exceptions: here, the number of instances ofa set of emotion classes partially increases. Thisconﬁrms that knowledge of the emotion “denoises”the annotation. Modelling.

Table 3 also reports the predictionperformance on appraisal classes using RoBERTa.We observe that the performances on E MO V IS arehigher than on E MO H IDE ( + .03pp on micro F , .01on macro F ). This is in line with our assumptions,but the improvement is actually lower than we ex-pected, given the more substantial difference ininter-annotator agreement. However, for certainty ( + .09) and anticipated effort ( + .18), the changeis substantial. Attentional Activity , which shows ahigh increase in agreement, has a small decreasein modelling performance ( − .04). Control , whichdoes not improve in agreement, has a considerableloss in prediction performance ( − .15).From this experiment, we conclude that the an-notation is more reliable with access to the emotion:this reﬂects on the modelling results, and it does soto different extents for different emotions. We now evaluatethe rule-based annotation procedure, in which ap-praisal classes are purely assigned by the automaticprocedure, shown in Table 2.The agreement between the rule-based annota-tion A

UTO A PPR and both manual annotations isshown in Table 4. As expected, we observe a higheragreement with E MO V IS . Again, the differences nger DisgustFearGuiltJoySadnessShameVisualHidden Attention