Emotion-Aware, Emotion-Agnostic, or Automatic: Corpus Creation Strategies to Obtain Cognitive Event Appraisal Annotations
EEmotion-Aware, Emotion-Agnostic, or Automatic:Corpus Creation Strategies to ObtainCognitive Event Appraisal Annotations
Jan Hofmann, Enrica Troiano, and
Roman Klinger
Institut f¨ur Maschinelle Sprachverarbeitung, University of Stuttgart, Germany { jan.hofmann,enrica.troiano,roman.klinger } @ims.uni-stuttgart.de Abstract
Appraisal theories explain how the cognitiveevaluation of an event leads to a particularemotion. In contrast to theories of basic emo-tions or affect (valence/arousal), this theoryhas not received a lot of attention in naturallanguage processing. Yet, in psychology it hasbeen proven powerful: Smith and Ellsworth(1985) showed that the appraisal dimensions attention , certainty , anticipated effort , pleas-antness , responsibility / control and situationalcontrol discriminate between (at least) 15 emo-tion classes. We study different annotationstrategies for these dimensions, based on theevent-focused enISEAR corpus (Troiano et al.,2019). We analyze two manual annotation set-tings: (1) showing the text to annotate whilemasking the experienced emotion label; (2) re-vealing the emotion associated with the text.Setting 2 enables the annotators to develop amore realistic intuition of the described event,while Setting 1 is a more standard annotationprocedure, purely relying on text. We eval-uate these strategies in two ways: by mea-suring inter-annotator agreement and by fine-tuning RoBERTa to predict appraisal variables.Our results show that knowledge of the emo-tion increases annotators’ reliability. Further,we evaluate a purely automatic rule-based la-beling strategy (inferring appraisal from anno-tated emotion classes). Training on automati-cally assigned labels leads to a competitive per-formance of our classifier, even when tested onmanual annotations. This is an indicator thatit might be possible to automatically createappraisal corpora for every domain for whichemotion corpora already exist. Automatically detecting emotions in written textsconsists of mapping textual units, like documents,paragraphs, or sentences, to a predefined set of emo-tions. Common sets of classes used for this purpose rely on psychological theories such as those pro-posed by Ekman (1992) ( anger , disgust , fear , joy , sadness , surprise ) or Plutchik (2001). These the-ories are based on the assumption that there is arestricted number of emotions that have prototypi-cal realizations. However, not all sets of emotionsare appropriate for every domain. For instance, Dit-trich and Zepf (2019) argue that some of the basicemotions are too strong for measuring how peoplefeel when driving a car and, based on that, Cevheret al. (2019) resort to joy , annoyance (instead of anger ), insecurity (instead of fear ), boredom , and relaxation to classify in-car utterances. Haider et al.(2020) model the emotional perception of poetryand opt for the categories beauty / joy , sadness , un-easiness , vitality , awe / sublime , suspense , humor , nostalgia , and annoyance , following the definitionof aesthetic emotions (Schindler et al., 2017; Men-ninghaus et al., 2019). Demszky et al. (2020) definea taxonomy of emotions, reaching a high coveragewhile maintaining inter-class relations.An alternative to the use of categorical variablesare the so-called “dimensional” approaches. Themost popular of them models affective experiencesalong the variables of dominance, valence, andarousal (Russell and Mehrabian, 1977, VAD). Feld-man Barrett (2006, 2017) theorizes that emotionsare interpretations of continuous affective statesexperiencers find themselves in. Still, as Smithand Ellsworth (1985) note, not all emotions can bedistinguished based on valence and arousal. Onemight argue that predicting three continuous vari-ables instead of a richer set of categories is a sim-plification and can be limiting for downstream ap-plications of emotion analysis models.Smith and Ellsworth (1985) particularly arguethat the VAD model does not capture all relevantaspects of an emotion in the context of an event.In a fight or flight situation (Cannon, 1929), for in-stance, the decision to take one of these two actions a r X i v : . [ c s . C L ] F e b motion Appraisals TextJoy Attention,Certainty,Pleasant,Sit. Ctrl. I felt . . . when I knew that I wasgoing back to Florida a year ear-lier than I thought I would.Disgust Attention,Certainty,Effort,Sit. Ctrl. I felt . . . when my kitten wassick and I had to clean it up.Fear Attention,Effort,Sit. Ctrl. I felt . . . when I was having ahard attach.Guilt Attention,Certainty,Effort,Respons.,Control I felt . . . when I went on holidayand left our cat behind.Sadness Attention,Certainty,Sit. Ctrl. I felt . . . when I found out oneof my favourite shops had shutdown. Table 1: Examples from the corpus of Hofmann et al.(2020). is mostly made based on the effort that the emotionexperiencer anticipates, but this is not representedby VAD. Therefore, Smith and Ellsworth (1985)propose a dimensional approach with the appraisalvariables of how pleasant an event is ( pleasantness ,likely to be associated with joy , but unlikely to ap-pear with disgust ), how much effort an event canbe expected to cause ( anticipated effort , likely tobe high when anger or fear is experienced), howcertain the experiencer is in a specific situation ( cer-tainty , low, e.g., in the context of hope or surprise ),how much attention is devoted to the event ( atten-tion , likely to be low, e.g., in the case of boredom or disgust ), how much responsibility the experiencerof the emotion has for what has happened ( self-other responsibility/control , high for feeling guilt or pride ), and how much the experiencer feels tobe controlled by the situation ( situational control ,high in anger ).As the cognitive appraisal is a fundamental sub-component of emotions, we deem that appraisaldimensions are useful to perform emotion recog-nition, and that even the prediction of appraisalsthemselves can contribute to computational ap-proaches to affective states. These appraisal di-mensions have only recently found application inautomatic emotion analysis in text: Hofmann et al.(2020) re-annotated a corpus of 1001 English emo-tional event descriptions (Troiano et al., 2019) forwhich the experienced emotion has been disclosedby the author of the description (Table 1 showsexamples from their corpus). Their annotation is designed as a preliminary step for inferring dis-crete emotion categories. In contrast, we argue that the prediction of appraisal dimensions is initself valuable . This intuition has an impact on ourannotation strategy. While Hofmann et al. (2020)did not show any emotion label to the annotators,thus avoiding information leaks, we hypothesizethat knowing such emotion helps understandinghow the described events were originally appraisedby their experiencers: at times, properly annotat-ing appraisals as a third party might be unfeasiblewithout having prior access to emotions.We test this by comparing three annotation pro-cedures: (1) we give the annotator access only tothe text but not to its emotion label; (2) we give theannotator access to the text and the emotion, andevaluate if this additional information has an im-pact on annotation reliability and performance of apretrained transformer-based classifier fine-tunedon these data; and (3) we automatically infer theappraisal dimensions from existing emotion anno-tations, investigating the hypothesis that manualannotation might not be necessary.Our main contributions are that we show that(a) appraisal annotation is more reliable when an-notators have access to the emotion label of theoriginal experiencer, hence, the event descriptionitself does not carry sufficient information for anno-tation There is a wealth of literature in psychology sur-rounding emotions, specifically regarding the waythey are elicited, their universal validity, their num-ber and stereotypical expressions, and their func-tion (Scherer, 2000; Gendron and Feldman Barrett,2009). The two prominent traditions which havedominated the field of emotion classification in nat-ural language processing are discrete and dimen-ional models (Kim and Klinger, 2019).Next to the creation of lexicons for emotion anal-ysis (Pennebaker et al., 2001; Strapparava and Vali-tutti, 2004; Mohammad et al., 2013; Mohammad,2018, i.a.), the annotation of text corpora receivedsubstantial attention (Bostan and Klinger, 2018).They vary across emotion categories and domains,with discrete classes being dominating – some ex-ceptions focused on valence and arousal annota-tions are Buechel and Hahn (2017), Preot¸iuc-Pietroet al. (2016), and Yu et al. (2016). For instance, theISEAR study by Scherer and Wallbott (1994) ledto self-reports of emotionally connotated events.Its creators aimed at understanding what aspectsof emotions are universal and which are relative toculture. It was built by asking students to recall anemotion-inducing event and to describe it.Other efforts focused more on creating corporaspecifically for emotion analysis in NLP. Troianoet al. (2019) built enISEAR and deISEAR, whose1001 event-descriptions were collected via crowd-sourcing, with a questionnaire inspired by ISEAR,both in English and in German. TEC (Moham-mad, 2012), another popular resource, is biggerin size ( ≈
21k instances), contains tweets andwas automatically annotated with hashtags. TheBlogs corpus by Aman and Szpakowicz (2007) hassentence-level annotations for 5205 texts, anno-tated by multiple raters. While ISEAR, enISEARand deISEAR are focused on describing specificemotion-inducing events, the Blogs corpus andTEC are more general.This is also the set of corpora that we use in ourstudy (a more comprehensive resource overviewwas made available by Hakak et al. (2017) andBostan and Klinger (2018)).
A richer perspective on emotions and their expe-rience than affect models or fundamental emo-tion sets is provided by appraisal models (Scherer,2009b), which did not receive a lot of attentionfrom the NLP community so far. Appraisals are im-mediate evaluations of situations which guide theemotion felt by the experiencer (Scherer, 2009a).More precisely, an emotion is a synchronizedchange in five organismic subsystems (i.e., cog-nitive, peripheral efference, motivational, motorexpression and subjective feeling) in response tothe evaluation of a stimulus event important to anindividual. Emotion states can be distinguished on the basis of their accompanying appraisals. Forinstance, fear emerges when an event is appraisedas unforeseen and disagreeable, a frightening eventis one appraised as an unforeseen, unpleasant, andcontrary to one’s goal (Mortillaro et al., 2012). Thecognitive part of the emotion is the one guiding theevaluation of the stimulus along different dimen-sions. According to Scherer et al. (2001), they arerelevance (i.e., the pleasantness of the event, and itsrelevance for one’s goals), implication (i.e., its po-tential consequences), coping potential (i.e., one’sability to adjust to or control the situation) andnormative significance (i.e. its congruity to one’svalues and beliefs). On a similar vein, Smith andEllsworth (1985) argue that six cognitive appraisaldimensions can differentiate emotional experiences,as there is a relationship between the way situationsare appraised along such dimensions and the experi-enced emotion. They are pleasantness , anticipatedeffort , certainty , attention , responsibility / control and situational control . We use their model to ex-plore appraisals in text. We adhere to the annotation guidelines and the ap-praisal dimensions of Hofmann et al. (2020), split-ting the original situational control from Smith andEllsworth (1985) into control and circumstance .Our judges take binary decisions with respect toseven appraisal dimensions. We ask them the fol-lowing questions:“Most probably, at the time when the event hap-pened, the writer. . . . . . wanted to devote furtherattention to the event (
Attention ) ; . . . was certainabout what was happening (
Certainty ) ; . . . had toexpend mental or physical effort to deal with the sit-uation (
Effort ) ; . . . found that the event was pleas-ant (
Pleasantness ) ; . . . was responsible for the situ-ation (
Responsibility ) ; . . . found that he/she was incontrol of the situation (
Control ) ; . . . found that theevent could not have been changed or influencedby anyone (
Circumstance ).”
To annotate the appraisal dimensions, judges needto make assumptions about the experienced situa-tion. We believe this is possible, at times, purelyfrom the textual description that needs to be judged.Other times, knowing which emotion a person de-veloped might be necessary to understand how theverall experience was originally appraised.To analyze this assumption and measure the im-portance of emotion labels to reliably assign ap-praisal dimensions, we build our experiment ontop of the English enISEAR corpus by Troianoet al. (2019). Its authors asked workers on a crowd-sourcing platform to complete sentences like “Ifelt [emotion name], when/that/because. . . ”, where[emotion name] is replaced by a concrete emotion.In a later annotation round, other annotators had toinfer the emotion of the text, and for this reason thecreators of the corpus replaced emotion words with“. . . ”. The resulting 1001 instances of enISEAR arelabeled by the experiencers of the emotion them-selves and have masked emotion words in the text.We use these data to perform two annotationexperiments on 210 instances, randomly sampledfrom enISEAR and stratified by emotion. Two an-notators judge all of these instances in two differentsettings. Setting 1, E MO H IDE , replicates the studyby Hofmann et al. (2020): the emotion label is notavailable to the annotator. In Setting 2, E MO V IS ,the emotion is presented along with the text. Thetwo rounds of annotations (first E MO H IDE , laterE MO V IS which makes the emotion available) weredistantiated by six months, to avoid a bias by re-calling the previous round. We evaluate the reliabil-ity of the annotation via inter-annotator agreementwith Cohen’s κ (1960), under the hypothesis thathaving knowledge of the emotion leads to morereliable human annotations. Computational Modelling.
One of the annota-tors annotated the full 1001 instances twice, that is,for the E MO H IDE and the E MO V IS approaches asa basis to evaluate how well the realization of theappraisal concepts in the corpus can be modelledautomatically. As we expect the annotations E MO -V IS to be more reliable, we also expect the modelto perform better. We use RoBERTa (Liu et al., 2019) with the ab-straction layer for tensorflow as provided by ktrain(Maiya, 2020), and choose the number of epochs tobe 5, based on the appraisal prediction and emotionclassification tasks in the data by Hofmann et al.(2020) (which we annotate in this paper). Onlyminor differences in performance can be seen be-tween epochs 4–7. We keep this number of epochsfixed across all experiments and all other param- We acknowledge that model performance on an annotatedcorpus can only to some degree be used to assess data quality.However, in combination with our inter-annotator agreementassessments, it serves as an indicator of the amount of noise. Emotion A tt e n ti on C e r t a i n t y E ff o r t P l ea s a n t R e s p / C on t r . S it . C on t r o l Anger 1 1 1 0 0 0Disgust 0 1 1 0 0 0Fear 1 0 1 0 0 1Guilt 0 1 1 0 1 0Joy 1 1 0 1 1 0Sadness 0 1 0 0 0 1Shame 0 0 1 0 1 0Surprise 1 0 0 1 0 1
Table 2: Discretized associations between appraisal di-mensions and emotion categories, following Smith andEllsworth (1985), as we use them for automatic annota-tion in Exp. 2. eters at their default. The batch size is 5. Moreconcretely, we opt for a 3 × As Smith and Ellsworth (1985) showed, appraisaldimensions are sufficient to discriminate emotioncategories: this is knowledge which we can makeuse of, and we can leverage their findings toautomatically assign discrete appraisal labels toenISEAR (see Table 2) in a rule-based manner. Forcomparability to the manual annotation setup, weopt for discrete labels which we infer from the con-tinuous principle component analysis values fromthe original paper.The question to be answered is if this rule-basedannotation actually represents the same concepts asthe manual annotation. To answer this, we comparethe automatic annotation (A
UTO A PPR ) with bothannotations that have been performed manually(E MO H IDE , E MO V IS ). Further, we train a modelto predict these automatic annotations and evaluateon the manually annotated labels.Since the automatic method relies on emotionlabels, we expect its annotations to be more similarto E MO V IS , where the annotators also have accessto this information. For the same reason, we alsoassume that the model trained on automatically an-notated labels performs better on E MO V IS than onE MO H IDE . Finding that models trained on labelsassigned in such rule-based manner show compa-rable performance to manual annotations (whentested on manual annotations) would suggest that nter Annotator Agreement RoBERTa ModellingE MO V IS E MO H IDE E MO V IS E MO H IDE
Appraisal κ κ ∆ P R F P R F ∆ F Attentional Activity .55 .30 + .25 .79 .84 .82 .84 .88 .86 − .04Certainty .71 .43 + .28 .94 .97 .96 .81 .93 .87 + .09Anticipated Effort .44 .38 + .06 .77 .83 .80 .66 .58 .62 + .18Pleasantness .93 .87 + .06 .92 .94 .93 .91 .92 .92 + .01Responsibility .80 .64 + .16 .85 .79 .82 .83 .81 .82 ± .00Control .66 .71 − .05 .64 .49 .56 .74 .68 .71 − .15Circumstance .65 .54 + .11 .80 .72 .76 .76 .74 .75 + .01Macro ∅ .68 .55 + .13 .82 .80 .80 .79 .79 .79 + .01Micro ∅ .84 .85 .84 .80 .81 .81 + .03 Table 3: Experiment 1: Cohen’s κ between annotators on E MO V IS and E MO H IDE and modelling experiments.The model is separately trained and tested on E MO V IS and E MO H IDE . the latter might not be necessary to obtain appraisalprediction models.In this automatic setup, we merge responsibility and control . While they are divided in the manuallyannotated corpora, this separation is not availablein the results by Smith and Ellsworth (1985). Thisaffects the comparability of the averages of perfor-mance measures between Exp. 1 and 2.Further, under the assumption that automatic an-notation shows competitive results on the manuallyannotated corpus enISEAR, we extend this analy-sis to other resources for corpus generalization. Inaddition to enISEAR, we use the original ISEARdataset (Scherer and Wallbott, 1994), the Germanevent corpus deISEAR (Troiano et al., 2019) and,as resources without focus on events, the TwitterEmotion Corpus (TEC) (Mohammad, 2012) andthe Blogs corpus (Aman and Szpakowicz, 2007).Since these corpora are not manually annotated forappraisals, we only evaluate on automatic appraisalannotations. In Experiment 1,we compare the reliability of the annotation withand without access to the emotion label. We showthe inter-annotator agreement results in Table 3.As we hypothesized, the agreement on E MO V IS isclearly higher than on E MO H IDE , with .68 in com-parison to .55 κ . The highest agreement increase isobserved for attention ( + .25) and certainty ( + .28),followed by responsibility ( + .16). The only de-crease in agreement, for control , is comparablysmall ( − .05).Figure 1 (and Table 9 in the Appendix) shows the distribution of emotions for the different appraisaldimensions: for most dimensions, the annotationbecomes more clearly connected to emotions withits availability, with certainty and anticipated effort being exceptions: here, the number of instances ofa set of emotion classes partially increases. Thisconfirms that knowledge of the emotion “denoises”the annotation. Modelling.
Table 3 also reports the predictionperformance on appraisal classes using RoBERTa.We observe that the performances on E MO V IS arehigher than on E MO H IDE ( + .03pp on micro F , .01on macro F ). This is in line with our assumptions,but the improvement is actually lower than we ex-pected, given the more substantial difference ininter-annotator agreement. However, for certainty ( + .09) and anticipated effort ( + .18), the changeis substantial. Attentional Activity , which shows ahigh increase in agreement, has a small decreasein modelling performance ( − .04). Control , whichdoes not improve in agreement, has a considerableloss in prediction performance ( − .15).From this experiment, we conclude that the an-notation is more reliable with access to the emotion:this reflects on the modelling results, and it does soto different extents for different emotions. We now evaluatethe rule-based annotation procedure, in which ap-praisal classes are purely assigned by the automaticprocedure, shown in Table 2.The agreement between the rule-based annota-tion A
UTO A PPR and both manual annotations isshown in Table 4. As expected, we observe a higheragreement with E MO V IS . Again, the differences nger DisgustFearGuiltJoySadnessShameVisualHidden Attention
Anger DisgustFearGuiltJoySadnessShame
Certainty
Anger DisgustFearGuiltJoySadnessShame
Effort
Anger DisgustFearGuiltJoySadnessShame
Pleasantness
Anger DisgustFearGuiltJoySadnessShame
Responsibility
Anger DisgustFearGuiltJoySadnessShame
Control
Anger DisgustFearGuiltJoySadnessShame
Circumstance
Figure 1: Frequency distribution of appraisals across emotions for E MO V IS (Visual) and E MO H IDE (Hidden).
Agreement to A
UTO A PPR , κ Modelling: RoBERTa trained on A
UTO A PPR E MO V IS E MO H IDE E MO V IS E MO H IDE A UTO A PPR
Appraisal A1 A2 A1 A2 P R F P R F P R F Attentional Activity .47 .63 .33 .48 .85 .59 .70 .88 .59 .71 .86 .85 .85Certainty .35 .53 .50 .51 .97 .81 .88 .81 .83 .82 .88 .90 .89Anticipated Effort .31 .08 .19 .16 .61 .77 .68 .39 .79 .52 .95 .96 .96Pleasantness .93 1.00 .91 .96 .90 .93 .92 .91 .95 .93 .96 .94 .95Responsibility/Control .39 .63 .39 .59 .66 .81 .72 .74 .78 .76 .91 .89 .90Situational Control .44 .64 .34 .59 .77 .74 .75 .78 .59 .67 .84 .83 .84Macro ∅ .48 .58 .44 .54 .79 .77 .78 .75 .75 .74 .90 .89 .90Micro ∅ .78 .75 .77 .70 .73 .72 .90 .90 .90 Table 4: Experiment 2 Main Results: Cohen’s κ between annotators of E MO H IDE /E MO V IS and A UTO A PPR , onthe subset of 210 instances from enISEAR. The classifier is trained on the full set of 1001 instances annotatedautomatically (A
UTO A PPR ) and evaluated on all other annotations (cross-validation splits remain the same). are not equally distributed across emotions and theyresemble the changes in the other experiments, butagreement is lower between A
UTO A PPR and themanual annotations than between the latters, sug-gesting that the automatic process does not lead tothe same conceptual annotation . Modelling.
To answer the question how well onemodel trained on rule-based annotations performson manual annotations, we test the model threetimes: on E MO V IS , E MO H IDE , and A
UTO A PPR .The right side of Table 4 reports the results. Notethat responsibility and control have been merged,as explained in the experimental setting.We see that the highest macro average F is non-surprisingly achieved when testing on A UTO A PPR (.90F ). When testing the same model on E MO -V IS , the performance drops by 12pp (.78F ), butis still substantially higher than for the corpus inwhich the emotions were not available to the anno- These labels could be compared to humans’ only on E MO -V IS , which turned out more reliable. We also consider E MO -H IDE because it represents standard emotion annotation proce-dures, where judges assess texts without further information. tators (.72F ). Note that the performance of .78F (E MO V IS ) and .74F (E MO H IDE ) are not too differ-ent from the model trained on manually annotateddata, with .80F and .79F . We therefore concludethat automatically labeling a corpus with appraisaldimensions leads to a meaningful model. Corpus Generalization.
Finally, we apply theautomatic labeling procedure to other emotion cor-pora. Results are in Table 5.Given the different nature of the domains and lan-guages (German vs. English – deISEAR/enISEAR;tweets vs. blog texts – TEC vs. Blogs), these num-bers cannot be directly compared, but we can ob-serve that they are comparably high, similar to theother experiments. We carefully infer (without hav-ing compared the prediction on these corpora tomanual annotations) that this is an indicator thatautomatic annotation of appraisal dimensions alsoworks across different corpora and languages. eISEAR ISEAR TEC BlogsAppraisal P R F P R F P R F P R F Attentional Activity 79 68 73 83 82 83 90 91 91 94 94 94Certainty 79 90 84 89 92 90 87 89 88 97 97 97Anticipated Effort 88 93 91 94 95 94 74 68 71 91 93 92Pleasantness 80 69 77 91 90 91 85 86 86 97 97 97Responsibility/Control 80 69 74 88 85 86 79 79 79 94 96 95Situational Control 73 69 71 83 81 82 79 79 79 88 86 87Macro ∅
81 76 78 88 87 88 82 82 82 94 94 94Micro ∅
82 81 81 89 89 89 84 84 84 94 95 94
Table 5: Experiment 2, Generalization to other corpora. All results are averages across 3 ×
10 cross validations.Note that the last three columns from Table 4 correspond to the same setting as it is in here.
The data that we use was made available to supportappraisal-based research in emotion analysis. Itconsists of the same instances we annotated in Hof-mann et al. (2020). However, in this previous work,each instance was judged by three annotators, whodid not have access to the emotion labels of thetexts, and the experiments have been performed onlabels obtained with the majority vote of the annota-tors. Instead, for the current experiments, the labelsby only one annotator on all instances have beenused. Therefore, the experiments of the two papersare not strictly comparable. In addition, Hofmannet al. (2020) adopted a CNN-based classifier. Brief,there are two sources for non-comparability in ourexperiments: different label sets and different mod-els. We aimed at leveraging a more state-of-the-arttransformer-based model, but at the same time, weneeded different label sets to better understand theappraisal annotation processes.For transparency reasons, we show the perfor-mance of our RoBERTA model on the originallabels against the results by Hofmann et al. (2020).Table 6 compares the two studies with respect to ap-praisals and Table 7 with respect to emotion predic-tions. The emotion recognition models consist of atext-based model (T → E), a pipeline that first pre-dicts the appraisal and then classifies the emotionwithout access to text with a two-layer dense neuralnetwork (T → A, A → E). To measure the comple-mentarity of these two settings, a third model is anoracle ensemble (T → A → E + T → E) which accepts aprediction as true positive if one of the two modelsprovides the correct prediction.On this original data set by Hofmann et al.(2020), our model constitutes a new state of theart. The micro-averaged appraisal prediction with
CNN RoBERTaAppraisal P R F P R F Attention 81 84 82 86 90 88Certainty 84 86 85 87 94 91Effort 68 68 68 79 77 78Pleasantness 79 63 70 92 92 92Responsibility 74 68 71 86 85 85Control 63 49 55 81 73 77Circumstance 65 58 61 74 69 71Macro ∅
73 68 70 83 83 83Micro ∅
77 74 75 84 85 85
Table 6: RoBERTA model performance on predict-ing appraisals on the original data by Hofmann et al.(2020), compared to their CNN results.
RoBERTa is 10pp higher than the original CNN-based model; the emotion classification has similarimprovements, and the overall relation between themodel configurations remains comparable.
To better understand how revealing emotions af-fects the annotations in Experiment 1, we providesome concrete examples. Table 8 reports instancesfrom enISEAR. We show for which appraisal vari-ables the agreement changes , by marking the ap-praisal with + or − . For instance, − attention means that the annotators came to disagree on thatappraisal dimension when the emotion was uncov-ered, while + pleasantness indicates that they cameto agree thanks to the knowledge of the emotion la-bel. Note that + does not mean that the dimensionwas marked as 1 by both annotators. The examplesare sorted by the sum of changes in agreement.In Example (1) an event is described in a waywhich leaves open if there is responsibility , pleas-antness , anticipated effort , and even if the experi-encer is entirely certain about what is happening. → E T → A,A → E T → A → E + T → ECNN RoBERTa CNN RoBERTa CNN RoBERTaEmotion P R F P R F P R F P R F P R F P R F Anger 51 52 52 62 62 62 34 62 44 44 72 54 66 81 73 72 85 78Disgust 65 63 64 70 71 71 59 34 43 65 38 48 78 68 73 86 74 80Fear 69 71 70 80 82 81 55 55 55 65 76 70 76 77 77 85 92 88Guilt 47 42 44 60 60 60 38 50 43 50 68 58 60 63 62 72 80 76Joy 74 80 77 92 96 94 77 69 72 92 95 93 79 80 80 94 98 96Sadness 69 67 68 82 81 81 58 40 47 74 55 63 74 70 72 87 84 86Shame 44 45 45 57 54 55 36 24 29 51 23 32 58 51 54 77 59 67Macro ∅
60 60 60 72 72 72 51 48 48 63 61 60 70 70 70 82 82 82Micro ∅
60 72 48 61 70 82
Table 7: Comparison of the CNN (Hofmann et al., 2020) and our RoBERTa model on the Text-to-Emotion base-line (T → E), the pipeline experiment (T → A,A → E) and the oracle ensemble experiment (T → A → E + T → E). Theseexperiments follow the model configurations by Hofmann et al. (2020).
ID Score Emotion Aggr. Change Input Text1 + + e + p + r + ci I felt . . . when I was abseiling down a cliff-face.2 + + ce + r + ci I felt . . . when I got a new job.3 ± − − a − ce − r + ci I felt . . . when I found out that my daughter had been having a difficult timeand I didn’t realise straight away what she was going through.5 − − a − e I felt . . . when we were charged by a care home for the three months after myfather had died, even though we had emptied his room the day after his death.6 − − a − ce − r I felt . . . when cycling home after a long ride one evening, unaware how darkit had become, and thus relying on some very weak led lights that I’d nevertested in complete darkness - I could barely see ten feet ahead of me. Table 8: Examples of differences between annotations with masked and visible emotion labels. + and − indicatethe agreement and the disagreement on a specific dimension which is reached after making the emotion visible.The score is the sum of agreement changes, either improvements ( +1 for each dimension) or degradation ( − ). a: attention , ce: certainty , e: anticipated effort , p: pleasantness , r: responsibility , ci: circumstance . With the knowledge that the emotion is fear, itbecomes clear that the situation does involve antic-ipated effort , is not pleasant , and that circumstance is not likely. The annotators also agree here thatthere is responsibility involved, which is likely aninterpretation based on world-knowledge.In the situation of getting a new job (Example(2)), knowing the emotion adds agreement regard-ing certainty, which is in line with added agree-ment that the person was responsible. Example (3)is an instance in which the situation was alreadyentirely clear without knowing the emotion – theexperiencer participated in gossip. That is a certain,non-pleasurable situation (when recognizing this)which is under their own control. Knowledge thatguilt has been experienced does not add anything.Example (4) shows that complex situations causemore disagreement in annotations which are notnecessarily resolved by knowing the emotion. Thedescribed event is about a negative emotion feltbecause the experiencer did not recognize the bad mood of the daughter. Annotators come to disagreeabout attention , certainty and responsibility .Example (5) describes another situation in whicha negative event is discussed. Knowledge of theemotion puts a clear focus away from the sad part ofthe description (father dying) and puts it on some-thing that causes anger. However, this shift does notresolve appraisal disagreements but indeed adds ontop of them, with attention and anticipated effort .Finally, Example (6) is another long descriptionwith annotators’ focus on different aspects, one onthe darkness (hence, no responsibility ), the otheron the cycling ( responsibility ). Here, knowledge ofthe emotion does not change the interpretability ofthe event. However, it informs on which part of thedescribed situation the original author focused on.These examples show that E MO V IS helps solveambiguities when events can be associated to multi-ple emotions, other times it helps people give moreweight to specific portions of texts. In the first caseagreement tends to be reached more easily. Conclusion & Future Work
We analyzed how to build corpora of text annotatedwith appraisal variables and we evaluated how wellsuch concepts can be modelled. By doing so, webrought together emotion analysis and a strand ofemotion research in psychology which has receivedlittle attention from the computational linguisticscommunity. We propose that in addition to well-established approaches to emotion analysis, likeaffect-oriented dimensional approaches or classi-fication into predefined emotion inventories, psy-chological models of appraisals will be consideredin future emotion-based studies, particularly thoserelying on event-oriented resources.The use of appraisals is interesting from a theo-retical perspective : motivated by psychology, weleveraged the cognitive mechanisms underlyingemotions, thus accounting for many complex pat-terns in which humans appraise an event and emo-tionally react to it. In this light, it is interestingin itself that our annotators were able to empathi-cally reconstruct the event appraisals experiencedby others, even without knowing their emotion.From a practical perspective , appraisal anno-tations are less prone to being poorly chosen forparticular domains, in comparison to regular emo-tion classes, as the actual feeling develops basedon the cognitive evaluation of an event. We alsohave shown that event descriptions alone might notbe sufficient to properly annotate the hypotheticalappraisal of the experiencer (which is, however,also an issue with traditional emotion analysis an-notations and models – we cannot look into thefeeler, we deal with private states). This shows andis presumably the reason that additional context(e.g., the emotion label) is required.Some implications for future research and devel-opments follow: ideally, appraisal (and emotion)annotations should stem directly from the experi-encer. This is not doable in many NLP settings.For instance, when analyzing literature, it is im-possible to ask fictional characters for their currentevent appraisal. However, we presume that settingson social media might be realistic, for instanceby probing appropriate distant labeling methods,e.g., a careful choice of hashtags. If this is unfea-sible, because text authors do not disclose theirappraisals, the available emotion labels still repre-sent a valuable source of information: as we haveshown, they can guide the interpretation of the de-scribed events, and hence, the way in which these are post-assigned appraisal dimensions.Finally, we provided evidence that even if amodel is trained on automatically obtained ap-praisal labels, it is still capable of substantial perfor-mance. Therefore, we conclude that more corporawith appraisal dimensions from different languagesand domains should be created from scratch. In themeantime, one can build on top of the rich set ofavailable emotion corpora and automatically createappraisal-annotated resources out of them.
Acknowledgements
This work was supported by Deutsche Forschungs-gemeinschaft (projects SEAT, KL 2869/1-1 andCEAT, KL 2869/1-2) and has partially been con-ducted within the Leibniz WissenschaftsCampusT¨ubingen “Cognitive Interfaces”. We thank KaiSassenberg and Laura Oberl¨ander for inspirationand fruitful discussions.
References
Saima Aman and Stan Szpakowicz. 2007. Identify-ing expressions of emotion in text. In
Text, Speechand Dialogue , pages 196–205, Berlin, Heidelberg.Springer Berlin Heidelberg.Laura-Ana-Maria Bostan and Roman Klinger. 2018.An analysis of annotated corpora for emotion clas-sification in text. In
Proceedings of the 27th Inter-national Conference on Computational Linguistics ,pages 2104–2119, Santa Fe, New Mexico, USA. As-sociation for Computational Linguistics.Sven Buechel and Udo Hahn. 2017. EmoBank: Study-ing the impact of annotation perspective and repre-sentation format on dimensional emotion analysis.In
Proceedings of the 15th Conference of the Euro-pean Chapter of the Association for ComputationalLinguistics: Volume 2, Short Papers , pages 578–585,Valencia, Spain. Association for Computational Lin-guistics.Walter B. Cannon. 1929.
Bodily changes in pain,hunger, fear, and rage . Appleton-Century, NewYork, US.Deniz Cevher, Sebastian Zepf, and Roman Klinger.2019. Towards multimodal emotion recognition ingerman speech events in cars using transfer learn-ing. In
Proceedings of the 15th Conference on Nat-ural Language Processing (KONVENS 2019): LongPapers , pages 79–90, Erlangen, Germany. GermanSociety for Computational Linguistics & LanguageTechnology.Jacob Cohen. 1960. A coefficient of agreement fornominal scales.
Educational and PsychologicalMeasurement , 20(1):37–46.lexis Conneau, Kartikay Khandelwal, Naman Goyal,Vishrav Chaudhary, Guillaume Wenzek, FranciscoGuzm´an, Edouard Grave, Myle Ott, Luke Zettle-moyer, and Veselin Stoyanov. 2020. Unsupervisedcross-lingual representation learning at scale. In
Proceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 8440–8451, Online. Association for Computational Lin-guistics.Dorottya Demszky, Dana Movshovitz-Attias, Jeong-woo Ko, Alan Cowen, Gaurav Nemade, and SujithRavi. 2020. GoEmotions: A dataset of fine-grainedemotions. In
Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics ,pages 4040–4054, Online. Association for Computa-tional Linguistics.Monique Dittrich and Sebastian Zepf. 2019. Exploringthe validity of methods to track emotions behind thewheel. In
Persuasive Technology: Development ofPersuasive and Behavior Change Support Systems ,pages 115–127, Cham. Springer International Pub-lishing.Paul Ekman. 1992. An argument for basic emotions.
Cognition & emotion , 6(3-4):169–200.Lisa Feldman Barrett. 2006. Solving the emotionparadox: categorization and the experience of emo-tion.
Personality and Social Psychology Review ,10(1):20–46.Lisa Feldman Barrett. 2017.
How Emotions Are Made .Houghton Mifflin Harcourt, New York, USA.Maria Gendron and Lisa Feldman Barrett. 2009. Re-constructing the past: A century of ideas about emo-tion in psychology.
Emotion review , 1(4):316–339.Thomas Haider, Steffen Eger, Evgeny Kim, RomanKlinger, and Winfried Menninghaus. 2020. PO-EMO: Conceptualization, annotation, and modelingof aesthetic emotions in German and English poetry.In
Proceedings of The 12th Language Resourcesand Evaluation Conference , pages 1652–1663, Mar-seille, France. European Language Resources Asso-ciation.Nida Manzoor Hakak, Mohsin Mohd, Mahira Kirmani,and Mudasir Mohd. 2017. Emotion analysis: Asurvey. In ,pages 397–402.Jan Hofmann, Enrica Troiano, Kai Sassenberg, and Ro-man Klinger. 2020. Appraisal theories for emotionclassification in text.Evgeny Kim and Roman Klinger. 2019. A survey onsentiment and emotion analysis for computationalliterary studies.
Zeitschrift fuer Digitale Geisteswis-senschaften , 4. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.RoBERTa: A robustly optimized bert pretraining ap-proach.
ArXiv , abs/1907.11692.Arun S. Maiya. 2020. ktrain: A low-code library foraugmented machine learning.Winfried Menninghaus, Valentin Wagner, Eugen Was-siliwizky, Ines Schindler, Julian Hanich, Thomas Ja-cobsen, and Stefan Koelsch. 2019. What are aes-thetic emotions?
Psychological review , 126(2):171.Saif Mohammad. 2012. *SEM2012: The First Joint Conference on Lexical andComputational Semantics – Volume 1: Proceedingsof the main conference and the shared task, and Vol-ume 2: Proceedings of the Sixth International Work-shop on Semantic Evaluation (SemEval 2012) , pages246–255, Montr´eal, Canada. Association for Com-putational Linguistics.Saif Mohammad. 2018. Obtaining reliable human rat-ings of valence, arousal, and dominance for 20,000English words. In
Proceedings of the 56th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 174–184, Melbourne, Australia. Association for Compu-tational Linguistics.Saif Mohammad, Svetlana Kiritchenko, and XiaodanZhu. 2013. NRC-canada: Building the state-of-the-art in sentiment analysis of tweets. In
SecondJoint Conference on Lexical and Computational Se-mantics (*SEM), Volume 2: Proceedings of the Sev-enth International Workshop on Semantic Evalua-tion (SemEval 2013) , pages 321–327, Atlanta, Geor-gia, USA. Association for Computational Linguis-tics.Marcello Mortillaro, Ben Meuleman, and Klaus R.Scherer. 2012. Advocating a componential appraisalmodel to guide emotion recognition.
InternationalJournal of Synthetic Emotions (IJSE) , 3(1):18–32.James W. Pennebaker, Martha E. Francis, and Roger J.Booth. 2001. Linguistic inquiry and word count:LIWC 2001.
Mahway: Lawrence Erlbaum Asso-ciates , 71:2001.Robert Plutchik. 2001. The nature of emotions.
Ameri-can Scientist , 89(4):344–350.Daniel Preot¸iuc-Pietro, H. Andrew Schwartz, GregoryPark, Johannes Eichstaedt, Margaret Kern, Lyle Un-gar, and Elisabeth Shulman. 2016. Modelling va-lence and arousal in Facebook posts. In
Proceedingsof the 7th Workshop on Computational Approachesto Subjectivity, Sentiment and Social Media Analy-sis , pages 9–15, San Diego, California. Associationfor Computational Linguistics.James A. Russell and Albert Mehrabian. 1977. Evi-dence for a three-factor theory of emotions.
Journalof research in Personality , 11(3):273–294.laus R. Scherer. 2000. Psychological models of emo-tion. In
The neuropsychology of emotion. , Seriesin affective science., pages 137–162. Oxford Univer-sity Press, New York, NY, US.Klaus R Scherer. 2009a. The dynamic architectureof emotion: Evidence for the component processmodel.
Cognition and emotion , 23(7):1307–1351.Klaus R. Scherer. 2009b. Emotions are emergent pro-cesses: they require a dynamic computational ar-chitecture.
Philosophical transactions of the RoyalSociety of London. Series B, Biological sciences ,364(1535):3459–3474. Publisher: The Royal Soci-ety.Klaus R. Scherer, Angela Schorr, and Tom Johnstone.2001.
Appraisal processes in emotion: Theory,methods, research . Oxford University Press.Klaus R. Scherer and Harald G. Wallbott. 1994. Evi-dence for universality and cultural variation of differ-ential emotion response patterning.
Journal of per-sonality and social psychology , 66(2):310.Ines Schindler, Georg Hosoya, Winfried Menninghaus,Ursula Beermann, Valentin Wagner, Michael Eid,and Klaus R. Scherer. 2017. Measuring aestheticemotions: A review of the literature and a new as-sessment tool.
PloS one , 12(6):e0178899.Craig A. Smith and Phoebe C. Ellsworth. 1985. Pat-terns of cognitive appraisal in emotion.
Journal ofPersonality and Social Psychology , 48(4):813–838.Carlo Strapparava and Alessandro Valitutti. 2004.WordNet affect: an affective extension of Word-Net. In
Proceedings of the Fourth InternationalConference on Language Resources and Evaluation(LREC’04) , Lisbon, Portugal. European LanguageResources Association (ELRA).Enrica Troiano, Sebastian Pad´o, and Roman Klinger.2019. Crowdsourcing and validating event-focusedemotion corpora for German and English. In
Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 4005–4011, Florence, Italy. Association for ComputationalLinguistics.Liang-Chih Yu, Lung-Hao Lee, Shuai Hao, Jin Wang,Yunchao He, Jun Hu, K. Robert Lai, and XuejieZhang. 2016. Building Chinese affective resourcesin valence-arousal dimensions. In
Proceedings ofthe 2016 Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies , pages 540–545, San Diego, California. Association for Compu-tational Linguistics.
A Corpus Statistics for ManualAnnotations of enISEAR
Table 9 shows the corpus statistics for the manuallyannotated corpora. In particular, it depicts howthe availability of the emotion to the annotatorsinfluences the distribution of appraisal labels. Thesame numbers are also shown in a comparativemanner in Figure 1. For most appraisal dimensions,the annotations are more specific, narrower, acrossthe counts for emotions and mostly manifest in alower number of those. An exception is certainty ,which shows higher counts for all emotions, and anticipated effort , which receives higher counts forshame, sadness, fear, and guilt, but not for anger.