[PDF] Uncertainty and Surprisal Jointly Deliver the Punchline: Exploiting Incongruity-Based Features for Humor Recognition

Abstract

Humor recognition has been widely studied as a text classification problem using data-driven approaches. However, most existing work does not examine the actual joke mechanism to understand humor. We break down any joke into two distinct components: the set-up and the punchline, and further explore the special relationship between them. Inspired by the incongruity theory of humor, we model the set-up as the part developing semantic uncertainty, and the punchline disrupting audience expectations. With increasingly powerful language models, we were able to feed the set-up along with the punchline into the GPT-2 language model, and calculate the uncertainty and surprisal values of the jokes. By conducting experiments on the SemEval 2021 Task 7 dataset, we found that these two features have better capabilities of telling jokes from non-jokes, compared with existing baselines.

Full PDF

UUncertainty and Surprisal Jointly Deliver the Punchline: ExploitingIncongruity-Based Features for Humor Recognition

Yubo Xie, Junze Li, and Pearl Pu

School of Computer and Communication SciencesÉcole Polytechnique Fédérale de LausanneLausanne, Switzerland {yubo.xie,junze.li,pearl.pu}@epfl.ch

Abstract

Humor recognition has been widely studied asa text classiﬁcation problem using data-drivenapproaches. However, most existing work doesnot examine the actual joke mechanism to un-derstand humor. We break down any joke intotwo distinct components: the set-up and thepunchline, and further explore the special rela-tionship between them. Inspired by the incon-gruity theory of humor, we model the set-up asthe part developing semantic uncertainty, andthe punchline disrupting audience expectations.With increasingly powerful language models,we were able to feed the set-up along with thepunchline into the GPT-2 language model, andcalculate the uncertainty and surprisal valuesof the jokes. By conducting experiments onthe SemEval 2021 Task 7 dataset, we foundthat these two features have better capabilitiesof telling jokes from non-jokes, compared withexisting baselines.

Humor, regardless of age, gender, or cultural back-ground, is perhaps one of the most fascinatinghuman behaviors. Besides being able to provide en-tertainment, humor can also be beneﬁcial to mentalhealth by serving as a moderator of life stress (Lef-court and Martin, 2012), and plays an importantrole in regulating human-human interaction. AsReeves and Nass (1996) have pointed out, peoplerespond to computers in the same way as they doto real people, which indicates that modeling hu-mor computationally could bring positive eﬀects inhuman-computer interaction (Nĳholt et al., 2003).One of the important aspects of computationalhumor is to develop computer programs capableof recognizing humor in text. Early work on hu-mor recognition (Mihalcea and Strapparava, 2005)proposed heuristic-based humor-speciﬁc stylisticfeatures, for example alliteration, antonymy, and

Set-up : I would never cheat in a relationship.

Punchline : Because that would require two people to find me attractive.I am loyal to my partner.

ExpectationViolation

Figure 1: A joke example consisting of a set-up anda punchline. A violation can be observed between thepunchline and the expectation. adult slang. More recent work (Yang et al., 2015;Chen and Soo, 2018; Weller and Seppi, 2019) re-garded the problem as a text classiﬁcation task, andadopted statistical machine learning methods andneural networks to train models on humor datasets.However, only few of the deep learning methodshave tried to establish a connection between humorrecognition and humor theories. Thus, one researchdirection in humor recognition is to bridge the dis-ciplines of linguistics and artiﬁcial intelligence.In this paper, we restrict the subject of investiga-tion to jokes, one of the most common humor typesin text form. As shown in Figure 1, these jokesusually consist of a set-up and a punchline . Theset-up creates a situation that introduces the hearerinto the story framework, and the punchline con-cludes the joke in a succinct way, intended to makethe hearer laugh. Perhaps the most suitable humortheory for explaining such humor phenomenon isthe incongruity theory , which states that the causeof laughter is the perception of something incon-gruous (the punchline) that violates the hearer’sexpectation (the set-up).Based on the incongruity theory, we propose twofeatures for humor recognition, by calculating thedegree of incongruity between the set-up and the1 a r X i v : . [ c s . C L ] D ec unchline. Recently popular pre-trained languagemodels enable us to study such relationship basedon large-scale corpora. Speciﬁcally, we fed theset-up along with the punchline into the GPT-2 lan-guage model (Radford et al., 2019), and obtainedthe surprisal and uncertainty values of the joke,indicating how surprising it is for the model togenerate the punchline, and the uncertainty whilegenerating it. We conducted experiments on a man-ually labeled humor dataset, and the results showedthat these two features could better distinguish jokesfrom non-jokes, compared with existing baselines.Our work made an attempt to bridge humor theo-ries and humor recognition by applying large-scalepre-trained language models, and we hope it couldinspire future research in computational humor. Humor Data

Mihalcea and Strapparava (2005)created a one-liner dataset with humorous exam-ples extracted from webpages with humor themeand non-humorous examples from Reuters titles,British National Corpus (BNC) sentences, and En-glish Proverbs. Yang et al. (2015) scraped punsfrom the Pun of the Day website and negativeexamples from various news websites. There isalso work on the curation of non-English humordatasets (Zhang et al., 2019; Blinov et al., 2019).Hasan et al. (2019) developed UR-FUNNY, a mul-timodal humor dataset that involves text, audio andvideo information extracted from TED talks. Humor Recognition

Most of the existing workon humor recognition in text focuses on one-liners,one type of jokes that delivers the laughter in a sin-gle line. The methodologies typically fall into twocategories: feature engineering and deep learning.Mihalcea and Strapparava (2005) designed threehuman-centric features (alliteration, antonymy andsynonym) for recognizing humor in the curated one-liner dataset. Mihalcea et al. (2010) approachedthe problem by calculating the semantic relatednessbetween the set-up and the punchline (they evalu-ated 150 one-liners by manually splitting them into“set-up” and “punchline”). Morales and Zhai (2017)proposed a probabilistic model and leveraged back-ground text sources (such as Wikipedia) to identifyhumorous Yelp reviews. Liu et al. (2018) proposedto model sentiment association between elemen-tary discourse units and designed features based on discourse relations. With neural networks beingpopular in recent years, some deep learning struc-tures have been developed for the recognition ofhumor in text. Chen and Lee (2017) and Chen andSoo (2018) adopted convolutional neural networks,while Weller and Seppi (2019) used a Transformerarchitecture to do the classiﬁcation task. Fan et al.(2020a,b) incorporated extra phonetic and semen-tic (ambiguity) information into the deep learningframework.Although the work of Mihalcea et al. (2010) isthe closest to ours, we are the ﬁrst to bridge theincongruity theory of humor and large-scale pre-trained language models. Other work (Bertero andFung, 2016) has attempted to predict punchlinesin conversations extracted from TV series, buttheir subject of investigation should be inherentlydiﬀerent from ours—punchlines in conversationslargely depend on the preceding utterances, whilejokes are much more succinct and self-contained. The attempts to explain humor date back to the ageof ancient Greece, where philosophers like Platoand Aristotle regarded the enjoyment of comedyas a form of scorn, and held critical opinions to-wards laughter. These philosophical comments onhumor, also followed by early Christian thinkers,were summarized as the superiority theory , whichstates that laughter expresses a feeling of superiorityover other people’s misfortunes or shortcomings.Starting from the 18 th century, two other humortheories began to challenge the dominance of thesuperiority theory: the relief theory and the incon-gruity theory . The relief theory argues that laughterserves to facilitate the relief of pressure for the ner-vous system. This explains why laughter is causedwhen people recognize taboo subjects—one typicalexample is the wide usage of sexual terms in jokes.The incongruity theory, supported by Kant (1790),Schopenhauer (1883), and many later philosophersand psychologists, states that laughter comes fromthe perception of something incongruous that vi-olates the expectations. This view of humor ﬁtswell the types of jokes commonly found in stand-upcomedies, where the set-up establishes an expecta-tion, and then the punchline violates it. Morreall(2020) gives a more comprehensive examination ofthese traditional humor theories.As an expansion of the incongruity theory,Raskin (1979) proposed the Semantic Script-based2heory of Humor (SSTH) by applying the semanticscript theory. It posits that, in order to produceverbal humor, two requirements should be fulﬁlled:(1) The text is compatible with two diﬀerent scripts;(2) The two scripts with which the text is compatibleare opposite. The General Theory of Verbal Humor(GTVH) (Attardo and Raskin, 1991) expanded therange of descriptive and explanatory dimensions ofSSTH to six, called knowledge resources. The incongruity theory attributes humor to the vi-olation of expectation. This means the punchlinedelivers the incongruity that turns over the expecta-tion established by the set-up, making it possible tointerpret the set-up in a completely diﬀerent way.With neural networks blooming in recent years, pre-trained language models make it possible to studysuch relationship between the set-up and the punch-line based on large-scale corpora. Given the set-up,language models are capable of writing expectedcontinuations, enabling us to measure the degreeof incongruity, by comparing the actual punchlinewith what the language model is likely to generate.In this paper, we leverage the GPT-2 languagemodel (Radford et al., 2019), a Transformer-basedarchitecture trained on the WebText dataset consist-ing of over 8 million documents for a total of 40 GBof text. WebText was curated without making anyassumptions on the genres of the text; thus the re-sulting model is domain independent, and is shownto learn many NLP tasks without explicit supervi-sion. We chose GPT-2 as our research tool because:(1) GPT-2 is already pre-trained on massive dataand publicly available online, which spares us thetraining process; (2) it is domain independent, thussuitable for modeling various styles of English text.Our goal is to model the set-up and the punchline asa whole piece of text using GPT-2, and analyze theprobability of generating the punchline given theset-up. In the following text, we denote the set-upas 𝑥 , and the punchline as 𝑦 . Basically, we areinterested in two quantities regarding the probabil-ity distribution 𝑝 ( 𝑦 | 𝑥 ) : uncertainty and surprisal,which are elaborated in the next two sections. The ﬁrst question we are interested in is: given theset-up, how uncertain it is for the language model tocontinue? This question is related to SSTH, whichstates that, for a piece of text to be humorous, it

GPT-2 x x x m ⋯ y y ⋯ y n −1 y y y y n ⋯ v v v v n Figure 2: The set-up 𝑥 and the punchline 𝑦 are concate-nated and fed into GPT-2 for predicting the next token. 𝑣 𝑖 ’s are probability distributions on the vocabulary. should be compatible with two diﬀerent scripts. Toput it under the framework of set-up and punchline,this means the set-up could have multiple ways ofinterpretation, according to the following punchline.Thus, one would expect a higher uncertainty valuewhen the language model tries to continue the set-upand generate the punchline.We propose to calculate the averaged entropy ofthe probability distributions at all token positions ofthe punchline, to represent the degree of uncertainty.As shown in Figure 2, the set-up 𝑥 and the punchline 𝑦 are concatenated and then fed into GPT-2 topredict the next token. While predicting the tokensof 𝑦 , GPT-2 produces a probability distribution 𝑣 𝑖 over the vocabulary. The averaged entropy is thendeﬁned as 𝑈 ( 𝑥, 𝑦 ) = − | 𝑦 | 𝑛 ∑︁ 𝑖 = ∑︁ 𝑤 ∈ 𝑉 𝑣 𝑤𝑖 log 𝑣 𝑤𝑖 , (1)where 𝑉 is the vocabulary. The second question we would like to address is:how surprising it is when the language model ac-tually generates the punchline? As the incongruitytheory states, laughter is caused when something in-congruous is observed and it violates the previouslyestablished expectation. Therefore, we expect theprobability of the language model generating theactual punchline to be relatively low, which indi-cates the surprisal value should be high. Formally,the surprisal is deﬁned as 𝑆 ( 𝑥, 𝑦 ) = − | 𝑦 | log 𝑝 ( 𝑦 | 𝑥 ) = − | 𝑦 | 𝑛 ∑︁ 𝑖 = log 𝑣 𝑦 𝑖 𝑖 . (2)3 Experiments

We evaluated and compared the proposed featureswith several baselines by conducting experiments intwo settings: predicting using individual features,and combining the features with a content-basedtext classiﬁer.

Similar to our approach of analyzing the relation-ship between the set-up and the punchline, Mihalceaet al. (2010) proposed to calculate the semantic re-latedness between the set-up and the punchline.The intuition is that the punchline (which deliversthe surprise) will have a minimum relatedness tothe set-up. For our experiments, we chose tworelatedness metrics that perform the best in their pa-per as our baselines, plus another similarity metricbased on shortest paths in WordNet (Miller, 1995): • Leacock & Chodorow similarity (Leacockand Chodorow, 1998), deﬁned asSim lch = − log length ∗ 𝐷 , (3)where length is the length of the shortest pathbetween two concepts using node-counting,and 𝐷 is the maximum depth of WordNet. • Wu & Palmer similarity (Wu and Palmer,1994) calculates similarity by considering thedepths of the two synsets in WordNet, alongwith the depth of their

LCS (Least CommonSubsumer), which is deﬁned asSim wup = ∗ depth ( LCS ) depth ( 𝐶 ) + depth ( 𝐶 ) , (4)where 𝐶 and 𝐶 denote synset 1 and synset 2respectively. • Path similarity (Rada et al., 1989) is alsobased on the length of the shortest path be-tween two concepts in WordNet, which isdeﬁned as Sim path = + length . (5)In addition to the metrics mentioned above, we alsoconsider the following two baselines related to thephonetic and semantic styles of the input text: • Alliteration . The alliteration value is com-puted as the total number of alliteration chainsand rhyme chains found in the input text (Mi-halcea and Strapparava, 2005). • Ambiguity . Semantic ambiguity is foundto be a crucial part of humor (Miller andGurevych, 2015). We follow the work of Liuet al. (2018) to compute the ambiguity value:log (cid:214) 𝑤 ∈ 𝑠 num_of_senses ( 𝑤 ) , (6)where 𝑤 is a word in the input text 𝑠 . We took the dataset from SemEval 2021 Task 7. The released training set contains 8,000 manuallylabeled examples in total, with 4,932 being posi-tive, and 3,068 negative. To adapt the dataset forour purpose, we only considered positive exampleswith exactly two sentences, and negative exampleswith at least two sentences. For positive exam-ples (jokes), the ﬁrst sentence was treated as theset-up and the second the punchline. For negativeexamples (non-jokes), consecutive two sentenceswere treated as the set-up and the punchline, respec-tively. After splitting, we cleaned the data withthe following rules: (1) we restricted the length ofset-ups and punchlines to be under 20 (by countingthe number of tokens); (2) we only kept punchlineswhose percentage of alphabetical letters is greaterthan or equal to 75%; (3) we discarded punchlinesthat do not begin with an alphabetical letter. As aresult, we obtained 3,341 examples in total, consist-ing of 1,815 jokes and 1,526 non-jokes. To furtherbalance the data, we randomly selected 1,526 jokes,and thus the ﬁnal dataset contains 3,052 labeledexamples in total. For the following experiments,we used 10-fold cross validation, and the averagedscores are reported.

To test the eﬀectiveness of our features in distin-guishing jokes from non-jokes, we built an SVMclassiﬁer for each individual feature (uncertaintyand surprisal, plus the baselines). The resultedscores are reported in Table 1. Compared withthe baselines, both of our features (uncertainty andsurprisal) achieved higher scores for all the fourmetrics. In addition, we also tested the performanceof uncertainty combined with surprisal (last rowof the table), and the resulting classiﬁer shows afurther increase in the performance. This suggests https://semeval.github.io/SemEval2021/ We refer to them as set-up and punchline for the sake ofconvenience, but since they are not jokes, the two sentencesare not real set-up and punchline. R F1 AccRandom 0.4973 0.4973 0.4958 0.4959Sim lch wup path

Table 1: Performance of individual features. Last row(U+S) is the combination of uncertainty and surprisal.P: Precision, R: Recall, F1: F1-score, Acc: Accuracy.P, R, and F1 are macro-averaged, and the scores arereported on 10-fold cross validation.

P R F1 AccGloVe 0.8233 0.8232 0.8229 0.8234GloVe+Sim lch wup path

Table 2: Performance of the features when combinedwith a content-based classiﬁer. U denotes uncertaintyand S denotes surprisal. P: Precision, R: Recall, F1: F1-score, Acc: Accuracy. P, R, and F1 are macro-averaged,and the scores are reported on 10-fold cross validation. that, by jointly considering uncertainty and sur-prisal of the set-up and the punchline, we are betterat recognizing jokes.

We also evaluated the proposed features as wellas the baselines under the framework of a content-based classiﬁer. The idea is to see if the featurescould further boost the performance of existing textclassiﬁers. To create a starting point, we encodedeach set-up and punchline into vector representa-tions by aggregating the GloVe (Pennington et al.,2014) embeddings of the tokens (sum up and thennormalize by the length). We used the GloVe em-beddings with dimension 50, and then concatenatedthe set-up vector and the punchline vector, to repre-sent the whole piece of text as a vector of dimension100. For each of the features (uncertainty and sur-prisal, plus the baselines), we appended it to theGloVe vector, and built an SVM classiﬁer to do P r ob a b ilit y Joke(Mdn = 3.64)Non-joke(Mdn = 3.47) 2 4 6 8Surprisal0.00.20.40.6 P r ob a b ilit y Joke(Mdn = 3.90)Non-joke(Mdn = 3.65)

Figure 3: Histograms of uncertainty (left) and surprisal(right), plotted separately for jokes and non-jokes. Mdnstands for Median. the prediction. Scores are reported in Table 2. Aswe can see, compared with the baselines, our fea-tures produce larger increases in the performanceof the content-based classiﬁer, and similar to whatwe have observed in Table 1, jointly consideringuncertainty and surprisal gives further increase inthe performance.

To get a straightforward vision of the uncertaintyand surprisal values for jokes versus non-jokes,we plot their histograms in Figure 3 (for all 3,052labeled examples). It can be observed that, for bothuncertainty and surprisal, jokes tend to have highervalues than non-jokes, which is consistent with ourexpectations in Section 4.

In this paper, we made an attempt in establishinga connection between the humor theories and thenowadays popular pre-trained language models.Based on that, we proposed two features relatedto the incongruity theory of humor: uncertaintyand surprisal. We conducted experiments on ahumor dataset, and the results suggest that ourapproach has an advantage in humor recognitionover the baselines. The proposed features canalso provide insight for the task of two-line jokegeneration—when designing the text generationalgorithm, one could exert extra constraints so thatthe set-up is chosen to be compatible with multiplepossible interpretations, and the punchline shouldbe surprising in a way that violates the most obviousinterpretation. We hope our work could inspirefuture research in the community of computationalhumor.

References

Salvatore Attardo and Victor Raskin. 1991. Script the-ory revis(it)ed: Joke similarity and joke representa- ion model. Humor: International Journal of HumorResearch .Dario Bertero and Pascale Fung. 2016. A long short-term memory framework for predicting humor in dia-logues. In

Proceedings of NAACL-HLT 2016 , pages130–135.Vladislav Blinov, Valeria Bolotova-Baranova, and PavelBraslavski. 2019. Large dataset and language modelfun-tuning for humor recognition. In

Proceedings ofACL 2019 , pages 4027–4032.Lei Chen and Chong Min Lee. 2017. Convolu-tional neural network for humor recognition.

CoRR ,abs/1702.02584.Peng-Yu Chen and Von-Wun Soo. 2018. Humorrecognition using deep learning. In

Proceedings ofNAACL-HLT 2018, Volume 2 (Short Papers) , pages113–117.Xiaochao Fan, Hongfei Lin, Liang Yang, Yufeng Diao,Chen Shen, Yonghe Chu, and Tongxuan Zhang.2020a. Phonetics and ambiguity comprehensiongated attention network for humor recognition.

Com-plex. , 2020:2509018:1–2509018:9.Xiaochao Fan, Hongfei Lin, Liang Yang, Yufeng Diao,Chen Shen, Yonghe Chu, and Yanbo Zou. 2020b.Humor detection via an internal and external neuralnetwork.

Neurocomputing , 394:105–111.Md. Kamrul Hasan, Wasifur Rahman, AmirAli BagherZadeh, Jianyuan Zhong, Md. Iftekhar Tanveer, Louis-Philippe Morency, and Mohammed (Ehsan) Hoque.2019. UR-FUNNY: A multimodal language datasetfor understanding humor. In

Proceedings of EMNLP-ĲCNLP 2019 , pages 2046–2056.Immanuel Kant. 1790. Critique of judgment, ed. andtrans.

WS Pluhar, Indianapolis: Hackett .Claudia Leacock and Martin Chodorow. 1998. Com-bining local context and WordNet sense similarityfor word sense identiﬁcation. In

WordNet, An Elec-tronic Lexical Database . The MIT Press.Herbert M Lefcourt and Rod A Martin. 2012.

Humorand life stress: Antidote to adversity . Springer Sci-ence & Business Media.Lizhen Liu, Donghai Zhang, and Wei Song. 2018. Mod-eling sentiment association in discourse for humorrecognition. In

Proceedings of ACL 2018, Volume 2(Short Papers) , pages 586–591.Rada Mihalcea and Carlo Strapparava. 2005. Makingcomputers laugh: Investigations in automatic humorrecognition. In

Proceedings of HLT/EMNLP 2005 ,pages 531–538.Rada Mihalcea, Carlo Strapparava, and Stephen G. Pul-man. 2010. Computational models for incongruitydetection in humour. In

Proceedings of CICLing2010 , volume 6008 of

Lecture Notes in ComputerScience , pages 364–374. George A. Miller. 1995. Wordnet: A lexical databasefor English.

Commun. ACM , 38(11):39–41.Tristan Miller and Iryna Gurevych. 2015. Automaticdisambiguation of English puns. In

Proceedings ofACL 2015 , pages 719–729.Alex Morales and Chengxiang Zhai. 2017. Identifyinghumor in reviews using background text sources. In

Proceedings of EMNLP 2017 , pages 492–501.John Morreall. 2020. Philosophy of Humor. In Ed-ward N. Zalta, editor,

The Stanford Encyclopedia ofPhilosophy , fall 2020 edition. Metaphysics ResearchLab, Stanford University.Anton Nĳholt, Oliviero Stock, Alan J. Dix, and JohnMorkes. 2003. Humor modeling in the interface. In

CHI 2003 Extended Abstracts on Human Factors inComputing Systems , pages 1050–1051.Jeﬀrey Pennington, Richard Socher, and Christopher D.Manning. 2014. GloVe: Global vectors for word rep-resentation. In

Proceedings of EMNLP 2014 , pages1532–1543.Roy Rada, Hafedh Mili, Ellen Bicknell, and Maria Blet-tner. 1989. Development and application of a metricon semantic nets.

IEEE Trans. Syst. Man Cybern. ,19(1):17–30.Alec Radford, Jeﬀrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.

OpenAIblog , 1(8):9.Victor Raskin. 1979. Semantic mechanisms of humor.In

Annual Meeting of the Berkeley Linguistics Soci-ety , volume 5, pages 325–335.Byron Reeves and Cliﬀord Nass. 1996.

The media equa-tion: How people treat computers, television, andnew media like real people and places . CambridgeUniversity Press.Arthur Schopenhauer. 1883. The world as will and idea(vols. i, ii, & iii).

Haldane, RB, & Kemp, J.(3 Vols.).London: Kegan Paul, Trench, Trubner , 6.Orion Weller and Kevin D. Seppi. 2019. Humor detec-tion: A transformer gets the last laugh. In

Proceed-ings of EMNLP-ĲCNLP 2019 , pages 3619–3623.Zhibiao Wu and Martha Palmer. 1994. Verbs semanticsand lexical selection. In

Proceedings of ACL 1994 ,page 133–138.Diyi Yang, Alon Lavie, Chris Dyer, and Eduard H. Hovy.2015. Humor recognition and humor anchor extrac-tion. In

Proceedings of EMNLP 2015 , pages 2367–2376.Dongyu Zhang, Heting Zhang, Xikai Liu, Hongfei Lin,and Feng Xia. 2019. Telling the whole story: Amanually annotated chinese dataset for the analysis ofhumor in jokes. In

Proceedings of EMNLP-ĲCNLP2019 , pages 6401–6406., pages 6401–6406.