Uncertainty and Surprisal Jointly Deliver the Punchline: Exploiting Incongruity-Based Features for Humor Recognition
UUncertainty and Surprisal Jointly Deliver the Punchline: ExploitingIncongruity-Based Features for Humor Recognition
Yubo Xie, Junze Li, and Pearl Pu
School of Computer and Communication SciencesÉcole Polytechnique Fédérale de LausanneLausanne, Switzerland {yubo.xie,junze.li,pearl.pu}@epfl.ch
Abstract
Humor recognition has been widely studied asa text classification problem using data-drivenapproaches. However, most existing work doesnot examine the actual joke mechanism to un-derstand humor. We break down any joke intotwo distinct components: the set-up and thepunchline, and further explore the special rela-tionship between them. Inspired by the incon-gruity theory of humor, we model the set-up asthe part developing semantic uncertainty, andthe punchline disrupting audience expectations.With increasingly powerful language models,we were able to feed the set-up along with thepunchline into the GPT-2 language model, andcalculate the uncertainty and surprisal valuesof the jokes. By conducting experiments onthe SemEval 2021 Task 7 dataset, we foundthat these two features have better capabilitiesof telling jokes from non-jokes, compared withexisting baselines.
Humor, regardless of age, gender, or cultural back-ground, is perhaps one of the most fascinatinghuman behaviors. Besides being able to provide en-tertainment, humor can also be beneficial to mentalhealth by serving as a moderator of life stress (Lef-court and Martin, 2012), and plays an importantrole in regulating human-human interaction. AsReeves and Nass (1996) have pointed out, peoplerespond to computers in the same way as they doto real people, which indicates that modeling hu-mor computationally could bring positive effects inhuman-computer interaction (Nijholt et al., 2003).One of the important aspects of computationalhumor is to develop computer programs capableof recognizing humor in text. Early work on hu-mor recognition (Mihalcea and Strapparava, 2005)proposed heuristic-based humor-specific stylisticfeatures, for example alliteration, antonymy, and
Set-up : I would never cheat in a relationship.
Punchline : Because that would require two people to find me attractive.I am loyal to my partner.
ExpectationViolation
Figure 1: A joke example consisting of a set-up anda punchline. A violation can be observed between thepunchline and the expectation. adult slang. More recent work (Yang et al., 2015;Chen and Soo, 2018; Weller and Seppi, 2019) re-garded the problem as a text classification task, andadopted statistical machine learning methods andneural networks to train models on humor datasets.However, only few of the deep learning methodshave tried to establish a connection between humorrecognition and humor theories. Thus, one researchdirection in humor recognition is to bridge the dis-ciplines of linguistics and artificial intelligence.In this paper, we restrict the subject of investiga-tion to jokes, one of the most common humor typesin text form. As shown in Figure 1, these jokesusually consist of a set-up and a punchline . Theset-up creates a situation that introduces the hearerinto the story framework, and the punchline con-cludes the joke in a succinct way, intended to makethe hearer laugh. Perhaps the most suitable humortheory for explaining such humor phenomenon isthe incongruity theory , which states that the causeof laughter is the perception of something incon-gruous (the punchline) that violates the hearer’sexpectation (the set-up).Based on the incongruity theory, we propose twofeatures for humor recognition, by calculating thedegree of incongruity between the set-up and the1 a r X i v : . [ c s . C L ] D ec unchline. Recently popular pre-trained languagemodels enable us to study such relationship basedon large-scale corpora. Specifically, we fed theset-up along with the punchline into the GPT-2 lan-guage model (Radford et al., 2019), and obtainedthe surprisal and uncertainty values of the joke,indicating how surprising it is for the model togenerate the punchline, and the uncertainty whilegenerating it. We conducted experiments on a man-ually labeled humor dataset, and the results showedthat these two features could better distinguish jokesfrom non-jokes, compared with existing baselines.Our work made an attempt to bridge humor theo-ries and humor recognition by applying large-scalepre-trained language models, and we hope it couldinspire future research in computational humor. Humor Data
Mihalcea and Strapparava (2005)created a one-liner dataset with humorous exam-ples extracted from webpages with humor themeand non-humorous examples from Reuters titles,British National Corpus (BNC) sentences, and En-glish Proverbs. Yang et al. (2015) scraped punsfrom the Pun of the Day website and negativeexamples from various news websites. There isalso work on the curation of non-English humordatasets (Zhang et al., 2019; Blinov et al., 2019).Hasan et al. (2019) developed UR-FUNNY, a mul-timodal humor dataset that involves text, audio andvideo information extracted from TED talks. Humor Recognition
Most of the existing workon humor recognition in text focuses on one-liners,one type of jokes that delivers the laughter in a sin-gle line. The methodologies typically fall into twocategories: feature engineering and deep learning.Mihalcea and Strapparava (2005) designed threehuman-centric features (alliteration, antonymy andsynonym) for recognizing humor in the curated one-liner dataset. Mihalcea et al. (2010) approachedthe problem by calculating the semantic relatednessbetween the set-up and the punchline (they evalu-ated 150 one-liners by manually splitting them into“set-up” and “punchline”). Morales and Zhai (2017)proposed a probabilistic model and leveraged back-ground text sources (such as Wikipedia) to identifyhumorous Yelp reviews. Liu et al. (2018) proposedto model sentiment association between elemen-tary discourse units and designed features based on discourse relations. With neural networks beingpopular in recent years, some deep learning struc-tures have been developed for the recognition ofhumor in text. Chen and Lee (2017) and Chen andSoo (2018) adopted convolutional neural networks,while Weller and Seppi (2019) used a Transformerarchitecture to do the classification task. Fan et al.(2020a,b) incorporated extra phonetic and semen-tic (ambiguity) information into the deep learningframework.Although the work of Mihalcea et al. (2010) isthe closest to ours, we are the first to bridge theincongruity theory of humor and large-scale pre-trained language models. Other work (Bertero andFung, 2016) has attempted to predict punchlinesin conversations extracted from TV series, buttheir subject of investigation should be inherentlydifferent from ours—punchlines in conversationslargely depend on the preceding utterances, whilejokes are much more succinct and self-contained. The attempts to explain humor date back to the ageof ancient Greece, where philosophers like Platoand Aristotle regarded the enjoyment of comedyas a form of scorn, and held critical opinions to-wards laughter. These philosophical comments onhumor, also followed by early Christian thinkers,were summarized as the superiority theory , whichstates that laughter expresses a feeling of superiorityover other people’s misfortunes or shortcomings.Starting from the 18 th century, two other humortheories began to challenge the dominance of thesuperiority theory: the relief theory and the incon-gruity theory . The relief theory argues that laughterserves to facilitate the relief of pressure for the ner-vous system. This explains why laughter is causedwhen people recognize taboo subjects—one typicalexample is the wide usage of sexual terms in jokes.The incongruity theory, supported by Kant (1790),Schopenhauer (1883), and many later philosophersand psychologists, states that laughter comes fromthe perception of something incongruous that vi-olates the expectations. This view of humor fitswell the types of jokes commonly found in stand-upcomedies, where the set-up establishes an expecta-tion, and then the punchline violates it. Morreall(2020) gives a more comprehensive examination ofthese traditional humor theories.As an expansion of the incongruity theory,Raskin (1979) proposed the Semantic Script-based2heory of Humor (SSTH) by applying the semanticscript theory. It posits that, in order to produceverbal humor, two requirements should be fulfilled:(1) The text is compatible with two different scripts;(2) The two scripts with which the text is compatibleare opposite. The General Theory of Verbal Humor(GTVH) (Attardo and Raskin, 1991) expanded therange of descriptive and explanatory dimensions ofSSTH to six, called knowledge resources. The incongruity theory attributes humor to the vi-olation of expectation. This means the punchlinedelivers the incongruity that turns over the expecta-tion established by the set-up, making it possible tointerpret the set-up in a completely different way.With neural networks blooming in recent years, pre-trained language models make it possible to studysuch relationship between the set-up and the punch-line based on large-scale corpora. Given the set-up,language models are capable of writing expectedcontinuations, enabling us to measure the degreeof incongruity, by comparing the actual punchlinewith what the language model is likely to generate.In this paper, we leverage the GPT-2 languagemodel (Radford et al., 2019), a Transformer-basedarchitecture trained on the WebText dataset consist-ing of over 8 million documents for a total of 40 GBof text. WebText was curated without making anyassumptions on the genres of the text; thus the re-sulting model is domain independent, and is shownto learn many NLP tasks without explicit supervi-sion. We chose GPT-2 as our research tool because:(1) GPT-2 is already pre-trained on massive dataand publicly available online, which spares us thetraining process; (2) it is domain independent, thussuitable for modeling various styles of English text.Our goal is to model the set-up and the punchline asa whole piece of text using GPT-2, and analyze theprobability of generating the punchline given theset-up. In the following text, we denote the set-upas 𝑥 , and the punchline as 𝑦 . Basically, we areinterested in two quantities regarding the probabil-ity distribution 𝑝 ( 𝑦 | 𝑥 ) : uncertainty and surprisal,which are elaborated in the next two sections. The first question we are interested in is: given theset-up, how uncertain it is for the language model tocontinue? This question is related to SSTH, whichstates that, for a piece of text to be humorous, it
GPT-2 x x x m ⋯ y y ⋯ y n −1 y y y y n ⋯ v v v v n Figure 2: The set-up 𝑥 and the punchline 𝑦 are concate-nated and fed into GPT-2 for predicting the next token. 𝑣 𝑖 ’s are probability distributions on the vocabulary. should be compatible with two different scripts. Toput it under the framework of set-up and punchline,this means the set-up could have multiple ways ofinterpretation, according to the following punchline.Thus, one would expect a higher uncertainty valuewhen the language model tries to continue the set-upand generate the punchline.We propose to calculate the averaged entropy ofthe probability distributions at all token positions ofthe punchline, to represent the degree of uncertainty.As shown in Figure 2, the set-up 𝑥 and the punchline 𝑦 are concatenated and then fed into GPT-2 topredict the next token. While predicting the tokensof 𝑦 , GPT-2 produces a probability distribution 𝑣 𝑖 over the vocabulary. The averaged entropy is thendefined as 𝑈 ( 𝑥, 𝑦 ) = − | 𝑦 | 𝑛 ∑︁ 𝑖 = ∑︁ 𝑤 ∈ 𝑉 𝑣 𝑤𝑖 log 𝑣 𝑤𝑖 , (1)where 𝑉 is the vocabulary. The second question we would like to address is:how surprising it is when the language model ac-tually generates the punchline? As the incongruitytheory states, laughter is caused when something in-congruous is observed and it violates the previouslyestablished expectation. Therefore, we expect theprobability of the language model generating theactual punchline to be relatively low, which indi-cates the surprisal value should be high. Formally,the surprisal is defined as 𝑆 ( 𝑥, 𝑦 ) = − | 𝑦 | log 𝑝 ( 𝑦 | 𝑥 ) = − | 𝑦 | 𝑛 ∑︁ 𝑖 = log 𝑣 𝑦 𝑖 𝑖 . (2)3 Experiments
We evaluated and compared the proposed featureswith several baselines by conducting experiments intwo settings: predicting using individual features,and combining the features with a content-basedtext classifier.
Similar to our approach of analyzing the relation-ship between the set-up and the punchline, Mihalceaet al. (2010) proposed to calculate the semantic re-latedness between the set-up and the punchline.The intuition is that the punchline (which deliversthe surprise) will have a minimum relatedness tothe set-up. For our experiments, we chose tworelatedness metrics that perform the best in their pa-per as our baselines, plus another similarity metricbased on shortest paths in WordNet (Miller, 1995): • Leacock & Chodorow similarity (Leacockand Chodorow, 1998), defined asSim lch = − log length ∗ 𝐷 , (3)where length is the length of the shortest pathbetween two concepts using node-counting,and 𝐷 is the maximum depth of WordNet. • Wu & Palmer similarity (Wu and Palmer,1994) calculates similarity by considering thedepths of the two synsets in WordNet, alongwith the depth of their
LCS (Least CommonSubsumer), which is defined asSim wup = ∗ depth ( LCS ) depth ( 𝐶 ) + depth ( 𝐶 ) , (4)where 𝐶 and 𝐶 denote synset 1 and synset 2respectively. • Path similarity (Rada et al., 1989) is alsobased on the length of the shortest path be-tween two concepts in WordNet, which isdefined as Sim path = + length . (5)In addition to the metrics mentioned above, we alsoconsider the following two baselines related to thephonetic and semantic styles of the input text: • Alliteration . The alliteration value is com-puted as the total number of alliteration chainsand rhyme chains found in the input text (Mi-halcea and Strapparava, 2005). • Ambiguity . Semantic ambiguity is foundto be a crucial part of humor (Miller andGurevych, 2015). We follow the work of Liuet al. (2018) to compute the ambiguity value:log (cid:214) 𝑤 ∈ 𝑠 num_of_senses ( 𝑤 ) , (6)where 𝑤 is a word in the input text 𝑠 . We took the dataset from SemEval 2021 Task 7. The released training set contains 8,000 manuallylabeled examples in total, with 4,932 being posi-tive, and 3,068 negative. To adapt the dataset forour purpose, we only considered positive exampleswith exactly two sentences, and negative exampleswith at least two sentences. For positive exam-ples (jokes), the first sentence was treated as theset-up and the second the punchline. For negativeexamples (non-jokes), consecutive two sentenceswere treated as the set-up and the punchline, respec-tively. After splitting, we cleaned the data withthe following rules: (1) we restricted the length ofset-ups and punchlines to be under 20 (by countingthe number of tokens); (2) we only kept punchlineswhose percentage of alphabetical letters is greaterthan or equal to 75%; (3) we discarded punchlinesthat do not begin with an alphabetical letter. As aresult, we obtained 3,341 examples in total, consist-ing of 1,815 jokes and 1,526 non-jokes. To furtherbalance the data, we randomly selected 1,526 jokes,and thus the final dataset contains 3,052 labeledexamples in total. For the following experiments,we used 10-fold cross validation, and the averagedscores are reported.
To test the effectiveness of our features in distin-guishing jokes from non-jokes, we built an SVMclassifier for each individual feature (uncertaintyand surprisal, plus the baselines). The resultedscores are reported in Table 1. Compared withthe baselines, both of our features (uncertainty andsurprisal) achieved higher scores for all the fourmetrics. In addition, we also tested the performanceof uncertainty combined with surprisal (last rowof the table), and the resulting classifier shows afurther increase in the performance. This suggests https://semeval.github.io/SemEval2021/ We refer to them as set-up and punchline for the sake ofconvenience, but since they are not jokes, the two sentencesare not real set-up and punchline. R F1 AccRandom 0.4973 0.4973 0.4958 0.4959Sim lch wup path
Table 1: Performance of individual features. Last row(U+S) is the combination of uncertainty and surprisal.P: Precision, R: Recall, F1: F1-score, Acc: Accuracy.P, R, and F1 are macro-averaged, and the scores arereported on 10-fold cross validation.
P R F1 AccGloVe 0.8233 0.8232 0.8229 0.8234GloVe+Sim lch wup path
Table 2: Performance of the features when combinedwith a content-based classifier. U denotes uncertaintyand S denotes surprisal. P: Precision, R: Recall, F1: F1-score, Acc: Accuracy. P, R, and F1 are macro-averaged,and the scores are reported on 10-fold cross validation. that, by jointly considering uncertainty and sur-prisal of the set-up and the punchline, we are betterat recognizing jokes.
We also evaluated the proposed features as wellas the baselines under the framework of a content-based classifier. The idea is to see if the featurescould further boost the performance of existing textclassifiers. To create a starting point, we encodedeach set-up and punchline into vector representa-tions by aggregating the GloVe (Pennington et al.,2014) embeddings of the tokens (sum up and thennormalize by the length). We used the GloVe em-beddings with dimension 50, and then concatenatedthe set-up vector and the punchline vector, to repre-sent the whole piece of text as a vector of dimension100. For each of the features (uncertainty and sur-prisal, plus the baselines), we appended it to theGloVe vector, and built an SVM classifier to do P r ob a b ilit y Joke(Mdn = 3.64)Non-joke(Mdn = 3.47) 2 4 6 8Surprisal0.00.20.40.6 P r ob a b ilit y Joke(Mdn = 3.90)Non-joke(Mdn = 3.65)
Figure 3: Histograms of uncertainty (left) and surprisal(right), plotted separately for jokes and non-jokes. Mdnstands for Median. the prediction. Scores are reported in Table 2. Aswe can see, compared with the baselines, our fea-tures produce larger increases in the performanceof the content-based classifier, and similar to whatwe have observed in Table 1, jointly consideringuncertainty and surprisal gives further increase inthe performance.
To get a straightforward vision of the uncertaintyand surprisal values for jokes versus non-jokes,we plot their histograms in Figure 3 (for all 3,052labeled examples). It can be observed that, for bothuncertainty and surprisal, jokes tend to have highervalues than non-jokes, which is consistent with ourexpectations in Section 4.
In this paper, we made an attempt in establishinga connection between the humor theories and thenowadays popular pre-trained language models.Based on that, we proposed two features relatedto the incongruity theory of humor: uncertaintyand surprisal. We conducted experiments on ahumor dataset, and the results suggest that ourapproach has an advantage in humor recognitionover the baselines. The proposed features canalso provide insight for the task of two-line jokegeneration—when designing the text generationalgorithm, one could exert extra constraints so thatthe set-up is chosen to be compatible with multiplepossible interpretations, and the punchline shouldbe surprising in a way that violates the most obviousinterpretation. We hope our work could inspirefuture research in the community of computationalhumor.
References
Salvatore Attardo and Victor Raskin. 1991. Script the-ory revis(it)ed: Joke similarity and joke representa- ion model. Humor: International Journal of HumorResearch .Dario Bertero and Pascale Fung. 2016. A long short-term memory framework for predicting humor in dia-logues. In
Proceedings of NAACL-HLT 2016 , pages130–135.Vladislav Blinov, Valeria Bolotova-Baranova, and PavelBraslavski. 2019. Large dataset and language modelfun-tuning for humor recognition. In
Proceedings ofACL 2019 , pages 4027–4032.Lei Chen and Chong Min Lee. 2017. Convolu-tional neural network for humor recognition.
CoRR ,abs/1702.02584.Peng-Yu Chen and Von-Wun Soo. 2018. Humorrecognition using deep learning. In
Proceedings ofNAACL-HLT 2018, Volume 2 (Short Papers) , pages113–117.Xiaochao Fan, Hongfei Lin, Liang Yang, Yufeng Diao,Chen Shen, Yonghe Chu, and Tongxuan Zhang.2020a. Phonetics and ambiguity comprehensiongated attention network for humor recognition.
Com-plex. , 2020:2509018:1–2509018:9.Xiaochao Fan, Hongfei Lin, Liang Yang, Yufeng Diao,Chen Shen, Yonghe Chu, and Yanbo Zou. 2020b.Humor detection via an internal and external neuralnetwork.
Neurocomputing , 394:105–111.Md. Kamrul Hasan, Wasifur Rahman, AmirAli BagherZadeh, Jianyuan Zhong, Md. Iftekhar Tanveer, Louis-Philippe Morency, and Mohammed (Ehsan) Hoque.2019. UR-FUNNY: A multimodal language datasetfor understanding humor. In
Proceedings of EMNLP-IJCNLP 2019 , pages 2046–2056.Immanuel Kant. 1790. Critique of judgment, ed. andtrans.
WS Pluhar, Indianapolis: Hackett .Claudia Leacock and Martin Chodorow. 1998. Com-bining local context and WordNet sense similarityfor word sense identification. In
WordNet, An Elec-tronic Lexical Database . The MIT Press.Herbert M Lefcourt and Rod A Martin. 2012.
Humorand life stress: Antidote to adversity . Springer Sci-ence & Business Media.Lizhen Liu, Donghai Zhang, and Wei Song. 2018. Mod-eling sentiment association in discourse for humorrecognition. In
Proceedings of ACL 2018, Volume 2(Short Papers) , pages 586–591.Rada Mihalcea and Carlo Strapparava. 2005. Makingcomputers laugh: Investigations in automatic humorrecognition. In
Proceedings of HLT/EMNLP 2005 ,pages 531–538.Rada Mihalcea, Carlo Strapparava, and Stephen G. Pul-man. 2010. Computational models for incongruitydetection in humour. In
Proceedings of CICLing2010 , volume 6008 of
Lecture Notes in ComputerScience , pages 364–374. George A. Miller. 1995. Wordnet: A lexical databasefor English.
Commun. ACM , 38(11):39–41.Tristan Miller and Iryna Gurevych. 2015. Automaticdisambiguation of English puns. In
Proceedings ofACL 2015 , pages 719–729.Alex Morales and Chengxiang Zhai. 2017. Identifyinghumor in reviews using background text sources. In
Proceedings of EMNLP 2017 , pages 492–501.John Morreall. 2020. Philosophy of Humor. In Ed-ward N. Zalta, editor,
The Stanford Encyclopedia ofPhilosophy , fall 2020 edition. Metaphysics ResearchLab, Stanford University.Anton Nijholt, Oliviero Stock, Alan J. Dix, and JohnMorkes. 2003. Humor modeling in the interface. In
CHI 2003 Extended Abstracts on Human Factors inComputing Systems , pages 1050–1051.Jeffrey Pennington, Richard Socher, and Christopher D.Manning. 2014. GloVe: Global vectors for word rep-resentation. In
Proceedings of EMNLP 2014 , pages1532–1543.Roy Rada, Hafedh Mili, Ellen Bicknell, and Maria Blet-tner. 1989. Development and application of a metricon semantic nets.
IEEE Trans. Syst. Man Cybern. ,19(1):17–30.Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.
OpenAIblog , 1(8):9.Victor Raskin. 1979. Semantic mechanisms of humor.In
Annual Meeting of the Berkeley Linguistics Soci-ety , volume 5, pages 325–335.Byron Reeves and Clifford Nass. 1996.
The media equa-tion: How people treat computers, television, andnew media like real people and places . CambridgeUniversity Press.Arthur Schopenhauer. 1883. The world as will and idea(vols. i, ii, & iii).
Haldane, RB, & Kemp, J.(3 Vols.).London: Kegan Paul, Trench, Trubner , 6.Orion Weller and Kevin D. Seppi. 2019. Humor detec-tion: A transformer gets the last laugh. In
Proceed-ings of EMNLP-IJCNLP 2019 , pages 3619–3623.Zhibiao Wu and Martha Palmer. 1994. Verbs semanticsand lexical selection. In
Proceedings of ACL 1994 ,page 133–138.Diyi Yang, Alon Lavie, Chris Dyer, and Eduard H. Hovy.2015. Humor recognition and humor anchor extrac-tion. In
Proceedings of EMNLP 2015 , pages 2367–2376.Dongyu Zhang, Heting Zhang, Xikai Liu, Hongfei Lin,and Feng Xia. 2019. Telling the whole story: Amanually annotated chinese dataset for the analysis ofhumor in jokes. In
Proceedings of EMNLP-IJCNLP2019 , pages 6401–6406., pages 6401–6406.