Birds of a Feather Flock Together: Satirical News Detection via Language Model Differentiation
Yigeng Zhang, Fan Yang, Yifan Zhang, Eduard Dragut, Arjun Mukherjee
BBirds of a Feather Flock Together:Satirical News Detection viaLanguage Model Differentiation
Yigeng Zhang , Fan Yang , Yifan Zhang , and Eduard Dragut and Arjun Mukherjee University of Houston, Houston, TX 77004, USA { yzhang168,fyang11,yzhang114 } @uh.edu, [email protected] Temple University, Philadelphia, PA 19122, USA [email protected]
Abstract.
Satirical news is regularly shared in modern social media because itis entertaining with smartly embedded humor. However, it can be harmful to so-ciety because it can sometimes be mistaken as factual news, due to its deceptivecharacter. We found that in satirical news, the lexical and pragmatical attributesof the context are the key factors in amusing the readers. In this work, we proposea method that differentiates the satirical news and true news. It takes advantageof satirical writing evidence by leveraging the difference between the predictionloss of two language models, one trained on true news and the other on satiricalnews, when given a new news article. We compute several statistical metrics oflanguage model prediction loss as features, which are then used to conduct down-stream classification. The proposed method is computationally effective becausethe language models capture the language usage differences between satiricalnews documents and traditional news documents, and are sensitive when appliedto documents outside their domains.
Keywords:
Satirical news detection · Text classification · Deception detection.
Satirical news is a kind of literary work that consists of parodies of mainstream jour-nalism, mundane events, or other humor. In the modern world, satirical news can beharmful to a society because it is deceptive in nature and hard to distinguish on manyoccasions. The creative approach of satire is witty, metaphorical, and subtle and peo-ple without a related cultural or contextual background may have difficulty in tellingit apart from factual news items. Satirical news may have unintentional consequencessimilar to fake news [11] and, thus, investigating the methods of filtering satirical newshas drawn people’s attention.In recent years, there have been a surge of works on fake news detection; however,the aim of producing news satire is not to contradict the truth and misleading people asfake news does, but rather to entertain as form of parody. Satirical news articles havethese characteristics: a r X i v : . [ c s . C L ] J u l Y. Zhang et al. – Imaginative Content : Similar to fake news, satirical news also entails fictionalcontent [11]. Fake news, pretending to report a real story, is intended to deliver falseinformation to mislead people. The fake stories are created seemingly reasonable,thus people who are fooled will take it as fact without being skeptical. Althoughsatirical news is also fictional, the purpose of creating satirical fake content is tomake the readers aware of an irony and the humor behind it. – Seemingly Formal and Serious Writing Style : Satirical news is usually writtenin a formal form and subjective tone in the same way as true news. Yang et al.reported that news satire is written in subjective tones, suggesting a formal form andmimicking true news [16]. This makes satire hold a kind of humor by contrastingthe serious writing style and ridiculous story. – Contradicting Common Sense : Once a reader deciphers the irony and deadpanhumor in a satire article, the reader will realize its ridiculousness since the con-tent violates common sense. Satirical news story sometimes combines irrelevantsubjects together to create unexpectedness and humor [10]. For example: ‘Father spends joyful afternoon throwing son around backyard.’ —OnionThis sentence gives a sense of ridiculousness because the object ‘son’ is not usedto be ‘thrown’ for fun around the backyard. Sometimes the stories are made up ofimpossible events that will never happen in the real world if one knows the context.For example: ‘Vice President Mike Pence reportedly visited his conversion therapist Thurs-day for a routine gay-preventative checkup.’ —Onion – Humoristic and Amusing : The purpose of creating news satire is to criticize orcomment on some social affair in a humorous way [1]. The readers usually findfunny metaphors and comical stories in it. This entertaining property increases thepopularity of news satire in social media outlets.Although this seemingly formal writing style makes satirical news hard to detect,the breach of lexical and pragmatical information can address the problem. Since satir-ical news usually makes up stories with preposterous content, it is easy for people withcorresponding knowledge and cultural background to recognize it. Using a similar ideaas how people discern illogical information, we can leverage the computational modelswhich have the ability to gain domain knowledge to ‘judge’ like human beings. Lan-guage models (LM) is one possible solution because an LM encodes knowledge fromthe context it is trained upon [15]. In this work, by making use of the characteristics ofthe satirical content, we propose a new method to detect satirical news, where languagemodels play an important part.
In recent years, there have been many works in news and sentiment analysis [5][14][13][4]. One particular area has been fake and satirical news detection using machinelearning methods. Burfoot & Baldwin conducted a research on differentiating satirenews and true news using SVM with targeted lexical features and semantic validity ona 4000-article dataset [1]. Rubin et al. categorized news satire as a kind of ‘humorous itle Suppressed Due to Excessive Length 3 fakes’ in deceptive news [12]. They proposed an SVM-based algorithm in satire detect-ing and tested with 360 pieces of news [11]. Yang et al. built a dataset for satirical newsdetection [16], which contains a much larger number of satirical and true news fromvarious sources. They also proposed a method with Hierarchical Attention Networkswith many linguistic features such as writing stylistic features and readability features.They argued that satirical cues often appear in certain paragraphs instead of the wholearticle. De Sarkar et al. works at the sentence level embeddings and document levelembeddings, and they also use many syntax features such as part-of-speech tags andnamed entity features [3].Compared to the existed satirical news detection works, we propose an approachthat seeks to capture the characteristics mentioned in Section 1 about news satire anduse it to distinguish it from other news genre. Our method utilizes two separately trainedlanguage models from satirical news and true news as the ‘brain’ of domain knowledge.Then we obtain a series of satire/non-satire measurement scores—surprise scores foreach news article from the language models. With these scores as features, a classifieris trained for future prediction. As the idiom ‘Birds of a feather flock together’ goes,we find that news of the same category will have similar feature representations whilenews from different categories will show a significant difference when viewed throughtheir corresponding language models. The dataset we use for satirical news detection isfrom the work by Yang et al. [16]. Experimental results on this dataset show that ourmethod can achieve state-of-the-art performance on the validation dataset and compet-itive results on the test data. Moreover, it only uses classical neural language modelswith shallow layers and a small number of features from several basic statistics of thelanguage model output instead of sophisticated feature engineering.
In this section, we first present the underlying hypothesis of our approach. Then wepresent our model and classification pipeline.
Satirical news is written in a seemingly formal and serious way just like true news.But people can distinguish satirical news from true news because its content ‘violatespeople’s common sense’. As discussed in Section 1, satirical content contains storieswhich can hardly happen in real life. Looking at this phenomenon at the language level,the pairing of subject-object and the word collocations in a large number of satiricalnews articles appears to be significantly different from a true news collection. Sincelanguage models (LM) encodes knowledge from the data they are trained from, theyare expected to act differently (i.e., present different scores) when fed with uncommontext. We expect this to be reflected in entropy (logarithm of perplexity) when a pre-trained true news LM is used to fit true news from satire news.We assume that, because of the lexical and pragmatical differences, when true/satirenews is applied to a pre-trained language model, news samples from a different categorywill result in an obvious difference as judged based on the output of the language model.
Y. Zhang et al.
Encoder2-LayerLSTMDecoder
One-hotWord IndexWordEmbeddingRecurrentModuleProjectionOutput
Fig. 1: Basic encoder-decoder LM.
True News Language Model Satirical News Language ModelNews articleConcatClassifierTrain/TestSentence 1Sentence 2...True ScoreStatistics Satire ScoreStatisticsFeature Vector
Fig. 2: Pipeline.By applying the Wilcoxon signed-rank test onto the output pairs from different languagemodels, we expect to prove the result of significant difference [7].Here we define a metric named surprise score . A surprise score is the arithmeticmean of the entropy loss values on token-level that the LM produces when fed in with apiece of sequential text. The more distinct the new text piece (from the training data) is,the higher the surprise score obtained with the LM we expect to be. For example, a pieceof satirical news will have a higher surprise score than a true news after being appliedto a language model trained on true news documents. By leveraging this surprise scoresas features, we can perform the classification effectively.
Language model (LM) can be defined as a probability distribution over a sequence ofwords. It is usually trained to describe the likelihood of occurrence of one next word w t in a sequence by having seen the previous k words of the context: p ( w t | w t − k − t − ) = LM ( w t − k − t − ) (1)where ∀ w t ∈ V , the finite vocabulary of the context.The recurrent neural network based language model (RNN LM) was firstly pro-posed by Mikolov et al. [9]. The neural language model used in this work follows abasic encoder-decoder language model with LSTM as the recurrent module (shown inFigure 1). The prediction procedure is derived by: w t = Ly ∗ t − (2) h t = f ( w t , h t − ) (3) y t = W h t + b (4)The matrix L ∈ R d x ×| V | is for word embedding. f refers to the LSTM module. The W is the w t matrix. After linear transformation from the decoder, output y t is obtained.The cross-entropy loss on sequence is calculated as below. In this loss function, x isa raw output score vector from the linear projection layer for each class, and i is thedictionary index of each corresponding word, which indicates the class index number. itle Suppressed Due to Excessive Length 5 L ( y target , ˆy ) = − V (cid:88) i =1 y targeti log ˆ y i (5) We favor a less complex classification pipeline in this work. The input news article isfed as a word sequence into the two language models, which are each trained on truenews and satirical news documents, respectively. From each language model, we getan output sequence of loss numbers corresponding to each sentence of one article. Noweach article of n sentences is represented with two corresponding sequences, describingthe impression of true and satirical news language models: Article truei = score t , score t , score t , ..., score tn Article satirei = score s , score s , score s , ..., score sn We calculate the mathematical statistics: sample size N , arithmetic mean ¯ X , me-dian ˜ X , sample variance s , and range R , of each sequence of surprise scores assatire/true features, where median is the middle value and sample size represents thenumber of sentences of each news article. Then we concatenate all of the features as an9-dimension vector to represent one news article. For each article, we finally obtaineda low-dimension feature vector: [ N, ¯ X t , ˜ X t , s t , R t , ¯ X s , ˜ X s , s s , R s ] (6)where the subscript s (for satirical) or t (for true) indicates which language model thecorresponding feature comes from. An SVM classifier is trained and tested using theabove feature vectors. Figure 2 shows the proposed classification pipeline. In this section, we first present the dataset and hypothesis testing. Then we introducethe implementation details. Finally, we dive into the evaluation and analysis.
The news dataset we use in this paper is from the work by Yang et al. [16]. The newsarticles are crawled from both satirical news providers (such as Onion) and true newswebsites (such as CNN). The news headline, creation time, and author information areremoved in order not to introduce many obvious features of a news source.In this work, we concentrate on the binary classification task. The usage of this newsdataset could be found in Table 1. We divide the original training data from [16] into twoparts: the part to train the two language models and the part to train the classifier. Theformer part takes 2/3 of the original training data, while the latter takes the rest 1/3. Thedata for validation and test remains the same. The part of data for training the classifier
Y. Zhang et al.
Table 1: The usage distribution of the News dataset.
Original Train
Train LM Train SVM
Validation TestTrue 101268 is roughly the same size as the validation data and test data. This practice of divisionis to ensure the balance of each part of data usage and also to prove the effectivenessand generalization ability of the classifier. Moreover, this dataset is from various newssources – Train:
Onion , the Spoof ; Validation: Daily Current , Daily Report , Endur-ing Vision , Gomerblog , National Report , Satire Tribune , Satire Wire , Syruptrap , and
Unconfirmed Source ; Test:
Satire World , Beaverton and
Ossurworld , which makes thedata distribution and its characteristics not uniform. In this case, the adaptability of theproposed method will be tested since the language models and the classifier are trainedon different news sources. Therefore, because of the diversity of the news sources in thisdataset, it can finally help to disclose whether our method has the ability to generallycatch the knowledge difference behind satirical content and true content.
As mentioned in section 3, we should prove that the statistics from true/satire surprisescores of each given article produced from both true and satire LMs have a significantdifference. Therefore, we need statistical hypothesis testing to examine the score pairsare deemed statistically significant.
Null Hypothesis 0 ( H ): For all news articles, each statistic feature calculated out ofthe surprise scores from the true LM has no statistically significant difference with thecorresponding one from the satire LM pairwisely.
Here we use the Wilcoxon signed-rank test [7], which is often used to determinewhether two related samples have the same distribution or not. By applying this test onevery pair of statistic features, we obtained the p-value (cid:28) . from both satire andtrue news sample test pairs, which means the Null Hypothesis H is rejected and thereis a significant difference between true/satire surprise score statistics from both true andsatire LMs. In this work, we implement the pipeline using typical neural network modules andmediocre settings in each part, comparing to other classification methods with complexnetwork structures or delicate embeddings. For the language model part, the size of theword embeddings of the encoder is 200. The RNN module is a typical 2-layer LSTMwith 200 hidden units per layer. A dropout rate of 0.2 is applied when training thelanguage model. Both of the true news and satirical news LM are trained with 6 epochs.For the classifier, a typical SVM with linear/polynomial kernel is used. sklearn.svm.SVCitle Suppressed Due to Excessive Length 7 True News Sentence/Paragraph Examples True LM Score Satire LM Score
We are not going to comment on the timeline of the evidence that comes in, Lynch said. 3.458497763 4.547165394Daley Blind rescued Manchester United with a strike deep into injury time on Sunday 5.857816696 7.925971508It is risky. Both situations defy easy solution. The Iranians have changed their tone but must go a long way to prove they are changing their intent, embracing transparency and adhering to international standards. Even if they do, if they continue to support terrorist groups like Hezbollah they will be at loggerheads with the United States. 4.679250264 5.665326705
Satirical News Sentence/Paragraph Examples True LM Score Satire LM Score
His job is raking the ocean waves flat so the sun can shine through. 6.591171265 6.478577614After moving several shelves of canned goods to his garage workbench area, Svoboda attempted to break open a can of lentil soup using a pair of needle nose pliers and a blowtorch. 7.182848454 6.970126152Although not a regular reader of Der Spiegel, Matherson said he gleaned information about the publication from the celebrity news program Insider, which he typically watches alone in his room while eating cold cereal. 6.212657452 6.450323237
Fig. 3: Some examples of some selected news sentences/paragraphs with their surprisescores specified by the true news LM and the satirical news LM.
Two different language models of true/satirical news giving surprise scores to each sen-tences, which finally forms a feature vector for one piece of news. Figure 3 shows someexample sentences with their surprise scores on two kinds of LMs. For the true newssamples, the surprise scores are generally lower than the satirical news samples. Mean-while, the scores from true news LM are lower than they are from the satire LM, whichconfirmed our hypothesis. For the satirical news samples, with their characteristics, thescores not only rise higher but become difficult to distinguish on both sides. This is alsoabstractly reflected in Figure 5.
Fig. 4: A t-SNE visualization of 1000randomly sampled feature vectors oftrue news (red)/satirical news (blue)from Train SVM part of data. N scaled X t X t s t R t X s X s s s R s Feature S t a t i s t i c M e t r i c s Arithmetic Mean with Standard Deviation of the Features
TrueSatire
Fig. 5: An illustration of the compar-ison of avg. and std on each featurebetween true news (red) and satirical(blue). Feature N is scaled by . .The statistics calculated from the score sequences depict the different behavior ondifferent LMs from a higher perspective. Figure 4 shows that by using the proposedfeature vector, it is clear to see the separation of true news samples and satirical news Y. Zhang et al.
Table 2: Performance results with increasing number of feature/feature-pairs. 1: Mean,2: Mean + Median, 3: Mean + Median + Sample Variance, 4: Mean + Median + SampleVariance + Range, 5: Mean + Median + Sample Variance + Range + Sample Size.
Validation Acc Pre Rec F1 Test Acc Pre Rec F11 96.35 92.01 82.83 86.74 1 94.82 90.51 77.37 82.352 96.39 92.08 83.09 86.93 2 94.76 90.45 77.04 82.083 96.90 92.61 86.32 89.16 3 95.23 91.32 79.35 84.064 97.38 93.17 89.26 91.10 4 96.06 92.23 83.92 87.505 Table 3: Experimental result comparison of four different methods on satirical newsdetection task in
Accuracy , Precision , Recall , and
F1 Score . Results being comparedare originally listed in the work by Yang et al. [16] and De Sarkar et al. [3].
Validation TestMethod Acc Pre Rec F1 Acc Pre Rec F1Rubin et al. [11] 97.73 90.21 81.92 85.86 97.79 93.47 82.95 87.90Yang et al. [16]
De Sarkar et al. [3] 98.18 94.15 86.55 90.19 98.31 93.45 86.01 89.57This work SVM-Linear 97.93 94.46 91.79 93.07 96.67 93.41 86.62 89.65This work SVM-Poly 97.97 samples even in 3D t-SNE illustration. Table 2 providing a further evidence: with fea-tures incrementing, the classification performance improves accordingly. Meanwhile,it also reflects the feature importance when controlling different selection of features.Here we report the experimental results on both validation dataset and test dataset, in or-der to present the upper bound as well as the generalization performance of this method.The results shows a coherent behavior on both dataset as the number of feature/feature-pairs increments. The features generated using this method can reflect some statisticaldifferences between two kinds of news data. Figure 5 is an illustration from a macroperspective using the arithmetic mean for each feature. Visible difference on these twoseries of data could be found from the histogram plot.A further consideration is concerning the level of importance of each investigatedfeature. Here we look into a uni-variate feature selection metric mutual information(MI) [6], which is described in
Eq. 2.28 of [2]: I ( X ; Y ) = (cid:88) y ∈Y (cid:88) x ∈X p ( x, y ) log (cid:18) p ( x, y ) p ( x ) p ( y ) (cid:19) (7)where feature X and target Y are discrete random variables. The MI score here depictshow much the uncertainty of the classification is eliminated when given feature X. Wecalculate the MI scores between each of the features and the classification target. Thusthe higher the MI score is, the more contribution of the corresponding feature will makein classification.The MI scores are illustrated in pairs in Figure 6 for the training LM data, validationdata, and test data. By interpreting the scores, we found that features such as N andall of the features from true LM are of significant importance on validation data, and itle Suppressed Due to Excessive Length 9 R s for all dataset play a great role in making decisions in determining news category,while features such as ¯ X s and ˜ X s of validation data are obviously of less utility forclassification. Although as shown in Figure 5 there is a less significant difference inthe feature pair sample variance, our model has the potential to distinguish and utilizethese features. Therefore, the contribution of each feature varies in the classification ondifferent dataset. Furthermore, there is also a visible complementarity shown on eachfeature item in pairs: if one feature from true news LM has a low MI score, its pairedcorresponding feature will raise. N X t X t s t R t X s X s s s R s Feature10 M I s c o r e Sentence CountFrom True LMFrom Satire LM
Fig. 6: An illustration of Mutual Information feature analysis on the feature sample size,which is sentence (or paragraph) count N (yellow) and paired features from true newsLM (red)/satirical news LM (blue). Each feature of Train/Validation/Test data appearsfrom left to right accordingly across each group of the histograms. Y -axis is log-scaled.As shown in Table 3, our method outperforms the other three methods in Rubinet al. [11], Yang et al. [16] and De Sarkar et al. [3] precision, recall, and F1 score onthe validation dataset and achieves competitive results on the test dataset. Upon furtherinvestigation of the corpora we use, we found that one of the satirical news sourceproposed by Yang et al. [16] and De Sarkar et al. [3], Ossurworld from the test dataset,is questionable to be categorized and utilized as ‘satirical news’. This website is a blogof Irreverence, Irony, Insouciance & Iconoclasm as mentioned on their headline insteadof news. Although it truly focuses on irony and publishes some satire content and fakenews, a considerable number of blog posts such as ironic film reviews are neither satirenor in the form of news. Therefore, it is likely some of this irrelevance in data results ina negative impact on the performance of our method.Moreover, our method is not influenced by the potential problems mentioned byMcHardy et al. [8] in Section 2, because the language models outputs, surprise scores,are just numerical values, which will not contain any semantic information or learn anyfine-grained details like the models [16] and [3] possibly did. Also, the satire trainingand testing data are from diverse sources. For the satire part, the LM training and clas-sifier data are from
Onion and the Spoof . The validation data and testing data are frommany satire sources listed in Section 4. By using data from different news sources, theobjectivity of the data and the method can be mutually guaranteed. https://ossurworld.com/0 Y. Zhang et al. Inspired by the idiom ‘Birds of a feather flock together’, we proposed a new methodfor satirical news classification that leverages language model output distribution diver-gence. By leveraging the surprise scores from different language models, the satiricalnews was differentiated from true news articles effectively. This method is not only freefrom extracting numerous linguistic features as previous works did, but also it does notrequire any sophisticated neural network structures or advanced embeddings. More im-portantly, this proposed method proves the value of the selected statistical features froma language model output, and shows the effectiveness of these features in depicting thecharacteristics of the corresponding document category.This work is supported in part by the U.S. NSF grants 1838147 and 1838145. Wealso thank anonymous reviewers for their helpful feedback.
References
1. Burfoot, C., Baldwin, T.: Automatic satire detection: Are you having a laugh? In: Proceed-ings of the ACL-IJCNLP 2009 conference short papers. pp. 161–164 (2009)2. Cover, T.M., Thomas, J.A.: Elements of information theory (2006)3. De Sarkar, S., Yang, F., Mukherjee, A.: Attending sentences to detect satirical fake news. In:Proceedings of COLING 2018. pp. 3371–3380 (2018)4. Dragut, E.C., Yu, C.T., Sistla, A.P., Meng, W.: Construction of a sentimental word dictionary.In: CIKM. pp. 1761–1764 (2010)5. He, L., Han, C., Mukherjee, A., Obradovic, Z., Dragut, E.C.: On the dynamics of user en-gagement in news comment media. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. (1)(2020)6. Kraskov, A., St¨ogbauer, H., Grassberger, P.: Estimating mutual information. Physical reviewE (6), 066138 (2004)7. McDonald, J.H.: Handbook of biological statistics, vol. 2 (2009)8. McHardy, R., Adel, H., Klinger, R.: Adversarial training for satire detection: Controlling forconfounding variables. In: Proceedings of NAACL-HLT 2019. pp. 660–665 (2019)9. Mikolov, T., Karafi´at, M., Burget, L., ˇCernock`y, J., Khudanpur, S.: Recurrent neural networkbased language model. In: Proceedings of Interspeech 2010 (2010)10. Reyes, A., Rosso, P., Buscaldi, D.: From humor recognition to irony detection: The figurativelanguage of social media. Data & Knowledge Engineering74