Generating Fact Checking Summaries for Web Claims
GG ENERATING F ACT C HECKING S UMMARIES FOR W EB C LAIMS P REPRINT , COMPILED O CTOBER
20, 2020
Rahul Mishra , Dhruv Gupta , Markus Leippold University of Stavanger, Norway [email protected] Max Planck Institute for Informatics, Germany [email protected] University of Zurich, Switzerland [email protected] A BSTRACT
We present SUMO, a neural attention-based approach that learns to establish the correctness of textual claimsbased on evidence in the form of text documents (e.g., news articles or Web documents). SUMO furthergenerates an extractive summary by presenting a diversified set of sentences from the documents that explain itsdecision on the correctness of the textual claim. Prior approaches to address the problem of fact checking andevidence extraction have relied on simple concatenation of claim and document word embeddings as an input toclaim driven attention weight computation. This is done so as to extract salient words and sentences from thedocuments that help establish the correctness of the claim. However, this design of claim-driven attention doesnot capture the contextual information in documents properly. We improve on the prior art by using improvedclaim and title guided hierarchical attention to model effective contextual cues. We show the efficacy of ourapproach on datasets concerning political, healthcare, and environmental issues.
NTRODUCTION
Most of the information consumed by the world is in the form ofdigital news, blogs, and social media posts available on the Web.However, most of this information is written in the absence offacts and evidences. Our ever-increasing reliance on informationfrom the Web is becoming a severe problem as we base our per-sonal decisions relating to politics, environment, and health onunverified information available online. For example, considerthe following unverified claim on the Web: "Smoking may protect againstCOVID-19."
A user attempting to verify the correctness of the above claimwill often take the following steps: issue keyword queries tosearch engines for the claim; going through the top reliable newsarticles; and finally making an informed decision based on thegathered information. Clearly, this approach is laborious, takestime, and is error-prone. In this work, we present
SUMO , a neuralapproach that assists the user in establishing the correctness ofclaims by automatically generating explainable summaries forfact checking. Example summaries generated by
SUMO forcouple of Web claims are given in Figure 1.
Prior approaches to automatic fact checking rely on predictingthe credibility of facts [20], instance detection [14, 31], andfact entailment in supporting documents [18]. The majority ofthese methods rely on linguistic features [20, 22, 23], socialcontexts, or user responses [13] and comments. However, theseapproaches do not help explain the decisions generated by themachine learning models. Recent works such as [2, 16, 21]overcome the explainability gap by extracting snippets from textdocuments that support or refute the claim. [16, 21] apply claim-based and latent aspect-based attention to model the context oftext documents. [16] model latent aspects such as the speaker orauthor of the claim, topic of the claim, and domains of retrieved Web documents for the claim. We observe in our experimentsthat in prior works [16, 21], the design of claim guided atten-tion in these methods is not effective and latent aspects such asthe topic and speaker of claims are not always available. Thesnippets extracted by such models are not comprehensive ortopically diverse. To overcome these limitations, we propose anovel design of claim and document title driven attention, whichbetter captures the contextual cues in relation to the claim. In ad-dition to this, we propose an approach for generating summariesfor fact-checking that are non-redundant and topically diverse.
Contributions.
Contributions made in this work are as follows.First, we introduce
SUMO , a method that improves upon the pre-viously used claim guided attention to model effective contextualrepresentation. Second, we propose a novel attention on top ofattention (Atop) method to improve the overall attention effec-tiveness. Third, we present an approach to generate topicallydiverse multi-document summaries, which help in explainingthe decision
SUMO makes for establishing the correctness ofclaims. Fourth, we provide a novel testbed for the task of factchecking in the domain of climate change and health care.
Outline.
The outline for the rest of the article is as follows. InSection 2, we describe prior work in relation to our problemsetting. In Section 3, we formalize the problem definition and de-scribe our approach,
SUMO , to generate explainable summariesfor fact checking of textual claims. In Sections 4 and 5, wedescribe the experimental setup that includes a description of thenovel datasets that we make available to the research communityand an analysis of the results we have obtained. In Section 6,we present the concluding remarks of our study.
ELATED WORK
We now describe prior work related to our problem setting.First, we describe works that rely only on features derived from a r X i v : . [ c s . C L ] O c t REPRINT – G
ENERATING F ACT C HECKING S UMMARIES FOR W EB C LAIMS The current evidence suggests that the severity of COVID is higher among smokers, prevent the health risk linked to the excessive consumption or misuse" of nicotine products by people hoping to protect themselves from COVID-19. Evidence from China, where COVID-19 originated, shows that people who have cardiovascular and respiratory conditions caused by tobacco use, or otherwise, are at higher risk of developing severe COVID-19 symptoms. HO urges researchers, scientists and the media to be cautious about amplifying unproven claims that tobacco or nicotine could reduce the risk of COVID-19. Smoking is also associated with increased development of acute respiratory distress syndrome, a key complication for severe cases of COVID-19.
Claim: Smoking may protect against COVID-19 Claim: Deforestation has made humans more vulnerable to pandemics
Deforestation can directly increase the likelihood that a pathogen will be transferred from wildlife species to humans through the creation of suitable habitats for vector species. Climate change, including deforestation which drives it, is a key driver of cross-species transmission which is where zoonotic emerging diseases come from . There is a correlation between deforestation and the rise in the spread of infectious diseases affecting humans. Deforestation forces various species into smaller, shared habitats and increases encounters between wildlife and humans. Habitat destruction and fragmentation due to deforestation can also increase the frequency of contact between humans, wildlife species, and the pathogens they carry . This can occur through direct transfer of pathogens from animals to humans or indirectly through cross-species transfer of pathogens from wildlife to domesticated species . Deforestation could be to blame for the rise of infectious diseases like the novel coronavirus.
Summary: Summary:
Label: False Verdict: False Label: True Verdict: True
Figure 1:
Example summaries generated by
SUMO for unverified claims on the Web. documents that support the input textual claim. Second, wedescribe works that additionally include features derived fromsocial media posts in connection to the claim. Third and finally,we describe works that rely on extracting textual snippets fromtext documents to explain a model’s decision on the claim’scorrectness.
Prior approaches for fact checking vary from simple machinelearning methods such as SVM and decision trees to highlysophisticated deep learning methods. These works largely utilizefeatures that model the linguistic and stylistic content of the factsto learn a classifier [4, 12, 23, 25]. The key shortcomings ofthese approaches are as follows. First, classifiers trained onlinguistic and stylistic features perform poorly as they can bemisguided by the writing style of the false claims, which aredeliberately made to look similar to true claims but are factuallyfalse. Second, these methods lack in terms of user response andsocial context pertaining to the claims, which is very helpful inestablishing the correctness of facts.
Works such as [24, 26, 32] overcome the issue of user feedbackby using a combination of content-based and context-based fea-tures derived from related social media posts. Specifically, thefeatures derived from social media include propagation patternsof claim related posts on social media and user responses in theform of replies, likes, sentiments, and shares. These methodsoutperform content-based methods significantly. In [32], theauthors propose a probabilistic graphical model for causal map-pings among the post’s credibility, user’s opinions, and user’scredibility. In [24], the authors introduce a user response gen-erator based on a deep neural network that leverages the user’s past actions such as comments, replies, and posts to generate asynthetic response for new social media posts.
Explaining a machine learning model’s decision is becomingan important problem. This is because modern neural net-work based methods are increasingly being used as black-boxes.There exist few machine learning models for fact checking thatexplain this decision via summaries. Related works [16, 21]achieve significant improvement in establishing the credibilityof textual claims by using external evidences from the Web.They additionally extract snippets from evidences that explaintheir model’s decision. However, we find that the claim-drivenattention design used in these methods is inadequate, and doesnot capture sufficient context of the documents in relation tothe input claim. The snippets extracted by these methods areoften redundant and lack topical diversity offered by Web ev-idences. In contrast, our method enhances the claim-drivenattention mechanism and generates a topically diverse, coher-ent multi-document summary for explaining the correctness ofclaims.
We now formally describe the task of fact checking and explain
SUMO in detail.
SUMO works in two stages. In the first stage,it predicts the correctness of the claim. In the second stage,it generates a topically diverse summary for the claims. Asinput, we are provided with a Web claim c ∈ C , where C is acollection of Web claims and a pseudo-relevant set of documents D = { d , d , . . . , d m } , where m is the number of results retrievedfor claim c . The documents d ∈ D are retrieved from the Webas potential evidences, using claim c as a query. Each retrieveddocument d is accompanied by its title t and text body bd , i.e. REPRINT – G
ENERATING F ACT C HECKING S UMMARIES FOR W EB C LAIMS d = (cid:104) t , bd (cid:105) ). We define the representation of each document’sbody as a collection of k sentences as bd = { s , s , ..., s k } andeach sentence as the collection of l words as { w , w , ..., w l } ∈ W ,where W is the overall word vocabulary of the corpus. By k and l , we denote the maximum numbers of sentences in adocument and the maximum number of words in a sentence,respectively. We use both WORD VEC and pre-trained GloVeembeddings to obtain the vector representations for each claim,title, and document body. The objective is to classify the claimas either true or false and automatically generate a topicallydiverse summary pieced together from D for establishing thecorrectness of the claim. We now describe
SUMO ’s neural architecture (see Figure 2) thathelps in predicting the correctness of the input claim along withits pseudo-relevant set of documents. The model additionallylearns the weights to words and sentences in the document’sbody that help ascertain the claim’s correctness. First, we needto encode the pseudo-relevant documents that support a claim.To this end, as a sequence encoder , we use a Gated RecurrentUnit (GRU) to encode the document’s body content. Claimand document’s title are not encoded using sequence encoder;we explain the method to represent them in detail in upcomingsections.
Claim-driven Hierarchical Attention. , aims to attend salientwords that are significant and have relevance to the content ofthe claim. Similarly, we aim to attend the salient sentencesat the sentence level attention. Recent works have used claimguided attention to model the contextual representation of theretrieved documents from the Web. These approaches provideclaim-guided attention by first concatenating the claim wordembeddings with document word embeddings and then applyinga dense softmax layer to learn the attention weights as follows: r i = c i (cid:107) d i & a i = tanh( W a r i + b a ) α = softmax( a i ) , (1)where c i and d i are the i th claim and document embeddings. W a and b a are the weight matrix and bias and α is the learnedattention weight. However, during experiments, we observethat applying claim-based attention provides an inferior overalldocument representation. Therefore, we do not concatenatethe claim and document embeddings before attention weightcomputation.Each claim c i is consists of l maximum number of wordsas { w , w , ......, w l } . We represent each claim c i as the sum-mation of embeddings of all the words contained in it as: Cl i = (cid:80) lj = f ( w j ), where f ( w j ) is the word embedding of the j th word of claim c i . Claim representation Cl i and hidden states h j from the GRU are used to compute word-level claim-drivenattention weights as: u j , i = tanh( W j , i h j + b j , i ) α Cj , i = softmax( u (cid:62) j , i Cl i ) , (2)where W j , i and b j , i are the weight matrix and bias, α Cj , i is theword level claim driven attention weight vector, and h j = ( h j , , h j , , ..., h j , l ) (cid:62) represents the tuple of all the hidden states ofthe words contained in the j th sentence. To compute sentence level claim-driven attention weights , we use claim representa-tion Cl i and hidden states h Sj from the sentence level GRU unitsas concatenations of both forward and backward hidden states h Sj = −→ h Sj (cid:107) ←− h Sj as follows: u j = tanh( W j h S + b j , i ) α Cj = softmax( u (cid:62) j Cl i ) , (3)where W j and b j are the weight matrix and bias, h S = ( h S , h S , ..., h Sl ) (cid:62) is the combination of all hidden states fromsentences, and α Cj = ( α j , , α j , , ..., α j , k ) (cid:62) is the sentence levelclaim-driven attention weight vector for the j th document. Title-driven Hierarchical Attention.
The objective of usingthe document title is to guide the attention in capturing sectionsin the document that are more critical and relevant for the title.Articles convey multiple perspectives, often reflected in theirtitles. By title-driven attention, we attend to those words andsentences that are not covered in claim-driven attention. Title-driven attention at both word and sentence level can be computedin a similar fashion as claim-driven attention. Each title t i iscomprised of l maximum number of words as { w , w , . . . , w l } .We represent each claim t i as the summation of embeddingsof all the words contained in it as: T i = (cid:80) lj = f ( w j ). Title-driven attention weights for both words and sentence level canbe computed as follows: u j , i = tanh( W j , i h j + b j , i ) α Tj , i = softmax( u (cid:62) j , i T i ) u j = tanh( W j h S + b j , i ) α Tj = softmax( u (cid:62) j T i ) . (4) Hierarchical Self-Attention.
Self-attention is a simplistic formof attention. It tries to attend salient words in a sequence ofwords and salient sentences in a collection of sentences basedon the self context of a sequence of words or a collection ofsentences. In addition to claim-driven and title-driven attention,we apply self-attention to capture the unattended words andsentences which are not related to claim or title directly but arevery useful for classification and summarization. Self-attentionweights for both words and sentence level can be computed asfollows: u j , i = tanh( W j , i h j + b j , i ) α S lj , i = softmax( u (cid:62) j , i ) u j = tanh( W j h S + b j , i ) α S lj = softmax( u (cid:62) j ) , (5)where α S lj , i and α S lj are the self-attention weight vectors at wordand sentence levels respectively.
Fusion of Attention Weights.
We combine the attentionweights from the three kinds of attention mechanisms: claim-driven, title-driven, and self-attention at both the word and sen-tence levels. At the word level, we set: α j = ( α Cj , i + α Tj , i + α S lj , i ) / S j = α (cid:62) j h j , (7)where α Cj , i , α Tj , i , and α S lj , i are the attention weight vectors fromclaim, title and self-attention at the word level. S j is the formed REPRINT – G
ENERATING F ACT C HECKING S UMMARIES FOR W EB C LAIMS w t,1 w t,2 w t,3 w t,m w t,m+1 w w w k w w w n Claim (C i )Title (T j )Doc (D j ) R e d u ce S u m D e n s e S o f t m a x DenseSoftmax R e d u ce S u m DenseSoftmax A v g M u l D e n s e S o f t m a x DenseSoftmaxDenseSoftmax A v g M u l S o f t m a x Bi-GRU Bi-GRU
Claim guided Attention Title guided Attention Sentence Representation Document Representation Sentence Encoder Word Encoder
Predicted Label
Word Attention Block Sentence Attention Block
Sentence Selector Attended Sentences Topic Model Topics
Bi-GRUClaim GuidedTitle GuidedSelf Guided Claim GuidedTitle GuidedSelf Guided
Figure 2:
SUMO ’s neural network architecture for establishing the correctness of Web claims. sentence representation after overall attention for the j th sen-tence. At the sentence level, we set: α Sj = ( α Cj + α Tj + α S lj ) / doc = α (cid:62) j h S , (9)where α Cj , α Tj , and α S lj are the attention weight vectors fromclaim, title, and self-attention at the sentence level, and doc isthe formed document representation after overall attention.
Attention on top of Attention (Atop).
Although the fusion ofthe three kinds of attention weights as an average of them workswell, we realize that we lose some context by averaging. To dealwith this issue, we use a novel attention on top of attention (Atop)method. We concatenate all three kinds of attentions α con and α Scon at both the word and sentence levels correspondingly. Weapply a tanh activation based dense layer as a scoring functionand subsequently, a softmax layer to compute attention weightsfor each of three kinds of attention:At word level: α con = ( α Cj , i (cid:107) α Tj , i (cid:107) α S lj , i ) u wa = tanh( W wa α con + b wa ) β w = softmax( u wa ) S j = β w α Cj , i + β w α Tj , i + β w α S lj , i At sentence level: α Scon = ( α Cj (cid:107) α Tj (cid:107) α S lj ) u sa = tanh( W sa α Scon + b sa ) β s = softmax( u sa ) doc = β s α Cj + β s α Tj + β s α S lj , (10) where β w and β s are the learned attention weight vectors forthree kinds of attentions at the word and sentence levels, and doc is the formed document representation after Atop attention. Prediction and Optimization.
We use the overall documentrepresentation doc in a softmax layer for the classification. Totrain the model, we use standard softmax cross-entropy withlogits as a loss function, we compute ˆ y , the predicted label as:ˆ y = softmax( W cl doc + b cl ) . (11) Recent works retrieve documents from the Web as externalevidence to support or refute the claims and thereafter extractsnippets as explanations to model’s decision [16, 21]. However,the extracted snippets from these methods are often redundantand lack topical diversity. The objective of our summarizationalgorithm is to provide ranked list of sentences that are: novel,non-redundant, and diverse across the topics identified from thetext of the documents. In this section, we outline the method weutilize for achieving this objective.
Multi-topic Sentence Model:
Each sentence in the documentthat is retrieved against the claim is modeled as a collection oftopics: s = (cid:104) a (1) , a (2) , . . . a ( k ) (cid:105) . Let A be the set of topics a i ∈ A across all candidate sentences from all the pseudo relevant setof documents D for the claim. Objective . We formulate the summarization task as a diversi-fication objective. Given a set of relevant sentences R whichare attended by Atop attention in SUMO while establishing theclaim’s correctness. We have to find the smallest subset of sen-tences
S ⊆ R such that all topics a i ∈ A are covered. This is REPRINT – G
ENERATING F ACT C HECKING S UMMARIES FOR W EB C LAIMS
5a variation of the Set Cover problem [1, 10, 29, 30, 8, 11, 5].However, unlike IA-Select [1] we do not choose to utilize theMax Coverage variation of the Set Cover problem. Instead, weformulate it as Set Cover itself [10, 29]. That is, given a set oftopics A , find a minimal set of sentences S ⊆ R that cover thosetopics [29]. Additionally, the inclusion of each sentence in thesubset S has a cost associated with it, given by: cost ( s ) = ( S core ) − S core = ( λθ s + (1 − λ )( W wa + W sa )) , (12)where θ s is the topic distribution score for sentence s com-puted using a topic model (e.g., Latent Dirichlet Allocation [3]), W wa = (cid:80) li = W wa ( i ) is the average of attention weights of thewords contained in sentence s , W sa is the attention weight ofthe sentence s , and λ is a parameter to be tuned. We brieflydescribe our adaptation of the Greedy algorithm, which providesan approximate solution to the Set Cover problem, based on thediscussion in [10, 29, 30, 8, 11, 5]. Algorithm 1:
Adaption of the approximate Greedy algorithm forSet Cover problem from [10, 29, 30, 8, 11, 5] to our topical diver-sification problem setting. At each iteration, a sentence is chosenthat covers the most number of topics reflected by topic distributionscore and has the highest attention weights. As an output, we areassured a non-redundant, novel, and a diversified set of sentences.
Input: A : Set of topics learned from the topic model fordiversification. R : Set of sentences, attended by Atop. Output:
S ⊆ R : Diversified set of sentences over
AS ← φ // S contains diversified sentences A (cid:48) ← φ // A (cid:48) contains topics covered by S while A (cid:48) (cid:44) A do /* identify the sentence that covers themost topics and is highly relevant forfact-checking */ s ∗ ← arg min s ∈R\S cost ( s ) |A−T (cid:48) | A (cid:48) ← A (cid:48) ∪ { a s ∗ } // a s ∗ is the dominant topicof sentence s ∗ S ← S ∪ s ∗ end VALUATION
Datasets.
We use two publicly available datasets, namely Politi-Fact political claims dataset and Snopes political claims dataset[21] for evaluating
SUMO ’s capability for fact checking. Datasetstatistics for both the datasets are shown in Table 1. In the caseof Politifact, claims have one of the following labels, namely:‘true’, ‘mostly true’, ‘half true’, ‘mostly false’, ‘false’, and‘pants-on-fire,’. We convert ‘true’, ‘mostly true’, and ‘half true’labels to the ‘true’ and the rest of them to ‘false’ label. For theSnopes dataset, each claim has either ‘true’ or ‘false’ as a label.We evaluate
SUMO for the task of summarization on PolitiFact,Snopes, Climate, and Health datasets. The two new datasets,Climate and Health, are about climate change and health carerespectively. We test
SUMO only on the PolitiFact and Snopesdataset for the task of fact checking as they are magnitudes largerthan the new datasets that we release. The climate change datasetcontains claims broadly related to climate change and global
Table 1:
Dataset Statistics
PUBLIC DATASETSSTATISTICS POLITIFACT SNOPES
CLAIMS
DOCUMENTS
DOMAINS
NEW DATASETSSTATISTICS CLIMATE HEALTH
CLAIMS
104 100
DOCUMENTS
DOMAINS
97 83 warming from climatefeedback.org . We use each claimas a query using Google API to search the Web and retrieve exter-nal evidences in the form of search results. Similarly, we createa dataset related to health care that additionally contains claimspertaining to the current global COVID-19 pandemic from healthfeedback.org . Examples of claims from these twodatasets are shown in Figure 3. We make the new datasets, pub-licly available to the research community at the following URL: https://github.com/rahulOmishra/SUMO/ . SUMO
Implementation.
We use TensorFlow to implement
SUMO . We use per class accuracy and macro F scores as per-formance metrics for evaluation. We use bi-directional GatedRecurrent Unit (GRU) with a hidden size of 200, word2vec[15], and GloVe [19] embeddings with embedding size of 200and softmax cross-entropy with logits as the loss function. Wekeep the learning rate as 0.001, batch size as 64, and gradientclipping as 5. All the parameters are tuned using a grid search.We use 50 epochs for each model and apply early stopping ifvalidation loss does not change for more than 5 epochs. Wekeep maximum sentence length as 45 and maximum number ofsentences in a document as 35. For the task of summarization,we use Latent Dirichlet Allocation (LDA) [3] as a topic modelto compute topic distribution scores and the dominant topic foreach candidate sentence. ESULTS
We experiment with five variants of our proposed
SUMO modeland compare with six state-of-the-art methods. The six state-of-the-art methods are as follows. First, we have the basic LongShort Term Memory (LSTM) [7]) unit which is used with claimand document contents for classification. Second, we have aconvolutional neural network (CNN) [9] for document classifi-cation. Third, we compare against the model proposed in [27]that uses a hierarchical representation of the documents using hi-erarchical LSTM units (Hi-LSTM). Fourth, we compare againstthe model proposed in [33] that uses a hierarchical neural at-tention on top of hierarchical LSTMs (HAN) to learn betterrepresentations of documents for classification. Fifth, we com-pare against the model proposed in [21] that uses a claim guidedattention method (DeClarE) for correctness prediction of claimsin the presence of external evidences. Sixth and finally, wecompare against the recent work [16] that improves on DeClarEmethod by using latent aspects (speaker, topic, or domain) basedattention.
REPRINT – G
ENERATING F ACT C HECKING S UMMARIES FOR W EB C LAIMS ‣ Global warming slowing down? 'Ironic' study finds more CO2 has slightly cooled the planet. ‣ The ozone layer is healing. ‣ Deforestation has made humans more vulnerable to pandemics. ‣ Historical data of temperature in the U.S. destroys global warming myth. ‣ New evidence shows wearing face mask can help coronavirus enter the brain and pose more health risk, warn expert. ‣ Boil weed and ginger for Covid-19 victims, the virus will vanish. ‣ Smoking may protect against COVID-19. ‣ Wearing face masks can cause carbon dioxide toxicity; can weaken immune system.
Figure 3:
Examples from climate change and health care dataset
The proposed five variants of our method
SUMO are as follows.First, we have the
SUMO -AW2V variant that corresponds to thebasic
SUMO model with word2vec embeddings. Second, wehave
SUMO -AtopW2V variant consists of the
SUMO model with
WORD VEC embeddings. Furthermore, in
SUMO -AtopW2V weuse Atop method of attention fusion rather than a simple average.Third, we have the
SUMO -AGlove variant, which is the basic
SUMO model that uses GloVe embeddings. Fourth, we have the
SUMO -AtopGlove variant, that consists of the
SUMO model withGloVe embeddings. Moreover, in
SUMO -AtopGlove, we useAtop method of attention fusion rather than a simple average.Fifth and finally, we have the
SUMO -AtopGlove+source-Embvariant that is similar to
SUMO -AtopGlove however with addi-tional source embeddings (domains of retrieved documents).
The results for establishing claim correctness are shown in Ta-ble 2. We observe that the basic LSTM based model achieves57 .
89% and 69 .
89% in terms of macro F accuracy in predic-tion of claim correctness for P OLITIFACT and S
NOPES , respec-tively. The CNN model performs slightly better than LSTM asit captures the local contextual features better. The hierarchicalattention network outperforms CNN with macro F accuracy of63 .
4% and 73 . accuracy of 67 . . accuracy of75 .
69% and 80 . SUMO model with word2vec embeddings performsbetter than DeClarE with source embeddings. This observationis a clear indication of the superiority of our claim- and title-driven attention design. The
SUMO with Atop attention fusion ismore effective than a simple average fusion of attention weights,which becomes apparent from the gain in macro F accuracyin both the datasets. SUMO with pertained GloVe embeddings outperforms the word2vec versions of
SUMO as the GloVe em-beddings are trained on a large corpus and therefore capturesbetter context for the words.
SUMO -AtopGlove+source-Emboutperforms all the other models and it is statistically significantwith a p-value of 2 . × − for P OLITI F ACT and 3 . × − for S NOPES . The statistical significance values were computedusing a two sample Student’s t-test. We notice that
SUMO couldnot outperform SADHAN without source embeddings, as SAD-HAN uses the very complex structure, having three parallelmodels with hierarchical latent aspects guide attention. How-ever, SADHAN has many drawbacks. First, it is challenging totrain and requires more hardware resources and time. Second,the latent aspects are not available for all the Web claims. There-fore, it is not generalizable. Third, it fails to accommodate newvalues of latent variables at the test time.
For the evaluation of the summarization capability of
SUMO ,we create gold reference summaries for claims. For creatingthe gold reference summaries, we include all the facts relatedto the claim, which are important for the claim correctness pre-diction, non-redundant, and topically diverse. We find that thedescriptions provided for a claim on fact-checking websites suchas snopes.com and politifact.com are suitable for thispurpose. We use cosine similarity score of 0.4 between claimsand sentences of description to filter out irrelevant or noisy sen-tences. As evaluation metrics, we use ROUGE-1, ROUGE-2,and ROUGE-L scores. The ROUGE-1 score represents theoverlap of unigrams, while the ROUGE-2 score represents theoverlap of bigrams between the summaries generated by the
SUMO system and gold reference summaries. The ROUGE-Lscore measures the longest matching sequence of words usingLongest Common Sub-sequence algorithm.Standard summarization techniques are not useful in such ascenario as the objective of summarization with standard tech-niques is usually not fact-checking. Hence, we compare the
SUMO results with an information retrieval (BM25) and a natu-ral language processing based method (QuerySum). BM25 isa ranking function, which uses a probabilistic retrieval frame-work and ranks the documents based on their relevance to agiven search query. We use Web claims as a query and applyBM25 to get the most relevant sentences from all the documentsretrieved for the claim. We also compare the results with the
REPRINT – G
ENERATING F ACT C HECKING S UMMARIES FOR W EB C LAIMS Table 2:
Comparison of the proposed models with various state of the art baseline models for two publicly available datasets. P OLITI F ACT
Model True Accuracy False Accuracy Macro F LSTM 53.51 56.32 57.89CNN 55.92 57.33 59.39HAN 60.13 65.78 63.44DeClarE (full) 68.18 66.01 67.10SADHAN-agg 68.37 78.23 75.69
SUMO -AW2V 67.30 69.22 70.74
SUMO -AtopW2V 67.81 70.09 71.15
SUMO -AGlove 68.03 72.57 72.39
SUMO -AtopGlove 68.93 73.43 72.79
SUMO -AtopGlove+source-Emb S NOPES
Model True Accuracy False Accuracy Macro F LSTM 69.23 70.67 69.89CNN 72.05 74.29 72.63HAN 72.89 76.25 73.84DeClarE (full) 60.16 80.78 70.47SADHAN-agg 79.47 84.26 80.09
SUMO -AW2V 77.32 80.67 75.56
SUMO -AtopW2V 78.02 81.66 76.86
SUMO -AGlove 78.74 82.03 77.22
SUMO -AtopGlove 78.89 82.46 78.45
SUMO -AtopGlove+source-Emb
Table 3:
Results for the Task of Summarization.
Model ROUGE-1 ROUGE-2 ROUGE-L
BM25 26.08 14.78 29.98QuerySum 29.78 16.49 30.16
SUMO query-driven attention based abstractive summarization methodQuerySum [17], which also uses a diversity objective to create adiverse summary. We use ROUGE metrics with a gold referencesummary to evaluate the generated summaries.
Results for the task of summarization are shown in Table 3, theQuerySum method performs significantly better than BM25 witha ROUGE-L score of 30.16 as it uses query-driven attention anddiversity objective, which results in a diverse and query orientedsummary. The proposed model
SUMO outperforms QuerySumwith a ROUGE-L score of 35.92. We attribute this gain to theuse of word and sentence level weights, which are trained usingback-propagation with correctness label. We also notice thatin QuerySum some sentences are related to the claim but arenot useful for fact checking. Therefore, they are absent in thegold reference summary. The results for
SUMO are statisticallysignificant ( p -value = . × − ) using a pairwise Student’st-test. ONCLUSION
We presented
SUMO , a neural network based approach to gen-erate explainable and topically diverse summaries for verifyingWeb claims.
SUMO uses an improved version of hierarchicalclaim-driven attention along with title-driven and self-attentionto learn an effective representation of the external evidencesretrieved from the Web. Learning this effective representation inturn assists us in establishing the correctness of textual claims.Using the overall attention weights from the novel Atop attentionmethod and topical distributions of the sentences, we generateextractive summaries for the claims. In addition to this, werelease two important datasets pertaining to climate change andhealthcare claims.In future, we plan to investigate the BERT [6] and other Trans-former [28] architecture based embedding methods in place ofGloVe [19] embeddings for better contextual representation ofwords. R EFERENCES [1] Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson,and Samuel Ieong. 2009. Diversifying search results. In
Proceedings of the Second ACM International Conferenceon Web Search and Data Mining , WSDM ˆaC™09, page5ˆaC“14, New York, NY, USA. Association for ComputingMachinery.[2] Pepa Atanasova, Jakob Grue, Simonsen Christina, andLioma Isabelle. 2019. Generating Fact Checking Explana-tions.[3] David M. Blei, Andrew Y. Ng, and Michael I. Jordan.2003. Latent dirichlet allocation.
J. Mach. Learn. Res. ,3(null):993ˆaC“1022.[4] Carlos Castillo, Marcelo Mendoza, and Barbara Poblete.2011. Information credibility on twitter. In
WWW .[5] Vasek Chv´atal. 1979. A greedy heuristic for the set-covering problem.
Math. Oper. Res. , 4(3):233–235.[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and KristinaToutanova. 2019. BERT: Pre-training of deep bidirectionaltransformers for language understanding. pages 4171–4186, Minneapolis, Minnesota. Association for Computa-tional Linguistics.[7] Sepp Hochreiter and Juergen Schmidhuber. 1997. Longshort-term memory. volume 9, page 1735ˆaC“1780, Cam-bridge, MA, USA. MIT Press.[8] David S. Johnson. 1974. Approximation algorithms forcombinatorial problems.
J. Comput. Syst. Sci. , 9(3):256–278.[9] Yoon Kim. 2014. Convolutional neural networks for sen-tence classification. In
Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Processing(EMNLP) , pages 1746–1751, Doha, Qatar. Association forComputational Linguistics.[10] Bernhard Korte and Jens Vygen. 2002. Approximationalgorithms. In
Combinatorial Optimization: Theory andAlgorithms , pages 361–396, Berlin, Heidelberg. SpringerBerlin Heidelberg.[11] L´aszl´o Lov´asz. 1975. On the ratio of optimal integral andfractional covers.
Discret. Math. , 13(4):383–390.[12] Jing Ma, Wei Gao, Prasenjit Mitra, Sejeong Kwon,Bernard J. Jansen, Kam-Fai Wong, and Meeyoung Cha.2016. Detecting rumors from microblogs with recurrentneural networks. In
Proceedings of the Twenty-Fifth In-ternational Joint Conference on Artificial Intelligence , IJ-CAIˆaC™16, page 3818ˆaC“3824. AAAI Press.
REPRINT – G
ENERATING F ACT C HECKING S UMMARIES FOR W EB C LAIMS
Proceed-ings of the 26th International Conference on Neural Infor-mation Processing Systems - Volume 2 , NIPSˆaC™13, page3111ˆaC“3119, Red Hook, NY, USA. Curran AssociatesInc.[16] Rahul Mishra and Vinay Setty. 2019. Sadhan: Hierarchicalattention networks to learn latent aspect embeddings forfake news detection. ICTIR ˆaC™19, page 197ˆaC“204.[17] Preksha Nema, Mitesh M. Khapra, Anirban Laha, andBalaraman Ravindran. 2017. Diversity driven attentionmodel for query-based abstractive summarization. In
Pro-ceedings of the 55th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Papers) ,pages 1063–1072, Vancouver, Canada. Association forComputational Linguistics.[18] Ankur P. Parikh, Oscar T¨ackstr¨om, Dipanjan Das, andJakob Uszkoreit. 2016. A Decomposable Attention Modelfor Natural Language Inference.[19] Jeffrey Pennington, Richard Socher, and Christopher D.Manning. 2014. Glove: Global vectors for word represen-tation. In
EMNLP .[20] Kashyap Popat, Subhabrata Mukherjee, Jannik Str¨otgen,and Gerhard Weikum. 2017. Where the truth lies: Ex-plaining the credibility of emerging claims on the web andsocial media. In
WWW , pages 1003–1012.[21] Kashyap Popat, Subhabrata Mukherjee, Andrew Yates, andGerhard Weikum. 2018. Declare: Debunking fake newsand false claims using evidence-aware deep learning. In
EMNLP , pages 22–32.[22] Martin Potthast, Johannes Kiesel, Kevin Reinartz, JanekBevendorff, and Benno Stein. 2018. A stylometric inquiryinto hyperpartisan and fake news. In
ACL , volume 1, pages231–240.[23] Vahed Qazvinian, Emily Rosengren, Dragomir R. Radev,and Qiaozhu Mei. 2011. Rumor has it: Identifying misin-formation in microblogs. EMNLP ˆaC™11.[24] Feng Qian, Chengyue Gong, Karishma Sharma, and YanLiu. 2018. Neural user response generator: Fake newsdetection with collective user intelligence. IJCAI ˆaC™18,pages 3834–3840.[25] Hannah Rashkin, Eunsol Choi, Jin Yea Jang, SvitlanaVolkova, and Yejin Choi. 2017. Truth of varying shades:Analyzing language in fake news and political fact-checking. In
Proceedings of the 2017 Conference on Em-pirical Methods in Natural Language Processing , pages2931–2937, Copenhagen, Denmark. Association for Com-putational Linguistics. [26] Kai Shu, Suhang Wang, and Huan Liu. 2019. Beyond newscontents: The role of social context for fake news detection.In
Proceedings of the Twelfth ACM International Confer-ence on Web Search and Data Mining , WSDM ˆaC™19,page 312ˆaC“320, New York, NY, USA. Association forComputing Machinery.[27] Duyu Tang, Bing Qin, and Ting Liu. 2015. Documentmodeling with gated recurrent neural network for senti-ment classification. In
Proceedings of the 2015 Conferenceon Empirical Methods in Natural Language Processing ,pages 1422–1432, Lisbon, Portugal. Association for Com-putational Linguistics.[28] Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser,and Illia Polosukhin. 2017. Attention is all you need.abs/1706.03762.[29] Vijay V. Vazirani. 2001. Approximation algorithms. NewYork, NY, USA. Springer-Verlag New York, Inc.[30] David P. Williamson and David B. Shmoys. 2011. Thedesign of approximation algorithms. New York, NY, USA.Cambridge University Press.[31] Brian Xu, Mitra Mohtarami, and James Glass. 2018.Adversarial Domain Adaptation for Stance Detection.(Nips):1–6.[32] Shuo Yang, Kai Shu, Suhang Wang, Renjie Gu, Fan Wu,and Huan Liu. 2019. Unsupervised fake news detection onsocial media: A generative approach.[33] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, AlexSmola, and Eduard Hovy. 2016. Hierarchical attentionnetworks for document classification. In