[PDF] Generating Fact Checking Summaries for Web Claims

Abstract

We present SUMO, a neural attention-based approach that learns to establish the correctness of textual claims based on evidence in the form of text documents (e.g., news articles or Web documents). SUMO further generates an extractive summary by presenting a diversified set of sentences from the documents that explain its decision on the correctness of the textual claim. Prior approaches to address the problem of fact checking and evidence extraction have relied on simple concatenation of claim and document word embeddings as an input to claim driven attention weight computation. This is done so as to extract salient words and sentences from the documents that help establish the correctness of the claim. However, this design of claim-driven attention does not capture the contextual information in documents properly. We improve on the prior art by using improved claim and title guided hierarchical attention to model effective contextual cues. We show the efficacy of our approach on datasets concerning political, healthcare, and environmental issues.

Full PDF

GG ENERATING F ACT C HECKING S UMMARIES FOR W EB C LAIMS P REPRINT , COMPILED O CTOBER

20, 2020

Rahul Mishra , Dhruv Gupta , Markus Leippold University of Stavanger, Norway [email protected] Max Planck Institute for Informatics, Germany [email protected] University of Zurich, Switzerland [email protected] A BSTRACT

We present SUMO, a neural attention-based approach that learns to establish the correctness of textual claimsbased on evidence in the form of text documents (e.g., news articles or Web documents). SUMO furthergenerates an extractive summary by presenting a diversiﬁed set of sentences from the documents that explain itsdecision on the correctness of the textual claim. Prior approaches to address the problem of fact checking andevidence extraction have relied on simple concatenation of claim and document word embeddings as an input toclaim driven attention weight computation. This is done so as to extract salient words and sentences from thedocuments that help establish the correctness of the claim. However, this design of claim-driven attention doesnot capture the contextual information in documents properly. We improve on the prior art by using improvedclaim and title guided hierarchical attention to model effective contextual cues. We show the efﬁcacy of ourapproach on datasets concerning political, healthcare, and environmental issues.

NTRODUCTION

Most of the information consumed by the world is in the form ofdigital news, blogs, and social media posts available on the Web.However, most of this information is written in the absence offacts and evidences. Our ever-increasing reliance on informationfrom the Web is becoming a severe problem as we base our per-sonal decisions relating to politics, environment, and health onunveriﬁed information available online. For example, considerthe following unveriﬁed claim on the Web: "Smoking may protect againstCOVID-19."

A user attempting to verify the correctness of the above claimwill often take the following steps: issue keyword queries tosearch engines for the claim; going through the top reliable newsarticles; and ﬁnally making an informed decision based on thegathered information. Clearly, this approach is laborious, takestime, and is error-prone. In this work, we present

SUMO , a neuralapproach that assists the user in establishing the correctness ofclaims by automatically generating explainable summaries forfact checking. Example summaries generated by

SUMO forcouple of Web claims are given in Figure 1.

Prior approaches to automatic fact checking rely on predictingthe credibility of facts [20], instance detection [14, 31], andfact entailment in supporting documents [18]. The majority ofthese methods rely on linguistic features [20, 22, 23], socialcontexts, or user responses [13] and comments. However, theseapproaches do not help explain the decisions generated by themachine learning models. Recent works such as [2, 16, 21]overcome the explainability gap by extracting snippets from textdocuments that support or refute the claim. [16, 21] apply claim-based and latent aspect-based attention to model the context oftext documents. [16] model latent aspects such as the speaker orauthor of the claim, topic of the claim, and domains of retrieved Web documents for the claim. We observe in our experimentsthat in prior works [16, 21], the design of claim guided atten-tion in these methods is not effective and latent aspects such asthe topic and speaker of claims are not always available. Thesnippets extracted by such models are not comprehensive ortopically diverse. To overcome these limitations, we propose anovel design of claim and document title driven attention, whichbetter captures the contextual cues in relation to the claim. In ad-dition to this, we propose an approach for generating summariesfor fact-checking that are non-redundant and topically diverse.

Contributions.

Contributions made in this work are as follows.First, we introduce

SUMO , a method that improves upon the pre-viously used claim guided attention to model effective contextualrepresentation. Second, we propose a novel attention on top ofattention (Atop) method to improve the overall attention effec-tiveness. Third, we present an approach to generate topicallydiverse multi-document summaries, which help in explainingthe decision

SUMO makes for establishing the correctness ofclaims. Fourth, we provide a novel testbed for the task of factchecking in the domain of climate change and health care.

Outline.

The outline for the rest of the article is as follows. InSection 2, we describe prior work in relation to our problemsetting. In Section 3, we formalize the problem deﬁnition and de-scribe our approach,

SUMO , to generate explainable summariesfor fact checking of textual claims. In Sections 4 and 5, wedescribe the experimental setup that includes a description of thenovel datasets that we make available to the research communityand an analysis of the results we have obtained. In Section 6,we present the concluding remarks of our study.

ELATED WORK

We now describe prior work related to our problem setting.First, we describe works that rely only on features derived from a r X i v : . [ c s . C L ] O c t REPRINT – G

ENERATING F ACT C HECKING S UMMARIES FOR W EB C LAIMS The current evidence suggests that the severity of COVID is higher among smokers, prevent the health risk linked to the excessive consumption or misuse" of nicotine products by people hoping to protect themselves from COVID-19. Evidence from China, where COVID-19 originated, shows that people who have cardiovascular and respiratory conditions caused by tobacco use, or otherwise, are at higher risk of developing severe COVID-19 symptoms. HO urges researchers, scientists and the media to be cautious about amplifying unproven claims that tobacco or nicotine could reduce the risk of COVID-19. Smoking is also associated with increased development of acute respiratory distress syndrome, a key complication for severe cases of COVID-19.

Claim: Smoking may protect against COVID-19 Claim: Deforestation has made humans more vulnerable to pandemics

Deforestation can directly increase the likelihood that a pathogen will be transferred from wildlife species to humans through the creation of suitable habitats for vector species. Climate change, including deforestation which drives it, is a key driver of cross-species transmission which is where zoonotic emerging diseases come from . There is a correlation between deforestation and the rise in the spread of infectious diseases affecting humans. Deforestation forces various species into smaller, shared habitats and increases encounters between wildlife and humans. Habitat destruction and fragmentation due to deforestation can also increase the frequency of contact between humans, wildlife species, and the pathogens they carry . This can occur through direct transfer of pathogens from animals to humans or indirectly through cross-species transfer of pathogens from wildlife to domesticated species . Deforestation could be to blame for the rise of infectious diseases like the novel coronavirus.

Summary: Summary:

Label: False Verdict: False Label: True Verdict: True

Figure 1:

Example summaries generated by

SUMO for unveriﬁed claims on the Web. documents that support the input textual claim. Second, wedescribe works that additionally include features derived fromsocial media posts in connection to the claim. Third and ﬁnally,we describe works that rely on extracting textual snippets fromtext documents to explain a model’s decision on the claim’scorrectness.

Prior approaches for fact checking vary from simple machinelearning methods such as SVM and decision trees to highlysophisticated deep learning methods. These works largely utilizefeatures that model the linguistic and stylistic content of the factsto learn a classiﬁer [4, 12, 23, 25]. The key shortcomings ofthese approaches are as follows. First, classiﬁers trained onlinguistic and stylistic features perform poorly as they can bemisguided by the writing style of the false claims, which aredeliberately made to look similar to true claims but are factuallyfalse. Second, these methods lack in terms of user response andsocial context pertaining to the claims, which is very helpful inestablishing the correctness of facts.

Works such as [24, 26, 32] overcome the issue of user feedbackby using a combination of content-based and context-based fea-tures derived from related social media posts. Speciﬁcally, thefeatures derived from social media include propagation patternsof claim related posts on social media and user responses in theform of replies, likes, sentiments, and shares. These methodsoutperform content-based methods signiﬁcantly. In [32], theauthors propose a probabilistic graphical model for causal map-pings among the post’s credibility, user’s opinions, and user’scredibility. In [24], the authors introduce a user response gen-erator based on a deep neural network that leverages the user’s past actions such as comments, replies, and posts to generate asynthetic response for new social media posts.

Explaining a machine learning model’s decision is becomingan important problem. This is because modern neural net-work based methods are increasingly being used as black-boxes.There exist few machine learning models for fact checking thatexplain this decision via summaries. Related works [16, 21]achieve signiﬁcant improvement in establishing the credibilityof textual claims by using external evidences from the Web.They additionally extract snippets from evidences that explaintheir model’s decision. However, we ﬁnd that the claim-drivenattention design used in these methods is inadequate, and doesnot capture sufﬁcient context of the documents in relation tothe input claim. The snippets extracted by these methods areoften redundant and lack topical diversity offered by Web ev-idences. In contrast, our method enhances the claim-drivenattention mechanism and generates a topically diverse, coher-ent multi-document summary for explaining the correctness ofclaims.

We now formally describe the task of fact checking and explain

SUMO in detail.

SUMO works in two stages. In the ﬁrst stage,it predicts the correctness of the claim. In the second stage,it generates a topically diverse summary for the claims. Asinput, we are provided with a Web claim c ∈ C , where C is acollection of Web claims and a pseudo-relevant set of documents D = { d , d , . . . , d m } , where m is the number of results retrievedfor claim c . The documents d ∈ D are retrieved from the Webas potential evidences, using claim c as a query. Each retrieveddocument d is accompanied by its title t and text body bd , i.e. REPRINT – G

ENERATING F ACT C HECKING S UMMARIES FOR W EB C LAIMS d = (cid:104) t , bd (cid:105) ). We deﬁne the representation of each document’sbody as a collection of k sentences as bd = { s , s , ..., s k } andeach sentence as the collection of l words as { w , w , ..., w l } ∈ W ,where W is the overall word vocabulary of the corpus. By k and l , we denote the maximum numbers of sentences in adocument and the maximum number of words in a sentence,respectively. We use both WORD VEC and pre-trained GloVeembeddings to obtain the vector representations for each claim,title, and document body. The objective is to classify the claimas either true or false and automatically generate a topicallydiverse summary pieced together from D for establishing thecorrectness of the claim. We now describe

SUMO ’s neural architecture (see Figure 2) thathelps in predicting the correctness of the input claim along withits pseudo-relevant set of documents. The model additionallylearns the weights to words and sentences in the document’sbody that help ascertain the claim’s correctness. First, we needto encode the pseudo-relevant documents that support a claim.To this end, as a sequence encoder , we use a Gated RecurrentUnit (GRU) to encode the document’s body content. Claimand document’s title are not encoded using sequence encoder;we explain the method to represent them in detail in upcomingsections.

Claim-driven Hierarchical Attention. , aims to attend salientwords that are signiﬁcant and have relevance to the content ofthe claim. Similarly, we aim to attend the salient sentencesat the sentence level attention. Recent works have used claimguided attention to model the contextual representation of theretrieved documents from the Web. These approaches provideclaim-guided attention by ﬁrst concatenating the claim wordembeddings with document word embeddings and then applyinga dense softmax layer to learn the attention weights as follows: r i = c i (cid:107) d i & a i = tanh( W a r i + b a ) α = softmax( a i ) , (1)where c i and d i are the i th claim and document embeddings. W a and b a are the weight matrix and bias and α is the learnedattention weight. However, during experiments, we observethat applying claim-based attention provides an inferior overalldocument representation. Therefore, we do not concatenatethe claim and document embeddings before attention weightcomputation.Each claim c i is consists of l maximum number of wordsas { w , w , ......, w l } . We represent each claim c i as the sum-mation of embeddings of all the words contained in it as: Cl i = (cid:80) lj = f ( w j ), where f ( w j ) is the word embedding of the j th word of claim c i . Claim representation Cl i and hidden states h j from the GRU are used to compute word-level claim-drivenattention weights as: u j , i = tanh( W j , i h j + b j , i ) α Cj , i = softmax( u (cid:62) j , i Cl i ) , (2)where W j , i and b j , i are the weight matrix and bias, α Cj , i is theword level claim driven attention weight vector, and h j = ( h j , , h j , , ..., h j , l ) (cid:62) represents the tuple of all the hidden states ofthe words contained in the j th sentence. To compute sentence level claim-driven attention weights , we use claim representa-tion Cl i and hidden states h Sj from the sentence level GRU unitsas concatenations of both forward and backward hidden states h Sj = −→ h Sj (cid:107) ←− h Sj as follows: u j = tanh( W j h S + b j , i ) α Cj = softmax( u (cid:62) j Cl i ) , (3)where W j and b j are the weight matrix and bias, h S = ( h S , h S , ..., h Sl ) (cid:62) is the combination of all hidden states fromsentences, and α Cj = ( α j , , α j , , ..., α j , k ) (cid:62) is the sentence levelclaim-driven attention weight vector for the j th document. Title-driven Hierarchical Attention.

The objective of usingthe document title is to guide the attention in capturing sectionsin the document that are more critical and relevant for the title.Articles convey multiple perspectives, often reﬂected in theirtitles. By title-driven attention, we attend to those words andsentences that are not covered in claim-driven attention. Title-driven attention at both word and sentence level can be computedin a similar fashion as claim-driven attention. Each title t i iscomprised of l maximum number of words as { w , w , . . . , w l } .We represent each claim t i as the summation of embeddingsof all the words contained in it as: T i = (cid:80) lj = f ( w j ). Title-driven attention weights for both words and sentence level canbe computed as follows: u j , i = tanh( W j , i h j + b j , i ) α Tj , i = softmax( u (cid:62) j , i T i ) u j = tanh( W j h S + b j , i ) α Tj = softmax( u (cid:62) j T i ) . (4) Hierarchical Self-Attention.

Self-attention is a simplistic formof attention. It tries to attend salient words in a sequence ofwords and salient sentences in a collection of sentences basedon the self context of a sequence of words or a collection ofsentences. In addition to claim-driven and title-driven attention,we apply self-attention to capture the unattended words andsentences which are not related to claim or title directly but arevery useful for classiﬁcation and summarization. Self-attentionweights for both words and sentence level can be computed asfollows: u j , i = tanh( W j , i h j + b j , i ) α S lj , i = softmax( u (cid:62) j , i ) u j = tanh( W j h S + b j , i ) α S lj = softmax( u (cid:62) j ) , (5)where α S lj , i and α S lj are the self-attention weight vectors at wordand sentence levels respectively.

Fusion of Attention Weights.

We combine the attentionweights from the three kinds of attention mechanisms: claim-driven, title-driven, and self-attention at both the word and sen-tence levels. At the word level, we set: α j = ( α Cj , i + α Tj , i + α S lj , i ) / S j = α (cid:62) j h j , (7)where α Cj , i , α Tj , i , and α S lj , i are the attention weight vectors fromclaim, title and self-attention at the word level. S j is the formed REPRINT – G

ENERATING F ACT C HECKING S UMMARIES FOR W EB C LAIMS w t,1 w t,2 w t,3 w t,m w t,m+1 w w w k w w w n Claim (C i )Title (T j )Doc (D j ) R e d u ce S u m D e n s e S o f t m a x DenseSoftmax R e d u ce S u m DenseSoftmax A v g M u l D e n s e S o f t m a x DenseSoftmaxDenseSoftmax A v g M u l S o f t m a x Bi-GRU Bi-GRU

Claim guided Attention Title guided Attention Sentence Representation Document Representation Sentence Encoder Word Encoder

Predicted Label

Word Attention Block Sentence Attention Block

Sentence Selector Attended Sentences Topic Model Topics

Bi-GRUClaim GuidedTitle GuidedSelf Guided Claim GuidedTitle GuidedSelf Guided

Figure 2:

SUMO ’s neural network architecture for establishing the correctness of Web claims. sentence representation after overall attention for the j th sen-tence. At the sentence level, we set: α Sj = ( α Cj + α Tj + α S lj ) / doc = α (cid:62) j h S , (9)where α Cj , α Tj , and α S lj are the attention weight vectors fromclaim, title, and self-attention at the sentence level, and doc isthe formed document representation after overall attention.

Attention on top of Attention (Atop).

Although the fusion ofthe three kinds of attention weights as an average of them workswell, we realize that we lose some context by averaging. To dealwith this issue, we use a novel attention on top of attention (Atop)method. We concatenate all three kinds of attentions α con and α Scon at both the word and sentence levels correspondingly. Weapply a tanh activation based dense layer as a scoring functionand subsequently, a softmax layer to compute attention weightsfor each of three kinds of attention:At word level: α con = ( α Cj , i (cid:107) α Tj , i (cid:107) α S lj , i ) u wa = tanh( W wa α con + b wa ) β w = softmax( u wa ) S j = β w α Cj , i + β w α Tj , i + β w α S lj , i At sentence level: α Scon = ( α Cj (cid:107) α Tj (cid:107) α S lj ) u sa = tanh( W sa α Scon + b sa ) β s = softmax( u sa ) doc = β s α Cj + β s α Tj + β s α S lj , (10) where β w and β s are the learned attention weight vectors forthree kinds of attentions at the word and sentence levels, and doc is the formed document representation after Atop attention. Prediction and Optimization.

We use the overall documentrepresentation doc in a softmax layer for the classiﬁcation. Totrain the model, we use standard softmax cross-entropy withlogits as a loss function, we compute ˆ y , the predicted label as:ˆ y = softmax( W cl doc + b cl ) . (11) Recent works retrieve documents from the Web as externalevidence to support or refute the claims and thereafter extractsnippets as explanations to model’s decision [16, 21]. However,the extracted snippets from these methods are often redundantand lack topical diversity. The objective of our summarizationalgorithm is to provide ranked list of sentences that are: novel,non-redundant, and diverse across the topics identiﬁed from thetext of the documents. In this section, we outline the method weutilize for achieving this objective.

Multi-topic Sentence Model:

Each sentence in the documentthat is retrieved against the claim is modeled as a collection oftopics: s = (cid:104) a (1) , a (2) , . . . a ( k ) (cid:105) . Let A be the set of topics a i ∈ A across all candidate sentences from all the pseudo relevant setof documents D for the claim. Objective . We formulate the summarization task as a diversi-ﬁcation objective. Given a set of relevant sentences R whichare attended by Atop attention in SUMO while establishing theclaim’s correctness. We have to ﬁnd the smallest subset of sen-tences

S ⊆ R such that all topics a i ∈ A are covered. This is REPRINT – G

ENERATING F ACT C HECKING S UMMARIES FOR W EB C LAIMS

5a variation of the Set Cover problem [1, 10, 29, 30, 8, 11, 5].However, unlike IA-Select [1] we do not choose to utilize theMax Coverage variation of the Set Cover problem. Instead, weformulate it as Set Cover itself [10, 29]. That is, given a set oftopics A , ﬁnd a minimal set of sentences S ⊆ R that cover thosetopics [29]. Additionally, the inclusion of each sentence in thesubset S has a cost associated with it, given by: cost ( s ) = ( S core ) − S core = ( λθ s + (1 − λ )( W wa + W sa )) , (12)where θ s is the topic distribution score for sentence s com-puted using a topic model (e.g., Latent Dirichlet Allocation [3]), W wa = (cid:80) li = W wa ( i ) is the average of attention weights of thewords contained in sentence s , W sa is the attention weight ofthe sentence s , and λ is a parameter to be tuned. We brieﬂydescribe our adaptation of the Greedy algorithm, which providesan approximate solution to the Set Cover problem, based on thediscussion in [10, 29, 30, 8, 11, 5]. Algorithm 1:

Adaption of the approximate Greedy algorithm forSet Cover problem from [10, 29, 30, 8, 11, 5] to our topical diver-siﬁcation problem setting. At each iteration, a sentence is chosenthat covers the most number of topics reﬂected by topic distributionscore and has the highest attention weights. As an output, we areassured a non-redundant, novel, and a diversiﬁed set of sentences.

Input: A : Set of topics learned from the topic model fordiversiﬁcation. R : Set of sentences, attended by Atop. Output:

S ⊆ R : Diversiﬁed set of sentences over

AS ← φ // S contains diversified sentences A (cid:48) ← φ // A (cid:48) contains topics covered by S while A (cid:48) (cid:44) A do /* identify the sentence that covers themost topics and is highly relevant forfact-checking */ s ∗ ← arg min s ∈R\S cost ( s ) |A−T (cid:48) | A (cid:48) ← A (cid:48) ∪ { a s ∗ } // a s ∗ is the dominant topicof sentence s ∗ S ← S ∪ s ∗ end VALUATION

Datasets.

We use two publicly available datasets, namely Politi-Fact political claims dataset and Snopes political claims dataset[21] for evaluating

SUMO ’s capability for fact checking. Datasetstatistics for both the datasets are shown in Table 1. In the caseof Politifact, claims have one of the following labels, namely:‘true’, ‘mostly true’, ‘half true’, ‘mostly false’, ‘false’, and‘pants-on-ﬁre,’. We convert ‘true’, ‘mostly true’, and ‘half true’labels to the ‘true’ and the rest of them to ‘false’ label. For theSnopes dataset, each claim has either ‘true’ or ‘false’ as a label.We evaluate

SUMO for the task of summarization on PolitiFact,Snopes, Climate, and Health datasets. The two new datasets,Climate and Health, are about climate change and health carerespectively. We test

SUMO only on the PolitiFact and Snopesdataset for the task of fact checking as they are magnitudes largerthan the new datasets that we release. The climate change datasetcontains claims broadly related to climate change and global

Table 1:

Dataset Statistics

PUBLIC DATASETSSTATISTICS POLITIFACT SNOPES

CLAIMS

DOCUMENTS

DOMAINS

NEW DATASETSSTATISTICS CLIMATE HEALTH

CLAIMS

104 100

DOCUMENTS

DOMAINS

97 83 warming from climatefeedback.org . We use each claimas a query using Google API to search the Web and retrieve exter-nal evidences in the form of search results. Similarly, we createa dataset related to health care that additionally contains claimspertaining to the current global COVID-19 pandemic from healthfeedback.org . Examples of claims from these twodatasets are shown in Figure 3. We make the new datasets, pub-licly available to the research community at the following URL: https://github.com/rahulOmishra/SUMO/ . SUMO

Implementation.

We use TensorFlow to implement

SUMO . We use per class accuracy and macro F scores as per-formance metrics for evaluation. We use bi-directional GatedRecurrent Unit (GRU) with a hidden size of 200, word2vec[15], and GloVe [19] embeddings with embedding size of 200and softmax cross-entropy with logits as the loss function. Wekeep the learning rate as 0.001, batch size as 64, and gradientclipping as 5. All the parameters are tuned using a grid search.We use 50 epochs for each model and apply early stopping ifvalidation loss does not change for more than 5 epochs. Wekeep maximum sentence length as 45 and maximum number ofsentences in a document as 35. For the task of summarization,we use Latent Dirichlet Allocation (LDA) [3] as a topic modelto compute topic distribution scores and the dominant topic foreach candidate sentence. ESULTS

We experiment with ﬁve variants of our proposed

SUMO modeland compare with six state-of-the-art methods. The six state-of-the-art methods are as follows. First, we have the basic LongShort Term Memory (LSTM) [7]) unit which is used with claimand document contents for classiﬁcation. Second, we have aconvolutional neural network (CNN) [9] for document classiﬁ-cation. Third, we compare against the model proposed in [27]that uses a hierarchical representation of the documents using hi-erarchical LSTM units (Hi-LSTM). Fourth, we compare againstthe model proposed in [33] that uses a hierarchical neural at-tention on top of hierarchical LSTMs (HAN) to learn betterrepresentations of documents for classiﬁcation. Fifth, we com-pare against the model proposed in [21] that uses a claim guidedattention method (DeClarE) for correctness prediction of claimsin the presence of external evidences. Sixth and ﬁnally, wecompare against the recent work [16] that improves on DeClarEmethod by using latent aspects (speaker, topic, or domain) basedattention.

REPRINT – G

ENERATING F ACT C HECKING S UMMARIES FOR W EB C LAIMS ‣ Global warming slowing down? 'Ironic' study finds more CO2 has slightly cooled the planet. ‣ The ozone layer is healing. ‣ Deforestation has made humans more vulnerable to pandemics. ‣ Historical data of temperature in the U.S. destroys global warming myth. ‣ New evidence shows wearing face mask can help coronavirus enter the brain and pose more health risk, warn expert. ‣ Boil weed and ginger for Covid-19 victims, the virus will vanish. ‣ Smoking may protect against COVID-19. ‣ Wearing face masks can cause carbon dioxide toxicity; can weaken immune system.

Figure 3:

Examples from climate change and health care dataset

The proposed ﬁve variants of our method

SUMO are as follows.First, we have the

SUMO -AW2V variant that corresponds to thebasic

SUMO model with word2vec embeddings. Second, wehave

SUMO -AtopW2V variant consists of the

SUMO model with

WORD VEC embeddings. Furthermore, in

SUMO -AtopW2V weuse Atop method of attention fusion rather than a simple average.Third, we have the

SUMO -AGlove variant, which is the basic

SUMO model that uses GloVe embeddings. Fourth, we have the

SUMO -AtopGlove variant, that consists of the

SUMO model withGloVe embeddings. Moreover, in

SUMO -AtopGlove, we useAtop method of attention fusion rather than a simple average.Fifth and ﬁnally, we have the

SUMO -AtopGlove+source-Embvariant that is similar to

SUMO -AtopGlove however with addi-tional source embeddings (domains of retrieved documents).

The results for establishing claim correctness are shown in Ta-ble 2. We observe that the basic LSTM based model achieves57 .

89% and 69 .

89% in terms of macro F accuracy in predic-tion of claim correctness for P OLITIFACT and S

NOPES , respec-tively. The CNN model performs slightly better than LSTM asit captures the local contextual features better. The hierarchicalattention network outperforms CNN with macro F accuracy of63 .

4% and 73 . accuracy of 67 . . accuracy of75 .

69% and 80 . SUMO model with word2vec embeddings performsbetter than DeClarE with source embeddings. This observationis a clear indication of the superiority of our claim- and title-driven attention design. The

SUMO with Atop attention fusion ismore effective than a simple average fusion of attention weights,which becomes apparent from the gain in macro F accuracyin both the datasets. SUMO with pertained GloVe embeddings outperforms the word2vec versions of

SUMO as the GloVe em-beddings are trained on a large corpus and therefore capturesbetter context for the words.

SUMO -AtopGlove+source-Emboutperforms all the other models and it is statistically signiﬁcantwith a p-value of 2 . × − for P OLITI F ACT and 3 . × − for S NOPES . The statistical signiﬁcance values were computedusing a two sample Student’s t-test. We notice that

SUMO couldnot outperform SADHAN without source embeddings, as SAD-HAN uses the very complex structure, having three parallelmodels with hierarchical latent aspects guide attention. How-ever, SADHAN has many drawbacks. First, it is challenging totrain and requires more hardware resources and time. Second,the latent aspects are not available for all the Web claims. There-fore, it is not generalizable. Third, it fails to accommodate newvalues of latent variables at the test time.

For the evaluation of the summarization capability of

SUMO ,we create gold reference summaries for claims. For creatingthe gold reference summaries, we include all the facts relatedto the claim, which are important for the claim correctness pre-diction, non-redundant, and topically diverse. We ﬁnd that thedescriptions provided for a claim on fact-checking websites suchas snopes.com and politifact.com are suitable for thispurpose. We use cosine similarity score of 0.4 between claimsand sentences of description to ﬁlter out irrelevant or noisy sen-tences. As evaluation metrics, we use ROUGE-1, ROUGE-2,and ROUGE-L scores. The ROUGE-1 score represents theoverlap of unigrams, while the ROUGE-2 score represents theoverlap of bigrams between the summaries generated by the

SUMO system and gold reference summaries. The ROUGE-Lscore measures the longest matching sequence of words usingLongest Common Sub-sequence algorithm.Standard summarization techniques are not useful in such ascenario as the objective of summarization with standard tech-niques is usually not fact-checking. Hence, we compare the

SUMO results with an information retrieval (BM25) and a natu-ral language processing based method (QuerySum). BM25 isa ranking function, which uses a probabilistic retrieval frame-work and ranks the documents based on their relevance to agiven search query. We use Web claims as a query and applyBM25 to get the most relevant sentences from all the documentsretrieved for the claim. We also compare the results with the

REPRINT – G

ENERATING F ACT C HECKING S UMMARIES FOR W EB C LAIMS Table 2:

Comparison of the proposed models with various state of the art baseline models for two publicly available datasets. P OLITI F ACT

Model True Accuracy False Accuracy Macro F LSTM 53.51 56.32 57.89CNN 55.92 57.33 59.39HAN 60.13 65.78 63.44DeClarE (full) 68.18 66.01 67.10SADHAN-agg 68.37 78.23 75.69

SUMO -AW2V 67.30 69.22 70.74

SUMO -AtopW2V 67.81 70.09 71.15

SUMO -AGlove 68.03 72.57 72.39

SUMO -AtopGlove 68.93 73.43 72.79

SUMO -AtopGlove+source-Emb S NOPES

Model True Accuracy False Accuracy Macro F LSTM 69.23 70.67 69.89CNN 72.05 74.29 72.63HAN 72.89 76.25 73.84DeClarE (full) 60.16 80.78 70.47SADHAN-agg 79.47 84.26 80.09

SUMO -AW2V 77.32 80.67 75.56

SUMO -AtopW2V 78.02 81.66 76.86

SUMO -AGlove 78.74 82.03 77.22

SUMO -AtopGlove 78.89 82.46 78.45

SUMO -AtopGlove+source-Emb

Table 3:

Results for the Task of Summarization.

Model ROUGE-1 ROUGE-2 ROUGE-L

BM25 26.08 14.78 29.98QuerySum 29.78 16.49 30.16

SUMO query-driven attention based abstractive summarization methodQuerySum [17], which also uses a diversity objective to create adiverse summary. We use ROUGE metrics with a gold referencesummary to evaluate the generated summaries.

Results for the task of summarization are shown in Table 3, theQuerySum method performs signiﬁcantly better than BM25 witha ROUGE-L score of 30.16 as it uses query-driven attention anddiversity objective, which results in a diverse and query orientedsummary. The proposed model

SUMO outperforms QuerySumwith a ROUGE-L score of 35.92. We attribute this gain to theuse of word and sentence level weights, which are trained usingback-propagation with correctness label. We also notice thatin QuerySum some sentences are related to the claim but arenot useful for fact checking. Therefore, they are absent in thegold reference summary. The results for

SUMO are statisticallysigniﬁcant ( p -value = . × − ) using a pairwise Student’st-test. ONCLUSION

We presented

SUMO , a neural network based approach to gen-erate explainable and topically diverse summaries for verifyingWeb claims.

SUMO uses an improved version of hierarchicalclaim-driven attention along with title-driven and self-attentionto learn an effective representation of the external evidencesretrieved from the Web. Learning this effective representation inturn assists us in establishing the correctness of textual claims.Using the overall attention weights from the novel Atop attentionmethod and topical distributions of the sentences, we generateextractive summaries for the claims. In addition to this, werelease two important datasets pertaining to climate change andhealthcare claims.In future, we plan to investigate the BERT [6] and other Trans-former [28] architecture based embedding methods in place ofGloVe [19] embeddings for better contextual representation ofwords. R EFERENCES [1] Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson,and Samuel Ieong. 2009. Diversifying search results. In

Proceedings of the Second ACM International Conferenceon Web Search and Data Mining , WSDM ˆaC™09, page5ˆaC“14, New York, NY, USA. Association for ComputingMachinery.[2] Pepa Atanasova, Jakob Grue, Simonsen Christina, andLioma Isabelle. 2019. Generating Fact Checking Explana-tions.[3] David M. Blei, Andrew Y. Ng, and Michael I. Jordan.2003. Latent dirichlet allocation.

J. Mach. Learn. Res. ,3(null):993ˆaC“1022.[4] Carlos Castillo, Marcelo Mendoza, and Barbara Poblete.2011. Information credibility on twitter. In

WWW .[5] Vasek Chv´atal. 1979. A greedy heuristic for the set-covering problem.

Math. Oper. Res. , 4(3):233–235.[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and KristinaToutanova. 2019. BERT: Pre-training of deep bidirectionaltransformers for language understanding. pages 4171–4186, Minneapolis, Minnesota. Association for Computa-tional Linguistics.[7] Sepp Hochreiter and Juergen Schmidhuber. 1997. Longshort-term memory. volume 9, page 1735ˆaC“1780, Cam-bridge, MA, USA. MIT Press.[8] David S. Johnson. 1974. Approximation algorithms forcombinatorial problems.

J. Comput. Syst. Sci. , 9(3):256–278.[9] Yoon Kim. 2014. Convolutional neural networks for sen-tence classiﬁcation. In

Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Processing(EMNLP) , pages 1746–1751, Doha, Qatar. Association forComputational Linguistics.[10] Bernhard Korte and Jens Vygen. 2002. Approximationalgorithms. In

Combinatorial Optimization: Theory andAlgorithms , pages 361–396, Berlin, Heidelberg. SpringerBerlin Heidelberg.[11] L´aszl´o Lov´asz. 1975. On the ratio of optimal integral andfractional covers.

Discret. Math. , 13(4):383–390.[12] Jing Ma, Wei Gao, Prasenjit Mitra, Sejeong Kwon,Bernard J. Jansen, Kam-Fai Wong, and Meeyoung Cha.2016. Detecting rumors from microblogs with recurrentneural networks. In

Proceedings of the Twenty-Fifth In-ternational Joint Conference on Artiﬁcial Intelligence , IJ-CAIˆaC™16, page 3818ˆaC“3824. AAAI Press.

REPRINT – G

ENERATING F ACT C HECKING S UMMARIES FOR W EB C LAIMS

Proceed-ings of the 26th International Conference on Neural Infor-mation Processing Systems - Volume 2 , NIPSˆaC™13, page3111ˆaC“3119, Red Hook, NY, USA. Curran AssociatesInc.[16] Rahul Mishra and Vinay Setty. 2019. Sadhan: Hierarchicalattention networks to learn latent aspect embeddings forfake news detection. ICTIR ˆaC™19, page 197ˆaC“204.[17] Preksha Nema, Mitesh M. Khapra, Anirban Laha, andBalaraman Ravindran. 2017. Diversity driven attentionmodel for query-based abstractive summarization. In

Pro-ceedings of the 55th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Papers) ,pages 1063–1072, Vancouver, Canada. Association forComputational Linguistics.[18] Ankur P. Parikh, Oscar T¨ackstr¨om, Dipanjan Das, andJakob Uszkoreit. 2016. A Decomposable Attention Modelfor Natural Language Inference.[19] Jeffrey Pennington, Richard Socher, and Christopher D.Manning. 2014. Glove: Global vectors for word represen-tation. In

EMNLP .[20] Kashyap Popat, Subhabrata Mukherjee, Jannik Str¨otgen,and Gerhard Weikum. 2017. Where the truth lies: Ex-plaining the credibility of emerging claims on the web andsocial media. In

WWW , pages 1003–1012.[21] Kashyap Popat, Subhabrata Mukherjee, Andrew Yates, andGerhard Weikum. 2018. Declare: Debunking fake newsand false claims using evidence-aware deep learning. In

EMNLP , pages 22–32.[22] Martin Potthast, Johannes Kiesel, Kevin Reinartz, JanekBevendorff, and Benno Stein. 2018. A stylometric inquiryinto hyperpartisan and fake news. In

ACL , volume 1, pages231–240.[23] Vahed Qazvinian, Emily Rosengren, Dragomir R. Radev,and Qiaozhu Mei. 2011. Rumor has it: Identifying misin-formation in microblogs. EMNLP ˆaC™11.[24] Feng Qian, Chengyue Gong, Karishma Sharma, and YanLiu. 2018. Neural user response generator: Fake newsdetection with collective user intelligence. IJCAI ˆaC™18,pages 3834–3840.[25] Hannah Rashkin, Eunsol Choi, Jin Yea Jang, SvitlanaVolkova, and Yejin Choi. 2017. Truth of varying shades:Analyzing language in fake news and political fact-checking. In

Proceedings of the 2017 Conference on Em-pirical Methods in Natural Language Processing , pages2931–2937, Copenhagen, Denmark. Association for Com-putational Linguistics. [26] Kai Shu, Suhang Wang, and Huan Liu. 2019. Beyond newscontents: The role of social context for fake news detection.In

Proceedings of the Twelfth ACM International Confer-ence on Web Search and Data Mining , WSDM ˆaC™19,page 312ˆaC“320, New York, NY, USA. Association forComputing Machinery.[27] Duyu Tang, Bing Qin, and Ting Liu. 2015. Documentmodeling with gated recurrent neural network for senti-ment classiﬁcation. In

Proceedings of the 2015 Conferenceon Empirical Methods in Natural Language Processing ,pages 1422–1432, Lisbon, Portugal. Association for Com-putational Linguistics.[28] Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser,and Illia Polosukhin. 2017. Attention is all you need.abs/1706.03762.[29] Vijay V. Vazirani. 2001. Approximation algorithms. NewYork, NY, USA. Springer-Verlag New York, Inc.[30] David P. Williamson and David B. Shmoys. 2011. Thedesign of approximation algorithms. New York, NY, USA.Cambridge University Press.[31] Brian Xu, Mitra Mohtarami, and James Glass. 2018.Adversarial Domain Adaptation for Stance Detection.(Nips):1–6.[32] Shuo Yang, Kai Shu, Suhang Wang, Renjie Gu, Fan Wu,and Huan Liu. 2019. Unsupervised fake news detection onsocial media: A generative approach.[33] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, AlexSmola, and Eduard Hovy. 2016. Hierarchical attentionnetworks for document classiﬁcation. In