Hierarchical Multi-head Attentive Network for Evidence-aware Fake News Detection
HHierarchical Multi-head Attentive Networkfor Evidence-aware Fake News Detection
Nguyen Vo
Worcester Polytechnic InstituteComputer Science DepartmentWorcester, MA, USA, 01609 [email protected]
Kyumin Lee
Worcester Polytechnic InstituteComputer Science DepartmentWorcester, MA, USA, 01609 [email protected]
Abstract
The widespread of fake news and misinforma-tion in various domains ranging from politics,economics to public health has posed an urgentneed to automatically fact-check information.A recent trend in fake news detection is to uti-lize evidence from external sources. However,existing evidence-aware fake news detectionmethods focused on either only word-levelattention or evidence-level attention, whichmay result in suboptimal performance. Inthis paper, we propose a Hierarchical Multi-head Attentive Network to fact-check tex-tual claims. Our model jointly combinesmulti-head word-level attention and multi-head document-level attention, which aid ex-planation in both word-level and evidence-level. Experiments on two real-word datasetsshow that our model outperforms seven state-of-the-art baselines. Improvements over base-lines are from 6% to 18%. Our sourcecode and datasets are released at https://github.com/nguyenvo09/EACL2021 . The proliferation of biased news, misleadingclaims, disinformation and fake news has causedheightened negative effects on modern society invarious domains ranging from politics, economicsto public health. A recent study showed that ma-liciously fabricated and partisan stories possiblycaused citizens’ misperception about political can-didates (Allcott and Gentzkow, 2017) during the2016 U.S. presidential elections. In economics, thespread of fake news has manipulated stock price(Kogan et al., 2019). For example, $139 billionwas wiped out when the Associated Press (AP)’shacked Twitter account posted rumor about WhiteHouse explosion with Barack Obama’s injury. Re-cently, misinformation has caused infodemics inpublic health (Ashoka, 2020) and even led to peo-ple’s fatalities in the physical world (Alluri, 2019). To reduce the spread of misinformation and itsdetrimental influences, many fact-checking sys-tems have been developed to fact-check textualclaims. It is estimated that the number of fact-checking outlets has increased 400% in 60 coun-tries since 2014 (Stencel, 2019). Several fact-checking systems such as snopes.com and politi-fact.com are widely used by both online users andmajor corporations. Facebook (CNN, 2020) re-cently incorporated third-party fact-checking sitesto social media posts and Google integrated fact-checking articles to their search engine (Wang et al.,2018). These fact-checking systems debunk claimsby manually assess their credibility based on col-lected webpages used as evidence. However, thismanual process is laborious and unscalable to han-dle the large volume of produced false claims oncommunication platforms. Therefore, in this pa-per, our goal is to build an automatic fake newsdetection system to fact-check textual claims basedon collected evidence to speed up fact-checkingprocess of the above fact-checking sites.To detect fake news, researchers proposed to uselinguistics and textual content (Castillo et al., 2011;Zhao et al., 2015; Liu et al., 2015). Since textualclaims are usually deliberately written to deceivereaders, it is hard to detect fake news by solelyrelying on the content claims. Therefore, multi-ple works utilized other signals such as temporalspreading patterns (Liu and Wu, 2018), networkstructures (Wu and Liu, 2018; Vo and Lee, 2018;Shu et al., 2020) and users’ feedbacks (Vo andLee, 2019; Shu et al., 2019; Vo and Lee, 2020a).However, limited work used external webpages asdocuments which could provide interpretive expla-nation to users. Several recent work (Popat et al.,2018; Ma et al., 2019; Vo and Lee, 2020b) startedto utilize documents to fact-check textual claims.Popat et al. (2018) used word-level attention indocuments but treated all documents with equal im- a r X i v : . [ c s . A I] F e b ortance whereas Ma et al. (2019) only focused onwhich documents are more crucial without consid-ering what words help explain credibility of textualclaims.Observing drawbacks of the existing work, wepropose Hierarchical Multi-head Attentive Net-work which jointly utilizes word attention and evi-dence attention. Overall semantics of a documentmay be generated by multiple parts of the docu-ment. Therefore, we propose a multi-head wordattention mechanism to capture different seman-tic contributions of words to the meaning of thedocuments. Since a document may have differentsemantic aspects corresponding to various informa-tion related to credibility of a claim, we proposea multi-head document-level attention mechanismto capture contributions of the different semanticaspects of the documents. In our attention mecha-nism, we also use speakers and publishers informa-tion to further improve effectiveness of our model.To our knowledge, our work is the first applyingmulti-head attention mechanism for both words anddocuments in evidence-aware fake news detection.Our work makes the following contributions: • We propose a novel hierarchical multi-head at-tention network which jointly combines wordattention and evidence attention for evidence-aware fake news detection. • We propose a novel multi-head attention mech-anism to capture important words and evidence. • Experiments on two public datasets demon-strate the effectiveness and generality of ourmodel over state-of-the-art fake news detectiontechniques.
Many methods have been proposed to detect fakenews in recent years. These methods can be placedinto three groups: (1) human-based fact-checkingsites (e.g. Snopes.com, Politifact.com), (2) ma-chine learning based methods and (3) hybrid sys-tems (e.g. content moderation on social mediasites). In machine-learning-based methods, re-searchers mainly used linguistics and textual con-tent (Zellers et al., 2019; Zhao et al., 2015; Wang,2017; Shu et al., 2019), temporal spreading pat-terns (Liu and Wu, 2018), network structures (Wuand Liu, 2018; Vo and Lee, 2018; You et al., 2019),users’ feedbacks (Vo and Lee, 2019; Shu et al.,2019) and multimodal signals (Gupta et al., 2013;Vo and Lee, 2020b). Recently, researchers focus on fact-checking claims based on evidence fromdifferent sources. Thorne and Vlachos (2017) andVlachos and Riedel (2015) fact-check claims us-ing subject-predicate-object triplets extracted fromknowledge graph as evidence. Chen et al. (2020)assess claims’ credibility using tabular data. Ourwork is closely related to fact verification task(Thorne et al., 2018; Nie et al., 2019; Soleimaniet al., 2020) which aims to classify a pair of a claimand an evidence extracted from Wikipedia intothree classes: supported , refuted , or not enoughinfo . For fact verification task, Nie et al. (2019)used ELMo (Peters et al., 2018) to extract con-textual embeddings of words and used a modifiedESIM model (Chen et al., 2017). Soleimani et al.(2020) used BERT model (Devlin et al., 2018) toretrieve and verify claims. Zhou et al. (2019) usedgraph based models for semantic reasoning. Ourwork is different from these work since our goal isto classify a pair of a claim and a list of relevantevidence into true or false .Our work is close to existing work aboutevidence-aware fake news detection (Popat et al.,2018; Ma et al., 2019; Wu et al., 2020; Mishraand Setty, 2019). Popat et al. (2018) used an aver-age pooling layer to derive claims’ representationto attend to words in evidence, Mishra and Setty(2019) focused on words and sentences in each ev-idence, and Ma et al. (2019) proposed a semanticentailment model to attend to important evidence.However, to the best of our knowledge, our work isthe first jointly using multi-head attention mecha-nisms to focus on important words in each evidenceand important evidence from a set of relevant ar-ticles. Our attention mechanism is different fromthese work since we use multiple attention heads tocapture different semantic contributions of wordsand evidence. We denote an evidence-based fact-checking dataset C as a collection of tuples ( c, s, D , P ) where c isa textual claim originated from a speaker s , D = { d i } ki =1 is a collection of k documents relevant tothe claim c and P = { p i } ki =1 is the correspondingpublishers of documents in D . Note, |D| = |P| .Our goal is to classify each tuple ( c, s, D , P ) intoa pre-defined class (i.e. true news/fake news). We use the term “documents”, “articles”, and “evidence”interchangeably. !" 𝑤 𝑤 $" … 𝑤 !% ! 𝑤 ! 𝑤 &% ! … Concat AttentionConcat
LSTM LSTM LSTM 𝑤 !% " 𝑤 " 𝑤 &% " … Concat Attention ℎ ! heads LSTM LSTM LSTMLSTM LSTM LSTM
Claim’s Representation 𝒅 ’s publisher …… 𝒅 ’s text Publisher Emb
Article Representation …… Publisher Emb
Concat AttentionConcatExtended Claim’s Representation
Claim’s text
Avg. Pooling …… Speaker Emb
MLP 𝒅 ’s publisherClaim’s Speaker 𝒅 ’s text E m b e dd i n g L a y e r Multi-head Word AttentionLayerMulti-head DocumentAttention LayerOutput Layer $𝒚 Article Representation ConcatTuple Representation ℎ heads Cross Entropy Loss ℒ ) (𝑦, *𝑦) 𝒄 𝒆𝒙𝒕 ; 𝒅 𝒓𝒊𝒄𝒉 𝒄 𝒆𝒙𝒕 = [𝒄; 𝒔] ∈ ℝ ! 𝒅 ∈ ℝ ! " 𝒅 ∈ ℝ ! " 𝒄 ∈ ℝ 𝒑 ∈ ℝ " 𝒑 ∈ ℝ " 𝒔 ∈ ℝ ! ℎ ! heads Figure 1: The architecture of our proposed model
MAC in which we show a claim c , two associated relevantarticles d and d and sources of the claim and the two documents. h and h are the number of heads of word-level attention and document-level attention respectively. In this section, we describe our Hierarchical M ulti-head A ttentive Network for Fact- C hecking ( MAC )which jointly considers word-level attention anddocument-level attention. Our framework consistsof four main components: (1) embedding layer, (2)multi-head word attention layer, (3) multi-head doc-ument attention layer and (4) output layer. Thesecomponents are illustrated in Fig. 1 where we showa claim and two documents as an example.
Each claim c is modeled as a sequence of n words [ w c , w c , ..., w cn ] and d i is viewed as another se-quence of m words [ w d , w d , ..., w dm ] . Each word w ci and w dj will be projected into D-dimensionalvectors e ci and e dj respectively by an embeddingmatrix W e ∈ R V × D where V is the vocabularysize. Each speaker s and publisher p i modeled asone-hot vectors are transformed into dense vectors s ∈ R D and p i ∈ R D respectively by using twomatrices W s ∈ R S × D and W p ∈ R P × D , where S and P are the number of speakers and publishersin a training set respectively. Both W s and W p areuniformly initialized in [ − . , . . Note that, bothmatrices W s and W p are jointly learned with otherparameters of our MAC. We input word embeddings e ci of the claim c intoa bidirectional LSTM (Graves et al., 2005) which helps generate contextual representation h i of eachtoken as follows: h ci = [ ←− h i ; −→ h i ] ∈ R H , where ←− h i and −→ h i are hidden states in forward and backwardpass of the BiLSTM, symbol ; means concatenationand H is hidden size. We derive claim’s representa-tion in R H by an average pooling layer as follows: c = 1 n n (cid:88) i =1 h ci (1)Applying a similar process on the top of eachdocument d i with a different BiLSTM, we havecontextual representation h dj ∈ R H for each wordin d i . After going through BiLSTM, d i is modeledas matrix H = [ h d ⊕ h d ⊕ ... ⊕ h dm ] ∈ R m × H where ⊕ denotes stacking.To understand what information in a documenthelps us fact-check a claim, we need to guide ourmodel to focus on crucial keywords or phrases ofthe document. Drawing inspiration from (Luonget al., 2015), we firstly replicate vector c (Eq.1) m times to create matrix C ∈ R m × H and pro-pose an attention mechanism to attend to importantwords in the document d i as follows: a = sof tmax (cid:0) tanh (cid:0) [ H ; C ] · W (cid:1) · w (cid:1) (2)where w ∈ R a , W ∈ R H × a , [ H ; C ] is con-catenation of two matrices on the last dimensionand a ∈ R m is attention distribution on m words.However, the overall semantics of the documentmight be generated by multiple parts of the docu-ment (Lin et al., 2017). Therefore, we propose aulti-head word attention mechanism to capturedifferent semantic contributions of words by ex-tending vector w into a matrix W ∈ R a × h where h is the number of attention heads shownin Fig. 1. We modify Eq. 2 as follows: A = sof tmax col (cid:0) tanh([ H ; C ] · W ) · W (cid:1) (3)where A ∈ R m × h and each column of A hasbeen normalized by the softmax operation. Intu-itively, A stands for h different attention distribu-tions on top of m words of the document d i , help-ing us capture different aspects of the document.After computing A , we derive representation ofdocument d i as follows: d i = f latten ( A T · H ) (4)where d i ∈ R h H and function flatten(.) flattens A T · H into a vector. We also implemented a moresophisticated multi-head attention in (Vaswaniet al., 2017) but did not achieve good results. This layer consists of three components as follows:(1) extending representations of claims, (2) extend-ing representations of evidence and (3) multi-headdocument attention mechanism.
Extending representations of claims.
So far therepresentation of the claim c (Eq. 1) is only fromtextual content. In reality, a speaker who made aclaim may impact credibility of the claim. For ex-ample, claims from some politicians are controver-sial and inaccurate (Allcott and Gentzkow, 2017).Therefore, we enrich vector c by concatenating itwith speaker’s embedding s to generate c ext ∈ R x ,where x = 2 H + D as shown in Eq. 5. c ext = [ c ; s ] ∈ R x (5) Extending representations of evidence.
Intu-itively, an article published by nytimes.com mightbe more reliable than a piece of news published by breitbart.com which is known to be a less credi-ble site. Therefore, to capture more information,we further enrich representations of evidence withpublishers’ information by concatenating d i (Eq. 4)with its publisher’s embedding p i as follows: d exti = [ d i ; p i ] ∈ R y (6)where y = 2 h H + D . From Eq. 6, we cangenerate representations of k relevant articles andstack them as shown in Eq. 7. D = [ d ext ⊕ ... ⊕ d extk ] ∈ R k × y (7) Multi-head Document Attention Mechanism.
In real life, a journalist from snopes.com and politi-fact.com may use all k articles relevant to the claim c to fact-check it but she may focus on some key ar-ticles to determine the verdict of the claim c whileother articles may have negligible information. Tocapture such intuition, we need to downgrade un-informative documents and concentrate on moremeaningful articles. Similar to Section 4.2, weuse multi-head attention mechanism which pro-duces different attention distributions representingdiverse contributions of articles toward determiningveracity of the claim c .We firstly create matrix C ∈ R k × x by repli-cating vector c ext (Eq. 5) k times. Secondly, thematrix C is concatenated with matrix D (Eq. 7) onthe last dimension of the two matrices denoted as [ D ; C ] ∈ R k × ( x + y ) .Our proposed multi-head document-level atten-tion mechanism applies h different attention headsas shown in Eq. 8. A = sof tmax col (tanh([ D ; C ] · W ) · W ) (8)where W ∈ R ( x + y ) × a , W ∈ R a × h . The ma-trix A ∈ R k × h , where each of its column is nor-malized by the softmax operator, is a collectionof h different attention distributions on k docu-ments. Using attention weights, we can generateattended representation of k evidence denoted as d rich ∈ R h y as shown in Eq. 9. d rich = f latten ( A T · D ) (9)where flatten(.) function flattens A T · D into avector. We finally generate representation of a tuple ( c, s, D , P ) by concatenating vector c ext (Eq. 5)and vector d rich (Eq. 9), denoted as [ c ext ; d rich ] .To the best of our knowledge, our work is thefirst work utilizing multi-head attention mechanismintegrated with speakers and publishers informa-tion to capture various semantic contributions ofevidence toward fact-checking process. In this layer, we input tuple representation [ c ext ; d rich ] into a multilayer perceptron (MLP) tocompute probability ˆ y that the claim c is a truenews as follows: ˆ y = σ (cid:0) W · (cid:0) W · [ c ext ; d rich ] + b (cid:1) + b (cid:1) (10)where W , W , b , b are weights and biases ofthe MLP, and σ ( . ) is the sigmoid function. Weoptimize our model by minimizing the standard able 1: Statistics of our experimental datasets Snopes PolitiFactTrue claims 1,164 1,867False claims 3,177 1,701 | Speakers | N/A 664 | Documents | | Publishers | L θ ( y, ˆ y ) = − (cid:0) y log ˆ y + (1 − y ) log(1 − ˆ y ) (cid:1) (11)where y ∈ { , } is the ground truth label of atuple ( c, s, D , P ) . During training, we sample amini batch of 32 tuples and compute average lossfrom the tuples. We employed two public datasets released by(Popat et al., 2018). Each of these datasets is a col-lections of tuples ( c, s, D , P , y ) where each textualclaim c and its credible label y are collected fromtwo major fact-checking websites snopes.com and politifact.com . The articles pertinentto the claim c are retrieved by using search en-gines. Each Snopes claim was labeled as true or false while in Politifact, there were originally sixlabels: true , mostly true , half true , false , mostlyfalse , pants on fire . Following (Popat et al., 2018),we merge true , mostly true and half true into trueclaims and the rest are into false claims . Detailsof our datasets are presented in Table 1. Note thatSnopes does not have speakers’ information. We compare our MAC model with seven state-of-the-art baselines divided into two groups. The firstgroup of the baselines only used textual content ofclaims, and the second group of the baselines uti-lized relevant articles to fact-check textual claims.A related method (Mishra and Setty, 2019) usedsubject information of articles (e.g. politics, enter-tainment), which was not available in our datasets.We tried to compare with it but achieved poor re-sults perhaps due to missing information. There-fore, we do not report its result in this paper. Detailsof the baselines are shown as follows:
Using only claims’ text: • BERT (Devlin et al., 2018) is a pre-trainedlanguage model achieving state-of-the-art re- sults on many NLP tasks. The representationof [CLS] token is inputted to a trainable linearlayer to classify claims. • LSTM-Last is a model proposed in (Rashkinet al., 2017).
LSTM-Last takes the last hid-den state of the LSTM as representations ofclaims. These representations will be inputtedto a linear layer for classification. • LSTM-Avg is another model proposed in(Rashkin et al., 2017) which used an averagepooling layer on top of hidden states to deriverepresentations of claims. • CNN (Wang, 2017) is a state-of-the-art modelwhich applied 1D-convolutional neural networkon word vectors of claims.
Using both claims’ text and articles’ text: • DeClare (Popat et al., 2018) computes cred-ibility score of each pair of a claim c and adocument d i . The overall credible rating is av-eraged from all k relevant articles. • HAN (Ma et al., 2019) is a hierarchical atten-tion network based on representations of rele-vant documents. It uses attention mechanismsto determine which document is more impor-tant without considering which word in a docu-ment should be focused on. • NSMN (Nie et al., 2019) is a state-of-the-artmodel designed to determine stance of a doc-ument d i with respect to claim c . We applyNSMN on our dataset by predicting score ofeach pair ( c, d i ) and computing average scorebased on documents in D same as DeClare.Note that, we also applied BERT, LSTM-Last,LSTM-Avg and CNN by using both claims’ textand articles’ text. For each of these baselines, weconcatenated a claim’s text and a document’s text,and input the concatenated content into the baselineto compute likelihood that the claim is fake news.We computed average probability based on all doc-uments of the claim and used it as final predic-tion. However, we did not observe considerable im-provements of these baselines. In addition to deep-learning-based baselines, we compared our MACwith other feature-based techniques (e.g. SVM).As expected, these traditional techniques had in-ferior performance compared with neural models.Therefore, we only report the seven baselines’ per-formance. able 2: Performance of MAC and baselines on Snopes dataset. MAC outperforms baselines significantly withp-value < MethodTypes Methods True News as Positive Fake News as PositiveAUC F1 Macro F1 Micro F1 Precision Recall F1 Precision RecallUsing onlyclaims’ text BERT 0.60852 0.56096 0.69806 0.31574 0.40318 0.26050 0.80618 0.76011 0.85839LSTM-Avg 0.69124 0.62100 0.71877 0.42953 0.48415 0.39692 0.81246 0.79139 0.83671LSTM-Last 0.70142 0.63122 0.72415 0.44650 0.48935 0.41412 0.81594 0.79594 0.83776TextCNN 0.70537 0.63081 0.72005 0.45001 0.48164 0.43035 0.81160 0.79882 0.82622Using bothclaims’ text &articles’ text HAN 0.70365 0.62510 0.72800 0.42884 0.49192 0.38161 0.82136 0.79058 0.85490NSMN 0.77270 0.68006 0.76127 0.51954 0.57558 0.48182 0.84058 0.82011 0.86364DeClare 0.81036 0.72445 0.78813 0.59250 0.61235 0.58096 0.85640 0.85023 0.86399Ours MAC
Imprv. over the best baseline 9.47% 8.58% 5.71% 16.01% 14.27% 18.08% 3.43% 4.23% 2.67%
For each dataset, we randomly select 10% numberof claims from each class to form a validation set,which is used for tuning hyper-parameters. We re-port 5-fold stratified cross validation results on theremaining 90% of the data. We train our model andbaselines on 4-folds and test them on the remainingfold. We use AUC, macro/micro F1, class-specificF1, Precision and Recall as evaluation metrics. Tomitigate overfitting and reduce training time, weearly stop training process on the validation setwhen F1 macro on the validation data continuouslydecreases in 10 epochs. When we get the sameF1 macro between consecutive epochs, we rely onAUC for early stopping.For fair comparisons, we use Adam optimizer(Kingma and Ba, 2014) with learning rate 0.001and regularize parameters of all methods with (cid:96) norm and weight decay λ = 0 . . As the max-imum lengths of claims and articles in words are30 and 100 respectively for both datasets, we set n = 30 and m = 100 . For HAN and our model,we set k = 30 since the number of articles for eachclaim is at most 30 in both datasets. Batch sizeis set to 32 and we trained all models until con-vergence. We tune all models including ours withhidden size H chosen from { , , } , pre-trained word-embeddings are from Glove (Penning-ton et al., 2014) with D = 300 . Both D and D are tuned from { , } . The number of atten-tion heads h and h is chosen from { , , , , } , a and a are equal to × H . In addition to Glove,we also utilized contextual embeddings from pre-trained language models such as ELMo and BERTbut achieved comparable performances. We im-plemented all methods in PyTorch 0.4.1 and runexperiments on an NVIDIA GTX 1080. We show experimental results of our model andbaselines in Tables 2 and 3. In Table 2, MACoutperforms all baselines with significance level p < . by using one-sided paired Wilcoxontest on Snopes dataset. MAC achieves the bestresult when h = 5 , h = 2 , H = 300 and D = D = 128 . In Table 3, MAC also signif-icantly outperforms all baselines with p < . according to one-sided paired Wilcoxon test onPolitiFact dataset. The hyperparameters we se-lected for MAC are h = 3 , h = 1 , H = 300 and D = D = 128 .For baselines, BERT is used as a static encoder.We tried to fine tune it but even achieve worseresults. This might be because we do not have suf-ficient data to tune it. For both HAN and DeClare,since both papers do not release their source code,we tried our best to reproduce results from thesetwo models. HAN model derived representationof each document by using the last hidden state ofa GRU (Chung et al., 2014) without any attentionmechanism on words to downgrade unimportantwords (e.g. stop words), leading to poor represen-tations of documents. Therefore, document-levelattention mechanism in HAN model did not per-form well. Similar patterns can be observed in twobaselines LSTM-Avg and LSTM-Last. DeClareperformed best among baselines, indicating the im-portance of applying word-level attention on wordsto reduce impact of less informative words.We can see that our MAC outperforms all base-lines in all metrics. When viewing true news as positive class , our MAC has an average increase of16.0% and 7.1% over the best baselines on Snopesand PolitiFact respectively. We also have an in-crease of 4.7% improvements over baselines witha maximum improvements of 10.1% in PolitiFact able 3: Performance of MAC and baselines on PolitiFact dataset. MAC outperforms baselines with statisticalsignificance level p-value < MethodTypes Methods True News as Positive Fake News as PositiveAUC F1 Macro F1 Micro F1 Precision Recall F1 Precision RecallUsing onlyclaims’ text BERT 0.58822 0.56021 0.56446 0.56364 0.59206 0.54968 0.55678 0.54354 0.58069LSTM-Avg 0.65465 0.60564 0.60866 0.61821 0.63192 0.61267 0.59307 0.59046 0.60425LSTM-Last 0.64289 0.60196 0.60493 0.61703 0.62634 0.61456 0.58690 0.58763 0.59434TextCNN 0.65152 0.60380 0.60740 0.61521 0.63010 0.61030 0.59238 0.59049 0.60421Using bothclaims’ text &articles’ text HAN 0.63201 0.58655 0.59121 0.59193 0.61502 0.58290 0.58117 0.57573 0.60034NSMN 0.64237 0.60211 0.60431 0.61123 0.63051 0.59912 0.59299 0.58213 0.60999DeClare 0.70642 0.65213 0.65350 0.67230 0.66548 0.67997 0.63195 0.64053 0.62444Ours MAC
Imprv. over the best baseline 7.24% 5.26% 5.76% 6.78% 3.47% 11.02% 3.64% 10.14% 0.21%
Table 4: Impact of word attention and evidence atten-tion on our MAC in two datasets
Methods Snopes PolitiFactAUC F1 Macro AUC F1 MacroOnly Word Att 0.87278 0.77831 0.74483 0.67818Only Evidence Att 0.82531 0.72885 0.71790 0.65187Word & Doc Att
Table 5: Impact of speakers and publishers on perfor-mance of MAC in two datasets
Methods Snopes PolitiFactAUC F1 Macro AUC F1 MacroText Only 0.88186 0.77146 0.72401 0.66844Text + Publishers when considering fake news as negative class . Interms of AUC, average improvements of MAC overthe baselines are 7.9% and 6.1% on Snopes andPolitiFact respectively. Improvements of MACover baselines can be explained by our multi-headattention mechanism shown in Eq. 3 and Eq. 8.After attending to words in documents, we can gen-erate better representations of documents/evidence,leading to more effective document-level attentioncompared with HAN model. We study the impact of attention layers onperformance of MAC by (1) using only word at-tention and replacing evidence attention with anaverage pooling layer on top of documents’ repre-sentations and (2) using only evidence attention andreplacing word attention with an average poolinglayer on top of words’ representations. As we cansee in Table 4, using only word attention performsmuch better than using only evidence attention.This is because without downgrading less infor- h h AUC (a) Snopes h h AUC (b) PolitiFact
Figure 2: Sensitivity of MAC with respect to numberof heads in word-level attention h and the number ofheads in document-level attention h mative words in evidence, irrelevant informationcan be captured, leading to low quality represen-tations of evidence. This experiment aligns withour observation that HAN model, which used onlyevidence attention, did not perform well. Whencombining both attention mechanisms hierarchi-cally, we consistently achieve best results on twodatasets in Table 4. In particular, the model Word& Doc Att outperformed both
Only Evidence Att and
Only Evidence Att significantly with p-value < Impact of Speakers and Publishers on MAC.
Tostudy how speakers and publishers impact perfor-mance of MAC, we experiment four models: (1) us-ing text only (Text Only), (2) using text and publish-ers (Text + Publishers), (3) using text and speakers(Text + Speakers) and (4) using text, publishers andspeakers (Text + Pubs + Spkrs). In Table 5, Text +Publishers has better performance then using onlytext in both datasets. In PolitiFact, Text + Speakersachieves 2 ∼
3% improvements over Text + Publish-ers, indicating that speakers who made claims are alse Claim: Actor Christopher Walken planning making bid US presidency 2008 D o c D o c D o c Figure 3: Visualization of attention weights of the first attention head on three documents relevant to a false claimin word-level attention layer
False Claim: Actor Christopher Walken planning making bid US presidency 2008 D o c D o c D o c Figure 4: Visualization of attention weights of the second attention head on three documents relevant to a falseclaim in word-level attention layer crucial to determine verdict of the claims. Finally,using all information (Text + Pubs + Spkrs) helpsus achieve the best result in PolitiFact. In Snopes,we omit results of Text + Speakers and Text + Pubs+ Spkrs because the dataset does not contain speak-ers’ information. In particular, model
Text + Pubs+ Spkrs outperformed methods
Text Only and
Text+ Publishers significantly (p-value < . ). Basedon these results, we conclude that integrating in-formation of speakers and publishers is useful fordetecting misinformation. In this section, we examine sensitivity of MAC withrespect to the number of heads h in word attentionlayer and the number of heads h in document at-tention layer. We vary h and h in { , , , , } .Since AUC is less sensitive to any threshold, we re-port AUC of MAC on two datasets in Fig. 2(a) and 2(b). A common pattern we can observe in the twofigures is that performance of MAC tends to be bet-ter when we increase the number heads h in wordattention layer while performance of MAC tendsto decrease when increasing h . This phenomenonindicates that word attention is more important thanevidence attention. In Snopes, MAC has the bestAUC when h = 5 , h = 2 . In PolitiFact, MACreaches the peak when h = 3 , h = 1 . To understand how multi-head attention mecha-nism works, from the testing set, we visualize at-tention weights on three documents of a false claim
Actor Christopher Walken planning making bid USpresidency 2008 . Note, our MAC correctly classi-fies the claim as fake news. In Fig. 3 and Fig. 4, weshow the claim and visualization of two differentheads in word attention layer. Note that Popat et al. ead 1 Head 2 Head 3 Head 4 Head 5
Document-Level Attention Heads
Doc 3Doc 2Doc 1 D o c u m e n t s Figure 5: Visualization of five attention heads indocument-level attention layer for three documents (2018), who released the datasets, already lower-cased and removed punctuations. To conduct faircomparison, we directly used the datasets withoutany additional preprocessing. In Fig. 3, attentionweights are sparse, indicating that the first attentionhead focuses on the most important words whichdetermine credibility of the claim (e.g. hoax , false ).Differently, in Fig. 4, the second attention head hasmore diffused attention weights to capture moreuseful phrases from documents (e.g. walken notrunning , its obviously not ). Moving on to attentionheads in evidence attention layer in Fig. 5, we showa heat map where the x-axis is the five heads ex-tracted from evidence attention layer and the y-axisis three documents relevant to the same claim inFig. 3 and 4. As we can see in Fig. 5, Head 1 , Head 3 and
Head 5 emphasize on
Doc 3 whichcontains refuting phrases (e.g. its obviously not ),while
Head 4 focuses on
Doc 1 which has negatinginformation such as walken not running . Both
Doc1 and
Doc 3 have crucial signals to fact-check theclaim. From these analyses, we conclude that headsin word attention layer capture different semanticcontributions of words and different heads in docu-ment attention layer captures important documents.
In this paper, we propose a novel evidence-awaremodel to fact-check textual claims. Our MAC isdesigned by hierarchically stacking two attentionlayers. The first one is a word attention layer andthe second one is a document attention layer. Inboth layers, we propose multi-head attention mech-anisms to capture different semantic contributionsof words and documents. Our MAC outperformsthe baselines significantly with an average increaseof 6% to 9% over the best results from baselineswith a maximum improvements of 18%. We con-duct ablation studies to understand the performance of MAC and provide a case study to show the ef-fectiveness of the attention mechanisms. In futurework, we will further examine other data typessuch as images to improve the performance of ourmodel.
Acknowledgment
This work was supported in part by NSF grantCNS-1755536, AWS Cloud Credits for Research,and Google Cloud. Any opinions, findings andconclusions or recommendations expressed in thismaterial are the author(s) and do not necessarilyreflect those of the sponsors.
References
Hunt Allcott and Matthew Gentzkow. 2017. Social me-dia and fake news in the 2016 election.
Journal ofeconomic perspectives , 31(2):211–36.Aparna Alluri. 2019. Whatsapp: The ’black hole’ offake news in india’s election. .Ashoka. 2020. Misinformation spreads faster thancoronavirus: How a social organization in turkey isfighting fake news. https://bit.ly/36qqmmH .Carlos Castillo, Marcelo Mendoza, and BarbaraPoblete. 2011. Information credibility on twitter. In
Proceedings of the 20th international conference onWorld wide web , pages 675–684. ACM.Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, HuiJiang, and Diana Inkpen. 2017. Enhanced lstm fornatural language inference. In
Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , pages1657–1668.Wenhu Chen, Hongmin Wang, Jianshu Chen, YunkaiZhang, Hong Wang, Shiyang Li, Xiyou Zhou, andWilliam Yang Wang. 2020. Tabfact: A large-scaledataset for table-based fact verification. In
Interna-tional Conference on Learning Representations .Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,and Yoshua Bengio. 2014. Empirical evaluation ofgated recurrent neural networks on sequence model-ing. In
Neural Information Processing Systems .CNN. 2020. How facebook is combating spreadof covid-19 misinformation. https://cnn.it/3gjtBkg .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. In
Proceedings of the 16th Annual Conferenceof the North American Chapter of the Associationfor Computational Linguistics .lex Graves, Santiago Fern´andez, and J¨urgen Schmid-huber. 2005. Bidirectional lstm networks for im-proved phoneme classification and recognition. In
International Conference on Artificial Neural Net-works , pages 799–804. Springer.Aditi Gupta, Hemank Lamba, Ponnurangam Ku-maraguru, and Anupam Joshi. 2013. Faking sandy:characterizing and identifying fake images on twitterduring hurricane sandy. In
Proceedings of the 22ndinternational conference on World Wide Web , pages729–736.Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. In
InternationalConference on Learning Representations .Shimon Kogan, Tobias J Moskowitz, and Marina Niess-ner. 2019. Fake news: Evidence from financial mar-kets.
Available at SSRN 3237763 .Zhouhan Lin, Minwei Feng, Cicero Nogueira dos San-tos, Mo Yu, Bing Xiang, Bowen Zhou, and YoshuaBengio. 2017. A structured self-attentive sentenceembedding. In
The 5th International Conference onLearning Representations .Xiaomo Liu, Armineh Nourbakhsh, Quanzhi Li, RuiFang, and Sameena Shah. 2015. Real-time rumor de-bunking on twitter. In
Proceedings of the 24th ACMInternational on Conference on Information andKnowledge Management , pages 1867–1870. ACM.Yang Liu and Yi-Fang Brook Wu. 2018. Early detec-tion of fake news on social media through propaga-tion path classification with recurrent and convolu-tional networks. In
Thirty-Second AAAI Conferenceon Artificial Intelligence .Minh-Thang Luong, Hieu Pham, and Christopher DManning. 2015. Effective approaches to attention-based neural machine translation. In
EmpiricalMethods in Natural Language Processing .Jing Ma, Wei Gao, Shafiq Joty, and Kam-Fai Wong.2019. Sentence-level evidence embedding for claimverification with hierarchical attention networks. In
Proceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 2561–2571.Rahul Mishra and Vinay Setty. 2019. Sadhan: Hierar-chical attention networks to learn latent aspect em-beddings for fake news detection. In
Proceedings ofthe 2019 ACM SIGIR International Conference onTheory of Information Retrieval , pages 197–204.Yixin Nie, Haonan Chen, and Mohit Bansal. 2019.Combining fact extraction and verification with neu-ral semantic matching networks. In
Proceedings ofthe AAAI Conference on Artificial Intelligence , vol-ume 33, pages 6859–6866.Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. Glove: Global vectors for word rep-resentation. In
Proceedings of the 2014 conference on empirical methods in natural language process-ing (EMNLP) , pages 1532–1543.Matthew E Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In
Proceedings of NAACL-HLT , pages2227–2237.Kashyap Popat, Subhabrata Mukherjee, Andrew Yates,and Gerhard Weikum. 2018. Declare: Debunkingfake news and false claims using evidence-awaredeep learning. In
Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing , pages 22–32.Hannah Rashkin, Eunsol Choi, Jin Yea Jang, SvitlanaVolkova, and Yejin Choi. 2017. Truth of varyingshades: Analyzing language in fake news and polit-ical fact-checking. In
Proceedings of the 2017 Con-ference on Empirical Methods in Natural LanguageProcessing , pages 2931–2937.Kai Shu, Limeng Cui, Suhang Wang, Dongwon Lee,and Huan Liu. 2019. defend: Explainable fake newsdetection. In
Proceedings of the 25th ACM SIGKDDInternational Conference on Knowledge Discovery& Data Mining , pages 395–405.Kai Shu, Deepak Mahudeswaran, Suhang Wang, andHuan Liu. 2020. Hierarchical propagation networksfor fake news detection: Investigation and exploita-tion. In
Proceedings of the International AAAI Con-ference on Web and Social Media , volume 14, pages626–637.Amir Soleimani, Christof Monz, and Marcel Worring.2020. Bert for evidence retrieval and claim verifi-cation. In
European Conference on Information Re-trieval , pages 359–366. Springer.Mark Stencel. 2019. Number of fact-checking outletssurges to 188 in more than 60 countries. https://bit.ly/36y3S3l .James Thorne and Andreas Vlachos. 2017. An exten-sible framework for verification of numerical claims.In
Proceedings of the Software Demonstrations ofthe 15th Conference of the European Chapter of theAssociation for Computational Linguistics , pages37–40.James Thorne, Andreas Vlachos, ChristosChristodoulopoulos, and Arpit Mittal. 2018.Fever: a large-scale dataset for fact extraction andverification. In
Proceedings of the 2018 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long Papers) , pages809–819.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In
Advances in Neural Information Pro-cessing Systems , pages 5998–6008.ndreas Vlachos and Sebastian Riedel. 2015. Identi-fication and verification of simple claims about sta-tistical properties. In
Proceedings of the 2015 Con-ference on Empirical Methods in Natural LanguageProcessing , pages 2596–2601. Association for Com-putational Linguistics.Nguyen Vo and Kyumin Lee. 2018. The rise ofguardians: Fact-checking url recommendation tocombat fake news. In
The 41st International ACMSIGIR Conference on Research & Development inInformation Retrieval , pages 275–284.Nguyen Vo and Kyumin Lee. 2019. Learning from fact-checkers: Analysis and generation of fact-checkinglanguage. In
Proceedings of the 42nd InternationalACM SIGIR Conference on Research and Develop-ment in Information Retrieval , pages 335–344.Nguyen Vo and Kyumin Lee. 2020a. Standing onthe shoulders of guardians: Novel methodologies tocombat fake news. In
Disinformation, Misinforma-tion, and Fake News in Social Media , pages 183–210. Springer.Nguyen Vo and Kyumin Lee. 2020b. Where are thefacts? searching for fact-checked information to alle-viate the spread of fake news. In
Proceedings of the2020 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) , pages 7717–7731.William Yang Wang. 2017. “liar, liar pants on fire”:A new benchmark dataset for fake news detection.In
Proceedings of the 55th Annual Meeting of theAssociation for Computational Linguistics (Volume2: Short Papers) , pages 422–426.Xuezhi Wang, Cong Yu, Simon Baumgartner, and FlipKorn. 2018. Relevant document discovery for fact-checking articles. In
Companion Proceedings of theThe Web Conference 2018 , pages 525–533.Liang Wu and Huan Liu. 2018. Tracing fake-newsfootprints: Characterizing social media messages byhow they propagate. In
Proceedings of the eleventhACM international conference on Web Search andData Mining , pages 637–645.Lianwei Wu, Yuan Rao, Xiong Yang, Wanzhen Wang,and Ambreen Nazir. 2020. Evidence-aware hierar-chical interactive attention networks for explainableclaim verification. In
International Joint Confer-ences on Artificial Intelligence .Di You, Nguyen Vo, Kyumin Lee, and Qiang Liu.2019. Attributed multi-relational attention networkfor fact-checking url recommendation. In
Proceed-ings of the 28th ACM International Conference onInformation and Knowledge Management , pages1471–1480.Rowan Zellers, Ari Holtzman, Hannah Rashkin,Yonatan Bisk, Ali Farhadi, Franziska Roesner, andYejin Choi. 2019. Defending against neural fakenews. In
Advances in Neural Information Process-ing Systems , pages 9051–9062. Zhe Zhao, Paul Resnick, and Qiaozhu Mei. 2015. En-quiring minds: Early detection of rumors in socialmedia from enquiry posts. In