DeFactoNLP: Fact Verification using Entity Recognition, TFIDF Vector Comparison and Decomposable Attention
DDeFactoNLP: Fact Verification using Entity Recognition, TFIDF VectorComparison and Decomposable Attention
Aniketh Janardhan Reddy * Machine Learning DepartmentCarnegie Mellon UniversityPittsburgh, USA [email protected]
Gil Rocha
LIACC/DEIFaculdade de EngenhariaUniversidade do Porto, Portugal [email protected]
Diego Esteves
SDA ResearchUniversity of BonnBonn, Germany [email protected]
Abstract
In this paper, we describe DeFactoNLP ,the system we designed for the FEVER 2018Shared Task. The aim of this task was toconceive a system that can not only auto-matically assess the veracity of a claim butalso retrieve evidence supporting this assess-ment from Wikipedia. In our approach, theWikipedia documents whose Term Frequency-Inverse Document Frequency (TFIDF) vectorsare most similar to the vector of the claimand those documents whose names are simi-lar to those of the named entities (NEs) men-tioned in the claim are identified as the docu-ments which might contain evidence. The sen-tences in these documents are then supplied toa textual entailment recognition module. Thismodule calculates the probability of each sen-tence supporting the claim, contradicting theclaim or not providing any relevant informa-tion to assess the veracity of the claim. Var-ious features computed using these probabili-ties are finally used by a Random Forest clas-sifier to determine the overall truthfulness ofthe claim. The sentences which support thisclassification are returned as evidence. Our ap-proach achieved a 0.4277 evidence F1-score,a 0.5136 label accuracy and a 0.3833 FEVERscore . * Work was completed while the author was a student atthe Birla Institute of Technology and Science, India and wasinterning at SDA Research. https://github.com/DeFacto/DeFactoNLP The scores and ranks reported in this paper are provi-sional and were determined prior to any human evaluation ofthose evidences that were retrieved by the proposed systemsbut were not identified in the previous rounds of annotation.The organizers of the task plan to update these results afteran additional round of annotation. FEVER score measures the fraction of claims for whichat least one complete set of evidences have been retrieved bythe fact verification system.
Given the current trend of massive fake news prop-agation on social media, the world is desperatelyin need of automated fact checking systems. Auto-matically determining the authenticity of a fact isa challenging task that requires the collection andassimilation of a large amount of information. Toperform the task, a system is required to find rel-evant documents, detect and label evidences, andfinally output a score which represents the truth-fulness of the given claim. The numerous designchallenges associated with such systems are dis-cussed by Thorne and Vlachos (2018) and Esteveset al. (2018).The Fact Extraction and Verification (FEVER)dataset (Thorne et al., 2018) is the first publiclyavailable large-scale dataset designed to facilitatethe training and testing of automated fact verifi-cation systems. The FEVER 2018 Shared Task re-quired us to design such systems using this dataset.The organizers had provided us a preprocessedversion of the June 2017 Wikipedia dump in whichthe pages only contained the introductory sectionsof the respective Wikipedia pages. Given a claim,we were asked to build systems which could deter-mine if there were sentences supporting the claim(labelled as ”SUPPORTS”) or sentences refuting it(labelled as ”REFUTES”). If conclusive evidenceeither supporting or refuting the claim could not befound in the dump, the system should report thesame (labelled ”NOT ENOUGH INFO”). How-ever, if conclusive evidence was found, it shouldalso retrieve the sentences which either support orrefute the claim.
Our approach has four main steps: Relevant Docu-ment Retrieval, Relevant Sentence Retrieval, Tex-tual Entailment Recognition and Final Scoring a r X i v : . [ c s . A I] S e p igure 1: The main steps of our approach and Classification. Given a claim, Named EntityRecognition (NER) and TFIDF vector compari-sion are first used to retrieve the relevant docu-ments and sentences as delineated in Section 2.1.The relevant sentences are then supplied to the tex-tual entailment recognition module (Section 2.2)that returns a set of probabilities. Finally, a Ran-dom Forest classifier (Breiman, 2001) is employedto assign a label to the claim using certain featuresderived from the probabilities returned by the en-tailment model as detailed in Section 2.3. The pro-posed architecture is depicted in Figure 1. We used two methods to identify which Wikipediadocuments may contain relevant evidences. Infor-mation about the NEs mentioned in a claim canbe helpful in determining the claim’s veracity. Inorder to get the Wikipedia documents which de-scribe them, the first method initially uses the Con-ditional Random Fields-based Stanford NER soft-ware (Finkel et al., 2005) to recognize the NEsmentioned in the claim. Then, for every NE whichis recognized, it finds the document whose namehas the least Levenshtein distance (Levenshtein,1966) to that of the NE. Hence, we obtain a set of documents which contain information about theNEs mentioned in a claim. Since all of the sen-tences in such documents might aid the verifica-tion, they are all returned as possible evidences.The second method used to retrieve candidateevidences is identical to that used in the base-line system (Thorne et al., 2018) and is based onthe rationale that sentences which contain termssimilar to those present in the claim are likelyto help the verification process. Directly evalu-ating all of the sentences in the dump is compu-tationally expensive. Hence, the system first re-trieves the five most similar documents based onthe cosine similarity between binned unigram andbigram TFIDF vectors of the documents and theclaim using the DrQA system (Chen et al., 2017).Of all the sentences present in these documents,the five most similar sentences based on the co-sine similarity between the binned bigram TFIDFvectors of the sentences and the claim are finallychosen as possible sources of evidence. The num-ber of documents and sentences chosen is based onthe analysis presented in the aforementioned workby Thorne et al. (2018).The sets of sentences returned by the two meth-ods are combined and fed to the textual entailmentrecognition module described in Section 2.2.
Recognizing Textual Entailment (RTE) is the pro-cess of determining whether a text fragment (Hy-pothesis H ) can be inferred from another fragment(Text T ) (Sammons et al., 2012). The RTE mod-ule receives the claim and the set of possible evi-dential sentences from the previous step. Let therebe n possible sources of evidence for verifying aclaim. For the i th possible evidence, let s i denotethe probability of it entailing the claim, let r i de-note the probability of it contradicting the claim,and let u i be the probability of it being uninfor-mative. The RTE module calculates each of theseprobabilities.The SNLI corpus (Bowman et al., 2015) is usedfor training the RTE model. This corpus is com-posed of sentence pairs (cid:104) T, H (cid:105) , where T corre-sponds to the literal description of an image and H is a manually created sentence. If H can be in-ferred from T , the “Entailment” label is assignedto the pair. If H contradicts the information in T ,the pair is labelled as “Contradiction”. Otherwise,the label “Neutral” is assigned.e chose to employ the state-of-the-art RTEmodel proposed by Peters et al. (2018) which isa re-implementation of the widely used decom-posable attention model developed by Parikh et al.(2016). The model achieves an accuracy of 86.4%on the SNLI test set. We selected it because atthe time of development of this work, it was oneof the best performing systems on the task withpublicly available code. Additionally, the usage ofpreprocessing parsing tools is not required and themodel is faster to train when compared to the otherapproaches we tried.Although the model achieved good scores onthe SNLI dataset, we noticed that it does not gen-eralize well when employed to predict the rela-tionships between the candidate claim-evidencepairs present in the FEVER data. In order to im-prove the generalization capabilities of the RTEmodel, we decided to fine-tune it using a newlysynthesized FEVER SNLI-style dataset (Pratt andJennings, 1996). This was accomplished in twosteps: the RTE model was initially trained us-ing the SNLI dataset and then re-trained using theFEVER SNLI-style dataset.The FEVER SNLI-style dataset was created us-ing the information present in the FEVER datasetwhile retaining the format of the SNLI dataset.Let us consider each learning instance in theFEVER dataset of the form (cid:104) c, l, E (cid:105) , where c isthe claim, l ∈ { SUPPORTS, REFUTES, NOTENOUGH INFO } is the label and E is the set ofevidences. While constructing the FEVER SNLI-style dataset, we only considered the learning in-stances labeled as “SUPPORTS” or “REFUTES”because these were the instances that provided uswith evidences. Given such an instance, we pro-ceeded as follows: for each evidence e ∈ E , wecreated an SNLI-style example (cid:104) c, e (cid:105) labeled as“Entailment” if l = “SUPPORTS” or “Contradic-tion” if l = “REFUTES”. If e contained more thanone sentence, we made a simplifying assumptionand only considered the first sentence of e . Foreach “Entailment” or “Contradiction” which wasadded to this dataset, a “Neutral” learning instanceof the form (cid:104) c, n (cid:105) was also created. n is a ran-domly selected sentence present the same docu-ment from which e was retrieved. We also en-sured that n was not included in any of the otherevidences in E . Following this procedure, we ob-tain examples that are similar (retrieved from thesame document) but should be labeled differently. Split Entail. Contradiction Neutral
Training 122,892 48,825 147,588Dev 4,685 4,921 8,184Test 4,694 4,930 8,432
Table 1: FEVER SNLI-style Dataset split sizes for E N - TAILMENT , C
ONTRADICTION and N
EUTRAL classes
Model Macro Entail. Contra. Neutral
Vanilla 0.45 0.54 0.44 0.37Fine-tuned 0.70 0.70 0.64 0.77
Table 2: Macro and class-specific F1 scores achievedon the FEVER SNLI-style test set
Thus, we obtained a dataset with the characteris-tics depicted in Table 1. To correct the unbalancednature of the dataset, we performed random under-sampling (He and Garcia, 2009). The fine-tuninghad a huge positive impact on the generalizationcapabilities of the model as shown in Table 2. Us-ing the fine-tuned model, the aforementioned setof probabilities are finally computed.
Twelve features were derived using the probabili-ties computed by the RTE module. We define thefollowing variables for notational convenience: cs i = (cid:40) if s i ≥ r i and s i ≥ u i otherwise cr i = (cid:40) if r i ≥ s i and r i ≥ u i otherwise cu i = (cid:40) if u i ≥ s i and u i ≥ r i otherwiseThe twelve features which were computed are: f = (cid:80) ni =1 cs i f = (cid:80) ni =1 cr i f = (cid:80) ni =1 cu i f = (cid:80) ni =1 ( s i × cs i ) f = (cid:80) ni =1 ( r i × cr i ) f = (cid:80) ni =1 ( u i × cu i ) f = max ( s i ) ∀ i f = max ( r i ) ∀ if = max ( u i ) ∀ if = (cid:40) f f if f (cid:54) = 00 otherwise f = (cid:40) f f if f (cid:54) = 00 otherwise f = (cid:40) f f if f (cid:54) = 00 otherwiseEach of the possible evidential sentences sup-ports a certain label more than the other labels (thisan be determined by looking at the computedprobabilities). The variables cs i , cr i and cu i areused to capture this fact. The most obvious wayto label a claim would be to assign the label withthe highest support to the claim. Hence, we choseto use the features f , f and f which representthe number of possible evidential sentences whichsupport each label. The amount of support lent toa certain label by supporting sentences could alsobe useful in performing the labelling. This moti-vated us to use the features f , f and f whichquantify the amount of support for each label. Ifa certain sentence can strongly support a label, itmight be prudent to assign that label to the claim.Hence, we use the features f , f and f whichcapture how strongly a single sentence can supportthe claim. Finally, we used the features f , f and f because the average strength of the sup-port lent by supporting sentences to a given labelcould also help the classifier.These features were used by a Random Forestclassifier (Breiman, 2001) to determine the label tobe assigned to the claim. The classifier was com-posed of 50 decision trees and the maximum depthof each tree was limited to 3. Information gain wasused to measure the quality of a split. 3000 claimslabelled as ”SUPPORTS”, 3000 claims labelled as”REFUTES” and 4000 claims labelled as ”NOTENOUGH INFO” were randomly sampled fromthe training set. Relevant sentences were then re-trieved as detailed in Section 2.1 and supplied tothe RTE module (Section 2.2). The probabilitiescalculated by this module were used to generatethe aforementioned features. The classifier wasthen trained using these features and the actual la-bels of the claims.We used the trained classifier to label the claimsin the test set. If the ”SUPPORTS” label was as-signed to the claim, the five documents with thehighest ( s i × cs i ) products were returned as ev-idences. However, if cs i = 0 ∀ i , then the labelwas changed to ”NOT ENOUGH INFO” and anull set was returned as evidence. A similar pro-cess was employed when the ”REFUTES” labelwas assigned to a claim. If the ”NOT ENOUGHINFO” label was assigned, a null set was returnedas evidence. Our system was evaluated using a blind test setwhich contained 19,998 claims. Table 3 compares
Metric
DeFactoNLP Baseline Best
Label Accuracy 0.5136 0.4884 0.6821Evidence F1 0.4277 0.1826 0.6485FEVER Score 0.3833 0.2745 0.6421
Table 3: System Performance the performance of our system with that of thebaseline system. It also lists the best performancefor each metric. The evidence precision of our sys-tem was 0.5191 and its evidence recall was 0.3636.All of these results were obtained upon submit-ting our predictions to an online evaluator. DeFac-toNLP had the th best evidence F1 score, the th best label accuracy and the th best FEVER scoreout of the 24 participating systems.The results show that the evidence F1 score ofour system is much better than that of the base-line system. However, the label accuracy of oursystem is only marginally better than that of thebaseline, suggesting that our final classifier is notvery reliable. The low label accuracy may havenegatively affected the other scores. Our system’slow evidence recall can be attributed to the prim-itive methods employed to retrieve the candidatedocuments and sentences. Additionally, the RTEmodule can only detect entailment between twopairs of sentences. Hence, claims which requiremore than one sentence to verify them cannot beeasily labelled by our system. This is another rea-son behind our low evidence recall, FEVER scoreand label accuracy. We aim to study more sophis-ticated ways to combine the information obtainedfrom the RTE module in the near future.To better assess the performance of the sys-tem, we performed a manual analysis of the pre-dictions made by the system. We observed thatfor some simple claims ( ex. “Tilda Swinton is avegan”) which were labeled as “NOT ENOUGHINFO” in the gold-standard, the sentence retrievalmodule found many sentences related to the NEsin the claim but none of them had any useful infor-mation regarding the claim object ( ex. “vegan”). Insome of these cases, the RTE module would la-bel certain sentences as either supporting or refut-ing the claim, even if they were not relevant to theclaim. In the future, we aim to address this short-coming by exploring triple extraction-based meth-ods to weed out certain sentences (Gerber et al.,2015).We also noticed that the usage of coreference inhe Wikipedia articles was responsible for the sys-tem missing some evidences as the RTE modulecould not accurately assess the sentences whichused coreference. Employing a coreference res-olution system at the article level is a promisingdirection to address this problem.The incorporation of named entity disambigua-tion into the sentence and document retrieval mod-ules could also boost performance. This is becausewe noticed that in some cases, the system used in-formation from unrelated Wikipedia pages whosenames were similar to those of the NEs mentionedin a claim to incorrectly label it ( ex. a claim wasrelated to the movie “Soul Food” but some of theretrieved evidences were from the Wikipedia pagerelated to the soundtrack “Soul Food”). In this work, we described our fact verificationsystem, DeFactoNLP, which was designed for theFEVER 2018 Shared Task. When supplied aclaim, it makes use of NER and TFIDF vectorcomparison to retrieve candidate Wikipedia sen-tences which might help in the verification pro-cess. An RTE module and a Random Forestclassifier are then used to determine the veracityof the claim based on the information present inthese sentences. The proposed system achieved a0.4277 evidence F1-score, a 0.5136 label accuracyand a 0.3833 FEVER score. After analyzing ourresults, we have identified many ways of improv-ing the system in the future. For instance, tripleextraction-based methods can be used to improvethe sentence retrieval component as well as to im-prove the identification of evidential sentences.We also wish to explore more sophisticated meth-ods to combine the information obtained from theRTE module and employ entity linking methods toperform named entity disambiguation.
Acknowledgments
This research was partially supported by an EUH2020 grant provided for the WDAqua project(GA no. 642795) and by the DAAD under the “In-ternational promovieren in Deutschland fr alle”(IPID4all) project.
References
Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large an- notated corpus for learning natural language infer-ence. In
EMNLP , pages 632–642. The Associationfor Computational Linguistics.Leo Breiman. 2001. Random forests.
Mach. Learn. ,45(1):5–32.Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading Wikipedia to answer open-domain questions. In
Association for Computa-tional Linguistics (ACL) .Diego Esteves, Anisa Rula, Aniketh Janardhan Reddy,and Jens Lehmann. 2018. Toward veracity assess-ment in rdf knowledge bases: An exploratory anal-ysis.
Journal of Data and Information Quality(JDIQ) , 9(3):16.Jenny Rose Finkel, Trond Grenager, and ChristopherManning. 2005. Incorporating non-local informa-tion into information extraction systems by gibbssampling. In
Proceedings of the 43rd Annual Meet-ing on Association for Computational Linguistics ,ACL ’05, pages 363–370, Stroudsburg, PA, USA.Association for Computational Linguistics.Daniel Gerber, Diego Esteves, Jens Lehmann, LorenzB¨uhmann, Ricardo Usbeck, Axel-Cyrille NgongaNgomo, and Ren´e Speck. 2015. Defacto - tempo-ral and multilingual deep fact validation.
Web Se-mantics: Science, Services and Agents on the WorldWide Web .Haibo He and Edwardo A. Garcia. 2009. Learningfrom imbalanced data.
IEEE Trans. on Knowl. andData Eng. , 21(9):1263–1284.V. I. Levenshtein. 1966. Binary Codes Capable of Cor-recting Deletions, Insertions and Reversals.
SovietPhysics Doklady , 10:707.Ankur Parikh, Oscar T¨ackstr¨om, Dipanjan Das, andJakob Uszkoreit. 2016. A decomposable attentionmodel for natural language inference. In
Proceed-ings of the 2016 Conference on Empirical Methodsin Natural Language Processing , pages 2249–2255.Association for Computational Linguistics.Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In
Proceedings of the 2018 Conferenceof NAACL: Human Language Technologies, Volume1 (Long Papers) , pages 2227–2237. Association forComputational Linguistics.Lorien Pratt and Barbara Jennings. 1996. A survey oftransfer between connectionist networks.
Connec-tion Science , 8(2):163–184.Mark Sammons, V.G.Vinod Vydiswaran, and DanRoth. 2012. Recognizing textual entailment. InDaniel M. Bikel and Imed Zitouni, editors,
Multilin-gual Natural Language Applications: From Theoryto Practice , pages 209–258. Prentice Hall.ames Thorne and Andreas Vlachos. 2018. Automatedfact checking: Task formulations, methods and fu-ture directions.
CoRR , abs/1806.07687.James Thorne, Andreas Vlachos, ChristosChristodoulopoulos, and Arpit Mittal. 2018.FEVER: a large-scale dataset for fact extraction andverification. In