Entity-level Factual Consistency of Abstractive Text Summarization
Feng Nan, Ramesh Nallapati, Zhiguo Wang, Cicero Nogueira dos Santos, Henghui Zhu, Dejiao Zhang, Kathleen McKeown, Bing Xiang
EEntity-level Factual Consistency of Abstractive Text Summarization
Feng Nan Ramesh Nallapati Zhiguo Wang Cicero Nogueira dos Santos Henghui Zhu Dejiao Zhang Kathleen McKeown , Bing Xiang Amazon Web Services , Columbia University { nanfen, rnallapa, zhiguow, cicnog, henghui, dejiaoz, mckeownk, bxiang } @amazon.com Abstract
A key challenge for abstractive summarizationis ensuring factual consistency of the gener-ated summary with respect to the original doc-ument. For example, state-of-the-art modelstrained on existing datasets exhibit entity hal-lucination, generating names of entities thatare not present in the source document. Wepropose a set of new metrics to quantify theentity-level factual consistency of generatedsummaries and we show that the entity hallu-cination problem can be alleviated by simplyfiltering the training data. In addition, we pro-pose a summary-worthy entity classificationtask to the training process as well as a joint en-tity and summary generation approach, whichyield further improvements in entity level met-rics.
Many recent advances in deep neural networkshave led to significant improvement in the qualityof abstractive summarization (Radford et al., 2019;Gehrmann et al., 2019; Lewis et al., 2019). Despitethis progress, there are still many limitations facingneural text summarization (Kryscinski et al., 2019),the most serious of which is their tendency to gen-erate summaries that are not factually consistentwith the input document; a factually consistent sum-mary only contains statements that can be derivedfrom the source document. Recent studies showthat about 30% of the summaries generated by neu-ral network sequence-to-sequence models sufferfrom fact fabrication (Cao et al., 2018). Unfortu-nately, the widely used ROUGE score is inadequateto quantify factual consistency (Kryscinski et al.,2019).Factual inconsistency can occur at either the en-tity or the relation level. At the entity level, amodel generated summary may contain named-entities that never appeared in the source document. We call this the entity hallucination problem. Forexample, consider the following model generatedsummary:
People in Italy and the Netherlandsare more likely to consume fewer cupsof coffee than those in the UK, a studysuggests. “UK” never appeared in the input source docu-ment (taken from the test set of the XSUM dataset(Narayan et al., 2018)). In fact, the source docu-ment mentioned a study involving people in Italyand Netherlands; “UK” was a result of model hal-lucination. Another type of inconsistency occurswhen the entities indeed exist in the source doc-ument but the relations between them are not inthe source document. This type of inconsistencyis much harder to identify. Open Information Ex-traction (OpenIE) and dependency parsing toolshave been used (Cao et al., 2018) to identify theunderlying relations in a summary, but are not yetaccurate enough for practical use. Ultimately, theseresearchers relied on manually classifying gener-ated summaries into faithful , fake , or unclear .In this paper, we propose a set of simple metricsto quantify factual consistency at the entity-level.We analyze the factual quality of summaries pro-duced by the state-of-the-art BART model (Lewiset al., 2019) on three news datasets. We then pro-pose several techniques including data filtering,multi-task learning and joint sequence generationto improve performance on these metrics. We leavethe relation level consistency to future work. Large transformer-based neural architectures com-bined with pre-training have set new records acrossmany natural language processing tasks (Vaswaniet al., 2017; Devlin et al., 2019; Radford et al., a r X i v : . [ c s . C L ] F e b predict which entities aresummary-worthy while generating the summarythat contains them. In this view we are trying tosolve a more challenging problem. We propose three new metrics that rely on off-the-shelf tools to perform Named-Entity Recognition(NER). We use N ( t ) and N ( h ) to denote thenumber of named-entities in the target (gold sum-mary) and hypothesis (generated summary), respec-tively. We use N ( h ∩ s ) to denote the number ofentities found in the generated summary that canfind a match in the source document. If a named-entity in the summary consists of multiple words,we consider it a match as long as any n-gram ofthe named-entity can be found in the source docu-ment. This is meant to capture the situation wherethe named-entity can be shortened; for example,“Obama ” is a match for “Barack Obama” and “Har-vard” is a match for “Harvard University”. Whenthe match is at the unigram level, we make sure We use Spacy (Honnibal and Montani, 2017). that it is not a stop word such as “the”. We alsomake the match case-insensitive to accommodatecasing variances.
Precision-source:
We propose precision-source( prec s ) to quantify the degree of hallucination withrespect to the source: prec s = N ( h ∩ s ) / N ( h ) . It is simply the percentage of named-entities in thesummary that can be found in the source. Low prec s means hallucination is severe.We first evaluate the prec s score on the groundtruth summaries of the 3 datasets: Newsroom(Grusky et al., 2018), CNN/DailyMail (Nallapatiet al., 2016) and XSUM (Narayan et al., 2018).Table 1 shows that among the three datasets, the Newsroom CNNDM XSUMtrain val test train val test train val testavg. N ( t ) N ( t ∩ s ) prec s (%) 90.6 90.6 90.5 96.5 96.7 96.6 79.0 79.5 79.3 Table 1: Average number of named-entities and the prec s scores (%) in the ground truth summary. ground truth summaries in XSUM have the low-est prec s score. This is because the ground truthsummaries in the XSUM dataset often use the firstsentence of the article as the summary; the sourcedocument is constructed to be the rest of the ar-ticle and may not repeat the named-entities thatappeared in the summary. We hypothesize thatthe hallucination problem is largely caused by thetraining data itself. Thus, we propose to performentity-based data filtering to construct a “clean”version of these datasets as described next. Entity-based data filtering:
For each dataset,we apply Spacy NER on the gold summary to iden-tify all the named-entities. If any of the entitiescannot find a match in the source document, wediscard the sentence that contains the entity fromthe ground truth summary. If the ground truth sum-mary consists of only one sentence and it needs tobe discarded, we remove the document-summarypair from the dataset. This way, we ensure that ourfiltered dataset does not contain hallucination ofentities ( prec s = 1 ) in the ground truth summary.The dataset size before and after the filtering isshown in Table 2. About a third of examples arefiltered out for XSUM. Again, this is because ofthe way XSUM dataset is constructed as mentioned We ignore certain types of entities such as date, time,numerals because they tend to have large variations in repre-sentation and are difficult to determine a match in the sourcedocument. The appendix contains more details. n the previous paragraph. As we shall see in Table3, entity-based data filtering reduces hallucinationof the trained model and the effect is especiallysignificant in the XSUM dataset.
Precision-target and recall-target:
Althoughthe precision-source ( prec s ) metric quantifies thedegree of entity hallucination with respect to thesource document, it does not capture the entity-level accuracy of the generated summary withrespect to the ground truth summary. To get acomplete picture of the entity-level accuracy ofthe generated summary, we propose the precision-target ( prec t ) score: prec t = N ( h ∩ t ) / N ( h ) , where N ( h ∩ t ) is the number of named-entitiesin the generated summary that can find a matchin the ground truth summary; and the recall-target( recall t ) score: recall t = N ( h ∩ t ) / N ( t ) , where N ( t ) is the number of named-entities in the groundtruth summary. We compute the F1 score as F t = 2 · prec t · recall t / ( prec t + recall t ) . In addition to entity-based data filtering, we alsoexplore another method to further improve thesummarization quality. In particular, we incor-porate an additional task of classifying summary-worthy named-entities in the source document. Asummary-worthy named-entity in the source docu-ment is one that appears in the ground truth sum-mary and thus, is a salient entity, worthy of inclu-sion in the generated summary. Intuitively, if wecan identify these summary-worthy named-entitiesusing the encoder representation, we may poten-tially increase the entity-level precision and recallmetrics as well as the overall quality of the sum-mary. We achieve this by adding a classificationhead to the encoder of BART. To prepare for theclassification label, we first identify the named-entities in the ground truth summary and find thematching tokens in the source document. We thenassign the (B)eginning-(I)nside-(O)utside labels toeach token of the source document to denote if thetoken is beginning, inside or outside of a summary-worthy named-entity, respectively. During training,we simply add the classification loss for each tokenat the encoder to the original sequence-to-sequenceloss.More precisely, let { (cid:0) x i , y i (cid:1) } Ni =1 be a datasetof N examples where x i = x i , . . . , x its ( i ) arethe tokens of the i th source document and y i = y i , . . . , y itt ( i ) are the tokens of the target (ground truth summary). The standard sequence-to-sequence training minimizes the maximum loglikelihood estimation (MLE) loss: L i MLE ( θ, x i , y i ) = − tt ( i ) (cid:88) t =1 log p θ ( y it | x i , y i We use the pre-trained BART-large model in theFairseq library (Ott et al., 2019) to fine-tune on the3 summarization datasets. The appendix containsadditional details of experimental setup.In Table 3, we show the effect of the entity-baseddata filtering. For each dataset, we train two sep-arate models: using the training data before andafter entity-based data filtering as shown in Table2. We evaluate both models on the “clean” testset after entity-based data filtering. We choosethis filtered version of the original test set because Our code is available at https://github.com/amazon-research/fact-check-summarization ewsroom CNNDM XSUMtrain val test train val test train val test original 922,500 (1.58) 100,968 (1.60) 100,933 (1.59) 287,112 (3.90) 13,368 (4.13) 11,490 (3.92) 203,540 (1.0) 11,301 (1.0) 11,299 (1.0)after filtering 855,975 (1.62) 93,678 (1.64) 93,486 (1.64) 286,791 (3.77) 13,350 (3.99) 11,483 (3.77) 135,155 (1.0) 7,639 (1.0) 7,574 (1.0) Table 2: Number of examples in three datasets together with the average number of sentences in the ground truthsummary (in parentheses) before and after entity-based filtering. trainingdata Rouge1 Rouge2 RougeL macro prec s micro prec s macro prec t micro prec t macro recall t micro recall t macro F t micro F t Newsroom original 47.7 ± ± ± ± ± ± ± ± ± ± ± + filtering 47.7 ± ± ± ± ± ± ± ± ± ± ± + classification 47.7 ± ± ± ± ± ± ± ± ± ± ± JAENS 46.6 ± ± ± ± ± ± ± ± ± ± ± CNNDM original 43.7 ± ± ± ± ± ± ± ± ± ± ± + filtering 43.4 ± ± ± ± ± ± ± ± ± ± ± + classification 43.5 ± ± ± ± ± ± ± ± ± ± ± JAENS 42.4 ± ± ± ± ± ± ± ± ± ± ± XSUM original ± ± ± ± ± ± ± ± ± ± ± + filtering 45.4 ± ± ± ± ± ± ± ± ± ± ± + classification 45.3 ± ± ± ± ± ± ± ± ± ± ± JAENS 43.4 ± ± ± ± ± ± ± ± ± ± ± Table 3: Comparison of models trained using original data, with entity-based data filtering, with an additionalclassification task and with JAENS. Scores are all in percentages, averaged over 5 runs and shown with standarddeviations. We bold the numbers that are significantly better in the sense that the means are separated by at leastthe standard deviations. We report both the micro and macro averages of our proposed entity-level scores. In alldatasets, data filtering leads to higher prec s scores, indicating that entity hallucination can be alleviated by thissimple technique. In addition, data filtering generally improves other entity level metrics: prec t , recall t and F t .Adding the classification task (multi-task) or JAENS to data filtering further improves the performance on prec t and recall t and therefore the overall entity-level F t . we only want to measure entity-level consistencyagainst the correct set of entities; using the unfil-tered dataset means we could count a hallucinatedentity as correct. We observe improvements of prec s across all three datasets trained using thefiltered subset of data. For example in XSUM, the prec s is increased from 93.6% to 98.2%, indicat-ing a significant reduction in entity hallucination.In addition, the entity-based data filtering generallyimproves other entity-level metrics as well. Evenwith less training data, the entity-based data filter-ing is able to maintain the ROUGE scores quitewell. For XSUM, about 34% of the training datais filtered out (c.f. Table 2), which explains themore noticable impact on the ROUGE scores. Theresults in Table 3 suggest that entity-level data fil-tering is a simple yet effective approach to achievehigher entity-level factual consistency as well asgeneral summarization quality. In Table 4 we pro-vide qualitative examples where the model trainedon the original data produces hallucination and theentity-level data filtering removes such hallucina-tion.Table 3 shows that adding the classification task(multi-task) futher increases the prec t and recall t metric and therefore the overall entity-level F t on top of the improvements from data filtering.Similar gains can be observed with JAENS, whichout-performs the multi-task approach on CNNDMand Newsroom datasets. The result confirms ourintuition that the summaries in JAENS can benefitfrom attending to the generated salient entities interms of the entity level metrics. However, theadditional complexity during decoding may havehurt the ROUGE scores.For the interested readers, we also evaluated thePEGASUS (Zhang et al., 2020) models for theROUGE and entity level metrics on these threedatasets in the appendix. Accuracy of entity level metrics: As our entitylevel metrics are based on automatic NER toolsand heuristics matching rules, errors in both stepscan lead to inaccuracy in the metrics. By manu-ally checking 10 random ground truth summariestogether with the source documents in the valida-tion split of XSUM dataset, we found that all ofthe named entities are correctly identified by theNER tool and the matchings are correct. There-fore, we believe that even our current NER tooland matching rule already produce high accuracyin practice. efore data filtering After data filtering With classification Ground truth summary People in Italy and theNetherlands are morelikely to consume fewercups of coffee thanthose in the UK, a studysuggests. The desire to drink cof-fee may be encoded in ourDNA, according to scien-tists. People with a particulargene are more likely toconsume fewer cups ofcoffee, a study has sug-gested. Researchers have identi-fied a gene that appears tocurb coffee consumption.A cathedral in Surrey isset to be restored aftermore than £5m was raisedto pay for repairs and im-provements. A £7m project to save aGrade II-listed cathedralfrom demolition is set togo ahead. A cathedral which hasbeen threatened with de-molition is set to be savedby a £5m fundraising cam-paign. A 1960s-built cathedralthat was ”at serious risk ofclosure” has raised morethan 90% of its £7m targetfor urgent repairs and de-velopment.More than 800,000chemists in the Indiancapital, Delhi, have goneon strike in protest againstonline drug sales. More than 800,000chemists in India will goon strike on Wednesdayto protest against illegalonline drug sales. More than 800,000chemists in India areset to go on strike onWednesday in a row overthe sale of drugs online. At least 800,000 pharma-cies in India are on a one-day strike, demanding anend to online drug saleswhich they say is affectingtheir business.Police officers inPembrokeshire are tobe issued with body-worncameras. Police officers in Powysare to be issued with body-worn cameras in a bid toimprove transparency inthe force. Police officers in Powysare to be issued with bodycameras in a bid to im-prove transparency in theforce. A police force has begunthe rollout of body cam-eras for 800 officers andcommunity support offi-cers.Wales midfielderBecky Lawrence has beenspeaking to BBC Sportabout her time as a player-manager with MelbourneCity. It’s been a great fewweeks for me as a player-manager and now I’mheading home to Walesahead of the Cyprus Cup. It’s been a very busy fewweeks for me as I’m head-ing home to Wales aheadof the Cyprus Cup. I have certainly had worse24 hours in my life thanwinning the Grand Fi-nal with Melbourne Cityand then being named inthe Wales squad for theCyprus Cup. Table 4: Generated and ground truth summary examples from the test set of XSUM. The first three columns aregenerated from the model trained without entity-based data filtering, with entity-based data filtering and with theadditional classification task, respectively. The right column contains the ground truth summaries. The hallucinatednamed-entities are underscored. Proposed data filtering overcomes hallucination in these examples. In this paper we study the entity-level factualconsistency of the state-of-the-art summarizationmodel. We propose precision-source score prec s to quantify the degree of entity hallucination. Wealso propose additional metrics prec t and recall t to measure entity level accuracy of the generatedsummary with respect to the ground truth summary.We found that the ground truth summaries of theXSUM dataset contain a high level of entity hal-lucination. We propose a simple entity-level datafiltering technique to remove such hallucinationin the training data. Experiments show that suchdata filtering leads to significant improvement in prec s . ( prec s increases from below 94% to above98% in XSUM for example.) We futher proposed amulti-task learning and a joint sequence generation approach to further improve the entity-level met-rics. Overall, combining our proposed approachessignificantly reduces entity hallucination and leadsto higher entity level metrics with minimal degra-dation of the ROUGE scores. References Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. 2018.Faithful to the original: Fact aware neural abstrac-tive summarization.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,ages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Angela Fan, David Grangier, and Michael Auli. 2018.Controllable abstractive summarization. In Proceed-ings of the 2nd Workshop on Neural Machine Trans-lation and Generation , pages 45–54, Melbourne,Australia. Association for Computational Linguis-tics.Sebastian Gehrmann, Zachary Ziegler, and AlexanderRush. 2019. Generating abstractive summaries withfinetuned language models. In Proceedings of the12th International Conference on Natural LanguageGeneration , pages 516–522, Tokyo, Japan. Associa-tion for Computational Linguistics.Max Grusky, Mor Naaman, and Yoav Artzi. 2018.Newsroom: A dataset of 1.3 million summaries withdiverse extractive strategies. In Proceedings of the2018 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies , pages 708–719, NewOrleans, Louisiana. Association for ComputationalLinguistics.Matthew Honnibal and Ines Montani. 2017. spaCy 2:Natural language understanding with Bloom embed-dings, convolutional neural networks and incremen-tal parsing. To appear.Wojciech Kry´sci´nski, Bryan McCann, Caiming Xiong,and Richard Socher. 2019. Evaluating the factualconsistency of abstractive text summarization.Wojciech Kryscinski, Nitish Shirish Keskar, Bryan Mc-Cann, Caiming Xiong, and Richard Socher. 2019.Neural text summarization: A critical evaluation. In Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) , pages 540–551, Hong Kong, China. Association for Computa-tional Linguistics.Mike Lewis, Yinhan Liu, Naman Goyal, Mar-jan Ghazvininejad, Abdelrahman Mohamed, OmerLevy, Veselin Stoyanov, and Luke Zettlemoyer.2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, trans-lation, and comprehension. arXiv preprintarXiv:1910.13461 .Margaret Li, Stephen Roller, Ilia Kulikov, SeanWelleck, Y-Lan Boureau, Kyunghyun Cho, and Ja-son Weston. 2019. Don’t say that! making incon-sistent dialogue unlikely with unlikelihood training. CoRR , abs/1911.03860.Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,C¸ a˘glar Gu`I‡lc¸ehre, and Bing Xiang. 2016. Abstrac-tive text summarization using sequence-to-sequenceRNNs and beyond. In Proceedings of The 20thSIGNLL Conference on Computational Natural Lan-guage Learning , pages 280–290, Berlin, Germany.Association for Computational Linguistics. Shashi Narayan, Shay B. Cohen, and Mirella Lapata.2018. Don’t give me the details, just the summary!topic-aware convolutional neural networks for ex-treme summarization. In Proceedings of the 2018Conference on Empirical Methods in Natural Lan-guage Processing , pages 1797–1807, Brussels, Bel-gium. Association for Computational Linguistics.Myle Ott, Sergey Edunov, Alexei Baevski, AngelaFan, Sam Gross, Nathan Ng, David Grangier, andMichael Auli. 2019. fairseq: A fast, extensibletoolkit for sequence modeling. In Proceedings ofNAACL-HLT 2019: Demonstrations .Alec Radford, Jeff Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors, Advances in Neural Information Pro-cessing Systems 30 , pages 5998–6008. Curran Asso-ciates, Inc.Sean Welleck, Jason Weston, Arthur Szlam, andKyunghyun Cho. 2019. Dialogue natural languageinference. In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguistics ,pages 3731–3741, Florence, Italy. Association forComputational Linguistics.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R´emi Louf, Morgan Funtow-icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,Teven Le Scao, Sylvain Gugger, Mariama Drame,Quentin Lhoest, and Alexander M. Rush. 2020.Transformers: State-of-the-art natural language pro-cessing. In Proceedings of the 2020 Conference onEmpirical Methods in Natural Language Processing:System Demonstrations , pages 38–45, Online. Asso-ciation for Computational Linguistics.Jingqing Zhang, Yao Zhao, Mohammad Saleh, andPeter Liu. 2020. PEGASUS: Pre-training with ex-tracted gap-sentences for abstractive summarization.In Proceedings of the 37th International Conferenceon Machine Learning , volume 119 of Proceedingsof Machine Learning Research , pages 11328–11339.PMLR. Supplementary material forEntity-level Factual Consistency ofAbstractive Text Summarization A.1 Details of NER filtering We only consider named-entities of the followingtypes: ’PERSON’ (People, including fictional.),’FAC’ (Buildings, airports, highways, bridges, etc.),’GPE’ (Countries, cities, states.), ’ORG’ (Com-panies, agencies, institutions, etc.), ’NORP’ (Na-tionalities or religious or political groups.), ’LOC’(Non-GPE locations, mountain ranges, bodies ofwater.), ’EVENT’ (Named hurricanes, battles, wars,sports events, etc.). We ignore other types of enti-ties such as date, time, numerals because they tendto have large variations in representation and aredifficult to determine a match in the source docu-ment. A.2 Details of experimental setup We use the pre-trained BART-large model in theFairseq library (Ott et al., 2019) to fine-tune on the3 summarization datasets.In all experiments, we validate the ROUGEscores of the generated summaries on the validationsplit and early-stop on the epoch with the highestvalidation score. We use the standard learning rateof 3e-5 for finetuning with linear decay scheduleand 500 warmup steps. For Newsroom, we use4 p3.16xlarge EC2 instances on AWS with a to-tal of 32 Tesla V100 GPUs for finetuning and theeffective batch size is 32; for XSUM, we use 1p3.16xlarge instance with a total of 8 Tesla V100GPUs and update frequency of 4, giving an ef-fective batch size of 32; for CNNDM, we use 1p3.16xlarge instance with a total of 8 Tesla V100GPUs, giving an effective batch size of 8.We chose the α parameter for multi-task learn-ing between 0.1 and 0.5 with step of 0.05 basedon ROUGE scores on the validation set. We foundthe best values are 0.3, 0.3 and 0.15 for Newsroom,CNNDM and XSUM, respectively. We observethat the ROUGE and entity level metrics on vali-dation and test sets are very close, with the former slightly higher.During decoding, we use beam size of 1 forNewsroom, 4 for CNNDM and 6 for XSUM (to beconsistent with the setting in (Lewis et al., 2019)).We did use trigrams blocking in beam search as wedid not see much need for this additional step. A.3 Evaluation of PEGASUS (Zhang et al.,2020)