Multi-Reward Reinforced Summarization with Saliency and Entailment
MMulti-Reward Reinforced Summarization with Saliency and Entailment
Ramakanth Pasunuru and
Mohit Bansal
UNC Chapel Hill { ram, mbansal } @cs.unc.edu Abstract
Abstractive text summarization is the taskof compressing and rewriting a long docu-ment into a short summary while maintainingsaliency, directed logical entailment, and non-redundancy. In this work, we address thesethree important aspects of a good summary viaa reinforcement learning approach with twonovel reward functions: ROUGESal and En-tail, on top of a coverage-based baseline. TheROUGESal reward modifies the ROUGE met-ric by up-weighting the salient phrases/wordsdetected via a keyphrase classifier. The Entailreward gives high (length-normalized) scoresto logically-entailed summaries using an en-tailment classifier. Further, we show supe-rior performance improvement when these re-wards are combined with traditional metric(ROUGE) based rewards, via our novel andeffective multi-reward approach of optimizingmultiple rewards simultaneously in alternatemini-batches. Our method achieves the newstate-of-the-art results (including human eval-uation) on the CNN/Daily Mail dataset as wellas strong improvements in a test-only transfersetup on DUC-2002.
Abstractive summarization, the task of generat-ing a natural short summary of a long docu-ment, is more challenging than the extractiveparadigm, which only involves selection of impor-tant sentences or grammatical sub-sentences (Jing,2000; Knight and Marcu, 2002; Clarke and La-pata, 2008; Filippova et al., 2015). Advent ofsequence-to-sequence deep neural networks andlarge human summarization datasets (Hermannet al., 2015; Nallapati et al., 2016) made the ab-stractive summarization task more feasible andaccurate, with recent ideas ranging from copy-pointer mechanism and redundancy coverage, tometric reward based reinforcement learning (Rush et al., 2015; Chopra et al., 2016; Ranzato et al.,2015; Nallapati et al., 2016; See et al., 2017).A good abstractive summary requires severalimportant properties, e.g., it should choose themost salient information from the input document,be logically entailed by it, and avoid redundancy.Coverage-based models address the latter redun-dancy issue (Suzuki and Nagata, 2016; Nallapatiet al., 2016; See et al., 2017), but there is still alot of scope to teach current state-of-the-art mod-els about saliency and logical entailment. To-wards this goal, we improve the task of abstractivesummarization via a reinforcement learning ap-proach with the introduction of two novel rewards:‘ROUGESal’ and ‘Entail’, and also demonstratethat these saliency and entailment skills allow forbetter generalizability and transfer.Our ROUGESal reward gives higher weight tothe important, salient words in the summary, incontrast to the traditional ROUGE metric whichgives equal weight to all tokens. These weightsare obtained from a novel saliency scorer, whichis trained on a reading comprehension dataset’sanswer spans to give a saliency-based probabilityscore to every token in the sentence. Our Entailreward gives higher weight to summaries whosesentences logically follow from the ground-truthsummary. Further, we also add a length normal-ization constraint to our Entail reward, to impor-tantly avoid misleadingly high entailment scoresto very short sentences.Empirically, we show that our new rewards withpolicy gradient approaches perform significantlybetter than a cross-entropy based state-of-the-artpointer-coverage baseline. We show further per-formance improvements by combining these re-wards via our novel multi-reward optimizationapproach, where we optimize multiple rewardssimultaneously in alternate mini-batches (henceavoiding complex scaling and weighting issues in a r X i v : . [ c s . C L ] M a y eward combination), inspired from how humanstake multiple concurrent types of rewards (feed-back) to learn a task. Overall, our methods achievethe new state-of-the-art (including human evalu-ation) on the CNN/Daily Mail dataset as well asstrong improvements in a test-only transfer setupon DUC-2002. Lastly, we present several analysesof our model’s saliency, entailment, and abstrac-tiveness skills. Earlier summarization work was based on ex-traction and compression-based approaches (Jing,2000; Knight and Marcu, 2002; Clarke and Lap-ata, 2008; Filippova et al., 2015), with more focuson graph-based (Giannakopoulos, 2009; Ganesanet al., 2010) and discourse tree-based (Geraniet al., 2014) models. Recent focus has shiftedtowards abstractive, rewriting-based summariza-tion based on parse trees (Cheung and Penn, 2014;Wang et al., 2016), Abstract Meaning Represen-tations (Liu et al., 2015; Dohare and Karnick,2017), and neural network models with pointer-copy mechanism and coverage (Rush et al., 2015;Chopra et al., 2016; Chen et al., 2016; Nallapatiet al., 2016; See et al., 2017), as well as reinforce-based metric rewards (Ranzato et al., 2015; Pauluset al., 2017). We also use reinforce-based models,but with novel reward functions and better simul-taneous multi-reward optimization methods.Recognizing Textual Entailment (RTE), the taskof classifying two sentences as entailment, contra-diction, or neutral, has been used for Q&A and IEtasks (Harabagiu and Hickl, 2006; Dagan et al.,2006; Lai and Hockenmaier, 2014; Jimenez et al.,2014). Recent neural network models and largedatasets (Bowman et al., 2015; Williams et al.,2017) enabled stronger accuracies. Some previ-ous work (Mehdad et al., 2013; Gupta et al., 2014)has explored the use of RTE by modeling graph-based relationships between sentences to selectthe most non-redundant sentences for summariza-tion. Recently, Pasunuru and Bansal (2017) im-proved video captioning with entailment-correctedrewards. We instead directly use multi-sentenceentailment knowledge (with additional length con-straints) as a separate RL reward to improveabstractive summarization, while avoiding theirpenalty hyperparameter tuning.For our saliency prediction model, we makeuse of the SQuAD reading comprehension dataset (Rajpurkar et al., 2016), where the answerspans annotated by humans for important ques-tions, serve as an interesting and effective proxyfor keyphrase-style salient information in summa-rization. Some related previous work has incorpo-rated document topic/subject classification (Ison-uma et al., 2017) and webpage keyphrase extrac-tion (Zhang et al., 2004) to improve saliency insummarization. Some recent work Subramanianet al. (2017) has also used answer probabilities ina document to improve question generation.
Our abstractive text summarization model is asimple sequence-to-sequence single-layer bidirec-tional encoder and unidirectional decoder LSTM-RNN, with attention (Bahdanau et al., 2015),pointer-copy, and coverage mechanism – pleaserefer to See et al. (2017) for details.
Traditional cross-entropy loss optimization for se-quence generation has an exposure bias issue andthe model is not optimized for the evaluated met-rics (Ranzato et al., 2015). Reinforce-based pol-icy gradient approach addresses both of these is-sues by using its own distribution during trainingand by directly optimizing the non-differentiableevaluation metrics as rewards. We use the RE-INFORCE algorithm (Williams, 1992; Zarembaand Sutskever, 2015) to learn a policy p θ de-fined by the model parameters θ to predict thenext action (word) and update its internal (LSTM)states. We minimize the loss function L RL = − E w s ∼ p θ [ r ( w s )] , where w s is the sequence ofsampled words with w st sampled at time step t ofthe decoder. The derivative of this loss functionwith approximation using a single sample alongwith variance reduction with a bias estimator is: ∇ θ L RL = − ( r ( w s ) − b e ) ∇ θ log p θ ( w s ) (1)There are several ways to calculate the baselineestimator; we employ the effective SCST ap-proach (Rennie et al., 2016), as depicted in Fig. 1,where b e = r ( w a ) , is based on the reward ob-tained by the current model using the test timeinference algorithm, i.e., choosing the arg-maxword w at of the final vocabulary distribution ateach time step t of the decoder. We use the jointcross-entropy and reinforce loss so as to optimize STM
SAMPLER
ARG-MAX
RewardReward R L Lo ss Figure 1: Our sequence generator with RL training. the non-differentiable evaluation metric as rewardwhile also maintaining the readability of the gen-erated sentence (Wu et al., 2016; Paulus et al.,2017; Pasunuru and Bansal, 2017), which is de-fined as L Mixed = γL RL + (1 − γ ) L XE , where γ isa tunable hyperparameter. Optimizing multiple rewards at the same time isimportant and desired for many language gener-ation tasks. One approach would be to use aweighted combination of these rewards, but thishas the issue of finding the complex scaling andweight balance among these reward combinations.To address this issue, we instead introduce a sim-ple multi-reward optimization approach inspiredfrom multi-task learning, where we have differenttasks, and all of them share all the model parame-ters while having their own optimization function(different reward functions in this case). If r and r are two reward functions that we want to op-timize simultaneously, then we train the two lossfunctions of Eqn. 2 in alternate mini-batches. L RL = − ( r ( w s ) − r ( w a )) ∇ θ log p θ ( w s ) L RL = − ( r ( w s ) − r ( w a )) ∇ θ log p θ ( w s ) (2) ROUGE Reward
The first basic reward isbased on the primary summarization metric ofROUGE package (Lin, 2004). Similar to Pauluset al. (2017), we found that ROUGE-L metric as areward works better compared to ROUGE-1 andROUGE-2 in terms of improving all the metricscores. Since these metrics are based on sim-ple phrase matching/n-gram overlap, they do notfocus on important summarization factors such assalient phrase inclusion and directed logical entail-ment. Addressing these issues, we next introducetwo new reward functions. For the rest of the paper, we mean ROUGE-L wheneverwe mention ROUGE-reward models.
John is playing with a dog
10 01 10 01 01 10
Figure 2: Overview of our saliency predictor model.
Saliency Reward
ROUGE-based rewards haveno knowledge about what information is salientin the summary, and hence we introduce anovel reward function called ‘ROUGESal’ whichgives higher weight to the important, salientwords/phrases when calculating the ROUGE score(which by default assumes all words are equallyweighted). To learn these saliency weights, wetrain our saliency predictor on sentence and an-swer spans pairs from the popular SQuAD readingcomprehension dataset (Rajpurkar et al., 2016))(Wikipedia domain), where we treat the human-annotated answer spans (avg. span length . ) forimportant questions as representative salient infor-mation in the document. As shown in Fig. 2, givena sentence as input, the predictor assigns a saliencyprobability to every token, using a simple bidirec-tional encoder with a softmax layer at every timestep of the encoder hidden states to classify thetoken as salient or not. Finally, we use the proba-bilities given by this saliency prediction model asweights in the ROUGE matching formulation toachieve the final ROUGESal score (see appendixfor details about our ROUGESal weighted preci-sion, recall, and F-1 formulations). Entailment Reward
A good summary shouldalso be logically entailed by the given sourcedocument, i.e., contain no contradictory or un-related information. Pasunuru and Bansal (2017)used entailment-corrected phrase-matching met-rics (CIDEnt) to improve the task of video caption-ing; we instead directly use the entailment knowl-edge from an entailment scorer and its multi-sentence, length-normalized extension as our ‘En-tail’ reward, to improve the task of abstractive textsummarization. We train the entailment classi-fier (Parikh et al., 2016) on the SNLI (Bowmanet al., 2015) and Multi-NLI (Williams et al., 2017)datasets and calculate the entailment probabilityscore between the ground-truth (GT) summary (aspremise) and each sentence of the generated sum-mary (as hypothesis), and use avg. score as ourntail reward. Finally, we add a length normal-ization constraint to avoid very short sentencesachieving misleadingly high entailment scores:
Entail = Entail × (3) CNN/Daily Mail dataset (Hermann et al., 2015;Nallapati et al., 2016) is a collection of onlinenews articles and their summaries. We use thenon-anonymous version of the dataset as describedin See et al. (2017). For test-only generaliza-tion experiments, we use the DUC-2002 singledocument summarization dataset . For entailmentreward classifier, we use a combination of thefull Stanford Natural Language Inference (SNLI)corpus (Bowman et al., 2015) and the recentMulti-NLI corpus (Williams et al., 2017) trainingdatasets. For our saliency prediction model, weuse the Stanford Question Answering (SQuAD)dataset (Rajpurkar et al., 2016). All dataset splitsand other training details (dimension sizes, learn-ing rates, etc.) for reproducibility are in appendix. We use the standard ROUGE package (Lin, 2004)and Meteor package (Denkowski and Lavie, 2014)for reporting the results on all of our summariza-tion models. Following previous work (Chopraet al., 2016; Nallapati et al., 2016; See et al., 2017),we use the ROUGE full-length F1 variant.
Human Evaluation Criteria:
We also performedhuman evaluation of summary relevance and read-ability , via Amazon Mechanical Turk (AMT). Weselected human annotators that were located in theUS, had an approval rate greater than , andhad at least , approved HITs. For the pair-wise model comparisons discussed in Sec. 6, we Since the GT summary is correctly entailed by the sourcedocument, we directly (by transitivity) use this GT as premisefor easier (shorter) encoding. We also tried using the fullinput document as premise but this didn’t perform as well(most likely because the entailment classifiers are not trainedon such long premises; and the problem with the sentence-to-sentence avg. scoring approach is discussed below).We also tried summary-to-summary entailment scoring (sim-ilar to ROUGE-L) as well as pairwise sentence-to-sentenceavg. scoring, but we found that avg. scoring of ground-truth summary (as premise) w.r.t. each generated summary’ssentence (as hypothesis) works better (intuitive because eachsentence in generated summary might be a compression ofmultiple sentences of GT summary or source document). Models R-1 R-2 R-L MP
REVIOUS W ORK
Nallapati (2016) (cid:63) (XE) (cid:63) (RL) (cid:63) UR M ODELS
Baseline (XE) (RL) (RL) (RL) (RL) (RL)
Table 1: Results on CNN/Daily Mail (non-anonymous). (cid:63) represents previous work on anony-mous version. ‘XE’: cross-entropy loss, ‘RL’: reinforcemixed loss (XE+RL). Columns ‘R’: ROUGE, ‘M’:METEOR. showed the annotators the input article, the groundtruth summary, and the two model summaries(randomly shuffled to anonymize model identi-ties) – we then asked them to choose the betteramong the two model summaries or choose ‘Not-Distinguishable’ if both summaries are equallygood/bad. Instructions for relevance were basedon the summary containing salient/important in-formation from the given article, being correct(i.e., avoiding contradictory/unrelated informa-tion), and avoiding redundancy. Instructions forreadability were based on the summary’s fluency,grammaticality, and coherence.
Baseline Cross-Entropy Model Results
Ourabstractive summarization model has attention,pointer-copy, and coverage mechanism. First,we apply cross-entropy optimization and achievecomparable results on CNN/Daily Mail w.r.t. pre-vious work (See et al., 2017). ROUGE Reward Results
First, using ROUGE-L as RL reward (shown as ROUGE in Table 1) im-proves the performance on CNN/Daily Mail in allmetrics with stat. significant scores ( p < . ) ascompared to the cross-entropy baseline (and alsostat. signif. w.r.t. See et al. (2017)). Similarto Paulus et al. (2017), we use mixed loss function(XE+RL) for all our reinforcement experiments, toensure good readability of generated summaries. Our baseline is statistically equal to the paper-reportedscores of See et al. (2017) (see Table 1) on ROUGE-1,ROUGE-2, based on the bootstrap test (Efron and Tibshirani,1994). Our baseline is stat. significantly better ( p < . )in all ROUGE metrics w.r.t. the github scores (R-1: 38.82,R-2: 16.81, R-3: 35.71, M: 18.14) of See et al. (2017).odels R-1 R-2 R-L MBaseline (XE) (RL) (RL) Table 2: ROUGE F1 full length scores of our modelson test-only DUC-2002 generalizability setup.
Models Relevance Readability TotalROUGESal+Ent 55 54 109See et al. (2017) 34 33 67Non-distinguish. 11 13 24
Table 3: Human Evaluation: pairwise comparisonof relevance and readability between our ROUGE-Sal+Entail multi-reward model and See et al. (2017).
ROUGESal and Entail Reward Results
Withour novel ROUGESal reward, we achieve stat.signif. improvements in all metrics w.r.t. thebaseline as well as w.r.t. ROUGE-reward results( p < . ), showing that saliency knowledgeis strongly improving the summarization model.For our Entail reward, we achieve stat. signif.improvements in ROUGE-L ( p < . ) w.r.t.baseline and achieve the best METEOR score bya large margin. See Sec. 7 for analysis of thesaliency/entailment skills learned by our models. Multi-Reward Results
Similar to ROUGESal,Entail is a better reward when combined withthe complementary phrase-matching metric in-formation in ROUGE; Table 1 shows that theROUGE+Entail multi-reward combination per-forms stat. signif. better than ROUGE-rewardin ROUGE-1, ROUGE-L, and METEOR ( p < . ), and better than Entail-reward in allROUGE metrics. Finally, we combined our tworewards ROUGESal+Entail to incorporate bothsaliency and entailment knowledge, and it givesthe best results overall ( p < . in all metricsw.r.t. both baseline and ROUGE-reward models),setting the new state-of-the-art. Human Evaluation
Table. 3 shows the MTurkanonymous human evaluation study (based on samples), where we do pairwise comparison be-tween our ROUGESal+Entail multi-reward’s out-put summaries w.r.t. See et al. (2017) summarieson CNN/Daily Mail (see setup details in Sec. 5.2).As shown, our multi-reward model is better onboth relevance and readability.
Test-Only Transfer (DUC-2002) Results
Fi-nally, we also tested our model’s generalizabil- Our last three rows in Table 1 are all stat. signif. betterin all metrics with p < . compared to See et al. (2017). ity/transfer skills, where we take the modelstrained on CNN/Daily Mail and directly test themon DUC-2002 in a test-only setup. As shown inTable 2, our final ROUGESal+Entail multi-rewardRL model is statistically significantly better thanboth the cross-entropy (pointer-generator + cov-erage) baseline as well as ROUGE reward RLmodel, in terms of all 4 metrics with a large mar-gin (with p < . ). This demonstrates that ourROUGESal+Entail model learned better transfer-able and generalizable skills of saliency and logi-cal entailment. Saliency Analysis
We analyzed the output sum-maries generated by See et al. (2017), and ourbaseline, ROUGE-reward and ROUGESal-rewardmodels, using our saliency prediction model(Sec. 4) as the keyword detection classifier. Weannotated the ground-truth and model summarieswith this keyword classifier and computed the %match, i.e., how many salient words from theground-truth summary were also generated in themodel summary , and the scores are . , . , . , and . . We also used theoriginal CNN/Daily Mail Cloze Q&A setup (Her-mann et al., 2015) with the fill-in-the-blank an-swers treated as salient information, and the re-sults are . , . , . , and . for the four models. Further, we also calculatedthe ROUGESal scores (based on our reward for-mulation in Sec. 4), and the results are . , . , . , and . for the four mod-els. All three of these saliency analysis experi-ments illustrate that our ROUGESal reward modelis stat. signif. better in saliency than the See et al.(2017), our baseline, and ROUGE-reward models( p < . for all three experiments). Entailment Analysis
We also analyzed theentailment scores of the generated summariesfrom See et al. (2017), and our baseline, ROUGE-reward, and Entail-reward models, and the re-sults are . , . , . , and . . We observe that our Entail-reward model achievesstat. significant entailment scores ( p < . )w.r.t. all the other three models. In order to select the keywords for this analysis, we useda . probability threshold on the saliency classifier (basedon the scale of the classifier’s distribution). Based on our ground-truth summary to output summarysentences’ average entailment score (see Sec. 4); similartrends hold for document-to-summary entailment scores.odels 2-gram 3-gram 4-gramSee et al. (2017) 2.24 6.03 9.72Baseline (XE) (RL) (RL) (RL)
Table 4: Abstractiveness: novel n -gram percentage. Abstractiveness Analysis
In order to measurethe abstractiveness of our models, we followed the‘novel n -gram counts’ approach suggested in Seeet al. (2017). First, we found that all our reward-based RL models have significantly ( p < . )more novel n -grams than our cross-entropy base-line (see Table 4). Next, the Entail-reward model‘maintains’ stat. equal abstractiveness as theROUGE-reward model, likely because it encour-ages rewriting to create logical subsets of informa-tion, while the ROUGESal-reward model does abit worse, probably because it focuses on copyingmore salient information (e.g., names). Comparedto previous work (See et al., 2017), our Entail-reward and ROUGE-reward models achieve statis-tically significant improvement ( p < . ) whileROUGESal is comparable. We presented a summarization model trained withnovel RL reward functions to improve the saliencyand directed logical entailment aspects of a goodsummary. Further, we introduced the novel and ef-fective multi-reward approach of optimizing mul-tiple rewards simultaneously in alternate mini-batches. We achieve the new state-of-the-art onCNN/Daily Mail and also strong test-only im-provements on a DUC-2002 transfer setup.
Acknowledgments
We thank the reviewers for their helpful com-ments. This work was supported by DARPA(YFA17-D17AP00022), Google Faculty ResearchAward, Bloomberg Data Science Research Grant,and NVidia GPU awards. The views, opinions,and/or findings contained in this article are thoseof the authors and should not be interpreted as rep-resenting the official views or policies, either ex-pressed or implied, of the funding agency.
References
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In
ICLR . Samuel R Bowman, Gabor Angeli, Christopher Potts,and Christopher D Manning. 2015. A large anno-tated corpus for learning natural language inference.In
EMNLP .Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, andHui Jiang. 2016. Distraction-based neural networksfor modeling documents. In
IJCAI .Jackie Chi Kit Cheung and Gerald Penn. 2014. Unsu-pervised sentence enhancement for automatic sum-marization. In
EMNLP . pages 775–786.Sumit Chopra, Michael Auli, and Alexander M Rush.2016. Abstractive sentence summarization with at-tentive recurrent neural networks. In
HLT-NAACL .James Clarke and Mirella Lapata. 2008. Global in-ference for sentence compression: An integer linearprogramming approach.
Journal of Artificial Intelli-gence Research
Machine learning challenges. evalu-ating predictive uncertainty, visual object classifica-tion, and recognising tectual entailment , Springer,pages 177–190.Michael Denkowski and Alon Lavie. 2014. Meteoruniversal: Language specific translation evaluationfor any target language. In
EACL .Shibhansh Dohare and Harish Karnick. 2017. Textsummarization using abstract meaning representa-tion. arXiv preprint arXiv:1706.01678 .Bradley Efron and Robert J Tibshirani. 1994.
An intro-duction to the bootstrap . CRC press.Katja Filippova, Enrique Alfonseca, Carlos A Col-menares, Lukasz Kaiser, and Oriol Vinyals. 2015.Sentence compression by deletion with lstms. In
EMNLP . pages 360–368.Kavita Ganesan, ChengXiang Zhai, and Jiawei Han.2010. Opinosis: a graph-based approach to abstrac-tive summarization of highly redundant opinions. In
Proceedings of the 23rd international conference oncomputational linguistics . ACL, pages 340–348.Shima Gerani, Yashar Mehdad, Giuseppe Carenini,Raymond T Ng, and Bita Nejat. 2014. Abstractivesummarization of product reviews using discoursestructure. In
EMNLP . volume 14, pages 1602–1613.George Giannakopoulos. 2009. Automatic summariza-tion from multiple documents.
Ph. D. dissertation .Anand Gupta, Manpreet Kaur, Adarsh Singh, AseemGoel, and Shachar Mirkin. 2014. Text summa-rization through entailment-based minimum vertexcover.
Lexical and Computational Semantics (*SEM 2014) page 75.anda Harabagiu and Andrew Hickl. 2006. Methodsfor using textual entailment in open-domain ques-tion answering. In
ACL . pages 905–912.Karl Moritz Hermann, Tomas Kocisky, EdwardGrefenstette, Lasse Espeholt, Will Kay, Mustafa Su-leyman, and Phil Blunsom. 2015. Teaching ma-chines to read and comprehend. In
NIPS . pages1693–1701.Masaru Isonuma, Toru Fujino, Junichiro Mori, YutakaMatsuo, and Ichiro Sakata. 2017. Extractive sum-marization using multi-task learning with documentclassification. In
EMNLP . pages 2091–2100.Sergio Jimenez, George Duenas, Julia Baquero,Alexander Gelbukh, Av Juan Dios B´atiz, andAv Mendiz´abal. 2014. UNAL-NLP: Combining softcardinality features for semantic textual similarity,relatedness and entailment. In
In SemEval . pages732–742.Hongyan Jing. 2000. Sentence reduction for automatictext summarization. In
ANLP .Diederik Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In
ICLR .Kevin Knight and Daniel Marcu. 2002. Summariza-tion beyond sentence extraction: A probabilistic ap-proach to sentence compression.
Artificial Intelli-gence
Proc. SemEval
Text Summa-rization Branches Out: Proceedings of the ACL-04workshop . volume 8.Fei Liu, Jeffrey Flanigan, Sam Thomson, NormanSadeh, and Noah A Smith. 2015. Toward abstrac-tive summarization using semantic representations.In
NAACL: HLT . pages 1077–1086.Yashar Mehdad, Giuseppe Carenini, Frank W Tompa,and Raymond T Ng. 2013. Abstractive meetingsummarization with entailment and fusion. In
Proc.of the 14th European Workshop on Natural Lan-guage Generation . pages 136–146.Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre,Bing Xiang, et al. 2016. Abstractive text summa-rization using sequence-to-sequence rnns and be-yond. In
CoNLL .Ankur P Parikh, Oscar T¨ackstr¨om, Dipanjan Das, andJakob Uszkoreit. 2016. A decomposable attentionmodel for natural language inference. In
EMNLP .Ramakanth Pasunuru and Mohit Bansal. 2017. Rein-forced video captioning with entailment rewards. In
EMNLP . Romain Paulus, Caiming Xiong, and Richard Socher.2017. A deep reinforced model for abstractive sum-marization. arXiv preprint arXiv:1705.04304 .Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100,000+ questions formachine comprehension of text. In
EMNLP .Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,and Wojciech Zaremba. 2015. Sequence level train-ing with recurrent neural networks. In
ICLR .Steven J Rennie, Etienne Marcheret, Youssef Mroueh,Jarret Ross, and Vaibhava Goel. 2016. Self-criticalsequence training for image captioning. arXivpreprint arXiv:1612.00563 .Alexander M Rush, Sumit Chopra, and Jason Weston.2015. A neural attention model for abstractive sen-tence summarization. In
CoRR .Abigail See, Peter J Liu, and Christopher D Manning.2017. Get to the point: Summarization with pointer-generator networks. In
ACL .Sandeep Subramanian, Tong Wang, Xingdi Yuan, andAdam Trischler. 2017. Neural models for key phrasedetection and question generation. arXiv preprintarXiv:1706.04560 .Jun Suzuki and Masaaki Nagata. 2016. Rnn-basedencoder-decoder approach with word frequency es-timation. In
EACL .Lu Wang, Hema Raghavan, Vittorio Castelli, RaduFlorian, and Claire Cardie. 2016. A sentencecompression based framework to query-focusedmulti-document summarization. arXiv preprintarXiv:1606.07548 .Adina Williams, Nikita Nangia, and Samuel R Bow-man. 2017. A broad-coverage challenge corpus forsentence understanding through inference. arXivpreprint arXiv:1704.05426 .Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning.
Machine learning arXiv preprintarXiv:1609.08144 .Wojciech Zaremba and Ilya Sutskever. 2015. Rein-forcement learning neural turing machines. arXivpreprint arXiv:1505.00521
Web Intelligence and Agent Systems: An Inter-national Journal
Supplementary Material
A.1 Saliency Rewards
Here, we describe the ROUGE-L formulation atsummary-level and later describe how we incorpo-rate saliency information into it. Given a referencesummary of u sentences containing a total of m tokens ( { w r,k } mk =1 ) and a generated summary of v sentences with a total of n tokens ( { w c,k } nk =1 ), let r i be the reference summary sentence and c j bethe generated summary sentence. Then, the pre-cision ( P lcs ), recall ( R lcs ), and F-score ( F lcs ) forROUGE-L are defined as follows: P lcs = (cid:80) ui =1 LCS ∪ ( r i , C ) n (4) R lcs = (cid:80) ui =1 LCS ∪ ( r i , C ) m (5) F lcs = (1 + β ) R lcs P lcs R lcs + β P lcs (6)where LCS ∪ takes the union Longest CommonSubsequence (LCS) between a reference summarysentence r i and every generated summary sen-tence c j ( c j ∈ C ), and β is defined in Lin (2004).In the above ROUGE-L scores, we assume that ev-ery token has equal weight, i.e, . However, ev-ery summary has salient tokens which should berewarded with more weight. Hence, we use theweights obtained from our novel saliency predic-tor to modify the ROUGE-L scores with salientinformation as follows: P slcs = (cid:80) ui =1 LCS ∗∪ ( r i , C ) (cid:80) nk =1 η ( w c,k ) (7) R slcs = (cid:80) ui =1 LCS ∗∪ ( r i , C ) (cid:80) mk =1 η ( w r,k ) (8) F slcs = (1 + β ) R slcs P slcs R slcs + β P slcs (9)where η ( w ) is the weight assigned by the saliencypredictor for token w , and β is defined in Lin(2004). Let { w k } pk =1 be the union LCS set, then LCS ∗∪ ( r i , C ) is defined as follows: LCS ∗∪ ( r i , C ) = p (cid:88) k =1 η ( w k ) (10) If a token is repeated at multiple times in the input sen-tence, we average the probabilities of those instances.
A.2 Experimental SetupA.2.1 DatasetsCNN/Daily Mail Dataset
CNN/Daily Maildataset (Hermann et al., 2015; Nallapati et al.,2016) is a collection of online articles and theirsummaries. The summaries are based on thehuman written highlights of these articles. Thedataset has , training pairs, , vali-dation pairs, and , test pairs. We use thenon-anonymous version of the dataset as describedin See et al. (2017). DUC Test Corpus
We use the DUC-2002 singledocument summarization dataset as a test-onlysetup where we directly take the pretrained modelstrained on CNN/Daily Mail dataset and test themon DUC-2002, in order to check for our model’sdomain transfer capabilities. This corpus consistsof documents with one or two human anno-tated reference summaries. SNLI and MultiNLI corpus
We use the fullStanford Natural Language Inference (SNLI) cor-pus (Bowman et al., 2015) and the recent Multi-NLI corpus (Williams et al., 2017) data for build-ing our entailment classifier. We use the standardsplits following previous work.
SQuAD Dataset
We use Stanford Question An-swering Dataset (SQuAD) for our saliency predic-tion model. We process the SQuAD dataset to col-lect the sentence and their corresponding salientphrases pairs. Here again, we use the standardsplit following previous work.
A.2.2 Training Details
During training, all our LSTM-RNNs are set withhidden state size of . We use a vocabulary sizeof k, where word embeddings are representedin dimension, and both the encoder and de-coder share the same embedding for each word.We encode the source document using a time-step unrolled LSTM-RNN and time-step un-rolled LSTM-RNN for decoder. We clip the gradi-ents to a maximum gradient norm value of . anduse Adam optimizer (Kingma and Ba, 2015) witha learning rate of × − for pointer baselineand × − while training along with coverageloss, and × − for reinforcement learning. Fol-lowing See et al. (2017), we add coverage mech-anism to a converged pointer model. For mixed- odels AccuracyEntailment Classifier 74.50%Saliency Predictor 16.87% Table 5: Performance of our entailment classifier andsaliency predictor. loss (XE+RL) optimization, we use the following γ values for various rewards: . for ROUGE, . for Entail and ROUGE+Entail, and . for ROUGESal and ROUGESal+Entail. For re-inforcement learning, we only use trainingsamples ( < of the actual data) to speed up con-vergence, but we found it to work well in practice.During inference time, we use a beam search ofsize . A.3 ResultsA.3.1 Saliency and Entailment Scorer