Contextualized Rewriting for Text Summarization
CContextualized Rewriting for Text Summarization
Guangsheng Bao , Yue Zhang School of Engineering, Westlake University Institute of Advanced Technology, Westlake Institute for Advanced Study { baoguangsheng, zhangyue } @westlake.edu.cn Abstract
Extractive summarization suffers from irrelevance, redun-dancy and incoherence. Existing work shows that abstractiverewriting for extractive summaries can improve the concise-ness and readability. These rewriting systems consider ex-tracted summaries as the only input, which is relatively fo-cused but can lose important background knowledge. In thispaper, we investigate contextualized rewriting, which ingeststhe entire original document. We formalize contextualizedrewriting as a seq2seq problem with group alignments, in-troducing group tag as a solution to model the alignments,identifying extracted summaries through content-based ad-dressing. Results show that our approach significantly outper-forms non-contextualized rewriting systems without requir-ing reinforcement learning, achieving strong improvementson ROUGE scores upon multiple extractive summarizers.
Introduction
Extractive text summarization systems (Nallapati, Zhai, andZhou 2017; Narayan, Cohen, and Lapata 2018; Liu and La-pata 2019) work by identifying salient text segments (typ-ically sentences) from an input document as its summary.They have been shown to outperform abstractive systems(Rush, Chopra, and Weston 2015; Nallapati et al. 2016;Chopra, Auli, and Rush 2016) in terms of content selectionand faithfulness to the input. However, extractive summa-rizers exhibit several limitations. First, sentences extractedfrom the input document tend to contain irrelevant and re-dundant phrases (Durrett, Berg-Kirkpatrick, and Klein 2016;Chen and Bansal 2018; Gehrmann, Deng, and Rush 2018).Second, extracted sentences can be weak in their coher-ence with regard to discourse relations and cross-sentenceanaphora (Dorr, Zajic, and Schwartz 2003; Cheng and Lap-ata 2016).To address these issues, a line of work investigates post-editing of extractive summarizer outputs. While grammartree trimming has been considered for reducing irrelevantcontent within sentences (Dorr, Zajic, and Schwartz 2003),rule-based methods have also been investigated for reduc-ing redundancy and enhancing coherence (Durrett, Berg-Kirkpatrick, and Klein 2016). With the rise of neural net- * Source Document: thousands of live earthworms have beenfalling from the sky in norway... a biology teacher discoveredthe worms on the surface of the snow while he was skiing inthe mountains near bergen at the weekend ... teacher karsteinerstad told norwegian news website the local ...
Gold Summary: teacher karstein erstad found thousands oflive worms on top of the snow.
Extractive Summary: a biology teacher discovered the wormson the surface of the snow while he was skiing in the mountainsnear bergen at the weekend . Rewritten Summary: biology teacher karstein erstad discov-ered the worms on the snow.
Figure 1: Example showing that contextual information canbenefit summary rewriting.works, a more recent line of work considers using abstrac-tive models for rewriting extracted outputs sentence by sen-tence (Chen and Bansal 2018; Bae et al. 2019; Wei, Huang,and Gao 2019; Xiao et al. 2020). Human evaluation showsthat such rewriting systems effectively improve the concise-ness and readability. Interestingly, existing rewriters do notimprove the ROUGE scores compared with the extractivebaselines.Existing abstractive rewriting systems take extracted sum-maries as the only input. On the other hand, informationfrom the original document can serve as useful backgroundknowledge for inferring factual details. Take Figure 1 forexample. A salient summary can be made by extracting thesentence “ a biology teacher...weekend. ” While a rewriter cansimplify the sentence for making a better summary, it can-not provide additional details beyond the sentence unless thedocument context is also considered. For example, the nameof the teacher is not given by the extractive summary, but wecan infer that the teacher’s name is “ karstein erstad ” fromthe context sentences, thereby making the summary moreinformative.We propose contextualized rewriting by using the fullinput document as a context for rewriting extractive sum-mary sentences. Rather than encoding only the extractivesummary, we use a neural representation model to encodethe whole input document, representing extractive summaryas a part of the document representation. To inform therewriter of the current sentence being rewritten, we use a r X i v : . [ c s . C L ] J a n ource Document: our resident coach and technical expert chris meadows has plenty of experience in the sport and has worked with some of the biggest names in golf. 1chris has worked with more than 100,000 golfers throughout his career. growing up beside nick faldo, meadows learned that success in golf comes through develping a clearunderstanding of, and being committed to, your objective. a dedicated coach from an early age, he soon realized his gift was the development of others. meadows simple andholistic approach to learning has been personally shared with more than 100,000 golfers in a career spanning three decades. 2 many of his instructional books have becomebest-sellers, his career recently being recognized by the professional golfers’ association when he was made an advanced fellow of the pga. 3 chris has been living golf’sresident golf expert since 2003. Rewritten Summary: chris meadows has worked with some of golf’s big names. 1 he has personally coached more than 100,000 golfers. 2 chris was made an advancedfellow of the pga. 3
Figure 2: Example of three-step summarization process: selecting, grouping and rewriting.content-based addressing (Graves, Wayne, and Danihelka2014). Specifically, as Figure 2 shows, a unique group tagis used to index each extracted sentence in the source docu-ment, matching an increasing sentence index in the abstrac-tive rewriter as the rewriter generates the output, where thegroup tags (cid:13) (cid:13) (cid:13) are used to guide the first, second andthird rewritten summary sentences, respectively.We choose the BERT (Devlin et al. 2019) base modelas the document encoder, building both the extractive sum-marizer and the abstractive rewriter by following the basicmodels of Liu and Lapata (2019). Our models are evalu-ated on the CNN/DM dataset (Hermann et al. 2015). Re-sults show that the contextualized rewriter gives signifi-cantly improved ROUGE (Lin 2004) scores compared witha state-of-the-art extractive baseline, outperforming a tradi-tional rewriter baseline by a large margin. In addition, ourmethod gives better compression, lower redundancy and bet-ter coherence. The contextualized rewriter achieves strongand consistent improvements on multiple extractive summa-rizers. To our knowledge, we are the first to report improvedROUGE by rewriting extractive summaries. We release ourcode at https://github.com/baoguangsheng/ctx-rewriter-for-summ.git. Related Work
Extractive summarizers have received constant research at-tention. Early approaches such as TextRank (Mihalcea andTarau 2004) select sentences based on weighted similarities.Recently, Nallapati, Zhai, and Zhou (2017) use a neural clas-sifier to choose sentences and a selector to rank them. Chenand Bansal (2018) use a Pointer Network (Vinyals, Fortu-nato, and Jaitly 2015) to extract sentences. Liu and Lapata(2019) use a linear classifier upon BERT. This method givesthe current state-of-the-art result in extractive summariza-tion, and we choose it for our baseline.Rewriting systems manipulate extractive summaries forreducing irrelevance, redundancy and incoherence. Durrett,Berg-Kirkpatrick, and Klein (2016) use compression rulesto reduce unimportant content within a sentence and makeanaphoricity constraints to improve cross-sentence coher-ence. Dorr, Zajic, and Schwartz (2003) trim unnecessaryphrases in a sentence without hurting grammar correctnessby finding the syntactic structures of sentences. In contrast totheir work, we consider neural abstractive rewriting, whichcan solve all the above issues more systematically.Recently, neural rewriting has attracted much researchattention. Chen and Bansal (2018) use a seq2seq modelwith the copy mechanism (See, Liu, and Manning 2017) to rewrite extractive summaries sentence by sentence. Areranking post-process is applied to avoid repetition, andthe extractive model is also tuned by reinforcement learningwith reward signals from each rewritten sentence. Bae et al.(2019) use a similar strategy but with a BERT documentencoder and reward signals from the whole summary. Wei,Huang, and Gao (2019) use a binary classifier upon a BERTdocument encoder to select sentences, and a Transformerdecoder (Vaswani et al. 2017) with the copy mechanism togenerate the summary sentence. Xiao et al. (2020) build ahierarchical representation of the input document. A pointernetwork and a copy-or-rewrite mechanism are designed tochoose sentences for copying or rewriting, followed by avanilla seq2seq model as the rewriter. The model decisionson sentence selecting, copying and rewriting are tuned byreinforcement learning. Compared with these methods, ourmethod is computationally simpler thanks to the freedomfrom using reinforcement learning and the copy mechanism,as most of the methods above do. In addition, as mentionedearlier, in contrast to these methods, we consider rewritingby including a document-level context, and therefore can po-tentially improve details and factual faithfulness.Some hybrid extractive and abstractive summarizationmodels are also in line with our work. Cheng and Lapata(2016) use a hierarchical encoder for extracting words, con-straining a conditioned language model for generating flu-ent summaries. Gehrmann, Deng, and Rush (2018) considera bottom-up method, using a neural classifier to select im-portant words from the input document, and informing anabstractive summarizer by restricting the copy source in apointer-generator network to the selected content. Similar toour work, they use extracted content for guiding the abstrac-tive summary. However, different from their work, which fo-cuses on the word level, we investigate sentence-level con-straints for guiding abstractive rewriting .Our method can also be regarded as using group tags toguide the reading context during abstractive summarization(Rush, Chopra, and Weston 2015; Nallapati et al. 2016; See,Liu, and Manning 2017), where group tags are obtainedusing an extractive summary. Compared with vanilla ab-stractive summarization, the advantages are three-fold. First,extractive summaries can guide the abstractive summarizerwith more salient information. Second, the training difficultyof the abstractive model can be reduced when important con-tents are marked as inputs. Third, the summarization pro-cedure is made more interpretable by associating a crucialsource sentence with each target sentence.igure 3: Architecture of the contextualized rewriter. The group tag embeddings are tied between the encoder (left figure) andthe decoder (right figure), through which the decoder can address to the corresponding tokens in the document.
Seq2seq with Group Alignments
As a key contribution of our method, we model contextu-alized rewriting as a seq2seq mapping problem with groupalignments. For an input sequence X and an output sequence Y , a group set G describes a set of segment-wise alignmentsbetween X and Y . The mapping problem is defined as find-ing estimation ˆ Y = arg Y max Y,G P ( Y, G | X ) , (1)where X = { w i }| | X | i =1 , Y = { w j }| | Y | j =1 , G = { G k }| | G | k =1 , (2)that | X | denotes the number of elements in X , | Y | the num-ber of elements in Y , and | G | the number of groups. Eachgroup G k denotes a pair of text segments, one from X andone from Y , which belongs to the same group. Taking Fig-ure 2 as an example, the first extractive sentence from thedocument and the first sentence from the summary form agroup G .The problem can be simplified given the fact that for eachgroup G k , the text segment from X is known, while the cor-responding segment from Y is dynamically decided duringthe generation of Y . We thus separate G into two compo-nents G X and G Y , and redefine the mapping problem as ˆ Y = arg Y max Y,G Y P ( Y, G Y | X, G X ) , (3)where G X = { g i = k if w i ∈ G k else }| | X | i =1 , (4) G Y = { g j = k if w j ∈ G k else }| | Y | j =1 , (5)so that for each group G k , a group tag k is assigned, throughwhich the text segment from X in group G k are linked tothe segment from Y in the same group. For the example inFigure 2, G X = { , ..., , , ..., , , ..., , , ..., , , ..., } and G Y = { , ..., , , ..., , , ..., } .In the encoder-decoder framework, we convert G X and G Y into vector representations through a shared embedding table, which is randomly initialized and jointly trained withthe encoder and decoder. The vector representations of G X and G Y are used to enrich vector representations of X and Y , respectively. As a result, all the tokens tagged with k inboth X and Y have the same vector component, throughwhich a content-based addressing can be done by the at-tention mechanism (Garg et al. 2019). Here, the group tagserves as a mechanism to constrain the attention from Y tothe corresponding part of X during decoding. Unlike ap-proaches which modify a seq2seq model by using rules (Hsuet al. 2018; Gehrmann, Deng, and Rush 2018), group tag en-ables the modification to be flexible and trainable. Contextualized Rewriting System
We take a three-step process in generating a summary. First,an extractive summarization model is used to select a set ofsentences from the original document as a guiding source.Second, the guiding source text is matched with the origi-nal document, whereby a set of group tags are assigned toeach token. Third, an abstractive rewriter is applied to thetagged document, where the group tags serve as a guidancefor summary generation.Formally, we use X = { w i }| | X | i =1 to represent document X , which contains | X | tokens, and Y = { w j }| | Y | j =1 to repre-sent a final resulting summary Y , which contains | Y | tokens. Extractive Summarizer
Following Liu and Lapata (2019), we use BERT to encodethe input document, with a special [
CLS ] token being addedto the beginning of each sentence, and interval segments be-ing applied to distinguish successive sentence. On top ofthe BERT representation of [
CLS ] tokens, an extractor isstacked to select sentences. The extractor uses a Transformer(Vaswani et al. 2017) encoder to generate inter-sentence rep-resentations, on which, for extracting a summary, an out-put layer with the sigmoid activation is used to calculate theprobability of each sentence being extracted. ncoder.
We use the BERT encoder B
ERT E NC to con-vert source document X into a sequence of token embed-dings H X , taking [ CLS ] embeddings as a representation ofthe source sentences, denoted as H C . H X = B ERT E NC ( X ) H C = { H ( i ) X | w i = [ CLS ] }| | X | i =1 . (6) Extractor.
We use a Transformer encoder T
RANS E NC toconvert sentence embeddings H C into final inter-sentencerepresentations H F , and calculate the extraction probabilityon each sentence according to H F . H F = T RANS E NC ( H C ) P ( ext k | X ) = σ ( W · H ( k ) F + b ) , (7)where ext k means the k -th sentence extracted, and W and b are model trainable parameters.Given the sequence of extraction probabilities { P ( ext k | X ) }| Ck =1 , where C denotes the number ofsentences in X , we make decision on each sentence ac-cording to three hyper-parameters: the minimum numberof sentences to extract min sel , the maximum number ofsentences to extract max sel , and a probability threshold .In particular, we sort the C sentences in descending orderbased on P ( ext k | X ) , where sentences that rank between and min sel are selected by default, while sentences thatrank between min sel and max sel are decided by com-paring the probability value with the threshold. Sentenceswith a probability above threshold are selected. We decidethe hyper-parameter values using dev experiments.Note that our method is slightly different from the extrac-tive model of Liu and Lapata (2019), which extracts the 3most probable sentences as the summary. For the purpose ofrewriting with a strong compression, our method allows toextract more sentences as the summary for better recall. Source Group Tagging
We match the extracted summary with the original documentfor group tagging, taking each sentence in the extracted sum-mary as a group. So that the first summary sentence and thematched sentence forms group one, the second group two,and so on. Formally, for document X and extractive sum-mary E , the k -th summary sentence E k ( k ∈ [1 , ..., K ] ) ismatched to X , where every token in E k is assigned with agroup tag k . In particular, Eq 4 is instantiated as G X = { g i = k if w i ∈ E k else }| | X | i =1 , (8)where G X is the sequence of group tags for document X . Contextualized Rewriter
The contextualized rewriter extends the abstractive summa-rizer of Liu and Lapata (2019), which is a standard Trans-former sequence to sequence model with BERT as the en-coder. As Figure 3 shows, to integrate group tag guidance,group tag embeddings are added to both the encoder and thedecoder. Formally, for an extractive summary E , the set ofgroup tags is a closed set of [1 , ..., K ] . We use a lookup table W G to represent the embeddings of the group tags, which isshared by the encoder and the decoder. Encoder.
The original document is processed in the sameway as for the extractive model, where a [
CLS ] token isadded for each sentence and interval segments are usedto distinguish successive sentences. After BERT encodingB
ERT E NC , the representation of each token is added to thegroup tag embedding for producing a final representation H X + G = B ERT E NC ( X ) + E MB W G ( G X ) , (9)where E MB W G ( G X ) denotes the retrieved embeddings fromthe lookup table W G for group tag sequence G X . Decoder.
Summary sentences are synthesized in a singlesequence with special token [
BOS ] at the beginning, [
SEP ]between sentences, and [
EOS ] at the end. The decoder fol-lows a standard Transformer architecture.We treat each sentence in the summary as a group. Con-sequently, the group tag sequence G Y is fully determined bythe summary Y . In particular, all the tokens in the k -th sum-mary sentence Y k ( k ∈ [1 , ..., K ]) are assigned with a grouptag k . Therefore, Eq 5 is instantiated as G Y = { g j = k if w j ∈ Y k else }| | Y | j =1 . (10)During decoding, the group tag is generated at each beamsearch step, starting with 1 after the special token [ BOS ] andincreasing by 1 after each special token [
SEP ].The embedding of group tag g j is retrieved from thelookup table W G by E MB W G ( g j ) , and added to the tokenembedding E MB ( w j ) and the position embedding. H Y + G = E MB ( Y ) + E MB W G ( G Y ) P ( w j | w We train our extractive summarizer and abstractive rewriterseparately on a pre-processed dataset labeled with gold-standard extractions. To generate gold-standard extraction,we match each sentence in human summary to each docu-ment sentence, choosing the sentence with the best match-ing score as the gold extraction for the summary sentence.Specifically, we use the average recall of ROUGE-1/2/Las the scoring function, which follows Wei, Huang, andGao (2019). Differing from existing work (Liu and Lapata2019), which aims to find a set of sentences that maximizesROUGE matching with the human summary, we find thebest match for each summary sentence. As a result, the num-ber of extracted sentences is the same as the number of sen-tences in the human summary. This strategy is also adoptedby Wei, Huang, and Gao (2019) and Bae et al. (2019).fter matching summary Y to document X , we obtaina gold-standard extraction E = { E k }| Kk =1 . For training ourextractive model, we convert gold-standard extraction E intoa label l k on each sentence X k in X . We set l k = 1 if X k ∈ E , otherwise l k = 0 . We train the model with a binary cross-entropy loss function L ext = 1 C C (cid:88) k =1 (cid:2) − l k · log P ( ext k | X ) − (1 − l k ) · log(1 − P ( ext k | X )) (cid:3) , (12)where C denotes the number of sentences in X .For training our abstractive rewriter, we convert gold-standard extractions E into group tags G X following Eq 8,and train the model with a negative log-likelihood loss L wrt = 1 | Y | | Y | (cid:88) j =1 − log P ( w j | w We evaluate our model on the CNN/Daily Mail dataset (Her-mann et al. 2015), which comprises online news articleswith several human written highlights (on average 3.75 perarticle). There are 312,085 samples in total. We use thenon-anonymized version and follow the standard splittingof Hermann et al. (2015), which includes 287,227 samplesfor training, 13,368 for dev testing, and 11,490 for testing.We preprocess the dataset following See, Liu, and Manning(2017) after splitting sentences with the Stanford CoreNLP(Manning et al. 2014). We tokenize sentences into subwordtokens, and truncate documents to 512 tokens.We evaluate our models automatically using ROUGE (Lin2004), reporting the unigram overlap ROUGE-1 and the bi-gram overlap ROUGE-2 as metrics for informativeness, andthe longest common subsequence ROUGE-L as an indicatorof fluency. All scores are calculated using pyrouge. Extractive Summarizer The document encoder is initialized with pre-trained un-cased BERT-base, which has 12 transformer layers and theoutput embedding size is 768. The Transformer extractoris set to 2 layers with an embedding size of 768 and ran-domly initialized. We use the Adam optimizer (Kingma andBa 2015) with β . and β . . The encoder andextractor are jointly trained for a total of 50,000 steps with alearning rate schedule (Vaswani et al. 2017) lr = 2 e − · min ( step − . , step · warmup − . ) , where warmup = 10 , . The model is trained with 2v100 GPUs for about 9 hours.For inference, we select sentences according to the hyper-parameters min sel = 3 , max sel = 5 and threshold =0 . , which are chosen by a grid search to find the best av-erage score of ROUGE 1/2/L on the dev dataset. https://pypi.org/project/pyrouge/0.1.3/ Method ROUGE-1 ROUGE-2 ROUGE-L ExtractiveLEAD-3 (See, Liu, and Manning 2017) 40.34 17.70 36.57BERTSUMEXT (Liu and Lapata 2019) 43.25 20.24 39.63AbstractiveBERTSUMABS (Liu and Lapata 2019) 41.72 19.39 38.76BERTSUMEXTABS (Liu and Lapata 2019) 42.13 19.60 39.18RNN-Ext+Abs+RL (Chen and Bansal 2018) 40.88 17.80 38.54BERT-Hybrid (Wei, Huang, and Gao 2019) 41.76 19.31 38.86BERT-Ext+Abs+RL (Bae et al. 2019) 41.90 19.08 39.64BERT+Copy/Rewrite+HRL (Xiao et al. 2020) 42.92 19.43 39.35Our ModelsBERT-Ext 41.04 19.56 37.66Oracle 46.77 26.78 43.32BERT-Abs 41.70 19.06 38.71BERT-Ext+ContextRewriter ∗ ∗ ∗ Oracle+ContextRewriter 52.57 29.71 49.69 Table 1: Results. The best scores are in bold, and signifi-cantly better scores are marked with * ( p < . , t-test).Ext and Abs denotes extractive and abstractive models, re-spectively, and RL means reinforcement learning. Contextualized Rewriter We initialize the document encoder with pre-trained uncasedBERT-base model, and initialize the decoder randomly. TheTransformer decoder has 6 layers with an embedding sizeof 768 and tied input-output embeddings (Press and Wolf2017). We use the Adam optimizer and default setting β . and β . . The model is trained for a total of240,000 steps, with 20,000 steps for warming-up of the en-coder and 10,000 steps for warming-up of the decoder: lr E NC = 2 e − · min ( step − . , step · warmup − . E NC ) lr D EC = 0 . · min ( step − . , step · warmup − . D EC ) . We use a learning rate of . for the encoder, and . for the decode, applying dropout with a probability of 0.2,label smoothing (Szegedy et al. 2016) with a factor of 0.1,and word dropout (Bowman et al. 2016) with a probabilityof 0.3 on the decoder. We train the model with 2 GPUs on av100 machine for about 60 hours.For inference, we constrain the decoding sequence to aminimum length of 50, a maximum length of 200, a lengthpenalty (Wu et al. 2016) with α = 0 . , and a beam sizeof 5. During beam search, we block the paths on whicha repeated trigram is generated, namely Trigram Blocking(Paulus, Xiong, and Socher 2017). Results and Analysis We compare our models with existing summarization mod-els before analysing the contextualized rewriter. Automatic Evaluation The results are shown in Table 1. The top section consists ofextractive models. The middle section contains abstractivemodels and hybrid systems with a rewriter. The bottom sec-tion lists our models. In comparison with BERTSUMEXT,our extractive model BERT-Ext gives lower result due to dif-ferences in the extraction goal, as discussed earlier. ethod Faith. Read. Info. Conc. RNN-Ext+Abs+RL 4.71 3.62 3.22 3.35BERTSUMEXT 5.00 3.45 3.90 3.55BERTSUMEXTABS 4.86 4.22 3.78 3.85BERT-Ext+ContextRewriter 5.00 4.15 4.01 3.80 Table 2: Human evaluation on faithfulness (Faith.), read-ability (Read.), informativeness (Info.), and conciseness(Conc.).Compared with the extractive baseline BERT-Ext,our model BERT-Ext+ContextRewriter improves ROUGE-1/2/L by 2.48, 1.01 and 2.90, respectively. This showsthe effectiveness of contextualized rewriting. To isolatethe effect of the rewriter from the extractive summa-rizer, we also did an experiment using the Oracle ex-tractive summary as the input to our contextualizedrewriter, as Oracle+ContextRewriter shows. The gap be-tween our BERT-Ext+ContextRewriter result and the Or-acle+ContextRewriter result shows the room for furtherimprovement when the extractive summarizer becomesstronger. The row BERT-Abs shows the result of the BERTbased abstractive summarizer which copies the structure andsettings of BERT-Ext-ContextRewriter excluding the com-ponents related to group tags in Figure 3. A contrast be-tween our BERT-Ext+ContextRewriter model and BERT-Abs model shows the usefulness of the extractive summaryfor guiding abstractive rewriting.Compared to the rewriting system BERT-Hybrid, ourBERT-Ext-ContextRewriter increases ROUGE-1/2/L by1.76, 1.26 and 1.7, respectively. It demonstrates the ef-fectiveness of contextualized rewriting compared to non-contextualized rewriting. Although with the help of re-inforcement learning, a better result can be achieved forthe non-contextualized rewriting system, as the resultsof BERT-Ext+Abs+RL and BERT+Copy/Rewrite+HRLshows, the complexity of the algorithm is inevitablyincreased. Compared with the best rewriting systemBERT+Copy/Rewrite+HRL, our contextualized rewriterBERT-Ext-ContextRewriter still shows a significant im-provement by 0.6, 1.14 and 1.21 on ROUGE 1/2/L, respec-tively, despite that our model is purely generative withoutcopying tokens from the source document.Compared with the strong extractive model BERT-SUMEXT, BERT-Ext-ContextRewriter gives a better scoreacross three ROUGE metrics with a significant margin for0.27, 0.33 and 0.93 on ROUGE-1/2/L, respectively. Consid-ering the different length of extractive summaries and rewrit-ten summaries, we normalize ROUGE scores following Sunet al. (2019). The relative improvement of our model afternormalization is even larger, that it improves over BERT-SUMEXT by relatively (from 1.47 to 1.53) on normal-ized score, compared to the improvement by . relatively(from 43.25 to 43.52) on ROUGE-1. To our knowledge, weare the first to report improved ROUGE scores comparedto a state-of-the-art extractive baseline by using abstractiverewriting. Human evaluation is given in the next section.We did not include BART (Lewis et al. 2020) in the table,which reports ROUGE-1/2/L of 44.16, 21.28 and 40.90, re- Method ROUGE-1 ROUGE-2 ROUGE-L Words Oracle 46.77 26.78 43.32 112+ ContextRewriter (ours) 52.57 (+5.80) 29.71 (+2.93) 49.69 (+6.37) 63LEAD-3 40.34 17.70 36.57 85+ ContextRewriter (ours) 41.09 ( +0.75 ) 18.19 ( +0.49 ) 38.06 ( +1.49 ) 55BERTSUMEXT w/o Tri-Bloc 42.50 19.88 38.91 80+ ContextRewriter (ours) 43.31 ( +0.81 ) 20.44 ( +0.56 ) 40.33 ( +1.42 ) 54BERT-Ext (ours) 41.04 19.56 37.66 105+ ContextRewriter (ours) 43.52 ( +2.48 ) 20.57 ( +1.01 ) 40.56 ( +2.90 ) 66 Table 3: Results of four extractive summarizers applied withcontextualized rewriter. Tri-Bloc means Trigram Blocking.spectively. Different pre-training method and data are usedby BART as compared to the models in Table 1. First, weuse BERT-base, while BART for summarization uses a largemodel. Second, models in Table 1 use only the first 512 to-kens of the document, while BART uses 1024 tokens. Human Evaluation Intuitively, our model can paraphrase extractive summariesinstead of generating summaries from scratch, thereby im-proving the faithfulness. Furthermore, the abstractiveness ofcontextualized rewriter can enhance the readability, and thestrong compression can improve the conciseness. To con-firm these hypothesis, we conduct a human evaluation byrandomly select 30 samples from the test set, scoring faith-fulness, readability, informativeness, and conciseness from1(worst) to 5(best) by 3 independent annotators. We reportthe final result by averaging across annotators.The result is shown in Table 2. Compared with non-contextualized rewriter RNN-Ext+Abs+RL, our contextual-ized rewriter shows obvious advantage across the four as-pects. Compared to the extractive baseline BERTSUMEXT,our rewriter enhances the readability, informativeness andconciseness with a significant margin, while keeping thefaithfulness. The enhancement of readability is mainly con-tributed by reduced redundancy and improved coherence.The improvement of conciseness confirms the strong com-pression of the rewriter. In comparison with the abstractivebaseline BERTSUMEXTABS, our rewriter improves faith-fulness and informativeness, while keeping the readabilityand conciseness close. The conciseness of the rewriter is0.05 lower since it generates summaries for about one wordlonger than the abstractive model on average. However, byhaving more text, the rewriter obtains much improved infor-mativeness from 3.78 to 4.01. Universality of the Rewriter Our contextualized abstractive rewriter can serve as a gen-eral summary rewriter. We evaluate the rewriter with fourdifferent extractive summarizers including LEAD-3, BERT-SUMEXT, BERT-Ext and Oracle. As Table 3 shows, thecontextualized rewriter improves the summaries generatedby all four extractive summarizers. In particular, usingLEAD-3 as a basic extractive summarizer, the ROUGEscores improve by a large margin. Even with the best ex-tractive summarizer BERTSUMEXT, the rewriter still en-hances the summary quality especially on ROUGE-L, witha 1.42 point improvement. All the extractive summaries areigure 4: Comparison of the ability to generate non-redundant summaries. Extractive Summary: oratilwe hlongwane , whose dj name is aj , is still learningto put together words but the toddler is already able to select and play music froma laptop and has become a phenomenon in south africa . 1 two-year-old oratilwehlongwane , from johannesburg , south africa , whose dj name is aj , is still learn-ing to put together words but is already able to play music from a laptop , makinghim a worldwide phenomenon . 2 Rewritten Summary: oratilwe hlongwane , whose dj name is aj , is still learningto put together words . 1 he is already able to play music from a laptop , makinghim a worldwide phenomenon . 2 Figure 5: Example of the ability to reduce redundancy.improved by more than 1.4 points on ROUGE-L, which in-dicates a significant improvement on the fluency.In Table 3, the ROUGE scores for “BERTSUMEXT w/otrigram blocking” is much worse than BERTSUMEXT be-cause there is redundant information. However, when theyare applied with our rewriter, they give similar scores wherethe difference is less than 0.03 point across ROUGE-1/2/L,which is another proof that our rewriter is robust to input ofredundant extractive summaries. Analysis We further quantitatively evaluate our contextualizedrewriter on the ability to reduce redundancy, compress sen-tences, improve abstractiveness, and enhance coherence. Redundancy Redundancy has been a major problem forautomatic summarization. Here we study the impact oftrigram-blocking to the model performance by comparingwith the work of Liu and Lapata (2019). As Figure 4 shows,when the trigram-blocking post-process is removed, all themodels give lower ROUGE scores. BERTSUMEXT experi-ences the most significant drop, while BERTSUMEXTABShas a smaller drop because of less redundancy in an abstrac-tive summarizer. ContextRewriter suffers the least drop, al-most halving that by BERTSUMEXTABS, which shows thatthe contextualized rewriter effectively reduces redundancy.An example shown in Figure 5 demonstrates the ability. Compression As the column Avg Words in Table 3shows, for all the four extractive summarizers, the contex-tualized rewriter can significantly compress the summaries.For Oracle extractive summaries, it compresses the size byalmost a half. For the other models, it compresses the sum-maries to almost 2/3 of the original summaries on average.Looking into the summaries generated by BERT-Ext+ContextRewriter, we find that of extractive sum-mary sentences are not changed by the rewriter, arecompressed into shorter versions, and are rewritten intonew sentences. We obtain these numbers on the test dataset,by adopting the edit-sequence-generation algorithm (Zhang Method 1-grams 2-grams 3-grams GOLD 20.66 56.55 73.48BERTSUMEXTABS 1.39 9.81 17.79BERT-Ext+ContextRewriter 1.82 10.74 19.30 Table 4: Percentage of novel n-grams. Source Document: a university of iowa student has died nearly three monthsafter a fall ... andrew mogni , 20 , from glen ellyn , illinois , had only just arrivedfor ... 1 he was flown back to chicago via ...but he died on sunday . 2 ... Rewritten Summary: andrew mogni , 20 , from glen ellyn , illinois , had onlyjust arrived for a semester program in italy when the incident happened in jan-uary 1 he was flown back to chicago via air ambulance on march 20 , but he diedon sunday 2 Swap Group Tags: andrew mogni , 20 , was flown back to chicago via air am-bulance on march 20 , but he died on sunday 1 he had only just arrived for asemester program in italy when the incident happened in january 2 Figure 6: Example of the ability to maintain coherence.and Litman 2014) to generate a sequence of word editingactions, and mapping an extracted summary sentence to therewritten one. We categorize a sentence as “Rewritten” if thesequence contains an action of adding or modifying, “Com-pressed” if the sequence contains an action of deleting, and“Unchanged” otherwise.According to 20 samples from the test dataset, all thecompressions are on phrases instead of single words. Fur-thermore, most removed phrases are unimportant, given thefact that only of the removed words are included in ref-erence summaries. For instance, “ they returned to find harg-reaves and the girl, who has not been named, lying on top ofeach other. ” is compressed into “ they returned to find harg-reaves and the girl lying on top of each other. ” Novel n-grams As a measure of abstractiveness, we cal-culate the percentage of novel n-grams as Table 4 shows theresults of 1.82, 10.74 and 19.30. We can see that the con-textualized rewriter generates summaries with more noveln-grams compared to BERTSUMEXTABS, which suggestsbetter abstractiveness. Coherence The text generation process of a contextu-alized rewriter can be controlled by the extractive input,through which we can observe the behavior of the rewriter.Figure 6 uses one output example to demonstrate how therewriter maintains coherence. We can see that the studentname is mentioned in the first summary sentence, while apronoun is used in the second sentence. As the “Swap GroupTags” section shows, when we swap the group tags in thesource document, the content of the two summary sentencesswap their positions, but the student name is still presentedin the first sentence and a pronoun is used in the secondsentence. From this case, we can see that the cross-sentenceanaphora is maintained correctly. Conclusion We investigate contextualized rewriting of extractive sum-maries using a neural abstractive rewriter, formalizing thetask as a seq2seq problem with group alignments, usinggroup tags to represent alignments, and constraining the at-tention to rewriting sentence through content-based address-ng. Results on standard benchmarks show that using con-textual information from the original document is highlybeneficial for summary rewriting. Our model outperformsexisting abstractive rewriters by a significant margin, achiev-ing strong ROUGE improvements upon multiple extractivesummarizers, for the first time. Our method of seq2seq withgroup alignments is general and can potentially be appliedto other NLG tasks. Acknowledgments We would like to thank the anonymous reviewers for theirvaluable feedback. We thank Wenyu Du for the inspiringdiscussion. References Bae, S.; Kim, T.; Kim, J.; and Lee, S.-g. 2019. SummaryLevel Training of Sentence Rewriting for Abstractive Sum-marization. In Proceedings of the 2nd Workshop on NewFrontiers in Summarization , 10–20. Hong Kong, China: As-sociation for Computational Linguistics.Bowman, S. R.; Vilnis, L.; Vinyals, O.; Dai, A.; Jozefowicz,R.; and Bengio, S. 2016. Generating Sentences from a Con-tinuous Space. In Proceedings of The 20th SIGNLL Confer-ence on Computational Natural Language Learning , 10–21.Berlin, Germany: Association for Computational Linguis-tics.Chen, Y.-C.; and Bansal, M. 2018. Fast Abstractive Sum-marization with Reinforce-Selected Sentence Rewriting. In Proceedings of the 56th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: Long Papers) ,675–686. Melbourne, Australia: Association for Computa-tional Linguistics.Cheng, J.; and Lapata, M. 2016. Neural Summarizationby Extracting Sentences and Words. In Proceedings of the54th Annual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , 484–494. Berlin, Ger-many: Association for Computational Linguistics.Chopra, S.; Auli, M.; and Rush, A. M. 2016. AbstractiveSentence Summarization with Attentive Recurrent NeuralNetworks. In Proceedings of the 2016 Conference of theNorth American Chapter of the Association for Computa-tional Linguistics: Human Language Technologies , 93–98.San Diego, California: Association for Computational Lin-guistics.Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019.BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding. In Proceedings of the 2019 Con-ference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technolo-gies, Volume 1 (Long and Short Papers) , 4171–4186. Min-neapolis, Minnesota: Association for Computational Lin-guistics.Dorr, B.; Zajic, D.; and Schwartz, R. 2003. Hedge Trim-mer: A Parse-and-Trim Approach to Headline Generation.In Proceedings of the HLT-NAACL 03 Text SummarizationWorkshop , 1–8. Durrett, G.; Berg-Kirkpatrick, T.; and Klein, D. 2016.Learning-Based Single-Document Summarization withCompression and Anaphoricity Constraints. In Proceedingsof the 54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , 1998–2008.Berlin, Germany: Association for Computational Linguis-tics.Garg, S.; Peitz, S.; Nallasamy, U.; and Paulik, M. 2019.Jointly Learning to Align and Translate with TransformerModels. In Proceedings of the 2019 Conference on Empir-ical Methods in Natural Language Processing and the 9thInternational Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) , 4453–4462. Hong Kong, China:Association for Computational Linguistics.Gehrmann, S.; Deng, Y.; and Rush, A. 2018. Bottom-UpAbstractive Summarization. In Proceedings of the 2018Conference on Empirical Methods in Natural LanguageProcessing , 4098–4109. Brussels, Belgium: Association forComputational Linguistics.Graves, A.; Wayne, G.; and Danihelka, I. 2014. Neural Tur-ing Machines. CoRR abs/1410.5401.Hermann, K. M.; Kocisk´y, T.; Grefenstette, E.; Espeholt,L.; Kay, W.; Suleyman, M.; and Blunsom, P. 2015. Teach-ing Machines to Read and Comprehend. In Cortes, C.;Lawrence, N. D.; Lee, D. D.; Sugiyama, M.; and Garnett, R.,eds., Advances in Neural Information Processing Systems28: Annual Conference on Neural Information ProcessingSystems 2015, December 7-12, 2015, Montreal, Quebec,Canada , 1693–1701.Hsu, W.-T.; Lin, C.-K.; Lee, M.-Y.; Min, K.; Tang, J.; andSun, M. 2018. A Unified Model for Extractive and Abstrac-tive Summarization using Inconsistency Loss. In Proceed-ings of the 56th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers) , 132–141.Melbourne, Australia: Association for Computational Lin-guistics.Kingma, D. P.; and Ba, J. 2015. Adam: A Method forStochastic Optimization. In Bengio, Y.; and LeCun, Y., eds., .Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mo-hamed, A.; Levy, O.; Stoyanov, V.; and Zettlemoyer, L.2020. BART: Denoising Sequence-to-Sequence Pre-trainingfor Natural Language Generation, Translation, and Compre-hension. In Proceedings of the 58th Annual Meeting of theAssociation for Computational Linguistics , 7871–7880. On-line: Association for Computational Linguistics.Lin, C.-Y. 2004. ROUGE: A Package for Automatic Evalu-ation of Summaries. In Text Summarization Branches Out ,74–81. Barcelona, Spain: Association for ComputationalLinguistics.Liu, Y.; and Lapata, M. 2019. Text Summarization withPretrained Encoders. In Proceedings of the 2019 Confer-ence on Empirical Methods in Natural Language Process-ing and the 9th International Joint Conference on Naturalanguage Processing (EMNLP-IJCNLP) , 3730–3740. HongKong, China: Association for Computational Linguistics.Manning, C.; Surdeanu, M.; Bauer, J.; Finkel, J.; Bethard,S.; and McClosky, D. 2014. The Stanford CoreNLP NaturalLanguage Processing Toolkit. In Proceedings of 52nd An-nual Meeting of the Association for Computational Linguis-tics: System Demonstrations , 55–60. Baltimore, Maryland:Association for Computational Linguistics.Mihalcea, R.; and Tarau, P. 2004. TextRank: Bringing Orderinto Text. In Proceedings of the 2004 Conference on Em-pirical Methods in Natural Language Processing , 404–411.Barcelona, Spain: Association for Computational Linguis-tics.Nallapati, R.; Zhai, F.; and Zhou, B. 2017. SummaRuNNer:A Recurrent Neural Network Based Sequence Model for Ex-tractive Summarization of Documents. In Singh, S. P.; andMarkovitch, S., eds., Proceedings of the Thirty-First AAAIConference on Artificial Intelligence, February 4-9, 2017,San Francisco, California, USA , 3075–3081. AAAI Press.Nallapati, R.; Zhou, B.; dos Santos, C.; Gulc¸ehre, C¸ .; andXiang, B. 2016. Abstractive Text Summarization usingSequence-to-sequence RNNs and Beyond. In Proceedingsof The 20th SIGNLL Conference on Computational Natu-ral Language Learning , 280–290. Berlin, Germany: Associ-ation for Computational Linguistics.Narayan, S.; Cohen, S. B.; and Lapata, M. 2018. Rank-ing Sentences for Extractive Summarization with Reinforce-ment Learning. In Proceedings of the 2018 Conference ofthe North American Chapter of the Association for Com-putational Linguistics: Human Language Technologies, Vol-ume 1 (Long Papers) , 1747–1759. New Orleans, Louisiana:Association for Computational Linguistics.Paulus, R.; Xiong, C.; and Socher, R. 2017. A DeepReinforced Model for Abstractive Summarization. CoRR abs/1705.04304.Press, O.; and Wolf, L. 2017. Using the Output Embeddingto Improve Language Models. In Proceedings of the 15thConference of the European Chapter of the Association forComputational Linguistics: Volume 2, Short Papers , 157–163. Valencia, Spain: Association for Computational Lin-guistics.Rush, A. M.; Chopra, S.; and Weston, J. 2015. A Neural At-tention Model for Abstractive Sentence Summarization. In Proceedings of the 2015 Conference on Empirical Methodsin Natural Language Processing , 379–389. Lisbon, Portu-gal: Association for Computational Linguistics.See, A.; Liu, P. J.; and Manning, C. D. 2017. Get To ThePoint: Summarization with Pointer-Generator Networks. In Proceedings of the 55th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: Long Papers) ,1073–1083. Vancouver, Canada: Association for Computa-tional Linguistics.Sun, S.; Shapira, O.; Dagan, I.; and Nenkova, A. 2019. Howto Compare Summarizers without Target Length? Pitfalls,Solutions and Re-Examination of the Neural Summarization Literature. In Proceedings of the Workshop on Methods forOptimizing and Evaluating Neural Language Generation ,21–29. Minneapolis, Minnesota: Association for Computa-tional Linguistics.Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna,Z. 2016. Rethinking the Inception Architecture for Com-puter Vision. In , 2818–2826.Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,L.; Gomez, A. N.; Kaiser, L. u.; and Polosukhin, I. 2017.Attention is All you Need. In Guyon, I.; Luxburg, U. V.;Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; andGarnett, R., eds., Advances in Neural Information Process-ing Systems 30 , 5998–6008. Curran Associates, Inc.Vinyals, O.; Fortunato, M.; and Jaitly, N. 2015. PointerNetworks. In Cortes, C.; Lawrence, N. D.; Lee, D. D.;Sugiyama, M.; and Garnett, R., eds., Advances in NeuralInformation Processing Systems 28: Annual Conference onNeural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada , 2692–2700.Wei, R.; Huang, H.; and Gao, Y. 2019. Sharing Pre-trainedBERT Decoder for a Hybrid Summarization. In Sun, M.;Huang, X.; Ji, H.; Liu, Z.; and Liu, Y., eds., Chinese Compu-tational Linguistics - 18th China National Conference, CCL2019, Kunming, China, October 18-20, 2019, Proceedings ,volume 11856 of Lecture Notes in Computer Science , 169–180. Springer.Wu, Y.; Schuster, M.; Chen, Z.; Le, Q. V.; Norouzi, M.;Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.;Klingner, J.; Shah, A.; Johnson, M.; Liu, X.; Kaiser, Ł.;Gouws, S.; Kato, Y.; Kudo, T.; Kazawa, H.; Stevens, K.;Kurian, G.; Patil, N.; Wang, W.; Young, C.; Smith, J.; Riesa,J.; Rudnick, A.; Vinyals, O.; Corrado, G.; Hughes, M.; andDean, J. 2016. Google’s Neural Machine Translation Sys-tem: Bridging the Gap between Human and Machine Trans-lation. In https://arxiv.org/abs/1609.08144 , 1–23.Xiao, L.; Wang, L.; He, H.; and Jin, Y. 2020. Copy orRewrite: Hybrid Summarization with Hierarchical Rein-forcement Learning. Proceedings of the AAAI Conferenceon Artificial Intelligence (AAAI) .Zhang, F.; and Litman, D. 2014. Sentence-level RewritingDetection. In