Length-controllable Abstractive Summarization by Guiding with Summary Prototype
Itsumi Saito, Kyosuke Nishida, Kosuke Nishida, Atsushi Otsuka, Hisako Asano, Junji Tomita, Hiroyuki Shindo, Yuji Matsumoto
aa r X i v : . [ c s . C L ] J a n Length-controllable Abstractive Summarizationby Guiding with Summary Prototype
Itsumi Saito , Kyosuke Nishida , Kosuke Nishida , Atsushi Otsuka , Hisako Asano ,Junji Tomita , Hiroyuki Shindo , Yuji Matsumoto NTT Media Intelligence Laboratories, NTT Corporation Nara Institute of Science and Technology RIKEN Center for Advanced Intelligence Project [email protected] Abstract
We propose a new length-controllable abstractive summariza-tion model. Recent state-of-the-art abstractive summariza-tion models based on encoder-decoder models generate onlyone summary per source text. However, controllable summa-rization, especially of the length, is an important aspect forpractical applications. Previous studies on length-controllableabstractive summarization incorporate length embeddings inthe decoder module for controlling the summary length. Al-though the length embeddings can control where to stop de-coding, they do not decide which information should be in-cluded in the summary within the length constraint. Unlikethe previous models, our length-controllable abstractive sum-marization model incorporates a word-level extractive mod-ule in the encoder-decoder model instead of length embed-dings. Our model generates a summary in two steps. First, ourword-level extractor extracts a sequence of important words(we call it the “prototype text”) from the source text accord-ing to the word-level importance scores and the length con-straint. Second, the prototype text is used as additional inputto the encoder-decoder model, which generates a summaryby jointly encoding and copying words from both the proto-type text and source text. Since the prototype text is a guideto both the content and length of the summary, our model cangenerate an informative and length-controlled summary. Ex-periments with the CNN/Daily Mail dataset and the NEWS-ROOM dataset show that our model outperformed previousmodels in length-controlled settings.
Neural summarization has made great progress in recentyears. It has two main approaches: extractive and abstrac-tive. Extractive methods generate summaries by selectingimportant sentences (Zhang et al. 2018; Zhou et al. 2018).They produce grammatically correct summaries; however,they do not give much flexibility to the summarization be-cause they only extract sentences from the source text. Bycontrast, abstractive summarization enables more flexiblesummarization, and it is expected to generate more fluentand readable summaries than extractive models. The mostcommonly used abstractive summarization model is thepointer-generator (See, Liu, and Manning 2017), which gen-erates a summary word-by-word while copying words from
Source Text various types of renewable energy such as solar and wind areoften touted as being the solution to the world ’s growing energycrisis . but one researcher has come up with a novel idea thatcould trump them all - a biological solar panel that works aroundthe clock . by harnessing the electrons generated by plants suchas moss , he said he can create useful energy that could be usedat home or elsewhere . a university of cambridge scientist hasrevealed his green source of energy . by using just moss he isable to generate enough power to run a clock -lrb- shown -rrb-. he said panels of plant material could power appliances inour homes . and the technology could help farmers grow cropswhere electricity is scarce . (...)
Reference Summary university of cambridge scientist has revealed his green source ofenergy . by using just moss he is able to generate enough powerto run a clock . he said panels of plant material could powerappliances in our homes . and the tech could help farmers growcrops where electricity is scarce .
Outputs (K=10) [ Extracted prototype ] he said panels of plant material couldpower in our[
Abstractive summary ] panels of plant material could powerappliances .
Outputs (K=30) [ Extracted Prototype ] university of cambridge scientist has re-vealed his he said panels of plant material could power appli-ances in our homes and the technology could help farmers growcrops where is scarce[
Abstractive summary ] university of cambridge scientist hasrevealed his green source of energy . he said panels of plant ma-terial could power appliances in our homes .
Figure 1: Output examples of our model. Our model extractsthe top- K important words, which are colored red ( K = 10 )and blue ( K = 30 ), as a prototype from the source text.It generates an abstractive summary based on the prototypeand source texts. The length of the generated summary iscontrolled in accordance with the length of the prototypetext.the source text and generating words from a pre-defined vo-cabulary set. This model can generate an accurate summaryy combining word-level extraction and generation.Although the idea of controlling the length of the sum-mary was mostly neglected in the past, it was recentlypointed out that it is actually an important aspect of abstrac-tive summarization (Liu, Luo, and Zhu 2018; Fan, Grang-ier, and Auli 2018). In practical applications, the summarylength should be controllable in order for it to fit the devicethat displays it. However, there have only been a few stud-ies on controlling the summary length. Kikuchi et al. (2016)proposed a length-controllable model that uses length em-beddings. In the length embedding approach, the summarylength is encoded either as an embedding that represents theremaining length at each decoding step or as an initial em-bedding to the decoder that represents the desired length.Liu, Luo, and Zhu (2018) proposed a model that uses thedesired length as an input to the initial state of the decoder.These previous models control the length in the decodingmodule by using length embeddings. However, length em-beddings only add length information on the decoder side.Consequently, they may miss important information becauseit is difficult to take into account which content should be in-cluded in the summary for certain length constraints.We propose a new length-controllable abstractive summa-rization that is guided by the prototype text. Our idea is touse a word-level extractive module instead of length embed-dings to control the summary length. Figure 2 compares theprevious length-controllable models and the proposed one.The yellow blocks are the modules responsible for lengthcontrol. Since the word-level extractor controls which con-tents are to be included in the summary when a length con-straint is given, it is possible to generate a summary includ-ing the important contents. Our model consists of two steps.First, the word-level extractor predicts the word-level im-portance of the source text and extracts important wordsaccording to the importance scores and the desired length.The extracted word sequence is used as a “prototype” of thesummary; we call it the prototype text. Second, we use theprototype text as an additional input of the encoder-decodermodel. The length of the summary is kept close to that of theprototype text because the summary is generated by refer-ring to the prototype text. Figure 1 shows examples of out-put generated by our model. Our abstractive summaries aresimilar to the extracted prototypes. The extractive moduleproduces a rough overview of the summary, and the encoder-decoder module produces a fluent summary based on the ex-tracted prototype.Our idea is inspired by extractive-and-abstractive summa-rization. Extractive-and-abstractive summarization incorpo-rates an extractive model in an abstractive encoder-decodermodel. While in the simple encoder-decoder model, onemodel identifies the important contents and generates flu-ent summaries, the extractive-and-abstractive model has anencoder-decoder part that generates fluent summaries and aseparate part that extracts important contents. Several stud-ies have shown that separating the problem of finding theimportant content and the problem of generating fluent sum-maries improves the accuracy of the summary (Gehrmann,Deng, and Rush 2018; Chen and Bansal 2018). Our modelcan be regarded as an extension of models that work in this summarysource summarysource prototype(K words)lengthembeddingsenc-decmodel enc-decmodelextractivemodel desiredlength Kdesiredlength K Previous length-controllable models Proposed model
Figure 2: Comparison of previous length-controllable mod-els and proposed model. Our model controls the summarylength in accordance with the length of the prototype text.way: However, this is the first to extend the extractive mod-ule such that it can control the summary length.Ours is the first method that controls the summary lengthusing an extractive module and that achieves both high accu-racy and length controllability in abstractive summarization.Our contributions are summarized as follows: • We propose a new length-controllable prototype-guidedabstractive summarization model, called LPAS (Length-controllable Prototype-guided Abstractive Summariza-tion). Our model effectively guides the abstractive sum-marization using a summary prototype. Our model con-trols the summary length by controlling the number ofwords in the prototype text. • Our model achieved state-of-the-art ROUGE scores inlength-controlled abstractive summarization settings onthe CNN/DM and NEWSROOM datasets.
Our study defines length-controllable abstractive summa-rization as two pipelined tasks: prototype extraction andprototype-guided abstractive summarization. The problemformulations of each task are described below.
Task 1 (Prototype Extraction) Given a source text X C with L words X C = ( x C , . . . , x CL ) and a desired summarylength K , the model estimates importance scores P ext = ( p ext1 . . . p ext L ) and extracts the top- K important words X P = ( x P , . . . , x PK ) as a prototype text on the basis of P ext . Thedesired summary length K can be set to an arbitrary value.Note that the original word order is preserved in X P ( X P isnot bag-of-words). Task 2 (Prototype-guided Abstractive Summarization)Given the source text and the extracted prototype text X P ,the model generates a length-controlled abstractive sum-mary Y = ( y , . . . , y T ) . The length of summary T is con-trolled in accordance with the prototype length K . Our model consists of three modules: the prototype extrac-tor, joint encoder, and summary decoder (Figure 3). The lasttwo modules comprise Task 2, the prototype-guided abstrac-tive summarization. The prototype extractor uses BERT, ource text shared encoderblock
Glove Glove
Pointer-Generator
Glove positionalencoding positionalencoding positionalencoding prototype textSource text
BERT
Prototype text(Top-K words)
SigmoidLinear
Summary prototype extractor(Sec.3.2) Joint encoder (Sec. 3.3) summaization decoder (Sec. 3.4)dual encoderblock (source)
4x 4x dual encoderblock (prototype) decoder block(source)shared encoderblock decoder block(prototype)
Figure 3: Architecture of proposed modelwhile the joint encoder and summary decoder use the Trans-former architecture (Vaswani et al. 2017).
Prototype extractor ( § The prototype extractor ex-tracts the top- K important words from the source text. Joint encoder ( § The joint encoder encodes both thesource text and the prototype text.
Summary decoder ( § The summary decoder is basedon the pointer-generator model and generates an abstractivesummary by using the output of the joint encoder.
Since our model extracts the prototype at the word level,the prototype extractor estimates an importance score p ext l of each word x Cl ∈ X C . BERT has achieved SOTA on manyclassification tasks, so it is a natural choice for the prototypeextractor. Our model uses BERT and a task-specific feed-forward network on top of BERT. We tokenize the sourcetext using the BERT tokenizer and fine-tune the BERTmodel. The importance score p ext l is defined as p ext l = σ ( W ⊤ BERT( X C ) l + b ) (1)where BERT() is the last hidden state of the pre-trainedBERT. W ∈ R d bert and b are learnable parameters. σ is asigmoid function. d bert is the dimension of the last hiddenstate of the pre-trained BERT.To extract a more fluent prototype than when using onlythe word-level importance, we define a new weighted impor-tance score p ext w l that incorporates a sentence-level impor-tance score as a weight for the word-level importance score: p ext w l = p ext l · p ext S j , p ext S j = 1 N S j X l : x l ∈ S j p ext l (2)where N S j is the number of words in the j -th sentence S j ∈ X C . Our model extracts the top- K important wordsas a prototype from the source text on the basis of p ext w l . Itcontrols the length of the summary in accordance with thenumber of words in the prototype text, K . https://github.com/google-research/bert/ Embedding layer
This layer projects each of the one-hotvectors of words x Cl (of size V ) into a d word -dimensionalvector space with a pre-trained weight matrix W e ∈R d word × V such as GloVe (Pennington, Socher, and Man-ning 2014). Then, the word embeddings are mapped to d model -dimensional vectors by using the fully connectedlayer, and the mapped embeddings are passed to a ReLUfunction. This layer also adds positional encoding to theword embedding (Vaswani et al. 2017). Transformer encoder blocks
The encoder encodes theembedded source and prototype texts with a stack of Trans-former blocks (Vaswani et al. 2017). Our model encodes thetwo texts with the encoder stack independently. We denotethese outputs as E Cs ∈ R d model × L and E Ps ∈ R d model × K ,respectively. Transformer dual encoder blocks
This block calculatesthe interactive alignment between the encoded source andprototype texts. Specifically, it encodes the source and pro-totype texts and then performs multi-head attention on theother output of the encoder stack (i.e., E Cs and E Ps ). Wedenote the outputs of the dual encoder stack of the sourcetext and prototype text by M C ∈ R d model × L and M P ∈R d model × K , respectively. Embedding layer
The decoder receives a sequence ofwords in an abstractive summary Y , which is generatedthrough an auto-regressive process. At each decoding step t ,this layer projects each of the one-hot vectors of the words y t in the same way as the embedding layer in the joint encoder. Transformer decoder blocks
The decoder uses a stack ofdecoder Transformer blocks (Vaswani et al. 2017) that per-form multi-head attention on the encoded representations ofthe prototype, M P . It uses another stack of decoder Trans-former blocks that perform multi-head attention on those ofthe source text, M C , on top of the first stack. The first stackrewrites the prototype text, and the second one complementsthe rewritten prototype with the original source information.The subsequent mask is used in the stacks since this compo-nent is used in a step-by-step manner at test time. The outputof the stacks is M S ∈ R d model × T . Copying mechanism
Our pointer-generator model copiesthe words from the source and prototype texts on the basisof the copy distributions, for efficient reuse.
Copy distributions
The copy distributions of the sourceand prototype words are described as follows: p p ( y t ) = X k : x Pk = y t α Ptk , p c ( y t ) = X l : x Cl = y t α Ctl where α Ptk and α Ctl are respectively the first attention headsof the last block in the first and second stacks of the decoder. inal vocabulary distribution
The final vocabulary dis-tribution is described as follows: p ( y t ) = λ g p g ( y t ) + λ c p c ( y t ) + λ p p p ( y t ) λ g , λ c , λ p = softmax( W v [ M St ; c Ct ; c Pt ] + b v ) c Ct = X l α Ctl M Cl , c Pt = X k α Ptk M Pk p g ( y t ) = softmax( W g ( M St ) + b g ) where W v ∈ R × d model , b v ∈ R , W g ∈ R d model × V , and b g ∈ R V are learnable parameters. Our model is not trained in an end-to-end manner: the pro-totype extractor is trained first and then the encoder and de-coder are trained.
Prototype extractor
Since there are no supervised datafor the prototype extractor, we created pseudo training datalike in (Gehrmann, Deng, and Rush 2018). The trainingdata consists of word x Cl and label r l pairs, ( x Cl , r l ) forall x Cl ∈ X C . r l is 1 if x Cl is included in the summary;otherwise it is 0. To construct the paired data automatically,we first extract oracle source sentences S oracle that maxi-mize the ROUGE-R score in the same way as in (Hsu et al.2018). Then, we calculate the word-by-word alignment be-tween the reference summary and S oracle using a dynamicprogramming algorithm to consider the word order. Finally,we label all aligned words with 1 and other words, includingthe words that are not in the oracle sentence, with 0. Joint encoder and summary decoder
We have to createtriple data of ( X C , X P , Y ), consisting of the source text,the gold prototype text, and the target text, for training ourencoder and decoder. We use the top- K words (in terms of p ext w l ; Eq. 2) in the oracle sentences S oracle as the gold pro-totype text to extract a prototype closer to the reference sum-mary and improve the quality of the encoder-decoder train-ing. K is decided using the reference summary length T .To obtain a natural summary close to the desired length, wequantize the length T into discrete bins, where each bin rep-resents a size range. We set the size range to 5 in this study.That is, the value nearest to the summary length T amongmultiples of 5 is selected for K . Prototype extractor
We use the binary cross-entropyloss, because the extractor estimates the importance scoreof each word (Eq. 1), which is a binary classification task. L ext = − N L N X n =1 L X l =1 (cid:18) r l log p ext l +(1 − r l ) log(1 − p ext l ) (cid:19) where N is the number of training examples. Joint encoder and summary decoder
The main loss forthe encoder-decoder is the cross-entropy loss: L maingen = − N T N X n =1 T X t =1 log p ( y t | y t − , X C , X P ) . Moreover, we add the attention guide loss of the summarydecoder. This loss is designed to guide the estimated atten-tion distribution to the reference attention. L sumattn = − N T N X n =1 T X t =1 log α Ct,l ( t ) L protoattn = − N T N X n =1 T X t =1 log α proto t,l ( t ) α proto t,l ( t ) is the first attention head of the last block in the jointencoder stack for the prototype. l ( t ) denotes the absolute po-sition in the source text corresponding to the t -th word in thesequence of summary words. The overall loss of the genera-tion model is a linear combination of these three losses. L gen = L maingen + λ L sumattn + λ L protoattn λ and λ were set to 0.5 in the experiments. During the inference period, we use a beam search and re-ranking (Chen and Bansal 2018). We keep all N beam sum-mary candidates provided by the beam search, where N beam is the size of the beam, and generate the N beam -best sum-maries. The summaries are then re-ranked by the number ofrepeated N-grams, the smaller the better. The beam searchand this re-ranking improve the ROUGE score of the output,as they eliminate candidates that contain repetitions. For thelength-controlled setting, we set the value of K to the de-sired length. For the standard setting, we set it to the averagelength of the reference summary in the validation data. Dataset
We used the CNN/DM dataset (Hermann et al.2015), a standard corpus for news summarization. The sum-maries are bullet points for the articles shown on their re-spective websites. Following See, Liu, and Manning (2017),we used the non-anonymized version of the corpus andtruncated the source documents to 400 tokens and the tar-get summaries to 120 tokens. The dataset includes 286,817training pairs, 13,368 validation pairs, and 11,487 test pairs.We also used the NEWSROOM dataset (Grusky, Naa-man, and Artzi 2018). NEWSROOM contains various newssources (38 different news sites). We used 973,042 pairs ofdata for training. We sampled 30,000 pairs for validationdata, and the number of the test pairs was 106,349. To eval-uate the length-controlled setting for NEWSROOM dataset,we randomly sampled 10,000 samples from the test set. ength Model R-1 R-2 R-LLC LPAS 17.43 8.87 16.78LC 32.26 13.60 24.6430 LenEmb 34.01 15.51 31.43LPAS
LC 34.71 14.24 25.6250 LenEmb 38.66 17.17 35.49LPAS
LC 33.83 13.67 24.6770 LenEmb 39.57 17.38 36.22LPAS
LC 32.17 13.00 23.2890 LenEmb 38.51 16.79 35.24LPAS
LC 30.40 12.59 22.94AVG LenEmb 33.79 15.16 31.16LPAS
Table 1: ROUGE scores (F1) of abstractive summarizationmodels with different lengths on the CNN/DM dataset (10,30, 50, 70, 90 words). AVG indicates the average ROUGEscore for the five different lengths. (Liu, Luo, and Zhu2018) Model Configurations
We used the same configurationsfor the two datasets. The extractor used the pre-trainedBERT large model (Devlin et al. 2018). We fine-tuned BERTfor two epochs with the default settings. Our encoder and de-coder used pre-trained 300-dimensional GloVe embeddings.The encoder and decoder Transformer have four blocks. Thenumber of heads was 8, and the number of dimensions ofFFN was 2048. d model was set to 512. We used the Adamoptimizer (Kingma and Ba 2015) with a scheduled learningrate (Vaswani et al. 2017). We set the size of the input vo-cabulary to 100,000 and the output vocabulary to 1,000. We used the ROUGE scores (F1), including ROUGE-1 (R-1), ROUGE-2 (R-2), and ROUGE-L (R-L), as the evaluationmetrics (Lin 2004). We used the files2rouge toolkit for cal-culating the ROUGE scores . Does our model improve the ROUGE score in thelength-controlled setting?
We used two types of length-controllable models as baselines. The first one is a CNN-based length-controllable model (LC) that uses the desiredlength as an input to the initial state of the CNN-based de-coder. (Liu, Luo, and Zhu 2018). The second one (LenEmb)embeds the remaining length and adds them to each decoderstep (Kikuchi et al. 2016). Since there are no previous resultson applying LenEmb to the CNN/DM dataset, we imple-mented it as a Transformer-based encoder-decoder model.Specifically, we simply added the embeddings of the remain-ing length to the word embeddings at each decoding step. https://github.com/pltrdy/files2rouge
10 30 50 70 90desired length20406080 o u t p u t l e n g t h Figure 4: Results in the length-controlled setting onCNN/DM. a): ROUGE-L recall, precision and F scores fordifferent lengths (left). b): Output length distribution (right).Table 1 shows that our model achieved high ROUGEscores for different lengths and outperformed the previouslength-controllable models in most cases. Our model wasabout 2 points more accurate on average than LenEmb. Ourmodel selected the most important words from the sourcetext in accordance with the desired length. It was thus effec-tive at keeping the important information even in the length-controlled setting. Figure 4a shows the precision, recall, andF score of ROUGE for different lengths. Our model main-tained a high F-score around the average length (around 60words); this indicates that it can select important informationand generate stable results with different lengths.
Does our model generate a summary with the desiredlength?
Figure 4b shows the relationship between the de-sired length and the output length. The x-axis indicates thedesired length, and the y-axis indicates the average lengthand standard deviation of the length-controlled output sum-mary. The results show that our model properly controls thesummary length. This controllable nature comes from thetraining procedure. When training our encoder-decoder, weset the number of words K in the prototype text accordingto the length of the reference summary; therefore, the modellearns to generate a summary that has a similar length to theprototype text. How good is the quality of the prototype text?
To eval-uate the quality of the prototype, we evaluated the ROUGEscores of the extracted prototype text. Table 2 shows the re-sults. In the table, LPAS-ext (top-3 sents) means the top-three sentences were extracted using p ext S j . Interestingly,ROUGE-1 and ROUGE-2 scores of the LPAS-ext (Top- K words) were higher than those of the sentence-level extrac-tive models. This indicates that word-level LPAS-ext is ef-fective at finding not only important words (ROUGE-1), butalso important phrases (ROUGE-2). Also, we can see fromTable 5 that whole LPAS improved the ROUGE-L score ofLPAS-ext. This indicates that our joint encoder and sum-mary decoder generate more fluent summaries with the helpof the prototype text. Does our abstractive model improve if the quality of theprototype is improved?
We evaluated our model in thefollowing two settings in order to analyze the relationshipbetween the quality of the abstractive summary and that ofthe prototype. In the gold-length setting, we only gave thegold length K to the prototype extractor. In the gold sen- -1 R-2 R-LLead3 40.3 17.7 36.6Bottom-Up (top-3 sents) HIBERT K words Table 2: ROUGE scores (F1) of our prototype extractor(LPAS-ext) on CNN/DM. (Gehrmann, Deng, and Rush2018); (Zhou et al. 2018); (Liu 2019); (Zhang, Wei, andZhou 2019) R-1 R-2 R-LAverage length 42.55 20.09 39.36Gold length 43.23 20.46 40.00Gold sentences + Gold length 46.68 23.52 43.41
Table 3: ROUGE scores (F1) of abstractive summarizationmodels with gold settings on the CNN/DM dataset.tences + the gold-length setting, we gave the gold sentences S oracle and gold length (see 4.1). Table 3 shows the results.These results indicate that selecting the correct number ofwords in the prototype improves the ROUGE scores. In thisstudy, we simply selected the average length when extract-ing the prototype for all examples in the standard setting;however, there will be an improvement if we adaptively se-lect the number of words in the prototype for each sourcetext. Moreover, the ROUGE score largely improved in thegold sentence and gold-length settings. This indicates thatthe quality of the generated summary will significantly im-prove by increasing the accuracy of the extractive model. Is our model effective on other datasets?
To verify theeffectiveness of our model on various other summary styles,we evaluated it on a large and varied news summary dataset,NEWSROOM. Table 4 and Figure 5 show the results inthe length-controlled setting for NEWSROOM. Our modelachieved higher ROUGE scores than those of LenEmb.From Figure 5a, we can see that the F-value of the ROUGEscore was highest around 30 words. This is because the av-erage word number is about 30 words. Moreover, Figure 5bshows that our model also acquired a length control capabil-ity for a dataset with various styles.
How well does our model perform in the standard set-ting?
Table 5 shows that our model achieved the ROUGEscores comparable to previous models that do not considerthe length constraint on the CNN/DM dataset. We note thatthe current state-of-the-art models use pre-trained encoder-decoder models (8-11), while the encoder and decoder of ourmodel (except for prototype extractor) were not pre-trained.We also examined the results of generating a summaryfrom only the prototype (LPAS w/o Source) or the source
Length Model R-1 R-2 R-L10 LenEmb
30 LenEmb 37.49 25.67 34.26LPAS
50 LenEmb 36.91 25.50 33.86LPAS
70 LenEmb 33.52 23.02 30.90LPAS
90 LenEmb 30.04 20.49 27.80LPAS
AVG LenEmb 32.19 21.62 29.66LPAS
Table 4: ROUGE scores (F1) of abstractive summarizationmodels with different lengths on the NEWSROOM dataset.
10 30 50 70 90desired length20406080 o u t p u t l e n g t h Figure 5: Results in the length-controlled setting on NEWS-ROOM. a): ROUGE-L recall, precision and F scores for dif-ferent lengths (left). b): Output length distribution (right).(LPAS w/o Prototype). Here, using only the prototype,turned out to have the same accuracy as using only thesource, but the model using the source and the prototypesimultaneously had higher accuracy. These results indicatethat our prototype extraction and joint encoder effectivelyincorporated the source text and prototype information andcontributed to improving the accuracy.The results for the NEWSROOM dataset under stan-dard settings are shown in Table 6. To consider differencesin summary length between news domains, we evaluatedour model in the average length and domain-level averagelength (denoted as domain length) settings. The results indi-cate that our model had significantly higher ROUGE scorescompared with the official baselines and outperformed ourbaseline (LPAS w/o Prototype). They also indicate thatour model is effective on datasets containing text in vari-ous styles. Moreover, we found that considering the domainlength has positive effects on the ROUGE scores. This in-dicates that our model can easily reflect differences in sum-mary length among various styles.
Length control for summarization
Kikuchi et al. (2016)were the first to propose using length embedding for length-controlled abstractive summarization. Fan, Grangier, andAuli (2018) also used length embeddings at the beginningof the decoder module for length control. Liu, Luo, andZhu (2018) proposed a CNN-based length-controllable sum-marization model that uses the desired length as an in- odel R-1 R-2 R-Lw/o pre-trained encoder-decoder modelPointer-Generator w/o Prototype 40.71 18.43 37.32w/o Source 40.08 18.32 37.08w/ pre-trained encoder-decoder modelPoDA Table 5: ROUGE scores (F1) of abstractive summariza-tion models on CNN/DM. (See, Liu, and Manning 2017); (Li et al. 2018); (Hsu et al. 2018); (Chen and Bansal2018); (Gehrmann, Deng, and Rush 2018); (Mendes et al.2019); (You et al. 2019); (Wang et al. 2019); (Dong et al.2019); (Raffel et al. 2019); (Lewis et al. 2019). LPASw/o Prototype denotes a simple Transformer-based pointer-generator, which is our model without the prototype extrac-tor and the joint encoder. LPAS w/o Source denotes a modelthat generates a summary only from the prototype text. R-1 R-2 R-LLead3 K = average length 39.24 27.20 35.84 K = domain length LPAS (w/o Prototype) 38.48 26.99 35.30
Table 6: ROUGE scores (F1) of proposed models on NEWS-ROOM dataset. (Grusky, Naaman, and Artzi 2018)put to the initial state of the decoder. Takase and Okazaki(2019) introduced positional encoding that represents theremaining length at each decoder step of the Transformer-based encoder-decoder model. It is almost equivalent to themodel LenEmb we implemented. These previous modelsuse length embeddings for controlling the length in the de-coding module, whereas we use the prototype extractor forcontrolling the summary length and to include important in-formation in the summary. Neural extractive-and-abstractive summarization
Hsuet al. (2018), Gehrmann, Deng, and Rush (2018) and You etal. (2019) incorporated a sentence- and word-level extractivemodel in the pointer-generator model. Their models weightthe copy probability for the source text by using an extractivemodel and guide the pointer-generator model to copy im-portant words. Li et al. (2018) proposed a keyword guidedabstractive summarization model. Chen and Bansal (2018) proposed a sentence extraction and re-writing model thattrains in an end-to-end manner by using reinforcement learn-ing. Cao et al. (2018) proposed a search and rewrite model.Mendes et al. (2019) proposed a combination of sentence-level extraction and compression. The idea behind thesemodels is word-level weighting for the entire source textor sentence-level re-writing. On the other hand, our modelguides the summarization with a length-controllable proto-type text by using the prototype extractor and joint encoder.Utilizing extractive results to control the length of the sum-mary is a new idea.
Large-scale pre-trained language model
BERT (De-vlin et al. 2018) is a new pre-trained language modelthat uses bidirectional encoder representations from Trans-former. BERT has performed well in many natural languageunderstanding tasks such as the GLUE benchmarks (Wanget al. 2018) and natural language inference (Williams, Nan-gia, and Bowman 2018). Liu (2019) used BERT for theirsentence-level extractive summarization model. Zhang, Wei,and Zhou (2019) trained a new pre-trained model that con-siders document-level information for sentence-level extrac-tive summarization. We used BERT for the word-level pro-totype extractor and verified the effectiveness of using itin the word-level extractive module. Several researchershave published pre-trained encoder-decoder models very re-cently (Wang et al. 2019; Lewis et al. 2019; Raffel et al.2019). Wang et al. (2019) pre-trained a transformer-basedpointer-generator model. Lewis et al. (2019) pre-traineda normal transformer-based encoder-decoder model usinglarge unlabeled data and achieved state-of-the-art results.Dong et al. (2019) extended the BERT structure to handlesequence-to-sequence tasks.
Reinforcement learning for summarization
Reinforce-ment learning (RL) is a key summarization technique. RLcan be used to optimize non-differential metrics or multi-ple non-differential networks. Narayan, Cohen, and Lapata(2018) and Dong et al. (2018) used RL for extractive sum-marization. For abstractive summarization, Paulus, Xiong,and Socher (2017) used RL to mitigate the exposure bias ofabstractive summarization. Chen and Bansal (2018) used RLto combine sentence-extraction and pointer-generator mod-els. Our model achieved high ROUGE scores without RL. Infuture, we may incorporate RL in our models to get a furtherimprovement.
We proposed a new length-controllable abstractive summa-rization model. Our model consists of a word-level prototypeextractor and a prototype-guided abstractive summarizationmodel. The prototype extractor identifies the important partof the source text within the length constraint, and the ab-stractive model is guided with the prototype text. This char-acteristic enabled it to achieve a high ROUGE score in stan-dard summarization tasks. Moreover, our prototype extrac-tor ensures the summary will have the desired length. Ex-periments with the CNN/DM dataset and the NEWSROOMdataset show that our model outperformed previous mod-els in standard and length-controlled settings. In future, weould like to incorporate a pre-trained language model inthe abstractive model to build a higher quality summariza-tion model.
References
Cao, Z.; Li, W.; Li, S.; and Wei, F. 2018. Retrieve, rerankand rewrite: Soft template based neural summarization. In
ACL , 152–161.Chen, Y.-C., and Bansal, M. 2018. Fast abstractive summa-rization with reinforce-selected sentence rewriting. In
ACL ,675–686.Devlin, J.; Chang, M.; Lee, K.; and Toutanova, K. 2018.Bert: Pre-training of deep bidirectional transformers for lan-guage understanding.
CoRR .Dong, Y.; Shen, Y.; Crawford, E.; van Hoof, H.; and Cheung,J. C. K. 2018. BanditSum: Extractive summarization as acontextual bandit. In
EMNLP , 3739–3748.Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.;Gao, J.; Zhou, M.; and Hon, H.-W. 2019. Unified languagemodel pre-training for natural language understanding andgeneration. In
Advances in Neural Information ProcessingSystems 32 . 13042–13054.Fan, A.; Grangier, D.; and Auli, M. 2018. Controllableabstractive summarization. In
NMT@ACL , 45–54.Gehrmann, S.; Deng, Y.; and Rush, A. 2018. Bottom-upabstractive summarization. In
EMNLP , 4098–4109.Grusky, M.; Naaman, M.; and Artzi, Y. 2018. Newsroom:A dataset of 1.3 million summaries with diverse extractivestrategies. In
ACL , 708–719.Hermann, K. M.; Kocisky, T.; Grefenstette, E.; Espeholt, L.;Kay, W.; Suleyman, M.; and Blunsom, P. 2015. Teachingmachines to read and comprehend. In
NIPS . 1693–1701.Hsu, W.-T.; Lin, C.-K.; Lee, M.-Y.; Min, K.; Tang, J.; andSun, M. 2018. A unified model for extractive and abstractivesummarization using inconsistency loss. In
ACL , 132–141.Kikuchi, Y.; Neubig, G.; Sasano, R.; Takamura, H.; andOkumura, M. 2016. Controlling output length in neuralencoder-decoders. In
EMNLP , 1328–1338.Kingma, D. P., and Ba, J. 2015. Adam: A method forstochastic optimization. In
ICLR .Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mo-hamed, A.; Levy, O.; Stoyanov, V.; and Zettlemoyer, L.2019. Bart: Denoising sequence-to-sequence pre-trainingfor natural language generation, translation, and comprehen-sion. arXiv e-prints .Li, C.; Xu, W.; Li, S.; and Gao, S. 2018. Guiding generationfor abstractive text summarization based on key informationguide network. In
ACL , 55–60.Lin, C.-Y. 2004. Rouge: A package for automatic evaluationof summaries. In
ACL .Liu, Y.; Luo, Z.; and Zhu, K. 2018. Controlling length inabstractive summarization using a convolutional neural net-work. In
EMNLP , 4110–4119.Liu, Y. 2019. Fine-tune BERT for extractive summarization.
CoRR abs/1903.10318. Mendes, A.; Narayan, S.; Miranda, S.; Marinho, Z.; Martins,A. F. T.; and Cohen, S. B. 2019. Jointly extracting and com-pressing documents with summary state representations. In
NAACL , 3955–3966.Narayan, S.; Cohen, S. B.; and Lapata, M. 2018. Rankingsentences for extractive summarization with reinforcementlearning. In
NAACL , 1747–1759.Paulus, R.; Xiong, C.; and Socher, R. 2017. A deepreinforced model for abstractive summarization.
CoRR abs/1705.04304.Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove:Global vectors for word representation. In
EMNLP .Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.;Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2019. Explor-ing the limits of transfer learning with a unified text-to-texttransformer. arXiv e-prints .See, A.; Liu, P. J.; and Manning, C. D. 2017. Get to thepoint: Summarization with pointer-generator networks. In
ACL , 1073–1083.Takase, S., and Okazaki, N. 2019. Positional encoding tocontrol output sequence length. In
NAACL , 3999–4004.Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,L.; Gomez, A. N.; Kaiser, L. u.; and Polosukhin, I. 2017.Attention is all you need. In
NIPS . 5998–6008.Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; andBowman, S. R. 2018. GLUE: A multi-task benchmarkand analysis platform for natural language understanding.In
BlackboxNLP@EMNLP , 353–355.Wang, L.; Zhao, W.; Jia, R.; Li, S.; and Liu, J. 2019. Denos-ing based sequence-to-sequence pre-training for text gener-ation. In
EMNLP , to appear.Williams, A.; Nangia, N.; and Bowman, S. R. 2018. Abroad-coverage challenge corpus for sentence understandingthrough inference. In
NAACL-HLT , 1112–1122.You, Y.; Jia, W.; Liu, T.; and Yang, W. 2019. Improving ab-stractive document summarization with salient informationmodeling. In
ACL , 2132–2141.Zhang, X.; Lapata, M.; Wei, F.; and Zhou, M. 2018. Neurallatent extractive document summarization. In
EMNLP , 779–784. Association for Computational Linguistics.Zhang, X.; Wei, F.; and Zhou, M. 2019. HIBERT: documentlevel pre-training of hierarchical bidirectional transformersfor document summarization. In
ACL , 5059–5069.Zhou, Q.; Yang, N.; Wei, F.; Huang, S.; Zhou, M.; and Zhao,T. 2018. Neural document summarization by jointly learn-ing to score and select sentences. In