[PDF] Discriminative Adversarial Search for Abstractive Summarization

Abstract

We introduce a novel approach for sequence decoding, Discriminative Adversarial Search (DAS), which has the desirable properties of alleviating the effects of exposure bias without requiring external metrics. Inspired by Generative Adversarial Networks (GANs), wherein a discriminator is used to improve the generator, our method differs from GANs in that the generator parameters are not updated at training time and the discriminator is only used to drive sequence generation at inference time. We investigate the effectiveness of the proposed approach on the task of Abstractive Summarization: the results obtained show that a naive application of DAS improves over the state-of-the-art methods, with further gains obtained via discriminator retraining. Moreover, we show how DAS can be effective for cross-domain adaptation. Finally, all results reported are obtained without additional rule-based filtering strategies, commonly used by the best performing systems available: this indicates that DAS can effectively be deployed without relying on post-hoc modifications of the generated outputs.

Full PDF

DDiscriminative Adversarial Search for Abstractive Summarization

Thomas Scialom

Paul-Alexis Dray Sylvain Lamprier Benjamin Piwowarski

Jacopo Staiano Abstract

We introduce a novel approach for sequencedecoding,

Discriminative Adversarial Search (DAS), which has the desirable properties of al-leviating the effects of exposure bias without re-quiring external metrics. Inspired by GenerativeAdversarial Networks (GANs), wherein a dis-criminator is used to improve the generator, ourmethod differs from GANs in that the generatorparameters are not updated at training time andthe discriminator is only used to drive sequencegeneration at inference time. We investigate theeffectiveness of the proposed approach on thetask of Abstractive Summarization: the resultsobtained show that DAS improves over the state-of-the-art methods, with further gains obtainedvia discriminator retraining. Moreover, we showhow DAS can be effective for cross-domain adap-tation. Finally, all results reported are obtainedwithout additional rule-based ﬁltering strategies,commonly used by the best performing systemsavailable: this indicates that DAS can effectivelybe deployed without relying on post-hoc modiﬁ-cations of the generated outputs.

1. Introduction

In the context of Natural Language Generation (NLG), amajority of approaches propose sequence to sequence mod-els trained via maximum likelihood estimation; a TeacherForcing (Williams & Zipser, 1989) strategy is applied dur-ing training: ground-truth tokens are sequentially fed intothe model to predict the next token. Conversely, at inferencetime, ground-truth tokens are not available: the model canonly have access to its previous outputs. In the literature(Bengio et al., 2015; Ranzato et al., 2015), such mismatchis referenced to as exposure bias : as mistakes accumulate,this can lead to a divergence from the distribution seen at reciTAL, Paris, France Sorbonne Universit´e, CNRS, LIP6, F-75005 Paris, France CNRS, France. Correspondence to: ThomasScialom < [email protected] > . Proceedings of the th International Conference on MachineLearning , Online, PMLR 119, 2020. Copyright 2020 by the au-thor(s). training time, resulting in poor generation outputs.Several works have focused on alleviating this issue, propos-ing to optimize a sequence level metric such as BLEU orROUGE: Wiseman & Rush (2016) used beam search optimi-sation while Ranzato et al. (2015) framed text generation asa reinforcement learning problem, using the chosen metricas reward. Still, these automated metrics suffer from knownlimitations: Sulem et al. (2018) showed how BLEU metricsdo not reﬂect meaning preservation, while Novikova et al.(2017) pointed out that, for NLG tasks, they do not mapwell to human judgements.Similar ﬁndings have been reported for ROUGE, in thecontext of abstractive summarization (Paulus et al., 2017):for the same input, several correct outputs are possible;nonetheless, the generated output is often compared to asingle human reference, given the lack of annotated data.Complementary metrics have been proposed to evaluateNLG tasks, based on Question Answering (Scialom et al.,2019) or learned from human evaluation data (B¨ohm et al.,2019). Arguably, though, the correlation of such metrics tohuman judgments is still unsatisfactory.To tackle exposure bias, Generative Adversarial Networks(GANs) (Goodfellow et al., 2014) represent a natural al-ternative to the proposed approaches: rather than learningfrom a speciﬁc metric, the model learns to generate text thata discriminator cannot differentiate from human-producedcontent. However, the discrete nature of text makes theclassiﬁer signal non-differentiable. A solution would beto use reinforcement learning with the classiﬁer predictionas a reward signal. However, due to reward sparsity andmode collapse (Zhou et al., 2020), text GANs failed so farto be competitive with state-of-the-art models trained withteacher forcing on NLG tasks (Caccia et al., 2018; Clarket al., 2019), and are mostly evaluated on synthetic datasets.Inspired by Generative Adversarial Networks, we proposean alternative approach for sequence decoding: ﬁrst, a dis-criminator is trained to distinguish human-produced textsfrom machine-generated ones. Then, this discriminator isintegrated into a beam search: at each decoding step, thegenerator output probabilities are reﬁned according to thelikelihood that the candidate sequence is human-produced.This is equivalent to optimize the search for a custom anddynamic metric, learnt to ﬁt the human examples. a r X i v : . [ c s . C L ] A ug iscriminative Adversarial Search for Abstractive Summarization Under the proposed paradigm, the discriminator causes theoutput sequences to diverge from those originally producedby the generator. These sequences, adversarial to the dis-criminator, can be used to further ﬁne-tune the discriminator:following the procedure used for GANs, the discrimina-tor can be iteratively trained on the new predictions it hascontributed to improve. This effectively creates a positivefeedback loop for training the discriminator: until conver-gence, the generated sequences improve and become harderto distinguish from human-produced text. Additionally, theproposed approach allows to dispense of custom rule-basedstrategies commonly used at decoding time such as lengthpenalty and n-gram repetition avoidance.In GANs, the discriminator is used to improve the generatorand is dropped at inference time. Our proposed approachdiffers in that, instead, we do not modify the generatorparameters at training time, and use the discriminator atinference time to drive the generation towards human-liketextual content.The main contributions of this work can be summarized as:1. we propose

Discriminative Adversarial Search (DAS),a novel sequence decoding approach that allows toalleviate the effects of exposure bias and to optimizeon the data distribution itself rather than for externalmetrics;2. we apply DAS to the abstractive summarization task,showing that even without the self-retraining proce-dure, our discriminated beam search procedure im-proves over the state-of-the-art for various metrics;3. we report further signiﬁcant improvements when ap-plying discriminator retraining;4. ﬁnally, we show how the proposed approach can beeffectively used for domain adaptation.

2. Related Work

Several research efforts have tackled the issue of exposurebias resulting from Teacher Forcing. Inspired by Venkatra-man et al. (2015), Bengio et al. (2015) proposed a variationof Teacher Forcing wherein the ground truth tokens areincrementally replaced by the predicted words. Further,Professor Forcing (Lamb et al., 2016) was devised as anadversarial approach in which the model learns to gener-ate without distinction between training and inference time,when it has no more access to the ground truth tokens. Us-ing automated metrics at coarser (sequence) rather thanﬁner (token) granularity to optimize the model, Wiseman &Rush (2016) proposed a beam search variant to optimise the BLEU score in neural translation. Framing NLG as a Rein-forcement Learning problem, Ranzato et al. (2015) used thereward as the metric to optimise. Paulus et al. (2017) applieda similar approach in abstractive summarization tasks, usingthe ROUGE metric as a reward; the authors observed that,despite the successful application of reinforcement, higherROUGE does not yield better models: other metrics forNLG are needed. Finally, Zhang et al. (2019) proposed toselect, among the different beams decoded, the one obtain-ing the highest BLEU score and then to ﬁne-tune the modelon that sequence.

Recent works have applied text classiﬁers as discriminatorsfor different NLG tasks. Kryscinski et al. (2019) used themto detect factual consistency in the context of abstractivesummarization; Zellers et al. (2019) applied discriminatorsto detect fake news, in a news generation scenario, reportinghigh accuracy (over 90%). Recently, Clark et al. (2019)proposed to train encoders as discriminators rather thanlanguage models, as an alternative to BERT (Devlin et al.,2019); they obtained better performances while improvingin terms of training time. Closest to our work, Chen et al.(2020) leverage on discriminators to improve unconditionaltext generation following Gabriel et al. (2019) work onsummarization. Abstractive summarization systems tend tobe too extractive (Kry´sci´nski et al., 2018), mainly becauseof the copy mechanism (Vinyals et al., 2015). To improvethe abstractiveness of the generated outputs, Gehrmann et al.(2018) proposed to train a classiﬁer to detect which wordsfrom the input could be copied, and applied it as a ﬁlterduring inference: to some extent, our work can be seen asthe generalisation of this approach.

Beam search is the de-facto algorithm used to decode gener-ated sequences of text, in NLG taks. This decoding strategyallows to select the sequence with the highest probability, of-fering more ﬂexibility than a greedy approach. Beam searchhas contributed to performance improvements of state-of-the-art models for many tasks, such as Neural MachineTranslation, Summarization, and Question Generation (Ottet al., 2018; Dong et al., 2019). However, external rules areusually added to further constrain the generation, like theﬁltering mechanism for copy described above (Gehrmannet al., 2018) or the inclusion of a length penalty factor (Wuet al., 2016). Hokamp & Liu (2017) reported improve-ments when adding lexical constraints to beam search. Ob-serving that neural models are prone to repetitions, whilehuman-produced summaries contain more than 99% unique3-grams, Paulus et al. (2017) introduced a rule in the beamforbidding the repetition of 3-grams. iscriminative Adversarial Search for Abstractive Summarization len src len tgt abstr. (%)CNN/DM 810.69 61.04 10.23TL;DR 248.95 30.71 36.88

Table 1.

Statistics of CNN/DM and TL;DR summarization datasets.We report length in tokens for source (len src) and summaries(len tgt). Abstractiveness (abstr.) is the percentage of tokens in thetarget summary, which are not present in the source article.

Whether trained from scratch (Paulus et al., 2017; Gehrmannet al., 2018) or based on pre-trained language models (Donget al., 2019), the current state-of-the-art results in abstractivesummarization have been achieved using length penalty and3-grams repetition avoidance.

3. Datasets

While the proposed approach is applicable to any NaturalLanguage Generation (NLG) task, we focus on Abstrac-tive Summarization in this study. One of most populardatasets for summarization is the CNN/Daily Mail (CN-N/DM) dataset (Hermann et al., 2015; Nallapati et al., 2016).It is composed of news articles paired to multi-sentence sum-maries. The summaries were written by professional writersand consist of several bullet points corresponding to theimportant information present in the paired articles. Forfair comparison, we used the exact same dataset version asprevious works (See et al., 2017; Gehrmann et al., 2018;Dong et al., 2019). Furthermore, to assess the possible beneﬁts of the proposedapproach in a domain adaptation setup, we conduct experi-ments on TL;DR, a large scale summarization dataset builtfrom social media data (V¨olske et al., 2017). We choose thisdataset for two main reasons: ﬁrst, its data is relatively out-of-domain if compared to the samples in CNN/DM; second,its characteristics are also quite different: compared to CN-N/DM, the TL;DR summaries are twice shorter and threetimes more abstractive, as detailed in Table 1. The trainingset is composed of around 3M examples and publicly avail-able, while the test set is kept hidden because of publicongoing leaderboard evaluation. Hence, we randomly sam-pled 100k examples for training, 5k for validation and 5kfor test. For reproducibility purposes, we make the TL;DRsplit used in this work publicly available.

4. Discriminative Adversarial Search

The proposed model is composed of a generator G coupledwith a sequential discriminator D : at inference time, for Publicly available at https://github . com/microsoft/unilm https://zenodo . org/record/1168855 every new token generated by G , the score and the labelassigned by D is used to reﬁne the probabilities, within abeam search, to select the top candidate sequences.While the proposed approach is applicable to any NaturalLanguage Generation (NLG) task, we focus on AbstractiveSummarization. Generator

Abstractive summarization is usually cast as asequence to sequence task: P γ ( y | x ) = | y | (cid:89) t =1 P γ ( y t | x, y t − )) (1)where x is the input text, y is the summary composed of y ...y | y | tokens and γ represents the parameters of the gen-erator. Under this framework, an abstractive summarizer isthus trained using article ( x ) and summary ( y ) pairs (e.g.,via log-likelihood maximization). Discriminator

The objective of the discriminator is tolabel a sequence y as being human-produced or machine-generated . We use the discriminator to obtain a label at eachgeneration step, rather than only for the entire generatedsequence. For simplicity, we cast the problem as sequenceto sequence, with a slight modiﬁcation from our generator:at each generation step, the discriminator, instead of predict-ing the next token among the entire vocabulary V , outputsthe probability that the input summary was generated by ahuman.Learning the neural discriminator D δ , using parameters δ ,corresponds to the following logistic regression problem: | H | (cid:88) ( x,y ) ∈ H log( D δ ( x, y )) + 1 | G | (cid:88) ( x,y ) ∈ G log(1 − D δ ( x, y )) (2) where H and G are sets of pairs ( x, y ) of all texts x ∈ X to be summarized associated to any sub-sequence y (fromstart to any token index t ), respectively taken from groundtruth summaries and generated ones: H = { ( x, y t ) | x ∈ X ∧ y ∈ H ( x ) ∧ t ≤ | y |} G = { ( x, y t ) | x ∈ X ∧ y ∈ G ( x ) ∧ t ≤ | y |} where x ∈ X stands as a text from the training set X and H ( x ) and G ( x ) respectively correspond to the associatedhuman-written summary and a set of generated summariesfor text x .We refer to D δ as a sequential discriminator, since it learnsto discriminate for any partial sequence (up to the t tokensgenerated at step t ) of any summary y . We cut all thesummaries to T = 140 tokens if longer, consistently withprevious works (Dong et al., 2019). iscriminative Adversarial Search for Abstractive Summarization At inference time, the aim is usually to maximize the proba-bility of the output y according to the generator (Eq. 1). Thebest candidate sequence is the one that maximizes P γ ( y | x ) .The beam search procedure is a greedy process that itera-tively constructs sequences from y to y n , while maintaininga pool of B best hypotheses generated so far at each step toallow exploration (when B > ). At each step t , the processassigns a score, for every sub-sequence y t − from the pool B , to every candidate new token y t from the vocabulary V : S gen (ˆ y ) = log P γ ( y t − | x ) + log P γ ( y t | x, y t − ) (3)where ˆ y results from the concatenation of a new token y t at the end of a sequence y t − . The B sequences ˆ y withbest S gen (ˆ y, x ) scores are kept to form the pool of hypothe-ses at next step. Finally, when all sequences from B areended sequences (with the end token $ as the last token ˆ y − ), the one with best S gen score is returned. The beamsize B corresponds to a hyper-parameter which enables tocontrol exploration and complexity of the process. It rangesbetween 1 and 5 in the literature. Algorithm 1

DAS: a Beam Search algorithm with the pro-posed discriminator re-ranking mechanism highligted.

Require: B , T , K rerank , α C ← { Start-Of-Sentence } for t = 1 , ..., T do C ← { ˆ y | (ˆ y t − ∈ C ∧ ˆ y t ∈ V ) ∨ (ˆ y ∈ C ∧ ˆ y − = $) } K rerank sequences with top S gen C ← arg max ˜ C ⊆ C, | ˜ C | = K rerank (cid:80) ˆ y ∈ ˜ C S gen (ˆ y ) B sequences with top S DAS C ← arg max ˜ C ⊆ C, | ˜ C | = B (cid:80) ˆ y ∈ ˜ C S DAS (ˆ y ) if only ended sequences in C then return C end if end for In our method, we propose a new score S DAS to reﬁne thescore S gen during the beam search w.r.t. the log probabilityof the discriminator, such that: S DAS (ˆ y ) = S gen (ˆ y ) + α × S dis (ˆ y ) (4)where S dis (ˆ y ) = log ( D δ ( x, ˆ y ) is the discriminator log-probability that the sequence ˆ y is human-written; α ≥ isused as a weighting factor. While theoretically such scorescould be computed for the entire vocabulary, in practiceapplying the discriminator to all of the | V | × B candidate Figure 1.

DAS self-training procedure: the generated examples areimproved by the discriminator, and then fed back to the discrimi-nator in a self-training loop. sequences ˆ y at every step t would be too time-consuming.For complexity purposes, we thus limit the re-ranking to thepool of the K rerank sequences with best S gen (ˆ y, x ) score,as detailed in Algorithm 1. Under the proposed paradigm, as mentioned in Section 1,the discriminator can be ﬁne-tuned using the outputs gen-erated from the re-ranking process, to match the new gen-eration distribution. Inspired by the GAN paradigm, weiteratively retrain the discriminator given the new predic-tions until convergence. We detail the full procedure inFigure 1, where the discriminator is retrained iterativelyfollowing equation 2 at each step. The set of generated sum-maries G used in equation 2 to train the discriminator at step t +1 corresponds to outputs from our DAS process using thediscriminator at step t . This allows to consider at each stepa discriminator that attempts to correct output distributionsfrom the previous step, in order to incrementally convergeto realistic distributions (w.r.t. human summaries), withoutrequiring any retraining of the generator.

5. Experimental Protocol

Generator

We build upon the Uniﬁed Language Modelfor natural language understanding and generation (UniLM)proposed by Dong et al. (2019); it is the current state-of-the-art model for summarization. This model can be describedas a Transformer (Vaswani et al., 2017) whose weights areﬁrst initialised from BERT. However, BERT is an encodertrained with bi-directional self attention: it can be usedin Natural Language Understanding (NLU) tasks but notdirectly for generation (NLG). Dong et al. (2019) proposedto unify it for NLU and NLG: resuming its training, thistime with an unidirectional loss; after this step, the modelcan be directly ﬁne-tuned on any NLG task. Code and models available at https://github . com/microsoft/unilm iscriminative Adversarial Search for Abstractive Summarization For our ablation experiments, to save time and computation,we do not use UniLM (345 million parameters). Instead, wefollow the approach proposed by the authors (Dong et al.,2019), with the difference that we start from BERT-base(110 million parameters) and we do not extend the pre-training. We actually observed little degradation than whenstarting from UniLM. We refer to this smaller model as BERT-gen . For our ﬁnal results we instead use the exactsame UniLM checkpoint made publicly available by Donget al. (2019) for Abstractive Summarization.

Discriminator

As detailed in Section 4, the discriminatormodel is also based on a sequence to sequence architecture.Thus, we can use again

BERT-gen , initializing it the sameway as the generator. The training data from CNN/DM isused to train the model; for each sample, the discriminatorhas access to two training examples: the human referenceand a generated summary.Hence, the full training data available for the discriminatoramounts to ∼ k total examples. However, as detailed inthe following Section, the discriminator does not need a lotof data to achieve a high accuracy. Therefore, we only used150k training examples, split into 90% for training, 5% forvalidation and 5% for test. Unless otherwise speciﬁed, thisdata is only used to train/evaluate the discriminator. Implementation details

All models are implemented inPyText (Aly et al., 2018). For all our experiments we used asingle RTX 2080 Ti GPU.To train the discriminator, we used the Adam optimiser withthe recommended parameters for BERT: learning rate of e − , batch size of 4 and accumulated batch size of 32. Wetrained it for 5 epochs; each epoch took 100 minutes on150k samples.During discriminator retraining, the generator is neededand thus additional memory is required: all else equal, wedecreased the batch size to 2. The self-training processtakes one epoch to converge, in about 500 minutes: 200minutes for training the discriminator and 300 minutes togenerate the summaries with the search procedure describedin Algorithm 1. Metrics

The evaluation of NLG models remains an openresearch question. Most of the previous works report n-grams based metrics such as ROUGE (Lin, 2004) or BLEU(Papineni et al., 2002). ROUGE-n is a recall oriented metriccounting the percentage of n-grams in the gold summariesthat are present in the evaluated summary. Conversely,BLEU is precision oriented.However, as discussed in Section 1, these metrics do notcorrelate sufﬁciently w.r.t human judgments. For summa-rization, Louis & Nenkova (2013) showed how this issue

20 40 60 80 100 120t0.600.650.700.750.800.850.900.951.00 C l a ss i f i c a t i o n a cc u r a c y Discriminator with contextDiscriminator without context

Figure 2.

Accuracy of two discriminators: one is given access tothe source context x while the other is not. The x-axis correspondsto the length of the discriminated sub-sequences. gets even more relevant when few gold references are given.Unfortunately, the annotation of large scale dataset is not re-alistic: in practice, all the large scale summarization datasetsrely on web-scraping to gather text-summary pairs.For these reasons, See et al. (2017) suggested to systemat-ically compare summarization systems with other metricssuch as novelty and the number of repetitions. Followingthe authors’ recommendation, we report the following mea-sures for all our experiments: i) Novelty ( nov-n ), as thepercentage of novel n-grams w.r.t. the source text, indicat-ing the abstractiveness of a system; ii)

Repetition ( rep-n ), asthe percentage of n-grams that occur more than once in thesummary; and iii)

Length ( len ), as the length in tokens ofthe summary.It is important to note that the objective is not to maximizethose metrics, but to minimize the difference w.r.t. human-quality summaries. Hence, we report this difference suchthat for any measure m above, ∆ m = m human − m model .

6. Preliminary study

High discriminator accuracy is of utmost importance forDAS to improve the decoding search. In Fig. 2 we plot thediscriminator accuracy against the generation step t , with t corresponding to the prediction for the partial sequenceof the summary, y , ..., y t (see Eq. 2). As an ablation, wereport the accuracy for a discriminator which is not givenaccess to the source article x .As one would expect, the scores improve with higher t ,from 65% for t = 1 to 98% for t = 140 : the longer thesequence y , ..., y t of the evaluated summary, the easier it isto discriminate it. This observed high accuracy indicates thepotential beneﬁt of using the discriminator signal to improvethe generated summaries. iscriminative Adversarial Search for Abstractive Summarization K rerank DAS-single DAS-retrain1 (BERT-gen) 27.70 ± ± B LE U - ± ± ± ± ± ± ∆ nov - ± ± ± ± ± ± ∆ l e n ± ± ± ± ± ± ∆ r e p - ± ± ± ± Table 2.

Scores obtained with varying K rerank When trained without access to the source article x (orangeplot), the discriminator has access to little contextual andsemantic information and its accuracy is lower than a dis-criminator who has access to x . In Fig. 2, the shaded areabetween the two curves represents the discrimination perfor-mance improvement attributed to using the source article x .It increases for ≤ t ≤ and starts shrinking afterwards.After t = 60 , corresponding to the average length of thehuman summaries (see Table 1), the performance of thediscriminator without context quickly increases, indicatingthat the generated sequences contain relatively easy-to-spotmistakes. This might be due to the increased difﬁculty forthe generator to produce longer and correct sequences, aserrors may accumulate over time. Impact of K rerank and α To assess the behavior of DAS,we conducted experiments with

BERT-gen for both the gen-erator and the discriminator using different values for α and K rerank . All models are trained using the same trainingdata from CNN/DM, and the ﬁgures reported in Tables 2and 3 are the evaluation results averaged across 3 runs onthree different subsets (of size 1k) randomly sampled fromthe validation split. We compare i) BERT-gen , i.e. the modelwithout a discriminator, ii) DAS-single, where the discrimi-nator is not self-retrained, and iii)

DAS-retrain, where thediscriminator is iteratively retrained. As previously men-tioned, for the repetition, novelty and length measures, wereport the difference w.r.t. human summaries: the closer to0 the better – 0 indicates no difference w.r.t. human.The parameter K rerank corresponds to the number of ex-plored possibilities by the discriminator (see Sec. 4.1). With K rerank = 1 , no reranking is performed, and the model isequivalent to BERT-gen . α DAS-single DAS-retrain0 (BERT-gen) 27.70 ± ± B LE U - ± ± ± ± ± ± ± ± ∆ nov - ± ± ± ± ± ± ± ± ∆ l e n ± ± ± ± ± ± ± ± ∆ r e p - ± ± ± ± ± ± Table 3.

Scores obtained with varying α In Table 2, for which we set α = 0 . , we observe that bothincreasing K rerank and retraining the discriminator help tobetter ﬁt the human distribution ( i.e. lower ∆ ): comparedto BERT-gen , DAS models generate more novel words, areshorter and less repetitive, show improvements over the basearchitecture, and also obtain performance gains in terms ofBLEU.Further, we report in Table 3 results for DAS models witha ﬁxed K rerank = 10 , while varying α . α controls theimpact of the discriminator predictions when selecting thenext token to generate (see Eq. 4).With α = 0 , the discriminator is deactivated and only thegenerator probabilities S gen are used (corresponding toEq. 1): the model is effectively equivalent to BERT-gen .Consistently with the results obtained for varying K rerank ,we observe: DAS-retrain > DAS-single > BERT-gen for α (cid:54) = 5 . However, when α = 5 , BLEU scores decrease. Thiscould indicate that a limit was reached: the higher the α , themore the discriminator inﬂuences the selection of the nextword.With α = 5 , the generated sequences are too far from thegenerator top-p probabilities, selected tokens at step t donot lead to useful sequences in the best K rerank candidatesat the following steps. The generation process strugglesto represent sequences too far from what was seen duringtraining. iscriminative Adversarial Search for Abstractive Summarization ∆ len ∆ nov-1 ∆ nov-3 ∆ rep-1 ∆ rep-3 R1 RL B1See et al. - - - - - 36.38 34.24 -Gehrmann et al. - - - - - 41.22 38.34 -Kry´sci´nski et al. - 10.10 32.84 - - 40.19 37.52 -UniLM -40.37 8.35 7.98 -27.99 -16.81 -2.40 Table 4.

Results on CNN/DM test set for the previous works as well as our proposed models. ∆ len ∆ nov-1 ∆ nov-3 ∆ rep-1 ∆ rep-3 R1 RL B1UniLM -12.11 27.16 5.49 -6.87 -2.72 19.05 1.01 -3.42 -1.33 Table 5.

Results on TL;DR test set for our proposed model in transfer learning scenarios.

7. Results and discussion

In our preliminary study, the best performing DAS conﬁg-uration was found with K rerank = 10 , α = 1 . We applythis conﬁguration in our main experiments, for fair compar-ison using the state-of-the-art UniLM model checkpoint. Results on the CNN/DM test set are reported in Table 4.Conﬁrming our preliminary study, DAS favorably comparesto previous works, for all the metrics. Compared to UniLM,we can observe that both DAS-single and DAS-retrain arecloser to the target data distribution: they allow to signiﬁ-cantly reduce the gap with human-produced summaries overall metrics. The length of the summaries are 16.81 tokensin average longer than the human, as opposed to 40.37 to-kens of difference for UniLM and 45.57 without the lengthpenalty. DAS-retrain is also more abstractive, averagingonly 2.59 novel 3-grams less than the human summaries, asopposed to 7.98 for UniLM. Notably, the proposed approachalso outperforms Kry´sci´nski et al. (2018) in terms of novelty,while their model was trained with novelty as a reward ina reinforcement learning setup. UniLM applies a 3-gramsrepetition avoidance rule, which is why this model gener-ates even less 3-grams repetitions than human summaries.Without this post-hoc rule, DAS-retrain generation is lessrepetitive compared to UniLM. Incidentally, our approachalso outperforms the previous works and achieves, to thebest of our knowledge, a new state-of-the-art for ROUGE.

Domain Adaptation

Further, in Fig. 5, we explore a do-main adaptation scenario, applying DAS-retrain on a seconddataset, TL;DR. This dataset is built off social media data,as opposed to news articles as in CNN/DM, and differs fromthe latter in several respects, as described in Section 3. Inthis scenario, we keep the previously used generator ( i.e. the As publicly released by the authors.

10 20 30 40 50 60t0.8250.8500.8750.9000.9250.9500.9751.000 C l a ss i f i c a t i o n a cc u r a c y training size: 100Ktraining size: 10Ktraining size: 1K Figure 3.

Learning curve for discriminators trained on TL;DR on1k, 10k and 100k examples. The x-axis corresponds to the lengthof the discriminated sub-sequences.

UniLM checkpoint trained on CNN/DM), and only train thediscriminator on a subset of TL;DR training samples. Thissetup has practical applications in scenarios where limiteddata is available: indeed, learning to generate is harder thanto discriminate and requires a large amount of examples(Gehrmann et al., 2018). A discriminator can be trainedwith relatively few samples: in Fig. 3 we show the learningcurves for discriminators trained from scratch on TL;DRtraining subsets of varying size. The samples are balanced:a training set size of 10k means that 5k gold summaries areused, along with 5k generated ones. We observe that only1k examples allow the discriminator to obtain an accuracyof 82.5% at step t = 1 . This score, higher in comparison tothe one obtained on CNN/DM (see Fig. 2) is due to the rela-tively lower quality of the out-of-domain generator outputs,which makes the job easier for the discriminator. iscriminative Adversarial Search for Abstractive Summarization −3 −2 −1 F r e q u e n c y HumanUniLMDAS-singleDAS-retrain 0 20 40 60 80 100k10 −3 −2 −1 F r e q u e n c y HumanUniLMDAS-singleDAS-retrain

Figure 4.

Vocabulary frequency for the k = 100 most frequent words generated by the models, for CNN/DM (left) and TL;DR (right). t0.000.020.040.060.080.100.120.14 % - g r a m s r e p e t i t i o n s HumanUniLM (no rules)DAS-singleDAS-retrain

Figure 5.

Distribution of 3-grams repetitions over their position t in the sequence (CNN/DM data). The results on TL;DR (Table 5) show larger improvementsof DAS-retrain over UniLM, than on CNN/DM. Due to thehigh accuracy of the discriminator, the summaries gener-ated are only 2.72 tokens shorter than the human ones asopposed to 12.11. They also contain more novelty and lessrepetitions. In terms of ROUGE and BLEU, DAS-retrainalso compares favorably with the exception of ROUGE-L.This might be due to the shorter length of DAS-retrain sum-maries as compared to UniLM: ROUGE is a recall-orientedmetric and ROUGE-L is computed for the longest commonsub-sequence w.r.t. the ground truth. Models participating to the public TL;DR leaderboard( https://tldr . webis . de/ ) are omitted here, since they aretrained on TL;DR data, and evaluated on a hidden test set. Nonethe-less, assuming that the distribution of our sampled test set is similarto that of the ofﬁcial test set, we observe that our approach obtainscomparable performance to the state-of-the-art, under a domain-adaptation setup and using only 1k training examples – exclusivelyfor the discriminator – over an available training set of 3M exam-ples. Discussion

In Fig. 4 we report the frequency distributionsfor the different models and the human summaries. Weobserve that DAS-retrain comes closer to the human dis-tribution, followed by DAS-single and signiﬁcantly outper-forming UniLM. This shows the beneﬁt of DAS at inferencetime, to produce relatively more human-like summaries.Further, the distribution of 3-grams repetition across theirrelative position in the sequence – Fig. 5 – shows how thegap between UniLM and Human increases more than that be-tween DAS-retrain and human, indicating that our approachcontributes to reduce the exposure bias effect. Rather thanexclusively targeting exposure bias (as in Scheduled Sam-pling or Professor Forcing), or relying on automatic metricsas in reinforcement learning approaches, we optimize to-wards a discriminator instead of discrete metrics: besidesreducing the exposure bias issue, this allows improvementsover the other aspects captured by a discriminator.

8. Conclusion

We introduced a novel sequence decoding approach, whichdirectly optimizes on the data distribution rather than onexternal metrics.Applied to Abstractive Summarization, the distribution ofthe generated sequences are found to be closer to that ofhuman-written summaries over several measures, while alsoobtaining improvements over the state-of-the-art.We reported extensive ablation analyses, and showed thebeneﬁts of our approach in a domain-adaptation setup. Im-portantly, all these improvements are obtained without anycostly generator retraining. In future work, we plan to applyDAS to other tasks such as machine translation and dialoguesystems. iscriminative Adversarial Search for Abstractive Summarization

References

Aly, A., Lakhotia, K., Zhao, S., Mohit, M., Oguz, B., Arora,A., Gupta, S., Dewan, C., Nelson-Lindall, S., and Shah, R.Pytext: A seamless path from nlp research to production. arXiv preprint arXiv:1812.08729 , 2018.Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. Sched-uled sampling for sequence prediction with recurrent neu-ral networks. In

Advances in Neural Information Process-ing Systems , pp. 1171–1179, 2015.B¨ohm, F., Gao, Y., Meyer, C. M., Shapira, O., Dagan, I.,and Gurevych, I. Better rewards yield better summaries:Learning to summarise without references. In

Proceed-ings of the 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th International JointConference on Natural Language Processing (EMNLP-IJCNLP) , pp. 3101–3111, 2019.Caccia, M., Caccia, L., Fedus, W., Larochelle, H., Pineau,J., and Charlin, L. Language gans falling short. arXivpreprint arXiv:1811.02549 , 2018.Chen, X., Cai, P., Jin, P., Wang, H., Dai, X., and Chen,J. A discriminator improves unconditional text gener-ation without updating the generator. arXiv preprintarXiv:2004.02135 , 2020.Clark, K., Luong, M.-T., Le, Q. V., and Manning, C. D. Elec-tra: Pre-training text encoders as discriminators ratherthan generators. In

International Conference on LearningRepresentations , 2019.Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:Pre-training of deep bidirectional transformers for lan-guage understanding. In

Proceedings of the 2019 Confer-ence of the North American Chapter of the Association forComputational Linguistics: Human Language Technolo-gies, Volume 1 (Long and Short Papers) , pp. 4171–4186,2019.Dong, L., Yang, N., Wang, W., Wei, F., Liu, X., Wang, Y.,Gao, J., Zhou, M., and Hon, H.-W. Uniﬁed languagemodel pre-training for natural language understandingand generation. In

Advances in Neural Information Pro-cessing Systems , pp. 13042–13054, 2019.Gabriel, S., Bosselut, A., Holtzman, A., Lo, K., Celikyilmaz,A., and Choi, Y. Cooperative generator-discriminatornetworks for abstractive summarization with narrativeﬂow. arXiv preprint arXiv:1907.01272 , 2019.Gehrmann, S., Deng, Y., and Rush, A. M. Bottom-up abstractive summarization. arXiv preprintarXiv:1808.10792 , 2018. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., and Bengio,Y. Generative adversarial nets. In

Advances in neuralinformation processing systems , pp. 2672–2680, 2014.Hermann, K. M., Kocisky, T., Grefenstette, E., Espeholt,L., Kay, W., Suleyman, M., and Blunsom, P. Teachingmachines to read and comprehend. In

Advances in neuralinformation processing systems , pp. 1693–1701, 2015.Hokamp, C. and Liu, Q. Lexically constrained decoding forsequence generation using grid beam search. In

Proceed-ings of the 55th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers) , pp.1535–1546, 2017.Kry´sci´nski, W., Paulus, R., Xiong, C., and Socher, R. Im-proving abstraction in text summarization. In

Proceedingsof the 2018 Conference on Empirical Methods in NaturalLanguage Processing , pp. 1808–1817, 2018.Kryscinski, W., McCann, B., Xiong, C., and Socher, R.Evaluating the factual consistency of abstractive text sum-marization. arXiv preprint arXiv:1910.12840 , 2019.Lamb, A. M., Goyal, A. G. A. P., Zhang, Y., Zhang, S.,Courville, A. C., and Bengio, Y. Professor forcing: A newalgorithm for training recurrent networks. In

Advances InNeural Information Processing Systems , pp. 4601–4609,2016.Lin, C.-Y. ROUGE: A package for automatic evaluationof summaries. In

Text Summarization Branches Out ,pp. 74–81, Barcelona, Spain, July 2004. Association forComputational Linguistics.Louis, A. and Nenkova, A. Automatically assessing ma-chine summary content without a gold standard.

Compu-tational Linguistics , 39(2):267–300, 2013.Nallapati, R., Zhou, B., Gulcehre, C., Xiang, B., et al. Ab-stractive text summarization using sequence-to-sequencernns and beyond. arXiv preprint arXiv:1602.06023 , 2016.Novikova, J., Duˇsek, O., Cercas Curry, A., and Rieser,V. Why we need new evaluation metrics for NLG.In

Proceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing , pp. 2241–2252, Copenhagen, Denmark, September 2017. Asso-ciation for Computational Linguistics. doi: 10 . . aclweb . org/anthology/D17-1238 .Ott, M., Auli, M., Grangier, D., and Ranzato, M. Analyzinguncertainty in neural machine translation. arXiv preprintarXiv:1803.00047 , 2018. iscriminative Adversarial Search for Abstractive Summarization Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu: amethod for automatic evaluation of machine translation.In

Proceedings of the 40th annual meeting on associationfor computational linguistics , pp. 311–318. Associationfor Computational Linguistics, 2002.Paulus, R., Xiong, C., and Socher, R. A deep reinforcedmodel for abstractive summarization. arXiv preprintarXiv:1705.04304 , 2017.Ranzato, M., Chopra, S., Auli, M., and Zaremba, W. Se-quence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 , 2015.Scialom, T., Lamprier, S., Piwowarski, B., and Staiano, J.Answers unite! unsupervised metrics for reinforced sum-marization models. In

Proceedings of the 2019 Confer-ence on Empirical Methods in Natural Language Process-ing and the 9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) , pp. 3237–3247,2019.See, A., Liu, P. J., and Manning, C. D. Get to the point:Summarization with pointer-generator networks. arXivpreprint arXiv:1704.04368 , 2017.Sulem, E., Abend, O., and Rappoport, A. BLEU isnot suitable for the evaluation of text simpliﬁcation.In

Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing , pp. 738–744,Brussels, Belgium, October-November 2018. Associ-ation for Computational Linguistics. doi: 10 . . aclweb . org/anthology/D18-1081 .Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-tion is all you need. In Advances in neural informationprocessing systems , pp. 5998–6008, 2017.Venkatraman, A., Hebert, M., and Bagnell, J. A. Improvingmulti-step prediction of learned time series models. In

Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence ,2015.Vinyals, O., Fortunato, M., and Jaitly, N. Pointer networks.In

Advances in neural information processing systems ,pp. 2692–2700, 2015.V¨olske, M., Potthast, M., Syed, S., and Stein, B. TL;DR:Mining Reddit to learn automatic summarization. In

Proceedings of the Workshop on New Frontiers inSummarization , pp. 59–63, Copenhagen, Denmark,September 2017. Association for Computational Lin-guistics. doi: 10 . . aclweb . org/anthology/W17-4508 . Williams, R. J. and Zipser, D. A learning algorithm for con-tinually running fully recurrent neural networks. Neuralcomputation , 1(2):270–280, 1989.Wiseman, S. and Rush, A. M. Sequence-to-sequence learn-ing as beam-search optimization. In

Proceedings of the2016 Conference on Empirical Methods in Natural Lan-guage Processing , pp. 1296–1306, 2016.Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M.,Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey,K., et al. Google’s neural machine translation system:Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 , 2016.Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y., Farhadi,A., Roesner, F., and Choi, Y. Defending against neuralfake news. In

Advances in Neural Information ProcessingSystems , pp. 9051–9062, 2019.Zhang, W., Feng, Y., Meng, F., You, D., and Liu, Q. Bridg-ing the gap between training and inference for neuralmachine translation. arXiv preprint arXiv:1906.02448 ,2019.Zhou, W., Ge, T., Xu, K., Wei, F., and Zhou, M. Self-adversarial learning with comparative discriminationfor text generation. In

International Conference onLearning Representations , 2020. URL https://openreview . net/forum?id=B1l8L6EtDSnet/forum?id=B1l8L6EtDS