Incremental Beam Manipulation for Natural Language Generation
IIncremental Beam Manipulation for Natural Language Generation
James Hargreaves
The Trade Desk & University of Cambridge [email protected]
Andreas Vlachos
University of Cambridge [email protected]
Guy Emerson
University of Cambridge [email protected]
Abstract
The performance of natural language genera-tion systems has improved substantially withmodern neural networks. At test time they typi-cally employ beam search to avoid locally opti-mal but globally suboptimal predictions. How-ever, due to model errors, a larger beam sizecan lead to deteriorating performance accord-ing to the evaluation metric. For this reason, itis common to rerank the output of beam search,but this relies on beam search to produce agood set of hypotheses, which limits the poten-tial gains. Other alternatives to beam searchrequire changes to the training of the model,which restricts their applicability compared tobeam search. This paper proposes incremen-tal beam manipulation, i.e. reranking the hy-potheses in the beam during decoding insteadof only at the end. This way, hypotheses thatare unlikely to lead to a good final output arediscarded, and in their place hypotheses thatwould have been ignored will be consideredinstead. Applying incremental beam manip-ulation leads to an improvement of 1.93 and5.82 BLEU points over vanilla beam searchfor the test sets of the E2E and WebNLG chal-lenges respectively. The proposed method alsooutperformed a strong reranker by 1.04 BLEUpoints on the E2E challenge, while being onpar with it on the WebNLG dataset.
In natural language generation (NLG), the goal isto generate text representing structured information(e.g. a database record or a meaning representation)that is both fluent and contains the right informa-tion. Sequence-to-sequence models (seq2seq) havebeen effective on many tasks in NLG (for exam-ple: Wen et al., 2015; Duˇsek and Jurˇc´ıˇcek, 2016).These systems first create an embedding for theinput information. This embedding is used incre-mentally during decoding, generating one token at a time. Seq2seq models are generally decodedusing beam search, to mitigate the effect of locallyoptimal but globally suboptimal decisions made bygreedy search.The performance of NLG systems can plateau oreven decrease when beam sizes larger than 10 areused, which is counter-intuitive since larger beamsproduce more likely sequences according to themodel. For example, Duˇsek and Jurˇc´ıˇcek (2016)used a beam size of 10, and Asghar et al. (2017)found a size of 5 to be optimal. Decreasing per-formance has been found across a range of tasksincluding (Cohen and Beck, 2019). Moreover, itand was given by Koehn and Knowles (2017) asone of the six main challenges facing neural ma-chine translation. To investigate this, Stahlberg andByrne (2019) presented an exact search algorithmto find the most likely output according to a seq2seqmodel. However, this performed poorly comparedto beam search, demonstrating that search errors(from beam search) can mask model errors (fromthe seq2seq model).To mitigate the limitations of beam search, it iscommon practice to apply a reranker to the finalset of hypotheses. This can be done by defining areranking criterion (for example: Kumar and Byrne,2004; Blain et al., 2017; Borgeaud and Emerson,2020) or by training a reranker to predict the besthypothesis in a beam (for example: Duˇsek andJurˇc´ıˇcek, 2016; Agarwal et al., 2018). Training areranker allows us to take into account informationfrom outside the model and mitigate model errors.However, rerankers can only choose a hypothesisfrom the final beam, which limits their potential.To quantify this, we trained the seq2seq model pro-posed by Duˇsek and Jurˇc´ıˇcek (2016), and applied itto the E2E validation set (Novikova et al., 2017b).For each instance, we recorded the point at which a r X i v : . [ c s . C L ] F e b igure 1: The percentage of beams which contain a ref-erence (orange), or which could still lead to a reference(blue), using the model of Duˇsek and Jurˇc´ıˇcek (2016)with beam size 3 on the E2E validation set. all gold-standard references fell out of the beam,meaning that none of the partial hypotheses in thebeam could be extended to a gold reference. Afinal beam containing at least one of the referenceswould score optimally with an oracle reranker (pro-viding an upper bound on performance). Figure 1shows the results for beam size 3. The final beamcontained a reference in only 60 out of 547 cases(11%). For the remaining 89% of the cases, evenan oracle reranker would be unable to give optimalresults. The figure also shows that in over half ofthe cases, all references fell out in the first 6 steps.In contrast, references that were still in the beam atstep 15 were almost certain to stay in the beam untilthe end. These observations suggest that an early manipulation of the beam has a strong potential toimprove performance.In this paper, we propose a method for manipulat-ing which items are pruned from the beam at eachstage of decoding. We then present evidence thatthis is a successful approach: it led to an improve-ment of 1.93, and 5.82 BLEU points over vanillabeam search on the E2E and WebNLG challenges,respectively. When comparing to a strong reranker,the performance of incremental beam manipula-tion was similar on the WebNLG dataset, whilstincreasing the performance on the E2E challengeby 1.04 points. We also applied beam manipula-tion on top of length normalisation (Murray andChiang, 2018), and incremental beam manipulationwas able to improve its performance. For larger beam sizes, the same general trends were ob-served. See Appendix A for beam size 10.
This paper is far from the first to try to improvebeam search for natural language generation.One modification is to use a variable beam size in-stead of a fixed one (Freitag and Al-Onaizan, 2017).However, this can only improve decoding speed, asthe ranking of the hypotheses in the beam remainsunchanged, and thus model errors are exposed bythe reduction of search errors.Length normalisation (Murray and Chiang, 2018)is widely used strategy that often improves the per-formance of a beam search decoder, by mitigatingthe fact that seq2seq models are biased towardsgenerating shorter sequences. Rather than directlyusing model probabilities to order the hypotheses inthe beam, each probability is normalised accordingto the length of the hypothesis, so that shorter hy-potheses are penalised. However, this only has animpact once the hypotheses within the beam havedifferent lengths. This only occurs towards the endof the decoding process, and we showed in the pre-vious section that the reference hypotheses oftenfall out of the beam relatively early. Furthermore,Stahlberg and Byrne (2019) showed that the biasescausing the deteriorating model performance aremore complex than a simple length bias.Wiseman and Rush (2016) modified the trainingprocedure for seq2seq models. They ran beamsearch and introduced a loss each time the goldstandard sequence fell out of the beam. Goyal et al.(2018) and Collobert et al. (2019) also modifiedthe training procedure. They added a term to theloss function that approximated the loss that themodel would receive when generating using a beamsearch method for each example in the training set.However, one of the reasons that beam search hasbeen so widely used is that it can be applied on topof a language model without changing the trainingprocedure, and this is lost with these approaches.Gu et al. (2017) manipulated the hidden state ofthe language model at each step of the decoding.This was achieved via a multi-output regressor thatproduced a vector that is added to the hidden stateused in decoding. The regressor was trained viareinforcement learning, and the training signal wasgathered by injecting unstructured noise to the hid-den state. Chen et al. (2018) also manipulate thehidden state. For each training instance, they ap-ply beam search and take the hypothesis with theighest BLEU score. The manipulator network istrained to encourage a greedy decoder to producethis output. Both of these approaches rely on infer-ring a better hidden state to be used in the decoding,which is not straightforward to define. We insteadmanipulate the hypotheses in the beam directly.Finally, Negrinho et al. (2018) presented a frame-work for learning a beam search framework viaimitation learning. This resulted in a beam awarealgorithm which was proved to have no regret guar-antees. While this paper makes a compelling ar-gument for this method in theory, putting it intopractice requires a number of further engineeringdecisions. Our work can be seen as a way of ap-plying this general framework using a simple andcomputationally efficient roll-out strategy.
In order to describe our method for incrementalbeam search, we first introduce terminology to de-scribe a standard beam search decoder. The de-coder produces a sequence iteratively, token bytoken. At each iteration it performs 3 actions: ex-pand, rank and prune. The expand step generatesall possible next step hypotheses. The rank steporders these hypotheses from most likely to leastlikely. The pruning step then removes the hypothe-ses that are near the end of this order.This formulation of the beam search algorithm en-ables us to view beam manipulation as a rankingproblem since expand is determined by the (fixed)decoder and the size of the beam chosen deter-mines the pruning. The rank step determines whichhypotheses will not be kept in the next iterationand hence discarded. By modifying the rankingmethod used, we can choose the partial hypothesesexpanded during beam search, taking into accountthe current state of the beam as well as signalsbeyond model scores.It is worth noting that while this paper applies beammanipulation on top of a seq2seq model, the tech-niques used could be applied without change toany conditional or unconditional neural languagemodel that can be decoded using beam search.
Partial hypotheses are more difficult to rank thancomplete hypotheses since the rest of the gener-ated text is unknown. For example, consider thefollowing partial hypotheses: Loch Fyne is a restaurant located...There is a family friendly...Both of these convey some information about afamily-friendly restaurant named ‘Loch Fyne’. Itis hard to know which partial sequence will leadto a better complete sentence, which is what wewould like a ranker to tell us. Existing rerankersoften rely on detecting missing information, butsome information may still be to come for partialhypotheses.What we need is a way to rank partial hypothesesbased on how the seq2seq model is likely to com-plete them. We propose ranking partial hypothesesbased on a greedy roll-out . This is a computation-ally efficient approximation of how the seq2seqmodel might complete the partial hypothesis.In the existing literature, roll-outs are generallyused at training time (Chang et al., 2015), for thesituation where the model’s subsequent decisionsinfluence the loss function for an individual deci-sion. The roll-outs are used to produce an approxi-mation to the final sequence that would be reachedif the original action was taken. This enables avalue for the loss of the original decision to bepredicted.On the other hand, incremental beam manipulationaims to predict which partial hypotheses will leadto good completed sequences. Similar to traditionalroll-outs, this is impacted by the generating model’ssubsequent decisions. In this case, the differenceis that roll-outs are used to provide features in ad-dition to obtaining training signal. It is also worthnoting that for incremental beam manipulation weuse roll-outs at test time as well as at training time.Beam manipulation can be applied after any stepin the beam search decoding. Figure 2 illustratesa single manipulation. The roll-outs are used toproduce approximations to the completed hypothe-ses that would be produced if the partial hypothesisremained in the beam. These completed sequencesare then ranked to define an order of the partialsequences. Since this may result in different hy-potheses remaining in the beam, the area of thesearch space considered during the decoding hasbeen manipulated. urrent beam Next token and greedy roll-out Rank
Loch Fyne is a family friendly restaurant in the city centre
Incremental beam manipulation requires a methodto rank completed hypotheses. There are many ex-isting rerankers designed for this task, such as theTGEN reranker (Duˇsek and Jurˇc´ıˇcek, 2016). How-ever, these rerankers are unlikely to be effectivewhen used in incremental beam manipulation. Inthe latter, the partial hypotheses need to be rankedaccording to their potential to produce good com-pleted hypotheses. This is a related but differenttask to that of a traditional reranker that aims toidentify the hypotheses that will score best againstsome metric such as BLEU score. Rerankers ofcompleted hypotheses typically rely on input infor-mation missing from the output as the signal; how-ever, this is not necessarily useful when rerankingpartial hypotheses. For example, it is more usefulto identify partial hypotheses which are indicativeof model failure at an early stage of decoding.To rank the partial hypotheses via roll-outs as in-troduced in the previous section, we explored twocommonly used techniques in the field of infor-mation retrieval: pointwise ranking and pairwiseranking (Liu, 2009). Pointwise approaches predicta numerical value for each item, and the items areordered by sorting them according to this value.Pairwise approaches, given a pair of hypotheses,output which of them would rank more highly, andtechniques such as the Copeland method (Saariand Merlin, 1996) are then used to produce a to-tal ordering from these pairwise comparisons. Inpreliminary experiments, the pointwise approachoutperformed the pairwise approach. These resultsare summarised in Appendix B. For the remain-der of this paper, we will focus on the pointwise approach.The inputs to the reranker were:• The meaning representation (i.e. the struc-tured information input for NLG). This waspassed as a sequence.• The generated text produced by the roll-out ofthe partial hypothesis. This was passed as asequence surrounded by the start token and end token
We trained the reranker on completed hypothesesthat were ranked from best to worst. The sequenceswere produced by generating text sequences usingthe seq2seq model. A beam search decoding, witha large beam size, was applied to each instance inthe training set. The set of hypotheses present inthe final beam of the search was ranked from bestto worst and recorded.The notion of the best hypothesis was simplifiedto the one that received the highest BLEU score(Papineni et al., 2002) against the manually writ-ten references. BLEU was chosen due to its wideadoption as an automatic evaluation measure, butany automatic metric could have been used in itsplace.As discussed at the end of the previous section, weneed the reranker to distinguish between hypothe-ses that should be pruned from those that shouldbe kept. Furthermore, for the purpose of rerank-ing, only relative differences in BLEU score matter,not absolute values. Therefore, when generatingtraining data, hypotheses in the bottom section ofthe beam (according to BLEU) were assigned the In preliminary experiments, we held back a portion of thetraining set from the seq2seq model’s training phase, so thatthe beam manipulation ranker was trained on outputs that theseq2seq model had not seen. However, the system was unableto recover from the lower performance of the seq2seq model. target value -1, and the rest were assigned the targetvalue 1.Similarly, it is only the differences in the reranker’sscores that matter, and not the absolute values.Therefore, after applying the reranker to each hy-pothesis in a beam, we normalise the scores to havea mean of 0.Using the normalised BLEU scores (-1 or 1) andnormalised reranker scores (with a mean of 0), weuse relative mean absolute error (RMAE) as thetraining objective, as shown in Equation (1), where b is the set of hypotheses in the beam, x is a hy-pothesis, ˆ x ranker is the normalised score predictedby the reranker, and x BLEU is the normalised targetderived from the BLEU score ordering of the beam.
RMAE( b ) = (cid:88) x ∈ b | ˆ x ranker − ˆ x BLEU | (1)Several other relative loss functions have beenshown to be successful in other situations (Zhanget al., 2019). In preliminary experiments, we eval-uated a number of these including log cosh errorand mean square error, but they did not outperformRMAE. In theory, it would be desirable to manipulate thebeam at every step of hypothesis generation, but inpractice, the difficulty of ranking partial hypothesescould limit its benefits. While manipulating thebeam can avoid certain model errors, it might alsointroduce other errors, either from the greedy roll-out strategy or the reranker. Reranking at every stepmay compound such errors. Empirically, we foundit was more effective to apply beam manipulationto some rather than all steps.Choosing when to manipulate is thus an importantdecision. It is advisable to avoid manipulating thebeam too early: not only it is harder to rank hy-potheses with very few tokens, but it is also lesslikely to be beneficial. As shown in Figure 1, in thefirst few steps even a relatively small beam size cankeep hypotheses that could lead to the referenceoutputs. On the other hand, it is also advisablenot to manipulate too late: once hypotheses havefallen out of the beam, they cannot be put back in.As the optimal choice of when to manipulate thebeam is dependent on the dataset and the model,e treat this as a hyperparameter to be tuned on thevalidation set.
In this section we will present results on the E2E(Novikova et al., 2017b) and WebNLG challenges(Gardent et al., 2017). We evaluate the systems us-ing the BLEU implementation used in the originalE2E challenge. For all the experiments reportedwe use the seq2seq architecture from Duˇsek andJurˇc´ıˇcek (2016) for the underlying model that weare trying to manipulate. It is well-known that BLEU is not a completelyreliable method for predicting human perceptionsof the quality of individual NLG outputs (for exam-ple: Callison-Burch et al., 2006; Novikova et al.,2017a). However, in this case, we are comparingoutputs from variants of the same system, and thusBLEU is more likely to provide reasonable esti-mates of their felicity to the references, as arguedby both Callison-Burch et al. and Novikova et al..To support the idea behind our approach, i.e. manip-ulating the beam during decoding instead of onlyat the end, we compare against existing rerankersapplied to the beam at the end of decoding. Theseinclude the TGEN reranker, proposed by Duˇsekand Jurˇc´ıˇcek (2016), that has achieved state-of-the-art BLEU scores on NLG tasks, as well as the samereranker architecture defined in Section 3.2. Botharchitectures are trained to perform reranking ofthe final beam only. When comparing betweenthese two methods of reranking the final beam, nosignificant difference in performance was found. Inthis section, we report results for the architecturedefined in Section 3.2. For completeness, resultsfor both rerankers are included in Appendix C.By using the same architecture both for the final-beam reranker and for beam manipulation, we canbe more confident that any difference in results isdue to the beam manipulation strategy (rerankingduring decoding, not just at the end).In our experiments, we also consider length nor-malisation (Murray and Chiang, 2018), which isa well-known technique that often increases the https://github.com/tuetschek/e2e-metrics The source code for this paper is at https://github.com/jamesHargreaves12/incremental_beam_manipulation
Figure 4: Results on the test set of the E2E challenge.LN = Length Normalisation.
BLEU score of the resulting sequences since it ad-dresses the bias of language models to favour shortsequences, as discussed in Section 2. Although thevalues assigned to a sequence when using lengthnormalisation are no longer probabilities, it stillperforms the expand, rank and prune steps at eachiteration of the decoding. Hence, we can applybeam manipulation in tandem with length normal-isation. Finally, we also considered nucleus sam-pling (Holtzman et al., 2020) as a baseline. How-ever, it was found to decrease performance evenwhen compared to vanilla beam search.In what follows, we will not only comment on testresults when the beam size is tuned on the valida-tion sets, but we will also comment on test resultsacross all beam sizes. The reason for doing this isthat considering all beam sizes assesses whether thetechnique is robust to changes in beam size. In ouropinion, this makes for a more convincing resultthen just indicating a difference in performance at asingle beam size. This is especially pertinent sincea well-documented issue of beam search is thatlarger beam sizes can lead to deteriorating perfor-mance. The results table in Appendix C indicatesthe validation set’s optimal beam size for each ofthe systems.
Figure 4 indicates the results on the E2E test set.The first thing to note is that increasing the beamsize did not lead to any considerable gain in per-formance for the vanilla beam search strategy. Forall beam sizes except 10, the performance is worsethan greedy decoding, while using a beam sizeof 10 only increased performance by 0.06 points.ompared to vanilla beam search, reranking wasan effective strategy, increasing the performanceat all beam sizes. Similarly, applying incrementalbeam manipulation was able to outperform bothmethods at all beam sizes. Using the validationset to tune the beam size, the BLEU scores are0.89 and 1.93 BLEU points higher for the rerankerand incremental beam manipulation, respectively.The difference in BLEU scores between incremen-tal beam manipulation and reranker methods wasfound to be significant (using a permutation testwith significance level 0.01).Length normalisation was the strongest baseline in-creasing the BLEU score of vanilla by 1.69 points.Adding the reranker on top of length normalisationdecreases performance for all beam sizes less than30. The strong performance of length normalisa-tion is likely due to the fact that the E2E test setcontained longer, and more complex inputs (andhence references) than the training and validationset (Duˇsek et al., 2020). Nevertheless, applyingincremental beam manipulation on top of lengthnormalisation was able to increase the BLEU scorefor all beam sizes except 5.It is worth pointing out that while incremental beammanipulation improved both vanilla beam searchand length normalisation, the overall BLEU scorefor the combination with the latter was lower forall sizes other than size 3. This is surprising con-sidering that vanilla beam search performed worsethan length normalisation when not combined withincremental beam manipulation. This could be dueto the fact that the greedy roll-out approximation isless accurate for length normalisation than vanillabeam search since length normalisation only hasan impact once some items in the beam have beencompleted.
Figure 5 indicates the results on the WebNLG testset. As in the results for E2E, we can see that in-creasing the beam size of vanilla beam search wasnot an effective way to increase BLEU score. Agreedy decode outperformed it at all beam sizes.Reranking the final beam was more effective, in-creasing the BLEU score by 5.83 points. Applyingincremental beam manipulation had a very similarperformance to reranking, increasing the perfor-mance at beam sizes 3 and 10 but reducing it atsize 5.
Figure 5: Results on the test set of the WebNLG chal-lenge. LN = Length Normalisation.
The length normalisation baseline improved uponthe vanilla baseline, increasing the BLEU scoreby 5.01 points. Reranking the final beam of thelength normalised beam search was more effec-tive on the WebNLG dataset than the E2E dataset;applying the reranker outperformed length normal-isation at every beam size. Focusing on the beamsizes that performed optimally on the validation set,the BLEU score on the test set was increased by0.43 points. Applying incremental beam manipula-tion on top of length normalisation received a yethigher BLEU score than reranked length normalisa-tion for all beam sizes. Increasing the BLEU scoreby 1.33 points compared to the length normalisa-tion. The improvement in BLEU scores achievedby applying incremental beam manipulation to thelength normalised beam search was found to besignificant when compared to length normalisation(with or without final beam reranking).Unlike the E2E dataset, beam manipulation hadhigher performance when applied on top of lengthnormalisation rather than vanilla beam search, out-performing it for all beam sizes except 3. TheBLEU score was 0.52 points higher when takingthe values at the beam sizes with the highest per-formances on the validation set.
In Section 1, we explained that references often fallout a beam relatively early during decoding, andreported results on the E2E task. We repeated thesame experiment for when applying incrementalbeam manipulation. A beam size of 3 was used sothat the results could be directly compared to thosefor vanilla beam search in Figure 1. igure 6: The percentage of beams which contain areference (orange), or which could still lead to a ref-erence (blue), using Incremental Beam Manipulationwith beam size 3 on the E2E validation set. This im-proves on vanilla beam search, shown in Fig. 1.
The results are shown in Figure 6. The graphindicates that beam manipulation indeed amelio-rates this issue. The final beam contains a (correct)reference in 100/547 cases (approx 18%), a largeincrease from the 60 of vanilla beam search. Thisis mainly due to reducing the number of referencesthat fall out in steps 5 to 15, which is consistentwith the fact that we are manipulating at steps 5,10, 15 and 20. We also observed that most of theretention gain is due to earlier manipulation steps.
To further investigate the differences between thesystems, we conducted a human evaluation compar-ing the incremental beam manipulation system’soutput against the output of the strongest baselineon the E2E dataset – length normalisation.The human evaluation was performed by the sec-ond and third authors of the paper. While the anno-tators had been involved in the design of the system,they had not seen textual outputs from the systemprior to the annotation. The outputs were presentedin a random order, without indicating which out-put came from which system. The systems werecompared in terms of both fluency and adequacy.For fluency, each annotator compared the systemoutputs for the same meaning representation input(without seeing it) and indicated their preference.Both annotators annotated 50 examples from theE2E test set.Little difference between the outputs was found. The systems were labelled as equally fluent in 76%of cases (incremental beam manipulation was pre-ferred in 12% of all cases).To judge the adequacy of the generations, humanraters were presented with the meaning represen-tation input and the text generated by a system.They were asked to label any hallucinations andany repetitions. They were asked to ignore miss-ing information as the human references that thesystem had been trained on contained frequent ex-amples of missing information (Duˇsek et al., 2020),so for this dataset, missing information is betterseen as content selection. Between the annotators,a combined total of 524 examples were labelled forboth for hallucination and repetition.Once again, the results were not conclusive in sup-port of either system, with no statistically signif-icant difference between them. The overall per-formance was very high: for 95% of the inputs,both systems exhibited no signs of hallucination orrepetition. This error analysis did, however, high-light that some errors are repeated multiple times,almost word for word. For example, all 5 cases ofrepetition for the incremental beam manipulationsystem had the following form: “There is a pubcalled X located near Y. It is a pub.”It is worth re-iterating that the system was opti-mised for BLEU, and not fluency or adequacy. Thefact that an improvement in BLEU has not led toan improvement in a human evaluation suggeststhat BLEU may not be an accurate enough metricfor this task, even when comparing similar systems.Therefore, BLEU may be even more limited inusefulness than Callison-Burch et al. (2006) andNovikova et al. (2017a) suggested.
We now present a couple of examples where ma-nipulating the beam during decoding led to an im-provement in the quality of the output. These wereselected from the set of examples for which theoutput of the beam manipulator was preferred bythe human annotators in terms of adequacy. Theexamples are given in Figure 7.In the first example, we can see that length nor-malisation leads to a repetition of the fact that TheCricketers had an ‘average customer rating’, show-ing the downsides of a technique that just favourslonger outputs. Neither the output from the beam nput: name = The Cricketers | eat type = restaurant | food = English | price range = cheap | rating = average | area = city centre | family friendly = yes | near = Caf´e Rouge BM:
The Cricketers serves cheap English food in the city centre near Caf´e Rouge. It has an averagecustomer rating and is family-friendly.
LN:
The Cricketers is a cheap, family-friendly, English restaurant with an average customer rating.It is located in the city centre near Caf´e Rouge and has an average customer rating.
RR:
The Cricketers is a cheap, family-friendly restaurant located in city centre near Caf´e Rouge.
Input: name = The Phoenix | eat type = pub | food = French | price range = £20-25 | rating = 3 outof 5 | area = riverside | family friendly = no | near = Caf´e Sicilia BM:
The Phoenix is a French restaurant in riverside near Caf´e Sicilia. It has a moderate price rangeand a customer rating of 3 out of 5. It is not kid friendly.
LN:
The Phoenix is a restaurant providing French food in the £20-25 price range. It is located in theriverside. It is near Caf´e Sicilia. Its customer rating is 5 out of 5.
RR:
The Phoenix is a restaurant providing French food in the £20-25 price range. It is located in theriverside. It is near Caf´e Sicilia. Its customer rating is high.
Figure 7: Example outputs for different systems. BM = Incremental beam manipulation system, LN = Vanillabeam search with length normalisation, RR = Vanilla beam search with reranking applied to the final beam. manipulator nor the reranked approach contain rep-etitions, although we can see that more of the inputinformation is realised in the case of the beam ma-nipulator.The second example contains a hallucination forboth the length normalised and reranked systems– the input clearly states that the customer ratingwas ‘3 out of 5’. In contrast, whereas these systemsclaim that it was ‘5 out of 5’ and ‘high’ respectively.The beam manipulation system avoided this issue.
Rerankers are commonly used to increase the per-formance of NLG systems decoded by beam search,by modifying which hypothesis from the final beamis chosen. This means that rerankers are dependenton good hypotheses reaching the final beam. How-ever, this is often not the case; on the validationset of E2E challenge, only 11% of references werepresent in the final beam when the seq2seq modelfrom (Duˇsek and Jurˇc´ıˇcek, 2016) was decoded witha beam size of 3.To address this limitation, we proposed incrementalbeam manipulation, which modifies the ranking ofpartial hypotheses within the beam at intermediatesteps of the decoding, and hence chooses which arepruned. We evaluated this method on both the E2Eand WebNLG challenges. The results showed thatapplying beam manipulation, instead of a reranker,was able to increase the BLEU score by 1.04 on the E2E challenge. We further showed that incre-mental beam manipulation was able to increaseperformance when applied on top of length normal-isation.The optimal reranker for incremental beam manip-ulation may differ at each step of generation (forexample, token 5 vs. token 20). In future work, weintend to refine our method further by conditioningthe reranker on how far through the beam searchwe are.
References
Shubham Agarwal, Marc Dymetman, and EricGaussier. 2018. Char2char generation with rerankingfor the E2E NLG challenge. pages 451–456.Nabiha Asghar, Pascal Poupart, Xin Jiang, and Hang Li.2017. Deep active learning for dialogue generation. In
Proceedings of the 6th Joint Conference on Lexical andComputational Semantics (*SEM 2017) , pages 78–83.Fr´ed´eric Blain, Lucia Specia, and Pranava Madhyastha.2017. Exploring hypotheses spaces in neural machinetranslation. In
Proceedings of the 16th Machine Trans-lation Summit (MT Summit XVI) . Asia-Pacific Associa-tion for Machine Translation (AAMT).Sebastian Borgeaud and Guy Emerson. 2020. Leverag-ing sentence similarity in natural language generation:Improving beam search using range voting. In
Pro-ceedings of the 4th Workshop on Neural Generationand Translation (WNGT) , pages 97–109. Associationfor Computational Linguistics.Chris Callison-Burch, Miles Osborne, and PhilippKoehn. 2006. Re-evaluating the role of BLEU in ma-hine translation research. In
Proceedings of the 11thConference of the European Chapter of the Associationfor Computational Linguistics (EACL) .Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agar-wal, Hal Daum´e III, and John Langford. 2015. Learn-ing to search better than your teacher. In
Proceed-ings of the 32nd International Conference on MachineLearning (ICML) , volume 37 of
JMLR Proceedings ,pages 2058–2066. JMLR.org.Yun Chen, Kyunghyun Cho, Samuel R Bowman, andVictor OK Li. 2018. Stable and effective trainablegreedy decoding for sequence to sequence learning. In
Proceedings of the 6th International Conference onLearning Representations (ICLR 2018) .Eldan Cohen and Christopher Beck. 2019. Empiricalanalysis of beam search performance degradation inneural sequence models. In
Proceedings of the 36th In-ternational Conference on Machine Learning (ICML) ,pages 1290–1299.Ronan Collobert, Awni Hannun, and Gabriel Synnaeve.2019. A fully differentiable beam search decoder. In
Proceedings of the 36th International Conference onMachine Learning (ICML) , pages 1341–1350.Ondˇrej Duˇsek and Filip Jurˇc´ıˇcek. 2016. Sequence-to-sequence generation for spoken dialogue via deep syn-tax trees and strings. In
Proceedings of the 54th An-nual Meeting of the Association for Computational Lin-guistics (Volume 2: Short Papers) , pages 45–51, Berlin,Germany. Association for Computational Linguistics.Ondˇrej Duˇsek, Jekaterina Novikova, and Verena Rieser.2020. Evaluating the state-of-the-art of end-to-end nat-ural language generation: The E2E NLG challenge.
Computer Speech & Language , 59:123–156.Markus Freitag and Yaser Al-Onaizan. 2017. Beamsearch strategies for neural machine translation. In
Proceedings of the First Workshop on Neural MachineTranslation , pages 56–60, Vancouver. Association forComputational Linguistics.Claire Gardent, Anastasia Shimorina, Shashi Narayan,and Laura Perez-Beltrachini. 2017. The WebNLGchallenge: Generating text from RDF data. In
Pro-ceedings of the 10th International Conference on Nat-ural Language Generation , pages 124–133, Santiagode Compostela, Spain. Association for ComputationalLinguistics.Kartik Goyal, Graham Neubig, Chris Dyer, and Tay-lor Berg-Kirkpatrick. 2018. A continuous relaxation ofbeam search for end-to-end training of neural sequencemodels. In
Proceedings of the 32nd AAAI Conferenceon Artificial Intelligence , pages 3045–3052.Jiatao Gu, Kyunghyun Cho, and Victor OK Li. 2017.Trainable greedy decoding for neural machine trans-lation. In
Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing(EMNLP) , pages 1968–1978.Ari Holtzman, Jan Buys, Leo Du, Maxwell Forbes, and Yejin Choi. 2020. The curious case of neural text de-generation.Philipp Koehn and Rebecca Knowles. 2017. Six chal-lenges for neural machine translation. In
Proceedingsof the First Workshop on Neural Machine Translation ,pages 28–39.Shankar Kumar and William Byrne. 2004. MinimumBayes-Risk decoding for statistical machine translation.In
Proceedings of the Human Language TechnologyConference of the North American Chapter of the As-sociation for Computational Linguistics: HLT-NAACL2004 , pages 169–176.Tie-Yan Liu. 2009. Learning to rank for informationretrieval.
Found. Trends Inf. Retr. , 3(3):225–331.Kenton Murray and David Chiang. 2018. Correctinglength bias in neural machine translation. In
Proceed-ings of the Third Conference on Machine Translation:Research Papers , pages 212–223.Renato Negrinho, Matthew R. Gormley, and Geoffrey J.Gordon. 2018. Learning beam search policies via imi-tation learning.Jekaterina Novikova, Ondˇrej Duˇsek, Amanda CercasCurry, and Verena Rieser. 2017a. Why we need newevaluation metrics for NLG. In
Proceedings of the2017 Conference on Empirical Methods in NaturalLanguage Processing , pages 2241–2252. Associationfor Computational Linguistics.Jekaterina Novikova, Ondˇrej Duˇsek, and Verena Rieser.2017b. The E2E dataset: New challenges for end-to-end generation. In
Proceedings of the 18th Annual SIG-dial Meeting on Discourse and Dialogue , pages 201–206.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evalua-tion of machine translation. In
Proceedings of the 40thAnnual Meeting of the Association for ComputationalLinguistics , pages 311–318, Philadelphia, Pennsylva-nia, USA. Association for Computational Linguistics.Donald G. Saari and Vincent R. Merlin. 1996. TheCopeland method.
Economic Theory , 8(1):51–76.Felix Stahlberg and Bill Byrne. 2019. On NMT searcherrors and model errors: Cat got your tongue? In
Pro-ceedings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Process-ing (EMNLP-IJCNLP) , pages 3347–3353.Tsung-Hsien Wen, Milica Gasic, Dongho Kim, NikolaMrkˇsi´c, Pei-Hao Su, David Vandyke, and Steve Young.2015. Stochastic language generation in dialogue usingrecurrent neural networks with convolutional sentencereranking. In
Proceedings of the 16th Annual Meetingof the Special Interest Group on Discourse and Dia-logue , pages 275–284.Sam Wiseman and Alexander M. Rush. 2016.Sequence-to-sequence learning as beam-search opti-mization. In
Proceedings of the 2016 Conference on igure 8: The percentage of beams that contain a refer-ence sentence after each step of beam search. A beamsize of 10 was used to decode the model proposed byDuˇsek and Jurˇc´ıˇcek (2016). Results are for the E2E val-idation dataset. The orange bars indicate the number ofcompleted references within the beam.
Empirical Methods in Natural Language Processing(EMNLP) , pages 1296–1306.Ning Zhang, Shui-Long Shen, Annan Zhou, and Ye-Shuang Xu. 2019. Investigation on performance of neu-ral networks using quadratic relative error cost function.
IEEE Access , 7:106642–106652.
A Fallout experiment with larger beamsize
Figure 1 indicates the step at which the referencesentences drop out of the beam (for a beam size of3). Figure 8 indicates the same results for a largerbeam size of 10.The figure indicates that the number of referencesthat were contained in the final beam was higherfor a beam size 10. For the early iterations of thedecoding the number of references that fell out ofthe beam was far lower for a beam size of 10. Alarger beam size meant that the beam containedmore hypotheses and so has more chances to matchagainst a reference.However the shape of the graphs is very similar.The majority of references that fell out did so rela-tively early in the process. 54% of references fellout by step 7, increasing to 79% by step 9. Atstep 21 the final last reference fell out of the beamdespite the fact that the beam contained partiallyreferences up to step 40.
Figure 9: Comparison between the performance of thepointwise and pairwise rankers when used as rerankerson the E2E validation set.
B Pointwise vs Pairwise rerankers
This paper required a method of ranking completedhypotheses from worst to best. During preliminaryexperiments we implemented rerankers based onthe Pairwise and Pointwise strategies from the In-formation Retrieval field. See Section 3.2 for moredetails.To evaluate the performance of the different rankerswe applied each of the rankers as a reranker of thefinal beam of a vanilla beam search over the E2Evalidation set. The BLEU scores for each of thererankers were calculated for each beam size. Theresults are shown in Figure 9.We can see that there was very little difference inperformance for the two methods of reranking forthe beam sizes up to 10. However, for beam size30 the pointwise reranker significantly outperformsthe pairwise reranker. The larger the beam size thegreater the number of hypotheses that the rerankercan pick as top and hence the greater the impact ofthe reranker.The pointwise reranker requires O(k) runs of thereranker to produce a total ordering. On the otherhand the Copeland method to produce a total order-ing from the pairwise comparisons requires O ( k ) number of pairwise Comparisons.These factors lead us to choose the Pointwiseranker over the pairwise ranker for the experimentsin the results section. eam size Vanilla Rerank TGEN LN LN+Rerank BM LN+BM * 66.6130 63.65 65.25 65.44 65.58 66.05 Table 1: BLEU scores for each of the different systems on the E2E testset. *indicates the beam size which scoredhighest on the respective validation sets. bold indicates the highest scoring system for each beam size.
Beam size Vanilla Rerank TGEN LN LN+Rerank BM LN+BM
10 41.33 47.33 46.50 47.11* 47.70 47.66
30 41.20 47.42 46.61 47.18 47.81 47.41 * Table 2: BLEU scores for each of the different systems on the WebNLG testset. *indicates the beam size whichscored highest on the respective validation sets. bold indicates the highest scoring system for each beam size.
C Numerical results
This section will present the numerical results forthe E2E and WebNLG datasets so that they can bemore readily compared in future works. The resultsare given in Table 1 and Table 2.