[PDF] Data-to-text Generation with Macro Planning

Abstract

Recent approaches to data-to-text generation have adopted the very successful encoder-decoder architecture or variants thereof. These models generate text which is fluent (but often imprecise) and perform quite poorly at selecting appropriate content and ordering it coherently. To overcome some of these issues, we propose a neural model with a macro planning stage followed by a generation stage reminiscent of traditional methods which embrace separate modules for planning and surface realization. Macro plans represent high level organization of important content such as entities, events and their interactions; they are learnt from data and given as input to the generator. Extensive experiments on two data-to-text benchmarks (RotoWire and MLB) show that our approach outperforms competitive baselines in terms of automatic and human evaluation.

Full PDF

DData-to-text Generation with Macro Planning

Ratish Puduppully and

Mirella Lapata

Institute for Language, Cognition and ComputationSchool of Informatics, University of Edinburgh10 Crichton Street, Edinburgh EH8 9AB [email protected] [email protected]

Abstract

Recent approaches to data-to-text generationhave adopted the very successful encoder-decoder architecture or variants thereof.These models generate text which is ﬂu-ent (but often imprecise) and perform quitepoorly at selecting appropriate content andordering it coherently. To overcome someof these issues, we propose a neural modelwith a macro planning stage followed by ageneration stage reminiscent of traditionalmethods which embrace separate modulesfor planning and surface realization. Macroplans represent high level organization of im-portant content such as entities, events andtheir interactions; they are learnt from dataand given as input to the generator. Extensiveexperiments on two data-to-text benchmarks(R

OTO W IRE and MLB) show that our ap-proach outperforms competitive baselines interms of automatic and human evaluation.

Data-to-text generation refers to the task of generat-ing textual output from non-linguistic input (Reiterand Dale, 1997, 2000; Gatt and Krahmer, 2018)such as databases of records, simulations of phys-ical systems, accounting spreadsheets, or expertsystem knowledge bases. As an example, Figure 1shows various statistics describing a major leaguebaseball (MLB) game, including extracts from the box score (i.e., the performance of the two teamsand individual team members who played as bat-ters, pitchers or ﬁelders; Tables (A)), play-by-play (i.e., the detailed sequence of each play of the gameas it occurred; Table (B)), and a human writtengame summary (Table (C)).Traditional methods for data-to-text generation(Kukich, 1983; McKeown, 1992; Reiter and Dale,1997) follow a pipeline architecture, adopting sep-arate stages for text planning (determining which content to talk about and how it might be orga-nized in discourse), sentence planning (aggregat-ing content into sentences, deciding speciﬁc wordsto describe concepts and relations, and generat-ing referring expressions), and linguistic realisa-tion (applying the rules of syntax, morphologyand orthographic processing to generate surfaceforms). Recent neural network-based approaches(Lebret et al., 2016; Mei et al., 2016; Wisemanet al., 2017) make use of the encoder-decoder ar-chitecture (Sutskever et al., 2014), are trained end-to-end, and have no special-purpose modules forhow to best generate a text, aside from genericmechanisms such as attention and copy (Bahdanauet al., 2015; Gu et al., 2016). The popularity ofend-to-end models has been further boosted by therelease of new datasets with thousands of input-document training pairs. The example shown inFigure 1 is taken from the MLB dataset (Pudup-pully et al., 2019b) which contains baseball gamestatistics and human written summaries (~25K in-stances). R

OTO W IRE (Wiseman et al., 2017) isanother widely used benchmark, which containsNBA basketball game statistics and their descrip-tions (~5K instances).Despite being able to generate ﬂuent text, neu-ral data-to-text generation models are often impre-cise, prone to hallucination (i.e., generate text thatis not supported by the input), and poor at con-tent selection and document structuring (Wisemanet al., 2017). Attempts to remedy some of theseissues focus on changing the way entities are repre-sented (Puduppully et al., 2019b; Iso et al., 2019),allowing the decoder to skip low-conﬁdence to-kens to enhance faithful generation (Tian et al.,2019), and making the encoder-decoder architec-ture more modular by introducing micro planning (Puduppully et al., 2019a; Moryossef et al., 2019).Micro planning operates at the record level (see Ta-bles (A) Figure 1; e.g.,

C.Mullins BH 2 , J.Villar TEAMOrioles ), it determines which facts should be men- a r X i v : . [ c s . C L ] F e b A) TEAM Inn1 Inn2 Inn3 Inn4 ...

TR TH E ...

Orioles 1 0 0 0 ... ...

Royals 1 0 0 3 ... ...

BATTER H/V AB BR BH RBI TEAM ...

C.Mullins H 4 2 2 1 Orioles ...

J.Villar H 4 0 0 0 Orioles ...

W.Merriﬁeld V 2 3 2 1 Royals ...

R.O’Hearn V 5 1 3 4 Royals ...... ... ... ... ... ... ...

PITCHER H/V W L IP PH PR ER BB K ...

A.Cashner H 4 13 5.1 9 4 4 3 1 ...

B.Keller V 7 5 8.0 4 2 2 2 4 ...... ... ... ... ... ... ...

Inn1: runs in innings,

TR: team runs,

TH: team hits, E: errors, AB: at-bats,

RBI: runs-batted-in,

BR: batterruns,

BH: batter hits,

H/V: home or visiting, W: wins, L: losses, IP: innings pitched,

PH: hits given,

PR: runsgiven,

ER: earned runs,

BB: walks, K: strike outs, INN: inning with (T)op/(B)ottom,

PL-ID: play id. (C)

KANSAS CITY, Mo. – Brad Keller kept up his recent pitching surge with another strongouting. Keller gave up a home run to the ﬁrst batter of the game – Cedric Mullins– but quickly settled in to pitch eight strong innings in the Kansas City Royals’ 9–2 winover the Baltimore Orioles in a matchup of the teams with the worst records in the majors. Keller (7–5) gave up two runs and four hits with two walks and four strikeoutsto improve to 3–0 with a 2.16 ERA in his last four starts. Ryan O’Hearn homeredamong his three hits and drove in four runs, Whit Merriﬁeld scored three runs, and HunterDozier and Cam Gallagher also went deep to help the Royals win for the ﬁfth time in sixgames on their current homestand. With the score tied 1–1 in the fourth, AndrewCashner (4–13) gave up a sacriﬁce ﬂy to Merriﬁeld after loading the bases on two walksand a single. Dozier led off the ﬁfth inning with a 423-foot home run to left ﬁeld to makeit 3-1. The Orioles pulled within a run in the sixth when Mullins led off with adouble just beyond the reach of Dozier at third, advanced to third on a ﬂy ball and scoredon Trey Mancini’s sacriﬁce ﬂy to the wall in right. ... (B) BATTER PITCHER SCORER ACTION TEAM INN PL-ID SCORE ...

C.Mullins B.Keller - Home run Orioles 1-T 1 1 ...

H.Dozier A.Cashner W.Merriﬁeld Grounded Royals 1-B 3 1 ...

W.Merriﬁeld A.Cashner B.Goodwin Sac ﬂy Royals 4-B 5 2 ...

H.Dozier A.Cashner - Home run Royals 5-B 1 3 ...... ... ... ... ... ... ... ... ... (D)

V(Orioles), V(Royals),V(C.Mullins), V(J.Villar),V(W.Merrifield), V(R.O’Hearn),V(A.Cashner), V(B.Keller),V(H.Dozier), ... ,V(1-T), V(1-B), V(2-T), V(2-B),V(3-T), V(3-B), ...

V(Royals) V(Orioles),V(Orioles) V(C.Mullins), V(Orioles) V(J.Villar),V(Royals) V(W.Merrifield), V(Royals)V(R.O’Hearn), V(Orioles) V(A.Cashner), V(Royals)V(B.Keller), ... ,V(C.Mullins) V(Royals) V(Orioles),V(J.Villar) V(Royals) V(Orioles), ... (E)

V(B.Keller) V(B.Keller) V(C.Mullins) V(Royals) V(Orioles) V(B.Keller) V(R.O’Hearn) V(W.Merrifield) V(H.Dozier) V(C.Gallagher) V(4-B, 5-B) V(6-T) Figure 1: MLB statistics tables and game summary. Tables summarize the performance of teams and individualteam members who played as batters and pitchers as well as the most important actions (and their actors) in eachplay (Tables (A) and (B)). Macro plan for the game summary is shown at the bottom (Table (E)). indicatesparagraph delimiters. There is a plan for every paragraph in the game summary (correspondence shown in samecolor); < V(entity) > verbalizes entities, while < V(inning-T/B) > verbalizes events related to the top/bottomside of an inning (see Section 3.1). Set of candidate paragraph plans are shown above macro plan (Table (D)) andgrouped into two types: plans describing a single entity/event or their combinations. Best viewed in color. tioned within a textual unit (e.g., a sentence) andhow these should be structured (e.g., the sequenceof records). An explicit content planner essentiallymakes the job of the neural network less onerousallowing to concentrate on producing ﬂuent natu-ral language output, without expending too mucheffort on content organization.In this work, we focus on macro planning , thehigh-level organization of information and howit should be presented which we argue is impor-tant for the generation of long, multi-paragraphdocuments (see text (C) in Figure 1). Problem-atically, modern datasets like MLB (Puduppullyet al. 2019b; and also Figure 1) and R OTO W IRE (Wiseman et al., 2017) do not naturally lend them- selves to document planning as there is no explicitlink between the summary and the content of thegame (which is encoded in tabular form). In otherwords, the underlying plans are latent , and it is notclear how they might be best represented, i.e., assequences of records from a table, or simply words.Nevertheless, game summaries through their seg-mentation into paragraphs (and lexical overlap withthe input) give clues as to how content might beorganized. Paragraphs are a central element of dis-course (Chafe, 1979; Longacre, 1979; Halliday andHasan, 1976), the smallest domain where coher-ence and topic are deﬁned and anaphora resolu-tion is possible (Zadrozny and Jensen, 1991). Wetherefore operationalize the macro plan for a gameummary as a sequence of paragraph plans.Although resorting to paragraphs describes thesummary plan at a coarse level, we still needto specify individual paragraph plans. In thesports domain, paragraphs typically mention en-tities (e.g, players important in the game), keyevents (e.g., scoring a run), and their interaction.And most of this information is encapsulated inthe statistics accompanying game summaries (seeTables (A) and (B) in Figure 1). We thus deﬁneparagraph plans such that they contain verbaliza-tions of entity and event records (see plan (E) inFigure 1). Given a set of paragraph plans and theircorresponding game summary (see Tables (D) andsummary (C) in Figure 1), our task is twofold. Attraining time, we must learn how content was se-lected in order to give rise to speciﬁc game sum-maries (e.g., how input (D) led to plan (E) for sum-mary (C) in Figure 1), while at test time, giveninput for a new game, we ﬁrst predict a macro planfor the summary and then generate the correspond-ing document.We present a two-stage approach where macroplans are induced from training data (by takingthe table and corresponding summaries into ac-count) and then fed to the text generation stage.Aside from making data-to-text generation moreinterpretable, the task of generating a documentfrom a macro plan (rather than a table) affordsgreater control over the output text and plays tothe advantage of encoder-decoder architectureswhich excel at modeling sequences. We evalu-ate model performance on the R OTO W IRE (Wise-man et al., 2017) and MLB (Puduppully et al.,2019b) benchmarks. Experimental results showthat our plan-and-generate approach produces out-put which is more factual, coherent, and ﬂuentcompared to existing state-of-the-art models. Ourcode, trained models and dataset with macro planscan be found at https://github.com/ratishsp/data2text-macro-plan-py . Content planning has been traditionally consid-ered a fundamental component in natural languagegeneration. Not only does it determine whichinformation-bearing units to talk about, but alsoarranges them into a structure that creates coherentoutput. Many content planners have been basedon theories of discourse coherence (Hovy, 1993),schemas (McKeown et al., 1997) or have relied on generic planners (Dale, 1989). Plans are mostlybased on hand-crafted rules after analyzing the tar-get text, although a few approaches have recog-nized the need for learning-based methods. For ex-ample, Duboue and McKeown (2001) learn order-ing constraints in a content plan, Konstas and Lap-ata (2013) represent plans as grammar rules whoseprobabilities are estimated empirically, while oth-ers make use of semantically annotated corpora tobootstrap content planners (Duboue and McKeown,2002; Kan and McKeown, 2002).More recently, various attempts have been madeto improve neural generation models (Wisemanet al., 2017) based on the encoder-decoder archi-tecture (Bahdanau et al., 2015) by adding variousplanning modules. Puduppully et al. (2019a) pro-pose a model for data-to-text generation which ﬁrstlearns a plan from the records in the input tableand then generates a summary conditioned on thisplan. Shao et al. (2019) introduce a Planning-basedHierarchical Variational Model where a plan is asequence of groups, each of which contains a sub-set of input items to be covered in a sentence. Thecontent of each sentence is verbalized, conditionedon the plan and previously generated context. Intheir case, input items are a relatively small list ofattributes (~28) and the output document is alsoshort (~110 words).There have also been attempts to incorporateneural modules in a pipeline architecture for data-to-text generation. Moryossef et al. (2019) developa model with a symbolic text planning stage fol-lowed by a neural realization stage. They exper-iment with the WebNLG dataset (Gardent et al.,2017) which consists of RDF (cid:104)

Subject, Object,Predicate (cid:105) triples paired with corresponding text.Their document plan is a sequence of sentenceplans which in turn determine the division of factsinto sentences and their order. Along similar lines,Castro Ferreira et al. (2019) propose an architecturecomprising of multiple steps including discourse or-dering, text structuring, lexicalization, referring ex-pression generation, and surface realization. Bothapproaches show the effectiveness of pipeline ar-chitectures, however, their task does not requirecontent selection and the output texts are relativelyshort (24 tokens on average).Although it is generally assumed that task-speciﬁc parallel data is available for model training,Laha et al. (2020) do away with this assumptionand present a three-stage pipeline model whichearns from monolingual corpora. They ﬁrst con-vert the input to a form of tuples, which in turnare expressed in simple sentences, followed by thethird stage of merging simple sentences to formmore complex ones by aggregation and referringexpression generation. They also evaluate on data-to-text tasks which have relatively short outputs.There have also been efforts to improve the coher-ence of the output, especially when dealing withlonger documents. Puduppully et al. (2019b) makeuse of hierarchical attention over entity represen-tations which are updated dynamically, while Isoet al. (2019) explicitly keep track of salient entitiesand memorize which ones have been mentioned.Our work also attempts to alleviate deﬁcienciesin neural data-to-text generation models. In con-trast to previous approaches, (Puduppully et al.,2019a; Moryossef et al., 2019; Laha et al., 2020),we place emphasis on macro planning and createplans representing high-level organization of a doc-ument including both its content and structure. Weshare with previous work (e.g., Moryossef et al.2019) the use of a two-stage architecture. We showthat macro planning can be successfully appliedto long document data-to-text generation result-ing in improved factuality, coherence, and ﬂuencywithout any postprocessing (e.g., to smooth refer-ring expressions) or recourse to additional tools(e.g., parsing or information extraction).

We hypothesize that generation based on plansshould fare better compared to generating froma set of records, since macro plans offer a bird’s-eye view, a high-level organization of the documentcontent and structure. We also believe that macroplanning will work well for long-form text genera-tion, i.e., for datasets which have multi-paragraphtarget texts, a large vocabulary space, and requirecontent selection.We assume the input to our model is a set of para-graph plans E = { e i } | E | i = where e i is a paragraphplan. We model the process of generating outputsummary y given E as a two step process, namelythe construction of a macro plan x based on theset of paragraph plans, followed by the generationof a summary given a macro plan as input. Wenow explain how E is obtained and each step isrealized. We discuss our model considering mainlyan example from the MLB dataset (Puduppullyet al., 2019b) but also touch on how the approach can be straightforwardly adapted to R OTO W IRE (Wiseman et al., 2017).

A macro plan consists of a sequence of para-graph plans separated by a paragraph discoursemarker , i.e., x = e i e j . . . e k where e i , e j , e k ∈ E . A paragraph plan in turn is a se-quence of entities and events describing the game.By entities we mean individual players or teamsand the information provided about them in boxscore statistics (see rows and column headings inFigure 1 Table (A)), while events refer to informa-tion described in play-by-play (see Table (B)). Inbaseball, plays are grouped in half-innings. Dur-ing each half of an inning, a team takes its turn tobat (the visiting team bats in the top half and thehome team in the bottom half). An example macroplan is shown at the bottom of Figure 1. Withina paragraph plan, entities and events are verbal-ized into a text sequence along the lines of Salehet al. (2019). We make use of special tokens forthe < TYPE > of record followed by the value ofrecord from the table. We retain the same posi-tion for each record type and value. For example,batter C.Mullins from Figure 1 would be verbal-ized as < PLAYER > C.Mullins < H/V > H < AB > < BH > < RBI > < TEAM > Orioles . . . .For the sake of brevity we use shorthand < V(C.Mullins) > for the full entity. Paragraph Plan for Entities

For a paragraphcontaining entities, the corresponding plan will bea verbalization of the entities in sequence. For para-graphs with multiple mentions of the same entity,the plan will verbalize an entity only once and atits ﬁrst position of mention. Paragraph “

Kellergave up a home run . . . the teams with the worstrecords in the majors ” from the summary in Fig-ure 1 describes four entities including

B. Keller , C. Mullins , Royals and

Orioles . The respectiveplan is the verbalization of the four entities insequence: < V(B.Keller) > <

V(C.Mullins) >< V(Royals) > <

V(Orioles) > , where V standsfor verbalization and < V(B. Keller) > is a short-hand for < PLAYER > B.Keller < H/V > V < W > < L > < IP > . . . , < V(Royals) > is a shorthandfor the team < TEAM > Royals < TR > < TH > < E > , and so on. Paragraph Plan for Events

A paragraph mayalso describe one or more events. For example, theparagraph “

With the score tied 1–1 in the fourth . . ”discusses what happened in the bottom halvesof the fourth and ﬁfth innings. We verbalize anevent by ﬁrst describing the participating entitiesfollowed by the plays in the event. Entities aredescribed in the order in which they appear ina play, and within the same play we list thebatter followed by the pitcher, ﬁelder, scorer,and basemen. The paragraph plan correspondingto the bottom halves of the fourth and ﬁfth inning is < V(4-B, 5-B) > . Here, < V(4-B,5-B) > is a shorthand for < V(W.Merrifield) >< V(A.Cashner) > <

V(B.Goodwin) > <

V(H.Do - zier) > . . . < V(4-B,1) > <

V(4-B,2) > . . .<

V(5-B,1) > <

V(5-B,2) > , and so on. The en-tities < V(W.Merrifield) > , < V(A.Cashner) > , < V(B.Goodwin) > , and < V(H.Dozier) > correspond in turn to W. Merriﬁeld, A.Cashner, B. Goodwin , and

H. Dozier while < V(5-B,1) > refers to the ﬁrst play in thebottom half of the ﬁfth inning (see the play-by-play table in Figure 1) and abbreviates thefollowing detailed plan: < INN > < HALF > B < BATTING > Royals < PITCHING > Orioles < PL-ID > < BATTER > H.Dozier < PITCHER > A.Cashner > <

ACTION > Home-run < SCORES > Royals-3-Orioles-1 , etc.The procedure described above is not speciﬁcto MLB and can be ported to other datasets withsimilar characteristics such as R

OTO W IRE . How-ever, R

OTO W IRE does not provide play-by-playinformation, and as a result there is no event ver-balization for this dataset.

We provided our deﬁnition for macro plans in theprevious sections, however, it is important to notethat such macro plans are not readily available indata-to-text benchmarks like MLB (Puduppullyet al., 2019b) and R

OTO W IRE (Wiseman et al.,2017) which consist of tables of records r pairedwith a gold summary y (see Tables (A)–(C) in Fig-ure 1). We now describe our method for obtainingmacro plans x from r and y .Similar to Moryossef et al. (2019), we deﬁnemacro plans to be conformant with gold summariessuch that (1) they have the same splits into para-graphs — entities and events within a paragraphin y are grouped into a paragraph plan in x ; and(2) the order of events and entities in a paragraphand its corresponding plan are identical. We con- struct macro plans by matching entities and eventsin the summary to records in the tables. Further-more, paragraph delimiters within summaries formnatural units which taken together give rise to ahigh-level document plan.We match entities in summaries with entities intables using exact string match, allowing for somedegree of variation in the expression of team names(e.g., A’s for

Athletics and

D-backs for

Diamond-backs ). Information pertaining to innings appearsin the summaries in the form of ordinal numbers(e.g., ﬁrst , ninth ) modifying the noun inning andcan be relatively easily identiﬁed via pattern match-ing (e.g., in sentences like “ Dozier led off the ﬁfth inning ”). However, there are instances where themention of innings is more ambiguous (e.g., “

Withthe scored tied 1–1 in the fourth , Andrew Cashner(4-13) gave up a sacriﬁce ﬂy ”). We could disam-biguate such mentions manually and then train aclassiﬁer to learn to predict whether an inning ismentioned. Instead, we explore a novel annotation-free method which makes use of the pretrained lan-guage model GPT2 (Radford et al., 2019). Specif-ically, we feed the context preceding the ordinalnumber to GPT2 (i.e., the current paragraph up tothe ordinal number and the paragraph preceding it)and if inning appears in the top 10 next word pre-dictions, we consider it a positive match. On a heldout dataset, this method achieves 98% precisionand 98% recall at disambiguating inning mentions.To resolve whether the summary discusses thetop or bottom side of an inning, we compare theentities in the paragraph with the entities in eachhalf-inning (play-by-play Table (B) in Figure 1)and choose the side with the greater number ofentity matches. For instance,

Andrew Cashner , Merriﬁeld and fourth inning uniquely resolves tothe bottom half of the fourth inning.

Figure 1 shows the macro plan we obtain for gamesummary (C). Importantly, macro plan (E) is theoutcome of a content selection process after con-sidering several candidate paragraph plans as input.So, what are the candidate paragraph plans whichgive rise to macro plan (E)? To answer this ques-tion, we examined the empirical distribution ofparagraph plans in MLB and R

OTO W IRE (train-ing portion). Interestingly, we found that ~79% ofthe paragraph plans in MLB refer to a single eventor a single player (and team(s)). In R

OTO W IRE , , p , p , ... p , | p | p , p , p , ... p , | p | e , e , e , ... e , | p | Attention dp · Sigmoid Content Selection p att3 Attention p , | p | ...p , p , p , ... ... ... ...... ... ... ...... ... ... ... Figure 2: Paragraph plan representation and contex-tualization for macro planning. Computation of e isdetailed in Equations (1), (2), e att in Equation (3), and e c in Equation (4). ~92% of paragraphs are about a singleton player(and team(s)) or a pair of players.Based on this analysis, we assume that paragraphplans can be either one (verbalized) entity/eventor a combination of at most two. Under this as-sumption, we explicitly enumerate the set of can-didate paragraph plans in a game. For the gamein Figure 1, candidate paragraph plans are shownin Tables (D). The ﬁrst table groups plans basedon individual verbalizations describing the team(s),players, and events taking place in speciﬁc innings.The second table groups pairwise combinationsthereof. In MLB, such combinations are betweenteam(s) and players. In R OTO W IRE , we also cre-ate combinations between players. Such paragraphplans form set E based on which macro plan x isconstructed to give rise to game summary y . The input to our model is a set of paragraph planseach of which is a sequence of tokens. We ﬁrstcompute paragraph plan representations ∈ R n , andthen apply a contextualization and content planningmechanism similar to planning modules introducedin earlier work (Puduppully et al., 2019a; Chenand Bansal, 2018). Predicted macro plans serve asinput to our text generation model which adoptsan encoder-decoder architecture (Bahdanau et al.,2015; Luong et al., 2015). We encode to-kens in a verbalized paragraph plan e i as { e i , j } | e i | j = with a BiLSTM (Figure 2 bottom part). To re-ﬂect the fact that some records will be more impor-tant than others, we compute an attention weightedsum of { e i , j } | e i | j = following Yang et al. (2016). Let d ∈ R n denote a randomly initialized query vectorlearnt jointly with the rest of parameters. We com-pute attention values α i , j over d and paragraph plantoken representation e i , j : α i , j ∝ exp ( d (cid:124) e i , j ) (1)Paragraph plan vector e i is the attention weightedsum of e i , j (with ∑ j α i , j = e i = ∑ j α i , j e i , j (2)Next, we contextualize each paragraph plan rep-resentation vis-a-vis other paragraph plans (Fig-ure 2 top left part). First, we compute attentionscores β i , k over paragraph plan representations toobtain an attentional vector e atti for each: β i , k ∝ exp ( e (cid:124) i W a e k ) c i = ∑ k (cid:54) = i β i , k e k e atti = W g [ e i ; c i ] (3)where W a ∈ R n × n , W g ∈ R n × n are parameter ma-trices, and ∑ k (cid:54) = i β i , k =

1. Then, we compute a con-tent selection gate, and apply this gate to e i to ob-tain new paragraph plan representation e ci : g i = sigmoid (cid:0) e atti (cid:1) e ci = g i (cid:12) e i (4)where (cid:12) denotes element-wise multiplication.Thus, each element in e i is weighted by correspond-ing element of g i ∈ [ , ] n to obtain a contextualizedparagraph plan representation e ci . Content Planning

Our model learns to predictmacro plans, after having been trained on pairsof sets of paragraph plans and correspondingmacro plans (Sections 3.2 and 3.3 explain howwe obtain these for data-to-text datasets like R O - TO W IRE and MLB). More formally, we modelmacro plan z = z . . . z | z | as a sequence of pointers,with each z k pointing to an input paragraph plan, p p p ... p | p | EOS · · · · · p CS p CS p CS p CS p CS | p | p e ... h h h h p s p CS p CS | p | p CS p CS p CS | p | p CS Vector Decoder · Content Selection Gate

Figure 3: Macro planning model; paragraph plan repre-sentation and contextualization mechanism are detailedin Figure 2. The output points to e , e | E | , and e (seeEquations (5) and (6)). EOM is end of macro plan token. i.e., z k ∈ { e i } | E | i = . We decompose p ( z | E ) , the prob-ability of macro plan z given paragraph plans E , as: p ( z | E ) = | z | ∏ k = p ( z k | z < k , E ) (5)where z < k = z . . . z k − .We use Pointer Networks (Vinyals et al., 2015)to model p ( z k | z < k , E ) as: p ( z k = e i | z < k , E ) ∝ exp ( h (cid:124) k W b e ci ) (6)where p ( z k | z < k , E ) is normalized to 1 and W b ∈ R n × n . Rather than computing a weighted represen-tation, Pointer Networks make use of attention topoint to speciﬁc elements in the input (see Figure 3).We use a decoder LSTM to compute hidden repre-sentation h k at time step k . We initialize h with themean paragraph plan representation, avg ( { e ci } | E | i = ) .Once the output points to e i , its representation e ci isused as input to the next step of the LSTM decoder.The process stops when the model points to EOM ,a token indicating end of the macro plan.

Recall that z is a sequence of pointers with eachentry z k pointing to a paragraph plan i.e., z k ∈{ e i } | E | i = . We can deterministically obtain macroplan x from z by retrieving the paragraph plans be-ing pointed to, adding separators in between.The conditional output probability p ( y | x ) is mod-eled as: p ( y | x ) = | y | ∏ t = p ( y t | y < t , x ) where y < t = y . . . y t − .To compute p ( y | x ) , we use an encoder-decoderarchitecture enhanced with an attention mechanism (Bahdanau et al., 2015; Luong et al., 2015). Weencode macro plan x with a bidirectional LSTM(Hochreiter and Schmidhuber, 1997). At timestep t , we lookup the embedding of the previouslypredicted word y t − and feed it as input to the de-coder which is another LSTM unit. The decoderattends over hidden states of the macro plan to pre-dict y t . We further incorporate a copy mechanism(Gulcehre et al., 2016) in the decoder to enablecopying values directly from the macro plan.We expect the text generation model to learn togenerate summary tokens while focusing on thecorresponding macro plan and that the output sum-mary will indeed follow the plan in terms of theentities and events being described and their order.At the same time, we believe that text generationis relatively easier as the encoder-decoder modelis relieved from the tasks of document structuringand information selection. We train two independent models for macro plan-ning and text generation. Our training objectivefor macro planning aims to maximize the log likeli-hood of the macro plan given the paragraph plans:max θ ∑ ( E , z ) ∈ D log p ( z | E ; θ ) where D is the training set consisting of pairs of(sets of) paragraph plans and macro plans, and θ are model parameters.Our training objective for text generation aimsto maximize the log likelihood of the output textgiven the macro plan:max φ ∑ ( x , y ) ∈ F log p ( y | x ; φ ) where F is the training set consisting of pairs ofmacro plans and game summaries, and φ are modelparameters.During inference, we employ beam search toﬁnd the most likely macro plan ˆ z among candidatemacro plans z (cid:48) given paragraph plans as input.ˆ z = arg max z (cid:48) p ( z (cid:48) | E ; θ ) We deterministically obtain ˆ x from ˆ z , and out-put summary ˆ y among candidate outputs y (cid:48) givenmacro plan ˆ x as input:ˆ y = arg max y (cid:48) p ( y (cid:48) | ˆ x ; φ ) OTO W IRE

MLBVocab Size 11.3K 38.9K

Table 1: Dataset statistics for R

OTO W IRE and MLB.Vocabulary size, number of tokens, number of instances(i.e., table-summary pairs), number of record types, av-erage number of records, average number of paragraphplans, and average summary length.

Data

We performed experiments on the R O - TO W IRE (Wiseman et al., 2017) and MLB (Pudup-pully et al., 2019b) benchmarks. The details ofthese two datasets are given in Table 1. We cansee that MLB is around 5 times bigger, has a richervocabulary and longer game summaries. We usethe ofﬁcial splits of 3,398/727/728 for R

OTO W IRE and 22,821/1,739/1,744 for MLB. We make use ofa tokenization script to detokenize and retokenizethe summaries in both R OTO W IRE and MLB.We reconstructed the MLB dataset, as the ver-sion released by Puduppully et al. (2019b) had re-moved all paragraph delimiters from game sum-maries. Speciﬁcally, we followed their methodol-ogy and downloaded the same summaries from theESPN website and added the delimiter toparagraphs in the summaries. R OTO W IRE doesnot have paragraph delimiters in game summarieseither. We reverse engineered these as follows:(1) we split summaries into sentences using theNLTK (Bird et al., 2009) sentence tokenizer; (2)initialized each paragraph with a separate sentence;(3) merged two paragraphs into one if the entities inthe former were a superset of entities in the latter;(4) repeated Step 3 until no merges were possible.

Training Conﬁguration

We tuned the model hy-perparameters on the development set. For trainingthe macro planning and the text generation stages,we used the Adagrad (Duchi et al., 2011) optimizer.Furthermore, the text generation stage made use oftruncated BPTT (Williams and Peng, 1990) with https://github.com/neulab/DGT Although our model is trained on game summaries withparagraph delimiters, and also predicts these at generationtime, for evaluation we strip from model output. truncation length 100. We learn subword vocab-ulary (Sennrich et al., 2016) for paragraph plansin the macro planning stage. We used 2.5K mergeoperations for R OTO W IRE and 8K merge opera-tions for MLB. In text generation, we learn a jointsubword vocabulary for the macro plan and gamesummaries. We used 6K merge operations for R O - TO W IRE and 16K merge operations for MLB. Allmodels were implemented on OpenNMT-py (Kleinet al., 2017). We add to set E the paragraph planscorresponding to the output summary paragraphs,to ensure full coverage during training of the macroplanner. During inference for predicting macroplans, we employ length normalization (Bahdanauet al., 2015) to avoid penalizing longer outputs;speciﬁcally, we divide the scores of beam searchby the length of the output. In addition, we adoptbigram blocking (Paulus et al., 2018). For MLB,we further block beams containing more than tworepetitions of a unigram. This helps improve thediversity of the predicted macro plans. System Comparisons

We compared our modelagainst the following systems: (1) the

Templ ate-based generators from Wiseman et al. (2017) forR

OTO W IRE and Puduppully et al. (2019b) forMLB. Both systems apply the same principle, theyemit a sentence about the teams playing in thegame, followed by player-speciﬁc sentences, anda closing sentence. MLB additionally contains adescription of play-by-play; (2) ED + CC , the bestperforming system in Wiseman et al. (2017), is avanilla encoder-decoder model equipped with anattention and copy mechanism; (3) NCP + CC , themicro planning model of Puduppully et al. (2019a),generates content plans from the table by makinguse of Pointer networks (Vinyals et al., 2015) topoint to records; content plans are encoded with aBiLSTM and the game summary is decoded usinganother LSTM with attention and copy; (4) ENT ,the entity-based model of Puduppully et al. (2019b),creates dynamically updated entity-speciﬁc repre-sentations; the text is generated conditioned on thedata input and entity memory representations usinghierarchical attention at each time step.

Automatic Evaluation

For automatic evalua-tion, following earlier work (Wiseman et al. 2017;Puduppully et al. 2019a,b, inter alia) we reportBLEU (Papineni et al., 2002) with the gold sum-mary as reference but also make use of the Infor-

OTO W IRE

RG CS CO BLEU + CC 35.9 82.6 19.8 33.8 24.9 12.0 14.99NCP + CC 40.8 87.6 28.0 51.1 36.2 15.8 16.50ENT 32.7 91.7

Macro 42.1 97.6 34.1 − Plan(4) 36.2 81.3 22.1 38.6 28.1 12.1 14.00MLB RG CS CO BLEU + CC 32.5 91.3 27.8 40.6 33.0 17.1 9.68NCP + CC 19.6 81.3 − Plan(SP,4) 25.1 92.7 40.0 44.6 42.2

Table 2: Evaluation on R

OTO W IRE and MLB test sets;relation generation (RG) count ( mation Extraction (IE) metrics from Wiseman et al.(2017) which are deﬁned over the output of an IEsystem; the latter extracts entity (players, teams)and value (numbers) pairs in a summary, and thenpredicts the type of relation. For instance, given thepair

Kansas City Royals, 9 , it would predict theirrelation as TR (i.e., Team Runs). Training data forthe IE system is obtained by checking for matchesbetween entity, value pairs in the gold summaryand entity, value, record type triplets in the table.Let ˆ y be the gold summary and y the modeloutput. Relation Generation (RG) measures theprecision and count of relations extracted from y that also appear in records r . Content Selec-tion (CS) measures the precision and recall ofrelations extracted from y that are also extractedfrom ˆ y . Content Ordering (CO) measures the nor-malized Damerau-Levenshtein distance betweenthe sequences of relations extracted from y and ˆ y .We reused the IE model from Puduppully et al.(2019a) for R OTO W IRE but retrained it for MLBto improve its precision and recall. Furthermore,the implementation of Wiseman et al. (2017) com-putes RG, CS, and CO excluding duplicate rela-tions. This artiﬁcially inﬂates the performance ofmodels whose outputs contain repetition. We in-clude duplicates in the computation of the IE met-rics (and recreate them for all comparison systems).Table 2 (top) presents our results on the R O - TO W IRE test set. In addition to Templ, NCP + CC,ENT, and ED + CC we include the best performingmodel of Wiseman et al. (2017) (WS-2017; notethat ED + CC is an improved re-implementationof their model), and the model of Rebuffel et al.(2020) (RBF-2020) which represents the state ofthe art on R

OTO W IRE . This model has a Trans-former encoder (Vaswani et al., 2017) with a hi-erarchical attention mechanism over entities andrecords within entities. The models of Saleh et al.(2019), Iso et al. (2019), and Gong et al. (2019)make use of additional information not presentin the input (e.g., previous/next games, summarywriter) and are not directly comparable to the sys-tems in Table 2. Results for the MLB test set are inthe bottom portion of Table 2.Templ has the highest RG precision and counton both datasets. This is not surprising, by designTempl is always faithful to the input. However,notice that it achieves the lowest BLEU amongstcomparison systems indicating that it mostly regur-gitates facts with low ﬂuency. Macro achieves thehighest RG precision amongst all neural modelsfor R

OTO W IRE and MLB. We obtain an absoluteimprovement of 5.9% over ENT for R

OTO W IRE and 13.3% for MLB. In addition, Macro achievesthe highest CS F-measure for both datasets. OnR

OTO W IRE , Macro achieves the highest CO score,and the highest BLEU on MLB. On R

OTO W IRE , interms of BLEU, Macro is worse than comparisonmodels (e.g., NCP+CC or ENT). Inspection of theoutput showed that the opening paragraph, whichmostly describes how the two teams fared, is gener-ally shorter in Macro, leading to shorter summariesand thus lower BLEU. There is high variance inthe length of the opening paragraph in the trainingdata and Macro verbalizes the corresponding planconservatively. Ideas such as length normalisation(Wu et al., 2016) or length control (Kikuchi et al.,2016; Takeno et al., 2017; Fan et al., 2018) couldhelp alleviate this; however, we do not pursue themfurther for fair comparison with the other models.

The Contribution of Macro Planning

To studythe effect of macro planning in more detail, wefurther compared Macro against text generationmodels (see Section 4.2) which are trained on ver-balizations of the tabular data (and gold summaries)but do not make use of document plans or a doc-ument planning mechanism. On R

OTO W IRE , themodel was trained on verbalizations of players andteams, with the input arranged such that the ver- acro CS-P CS-R CS-F COR

OTO W IRE

Table 3: Evaluation of macro planning stage; contentselection precision (CS-P), recall (CS-R), F-measure(CS-F) and content ordering (CO) between the inferredplans and gold plans in terms of entities and events forR

OTO W IRE (RW) and MLB test sets. balization of the home team was followed by thevisiting team, the home team players and the visit-ing team players. Mention of players was limitedto the four best ones, following Saleh et al. (2019)(see − Plan(4) in Table 2). For MLB, we addi-tionally include verbalizations of innings focusingon scoring plays which are likely to be discussedin game summaries (see − Plan(SP,4) in Table 2).Note that by preprocessing the input in such a waysome simple form of content selection takes placesimply by removing extraneous information whichthe model does not need to consider.Across both datasets, − Plan variants appearcompetitive. On R

OTO W IRE − Plan(4) is betterthan ED + CC in terms of content selection butworse compared to ENT. On MLB, − Plan(SP,4) isagain superior to ED + CC in terms of content selec-tion but not ENT whose performance lags behindwhen considering RG precision. Taken together,these results conﬁrm that verbalizing entities andevents into a text sequence is effective. At the sametime, we see that − Plan variants are worse thanMacro across most metrics which underlines theimportance of an explicit planning component.Table 3 presents intrinsic evaluation of the macroplanning stage. Here, we compare the inferredmacro plan with the gold macro plans, CS and COmetrics with regard to entities and events insteadof relations. We see that our macro planning model(Macro) achieves high scores for CS and CO forboth R

OTO W IRE and MLB. We further used theCS and CO metrics to check how well the gener-ated summary follows the (predicted) plan. Wefollowed the steps in Section 3.2 and reverse engi-neered macro plans from the model summaries andcompared these extracted plans with the originalmacro plans with regard to entities and events. Wefound that Macro creates summaries which followthe plan closely: for R

OTO W IRE , the CS F-scoreand CO are greater than 98%; for MLB, the CSF-score is greater than 94% and CO is greater than89%. We show an output summary for Macro in < V(Rays) > < V(8-B) > < V(Rays) > < V(Rays) > <

V(Red Sox) > < V(8-B) > < V(8-T) > < V(9-T) > < V(Clay Buchholz) > < V(5-T) > < V(Edwin Jackson) > < V(5-T) > < V(8-T) > ST. PETERSBURG, Fla. (AP) – The

Tampa Bay Rays are makingthe most of it. Akinori Iwamura hit a two-run homer in the eighth inning and the Rays beat the Boston Red Sox 2-1 on Sundayto complete a three-game sweep. The

Rays , who have the bestrecord in the majors, have won six of their last seven games. The

Rays have won four of their last ﬁve series, including three ina row against the

Red Sox , who have won six of their last sevenoverall. Dioner Navarro singled with one out in the eighth offClay Buchholz (1-2) and moved to third on Jason Bartlett’s ﬂyout tocenter. Iwamura then drove a 1-1 pitch into the left-ﬁeld stands forhis second homer of the season. Scott Dohmann (2-0) got thewin in relief , striking out Manny Ramirez with runners on ﬁrst andthird to end the eighth . Troy Percival worked the ninth forhis ﬁfth save in ﬁve opportunities. Clay Buchholz (1-2) gaveup two runs and three hits in eight innings. He struck out nine andwalked two. The Red Sox loaded the bases with one out in the ﬁfth on a single by Coco Crisp, a wild pitch and a walk to Jed Lowrie.Jacoby Ellsbury drove in Crisp with a two-out single to center. Jackson struck out four and walked three. The Red Sox loadedthe bases with one out in the ﬁfth on a single by Coco Crisp, a walkto Jed Lowrie and a one-out walk to Jed Lowrie. Jackson struck outJulio Lugo, but Jacoby Ellsbury singled to center to put the Red Soxup 1-0. The Red Sox threatened in the eighth when J. D. Drewdrew a two-out walk against Trever Miller, but Ramirez struck out toend the inning.

Table 4: Predicted macro plan (top) with correspondingmodel output (bottom). Entities and events in summarycorresponding to those in the macro plan are bold faced.

Table 4 together with the predicted document plan.

Human-Based Evaluation

We also asked par-ticipants to assess model output in terms of rela-tion generation, grammaticality, coherence, andconciseness (Wiseman et al., 2017; Puduppullyet al., 2019a,b), For R

OTO W IRE , we comparedMacro against RBF-2020 , ED + CC, Gold, andTempl. For MLB, we compared Macro againstENT, ED + CC, Gold, and Templ.We conducted our study on the Amazon Mechan-ical Turk (AMT) crowdsourcing platform, follow-ing best practices for human evaluation in NLG(van der Lee et al., 2019). Speciﬁcally, to en-sure consistent ratings, we required crowdworkersto have an approval rating greater than 98% anda minimum of 1,000 previously completed tasks.Raters were restricted to English speaking coun-tries (i.e., US, UK, Canada, Ireland, Australia, orNZ). Participants were allowed to provide feedbackon the task or ﬁeld questions (our interface acceptsfree text).In our ﬁrst study, we presented crowdworkerswith sentences randomly selected from summariesalong with their corresponding box score (and play- We are grateful to Clément Rebuffel for providing us withthe output of their system.

OTO W IRE − − − + CC 3.92 0.91* 5.0 − − − − + CC 3.42 0.72* − − − − − Table 5: Average number of supported ( best-worst scaling evaluation (higher is better). Systemssigniﬁcantly different from Macro are marked with anasterisk * (using a one-way ANOVA with posthoc TukeyHSD tests; p ≤ . . by-play in case of MLB) and asked them to countsupported and contradicting facts (ignoring halluci-nations, i.e., unsupported facts). We did not requirecrowdworkers to be familiar with NBA or MLB.Instead, we provided a cheat sheet explaining thesemantics of box score tables. In addition, weprovided examples of sentences with supported/-contradicting facts. We evaluated 40 summariesfrom the test set (20 per dataset), 4 sentences fromeach summary and elicited 3 responses per sum-mary. This resulted in 40 summaries × × α was 0.44 for supported and0.42 for contradicting facts).As shown in Table 5, Macro yields the small-est number of contradicting facts among neuralmodels on both datasets. On R OTO W IRE the num-ber of contradicting facts for Macro is comparableto Gold and Templ (the difference is not statis-tically signiﬁcant) and signiﬁcantly smaller com-pared to RBF-2020 and ED + CC. The count of supported facts for Macro is comparable to Gold,and ED + CC, and signiﬁcantly lower than Templand RBF-2020. On MLB, Macro has signiﬁcantlyfewer contradicting facts than ENT and ED + CCand is comparable to Templ, and Gold (the differ-ence is not statistically signiﬁcant). The count of supported facts for Macro is comparable to Gold,ENT, ED + CC and Templ. For both datasets, Templhas the lowest number of contradicting facts. Thisis expected as Templ essentially parrots facts (aka records) from the table.We also conducted a second study to evaluate thequality of the generated summaries. We presentedcrowdworkers with a pair of summaries and askedthem to choose the better one in terms of

Grammat-icality (is the summary written in well-formed En-glish?),

Coherence (is the summary well structuredand well organized and does it have a natural order-ing of the facts?) and

Conciseness (does the sum-mary avoid unnecessary repetition including wholesentences, facts or phrases?). We provided examplesummaries showcasing good and bad output. Forthis task, we required that the crowdworkers be ableto comfortably comprehend NBA/MLB game sum-maries. We elicited preferences with Best-WorstScaling (Louviere and Woodworth, 1991; Louviereet al., 2015), a method shown to be more reliablethan rating scales. The score of a system is com-puted as the number of times it is rated best minusthe number of times it is rated worst (Orme, 2009).The scores range from −

100 (absolutely worst) to +

100 (absolutely best). We divided the ﬁve compet-ing systems into ten pairs of summaries and elicitedratings for 40 summaries (20 per dataset). Eachsummary pair was rated by 3 raters. This resultedin 40 summaries ×

10 system pairs × × α was 0.47).As shown in Table 5, on R OTO W IRE , Macrois comparable to Gold, RBF-2020, and ED + CCin terms of

Grammaticality but signiﬁcantly bet-ter than Templ. In terms of

Coherence , Macro iscomparable to RBF-2020 and ED + CC but signif-icantly better than Templ and signiﬁcantly worsethan Gold. With regard to

Conciseness , Macro iscomparable to Gold, RBF-2020, and ED + CC, andsigniﬁcantly better than Templ. On MLB, Macro iscomparable to Gold in terms of

Grammaticality andsigniﬁcantly better than ED + CC, ENT and Templ.Macro is comparable to Gold in terms of

Coher-ence and signiﬁcantly better than ED + CC, ENTand Templ. In terms of

Conciseness , raters foundMacro comparable to Gold and Templ and signif-icantly better than ED + CC, and ENT. Taken to-gether, our results show that macro planning leadsto improvement in data-to-text generation in com-parison to other systems for both R

OTO W IRE andMLB datasets.

Discussion

In this work we presented a plan-and-generate ap-proach for data-to-text generation which consistsof a macro planning stage representing high-leveldocument organization in terms of structure andcontent, followed by a text generation stage. Ex-tensive automatic and human evaluation shows thatour approach achieves better results than existingstate-of-the-art models and generates summarieswhich are factual, coherent, and concise.Our results show that macro planning is moreadvantageous for generation tasks expected to pro-duce longer texts with multiple discourse units, andcould be easily extended to other sports domainssuch as cricket (Kelly et al., 2009) or Americanfootball (Barzilay and Lapata, 2005). Other ap-proaches focusing on micro planning (Puduppullyet al., 2019a; Moryossef et al., 2019) might be bet-ter tailored for generating shorter texts. There hasbeen a surge of datasets recently focusing on single-paragraph outputs and the task of content selectionsuch as E2E (Novikova et al., 2017), WebNLG(Gardent et al., 2017), and WikiBio (Lebret et al.,2016; Perez-Beltrachini and Lapata, 2018). Wenote that in our model content selection takes placeduring macro planning and text generation. Theresults in Table 2 show that Macro achieves thehighest CS F-measure on both datasets indicatingthat the document as a whole and individual sen-tences discuss appropriate content.Throughout our experiments we observed thattemplate-based systems score poorly in terms ofCS (but also CO and BLEU). This is primarilydue to the inﬂexibility of the template approachwhich is limited to the discussion of a ﬁxed num-ber of (high-scoring) players. Yet, human writers(and neural models to a certain extent), synthesizesummaries taking into account the particulars of aspeciﬁc game (where some players might be moreimportant than others even if they scored less) andare able to override global defaults. Template sen-tences are ﬂuent on their own, but since it is notpossible to perform aggregation (Reiter, 1995), thewhole summary appears stilted, it lacks coherenceand variability, contributing to low BLEU scores.The template baseline is worse for MLB than R O - TO W IRE which reﬂects the greater difﬁculty tomanually create a good template for MLB. Overall,we observe that neural models are more ﬂuent andcoherent, being able to learn a better ordering offacts which is in turn reﬂected in better CO scores. Despite promising results, there is ample roomto improve macro planning, especially in termsof the precision of RG (see Table 2, P% columnof RG). We should not underestimate that Macromust handle relatively long inputs (the average in-put length in the MLB development set is ~3100 to-kens) which are challenging for the attention mech-anism. Consider the following output of our modelon the MLB dataset:

Ramirez’s two-run doubleoff Joe Blanton tied it in the sixth, and BrandonMoss added a two-out RBI single off Alan Em-bree to give Boston a 3-2 lead . Here, the nameof the pitcher should have been

Joe Blanton in-stead of

Alan Embree . In fact,

Alan Embree is thepitcher for the following play in the half inning.In this case, attention diffuses over the relativelylong MLB macro plan, leading to inaccurate con-tent selection. We could alleviate this problem byadopting a noisy channel decomposition (Yee et al.,2019; Yu et al., 2020), i.e., by learning two differentdistributions: a conditional model which providesthe probability of translating a paragraph plan totext and a language model which provides an un-conditional estimate of the output (i.e., the wholegame summary). However, we leave this to futurework.For R

OTO W IRE , the main source of errors is themodel’s inability to understand numbers. For ex-ample, Macro generates the following output

TheLakers were the superior shooters in this game, go-ing 48 percent from the ﬁeld and 30 percent fromthe three-point line, while the Jazz went 47 percentfrom the ﬂoor and 30 percent from beyond the arc. .Here,

30 percent should have been

24 percent forthe Lakers but the language model expects a higherscore for the three-point line , and since is low(especially compared to 30 scored by the Jazz), itsimply copies scored by the Jazz instead. Amechanism for learning better representations fornumbers (Wallace et al., 2019) or executing oper-ations such as argmax or minus (Nie et al., 2018)should help alleviate this problem.Finally, although our focus so far has been onlearning document plans from data, the decouplingof planning from generation allows to ﬂexibly gen-erate output according to speciﬁcation. For ex-ample, we could feed the model with manuallyconstructed macro plans, consequently controllingthe information content and structure of the outputsummary (e.g., for generating short or long texts,or focusing on speciﬁc aspects of the game). cknowledgements We thank the Action Editor, Claire Gardent, and thethree anonymous reviewers for their constructivefeedback. We also thank Laura Perez-Beltrachinifor her comments on an earlier draft of this pa-per, and Parag Jain, Hao Zheng, Stefanos Ange-lidis and Yang Liu for helpful discussions. Weacknowledge the ﬁnancial support of the EuropeanResearch Council (Lapata; award number 681760,“Translating Multiple Modalities into Text”).

References

Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2015. Neural machine translation byjointly learning to align and translate. In .Regina Barzilay and Mirella Lapata. 2005. Collec-tive content selection for concept-to-text genera-tion. In

Proceedings of Human Language Tech-nology Conference and Conference on EmpiricalMethods in Natural Language Processing , pages331–338, Vancouver, British Columbia, Canada.Association for Computational Linguistics.Steven Bird, Ewan Klein, and Edward Loper.2009.

Natural Language Processing withPython . O’Reilly Media.Thiago Castro Ferreira, Chris van der Lee, Emielvan Miltenburg, and Emiel Krahmer. 2019. Neu-ral data-to-text generation: A comparison be-tween pipeline and end-to-end architectures. In

Proceedings of the 2019 Conference on Em-pirical Methods in Natural Language Process-ing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 552–562, Hong Kong, China.Association for Computational Linguistics.Wallace L. Chafe. 1979. The ﬂow of thought andthe ﬂow of language. In Talmy Givón, editor,

Syntax and Semantics , volume 12, pages 159–181. Academic Press Inc.Yen-Chun Chen and Mohit Bansal. 2018. Fast ab-stractive summarization with reinforce-selectedsentence rewriting. In

Proceedings of the 56thAnnual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , pages 675–686, Melbourne, Australia. Associa-tion for Computational Linguistics.Robert Dale. 1989. Generating referring expres-sions in a domain of objects and processes.Pablo Duboue and Kathleen McKeown. 2002. Con-tent planner construction via evolutionary algo-rithms and a corpus-based ﬁtness function. In

Proceedings of the International Natural Lan-guage Generation Conference , pages 89–96,Harriman, New York, USA. Association forComputational Linguistics.Pablo A. Duboue and Kathleen R. McKeown. 2001.Empirically estimating order constraints for con-tent planning in generation. In

Proceedingsof the 39th Annual Meeting of the Associa-tion for Computational Linguistics , pages 172–179, Toulouse, France. Association for Compu-tational Linguistics.John C. Duchi, Elad Hazan, and Yoram Singer.2011. Adaptive subgradient methods for onlinelearning and stochastic optimization.

Journal ofMachine Learning Research , 12:2121–2159.Angela Fan, David Grangier, and Michael Auli.2018. Controllable abstractive summarization.In

Proceedings of the 2nd Workshop on Neu-ral Machine Translation and Generation , pages45–54, Melbourne, Australia. Association forComputational Linguistics.Claire Gardent, Anastasia Shimorina, ShashiNarayan, and Laura Perez-Beltrachini. 2017.Creating training corpora for NLG micro-planners. In

Proceedings of the 55th AnnualMeeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 179–188, Vancouver, Canada. Association for Com-putational Linguistics.Albert Gatt and Emiel Krahmer. 2018. Survey ofthe state of the art in natural language generation:Core tasks, applications and evaluation.

J. Artif.Intell. Res. , 61:65–170.Heng Gong, Xiaocheng Feng, Bing Qin, and TingLiu. 2019. Table-to-text generation with ef-fective hierarchical encoder on three dimen-sions (row, column and time). In

Proceed-ings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and theth International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) , pages3143–3152, Hong Kong, China. Association forComputational Linguistics.Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K.Li. 2016. Incorporating copying mechanism insequence-to-sequence learning. In

Proceedingsof the 54th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: LongPapers) , pages 1631–1640, Berlin, Germany. As-sociation for Computational Linguistics.Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati,Bowen Zhou, and Yoshua Bengio. 2016. Point-ing the unknown words. In

Proceedings of the54th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers) ,pages 140–149, Berlin, Germany. Associationfor Computational Linguistics.M. A. K. Halliday and Ruqaiya Hasan. 1976.

Co-hesion in English . Longman, London.Sepp Hochreiter and Jürgen Schmidhuber. 1997.Long Short-Term Memory.

Neural Computa-tion , 9:1735–1780.Eduard H Hovy. 1993. Automated discourse gen-eration using discourse structure relations.

Arti-ﬁcial intelligence , 63(1-2):341–385.Hayate Iso, Yui Uehara, Tatsuya Ishigaki, HiroshiNoji, Eiji Aramaki, Ichiro Kobayashi, YusukeMiyao, Naoaki Okazaki, and Hiroya Takamura.2019. Learning to select, track, and generate fordata-to-text. In

Proceedings of the 57th AnnualMeeting of the Association for ComputationalLinguistics , pages 2102–2113, Florence, Italy.Association for Computational Linguistics.Min-Yen Kan and Kathleen R. McKeown. 2002.Corpus-trained text generation for summariza-tion. In

Proceedings of the International Nat-ural Language Generation Conference , pages1–8, Harriman, New York, USA. Association forComputational Linguistics.Colin Kelly, Ann Copestake, and Nikiforos Kara-manis. 2009. Investigating content selectionfor language generation using machine learning.In

Proceedings of the 12th European Workshopon Natural Language Generation (ENLG 2009) ,pages 130–137, Athens, Greece. Association forComputational Linguistics. Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hi-roya Takamura, and Manabu Okumura. 2016.Controlling output length in neural encoder-decoders. In

Proceedings of the 2016 Confer-ence on Empirical Methods in Natural LanguageProcessing , pages 1328–1338, Austin, Texas. As-sociation for Computational Linguistics.Guillaume Klein, Yoon Kim, Yuntian Deng, JeanSenellart, and Alexander Rush. 2017. Open-NMT: Open-source toolkit for neural machinetranslation. In

Proceedings of ACL 2017, Sys-tem Demonstrations , pages 67–72, Vancouver,Canada. Association for Computational Linguis-tics.Ioannis Konstas and Mirella Lapata. 2013. Induc-ing document plans for concept-to-text gener-ation. In

Proceedings of the 2013 Conferenceon Empirical Methods in Natural Language Pro-cessing , pages 1503–1514, Seattle, Washington,USA. Association for Computational Linguis-tics.Karen Kukich. 1983. Design of a knowledge-basedreport generator. In .Anirban Laha, Parag Jain, Abhijit Mishra, andKarthik Sankaranarayanan. 2020. Scalablemicro-planned generation of discourse fromstructured data.

Computational Linguistics ,45(4):737–763.Rémi Lebret, David Grangier, and Michael Auli.2016. Neural text generation from structureddata with application to the biography domain.In

Proceedings of the 2016 Conference on Empir-ical Methods in Natural Language Processing ,pages 1203–1213, Austin, Texas. Associationfor Computational Linguistics.Chris van der Lee, Albert Gatt, Emiel van Mil-tenburg, Sander Wubben, and Emiel Krahmer.2019. Best practices for the human evaluationof automatically generated text. In

Proceedingsof the 12th International Conference on NaturalLanguage Generation , pages 355–368, Tokyo,Japan. Association for Computational Linguis-tics.R E Longacre. 1979. The paragraph as a gram-matical unit. In Talmy Givón, editor,

Syntaxand Semantics , volume 12, pages 115–133. Aca-demic Press Inc.ordan J. Louviere, Terry N. Flynn, and A. A. J.Marley. 2015.

Best-Worst Scaling: Theory,Methods and Applications . Cambridge Univer-sity Press.Jordan J Louviere and George G Woodworth. 1991.Best-worst scaling: A model for the largest dif-ference judgments.

University of Alberta: Work-ing Paper .Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Effective approaches to attention-based neural machine translation. In

Proceed-ings of the 2015 Conference on Empirical Meth-ods in Natural Language Processing , pages1412–1421, Lisbon, Portugal. Association forComputational Linguistics.Kathleen R. McKeown. 1992.

Text Generation .Studies in Natural Language Processing. Cam-bridge University Press.Kathleen R. McKeown, Desmond A. Jordan,Shimei Pan, James Shaw, and Barry A. Allen.1997. Language generation for multimediahealthcare brieﬁngs. In

Fifth Conference onApplied Natural Language Processing , pages277–282, Washington, DC, USA. Associationfor Computational Linguistics.Hongyuan Mei, Mohit Bansal, and Matthew R.Walter. 2016. What to talk about and how? selec-tive generation using LSTMs with coarse-to-ﬁnealignment. In

Proceedings of the 2016 Con-ference of the North American Chapter of theAssociation for Computational Linguistics: Hu-man Language Technologies , pages 720–730,San Diego, California. Association for Compu-tational Linguistics.Amit Moryossef, Yoav Goldberg, and Ido Dagan.2019. Step-by-step: Separating planning fromrealization in neural data-to-text generation. In

Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Com-putational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) ,pages 2267–2277, Minneapolis, Minnesota. As-sociation for Computational Linguistics.Feng Nie, Jinpeng Wang, Jin-Ge Yao, Rong Pan,and Chin-Yew Lin. 2018. Operation-guided neu-ral networks for high ﬁdelity data-to-text gener-ation. In

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Pro-cessing , pages 3879–3889, Brussels, Belgium.Association for Computational Linguistics.Jekaterina Novikova, Ondˇrej Dušek, and VerenaRieser. 2017. The E2E dataset: New challengesfor end-to-end generation. In

Proceedings ofthe 18th Annual SIGdial Meeting on Discourseand Dialogue , pages 201–206, Saarbrücken, Ger-many. Association for Computational Linguis-tics.Bryan Orme. 2009. Maxdiff analysis: Simplecounting, individual-level logit, and hb.

Saw-tooth Software .Kishore Papineni, Salim Roukos, Todd Ward, andWei-Jing Zhu. 2002. Bleu: a method for auto-matic evaluation of machine translation. In

Pro-ceedings of the 40th Annual Meeting of the As-sociation for Computational Linguistics , pages311–318, Philadelphia, Pennsylvania, USA. As-sociation for Computational Linguistics.Romain Paulus, Caiming Xiong, and RichardSocher. 2018. A deep reinforced model for ab-stractive summarization. In

International Con-ference on Learning Representations .Laura Perez-Beltrachini and Mirella Lapata. 2018.Bootstrapping generators from noisy data. In

Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Com-putational Linguistics: Human Language Tech-nologies, Volume 1 (Long Papers) , pages 1516–1527, New Orleans, Louisiana. Association forComputational Linguistics.Ratish Puduppully, Li Dong, and Mirella Lapata.2019a. Data-to-text generation with content se-lection and planning. In

Proceedings of the 33rdAAAI Conference on Artiﬁcial Intelligence , Hon-olulu, Hawaii.Ratish Puduppully, Li Dong, and Mirella Lapata.2019b. Data-to-text generation with entity mod-eling. In

Proceedings of the 57th Annual Meet-ing of the Association for Computational Lin-guistics , pages 2023–2035, Florence, Italy. As-sociation for Computational Linguistics.Alec Radford, Jeffrey Wu, Rewon Child, DavidLuan, Dario Amodei, and Ilya Sutskever. 2019.Language models are unsupervised multitasklearners.

OpenAI blog , 1(8):9.lément Rebuffel, Laure Soulier, GeoffreyScoutheeten, and Patrick Gallinari. 2020. A hi-erarchical model for data-to-text generation. In

European Conference on Information Retrieval ,pages 65–80. Springer.Ehud Reiter. 1995. NLG vs. templates.

CoRR ,cmp-lg/9504013v1.Ehud Reiter and Robert Dale. 1997. Building ap-plied natural language generation systems.

Nat.Lang. Eng. , 3(1):57–87.Ehud Reiter and Robert Dale. 2000.

Building Nat-ural Language Generation Systems . Studies inNatural Language Processing. Cambridge Uni-versity Press.Fahimeh Saleh, Alexandre Berard, Ioan Calapode-scu, and Laurent Besacier. 2019. Naver labsEurope’s systems for the document-level gen-eration and translation task at WNGT 2019. In

Proceedings of the 3rd Workshop on Neural Gen-eration and Translation , pages 273–279, HongKong. Association for Computational Linguis-tics.Rico Sennrich, Barry Haddow, and AlexandraBirch. 2016. Neural machine translation of rarewords with subword units. In

Proceedings of the54th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers) ,pages 1715–1725, Berlin, Germany. Associationfor Computational Linguistics.Zhihong Shao, Minlie Huang, Jiangtao Wen, Wen-fei Xu, and Xiaoyan Zhu. 2019. Long and di-verse text generation with planning-based hi-erarchical variational model. In

Proceedingsof the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) , pages3257–3268, Hong Kong, China. Association forComputational Linguistics.Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In

Advances in Neural Information Pro-cessing Systems , volume 27, pages 3104–3112.Curran Associates, Inc.Shunsuke Takeno, Masaaki Nagata, and KazuhideYamamoto. 2017. Controlling target features in neural machine translation via preﬁx constraints.In

Proceedings of the 4th Workshop on AsianTranslation (WAT2017) , pages 55–63, Taipei,Taiwan. Asian Federation of Natural LanguageProcessing.Ran Tian, Shashi Narayan, Thibault Sellam, andAnkur P. Parikh. 2019. Sticking to the facts:Conﬁdent decoding for faithful data-to-text gen-eration.

CoRR , abs/1910.08684v2.Ashish Vaswani, Noam Shazeer, Niki Parmar,Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Ł ukasz Kaiser, and Illia Polosukhin. 2017. At-tention is all you need. In I. Guyon, U. V.Luxburg, S. Bengio, H. Wallach, R. Fergus,S. Vishwanathan, and R. Garnett, editors,

Ad-vances in Neural Information Processing Sys-tems 30 , pages 5998–6008. Curran Associates,Inc.Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.2015. Pointer networks. In C. Cortes, N. D.Lawrence, D. D. Lee, M. Sugiyama, and R. Gar-nett, editors,

Advances in Neural InformationProcessing Systems 28 , pages 2692–2700. Cur-ran Associates, Inc.Eric Wallace, Yizhong Wang, Sujian Li, SameerSingh, and Matt Gardner. 2019. Do NLP modelsknow numbers? probing numeracy in embed-dings. In

Proceedings of the 2019 Conferenceon Empirical Methods in Natural Language Pro-cessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 5307–5315, Hong Kong, China.Association for Computational Linguistics.Ronald J. Williams and Jing Peng. 1990. An efﬁ-cient gradient-based algorithm for on-line train-ing of recurrent network trajectories.

NeuralComputation , 2(4):490–501.Sam Wiseman, Stuart Shieber, and Alexander Rush.2017. Challenges in data-to-document genera-tion. In

Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Pro-cessing , pages 2253–2263, Copenhagen, Den-mark. Association for Computational Linguis-tics.Yonghui Wu, Mike Schuster, Zhifeng Chen,Quoc V. Le, Mohammad Norouzi, WolfgangMacherey, Maxim Krikun, Yuan Cao, Qin Gao,laus Macherey, Jeff Klingner, Apurva Shah,Melvin Johnson, Xiaobing Liu, Lukasz Kaiser,Stephan Gouws, Yoshikiyo Kato, Taku Kudo,Hideto Kazawa, Keith Stevens, George Kurian,Nishant Patil, Wei Wang, Cliff Young, Ja-son Smith, Jason Riesa, Alex Rudnick, OriolVinyals, Greg Corrado, Macduff Hughes, andJeffrey Dean. 2016. Google’s neural ma-chine translation system: Bridging the gap be-tween human and machine translation.

CoRR ,abs/1609.08144v2.Zichao Yang, Diyi Yang, Chris Dyer, XiaodongHe, Alex Smola, and Eduard Hovy. 2016. Hier-archical attention networks for document clas-siﬁcation. In

Proceedings of the 2016 Confer-ence of the North American Chapter of the As-sociation for Computational Linguistics: Hu-man Language Technologies , pages 1480–1489,San Diego, California. Association for Compu-tational Linguistics.Kyra Yee, Yann Dauphin, and Michael Auli. 2019.Simple and effective noisy channel modelingfor neural machine translation. In

Proceed-ings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) , pages5696–5701, Hong Kong, China. Association forComputational Linguistics.Lei Yu, Laurent Sartran, Wojciech Stokowiec,Wang Ling, Lingpeng Kong, Phil Blunsom, andChris Dyer. 2020. Better document-level ma-chine translation with bayes’ rule.

Transactionsof the Association for Computational Linguistics ,8:346–360.Wlodek Zadrozny and Karen Jensen. 1991. Seman-tics of paragraphs.