[PDF] Controlling Hallucinations at Word Level in Data-to-Text Generation

Abstract

Data-to-Text Generation (DTG) is a subfield of Natural Language Generation aiming at transcribing structured data in natural language descriptions. The field has been recently boosted by the use of neural-based generators which exhibit on one side great syntactic skills without the need of hand-crafted pipelines; on the other side, the quality of the generated text reflects the quality of the training data, which in realistic settings only offer imperfectly aligned structure-text pairs. Consequently, state-of-art neural models include misleading statements - usually called hallucinations - in their outputs. The control of this phenomenon is today a major challenge for DTG, and is the problem addressed in the paper. Previous work deal with this issue at the instance level: using an alignment score for each table-reference pair. In contrast, we propose a finer-grained approach, arguing that hallucinations should rather be treated at the word level. Specifically, we propose a Multi-Branch Decoder which is able to leverage word-level labels to learn the relevant parts of each training instance. These labels are obtained following a simple and efficient scoring procedure based on co-occurrence analysis and dependency parsing. Extensive evaluations, via automated metrics and human judgment on the standard WikiBio benchmark, show the accuracy of our alignment labels and the effectiveness of the proposed Multi-Branch Decoder. Our model is able to reduce and control hallucinations, while keeping fluency and coherence in generated texts. Further experiments on a degraded version of ToTTo show that our model could be successfully used on very noisy settings.

Full PDF

JJournal manuscript No. (will be inserted by the editor)

Controlling Hallucinations at Word Level inData-to-Text Generation

Clement Rebuﬀel (cid:63) · Marco Roberti (cid:63) · Laure Soulier · Geoﬀrey Scoutheeten · Rossella Cancelliere · Patrick Gallinari

Received: date / Accepted: date

Abstract

Data-to-Text Generation (DTG) is a subﬁeld of Natural Language Gen-eration aiming at transcribing structured data in natural language descriptions.The ﬁeld has been recently boosted by the use of neural-based generators which ex-hibit on one side great syntactic skills without the need of hand-crafted pipelines;on the other side, the quality of the generated text reﬂects the quality of the train-ing data, which in realistic settings only oﬀer imperfectly aligned structure-textpairs. Consequently, state-of-art neural models include misleading statements –usually called hallucinations– in their outputs. The control of this phenomenon istoday a major challenge for DTG, and is the problem addressed in the paper.Previous work deal with this issue at the instance level: using an alignmentscore for each table-reference pair. In contrast, we propose a ﬁner-grained ap-proach, arguing that hallucinations should rather be treated at the word level.Speciﬁcally, we propose a Multi-Branch Decoder which is able to leverage word- (cid:63)

Equal contributionCl´ement RebuﬀelSorbonne Universit´e, CNRS, LIP6, F-75005 Paris, FranceE-mail: clement.rebuﬀ[email protected] RobertiUniversity of Turin, ItalyE-mail: [email protected] SoulierSorbonne Universit´e, CNRS, LIP6, F-75005 Paris, FranceGeoﬀrey ScoutheetenBNP Paribas, ParisRossella CancelliereUniversity of Turin, ItalyPatrick GallinariSorbonne Universit´e, CNRS, LIP6, F-75005 Paris, FranceCriteo AI Lab, Paris a r X i v : . [ c s . C L ] F e b C. Rebuﬀel, M. Roberti, L. Soulier, G. Scoutheeten, R. Cancelliere, P. Gallinari level labels to learn the relevant parts of each training instance. These labels areobtained following a simple and eﬃcient scoring procedure based on co-occurrenceanalysis and dependency parsing. Extensive evaluations, via automated metricsand human judgment on the standard WikiBio benchmark, show the accuracy ofour alignment labels and the eﬀectiveness of the proposed Multi-Branch Decoder.Our model is able to reduce and control hallucinations, while keeping ﬂuency andcoherence in generated texts. Further experiments on a degraded version of ToTToshow that our model could be successfully used on very noisy settings.

Keywords

Data-to-Text Generation · Hallucinations · Controlled Text Genera-tion

Data-to-Text Generation (DTG) is the subﬁeld of Computational Linguistics andNatural Language Generation (NLG) that is concerned with transcribing struc-tured data into natural language descriptions, or, said otherwise, transcribingmachine understandable information into a human understandable description[14]. DTG objectives includes coverage , i.e. all the required information shouldbe present in the text, and adequacy , i.e. the text should not contain informationthat is not covered by the input data. DTG is a domain distinct from other NLGtask (e.g. machine translation [60], text summarization [25]) with its own chal-lenges [60], starting with the nature of inputs [48, 34]. Such inputs include and arenot limited to: databases of records, spreadsheets, knowledge bases, sensor read-ings. As an example, Fig. 1 shows an instance of the WikiBio dataset, i.e. a datatable containing information about Kian Emadi, paired with its correspondingnatural language description found on Wikipedia.Early approaches to DTG relied on static rules hand-crafted by experts, inwhich content selection (what to say) and surface realization (how to say it) aretypically two separate tasks [48, 9]. In recent years, neural models have blurredthis distinction: various approaches showed that both content selection and surfacerealization can be learned in an end-to-end, data-driven fashion [33, 29, 49, 42].Based on the now-standard encoder-decoder architecture, with attention and copymechanisms [1, 51], neural methods for DTG are able to produce ﬂuent text con-ditioned on structured data in a number of domains [26, 60, 43], without relyingon heavy manual work from ﬁeld experts.Such advances have gone hand in hand with the introduction of larger andmore complex benchmarks. In particular, surface-realization abilities have beenwell studied on hand-crafted datasets such as E2E [37] and WebNLG [13], whilecontent-selection has been addressed by automatically constructed dataset suchas WikiBio [26] or RotoWire [60]. These large corpora are often constructed frominternet sources, which, while easy to access and aggregate, do not consist ofperfectly aligned source-target pairs [40, 6]. Consequently, model outputs are oftensubject to over-generation: misaligned fragments from training instances, namely divergences , can induce similarly misaligned outputs during inference, the so-called hallucinations . Hallucinations are currently regarded as a major issue in DTG[34], with experimental surveys showing that real-life end users of DTG systemscare more about accuracy than about readability [47], as inaccurate texts canpotentially mislead decision makers, with dire consequences. ontrolling Hallucinations at Word Level in Data-to-Text Generation 3 key value name kian emadifullname kian emadi-coffincurrentteam retireddiscipline trackrole riderridertype sprinterproyears 2012-presentproteams sky track cyclingRef.: kian emadi (born 29 july 1992) is a british track cyclist .

Fig. 1

An example of a WikiBio instance, composed by an input table and its (partiallyaligned) description.

When corpora include a mild amount of noise, as in handcrafted ones (e.g.E2E, WebNLG), dataset regularization techniques [35, 8] or hand crafted rules[19] can help to reduce hallucinations. Unfortunately, these techniques are notsuited to more realistic and noisier datasets, as for instance WikiBio [26] or Ro-toWire [60]. On these benchmarks, several techniques have been proposed, such asreconstruction loss terms [60, 59] or Reinforcement Learning (RL) based methods[41, 30, 45]. These approaches suﬀer however from diﬀerent issues: (1) the recon-struction loss relies on an hypothesis of one-to-one alignment between source andtarget which does not ﬁt with content selection in DTG; (2) RL-trained modelsare based on instance-level rewards (e.g. BLEU [38], PARENT [6]) which can leadto a loss of signal because divergences occur at the word level. In practice, partsof the target sentence express source attributes (in Fig. 1 name and occupationﬁelds are correctly realized), while others diverge (the birthday and nationality ofKian Emadi are not supported by the source table).Interestingly, one can view DTT models as Controlled Text Generation (CTG)ones focused on controlling content, as most CTG techniques condition the genera-tion on several key-value pairs of control factors (e.g. tone, tense, length) [7, 17, 10].Recently, Filippova [11] explicitly introduced CTG to DTG by leveraging an hallu-cination score simply attached as an additional attribute which reﬂects the amountof noise in the instance. As an example, the table from Fig 1 can be augmentedwith an additional line ( hallucination_score , ) . However, this approach re-quires a strict alignment at the instance-level, namely between control factors andtarget text. A ﬁrst attempt towards word-level approaches is proposed by Perez-Beltrachini and Lapata [41]. They design word-level alignment labels, denoting thecorrespondence between the text and the input table, to bootstrap DTG systems.However, they incorporate these labels into a sentence-level RL-reward, which ul-timately leads to a loss of this ﬁner-grained signal.In this paper, we go further in this direction with a DTG model by fullyleveraging word-level alignment labels with a CTG perspective. We propose anoriginal approach in which the word-level is integrated at all phases: – we propose a word-level labeling procedure (Section 3), based on co-occurrences and sentence structure through dependency parsing. This mitigates The reader may disagree with such a strong hallucination score. Indeed, while the birthdateand nationality are clearly divergences, the rest of the sentence is correct. This illustrate thecomplexity of handling divergences in complex datasets, where alignment cannot be framed asa simple word-matching task. C. Rebuﬀel, M. Roberti, L. Soulier, G. Scoutheeten, R. Cancelliere, P. Gallinari the failure of strict word-matching procedure, while still producing relevant la-bels in complex settings. – we introduce a weighted multi-branch neural decoder (Section 4), guidedby the proposed alignment labels, acting as word-level control factors. Dur-ing training, the model is able to distinguish between aligned and unalignedwords and learns to generate accurate descriptions without being misled byun-factual reference information. Furthermore, our multi-branch weighting ap-proach enables control at inference time.We carry out extensive experiments on WikiBio, to evaluate both our labelingprocedure and our decoder (Section 6). We also test our framework on ToTTo[39], in which models are trained with noisy reference texts, and evaluated onreferences reviewed and cleaned by human annotators to ensure accuracy. Evalua-tions are based on a range of automated metrics as well as human judgments, andshow increased performances regarding hallucinations reduction, while preservingﬂuency. The use of Deep Learning basedmethods to solve DTG tasks has led to sudden improvements in state of the artperformances [26, 60, 31, 42]. As a key aspect in determining a model’s perfor-mance is the quality of training data, several large corpora have been introducedto train and evaluate models’ abilities on diverse tasks. E2E [37] evaluates surfacerealization, i.e. the strict transcription of input attributes into natural language;RotoWire [60] pairs statistics of basketball games with their journalistic descrip-tions, while WikiBio [26] maps a Wikipedia info-box with the ﬁrst paragraph of itsassociated article. Contrary to E2E, the latter datasets are not limited to surfacerealization. They were not constructed by human annotators, but rather createdfrom Internet sources, and consist of loosely aligned table-reference pairs: in Wik-iBio, almost two thirds of the training instances contain divergences [6], and noinstance has a 1-to-1 source-target alignment [40].On datasets with a moderate amount of noise, such as E2E, data pre-processinghas proven eﬀective for reducing hallucinations. Indeed, rule-based [8] or neural-based methods [35] have been proposed, speciﬁcally with table regularization tech-niques, where attributes are added or removed to re-align table and target descrip-tion. Several successful attempts have also been made in automatically learningalignments between the source tables and reference texts, beneﬁting from the reg-ularity of the examples [19, 53, 15]. For instance, Juraska et al. [19] leverage tem-plating and hand-crafted rules to re-rank the top outputs of a model decodingvia beam search, while Gehrmann et al. [15] also leverage the possible templatingformats of E2E’s reference texts, and train an ensemble of decoders where eachdecoder is associated to one template. The previous techniques are not applica-ble in more complex, general settings. The work of Dusek et al. [8] hints at this direction, as authors found that neural models trained on E2E were principallyprone to omissions rather than hallucinations. In this direction, Shen et al. [53]were able to obtain good results at increasing the coverage of neural outputs, byconstraining the decoder to focus its attention exclusively on each table cell se-quentially until the whole table was realized. On more complex datasets (e.g. Wik- ontrolling Hallucinations at Word Level in Data-to-Text Generation 5 iBio), a wide range of methods has been explored to deal with factualness suchas loss design, either with a reconstruction term [60, 59] or with RL-based meth-ods [41, 30, 45]. Similarly to the coverage constraints, a reconstruction loss hasproven only marginally eﬃcient in these settings, as it contradicts the content se-lection task [59], and needs to be well calibrated using expert insight in order tobring improvements. Regarding RL, Perez-Beltrachini and Lapata [41] build aninstance-level reward which sums up word-level scores; Liu et al. [30] propose areward based on document frequency to favor words from the source table morethan rare words; and Rebuﬀel et al. [45] train a network with a variant of PAR-ENT [6] using self-critical RL. Note that data regularization techniques have alsobeen proposed [57, 59], but these methods require heavy manual work and expertinsights, and are not readily transposable from one domain to another.

From CTG to controlling hallucinations.

Controlled Text Generation (CTG)is concerned with constraining a language model’s output during inference on anumber of desired attributes, or control factors , such as the identity of the speakerin a dialog setting [28], the politeness of the generated text or the text length inmachine-translation [52, 21], or the tense in generated movie reviews [17]. Earlierattempts at neural CTG can even be seen as direct instances of DTG as it iscurrently deﬁned: models are trained to generate text conditioned on attributesof interest, where attributes are key-value pairs. For instance, in the movie reviewdomain, Ficler and Goldberg [10] proposed an expertly crafted dataset, where sen-tences are strictly aligned with control factors, being either content or linguisticstyle aspects (e.g. tone, length).In the context of dealing with hallucinations in DTG, Filippova [11] recentlyproposed a similar framework, by augmenting source tables with an additionalattribute that reﬂects the degree of hallucinated content in the associated targetdescription. During inference, this attribute acts as an hallucination handle usedto produce more or less factual text. As mentioned in Section 1, we argue that aunique value can not accurately represent the correspondence between a table andits description, due to the phrase-based nature of divergences.Based on the literature review, the lack of model control can be evidenced whenloss modiﬁcation methods are used [59, 29, 45], although these approaches can beeﬃcient and transposed from one domain to another. On the other hand, whileCTG deals with control and enables choosing the deﬁning features of generatedtexts [11], standard approaches rely on instance-level control factors that do notﬁt with hallucinations, which rather appear due to divergences at the word level.Our approach aims at gathering the merits of both trends of models and is guidedby previous statements highlighting that word-level is primary in hallucinationcontrol. More particularly, our model diﬀers from previous ones in several aspects:(1) Contrasting with data-driven approaches (i.e. dataset regularization) which are costly in expert time, and loss-driven approaches (i.e. reconstruction or RLlosses) which often do not take into account key subtasks of DTG (content-selection, world-level correspondences), we propose a multi-branch modelingprocedure which allows the controllability of the hallucination factor in DTG.This multi-branch model can be integrated seamlessly in current approaches,

C. Rebuﬀel, M. Roberti, L. Soulier, G. Scoutheeten, R. Cancelliere, P. Gallinari

Fig. 2

The reference sentence of the example shown in Fig. 1. Every token is associated to itsPart-of-Speech tag and hallucination score s t . Words in red denote s t < τ . The dependencyparsing is represented by labeled arrows that ﬂow from parents to children. Important wordsare kian , emadi , , july , , british , track , and cyclist . allowing to keep peculiarities of existing DTG models, while deferring halluci-nation management to a parallel decoding branch.(2) Unlike previous CTG approaches [28, 52, 10, 11] which propose instance-levelcontrol factors, the control of the hallucination factor is performed at the word-level to enable ﬁner-grained signal to be sent to the model.Our model is composed of two main components: (1) a word-level alignmentlabeling mechanism, which makes the correspondence between the input table andthe text explicit, and (2) a multi-branch decoder guided by these alignment labels.The branches separately integrate co-dependent control factors (namely content,hallucination and ﬂuency). We describe these components in Sections 3 and 4,respectively. We consider a DTG task, in which the corpus C is composed of a set of entity-description pairs, ( e, y ). A single-entity table e is a variable-sized set of T e key-valuepairs x := ( k, v ). A description y := y T y is a sequence of T y tokens representingthe natural language description of the entity; we refer to the tokens spanningfrom indices t to t (cid:48) of a description y as y t : t (cid:48) . Fig. 1 shows a WikiBio entity madeby 8 key-value pairs together with its associated description.First, we aim at labeling each word from a description, depending on the pres-ence of a correspondence with its associated table. We call such labels alignmentlabels . We drive the word-level labeling procedure on two intuitive constraints:(1) important words (names, adjectives and numbers) should be labeled depend-ing on their alignment with the table, and (2) words from the same statement(i.e. a text span expressing one idea) should have the same label.With this in mind, the alignment label for the t th token y t is a binary label: l t := { s t >τ } where s t refers to the alignment score between y t and the table, and τ is set experimentally (see Sec. 5.3). The alignment score s t acts as a normalizedmeasure of correspondence between a token y t and the table e : s t := norm (max x ∈ e align ( y t , x ) , y ) (1) where the function align estimates the alignment between token y t and a key-valuepair x from the input table e , and norm is a normalization function based on thedependency structure of the description y . Fig. 2 illustrates our approach: undereach word we show its word alignment score, and words are colored in red if thisscore is lower than τ , denoting an alignment label equal to 0. Below, we describethese functions (Appendix A contains reproducibility details). ontrolling Hallucinations at Word Level in Data-to-Text Generation 7 Fig. 3

Our proposed decoder with three branches associated to content (in blue – left),hallucination (in red – middle) and ﬂuency (in yellow – right). Semi-transparent branches areassigned the weight 0.

Co-occurrence-based alignment function (align ( · , x ) ). This function assignsto important words a score in the interval [0 ,

1] proportional to their co-occurrencecount (a proxy for alignment) with the key-value pair from the input table. If theword y t appears in the key-value pair x := ( k, v ), align ( y t , x ) outputs 1; otherwise,the output is obtained scaling the number of occurrences co y t ,x between y t and x through the dataset: align ( y t , x ) :=  y t ∈ xa · ( co y t ,x − m ) if m ≤ co y t ,x ≤ M ≤ co y t ,x ≤ m (2) where M is the maximum number of word co-occurrences in the dataset vocabularyand the row x , m is a threshold value, and a := M − m ) . Score normalization (norm ( · , y ) ). According to the assumption that wordsinside the same statement should have the same score, we ﬁrst split the sentence y into statements y t i : t i +1 − , via dependency parsing and its rule-based conversionto constituency trees [16, 63, 18, 2]. Given a word y t associated to the score s t andbelonging to statement y t i : t i +1 − , its normalized score corresponds to the averagescore of all words in this statement: norm ( s t , y ) = 1 t i +1 − t i t i +1 − (cid:88) j = t i s j (3) The proposed Multi-Branch Decoder (MBD) architecture aims at separating tar-geted co-dependent factors during generation. Each factor is modeled via a singleRNN, or branch, whose hidden-state can be weighted according to its desiredimportance. Figure 3 exemplify how this model works: weights change at train-ing time depending on the word currently being decoded, inducing the desiredspecialization of each branch. In Section 4.1, the standard DTG encoder-decoderarchitecture is presented; Section 4.2 shows how it can be extended to MBD, to-gether with its peculiarities and the underlying objectives and assumptions. h j (elements of the input table are C. Rebuﬀel, M. Roberti, L. Soulier, G. Scoutheeten, R. Cancelliere, P. Gallinari ﬁrst embedded into T e N -dimensional vectors, and then fed sequentially to theRNN [60]), and (2) the decoder generates a textual description y using a RNNaugmented with attention and copy mechanisms [51]. Words are generated in anauto-regressive way. The decoder’s RNN updates its hidden state d t as: d t := RNN( d t − , [ y t − , c t ]) (4)where y t − is the previous word and c t is the context vector obtained through theattention mechanism. Finally, a word is drawn from the distribution computed viaa copy mechanism [51].4.2 Controlling Hallucinations via a Multi-Branch ModelOur objective is to enrich the decoder in order to be able to tune the con-tent/hallucination ratio during generation, aiming at enabling generation of hal-lucination-free text when needed. Our key assumption is that the decoder’s gen-eration is conditioned by three co-dependent factors: – Content factor constrains the generation to realize only the information includedin the input; – Hallucinating factor favors lexically richer and more diverse text, but may leadto hallucinations not grounded by the input; – Fluency factor conditions the generated sentences toward global syntactic cor-rectness, regardless of the relevance.Based on this assumption, we propose a multi-branch encoder-decoder network,whose branches are constrained on the above factors at word-level, as illustrated inFig. 3. Our network has a single encoder and F = 3 distinct decoding RNNs, notedRNN f respectively, one for each factor. During each decoding step, the previouslydecoded word y t − is fed to all RNNs, and a ﬁnal decoder state d t is computedusing a weighted sum of all the corresponding hidden states, d ft := RNN f ( d ft − , [ y t − , c t ]) (5) d t := F (cid:88) f =1 ω ft d ft (6)where d ft and ω ft are respectively the hidden state and the weight of the f th RNNat time t . Weights are used to constrain the decoder branches to the desired controlfactors ( ω t , ω t , ω t for the content, hallucination and ﬂuency factors respectively)and sum to one.During training, their values are dynamically set depending on the alignmentlabel l t ∈ { , } of the target token y t (see Sec. 5.3. While a number of mappingscan be used to set the weights given the alignment label, early experiments haveshown that better results were achieved when using a binary switch for each factor,i.e. activating/deactivating each branch, as shown in Fig. 3 (note that ﬂuencyshould not depend on content and therefore its associated branch is always active). During inference, the weights of the decoder’s branches are set manually by auser, according to the desired trade-oﬀ between information reliability, sentencediversity and global ﬂuency. Text generation is then controllable and consistentwith the control factors. Wiseman et al. [61] showed that the explicit modeling of a ﬂuency latent factor improvesperformance.ontrolling Hallucinations at Word Level in Data-to-Text Generation 9

WikiBio [26] contains 728 ,

321 tables, automatically paired with the ﬁrst sentenceof the corresponding Wikipedia English article. Reference text’s average length is26 words, and tables have on average 12 key-value pairs. We use the original datapartition: 80% for the train set, and 10% for validation and test sets. This datasethas been automatically built from the Internet; concerning divergences, 62% ofthe references mention extra information not grounded by the table [6].

ToTTo [39] contains 120 ,

761 training examples, and 7 ,

700 validation and testexamples. Inputs are Wikipedia tables, paired with candidate sentences of the cor-responding article, extracted via simple similarity heuristics. ToTTo pairs everytable with both the source sentence extracted from Wikipedia that exhibits diver-gence w.r.t. the table and a version of the same sentence manually cleaned fromhallucinations. For cleaning, annotators highlighted table cells corresponding tothe information present in the source sentence and then removed any hallucina-tion information from the sentence. Reference text’s average length is 17 . .

55 table cells are highlighted, on average. We train models using the sourcesentences of the training set, with early stopping using the overlap portion of thedevelopment set, and evaluate the performance on the cleaned sentences on thenon-overlap dev set. Note that the oﬃcial task for ToTTo aims at evaluating modeltrained on the complete tables, while we only use the highlighted cells.5.2 BaselinesWe assess the accuracy and relevance of our alignment labels against labels pro-posed by Perez-Beltrachini and Lapata [41], which is, to the best of our knowledge,the only work proposing such ﬁne-grained alignment labels.To evaluate our Multi-Branch Decoder (

MBD ), we consider ﬁve baselines: – stnd [51], a LSTM-based encoder-decoder model with attention and copy mech-anisms. This is the standard sequence-to-sequence recurrent architecture. – stnd ﬁltered , the previous model trained on a ﬁltered version of the training set:tokens deemed hallucinated according to their hallucination scores, are removedfrom target sentences. – hsmm [61], an encoder-decoder model with a multi-branch decoder. The branchesare not constrained by explicit control factors. This is used as a baseline to showthat the multi-branch architecture by itself does not guarantee the absence ofhallucinations. – hier [29], a hierarchical sequence-to-sequence model, with a coarse-to-ﬁne at- tention mechanism to better ﬁt the attribute-value structure of the tables. Thismodel is trained with three auxiliary tasks to capture more accurate semanticrepresentations of the tables. – hal WO [11], a stnd -like model trained by augmenting each source table with anadditional attribute ( hallucination ratio , value ). Note that the last three baselines were already originally implemented andevaluated on the WikiBio benchmark. We ran our own implementation of hal WO on ToTTo.5.3 Implementation DetailsDuring training of our multi-branch decoder the ﬂuency branch is always active( ω t = 0 .

5) while the content and hallucination branches are alternatively acti-vated, depending on the alignment label l t : ω t = 0 . ω t = 0(hallucination factor) when l t = 1, and conversely. The threshold τ used to obtain l t is set to 0 . . . . – BLEU [38] is a length-penalized precision score over n -grams, n ∈ (cid:74) , (cid:75) , option- ally improved with a smoothing technique [3]. Despite being the standard choice,recent ﬁndings show that it correlates poorly with human evaluation, especiallyon the sentence level [36, 46], and that it is a proxy for sentence grammar andﬂuency aspects rather than semantics [6]. – PARENT [6] computes smoothed n -gram precision and recall over both thereference and the input table. It is explicitly designed for DTG tasks, and its F-measure shows “the highest correlation with humans across a range of settingswith divergent references in WikiBio.” [6] – The hallucination rate computes the percentage of tokens labeled as hallucina-tions (Sec. 3). – The average generated sentence length in number of words. – The classic readability Flesch index [12], which is based on words per sentenceand syllables per word, and is still used as a standard benchmark [24, 54, 55, 56].Finally, we perform qualitative evaluations of the results obtained on WikiBIOand ToTTo, following the best practices outlined by van der Lee et al. [27]. Ourhuman annotators are from several countries across Europe, between 20 and 55years old and proﬁcient in English. They have been assigned two diﬀerent tasks:(i) hallucination labeling, i.e. the selection of sentence pieces which include incor-rect information, and (ii) sentence analysis, i.e. evaluating diﬀerent realizations ofthe same table according to their ﬂuency, factualness and coverage. Scores are pre-sented as a 3-level Likert scale for Fluency (

Fluent , Mostly ﬂuent , or

Not ﬂuent )and Factualness (likewise), while coverage is the number of cells from the table that have been realized in the description. To avoid all bias, annotators are showna randomly selected table at a time, together with its corresponding descriptions,both from the dataset and the models that are being evaluated. Sentences arepresented each time in a diﬀerent order. Following Tian et al. [58], we ﬁrst tasked Code is given to reviewers and will be available upon acceptance.ontrolling Hallucinations at Word Level in Data-to-Text Generation 11Labels Accuracy Precision Recall F-measurePB&L 46.9% 21.3% 49.2% 29.7%ours

Labels BLEU PARENTPrecision Recall F-measurePB&L 32.15% 76.91% 39.28% 48.75%ours

Performances of hallucination scores on WikiBio test set, w.r.t. human-designatedlabels (upper table) and

MBD trained with diﬀerent labeling procedures (lower table). Ourmodel always signiﬁcantly overpasses

PB&L (T-test with p < . three expert annotators to annotate a pilot batch of 50 sentences. Once conﬁrmedthat Inter-Annotator Agreement was approx. 75% (a similar ﬁnding to Tian et al.[58]), we asked 16 annotators to annotate a bigger sample of 300 instances (whereeach instance consists of one table and four associated outputs), as Liu et al. [29]. We perform an extensive evaluation of our scoring procedure and multi-brancharchitecture on the WikiBio dataset: we evaluate the quality of the proposed align-ment labels, both intrinsically using human judgment and extrinsically by meansof the DTG downstream task; and the performance of our model with respect tothe baselines. Additionally, we assess the applicability of our framework on themore noisy ToTTo benchmark, which represents a harder challenge for today’sDTG models.6.1 Validation of Alignment Labels.To assess the eﬀectiveness of our alignment labels (Sec. 3), we ﬁrst compare thealignment labels against human judgment, and then explore their impact on aDTG task. As a baseline for comparison we report performances of

PB&L . Intrinsic performance.

Tab. 1 (top) compares the labeling performance ofour method and

PB&L against human judgment. Our scoring procedure signif-icantly improves over

PB&L : the latter only achieves 46 .

9% accuracy and 29 . .

5% and 68 .

7% respectively for our proposed procedure.Perez-Beltrachini and Lapata [41] report a F-measure of 36%, a discrepancy thatcan be explained by the diﬀerence between the evaluation procedures:

PB&L eval-uate on 132 sentences, several of which can be tied to the same table, whereas weexplicitly chose to evaluate on 300 sentences all from diﬀerent tables in order tominimize correlation.We remark that beyond F-measure, the precision of

PB&L ’s scoring procedureis at 21 .

3% compared to 80 .

6% for ours, and recall stands at 49 .

2% against 59 . leads the network to incoherently label words, without apparent justiﬁcation. SeeFigure 4 for two examples of this phenomenon; and Appendix D for other com-parisons. In contrast, our method is able to detect hallucinated statements insidea sentence, without incorrectly labeling the whole sentence as hallucinated. An eyesight of our platform is available in Appendix C.2 C. Rebuﬀel, M. Roberti, L. Soulier, G. Scoutheeten, R. Cancelliere, P. Gallinari key value name patricia flores fuentesbirth date 25 july 1977birth place state of mexico , mexicooccupation politiciannationality mexicanarticle title patricia flores fuentesRef.: patricia flores fuentes -lrb- born 25 july 1977 -rrb- is a mexican politicianaffiliated to the national action party .PB&L: patricia flores fuentes -lrb- born 25 july 1977 -rrb- is a mexican politicianaffiliated to the national action party .Ours: patricia flores fuentes -lrb- born 25 july 1977 -rrb- is a mexican politicianaffiliated to the national action party . (a) key value name ryan moorespouse nichole olson -lrb- m. 2011 -rrb-children tuckercollege unlvyearpro 2005tour pga tourprowins 4pgawins 4masters t12 2015usopen t10 2009open t10 2009pga t9 2006article title ryan moore -lrb- golfer -rrb-Ref.: ryan david moore -lrb- born december 5 , 1982 -rrb- is an americanprofessional golfer , currently playing on the pga tour .PB&L: ryan david moore -lrb- born december 5 , 1982 -rrb- is an americanprofessional golfer , currently playing on the pga tour .Ours: ryan david moore -lrb- born december 5 , 1982 -rrb- is an americanprofessional golfer , currently playing on the pga tour . (b)

Fig. 4

WikiBio instances’ hallucinated words according either to our scoring procedure or tothe method proposed by Perez-Beltrachini and Lapata [41].

PB(cid:32)L labels word incoherently (a),and sometimes the whole reference text (b). In comparison, our approach leads to a ﬂuentbreakdown of the sentences in hallucinated/factual statements.

Impact on a DTG downstream task.

Additionally, we assess the diﬀerence ofboth scoring procedures using their impact on the WikiBio DTG task. Speciﬁcally,Tab. 1 (bottom) shows the results of training

MBD using either

PB&L ’s or ourlabels. We observe signiﬁcant improvements, especially in BLEU and PARENT-recall (40 .

5% vs 32 .

2% and 45% vs 39 . Comparison with SOTA systems.

Tab. 2 shows the performances of our modeland all baselines according to the metrics of Sec. 5.4. Two qualitative examplesare presented in Figure 5 and more are available in Appendix D. ontrolling Hallucinations at Word Level in Data-to-Text Generation 13

Model BLEU ↑ PARENT ↑ Halluc.rate ↓ Meansent.length Flesch ↓ Precision Recall F-measureGold - - - - 23.82% 19.20 stnd stnd filtered hsmm hier hal WO MBD

Table 2

Comparison results on WikiBio. ↑ (resp. ↓ ) means higher (resp. lower) is better. First of all, reducing hallucinations is reached with success, as highlighted bythe hallucination rate (1 .

43% vs. 4 .

20% for a standard encoder-decoder and 10 . . . stnd ﬁltered ,achieves such a result at a high cost. As can be seen in Figure 5 where its out-put is factual but cut short, its sentences are the shortest and the most naive interms of the Flesch readability index, which is also reﬂected by a lower BLEUscore. The high PARENT precision – mostly due to the shortness of the outputs– is counterbalanced by a low recall: the F-measure indicates the overall lack ofcompetitiveness of this trade-oﬀ. This shows that the naive approach of simplyﬁltering training instances is not the appropriate solution for hallucination reduc-tion. This echoes [11] who trained a vanilla network on the cleanest 20% of thedata and found that predictions are more precise than those of a model trainedon 100% but that PARENT-recall and BLEU scores are low.At the other extreme, the best model in terms of BLEU, hier , falls short re-garding precision, suggesting that often the generated text is not matched in theinput table, an issue also reﬂected by the highest hallucination rate of all models(10 . hsmm , and especially of its relatively weak performance both interms of BLEU and PARENT, highlights the insuﬃciency of the multi-branch ar-chitecture by itself. This reinforces the need of an additional supervision regardinghallucinations, satisﬁed by our labeling procedure.Finally, in the comparisons with hal WO , we can see that while it achieves oneof the highest performances in term of precision (79 . . generation procedure, which requires ﬁner-grain annotations, as shown by ourmodel recall (46 . . Weight impact on decoding.

As a CTG system, we can guide our network atinference to generate sentences following desired attributes. We explore the impactof diﬀerent weight combinations in Tab. 3. In particular, we can see that changing name zack leebirth name zack lee jowononationality indonesianoccupation actor , boxer , modelbirth date 15 august 1984birth place liverpool , merseyside , england , ukyears active 2003 -- presentparents hendra and ayu jowonospouse nafa urbach ( 2007 -- present )article title zack lee

Gold zack lee ( born 15 august 1984 ) is an indonesian actor , model and boxer ofbritish descent . stnd zack lee jowono ( born 15 august 1984 ) is an indonesian actor and model . stnd filtered zack lee ( born zack lee jowono ; 15 august 1984 ) is an indonesian actor . hsmm zack lee jowono ( born 15 august 1984 ) is an indonesian actor who hasappeared in tamil ﬁlms . hier zack lee jowono ( born 15 august 1984 ) , better known by his stage namezack lee , is an indonesian actor , model and model .

MBD[.4, .1, .5] zack lee ( born zack lee jowono ; 15 august 1984 ) is an indonesian actor ,boxer and model . (a) name wayne r. dynesbirth date 23 august 1934occupation professor , historian , and encyclopedistarticle title wayne r. dynes

Gold wayne r. dynes ( born august 23 , 1934 ) is an american art historian , ency-clopedist , and bibliographer . stnd wayne r. dynes ( born august 23 , 1934 ) is an american historian and ency-clopedist . stnd filtered wayne r. dynes is a professor . hsmm wayne r. dynes ( born august 23 , 1934 ) is an american historian , historianand encyclopedist . hier wayne r. dynes ( born august 23 , 1934 ) is an american professor of historyat the university of texas at austin .

MBD[.4, .1, .5] wayne r. dynes ( born august 23 , 1934 ) is an american professor , historian, and encyclopedist . (b)

Fig. 5

Qualitative examples of our model and baselines on the WikiBio test set. Note that: (1) gold references may contain divergences; (2) stnd and hsmm seem to perform well superﬁcially,but often hallucinate; (3) stnd ﬁltered doesn’t hallucinate but struggles with ﬂuency; (4) hier overgenerate ”human-sounding” statements, that lacks facutalness; (5)

MBD sticks to the factcontained by the table, in concise and ﬂuent sentences. weights in favor of the hallucination factor (top ﬁve lines) leads to decreases inboth precision and recall (from 80 .

37% to 57 .

88% and 44 .

96% 4 .

82% respectively).We also observe that strongly relying on the hallucinating branch dramaticallyimpacts performances ([0 . . .

5] obtains near 0 BLEU and F-measure), as itis never fed with complete, coherent sentences during training. However, someperformance can still be restored via the ﬂuency branch: [0 . . .

9] performs at15 .

51% BLEU and 36 .

88% F-measure.Interestingly, we note that relaxing the strict constraint on the content factor in favor of the hallucination factor, i.e. choosing [0 . . .

5] instead of [0 . . . .

16% vs 55 .

29% F-measure). This highlights thatstrictly constraining on content yields sensibly more factual outputs (79% vs80 .

37% precision), at the cost of constraining the model’s generation creativity(46 .

40% vs 44 .

96% recall). The [0 . . .

5] variant has more “freedom of speech” ontrolling Hallucinations at Word Level in Data-to-Text Generation 15Weights BLEU ↑ PARENT ↑ Precision Recall F-measure0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 3

Performances of

MBD on WikiBio validation set, with various weight settings.Weights’ order is ( content , hallucination , ﬂuency ).Model Fluency Factualness CoverageGold 98.7% 32.0% stnd filtered hier MBD

Table 4

Results of the human evaluation on WikiBio . and sticks more faithfully to domain lingo (recall and BLEU) without compromis-ing too much in terms of content.6.3 Human evaluationTo measure subtleties which are not captured by automatic metrics, we report inTab. 4 human ratings of our model, two baselines and the gold. Selected baselinesshow interesting behaviors: hier obtains the best BLEU score but a poor precision,and stnd ﬁltered gets the best precision but poor BLEU, length and Flesch index.First, coherently with [6], we found that around two thirds of gold referencescontain divergences from their associated tables. Such data also conﬁrm our anal-ysis on the stnd ﬁltered baseline: its unquestionable capability of avoiding halluci-nations dramatically impacts on both ﬂuency and coverage, leading to less desiredoutputs overall, despite the high PARENT-precision score.The comparison between hier and MBD shows that both approaches lead tosimilar coverage, with

MBD obtaining signiﬁcantly better performances in termsof factualness. We also highlight that

MBD is evaluated as being the most ﬂu-ent one, even better than the reference (which can be explained by the imperfectpre-processing done by Lebret et al. [26]).6.4 ToTTo: a considerably noisy settingThe ToTTo dataset includes both noisy and a cleaned reference texts. Furthermore,cells realized by the output are highlighted. In order to recreate a realistic setting,we train our model on those cells uniquely, and using the noisy reference. We do not use the complete table, because content selection is arbitrary in this task. Fluency reports the sum of “ﬂuent” and “mostly ﬂuent”, as “mostly ﬂuent” often comesfrom misplaced punctuation and doesn’t really impact readability. However, Factualness re-ports only the count of “factual”, as “mostly factual” sentences contain hallucinations andcannot be considered “factual”.6 C. Rebuﬀel, M. Roberti, L. Soulier, G. Scoutheeten, R. Cancelliere, P. Gallinari

Model BLEU ↑ PARENT ↑ Human evaluationPrecision Recall F-measure Fluency ↑ Factualness ↑ CoverageGold - - - - 97.1% (97.1) 91.2% (79.4) 3.618 stnd stnd filtered halWO

MBD

Table 5

Comparison results on ToTTo. ↑ (resp. ↓ ) means higher (resp. lower) is better.In human evaluation for Fluency, reported are for ”Fluent” and ”Mostly Fluent”, with only”Fluent” in parentheses. Same for Factualness. page title Huge (TV series)section title EpisodesOriginal air date June 28 2010U.S. viewers (millions) 2.53 Gold The TV series , Huge , premiered on June 28 , 2010 with 2.53 million viewers. stnd

On June 28 , 2010 , it was watched by 2.53 million viewers . stnd filtered was watched by 2.53 on June 28 , 2010 . hal WO June 28 , 2010 : Huge million viewers .

MBD[.4, .1, .5]

Huge ’s ﬁrst episode , aired on June 28 , 2010 , was watched by 2.53 million . (a) page title LM317section title SpecificationParameter Output voltage rangeValue 1.25 - 37

Gold LM317 produces a voltage of 1.25 V . stnd

The Output is a Output range of 1.25 – 37 . stnd filtered range from 1.25 to 37 . hal WO Output voltage range 1.25 – 37 – 37 .

MBD[.4, .1, .5]

The Output ’s range is approximately 1.25 . (b)

Fig. 6

Qualitative examples of

MBD and hal WO on ToTTo. hal WO ’s poor generation qualityis not detected by discrete metrics. In contrast, MBD generates ﬂuent and naively factualsentences. Note that stnd and stnd ﬁltered have the same behavior as on WikiBio: the formerproduces ﬂuent but nonsensical text; the latter generates very un-ﬂuent, but factual, text.

We report the performances of hal WO and MBD with regards to automatic met-rics in Table 5. As one can see, these metrics appear incoherent and contradictory(e.g. hal WO obtains 17% BLEU, but 77% PARENT precision). Manual inspectionreveals that models struggle to guess the exact cleaned sentence (hence the lowBLEU and PARENT recall) but still manage to rely on the table (PARENT pre-cision). Human evaluation is reported in the same Table.First, note that no model achieves a credible accuracy for a real life setting.The best performance is obtained by MBD with only 55% of hallucination-freetexts. However, note that hal WO obtains 61 .

7% Fluency score, contrasting with

MBD ’s 91% which is close to its performances on WikiBIO. We show examplesof these behaviors in Figure 6: generations from hal WO are often non-sensical andconsists of incorrectly ordered sequences of words extracted from the table and repetitions, explaining low BLEU and high PARENT-precision; generations from MBD , while far from perfect, are still mostly syntactically correct.Discrepancies in results of hal WO could be explained by the low number of in-stances with a global hallucination score of 0 in the training set, leading to too fewexamples for the model to generalize correctly. In contrast, our approach is able to ontrolling Hallucinations at Word Level in Data-to-Text Generation 17 leverage the most of each training instance, leading to better performances over-all. Taken altogether, theses statements highlight the diﬃculty of current models(ours and baselines) to learn on very noisy datasets with a big diversity in traininginstances. For instance, in Figure 6b, our model treats “Output” as an entity, anddo not understand that a range should be two numbers. We proposed a Multi-Branch decoder, able to leverage word-level alignment labelsin order to produce factual and coherent outputs. Our proposed labeling procedureis more accurate than previous work, and outputs from our model are estimated, byautomatic metrics and human judgment alike, more ﬂuent, factual, and relevant.We obtain state-of-the-art performances on WikiBio for PARENT F-measure, andshow that our approach is promising in the context of a noisier setting.We designed our alignment procedure to be general and easily reproducibleon any DTG dataset. One strength of our approach is that co-occurrences anddependency parsing can be used intuitively to extract more information from thetables than a naive word matching procedure. However, in the context of tablesmainly including numbers (e.g., RotoWire), the eﬀectiveness of the co-occurrenceanalysis is not guaranteed. A future work will be to improve upon the co-occurrenceanalysis to generalize to tables which contain less semantic inputs. For instance,the labeling procedure of Perez-Beltrachini and Lapata [41] might be revised sothat adverse instances are not selected randomly, which we hypothesize wouldresult in more relevant labels.Finally, experiments on ToTTo outline the narrow exposure to language of cur-rent models when used on very noisy datasets. Our model has shown interestingproperties through the human evaluation but is still perfectible. Recently intro-duced large pretrained language models, which have seen signiﬁcantly more variedtexts, may attenuate this problem. In this direction, adapting the work of [4, 20]to our model could bring improvements to the results presented in this paper.

Declarations • Fundings: We would like to thank the H2020 project AI4EU (825619) and the ANR JCJCSESAMS (ANR-18-CE23-0001) for supporting this work. This research has been partiallycarried on in the context of the Visiting Professor Program of the Italian Istituto Nazionale diAlta Matematica (INdAM). • Conﬂict of interest: No conﬂict of interest • Code availability: Code is available at https://github.com/KaijuML/dtt-multi-branch • Other items are not applicable (Availability of data and material, Additional declarationsfor articles in life science journals that report the results of studies involving humans and/oranimals, Ethics approval, Consent to participate, Consent for publication)8 C. Rebuﬀel, M. Roberti, L. Soulier, G. Scoutheeten, R. Cancelliere, P. Gallinari

References

1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointlylearning to align and translate. In: ICLR (2015)2. Borensztajn, G., Zuidema, W.H., Bod, R.: Children’s grammars grow moreabstract with age - evidence from an automatic procedure for identifying theproductive units of language. topiCS (2009)3. Chen, B., Cherry, C.: A systematic comparison of smoothing techniques forsentence-level BLEU. In: WMT@ACL (2014)4. Chen, Z., Eavani, H., Chen, W., Liu, Y., Wang, W.Y.: Few-shot NLG withpre-trained language model. In: ACL (2020)5. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deepbidirectional transformers for language understanding. In: NAACL (2019)6. Dhingra, B., Faruqui, M., Parikh, A., Chang, M.W., Das, D., Cohen, W.:Handling divergent reference texts when evaluating table-to-text generation.In: ACL (2019)7. Dong, L., Huang, S., Wei, F., Lapata, M., Zhou, M., Xu, K.: Learning togenerate product reviews from attributes. In: EACL (2017)8. Dusek, O., Howcroft, D.M., Rieser, V.: Semantic noise matters for neural nat-ural language generation. In: INLG (2019)9. Ferreira, T.C., van der Lee, C., van Miltenburg, E., Krahmer, E.: Neural data-to-text generation: A comparison between pipeline and end-to-end architec-tures. In: EMNLP-IJCNLP (2019)10. Ficler, J., Goldberg, Y.: Controlling linguistic style aspects in neural languagegeneration. In: Workshop on Stylistic Variation @ ACL (2017)11. Filippova, K.: Controlled hallucinations: Learning to generate faithfully fromnoisy data. In: Findings of EMNLP (2020)12. Flesch, R.: The Art of Readable Writing (1962)13. Gardent, C., Shimorina, A., Narayan, S., Perez-Beltrachini, L.: Creating train-ing corpora for NLG micro-planners. In: ACL (2017)14. Gatt, A., Krahmer, E.: Survey of the state of the art in natural languagegeneration: Core tasks, applications and evaluation. J. Artif. Intell. Res. (2018)15. Gehrmann, S., Dai, F., Elder, H., Rush, A.: End-to-end content and planselection for data-to-text generation. In: INLG (2018)16. Han, C., Lavoie, B., Palmer, M.S., Rambow, O., Kittredge, R.I., Korelsky,T., Kim, N., Kim, M.: Handling stuctural divergences and recovering droppedarguments in a korean/english machine translation system. In: AMTA (2000)17. Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., Xing, E.P.: Toward controlledgeneration of text. In: ICML (2017)18. Hwa, R., Resnik, P., Weinberg, A., Cabezas, C.I., Kolak, O.: Bootstrappingparsers via syntactic projection across parallel texts. Nat. Lang. Eng. (2005)19. Juraska, J., Karagiannis, P., Bowden, K.K., Walker, M.A.: A deep ensemblemodel with slot alignment for sequence-to-sequence natural language genera-tion. NAACL-HLT (2018)

20. Kale, M., Rastogi, A.: Text-to-text pre-training for data-to-text tasks. In:INLG (2020)21. Kikuchi, Y., Neubig, G., Sasano, R., Takamura, H., Okumura, M.: Controllingoutput length in neural encoder-decoders. In: EMNLP (2016) ontrolling Hallucinations at Word Level in Data-to-Text Generation 19

22. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR(2015)23. Klein, G., Kim, Y., Deng, Y., Senellart, J., Rush, A.M.: OpenNMT: Open-source toolkit for neural machine translation. In: Proc. ACL (2017)24. Kosmajac, D., Keselj, V.: Twitter user proﬁling: Bot and gender identiﬁcation.In: CLEF (2019)25. Kryscinski, W., McCann, B., Xiong, C., Socher, R.: Evaluating the factualconsistency of abstractive text summarization abs/1910.12840 (2019)26. Lebret, R., Grangier, D., Auli, M.: Neural text generation from structureddata with application to the biography domain. In: EMNLP (2016)27. van der Lee, C., Gatt, A., van Miltenburg, E., Wubben, S., Krahmer, E.: Bestpractices for the human evaluation of automatically generated text. In: INLG(2019)28. Li, J., Galley, M., Brockett, C., Spithourakis, G.P., Gao, J., Dolan, B.: Apersona-based neural conversation model. In: ACL (2016)29. Liu, T., Luo, F., Xia, Q., Ma, S., Chang, B., Sui, Z.: Hierarchical encoderwith auxiliary supervision for neural table-to-text generation: Learning betterrepresentation for tables. In: AAAI (2019)30. Liu, T., Luo, F., Yang, P., Wu, W., Chang, B., Sui, Z.: Towards comprehensivedescription generation from factual attribute-value tables. In: ACLs (2019)31. Liu, T., Wang, K., Sha, L., Chang, B., Sui, Z.: Table-to-text generation bystructure-aware seq2seq learning. In: AAAI (2018)32. Luong, T., Pham, H., Manning, C.D.: Eﬀective approaches to attention-basedneural machine translation. In: EMNLP (2015)33. Mei, H., Bansal, M., Walter, M.R.: What to talk about and how? selectivegeneration using lstms with coarse-to-ﬁne alignment. In: NAACL-HLT (2016)34. Narayan, S., Gardent, C.: Deep Learning Approaches to Text Production(2020)35. Nie, F., Yao, J.G., Wang, J., Pan, R., Lin, C.Y.: A simple recipe towardsreducing hallucination in neural surface realisation. In: ACL (2019)36. Novikova, J., Dusek, O., Curry, A.C., Rieser, V.: Why we need new evaluationmetrics for NLG. In: EMNLP (2017)37. Novikova, J., Dusek, O., Rieser, V.: The E2E dataset: New challenges for end-to-end generation. In: SIGdial Meeting on Discourse and Dialogue (2017)38. Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a method for automaticevaluation of machine translation. In: ACL (2002)39. Parikh, A.P., Wang, X., Gehrmann, S., Faruqui, M., Dhingra, B., Yang, D.,Das, D.: ToTTo: A Controlled Table-To-Text Generation Dataset. In: EMNLP(2020)40. Perez-Beltrachini, L., Gardent, C.: Analysing data-to-text generation bench-marks. INLG (2017)41. Perez-Beltrachini, L., Lapata, M.: Bootstrapping generators from noisy data.In: NAACL-HLT (2018)42. Puduppully, R., Dong, L., Lapata, M.: Data-to-text generation with content selection and planning. In: AAAI (2019)43. Puduppully, R., Dong, L., Lapata, M.: Data-to-text generation with entitymodeling. In: ACL (2019)44. Qi, P., Zhang, Y., Zhang, Y., Bolton, J., Manning, C.D.: Stanza: A Pythonnatural language processing toolkit for many human languages. In: System

Demonstrations @ ACL (2020)45. Rebuﬀel, C., Soulier, L., Scoutheeten, G., Gallinari, P.: Parenting via model-agnostic reinforcement learning to correct pathological behaviors in data-to-text generation. In: INLG (2020)46. Reiter, E.: A structured review of the validity of BLEU. Computational Lin-guistics (2018)47. Reiter, E., Belz, A.: An investigation into the validity of some metrics forautomatically evaluating natural language generation systems. ComputationalLinguistics (2009)48. Reiter, E., Dale, R.: Building applied natural language generation systems.Nat. Lang. Eng. (1997)49. Roberti, M., Bonetta, G., Cancelliere, R., Gallinari, P.: Copy mechanismand tailored training for character-based data-to-text generation. In: ECML-PKDD (2019)50. Sanguinetti, M., Bosco, C.: Parttut: The turin university parallel treebank.In: Basili, R., Bosco, C., Delmonte, R., Moschitti, A., Simi, M. (eds.) PARLI(2015)51. See, A., Liu, P.J., Manning, C.D.: Get to the point: Summarization withpointer-generator networks. In: ACL (2017)52. Sennrich, R., Haddow, B., Birch, A.: Controlling politeness in neural machinetranslation via side constraints. In: NAACL-HLT (2016)53. Shen, X., Chang, E., Su, H., Zhou, J., Klakow, D.: Neural Data-to-Text Gen-eration via Jointly Learning the Segmentation and Correspondence. In: ACL(2020)54. Smeuninx, N., Clerck, B.D., Aerts, W.: Measuring the readability of sustain-ability reports: A corpus-based analysis through standard formulae and nlp.International Journal of Business Communication (2020)55. Stajner, S., Hulpus, I.: When shallow is good enough: Automatic assessment ofconceptual text complexity using shallow semantic features. In: LREC (2020)56. Stajner, S., Nisioi, S., Hulpus, I.: Coco: A tool for automatically assessingconceptual complexity of texts. In: LREC (2020)57. Thomson, C., Zhao, Z., Sripada, S.: Studying the Impact of Filling InformationGaps on the Output Quality of Neural Data-to-Text. In: INLG (2020)58. Tian, R., Narayan, S., Sellam, T., Parikh, A.P.: Sticking to the facts: Conﬁdentdecoding for faithful data-to-text generation abs/1910.08684 (2019)59. Wang, H.: Revisiting challenges in data-to-text generation with fact grounding.In: INLG (2019)60. Wiseman, S., Shieber, S.M., Rush, A.M.: Challenges in data-to-document gen-eration. In: EMNLP (2017)61. Wiseman, S., Shieber, S.M., Rush, A.M.: Learning neural templates for textgeneration. In: EMNLP (2018)62. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac,P., Rault, T., Louf, R., Funtowicz, M., Brew, J.: Huggingface’s transformers:State-of-the-art natural language processing. ArXiv abs/1910.03771 (2019)

63. Xia, F., Palmer, M.: Converting dependency structures to phrase structures.In: HLT (2001) ontrolling Hallucinations at Word Level in Data-to-Text Generation 21

A Alignment labels reproducibility

We consider as important words , i.e. nouns, adjectives or numbers, those who are Part-of-Speech tagged as

NUM , ADJ , NOUN and

PROPN .In order to apply the score normalization function norm ( · , y ), we separate sentences y intostatements y t i : t i +1 − . To do so, we identify the set of introductory dependency relation labels ,following previous work on rule-based systems for the conversion of dependency relations treesto constituency trees [16, 63, 18, 2]. Our segmentation algorithm considers every leaf token inthe dependency tree, and seeks its nearest ancestor which is the root of a statement.Two heuristics enforce the score normalization: (i) conjunctions and commas next to hallu-cinated tokens acquires these lasts’ hallucination scores, and (ii) paired parentheses and quotesacquire the minimum inner tokens’ hallucination score.Part-of-Speech tagging has been done using the HuggingFace’s Transformers library [62]to ﬁne-tune a BERT model [5] on the UD English ParTUT dataset [50]; Stanza [44] has beenexploited for dependency parsing. B Implementation details

Our system is implemented in Python 3.8 and PyTorch 1.4.0 . In particular, our multi-brancharchitecture is developed, trained and tested as an OpenNMT [23] model. Sentence lengthsand Flesch index [12] are computed using the standard style Unix command.Diﬀerently to Perez-Beltrachini and Lapata [41], we did not adapt the original WikiBiodataset in any manner: as we work on the model side, we fairly preserves the dataset’snoisiness.We share the vocabulary between input and output, limiting its size to 20000 tokens.Hyperparmeters were tuned using performances on the development set: Tab. B1 reports theperformances of our best performing MBD on the development set. Our encoder consist ofa 600-dimensional embedding layer followed by a 2-layered bidirectional LSTM network withhidden states sized 600. We use the general attention mechanism with input feeding [32] andthe same copy mechanism as See et al. [51]. Each branch of the multi-branch decoder is a2-layered LSTM network with hidden states sized 600 as well.Training is performed using the Adam algorithm [22] with learning rate η = 10 − , β = 0 . β = 0 . . .

3. We used beam search during inference, with a beamsize of 10.

MBD has 55M parameters. Its training took less than 10 hours on a single NVIDIA TitanXP GPU.

C Annotation interface

The human annotation procedure is done via a web application speciﬁcally developed for thisresearch. Fig. C1a shows how the hallucination tagging user interface looked like in practice,while in Fig. C1b a typical sentence analysis page is shown.2 C. Rebuﬀel, M. Roberti, L. Soulier, G. Scoutheeten, R. Cancelliere, P. Gallinari(a) Hallucination tagging(b) Sentence analysis

Fig. C1

The human annotation tasks, as presented to the annotators.Model BLEU PARENTPrecision Recall F-measure

MBD[.4, .1, .5]

Table B1

The performances of our model on the WikiBio validation set.ontrolling Hallucinations at Word Level in Data-to-Text Generation 23

D Qualitative examples

Tables D1 and D2 show word-level labeling of WikiBio training examples. Underlined, redwords are hallucinated according either to our scoring procedure or to the method proposedby Perez-Beltrachini and Lapata [41].In the subsequent tables, some WikiBio (D3 to D12) and ToTTo (D13 to D15) inputsare shown, coupled with the corresponding sentences, either as found in the dataset, or asgenerated by our models and baselines. acl , advcl , amod , appos , ccomp , conj , csubj , iobj , list , nmod , nsubj , obj , orphan , parataxis , reparandum , vocative , xcomp ; every dependency relation is documented in theUniversal Dependencies website. https://github.com/DavidGrangier/wikipedia-biography-dataset key value name susan blubirth name susan maria blupkabirth date 12 july 1948birth place st paul , minnesota , u.s.occupation actress , director , casting directoryearsactive 1968 -- presentarticle title susan bluRef.: susan maria blu -lrb- born july 12 , 1948 -rrb- ,sometimes credited as sue blu , is an american voice actress, voice director and casting director in american and canadiancinema and television .PB&L: susan maria blu -lrb- born july 12 , 1948 -rrb- ,sometimes credited as sue blu , is an american voice actress, voice director and casting director in american and canadiancinema and television .Ours: susan maria blu -lrb- born july 12 , 1948 -rrb- ,sometimes credited as sue blu , is an american voice actress, voice director and casting director in american and canadian cinema and television . key value name patricia flores fuentesbirth date 25 july 1977birth place state of mexico , mexicooccupation politiciannationality mexicanarticle title patricia flores fuentesRef.: patricia flores fuentes -lrb- born 25 july 1977 -rrb- is amexican politician affiliated to the national action party .PB&L: patricia flores fuentes -lrb- born 25 july 1977 -rrb- is amexican politician affiliated to the national action party .Ours: patricia flores fuentes -lrb- born 25 july 1977 -rrb- is amexican politician affiliated to the national action party . key value name ate faberbirth date 19 march 1894 birth place leeuwarden , netherlandsdeath date 19 march 1962death place zutphen , netherlandssport fencingarticle title ate faberRef.: ate faber -lrb- 19 march 1894 -- 19 march 1962 -rrb- was adutch fencer .PB&L: ate faber -lrb- 19 march 1894 -- 19 march 1962 -rrb- was adutch fencer .Ours: ate faber -lrb- 19 march 1894 -- 19 march 1962 -rrb- was adutch fencer . Table D1

Hallucinated words according either to our scoring procedure or to the methodproposed by Perez-Beltrachini and Lapata [41].ontrolling Hallucinations at Word Level in Data-to-Text Generation 25 key value name alex wilmot sitwellbirth date 16 march 1961birth place ukoccupation president , europe and emerging markets -lrb-ex-asia -rrb- of bank of america merrill lyncharticle title alex wilmot-sitwellRef.: alex wilmot-sitwell heads bank of america merrill lynch’s businesses across europe and emerging markets excludingasia .PB&L: alex wilmot-sitwell heads bank of america merrill lynch’s businesses across europe and emerging markets excludingasia .Ours: alex wilmot-sitwell heads bank of america merrill lynch’s businesses across europe and emerging markets excludingasia . key value name ryan moorespouse nichole olson -lrb- m. 2011 -rrb-children tuckercollege unlvyearpro 2005tour pga tourprowins 4pgawins 4masters t12 2015usopen t10 2009open t10 2009pga t9 2006article title ryan moore -lrb- golfer -rrb-Ref.: ryan david moore -lrb- born december 5 , 1982 -rrb- isan american professional golfer , currently playing on the pgatour .PB&L: ryan david moore -lrb- born december 5 , 1982 -rrb- isan american professional golfer , currently playing on the pgatour .Ours: ryan david moore -lrb- born december 5 , 1982 -rrb- isan american professional golfer , currently playing on the pga tour .

Table D2

Hallucinated words according either to our scoring procedure or to the methodproposed by Perez-Beltrachini and Lapata [41].6 C. Rebuﬀel, M. Roberti, L. Soulier, G. Scoutheeten, R. Cancelliere, P. Gallinari title prince of no¨ername prince frederickimage prinsen af noer.jpgimage size 200pxspouse countess henriette of danneskjold-samsøe mary esther leeissue prince frederick , count of noer prince christian louise ,princess michael vlangali-handjeri princess mariehouse house ofschleswig-holstein-sonderburg-augustenburgfather frederick christian ii , duke ofschleswig-holstein-sonderburg-augustenburgmother princess louise auguste of denmarkbirth date 23 august 1800birth place kieldeath date 2 july 1865death place beirutarticle title prince frederick of schleswig-holstein-sonderburg-augustenburg

Gold prince frederick emil august of schleswig-holstein-sonderburg-augustenburg ( kiel , 23 august 1800 – beirut , 2 july 1865 ) , usuallysimply known by just his ﬁrst name , frederick , “ prince of no¨er ” , wasa prince of the house of schleswig-holstein-sonderburg-augustenburgand a cadet-line descendant of the danish royal house . stnd prince frederick of schleswig-holstein-sonderburg-augustenburg ( 23 au-gust 1800 – 2 july 1865 ) was a member of the house of schleswig-holstein-sonderburg-augustenburg . stnd filtered prince frederick of schleswig-holstein-sonderburg-augustenburg ( 23 au-gust 1800 – 2 july 1865 ) was a german . hsmm prince frederick of schleswig-holstein-sonderburg-augustenburg ( 23 au-gust 1800 – 2 july 1865 ) was a danish noblewoman . hier prince frederick of schleswig-holstein-sonderburg-augustenburg ( ) (23 august 1800 – 2 july 1865 ) was a german prince of the houseof schleswig-holstein-sonderburg-augustenburg .

MBD[.4, .1, .5] prince frederick of schleswig-holstein-sonderburg-augustenburg ( ; 23august 1800 – 2 july 1865 ) was the son of frederick christian ii , dukeof schleswig-holstein-sonderburg-augustenburg and princess louise au-guste of denmark .

Table D3

A WikiBio input table, coupled with the corresponding sentence and the models-generated outputs.ontrolling Hallucinations at Word Level in Data-to-Text Generation 27 name godgorybackground group or bandorigin karlstad , swedengenre melodic death metal doom metal gothic metalyears active 1992 -- 2004label nuclear blastcurrent members matte andersson erik anderssonpast members mikael dahlqvist stefan grundel henrik lindstr¨om fredricdanielsson thomas hederarticle title godgory

Gold godgory was a swedish melodic death metal or death doom metal bandthat was formed august 1992 by erik andersson ( drums ) and matteandersson ( vocals ) of no relation . stnd godgory was a melodic death metal band from karlstad , sweden . stnd filtered godgory was a swedish metal band . hsmm godgory ( 10 august 2004 – 4 january 2010 ) was a melodic death metalband from karlstad , sweden . hier godgory was a melodic death metal band from sweden , sweden .

MBD[.4, .1, .5] godgory was a swedish melodic death metal band from karlstad .

Table D4

A WikiBio input table, coupled with the corresponding sentence and the models-generated outputs. name christy mihosimage christy2006wp.jpegpx|office massachusetts turnpike commission memberterm start july 1 , 1999term end july 1 , 2004successor daniel grabauskasbirth date 13 june 1949birth place brockton , massachusettsnationality americanoccupation convenience store franchise ownerspouse andrea mihos ( 1974-2013 ; divorce )religion greek orthodoxalma mater stonehill collegeparty republicanarticle title christy mihos

Gold christy peter mihos is an american politician and businessman frommassachusetts . stnd christy mihos ( born june 13 , 1949 in brockton , massachusetts ) is aformer republican member of the massachusetts house of representa-tives . stnd filtered christy mihos ( born june 13 , 1949 in brockton , massachusetts ) is anamerican politician . hsmm christy mihos ( born june 13 , 1949 ) is an american politician fromthe state of massachusetts . hier christy mihos ( born june 13 , 1949 ) is an american politician whoserved as the massachusetts state representative from 1999 to 2004 .

MBD[.4, .1, .5] christy mihos ( born june 13 , 1949 , in brockton , massachusetts ) isa former member of the massachusetts turnpike commission .

Table D5

A WikiBio input table, coupled with the corresponding sentence and the models-generated outputs.8 C. Rebuﬀel, M. Roberti, L. Soulier, G. Scoutheeten, R. Cancelliere, P. Gallinari name duncan eaglesonnationality americanarea painter , writer , penciller , colorist , sculptor , designerarticle title duncan eagleson

Gold duncan eagleson is an american self-trained painter and former graﬃtiartist . stnd duncan eagleson is an american painter , writer , and penciller . stnd filtered duncan eagleson is an american painter . hsmm duncan eagleson ( born 1975 ) is an american comic book painter andwriter . hier duncan eagleson is an american painter , illustrator , and designer .

MBD[.4, .1, .5] duncan eagleson is an american painter , writer , and sculptor .

Table D6

A WikiBio input table, coupled with the corresponding sentence and the models-generated outputs. name gerald warner braceimagesize 208pxbirth date 24 september 1901birth place islip , long island , suffolk county , new yorkdeath date 20 july 1978death place blue hill , maineoccupation novelist , writer , educator , sailor , boat buildernationality americangenre fiction , non-fictionarticle title gerald warner brace

Gold gerald warner brace ( september 24 , 1901 – july 20 , 1978 ) was anamerican novelist , writer , educator , sailor and boat builder . stnd gerald warner brace ( september 24 , 1901 – july 20 , 1978 ) was anamerican novelist , writer , and boat builder . stnd filtered gerald warner brace ( september 24 , 1901 – july 20 , 1978 ) was anamerican novelist . hsmm gerald warner brace ( september 24 , 1901 – july 20 , 1978 ) was anamerican novelist and writer . hier gerald warner brace ( september 24 , 1901 – july 20 , 1978 ) was anamerican novelist , short story writer , educator , and sailor .

MBD[.4, .1, .5] gerald warner brace ( september 24 , 1901 – july 20 , 1978 ) was anamerican author , educator , sailor , and boat builder .

Table D7

A WikiBio input table, coupled with the corresponding sentence and the models-generated outputs.ontrolling Hallucinations at Word Level in Data-to-Text Generation 29 name robert b. murrettimage robertbmurrett.jpgoffice 4th director of the national geospatial-intelligence agencydirector of the office of naval intelligencepresident george w. bush barack obama george w. bushterm start 2006 2005term end 2010 2006predecessor james r. clapper richard b. porterfieldsuccessor letitia long tony l. cothronalma mater university at buffalo georgetown university joint militaryintelligence collegebranch united states navyrank vice admiral 20pxarticle title robert b. murrett

Gold vice admiral robert b. murrett was the fourth director of the nationalgeospatial-intelligence agency , from 7 july 2006 through july 2010 . stnd robert b. murrett is a retired vice admiral of the united states navy . stnd filtered robert b. murrett is the director of the national geospatial-intelligenceagency . hsmm robert b. “ bob ” murrett ( born 1956 ) is an american naval oﬃcerand the director . hier robert b. murrett is a retired vice admiral in the united states navy .

MBD[.4, .1, .5] robert b. murrett is a vice admiral in the united states navy .

Table D8

A WikiBio input table, coupled with the corresponding sentence and the models-generated outputs. name rosane ferreiraimage deputada federal rosane ferreira.jpgoffice federal deputy for state of parn´aterm start 1 february 2011term end actualpresident dilma roussefforder federal deputy for the state of roraimabirth date 31 july 1963birth place clevel^andia , parn´a , brazildead alivenationality brazilianparty green party ( brazil )article title rosane ferreira

Gold rosane ferreira ( cleusa rosane ribas ferreira , born clevelˆandia , paran´a, july 31 , 1963 ) , is a nurse and a brazilian politician . stnd rosane ferreira ( born 31 july 1963 in clevelˆandia , parn´a ) is a brazilianpolitician . stnd filtered rosane ferreira ( born 31 july 1963 ) is a brazilian politician . hsmm rosane ferreira ( born july 31 , 1963 ) is a brazilian politician and thefederal deputy . hier rosane ferreira ( born 31 july 1963 ) is a brazilian politician and thecurrent federal deputy for the state of roraima .

MBD[.4, .1, .5] rosane ferreira ( born 31 july 1963 in clevelˆandia , parn´a , brazil ) is abrazilian politician .

Table D9

A WikiBio input table, coupled with the corresponding sentence and the models-generated outputs.0 C. Rebuﬀel, M. Roberti, L. Soulier, G. Scoutheeten, R. Cancelliere, P. Gallinari name polina millergender fbirth date 23 november 1988birth place saint petersburg , russian sfsr , soviet uniondiscipline wagarticle title polina miller

Gold polina miller ( , born november 23 , 1988 in saint petersburg ) is arussian gymnast . stnd polina miller ( born november 23 , 1988 ) is a russian artistic gymnast. stnd filtered polina miller ( born november 23 , 1988 ) is a . hsmm polina miller ( born 23 november 1988 in saint petersburg ) is a russianartistic gymnast . hier polina miller ( born 23 november 1988 ) is a russian rhythmic gymnast.

MBD[.4, .1, .5] polina miller ( born 23 november 1988 in saint petersburg , russian sfsr, soviet union ) is a russian gymnast .

Table D10

A WikiBio input table, coupled with the corresponding sentence and the models-generated outputs. name pat burkeirish p´adraig de b´urcasport gaelic footballcode footballcounty dublinprovince leinsterclposition corner forwardclub kilmacud crokesclubs kilmacud crokescounties dublinicprovince 1birth place dublin , irelandarticle title pat burke ( gaelic footballer )

Gold pat burke is an irish gaelic footballer who plays for dublin and kilmacudcrokes . stnd pat burke is a gaelic footballer from dublin , ireland . stnd filtered pat burke is a gaelic footballer for dublin . hsmm pat burke ( born in dublin ) is a former irish gaelic footballer whoplayed as a gaelic footballer . hier pat burke is a former gaelic footballer for dublin .

MBD[.4, .1, .5] pat burke is a gaelic footballer from county dublin .

Table D11

A WikiBio input table, coupled with the corresponding sentence and the models-generated outputs.ontrolling Hallucinations at Word Level in Data-to-Text Generation 31 name odiakesbackground non vocal instrumentalistbirth date march 22origin tokyo , japaninstrument keyboard , synthesizergenre j-pop , technooccupation composeryears active 1998 -- presentarticle title odiakes

Gold odiakes ( born march 22 ) is a japanese composer from tokyo , japanwho has worked for a variety of visual novel companies . stnd , better known by his stage name odiakes , is a japanese composer . stnd filtered odiakes is a japanese composer . hsmm odiakes “ odiakes ” ( born march 22 ) is a japanese composer . hier composer ( born march 22 ) is a japanese j-pop player .

MBD[.4, .1, .5] odiakes ( born march 22 in tokyo , japan ) is a japanese composer .

Table D12

A WikiBio input table, coupled with the corresponding sentence and the models-generated outputs.

Club IstiklolSeason 2015League Tajik Leaguepage title Parvizdzhon Umarbayevsection title Clubsection text As of match played 29 July 2018

Gold In 2015 , Umarbayev signed for Tajik League FC Istiklol . stnd

Umarbayev joined Tajik League side Istiklol in 2015 . stnd filtered hal WO

Parvizdzhon joined Tajik League club Istiklol in 2015 .

MBD[.4, .1, .5]

Umarbayev signed with Istiklol ahead of the 2015 Tajik League season.

Table D13

A ToTTo input table, coupled with the corresponding sentence and the models-generated outputs.

Rank 5Island Hulhumeedhoopage title List of islands of the Maldivessection title Islands by area sizesection text This list ranks the top 10 islands of the Maldives by area .Some islands in the Maldives , although geographically oneisland , are divided into two administrative islands ( forexample , Gan and Maandhoo in Laamu Atoll ) .

Gold Hulhumeedhoo is the 5th largest island in Maldives . stnd

It has a area of Hulhumeedhoo km ² ( Islands sq mi ) . stnd filtered is the fourth of the Maldives in Maldives . hal WO Hulhumeedhoo is the largest islands of the Maldives by area size .

MBD[.4, .1, .5]

Hulhumeedhoo is the ﬁfth largest island by area size .

Table D14

A ToTTo input table, coupled with the corresponding sentence and the models-generated outputs.2 C. Rebuﬀel, M. Roberti, L. Soulier, G. Scoutheeten, R. Cancelliere, P. Gallinari

Single 24.7 ( Twenty-Four Seven )page title Singular ( band )section title 2010

Gold In 2010 , Singular released its ﬁrst single , “ 24.7 ( Twenty-Four Seven) ” . stnd

The ﬁrst single , 24.7 ( Twenty-Four Seven ) , was released in 2010 . stnd filtered

The band won the 24.7 ( Twenty-Four Seven ) . hal WO

MBD[.4, .1, .5]

Singular released their ﬁrst album , 24.7 ( Twenty-Four Seven ) .