Neural Data-to-Text Generation with LM-based Text Augmentation
NNeural Data-to-Text Generation with LM-based Text Augmentation †Ernie Chang, (cid:12)
Xiaoyu Shen ∗ , †Dawei Zhu, †Vera Demberg, ⊗ Hui Su †Dept. of Language Science and Technology, Saarland University { cychang,xiaoyu } @coli.uni-saarland.de (cid:12) Amazon Alexa AI, Berlin ⊗ Pattern Recognition Center, Wechat AI, Tencent Inc, China
Abstract
For many new application domains for data-to-text generation, the main obstacle in train-ing neural models consists of a lack of trainingdata. While usually large numbers of instancesare available on the data side, often only veryfew text samples are available. To addressthis problem, we here propose a novel few-shot approach for this setting. Our approachautomatically augments the data available fortraining by (i) generating new text samplesbased on replacing specific values by alterna-tive ones from the same category, (ii) generat-ing new text samples based on GPT-2, and (iii)proposing an automatic method for pairing thenew text samples with data samples. As thetext augmentation can introduce noise to thetraining data, we use cycle consistency as anobjective, in order to make sure that a givendata sample can be correctly reconstructed af-ter having been formulated as text (and thattext samples can be reconstructed from data).On both the E2E and WebNLG benchmarks,we show that this weakly supervised trainingparadigm is able to outperform fully super-vised seq2seq models with less than an-notations. By utilizing all annotated data, ourmodel can boost the performance of a standardseq2seq model by over 5 BLEU points, estab-lishing a new state-of-the-art on both datasets.
Neural data-to-text generation has been the subjectof much recent research. The task aims at trans-forming source-side structured data into target-side natural language text (Reiter and Dale, 2000; Barzi-lay and Lapata, 2005). While neural end-to-endsystems afford the advantage of easy adaptabil-ity (Lebret et al., 2016; Wiseman et al., 2017), hugeamounts of data-text pairs are still necessary toperform on par with their rule-based counterparts ∗ Work done prior to joining Amazon.
The Blue Spice is arestaurant that servesEnglish cuisine.
Few-shot scenario : The model is expected tolearn data-to-text generation with few labeled instances (i.e.table-text pairs). The example is taken from the E2E dataset. (van der Lee et al., 2018). This makes using neuralsystems less appealing: oftentimes, in-domain textsamples are not readily available, and there is ahigh cost to collecting in-domain texts which fitthe data samples, and annotating these texts withthe data labels – the cost for collecting this datamight hence even outweigh the efforts of designinga rule-based system (Gkatzia, 2016). The goal ofthis work is to improve the performance of neuraldata-to-text models in scenarios where only veryfew text samples exist (we assume that these textsamples are paired with corresponding data sam-ples). We aim to answer how we can make themost of the scarce annotations, together with largeamounts of unlabelled data, in order to push thelimit of the neural data-to-text models. Figure 1illustrates the scenario.To address the limited-data challenge, we pro-pose a simple yet effective way of augmenting thetext side with the pretrained language model (LM)GPT-2 (Radford et al., 2019). Unlike other textaugmentation work employed in data-to-text gener-ation systems (Freitag and Roy, 2018; Agarwalet al., 2018), our proposal assumes little to nodomain-dependent heuristics. It consists of twosteps: (1) information augmentation by slot-valuereplacement and (2) LM augmentation by GPT-2 a r X i v : . [ c s . C L ] F e b eneration.Once we have augmented the set of text samples,we are essentially in a similar setting as previouslyproposed semi-supervised approaches to data-to-text generation Schmitt and Sch¨utze (2019); Qaderet al. (2019); Su et al. (2020b), which assume thepresence of vast amounts of unpaired data and textinstances. These approaches exploit a cycle con-sistency objective in order to learn a pairing forthe data samples. The cycle consistency objectivetries to make sure that data samples can be recon-structed correctly from their textual formulations,and similarly that texts can be reconstructed afterhaving been parsed into a data representation.As the automatically generated text samplesfrom GPT-2 might be very noisy and not pair wellwith data samples, we align each augmented textsample with its most similar unlabeled data sample,as defined in their encoded vector space. This ideais inspired by recent work on representation match-ing in MT (Artetxe and Schwenk, 2019; Ruiteret al., 2019). To ensure good quality of the trainingdata, only pairs above a certain similarity threshold (cid:15) are retained as pseudo pairs for training. Thequality of the pseudo pairs will gradually improveas the encoder improves in the training process. Inreturn, the learning of the encoder will also be fa-cilitated with the improved quality of pseudo pairsas a virtuous cycle.On two data-to-text benchmarks E2E (Novikovaet al., 2017) and WebNLG (Gardent et al., 2017),we show that our LM-augmented weakly super-vised model succeeds on outperforming fully su-pervised seq2seq model, though utilizing less than of the data annotations. It even outperformsprevious work which additionally has access to allunpaired text samples . When trained with full dataannotations, it is able to boost the model perfor-mance by up to 5 BLEU points, establishing a newstate-of-the-art on both datasets.In summary, this work makes the following con-tributions:1. We study the few-shot data-to-text scenariowhere, unlike previous works, no furthertarget-side text is available.2. We present an effective way of automaticallyaugmenting target text by resorting to the pre-trained LM GPT-2.3. We propose utilizing the augmented text bya combination of cycle consistency and rep- resentation matching. The resulting modeloutperforms standard seq2seq model with lessthan 10% data annotations.4. The proposed model is shown to be com-plementary with current seq2seq pretrainingtechniques, and can offer orthogonal improve-ments when combining both. Building neural data-to-text systems with fewpaired samples (but a large set of unpaired sam-ples) has been a hot research topic recently. Mostworks adopt the idea of cycle consistency (Zhuet al., 2017), which has been used in many text gen-eration tasks like machine translation (Artetxe et al.,2017; Lample et al., 2017) and style transfer (Prab-humoye et al., 2018; Subramanian et al., 2018).Schmitt and Sch¨utze (2019); Qader et al. (2019);Su et al. (2020b); Chang et al. (2020b, 2021a,b) ap-plied this idea to the task of data-to-text generationand reported promising results. Ma et al. (2019)separate the generation process into few-shot con-tent selection and surface realization componentsand learn them separately. Nonetheless, all of theseapproaches assume the existence of huge quantityof unpaired text samples , which, as we mentioned,is an unrealistic assumption for the task of data-to-text generation. Freitag and Roy (2018) proposesto reconstruct usable sequences re-written fromdata with rules for unsupervised data-to-text gener-ation. Unfortunately, designing these rules requireefforts similar to building a template-based system.(Budzianowski and Vuli´c, 2019; Chen et al., 2020;Peng et al., 2020) tackle the few-shot challengeby finetuning a pretrained LM to incorporate priorknowledge from general-domain text or data-textpairs. We show that our technique is complemen-tary with them and can offer orthogonal improve-ments when combining both.
We represent the data samples as D and the textsamples as T . In our work, we do not restrict theformat of the data. Each d ∈ D can be a set ofkey-value pairs, as in Figure 1, or in form of RDFtriples as in Gardent et al. (2017). Each text t ∈ T consists of a sequence of words. In few-shotsettings, we are assumed to have (1) k labeled pairs ( D L , T L ) and (2) large quantities of unlabeled data U where | D U | (cid:29) k > . This, we believe,is a more realistic setting as unlabeled data areusually abundant and also can be easily fabricatedfrom predefined schemata. Notably, we assume noaccess to outside resources containing in-domaintext. The k annotations are all we know about thetext side. In this section, we first explain our proposed newmethod for text sample augmentation, and then dis-cuss methods to remove noise and automaticallyalign the data by elaborating on the ideas of cycleconsistency and representation matching. Finally,we summarize the approach and present the de-tailed algorithm.
To mitigate the paucity of the set of text samples T , we propose a pipeline approach to augment thetext samples by (1) information augmentation and(2) LM augmentation. We generate additional text samples by performingslot-value replacements. As many data values areexactly copied to the text samples, these copiedinformation can be easily detected and replacedwith other values (for the same slot type) to enrichthe information space of the text samples. This canbe considered as a simplified version of traditionalmethods of template mining where key words areextracted to construct templates (Kondadadi et al.,2013; Oya et al., 2014). An example is shownin Figure 2. Each text sample is augmented with10 more distinct text samples or with all possiblevalues being replaced.The slot-value replacement is efficient to imple-ment. However, it can only detect identical valuesand augment text with the same combinatorial pat-terns as the few-shot annotations. To enrich thelinguistic realizations of text sentences and enablenew combinations of information, we further pro-pose a LM augmentation approach using GPT-2.
GPT-2 (Radford et al., 2019) is a language modelpretrained on the collected WebText. It has demon-strated remarkable zero-shot multitask adaptabil-ity by simply feeding the input of each task into We force k > as we believe a reasonable generationsystem needs a least a few demonstrations of the annotation. the LM and continuing to generate words. Peoplehave also also shown that GPT-2 is able to improveclassification tasks via in-domain text augmenta-tion (Papanikolaou and Pierleoni, 2020; Sun et al.,2020). We use a similar technique by first fine-tuning GPT-2 in the few-shot annotations (Wolfet al., 2019), and then applying it to produce syn-thetic text through an iterative conditional genera-tion process: With initial seeds being samples of T L plus new samples from information augmenta-tion, the LM iteratively conditions on the previousoutput sentence to generate in-domain text . Eachsynthetic sentence is pruned if it (1) is shorter than5 words or (2) contains only special tokens. Theiterative generation is terminated when all tokensin the initial seeds are covered or if the maximumof 100 runs is reached. All the unpruned synthetictext samples are added into the space of T to bene-fit the learning direction of t → d (cid:48) → t and ˜ t → t .Figure 2 depicts the generation process of GPT-2.In practice, obtaining clean in-domain text re-quires extreme efforts of designing heuristic rules.Nonetheless, the synthetic text from GPT-2 makesdecent sense and can already provide useful signalsto drive the learning process. The core idea of encouraging cycle consistencyis that starting from one sample in a domain, themodel first maps it into the other domain, thenmaps it back (He et al., 2016). The resulting sampleshould be identical to the original sample. Specifi-cally, let p θ ( t | d ) be the probability distribution tomap a data sample d to its corresponding text t , and p φ ( d | t ) be the probability distribution to map textback to data. Starting from a data sample d ∈ D ,its objective is: max φ E d ∼ p ( D ) log p φ ( d | t (cid:48) ); t (cid:48) ∼ p θ ( t | d ) (1)which basically ensures the consistency in the direc-tion of d → t (cid:48) → d . Note that only p φ is updatedin this direction and p θ serves only as as an auxil-iary function to provide pseudo samples t (cid:48) from d .Though it is also possible to update θ at the sametime through tricks like Gumbel-softmax (Janget al., 2016) or REINFORCE (Williams, 1992),we find it did not lead to better performance, yetcomplicated the training. Similar observations have We adopt the Top- k random sampling setting with k = 2 to encourage diversity and reduce repetition (Radford et al.,2019) PT-2
Blue Spice is located in Riverside and has a price range of less than £20.The Punter is located in city centre and has a price range of £20-25.Pseudo Data-Text PairsSimilarityScore
Alimentum is a family-friendly restaurant located in the by the city area.
Table + ReferenceAfter Slot-Value Replacement
Most Similar Table
Information Augementation
Iterative Seeding
AddDrop
EncEnc
Figure 2:
Depiction of text augmentation and representation matching . Each text sample first goes through informationaugmentation by slot-value replacement, then passed to GPT-2 with iterative conditional generation. The augmented text samplesare paired with the most similar data from the corpus with a threshold cutoff. d Enc t'Eq. 1 Eq. 3t ddEq. 2
Dec d t Enc d' Dec t Dec t Enc Dec d Enc ~t~
Figure 3:
Four directions of cycle consistency. Gradients arebackpropagated only through solid lines. been made in Lample et al. (2018); He et al. (2020);Garcia et al. (2020).Similarly, starting from a text t ∈ T , the objec-tive is to ensure the consistency in the direction of t → d (cid:48) → t : max θ E t ∼ p ( T ) log p θ ( t | d (cid:48) ); d (cid:48) ∼ p φ ( d | t ) (2)Finally, we further add two denoising autoencodingobjectives on both the data and text sides: max θ,φ E d ∼ p ( D ) ,t ∼ p ( T ) log p φ ( d | ˜ d ) p θ ( t | ˜ t ) (3)where ˜ d and ˜ t are the corrupted versions of d and t . We use the same noise function as in Lampleet al. (2018) which randomly permutes and padsa portion of the input. This can encourage theencoder to learn meaningful latent representationsby reconstructing the input itself (Currey et al.,2017; Lample et al., 2018).Figure 3 illustrates all the four directions of thecycle consistency objective. In the MT community, the equivalent step is usually called back translation (Sennrich et al., 2016; Lample et al., 2018).
We use one shared encoder
Enc for both the dataand text sides. Each data sample is flattened intoa sequence by making a list of slot value pairs andfed into the same encoder. Using the same encoderfor both types of input gives the model an inductivebias to project similar data/text into surroundinglatent space.We will show later that encoder sharing is es-sential for a good performance under the few-shotscenario. From the shared encoded space, two sep-arate decoders
Dec d and Dec t are used to decode d and t respectively . Apart from training under the cycle consistency, wefurther consider matching each synthetic text withits most similar data sample and treating them assupplementary training pairs. Compared with thepseudo d (cid:48) obtained from back translation (Eq. 2),the matched data samples are extracted from the ex-isting corpus D U and thereby are guaranteed to beclean. This can provide a much more stable train-ing signal especially at the initial training stage .Previous work has used representation matchingto automatically extract pseudo training pairs formachine translation (Artetxe and Schwenk, 2019;Ruiter et al., 2019). Baziotis et al. (2019); Chu and The shared encoding has also been shown effective inother tasks like machine translation (Lample et al., 2018) andimage transition (Zhu et al., 2017). We further tried sharing thedecoder as in Johnson et al. (2017) but find no improvement(see Table 2). In theory, as we can fabricate arbitrary possible data sam-ples from the predefined schema and add to the corpus, wecan always find one matched data for a text samples. iu (2019) also demonstrate that the representationsimilarity between input-output pairs can serve as auseful regularization for unsupervised text summa-rization. We adopt a similar idea to create pseudopairs based on their cosine similarity in the rep-resentation space. To summarize, the process ofrepresentation matching can be described as: max θ,φ E t ∼ p ( T (cid:48) ) cos ( d ∗ ,t ) >ε ( log p θ ( t | d ∗ )+ log p φ ( d ∗ | t )); d ∗ = arg max d ∈ D cos ( d, t ) (4)where T (cid:48) is augmented text from the LM and is the indicator function. We also perform meanpooling over the encoded representations beforematching them. ε is a threshold. Pseudo pairs witha cosine similarity less than ε will be discarded.Ideally, as the encoder improves, the pseudo pairscreated by representation matching will make moresense, which can in turn benefit the training of theencoder. Apart from the above unsupervised objective, onthe few annotated data-text pairs, we can imposethe supervised objective: max θ,φ E d,t ∼ p ( D L ,T L ) log p θ ( t | d ) + log p φ ( d | t ) (5)where ( D L , T L ) contains the k data annotations.Putting all together, we summarize it in Algorithm1. In the training stage, we optimize the objec-tives of cycle consistency, representation matchingand supervised learning sequentially to maintain aconstant ratio of signals from all sides. Data
We conduct experiments on theE2E (Novikova et al., 2017) and WebNLG (Colinet al., 2016) datasets. E2E is a crowd-sourceddataset containing 50k instances in the restaurantdomain. The inputs are dialogue acts consisting ofthree to eight slot-value pairs. WebNLG contains25k instances describing entities belonging tofifteen distinct DBpedia categories. The inputsare up to seven RDF triples of the form (subject,relation, object) . Configuration
The model is implemented basedon fairseq (Ott et al., 2019). We use -dimensional token embedding and Adam optimizer
Algorithm 1
Few-shot Data-to-text Framework Input: D U , ( D L , T L ) Create ( D a , T a ) by information augmentation; ( D L , T L ) ← ( D L , T L ) ∪ ( D a , T a ) ; Create T (cid:48) by LM augmentation ; T ← T L ∪ T (cid:48) ; repeat Sample batch data from ( D L , T ) ; Cycle consistency: Optimize by Eq. 1 + Eq. 2 + Eq. 3;
Representation Matching:
Optimize by Eq. 4;
Supervised Training:
Optimize by Eq. 5; until convergencewith initial learning rate at . . Batch sizeis kept at with a dropout rate at . . We em-ploy beam search with size for decoding andselect models based on BLEU-4 scores on the de-velopment set. The score is averaged over 10 ran-dom initialization runs. In this work, the seq2seqmodels are built upon the long short-term mem-ory (LSTM) (Hochreiter and Schmidhuber, 1997).For LSTM cells, both the encoder and decoderhave layers, amounting to 18M parameters forthe seq2seq model (600-dimension and 1024 hid-den units). Maximum sequence length is set as 100for E2E and 200 for WebNLG (SPM-based). Allencoder parameters are shared between data andtext samples. All models were trained on 1 NvidiaV100 GPUs (32GB and CUDA Version 10.2) for 4ksteps. The total batch size is around K tokens perGPU and we use the Adam optimizer ( (cid:15) = 1e − , β = 0 . ) along with linear learning rate decayscheduling. The total number of updates is set to for all training and models are selected basedon optimal validation BLEU4. At decoding time,sentences are generated using greedy decoding. In this section, we present experiment results andanalysis. We first compare our model with otherbaselines on both datasets, then perform a set ofablation studies on the E2E dataset to see the effectsof each component. Finally, we analyze how textaugmentation helps improves the model, includeexample outputs and show the human evaluationresults in the end. odel E2E - 10% E2E - 100%BLEU NIST METEOR ROUGE-L BLEU NIST METEOR ROUGE-LSLUG - - - - 66.19 8.61 44.54 67.72Seq2seq 53.38 6.10 38.10 60.53 63.32 6.81 41.25 62.91Qader et al. (2019) 58.10 6.24 41.32 62.84 64.20 7.14 44.68 65.31Chen et al. (2020) 59.10 7.49 40.25 63.23 63.72 7.76 40.25 66.23Proposed (LSTM) 64.24 7.71 43.53 66.81 68.88 8.89 48.53 72.12WebNLG - 10% WebNLG - 100%Model BLEU NIST METEOR ROUGE-L BLEU NIST METEOR ROUGE-LMelbourne - - - - 44.93 8.98 36.58 60.40Seq2seq 36.54 7.3 35 54.61 44.60 8.49 38.23 59.67Qader et al. (2019) 38.66 7.81 34.1 56.95 47.19 8.71 37.90 58.61Chen et al. (2020) 39.40 7.84 37.25 56.23 46.15 8.52 39.1 58.5Proposed (LSTM) 43.75 8.29 33.58 58.49 50.26 8.86 40.71 61.29
Table 1:
Performance on E2E and WebNLG with 10% and 100% data. Qader et al. (2019) utilizes all ground-truth unpaired textsamples while our proposed model only gets access to the few-shot data annotations.
Comparison with Other Models
In Table 1, wecompare our model with (1) seq2seq baseline, (2)cycle consistency model as in Qader et al. (2019) and (3) finetuned GPT-2 model as in Chen et al.(2020) . For all models, we try running with 10%and 100% annotations to see how they perform un-der different data sizes. Our model is implementedboth with LSTM encoder-decoders, same as theseq2seq baseline for a fair comparison. Note thatQader et al. (2019) further utilized all the ground-truth unpaired text samples , while the other modelsrun only on the few-shot annotations. We also in-clude the results of SLUG (Juraska et al., 2018)and
MELBOURNE (Gardent et al., 2017), theoverall winner on automatic metrics in the E2Eand WebNLG challenge respectively(both seq2seq-based). SLUG uses a heuristic slot aligner based ona set of handcrafted rules and combines a complexpipeline of data augmentation, selection, modelensemble and reranker.The results show that our proposed model sig-nificantly improves over the baseline on both thefew-shot and fully supervised setting. The improve-ment is more evident when only 10% annotationsare available, with a leap of 11 and 7 BLEU scoreson E2E and WebNLG respectively. It also outper-forms systems relying on task-dependent heuristics.In comparison, Qader et al. (2019), though withaccess to all text samples at all percentages, still un-derperforms our model with tangible margin. Onthe fully supervised setting, it brings little to no The author did not open-source their code. We reproducedtheir model based on our implementation. The results on 10kannotations matches their reports in the paper. https://github.com/czyssrs/Few-Shot-NLG Model/Share None Enc Dec BothSupervised + t → d (cid:48) → t d → t (cid:48) → d t → t d → d Table 2:
Ablation study for cycle consistency (10% annota-tions). BLEU-4 score is reported. Each line adds one condi-tion on top of the previous one.
Supervised is a supervisedseq2seq baseline. difference compared with the seq2seq baseline asno more extra data is incorporated in the trainingprocess. As such, we also observe that the textaugmentation from finetuned GPT-2 model helpsthe proposed model on the few-shot setting, but itsadvantage also vanishes when all data annotationsare available.In Figure 4, we draw the model performancewith varying number of data annotations. All mod-els are trained from scratch with 10 different ran-dom initializations and the standard deviation of theBLEU-4 score is visualized. We can see our model(LSTM-based), though with a relatively larger stan-dard deviation due to the uncertainty of text aug-mentation sampling, still consistently outperformsother baselines significantly and even surpasses thefully supervised seq2seq model with less than 10%of data annotations.
Ablation Study on Cycle Consistency
In Ta-ble 2, we study how the four directions, input noiseand parameter sharing affect the performance ofcycle-consistency. The experiments are conductedwith 10% annotations and no further unpaired text % of Annotated Samples4550556065 B L E U E2E
Fairseq+COPY (100%)ProposedFairseq+COPYChen et al., 2019Qader et al., 2019 % of Annotated Samples15202530354045 B L E U WebNLG
Fairseq+COPY (100%)ProposedFairseq+COPYChen et al., 2019Qader et al., 2019
Figure 4:
Model performance with varying number of data annotations. Our model with 10% annotations outperforms theseq2seq model trained on pairs (dotted line) on both datasets. Shades are the sample standard deviation based on 10 runsof different model initializations. samples are available .As can be observed, adding the training direction t → d (cid:48) → t (i.e. back translation) has little effectson top of the supervised seq2seq baseline. Thisis expected since back translation is naturally de-signed to incorporate additional unpaired text sam-ples. When run only on the few-shot annotations,its power is very limited. The backward direction d → t (cid:48) → d is surprisingly useful when the en-coder is shared between the data and text. Thoughthis direction will not affect the text decoder at all,the improvement suggests the model can benefita lot by simply structuring its encoded space andmapping aligned data-text pairs to similar vectorspace. The autoencoding directions brings a littleimprovement. When combined with input noise,the performance further increases. This is simi-lar to previous findings that denoising autoencod-ing is more helpful in inducing meaningful latentspace (Lample et al., 2018) in comparison to simplylearning to copy the original input.The results also suggest encoder sharing is im-portant for the cycle consistency objective to workin our few-shot setting. Decoder sharing, in con-trast, makes little or even negative influence. This iskinda similar as in multilingual machine translationwhere sharing the decoder among languages mightnegatively interfere with the performance (Johnsonet al., 2017). Ablation Study on Text Augmentation
On topof the four-direction cycle consistency training, westudy the effects of text augmentation in Table 3.We compare our proposed info + LM augmenta-tion with (1) random augmentation, where a ran-
Text Augmentation 1% 5% 10% 20%None
Reference
Reference (+RM)
Table 3:
Ablation study for text augmentation with varyingnumber of annotations. Experiments are performed on the E2Edataset. LM augmentation outperforms Random by a largemargin, and even outperforming augmentation with ground-truth references on some occasions. Representation matching( RM ) boosts the overall performance further. Model E2E WebNLGFluency Miss Wrong Fluency Miss WrongSeq2Seq 3.68 49 63 3.95 57 48Cycle-only 4.08 46 66 4.23 48 44Finetune GPT-2 4.21 43 57 4.10 39 45Proposed (LSTM)
Table 4:
Human Evaluation on the sampled outputs (100instances) for models with 10% annotated data. Cycle-onlyindicates the approach in Qader et al. (2019); and FinetunedGPT-2 is refers to Chen et al. (2020). dom text from Wikipedia is sampled to the aug-mented text space, (2) unsupervised data augmen-tation (Xie et al., 2019) where text samples areaugmented with paraphrases of current annotationsand (3) ground-truth augmentation with referenceobtained from the left training corpus, which canserve as an upper bound of text augmentation tech-niques. We test the performance with 1%, 5%, 10%and 20% annotations to see the effects with varyingnumber of supervisions.As can be seen, the random augmentation even ata :[name] Blue Spice [eat type] restaurant [food] Chinese [area] city centre [family friendly] no [near] RainbowVegetarian Caf´e
Reference : at the [city centre], there is a [restaurant] called the [Blue Spice]. seq2seq/info-aug : Blue Spice restaurant near Rainbow Vegetarian Caf´e has a . +LM-aug : located near Rainbow Vegetarian Caf´e is a Chinese theme eatery and restaurant called Blue Spice. It isin the city centre area. Figure 5:
Generation examples with different text augmentation techniques. Trained on 5 data annotations (see the above toytraining set). The attribute combination of input data is unseen in the 5 annotations. Hallucinated contents are italitized . U n i q u e T o k e n C o un t no-auginfo-auglm-auguda-aug 0.25 0.5 2.5 5.0 7.5 10% of Annotations0102030405060 U n i q u e C o m b i n a t i o n C o un t no-auginfo-auglm-auguda-aug Figure 6:
Additional number of decoded unique tokens (non-copied) and unique combinations of information on the testsetwith varying number of annotations. harms the model performance, suggesting reason-able in-domain text augmentation are necessaryfor the model improvement. UDA augmentationalso makes rather little difference as it simply para-phrases the current available annotations but can-not bring any new information. The informationaugmentation by slot-value replacement helps im-prove a bit. When combined with LM, the perfor-mance can be further boosted, especially for lower-resource scenarios. The representation matchingalways helps lift the performance, with gains ofup to 10 BLEU points. As expected, the benefitfrom text augmentation gradually vanishes as moreannotations are collected, especially for datasetswith relatively simple patterns as E2E.
How text augmentation helps
Intuitively theGPT-2 augmentation is expected to impose newtokens and combination patterns to the few-shotannotations. To investigate whether this is the case,for the decoded text in the test phase, we countthe number of unique tokens (excluding copieddata values) and unique information combinationpatterns (attribute combinations in E2E). The re-sults in Fig. 6 show that LM-augmentation indeedgreatly enriches the vocabulary space, even dou-bling the generated unique tokens in low-resourcescenarios. The same happens for new combina-tion patterns. In contrast, info-aug cannot insert new tokens or combinations at all since all it doesis replacing data values based on the same textannotation. UDA can impose new tokens by para-phrasing the annotations, but it hardly helps themodel generalize to new combinations of informa-tion. Moreover, when trained on a toy dataset, weobserve from the generation outputs that Seq2seqand info-aug produce the wrong outputs and overfitto the information in the 5 training instances. WithLM augmentation, it adapts to the new combina-tion and connects information correctly. Figure 5shows a generation example with different text aug-mentation techniques. We train the systems in atoy setting with only 5 data annotations (Trainsetin the Appendix). We pick an input data with anunseen attribute combination to test if models cangeneralize correctly. Seq2seq and info-aug producethe wrong generation overfit to the information inthe 5 training instances. With LM augmentation, itadapts to the new combination and connects infor-mation correctly.
Human Evaluation
We further run a humanevaluation on the model outputs to closely checkthe generation quality. We compared four types ofmodels: the seq2seq baseline, seq2seq plus cycle-consistency as in Qader et al. (2019), finetunedGPT-2 as in Chen et al. (2020) and our proposedmodel. All models are LSTM-based apart fromhe finetuned GPT-2 one. We sample 100 data in-stances from the test set and apply all the modelsto generate corresponding text. The data and gen-erated text are evaluated by crowdworkers onProlific . For each data-text pair, the annotator isinstructed to evaluate (1) if the text is fluent (score0-5 with 5 being fully fluent), (2) if it misses in-formation contained in the source data and (3) ifit includes wrong information. The average flu-ency scores, count of information miss and wronginformation are presented in Table 4. The scoresare generally consistent with the automatic evalua-tion results, our proposed model outperforms otherones by a large margin, even though cycle-onlycan access all unpaired text and finetuned GPT-2 issignificantly larger than our LSTM-based seq2seq.The generated text are more fluent, yet maintainingthe information completeness and correctness to alarge extent. We study few-shot data-to-text generation withonly limited annotated data. We propose text aug-mentation with slot-value replacement followed byGPT-2 generation. The augmented text, when com-bined with cycle consistency and representationmatching, is shown to help the model to gener-alize to unseen new tokens and patterns of tokencombinations. With less than 10% annotations,it outperforms supervised seq2seq model trainedon 100% annotations and is extensible enough tobe combined with pretraining techniques. For fu-ture works, we hope to apply the techniques toannotation scenarios (Hong et al., 2019; de Souzaet al., 2018; Zhuang and Chang, 2017; Chang et al.,2020a; Wiehr et al., 2020; Shen et al., 2020; Changet al., 2020b; Su et al., 2020a; Chang et al., 2021a,b;Crook et al., 2018).
Acknowledgements
This research was funded in part by the GermanResearch Foundation (DFG) as part of SFB 248“Foundations of Perspicuous Software Systems”.We sincerely thank the anonymous reviewers fortheir insightful comments that helped us to improvethis paper. References
Shubham Agarwal, Marc Dymetman, and EricGaussier. 2018. Char2char generation with rerank-ing for the e2e nlg challenge. In
Proceedings of the11th International Conference on Natural LanguageGeneration , pages 451–456.Mikel Artetxe, Gorka Labaka, Eneko Agirre, andKyunghyun Cho. 2017. Unsupervised neural ma-chine translation. arXiv preprint arXiv:1710.11041 .Mikel Artetxe and Holger Schwenk. 2019. Margin-based parallel corpus mining with multilingual sen-tence embeddings. In
Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics , pages 3197–3203.Regina Barzilay and Mirella Lapata. 2005. Model-ing local coherence: An entity-based approach. In
Proceedings of the 43rd Annual Meeting of the As-sociation for Computational Linguistics (ACL’05) ,pages 141–148, Ann Arbor, Michigan. Associationfor Computational Linguistics.Christos Baziotis, Ion Androutsopoulos, Ioannis Kon-stas, and Alexandros Potamianos. 2019. Seqˆ3: Dif-ferentiable sequence-to-sequence-to-sequence au-toencoder for unsupervised abstractive sentencecompression. In
Proceedings of the 2019 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers) , pages 673–681.Paweł Budzianowski and Ivan Vuli´c. 2019. Hello, it’sgpt-2-how can i help you? towards the use of pre-trained language models for task-oriented dialoguesystems. In
Proceedings of the 3rd Workshop onNeural Generation and Translation , pages 15–22.Ernie Chang, David Adelani, Xiaoyu Shen, and VeraDemberg. 2020a. Unsupervised pidgin text genera-tion by pivoting english data and self-training. In
InProceedings of Workshop at ICLR .Ernie Chang, Jeriah Caplinger, Alex Marin, XiaoyuShen, and Vera Demberg. 2020b. Dart: Alightweight quality-suggestive data-to-text annota-tion tool. In
COLING 2020 , pages 12–17.Ernie Chang, Vera Demberg, and Alex Marin. 2021a.Jointly improving language understanding and gen-eration with quality-weighted weak supervision ofautomatic labeling. In
EACL 2021 .Ernie Chang, Hui-Syuan Yeh, and Vera Demberg.2021b. Does the order of training samples matter?improving neural data-to-text generation with cur-riculum learning. In
EACL 2021 .Zhiyu Chen, Harini Eavani, Yinyin Liu, andWilliam Yang Wang. 2020. Few-shot nlg with pre-trained language model.
ACL .ric Chu and Peter Liu. 2019. Meansum: a neuralmodel for unsupervised multi-document abstractivesummarization. In
International Conference on Ma-chine Learning , pages 1223–1232.Emilie Colin, Claire Gardent, Yassine M’rabet, ShashiNarayan, and Laura Perez-Beltrachini. 2016. TheWebNLG challenge: Generating text from DBPediadata. In
Proceedings of the 9th International Nat-ural Language Generation conference , pages 163–167, Edinburgh, UK. Association for ComputationalLinguistics.Paul A Crook, Alex Marin, Vipul Agarwal, Saman-tha Anderson, Ohyoung Jang, Aliasgar Lanewala,Karthik Tangirala, and Imed Zitouni. 2018. Con-versational semantic search: Looking beyond websearch, q&a and dialog systems. In
Proceedings ofthe Eleventh ACM International Conference on WebSearch and Data Mining , pages 763–766.Anna Currey, Antonio Valerio Miceli-Barone, and Ken-neth Heafield. 2017. Copied monolingual data im-proves low-resource neural machine translation. In
Proceedings of the Second Conference on MachineTranslation , pages 148–156.Markus Freitag and Scott Roy. 2018. Unsupervisednatural language generation with denoising autoen-coders. In
Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing ,pages 3922–3929.Xavier Garcia, Pierre Foret, Thibault Sellam, andAnkur P Parikh. 2020. A multilingual view ofunsupervised machine translation. arXiv preprintarXiv:2002.02955 .Claire Gardent, Anastasia Shimorina, Shashi Narayan,and Laura Perez-Beltrachini. 2017. The webnlgchallenge: Generating text from rdf data. In
Pro-ceedings of the 10th International Conference onNatural Language Generation , pages 124–133.Dimitra Gkatzia. 2016. Content selection in data-to-text systems: A survey. arXiv preprintarXiv:1610.08375 .Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu,Tie-Yan Liu, and Wei-Ying Ma. 2016. Dual learn-ing for machine translation. In
Advances in neuralinformation processing systems , pages 820–828.Junxian He, Xinyi Wang, Graham Neubig, and TaylorBerg-Kirkpatrick. 2020. A probabilistic formulationof unsupervised text style transfer.
ICLR .Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.
Neural computation ,9(8):1735–1780.Xudong Hong, Ernie Chang, and Vera Demberg. 2019.Improving language generation from feature-richtree-structured data with relational graph convolu-tional encoders. In
Proceedings of the 2nd Work-shop on Multilingual Surface Realisation (MSR2019) , pages 75–80. Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categor-ical reparameterization with gumbel-softmax. arXivpreprint arXiv:1611.01144 .Melvin Johnson, Mike Schuster, Quoc V Le, MaximKrikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,Fernanda Vi´egas, Martin Wattenberg, Greg Corrado,et al. 2017. Google’s multilingual neural machinetranslation system: Enabling zero-shot translation.
Transactions of the Association for ComputationalLinguistics , 5:339–351.Juraj Juraska, Panagiotis Karagiannis, Kevin Bowden,and Marilyn Walker. 2018. A deep ensemble modelwith slot alignment for sequence-to-sequence natu-ral language generation. In
Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long Papers) ,pages 152–162.Ravi Kondadadi, Blake Howald, and Frank Schilder.2013. A statistical nlg framework for aggregatedplanning and realization. In
Proceedings of the51st Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , vol-ume 1, pages 1406–1415.Guillaume Lample, Alexis Conneau, Ludovic Denoyer,and Marc’Aurelio Ranzato. 2017. Unsupervised ma-chine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043 .Guillaume Lample, Myle Ott, Alexis Conneau, Lu-dovic Denoyer, et al. 2018. Phrase-based & neu-ral unsupervised machine translation. In
Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing , pages 5039–5049.R´emi Lebret, David Grangier, and Michael Auli. 2016.Neural text generation from structured data with ap-plication to the biography domain. In
Proceedingsof the 2016 Conference on Empirical Methods inNatural Language Processing , pages 1203–1213.Chris van der Lee, Emiel Krahmer, and SanderWubben. 2018. Automated learning of templates fordata-to-text generation: comparing rule-based, sta-tistical and neural methods. In
Proceedings of the11th International Conference on Natural LanguageGeneration , pages 35–45.Shuming Ma, Pengcheng Yang, Tianyu Liu, Peng Li,Jie Zhou, and Xu Sun. 2019. Key fact as pivot: Atwo-stage model for low resource table-to-text gen-eration. In
Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguistics ,pages 2047–2057.Jekaterina Novikova, Ondˇrej Duˇsek, and Verena Rieser.2017. The e2e dataset: New challenges for end-to-end generation. In
Proceedings of the 18th AnnualSIGdial Meeting on Discourse and Dialogue , pages201–206.yle Ott, Sergey Edunov, Alexei Baevski, AngelaFan, Sam Gross, Nathan Ng, David Grangier, andMichael Auli. 2019. fairseq: A fast, extensibletoolkit for sequence modeling. In
Proceedings ofNAACL-HLT 2019: Demonstrations .Tatsuro Oya, Yashar Mehdad, Giuseppe Carenini, andRaymond Ng. 2014. A template-based abstractivemeeting summarization: Leveraging summary andsource text relationships. In
Proceedings of the 8thInternational Natural Language Generation Confer-ence (INLG) , pages 45–53.Yannis Papanikolaou and Andrea Pierleoni. 2020.Dare: Data augmented relation extraction with gpt-2. arXiv preprint arXiv:2004.13845 .Baolin Peng, Chenguang Zhu, Chunyuan Li, Xi-ujun Li, Jinchao Li, Michael Zeng, and Jian-feng Gao. 2020. Few-shot natural language gen-eration for task-oriented dialog. arXiv preprintarXiv:2002.12328 .Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhut-dinov, and Alan W Black. 2018. Style transferthrough back-translation. In
Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , pages866–876.Raheel Qader, Franc¸ois Portet, and Cyril Labb´e. 2019.Semi-supervised neural text generation by jointlearning of natural language generation and naturallanguage understanding models. In
Proceedings ofthe 12th International Conference on Natural Lan-guage Generation , pages 552–562.Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.
OpenAIBlog , 1(8).Ehud Reiter and Robert Dale. 2000.
Building naturallanguage generation systems . Cambridge universitypress.Dana Ruiter, Cristina Espa˜na-Bonet, and Josef vanGenabith. 2019. Self-supervised neural machinetranslation. In
Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguistics ,pages 1828–1834, Florence, Italy. Association forComputational Linguistics.Martin Schmitt and Hinrich Sch¨utze. 2019. Unsuper-vised text generation from structured data. arXivpreprint arXiv:1904.09447 .Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Improving neural machine translation mod-els with monolingual data. In
Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , vol-ume 1, pages 86–96. Xiaoyu Shen, Ernie Chang, Hui Su, Jie Zhou,and Dietrich Klakow. 2020. Neural data-to-text generation via jointly learning the seg-mentation and correspondence. In
ACL 2020
INLG , pages 233–243.Hui Su, Xiaoyu Shen, Zhou Xiao, Zheng Zhang, ErnieChang, Cheng Zhang, Cheng Niu, and Jie Zhou.2020a. Moviechats: Chat like humans in a closeddomain. In
EMNLP 2020 , pages 6605–6619.Shang-Yu Su, Chao-Wei Huang, and Yun-Nung Chen.2020b. Towards unsupervised language understand-ing and generation by joint dual learning.
ACL .Sandeep Subramanian, Guillaume Lample,Eric Michael Smith, Ludovic Denoyer,Marc’Aurelio Ranzato, and Y-Lan Boureau.2018. Multiple-attribute text style transfer. arXivpreprint arXiv:1811.00552 .Fan-Keng Sun, Cheng-Hao Ho, and Hung-Yi Lee.2020. { LAMAL } : { LA } nguage modeling is all youneed for lifelong language learning. In InternationalConference on Learning Representations .Frederik Wiehr, Anke Hirsch, Florian Daiber, AntonioKruger, Alisa Kovtunova, Stefan Borgwardt, ErnieChang, Vera Demberg, Marcel Steinmetz, and Hoff-mann Jorg. 2020. Safe handover in mixed-initiativecontrol for cyber-physical systems. In Proceedingsof Workshop at CHI.Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning.
Machine learning , 8(3-4):229–256.Sam Wiseman, Stuart Shieber, and Alexander Rush.2017. Challenges in data-to-document generation.In
Proceedings of the 2017 Conference on Empiri-cal Methods in Natural Language Processing , pages2253–2263.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-icz, and Jamie Brew. 2019. Huggingface’s trans-formers: State-of-the-art natural language process-ing.
ArXiv , abs/1910.03771.Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Lu-ong, and Quoc V Le. 2019. Unsupervised data aug-mentation for consistency training. arXiv preprintarXiv:1904.12848 .Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei AEfros. 2017. Unpaired image-to-image translationusing cycle-consistent adversarial networks. In
Pro-ceedings of the IEEE international conference oncomputer vision , pages 2223–2232.enLi Zhuang and Ernie Chang. 2017. Neobility atsemeval-2017 task 1: An attention-based sentencesimilarity model. In, pages 2223–2232.enLi Zhuang and Ernie Chang. 2017. Neobility atsemeval-2017 task 1: An attention-based sentencesimilarity model. In