[PDF] Jointly Improving Language Understanding and Generation with Quality-Weighted Weak Supervision of Automatic Labeling

Abstract

Full PDF

JJointly Improving Language Understanding and Generation withQuality-Weighted Weak Supervision of Automatic Labeling †Ernie Chang, †Vera Demberg, *Alex Marin †Dept. of Language Science and Technology, Saarland University { cychang,vera } @coli.uni-saarland.de *Microsoft Corporation, Redmond, WA { alemari } @microsoft.com Abstract

Neural natural language generation (NLG) andunderstanding (NLU) models are data-hungryand require massive amounts of annotated datato be competitive. Recent frameworks ad-dress this bottleneck with generative modelsthat synthesize weak labels at scale, wherea small amount of training labels are expert-curated and the rest of the data is automaticallyannotated. We follow that approach, by au-tomatically constructing a large-scale weakly-labeled data with a ﬁne-tuned GPT-2, and em-ploy a semi-supervised framework to jointlytrain the NLG and NLU models. The pro-posed framework adapts the parameter updatesto the models according to the estimated label-quality. On both the E2E and Weather bench-marks, we show that this weakly supervisedtraining paradigm is an effective approach un-der low resource scenarios with as little as data instances, and outperforming benchmarksystems on both datasets when % of train-ing data is used. Natural language generation (NLG) is the task thattransforms meaning representations (MR) into nat-ural language descriptions (Reiter and Dale, 2000;Barzilay and Lapata, 2005); while natural languageunderstanding (NLU) is the opposite process wheretext is converted into MR (Zhang and Wang, 2016).These two processes can thus constrain each other– recent exploration of the duality of neural nat-ural language generation (NLG) and understand-ing (NLU) has led to successful semi-supervisedlearning techniques where both labeled and un-labeled data can be used for training (Su et al.,2020b; Tseng et al., 2020; Schmitt and Sch¨utze,2019; Qader et al., 2019; Su et al., 2020b).Standard supervised learning for NLG and NLUdepends on the access to labeled training data – amajor bottleneck in developing new applications.

Human AnnotatorWeak Annotator Clean Text labelsNoisy text labels Unlabeled MR x x xx x x x x x x x x x x x x x

Figure 1:

Training scenario : Each × represents a labeleddata instance. The goal is to learn both from few human-labeled instances (inner) and large amounts of weakly labeleddata (outer). In particular, neural methods require a large anno-tated dataset for each speciﬁc task. The collectionprocess is often prohibitively expensive, especiallywhen specialized domain expertise is required. Onthe other hand, learning with weak supervisionfrom noisy labels offers a potential solution as itautomatically builds imperfect training sets fromlow cost labeling rules or pretrained models (Zhou,2018; Ratner et al., 2017; Fries et al., 2020). Fur-ther, labeled data and large unlabeled data can beutilized in semi-supervised learning (Lample et al.,2017; Tseng et al., 2020), as a way to jointly im-prove both NLU and NLG models.To this end, we target a weak supervision sce-nario (shown in Figure 1) consisting of small, high-quality expert-labeled data and a large set of un-labeled MR instances. We propose to expand thelabeled data by automatically annotating the MRsamples with noisy text labels. These noisy textlabels are generated by a weak annotator , which isbuilt upon recent works that directly ﬁne-tune GPT-2 (Radford et al., 2019) on joint meaning repre-sentation (MR) and text (Mager et al., 2020; Hark-ous et al., 2020). Then, we jointly train the NLGand NLU models in a two-step process with semi-supervised learning objectives (Tseng et al., 2020).First, we use pretrained models to estimate qualityscores for each sample. Then, we down-weight the a r X i v : . [ c s . C L ] F e b oss updates in the back-propagation phase usingthe estimated quality scores. This way, the modelsare guided to avoid mistakes of the weak annotator .On two benchmarks, E2E (Novikova et al.,2017b) and Weather (Balakrishnan et al., 2019),we utilize varying amount of labeled data and showthat the framework is able to successfully learnfrom the synthetic data generated by weak anno-tator , thereby allowing jointly-trained NLG andNLU models to outperform other baseline systems.This work makes the following contributions:1. We propose an automatic method to overcomethe lack of text labels by using a ﬁne-tunedlanguage model as a weak annotator to con-struct text labels for the vast amount of MRsamples, resulting in a much larger labeleddataset.2. We propose an effective two-step weak su-pervision using the dual mutual information(DMI) measure which can be used to modu-late parameter updates on the weakly labeleddata by providing quality estimates.3. We show that the approach can even be usedto improve upon baselines with 100% data toestablish new state-of-the-art performance. Learning with Weak Supervision.

Learningwith weak supervision is a well-studied area thatis popularized by the rise of data-driven neuralapproaches (Ratner et al., 2017; Safranchik et al.,2020; Bach et al., 2017; Wu et al., 2018; Dehghaniet al., 2018; Jiang et al., 2018; Chang et al., 2020a;de Souza et al., 2018). Our approach incorpo-rates similar line of work, by providing noisy la-bels (text) with a ﬁne-tuned LM which incorpo-rates prior knowledge from general-domain textand data-text pair (Budzianowski and Vuli´c, 2019;Chen et al., 2020; Peng et al., 2020; Mager et al.,2020; Harkous et al., 2020; Shen et al., 2020;Chang et al., 2020b, 2021b,a), and use it as the weak annotator , similar by functionality to that ofﬁdelity-weighted learning (Dehghani et al., 2017),or data creation tool

Snorkel (Ratner et al., 2017).

Learning with Semi-Supervision.

Work onsemi-supervised learning considers settings withsome labeled data and a much larger set of unla-beled data, and then leverages both labeled the unla-beled data as in machine translation (Artetxe et al., 2017; Lample et al., 2017), data-to-text genera-tion (Schmitt and Sch¨utze, 2019; Qader et al., 2019)or more relevantly the joint learning framework fortraining NLU and NLG (Tseng et al., 2020; Suet al., 2020b). Nonetheless, these approaches allassume that a large collection of text is available,which is an unrealistic assumption for the task dueto the need for expert curation. In our work, weshow that both NLU and NLG models can bene-ﬁt from (1) automatically labeling MR with text,and (2) by semi-supervisedly learning from thesesamples while accounting for their qualities.

We represent the set of meaning representation(MR) as X and the text samples as Y . There are norestrictions on the format of the MR: each x ∈ X can be a set of slot-value pairs, or can take the formof tree-structured semantic deﬁnitions as in Balakr-ishnan et al. (2019). Each text y ∈ Y consists of asequence of words.In our setting, we have (1) k labeled pairs and(2) a large quantity of unlabeled MR set X U where | X U | (cid:29) k > . (We force k > as we believea reasonable generation system needs at least afew demonstrations of the annotation.) This is arealistic setting for novel application domains, asunlabeled MR are usually abundant and can alsobe easily constructed from predeﬁned schemata.Notably, we assume no access to outside resourcescontaining in-domain text . The k annotations areall we know about in-domain text.The core of our approach consists of ﬁrst label-ing MR samples with text, and then training onthe expanded dataset. We start with describing theprocess of creating weakly labeled data (§4). Next,we delve into the semi-supervised training objec-tives for the NLU and NLG models, which allowthe models to learn from labeled and unlabeleddata (§5). Lastly, we explain the training processwhere NLG and NLU models are jointly optimizedin two steps: In step 1 , we pretrain the models onthe weakly-labeled corpus, then continue updat-ing the models on the combined data consistingof the weak and real data in step 2 . Importantly,to account for the noise that comes with the au-tomatic weak annotation, step 2 trains the modelwith quality-weighted updates (§6). We depict thisprocess in Figure 2. . Build Label Model 2. Label Dataset 3. Train Model Label Model: Optimize P

GPT-2 (X,Y) [MR] restaurant_name=Green Man, food=english,price_range=cheap, customer_rating=average,family_friendly=yes, near=sunshine vegetarian cafe [TEXT ] Green Man offers british food in the low price range. it is family friendly with a 3 out of 5 star rating.you can ﬁnd it near the sunshine vegetarian cafe M R T ex t x x xx x x x x x x x x x x x x [MR] restaurant_name=blue spice, food=Indian,price_range=high, customer_rating=average [TEXT ] Blue spice is an expensive Indian restaurantwith an average customer rating.

Inference

Construct noisy text labels

Step 1. Train on weak dataStep 2. Train on combined data x x x x x x x x x x x x x x x x x x xx

Eq. 1 Eq. 3Eq. 2

Joint Learning (Sec. 6) ～～

Sec. 5

Weak Supervision (Sec. 7)

Figure 2:

Depiction of the proposed framework. In joint learning , gradients are back-propagated through solid lines.

We construct synthetic data in two ways: (1) cre-ating more MR samples (see §4.1), and (2) bycreating a larger parallel set of MRs with texts(see §4.2).

We consider a simple way of MR augmentation viavalue swapping. This creates more unlabeled MRto be annotated by the weak annotator and alsoprovide a substantial augmentation that beneﬁts theautoencoding on MR samples (see Equation 3) byexposing it to a larger set of MR. [Blue Spice] [?] [Giraffe] ... V a l u e S w a pp i ng ... [name] [?] [eatType] (slot) (value) Figure 3:

Depiction of MR augmentation in the E2E corpus.

Since each slot in the MR samples correspondsto multiple possible values, we pair each slot witha randomly sampled value collected from the set ofall MR samples to obtain new combination of slot-value pairs. This way, we create a large syntheticMR set.

GPT-2 (Radford et al., 2019) is a powerful lan-guage model pretrained on the large WebText cor-pus. Recent work on conditional data-to-text gen-eration (Harkous et al., 2020; Mager et al., 2020)demonstrated that ﬁne-tuning GPT-2 on the jointdistribution of MR and text for text-only generationyields impressive performance.The ﬁne-tuned model generates in-domain textby conditioning on samples from the augmentedMR set ( X U ). Rather than using GPT-2 outputs directly, we employ them in a process analogousto knowledge distillation (Tan et al., 2018; Tanget al., 2019; Baziotis et al., 2020) where the ﬁne-tuned GPT-2 provides supervisory signals insteadof being used directly for generation.We now describe the process of GPT-2 ﬁne-tuning. Given the sequential MR representa-tion x · · · x M and a sentence y · · · y N in the la-beled dataset ( X L , Y L ) , we maximize the jointprobability p GPT-2 ( X L , Y L ) , where each sequenceis concatenated into “[MR] x · · · x M [TEXT] y · · · y N ”. In addition, we also freeze the inputembeddings when ﬁne-tuning had positive impacton performance, following Mager et al. (2020). Attest time, we provide the MR samples as context asin conventional conditional text generation: ˜ y j = arg max y j { p GPT-2 ( y j | y j − , x N ) } The ﬁne-tuned LM conditions on augmented MRsample set X U to generate the in-domain text ,forming the weak label dataset D W = ( X U , ˜ Y L ) with noisy labels ˜ y i ∈ ˜ Y L . In practice, the ﬁne-tuned LM produces malformed, synthetic textwhich does not fully match with the MR it wasconditioned on, as it might hallucinate additionalvalues not consistent with its MR counterpart.Thus, it is necessary to check for factual consis-tency (Moryossef et al., 2019). We address thispoint next.Past ﬁndings showed (e.g. (Wang, 2019)) thatthe removal of utterance with “hallucinated” facts(MR values) from MR leads to considerable per-formance gain, since inconsistent MR-Text cor-respondence might misguide systems to generateincorrect facts and deteriorate the NLG outputs.We ﬁlter out the synthetic, poor quality MR-text We adopt the Top- k random sampling with k = 2 to en-courage diversity and reduce repetition (Radford et al., 2019) airs by training a separate NLU model on the orig-inal labeled data to predict MR from generated textlabels. These MRs can then be checked againstthe paired MR in D W via pattern matching as in-spired by Cai and Knight (2013); Wiseman et al.(2017). Speciﬁcally, we use a measure of semanticsimilarity in terms of f-score via matching of slotsbetween the two MRs . We keep all MR-text pairswith f-scores above . , as we found empiricallythat this criterion retains a sufﬁciently large amountof high-quality data. The removed text sentencesare used for unsupervised training objectives as inEq. 1-3. Using this method, we create a collectionof parallel MR-text samples ( ~ ~

40k for E2E and ~

25k for Weather).

For both NLU and NLG models, we adopt thesame architecture as Tseng et al. (2020), whichuse two Bi-LSTM-based (Hochreiter and Schmid-huber, 1997) encoders for each model. The NLUdecoder for slot-value structured data (e.g., E2E,Mrkˇsi´c et al., 2017) contains several 1-layer feed-forward neural classiﬁers for each slot; while fortree-structured meaning representation in Balakr-ishnan et al. (2019), the decoder is LSTM-based.In this framework, both NLU and NLG models aretrained to infer the shared latent variable repeat-edly – starting from either MR or text, in order toencourage semantic consistency. Each model canbe improved via gradient passing between themusing REINFORCE (Williams, 1992). This way,the models beneﬁt from each other’s training ina process known as the dual learning (Su et al.,2020b), which consists of both unsupervised and supervised learning objectives. We now go intodetails describing them.

Unsupervised Learning.

Starting from either aMR sample or a text sample, the models project thesample from one space into the other, then map itback to the original space (either MR or text sam-ple, respectively), and compute the reconstructionloss after the two operations. This repetition willresult in aligned pairs between the MR samples andcorresponding text (He et al., 2016). Speciﬁcally,let p θ ( y | x ) be the probability distribution to map x to its corresponding y (NLG), and p φ ( x | y ) be theprobability distribution to map y back to x (NLU). Starting from x ∈ X , its objective is: max φ E x ∼ p ( X ) log p φ ( x | y (cid:48) ); y (cid:48) ∼ p θ ( y | x ) (1)which ensures the semantic consistency by ﬁrstperforming NLG accompanied by NLU in direc-tion x → y (cid:48) → x . Note that only p φ is updatedin this direction and p θ serves only as as an auxil-iary function to provide pseudo samples y (cid:48) from x .Similarly, starting from y ∈ Y , the objective en-sures semantic consistency in the direction wherethe NLU step is followed by NLG: y → x (cid:48) → y : max θ E y ∼ p ( Y ) log p θ ( y | x (cid:48) ); x (cid:48) ∼ p φ ( x | y ) (2)We further add two autoencoding objectives onboth MR and text samples: max θ,φ E x ∼ p ( X ) ,y ∼ p ( Y ) log p φ ( x | x ) p θ ( y | y ) (3)Thus, unlabeled text samples can be used as theyare shown to beneﬁt the text space ( Y ) by intro-ducing new signals into learning directions y → x (cid:48) → y and ˜ y → y . Thus, we use all in-domain textdata whether they have corresponding MR or not.Note that following (Tseng et al., 2020), we alsoadopt the variational optimization objective uponthe latent variable z which was shown to pull theinferred posteriors q ( z | x ) and q ( z | y ) closer to eachother. In this case, the parameters of both NLG andNLU models are updated. Supervised Learning.

Apart from the above un-supervised objectives, we can impose the super-vised objective on the k labeled pairs: max θ,φ E x,y ∼ p ( X L ,Y L ) log p θ ( y | x )+log p φ ( x | y ) (4)Each MR is ﬂattened into a sequence and fed intothe NLG encoder, giving NLG and NLU models aninductive bias to project similar MR/text into thesurrounding latent space (Chisholm et al., 2017).As we observed anecdotally , the information ﬂowenabled by REINFORCE allows the models to uti-lize unlabeled MR and text, boosting the perfor-mance in our scenarios. This direction is usually termed as back translation in MTcommunity (Sennrich et al., 2016; Lample et al., 2018) Tseng et al. (2020) noticed similar trend in the experi-ments.

Learning with Weak Supervision

The primary challenge that arises from the syn-thetic data is the noise introduced during the gener-ation process. Noisy and poor quality labels tendto bring little to no improvements (Elman, 1993;Fr´enay and Verleysen, 2013). To better train onthe large and noisy corpus described in section §4(size ~ pretraining and (2) quality-weighted ﬁne-tuning to account for the heterogenous data quality. Step 1:

Pre-train two sets of models on weak andclean data, respectively.

We train the ﬁrst set ofmodels ( teacher ) consisting of NLU, NLG, andautoencoder (AUTO) models on the clean data. Thesecond set of models (i.e. NLU and NLG) is the student that pretrains on the weak data.

Step 2:

Fine-tune the student model parameterson the combined clean and weak datasets.

Weuse each teacher model to determine the step sizefor each iteration of the stochastic gradient descent(SGD) by down-weighting the training step of thecorresponding student model using the sample qual-ity given by the teacher. Data points with true la-bels will have high quality, and thus will be givena larger step-size when updating the parameters;conversely, we down-weight the training steps ofthe student for data points where the teacher is notconﬁdent. For this speciﬁc ﬁne-tuning process, weupdate the parameters of the student (i.e. NLGand NLU models) at time t by training with SGD,where L ( · ) is the loss of predicting ˆ y for an input x i when the label is ˜ y . The weighted step is then c ( x i , ˜ y i ) ∇L (ˆ y, ˜ y ) , where c ( · ) is a scoring functionlearned by the teacher taking as input MR x i andits noisy text label ˜ y i . In essence, we control thedegree of parameter updates to the student based onhow reliable its labels are according to the teacher .We denote c ( · ) as the function of the label qual-ity based on the dual mutual information (DMI) ,deﬁned as the absolute difference between mutualinformation (MI) in inference directions x → y and y → x . Bugliarello et al. (2020) shows thatMI x → y correlates to the difﬁculty in predicting y from x , and vice versa. Thus we expect the dif-ference between MI x → y and MI y → x for clean sam-ple ( x, y ) to be relatively small compared to noisy Mutual information for x → y can be seen as H ( x → y ) = H AUTO ( y ) − H NLG ( y | x ) (Bugliarello et al., 2020). samples, since the level of difﬁculty is largely pro-portional between NLU and NLG on the samples –difﬁculty in inferring x from y will result in harderprediction of y from x . Based on this intuition, theDMI score of the sample ( x, y ) is deﬁned as: exp (cid:26) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) log q AUTO ( y ) q NLG ( y | x ) (cid:124) (cid:123)(cid:122) (cid:125) MI x → y − log q AUTO ( x ) q NLU ( x | y ) (cid:124) (cid:123)(cid:122) (cid:125) MI y → x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:27) . where q ( · ) are the two respective models. The DMIfor a clean MR-text pair should be relatively small ,as the two sides contain proportional semantic in-formation , and so poor quality samples tend tohave higher DMI scores and lower c ( · ) as they are less semantically aligned . Thus, c ( · ) deﬁnes the conﬁdence (quality) the teacher has about the cur-rent MR-text sample. We use c ( · ) to scale η t . Notethat η t ( t ) does not necessarily depend on each datapoint, whereas c ( · ) does. We deﬁne c ( x t , y t ) as: c ( x t , y t ) = 1 − N ( DMI ( x t , y t )) where N ( · ) normalizes DMI over all samples inboth clean and weak data to be in [0 , . Data.

We conduct experiments on theWeather (Balakrishnan et al., 2019) andE2E (Novikova et al., 2017b) datasets. Weathercontains 25k instances of tree-structure annotations.E2E is a crowd-sourced dataset containing 50kinstances in the restaurant domain. The inputs aredialogue acts consisting of three to eight slot-valuepairs.

Conﬁgurations.

Both NLU and NLG models areimplemented in PyTorch (Paszke et al., 2019) with2 Bi-LSTM layers and -dimensional token em-beddings and Adam optimizer (Kingma and Ba,2014) with initial learning rate at . . Batchsize is kept at and we employ beam search withsize for decoding. The score is averaged over10 random initialization runs. In our implemen-tation, the sequence-to-sequence models are builtupon the bi-directional long short-term memory(Bi-LSTM) (Hochreiter and Schmidhuber, 1997).For LSTM cells, both the encoder and decoderhave layers, amounting to 18M parameters for We found that mutual information for x → y is usuallygreater than that of y → x since NLG is a one-to-many andmore difﬁcult process as opposed to NLU. odel E2E (NLG) E2E (NLU)10 50 1% 5% 50% 10 50 1% 5% 50%WA 0.195 0.287 0.563 0.649 0.714 9.48 11.66 13.20 45.21 65.81JUG ∗ decoupled 0.261 0.279 0.648 0.693 0.793 0.00 0.00 20.51 52.77 73.68joint 0.218 0.336 0.732 0.764 0.775 0.00 6.18 24.98 49.66 70.33joint+aug 0.275 0.381 0.748 0.781 0.797 5.88 15.79 25.15 53.20 69.68step 1 0.441 0.487 0.610 0.642 0.685 13.18 14.28 15.37 44.72 65.20Ours (step 1+2) ∗ decoupled 0.250 0.288 0.598 0.632 0.719 0.00 28.21 70.24 73.46 88.45joint 0.270 0.348 0.577 0.639 0.658 0.00 24.52 64.30 69.92 86.86joint+aug 0.329 0.361 0.589 0.662 0.671 4.21 26.33 67.43 71.19 87.10step 1 0.371 0.429 0.570 0.607 0.632 12.19 35.89 72.90 72.01 84.73Ours (step 1+2) Table 1:

Performance for NLG (BLEU-4) and NLU (joint accuracy (%)) on E2E and Weather datasets with increasing amountof labeled data from 10, 50 labeled instances to 1%, 5%, and 100% of the labeled data ( D L ). Models that have access to unlabeled ground-truth text labels are marked with *. We provide results for the NLG and NLU models trained separatelyusing supervised objectives alone ( decoupled ), our semi-supervised joint-learning model ( joint ), joint with all unlabeled data( joint+aug ), and weakly-supervised models ( step 1 ). Step 1+2 denotes the full proposed approach. D L D W X U Y SL Y WL JUG (cid:88) (cid:55) (cid:88) (cid:88) (cid:55) WA (cid:88) (cid:55) (cid:55) (cid:55) (cid:55) decoupled (cid:88) (cid:55) (cid:55) (cid:55) (cid:55) joint (cid:88) (cid:55) (cid:88) (cid:55) (cid:55) joint+aug (cid:88) (cid:55) (cid:88) (cid:55) (cid:88) step 1 (cid:55) (cid:88) (cid:88) (cid:55) (cid:88) Ours (step 1+2) (cid:88) (cid:88) (cid:88) (cid:55) (cid:88)

Table 2:

Summary of training data used in each model.Sources of data include labeled data ( D L ), unlabeled MR( X U ), weakly labeled data ( D W ), 100% real text ( Y SL ), andweak text labels ( Y WL ). the seq2seq model. All models were trained on1 Nvidia V100 GPU (32GB and CUDA Version10.2) for 10k steps. The average training timefor seq2seq model was approximately 1 hour, androughly 2 hours for the proposed semi-supervisedtraining with 100% data. The total number of up-dates is set to 10k steps for all training and patienceis set as updates. At decoding time, sentencesare generated using greedy decoding. We ﬁrst compare our model with other baselineson both datasets, then perform a set of ablationstudies on the E2E dataset to see the effects of eachcomponent. Finally, we analyze the strength ofthe weak annotator , and the effect of the quality-weighted weak supervision, before concluding withthe analysis of dual mutual information.

E2E NLG BLEU-4

TGEN (Duˇsek and Jurcicek, 2016) 0.6593SLUG (Juraska et al., 2018) 0.6619Dual supervised learning (Su et al., 2019) 0.5716JUG (Tseng et al., 2020) 0.6855GPT2-FT (Chen et al., 2020) 0.6562WA (Harkous et al., 2020) 0.6445Ours (step 1+2)

S2S-CONSTR (Balakrishnan et al., 2019) 0.7660JUG (Tseng et al., 2020) 0.7768Ours (step 1+2)

Table 3:

For comparison, we show the performance of pre-vious systems on the datasets following the original split , sothe scores are not comparable to Table 1.

In particular, we experiment with various lowresource conditions of training set (10 instances,50 instances, 1% of all data, 5% of all data). Toshow that our proposed approach is consistentlybetter, we include the scenario with 0-100% of thedata at 10% interval, to show that performance doesnot deteriorate as more training samples are added(Figure 4). Table 2 shows the summary of trainingdata used for all models in Table 1. We compare ourmodel with (1) a ﬁne-tuned GPT2 model (GPT2-FT) that uses a switch mechanism to select betweeninput and GPT2 knowledge (Chen et al., 2020) ,(2) a ﬁne-tuning approach to be used as the weak https://github.com/czyssrs/Few-Shot-NLG % 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Percentage of weak data B L E U w/o quality weightingquality weighting 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Percentage of strong data B L E U w/o quality weightingquality weighting Figure 4:

Model performance (BLEU-4) on % E2E data with varying percentages of strong and weak data with and without DMI-based quality weighting.

Left plot begins with models trained on labeled data while right plot starts with the weaksynthesized dataset instead.

Method NLU NLGMiss Redundant Wrong Fluency Miss Wrongdecoupled 72 78 87 4.10 69 73JUG 65

75 4.23 64 65Ours (step 1+2)

68 4.50 63 61

Table 4:

Human evaluation on the sampled E2E outputs (100instances) for models with 1% training data. Numbers of missing , redundant and wrong predictions on slot-value pairsare reported for NLU; ﬂuency , numbers of missing or wrong generated slot values are listed for NLG. annotator (WA) that predict text from MR or MRfrom text, depending on the input format duringﬁne-tuning (Harkous et al., 2020) , and (3) thesemi-supervised model (JUG) from Tseng et al.(2020). Note that the specialized encoder in GPT2-FT cannot be easily adapted to the tree-structuredinput in Weather, and so we do not provide its scoreon the Weather dataset.In Table 1, we show that our proposed approach( step 1+2 ) generally performs better than the base-lines for both tasks (NLG and NLU) for most se-lected labeled data sizes. We show that even withonly labeled instances, our approach ( step 1+2 )is able to yield decent results compared to the base-lines. The difference between models tends tobe larger for settings with few training instances,and the advantage of the method diminishes asthe amount of labeled data available for JUG in-creases , to the point where JUG is able to outper-form the proposed approach. Overall, the beneﬁtof the noisy supervisory signal from the weak datais able to boost performance, especially at lowerresource conditions.We observe that training with weakly labeleddata alone (step 1) is not sufﬁcient, and so strongdata is required to provide the supervisory signals No released source code so we re-implemented it basedon paper. https://github.com/andy194673/Joint-NLU-NLG necessary (step 2). Further, the fact that joint+aug displays noticeable improvements over joint sug-gests that simply having augmented text helpsto improve the encoded latent space as projectedby both the NLU and NLG encoders. This alsoshows an alternative way to introduce additional in-domain information to both models, even thoughthe NLU model does not beneﬁt directly from ad-ditional text. Importantly, our approach shows thatthe weak annotator is able to bridge the gap asdeﬁned by the access to ground-truth text labelsin JUG – outperforming it signiﬁcantly at low re-source conditions (10, 50, 1%, 5%) with the differ-ence in NLG being as large as 48.7 BLEU pointswith instances. We ﬁnd that the proposed modelalso performs well in the high resource (100% oflabeled data) condition, as shown in Table 3. More-over, with 100% labeled data, our model is stillable to produce superior performance over some ofthe baselines, which shows that weak annotationdoes capture additional useful patterns that beneﬁtthe NLG process. Error Analysis.

Since word-level overlappingscores usually correlate rather poorly with hu-man judgements on ﬂuency and information ac-curacy (Reiter and Belz, 2009; Novikova et al.,2017a), we perform human evaluation on the E2Ecorpus on 100 sampled generation outputs. Foreach MR-text pair, the annotator is instructed toevaluate the ﬂuency (score 1-5, with 5 being mostﬂuent), miss (count of MR slots that were missed)and wrong (count of included slots not in MR) arepresented in Table 4, where ﬂuency scores are av-eraged over 50 crowdworkers. We show that with data, both NLU and NLG models yield signif- I x y M I yx E2E MI x y M I yx Weather

Figure 5:

Visualization of dual mutual information (DMI) on both datasets where × markers are random samples fromannotated data and ◦ markers are random samples in the weak dataset. Dotted lines are trend lines for ◦ markers and solidblack lines are diagonal reference that correspond to the perfect NLG-NLU balance where both tasks have equal difﬁculties. Method NLG NLUBLEU-4 (Accuracy (%)) Accuracy (%) (F1)w/ DF DF WS WS+CW

Table 5:

Ablation study of weak supervision (1% E2E labeleddata D L ) including data ﬁdelity ( DF ), the proposed model(step 1+2) with weak supervision ( WS ), and WS with quality-weighted weak supervision ( WS+CW ). icantly fewer errors in terms of misses and wrong facts, while having more ﬂuent outputs. However,it generates more redundant slot-value pairs whichwe attribute to the noisy augmentation that “mis-guided” the NLU model. How Strong is the

Weak Annotator ? To assessthe strength of the weak annotator (WA) itself,we also computed its NLG scores with varyingamounts of labeled data (see Table 1). We observethat the WA suffers from a performance drop inlower resource conditions (i.e. 0.195 BLEU with10 labeled instances), when the given training sam-ples are not sufﬁcient for the pretrained model toconverge upon a region of in-domain generation.However, it yields some quality data when condi-tioned on a large number of possible MR (i.e. %data), forming a useful in-domain text set (See Ta-ble 6). Analysis of Weak Supervision.

In Table 5, wepresent the results of an ablation study on weak su-pervision (see §6) where the effect of data ﬁdelity is stronger on NLU than on NLG, which is due tothe nature of the ﬁltering process which removesfaulty text labels which inﬂuences both x → y and y → y training directions. Next, though weaksupervision boosted the model by giving direct su-pervision in training directions x → y and y → x ,the noisy nature of the augmentation limits its ef- fectiveness. The model is further improved withthe proposed quality-weighted update that takesinto account the sample quality and alleviate theinﬂuence of poor quality samples. Refer to Table 7for output comparison. Analysis of the Two-Step Training Process.

Asinspired by Dehghani et al. (2018), we justify thetwo-step training process by performing two typesof experiments with data (see Figure 4): Inthe ﬁrst experiment, we use all the available strongdata but consider different ratios of the entire weakdataset – as used in our 2-step approach. In thesecond, we ﬁx the amount of weak data and pro-vide the model with varying amounts of strongdata. The results show that the student models aregenerally better off by having the teacher ’s super-vision. Further, pretraining on weak data prior toﬁne-tuning on strong data appears to be the betterapproach and this motivates the reasoning behindour two-step approach. Analysis of the Dual Mutual Information.

Fig-ure 5 depicts DMI with the visualization of

M I x → y as x-axis and M I y → x as y-axis, in which ran-domly sampled noisy and ground-truth samples areplotted for both datasets. On the plot, the diagonalreference represents the scenario in which NLGand NLU inference are equally difﬁcult, and wesee that annotated data cluster more around the di-agonal reference. This means that expert-labeledsamples’ DMI scores tend to be smaller, whereNLU and NLG inference for these samples carrysimilar levels of difﬁculty. Importantly, since DMIscores are normalized over both clean and noisysamples, the proximity of data to the trendlinescan then be used to estimate the sample quality –clean data are closer as compared to the noisy sam- r [name] Giraffe, [eat type] pub, [area] riversidesynthetic reference Giraffe is a pub in the riverside of the city just down the street.mr [name] Strada, [eat type] restaurant, [food] Italian, [area] city centre, [familyfriendly] no, [near] Avalonsynthetic reference Strada is an Italian restaurant not for the families! it is near Avalon in the city centre.mr [name] Cocum, [eat type] restaurant, [food] French, [area] riverside, [familyfriendly] no, [near] Raja Indian Cuisinesynthetic reference Cocum sells French food near Raja Indian Cuisine. Table 6:

Display of weakly-labeled data samples. mr [name] Blue Spice, [eat type] coffee shop, [area] city centrestep 1+2 Blue Spice is a coffee shop in the city centre that of the city.JUG Blue Spice serves Italian food and is family friendly.decoupled Blue Spice is an adult Italian coffee shop with high customer rating located in

Table 7:

Display of text generations from different models. ples. Thus clean data will have smaller normalizedscores, higher c ( · ) , and a larger update step. Thisfurther supports the use of the proposed samplequality-based updates on the parameters.

10 Conclusion and Future Work

In this paper, we show the efﬁcacy of the frame-work where data is automatically labeled and bothNLU and NLG models learn with quality-weightedweak supervision so as to account for the individ-ual data quality. Most importantly, we show thatnot only is the two-step training process useful inimproving the model, it yields decent quality text.This work serves as a starting point for weakly-supervised learning in natural language genera-tion, especially for topics related to instance-basedweighting approaches.For future work, we hope to extend on the frame-work and propose ways with which it can be in-corporated into existing text annotation systems.Speciﬁcally, we would like to see its effectivenessin human-in-the-loop settings (Crook et al., 2018;Hong et al., 2019; de Souza et al., 2018; Zhuangand Chang, 2017; Chang et al., 2020a; Wiehr et al.,2020; Shen et al., 2020; Chang et al., 2020b; Suet al., 2020a; Chang et al., 2021b,a) where the qual-ity estimation metrics can take signals from humanfeedbacks into account.

Acknowledgements

This research was funded in part by the GermanResearch Foundation (DFG) as part of SFB 248“Foundations of Perspicuous Software Systems”. We sincerely thank the anonymous reviewers fortheir insightful comments that helped us to improvethis paper.

References

Mikel Artetxe, Gorka Labaka, Eneko Agirre, andKyunghyun Cho. 2017. Unsupervised neural ma-chine translation. arXiv preprint arXiv:1710.11041 .Stephen H Bach, Bryan He, Alexander Ratner, andChristopher R´e. 2017. Learning the structure of gen-erative models without labeled data.

Proceedings ofmachine learning research , 70:273.Anusha Balakrishnan, Jinfeng Rao, Kartikeya Upasani,Michael White, and Rajen Subba. 2019. Con-strained decoding for neural nlg from compositionalrepresentations in task-oriented dialogue. In

Pro-ceedings of the 57th Annual Meeting of the Associa-tion for Computational Linguistics , pages 831–844.Regina Barzilay and Mirella Lapata. 2005. Model-ing local coherence: An entity-based approach. In

Proceedings of the 43rd Annual Meeting of the As-sociation for Computational Linguistics (ACL’05) ,pages 141–148, Ann Arbor, Michigan. Associationfor Computational Linguistics.Christos Baziotis, Barry Haddow, and AlexandraBirch. 2020. Language model prior for low-resource neural machine translation. arXiv preprintarXiv:2004.14928 .Paweł Budzianowski and Ivan Vuli´c. 2019. Hello, it’sgpt-2-how can i help you? towards the use of pre-trained language models for task-oriented dialoguesystems. In

Proceedings of the 3rd Workshop onNeural Generation and Translation , pages 15–22.manuele Bugliarello, Sabrina J Mielke, Anto-nios Anastasopoulos, Ryan Cotterell, and NaoakiOkazaki. 2020. It’s easier to translate out of en-glish than into it: Measuring neural translation dif-ﬁculty by cross-mutual information. arXiv preprintarXiv:2005.02354 .Shu Cai and Kevin Knight. 2013. Smatch: an evalua-tion metric for semantic feature structures. In

Pro-ceedings of the 51st Annual Meeting of the Associa-tion for Computational Linguistics (Volume 2: ShortPapers) , pages 748–752.Ernie Chang, David Adelani, Xiaoyu Shen, and VeraDemberg. 2020a. Unsupervised pidgin text genera-tion by pivoting english data and self-training. In

InProceedings of Workshop at ICLR .Ernie Chang, Jeriah Caplinger, Alex Marin, XiaoyuShen, and Vera Demberg. 2020b. Dart: Alightweight quality-suggestive data-to-text annota-tion tool. In

COLING 2020 , pages 12–17.Ernie Chang, Xiaoyu Shen, Dawei Zhu, Vera Demberg,and Hui. Su. 2021a. Neural data-to-text generationwith lm-based text augmentation.

EACL 2021 .Ernie Chang, Hui-Syuan Yeh, and Vera Demberg.2021b. Does the order of training samples matter?improving neural data-to-text generation with cur-riculum learning. In

EACL 2021 .Zhiyu Chen, Harini Eavani, Yinyin Liu, andWilliam Yang Wang. 2020. Few-shot nlg with pre-trained language model.

ACL .Andrew Chisholm, Will Radford, and Ben Hachey.2017. Learning to generate one-sentence biogra-phies from wikidata. In

Proceedings of the 15thConference of the European Chapter of the Associa-tion for Computational Linguistics: Volume 1, LongPapers , pages 633–642.Paul A Crook, Alex Marin, Vipul Agarwal, Saman-tha Anderson, Ohyoung Jang, Aliasgar Lanewala,Karthik Tangirala, and Imed Zitouni. 2018. Con-versational semantic search: Looking beyond websearch, q&a and dialog systems. In

Proceedings ofthe Eleventh ACM International Conference on WebSearch and Data Mining , pages 763–766.Mostafa Dehghani, Arash Mehrjou, Stephan Gouws,Jaap Kamps, and Bernhard Sch¨olkopf. 2017.Fidelity-weighted learning. arXiv preprintarXiv:1711.02799 .Mostafa Dehghani, Arash Mehrjou, Stephan Gouws,Jaap Kamps, and Bernhard Sch¨olkopf. 2018.Fidelity-weighted learning. In

International Confer-ence on Learning Representations .Ondˇrej Duˇsek and Filip Jurcicek. 2016. Sequence-to-sequence generation for spoken dialogue via deepsyntax trees and strings. In

Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 2: Short Papers) , pages45–51. Jeffrey L Elman. 1993. Learning and development inneural networks: The importance of starting small.

Cognition , 48(1):71–99.Benoˆıt Fr´enay and Michel Verleysen. 2013. Classiﬁca-tion in the presence of label noise: a survey.

IEEEtransactions on neural networks and learning sys-tems , 25(5):845–869.Jason A Fries, Ethan Steinberg, Saelig Khattar,Scott L Fleming, Jose Posada, Alison Callahan, andNigam H Shah. 2020. Trove: Ontology-driven weaksupervision for medical entity classiﬁcation. arXivpreprint arXiv:2008.01972 .Hamza Harkous, Isabel Groves, and Amir Saffari. 2020.Have your text and use it too! end-to-end neuraldata-to-text generation with semantic ﬁdelity. arXivpreprint arXiv:2004.06577 .Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu,Tie-Yan Liu, and Wei-Ying Ma. 2016. Dual learn-ing for machine translation. In

Advances in neuralinformation processing systems , pages 820–828.Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.

Neural computation ,9(8):1735–1780.Xudong Hong, Ernie Chang, and Vera Demberg. 2019.Improving language generation from feature-richtree-structured data with relational graph convolu-tional encoders. In

Proceedings of the 2nd Work-shop on Multilingual Surface Realisation (MSR2019) , pages 75–80.Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li,and Li Fei-Fei. 2018. Mentornet: Learning data-driven curriculum for very deep neural networks oncorrupted labels. In

International Conference onMachine Learning , pages 2304–2313.Juraj Juraska, Panagiotis Karagiannis, Kevin Bowden,and Marilyn Walker. 2018. A deep ensemble modelwith slot alignment for sequence-to-sequence natu-ral language generation. In

Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long Papers) ,pages 152–162.Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 .Guillaume Lample, Alexis Conneau, Ludovic Denoyer,and Marc’Aurelio Ranzato. 2017. Unsupervised ma-chine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043 .Guillaume Lample, Myle Ott, Alexis Conneau, Lu-dovic Denoyer, et al. 2018. Phrase-based & neu-ral unsupervised machine translation. In

Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing , pages 5039–5049.anuel Mager, Ram´on Fernandez Astudillo, TahiraNaseem, Md Arafat Sultan, Young-Suk Lee, RaduFlorian, and Salim Roukos. 2020. Gpt-too: Alanguage-model-ﬁrst approach for amr-to-text gen-eration. arXiv preprint arXiv:2005.09123 .Amit Moryossef, Yoav Goldberg, and Ido Dagan. 2019.Step-by-step: Separating planning from realizationin neural data-to-text generation. In

Proceedings ofthe 2019 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Longand Short Papers) , pages 2267–2277.Nikola Mrkˇsi´c, Diarmuid ´O S´eaghdha, Tsung-HsienWen, Blaise Thomson, and Steve Young. 2017. Neu-ral belief tracker: Data-driven dialogue state track-ing. In

Proceedings of the 55th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers) , pages 1777–1788, Vancouver,Canada. Association for Computational Linguistics.Jekaterina Novikova, Ondˇrej Duˇsek, Amanda CercasCurry, and Verena Rieser. 2017a. Why we need newevaluation metrics for nlg. In

Proceedings of the2017 Conference on Empirical Methods in NaturalLanguage Processing , pages 2241–2252.Jekaterina Novikova, Ondˇrej Duˇsek, and Verena Rieser.2017b. The e2e dataset: New challenges for end-to-end generation. In

Proceedings of the 18th AnnualSIGdial Meeting on Discourse and Dialogue , pages201–206.Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, et al. 2019. Pytorch: An imperative style,high-performance deep learning library. In

Ad-vances in Neural Information Processing Systems ,pages 8024–8035.Baolin Peng, Chenguang Zhu, Chunyuan Li, Xi-ujun Li, Jinchao Li, Michael Zeng, and Jian-feng Gao. 2020. Few-shot natural language gen-eration for task-oriented dialog. arXiv preprintarXiv:2002.12328 .Raheel Qader, Franc¸ois Portet, and Cyril Labb´e. 2019.Semi-supervised neural text generation by jointlearning of natural language generation and naturallanguage understanding models. In

Proceedings ofthe 12th International Conference on Natural Lan-guage Generation , pages 552–562.Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.

OpenAIBlog , 1(8).Alexander Ratner, Stephen H Bach, Henry Ehrenberg,Jason Fries, Sen Wu, and Christopher R´e. 2017.Snorkel: Rapid training data creation with weak su-pervision. In

Proceedings of the VLDB Endowment.International Conference on Very Large Data Bases ,volume 11, page 269. NIH Public Access. Ehud Reiter and Anja Belz. 2009. An investigation intothe validity of some metrics for automatically evalu-ating natural language generation systems.

Compu-tational Linguistics , 35(4):529–558.Ehud Reiter and Robert Dale. 2000.

Building naturallanguage generation systems . Cambridge universitypress.Esteban Safranchik, Shiying Luo, Stephen H Bach, Ela-heh Raisi, Stephen H Bach, Stephen H Bach, DanielRodriguez, Yintao Liu, Chong Luo, Haidong Shao,et al. 2020. Weakly supervised sequence taggingfrom noisy rules. In

AAAI , pages 5570–5578.Martin Schmitt and Hinrich Sch¨utze. 2019. Unsuper-vised text generation from structured data. arXivpreprint arXiv:1904.09447 .Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Improving neural machine translation mod-els with monolingual data. In

Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , vol-ume 1, pages 86–96.Xiaoyu Shen, Ernie Chang, Hui Su, Jie Zhou,and Dietrich Klakow. 2020. Neural data-to-text generation via jointly learning the seg-mentation and correspondence. In

ACL 2020

INLG , pages 233–243.Hui Su, Xiaoyu Shen, Zhou Xiao, Zheng Zhang, ErnieChang, Cheng Zhang, Cheng Niu, and Jie Zhou.2020a. Moviechats: Chat like humans in a closeddomain. In

EMNLP 2020 , pages 6605–6619.Shang-Yu Su, Chao-Wei Huang, and Yun-Nung Chen.2019. Dual supervised learning for natural lan-guage understanding and generation. arXiv preprintarXiv:1905.06196 .Shang-Yu Su, Chao-Wei Huang, and Yun-Nung Chen.2020b. Towards unsupervised language understand-ing and generation by joint dual learning.

ACL .Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, and Tie-Yan Liu. 2018. Multilingual neural machine trans-lation with knowledge distillation. In

InternationalConference on Learning Representations .Raphael Tang, Yao Lu, and Jimmy Lin. 2019. Natu-ral language generation for effective knowledge dis-tillation. In

Proceedings of the 2nd Workshop onDeep Learning Approaches for Low-Resource NLP(DeepLo 2019) , pages 202–208.o-Hsiang Tseng, Jianpeng Cheng, Yimai Fang, andDavid Vandyke. 2020. A generative model for jointnatural language understanding and generation. In

Proceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 1795–1807.Hongmin Wang. 2019. Revisiting challenges in data-to-text generation with fact grounding. In

Proceed-ings of the 12th International Conference on NaturalLanguage Generation , pages 311–322.Frederik Wiehr, Anke Hirsch, Florian Daiber, AntonioKruger, Alisa Kovtunova, Stefan Borgwardt, ErnieChang, Vera Demberg, Marcel Steinmetz, and Hoff-mann Jorg. 2020. Safe handover in mixed-initiativecontrol for cyber-physical systems. In Proceedingsof Workshop at CHI.Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning.

Machine learning , 8(3-4):229–256.Sam Wiseman, Stuart Shieber, and Alexander Rush.2017. Challenges in data-to-document generation.In

Proceedings of the 2017 Conference on Empiri-cal Methods in Natural Language Processing , pages2253–2263, Copenhagen, Denmark. Association forComputational Linguistics.Yu Wu, Wei Wu, Zhoujun Li, and Ming Zhou. 2018.Learning matching models with weak supervisionfor response selection in retrieval-based chatbots. In

Proceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 2:Short Papers) , pages 420–425.Xiaodong Zhang and Houfeng Wang. 2016. A jointmodel of intent determination and slot ﬁlling for spo-ken language understanding. In

Proceedings of theTwenty-Fifth International Joint Conference on Arti-ﬁcial Intelligence , pages 2993–2999.Zhi-Hua Zhou. 2018. A brief introduction to weaklysupervised learning.

National Science Review ,5(1):44–53.WenLi Zhuang and Ernie Chang. 2017. Neobility atsemeval-2017 task 1: An attention-based sentencesimilarity model. In,5(1):44–53.WenLi Zhuang and Ernie Chang. 2017. Neobility atsemeval-2017 task 1: An attention-based sentencesimilarity model. In