[PDF] Does the Order of Training Samples Matter? Improving Neural Data-to-Text Generation with Curriculum Learning

Abstract

Recent advancements in data-to-text generation largely take on the form of neural end-to-end systems. Efforts have been dedicated to improving text generation systems by changing the order of training samples in a process known as curriculum learning. Past research on sequence-to-sequence learning showed that curriculum learning helps to improve both the performance and convergence speed. In this work, we delve into the same idea surrounding the training samples consisting of structured data and text pairs, where at each update, the curriculum framework selects training samples based on the model's competence. Specifically, we experiment with various difficulty metrics and put forward a soft edit distance metric for ranking training samples. Our benchmarks show faster convergence speed where training time is reduced by 38.7% and performance is boosted by 4.84 BLEU.

Full PDF

DDoes the Order of Training Samples Matter? Improving NeuralData-to-Text Generation with Curriculum Learning

Ernie Chang, Hui-Syuan Yeh, Vera Demberg

Dept. of Language Science and Technology, Saarland University { cychang,yehhui,vera } @coli.uni-saarland.de Abstract

Recent advancements in data-to-text genera-tion largely take on the form of neural end-to-end systems. Efforts have been dedicated toimproving text generation systems by chang-ing the order of training samples in a processknown as curriculum learning . Past researchon sequence-to-sequence learning showed that curriculum learning helps to improve both theperformance and convergence speed. In thiswork, we delve into the same idea surround-ing the training samples consisting of struc-tured data and text pairs, where at each up-date, the curriculum framework selects train-ing samples based on the model’s competence.Speciﬁcally, we experiment with various difﬁ-culty metrics and put forward a soft edit dis-tance metric for ranking training samples. Ourbenchmarks show faster convergence speedwhere training time is reduced by . % andperformance is boosted by . BLEU.

Neural data-to-text generation has been the sub-ject of much recent research. The task aims attransforming source-side structured data into target-side natural language descriptions (Reiter and Dale,2000; Barzilay and Lapata, 2005). The process typ-ically involves mini-matches which are randomlysampled with a ﬁxed size from the training set tofeed into the model at each training step. In thispaper, we apply curriculum learning to this pro-cess, which was explored in neural machine trans-lation (Platanios et al., 2019; Zhou et al., 2020),and show how it can help in neural data-to-textgeneration.The main idea in curriculum learning is topresent the training data in a speciﬁc order, startingfrom easy examples and moving on to more difﬁ-cult ones, as the learner becomes more competent .When starting out with easier instances, the risk of getting stuck in local optima early on in trainingis reduced, since the loss functions in neural mod-els are typically highly non-convex (Bengio et al.,2009). This learning paradigm enables ﬂexiblebatch conﬁgurations by considering the materialproperties as well as the state of the learner. Theidea brings in two potential beneﬁts: (1) It speedsup the convergence and reduces the computationalcost. (2) It boosts the model performance, withouthaving to change the model or add data.With the release of large data-to-text datasets(e.g. Wikibio (Lebret et al., 2016), Totto (Parikhet al., 2020), E2E (Novikova et al., 2017)), neuraldata-to-text generation is now at a point where train-ing speed and the order of samples may begin tomake a real difference. We here show the efﬁcacyof curriculum learning with a general LSTM-basedsequence-to-sequence model and deﬁne difﬁcultymetrics that can assess the training instances, usinga sucessful competence function which estimatesthe model capability during training. Such metricshave not yet been explored in neural data-to-textgeneration.In this paper, we explore the effectiveness ofvarious difﬁculty metrics and propose a soft editdistance metric, which leads to substantial improve-ments over other metrics. Crucially, we observethat difﬁculty metrics that consider data-text sam-ples jointly lead to stronger improvements thanmetrics that consider text or data samples alone. Insummary, this work makes the following contribu-tions towards neural data-to-text generation:1. We show that by simply changing the order ofsamples during training, neural models can beimproved via the use of curriculum learning.2. We explore various difﬁculty metrics at thelevel of the data, text, and data-text pairs, andpropose an effective novel metric. a r X i v : . [ c s . C L ] F e b Related work

The idea of teaching algorithms in a similar manneras humans, incrementally from easy concepts tomore difﬁcult ones dates back to incremental learn-ing , which was discussed in light of theories of cog-nitive development relating to the processes of ac-quisition in young children (Elman, 1993; Kruegerand Dayan, 2009; Plunkett and Marchman, 1993).Bengio et al. (2009) ﬁrst demonstrated empiricallythat curriculum learning approaches can decreasetraining times and improve generalization; later ap-proach address these issues by changing the mini-batch sampling strategy to also include model com-petence (Kocmi and Bojar, 2017; Zhou et al., 2020;Platanios et al., 2019; Liu et al., 2020; Zhang et al.,2018, 2019). While sample difﬁculty can be as-sessed for text samples and data samples or jointly,various measures have been proposed for text sam-ples including n-gram frequency Haffari (2009);Platanios et al. (2019), token rarity, and sentencelength (Liu et al., 2020; Platanios et al., 2019). Ourapproach considers data and text jointly, similar toedit distance metric – Levenshtein (Levenshtein,1966) and Damerau-Levenshtein Distance (Dam-erau, 1964; Brill and Moore, 2000a), which wasused as a content ordering metric in Wiseman et al.(2017) to measure the extent of alignment betweendata slots and text tokens.

We base our curriculum learning framework on thetwo standard components: (1) model competence(how capable the current model is at time t ), and(2) sample difﬁculty, which makes independentjudgement on each sample’s difﬁculty. Speciﬁcally,we adopt the competence function c ( t ) for a modelat time t as in Platanios et al. (2019); Liu et al.(2020): c sqrt ( t ) ∈ (0 ,

1] = min  , (cid:115) t − c λ t + c  . (1)Where λ t is a hyperparameter deﬁning the lengthof the curriculum and is set to . as in Liu et al.(2020). c = 0 . as Platanios et al. (2019). Follow-ing this formulation, the number of new trainingexamples per unit time is reduced as training pro-gresses to give the learner sufﬁcient time to obtainnew knowledge. The sequence-to-sequence modellearns using the curriculum as outlined in Algo-rithm 1 by primarily making batch-wise decisions Algorithm 1:

Curriculum Learning Algo-rithm

Input:

Training set, D = { s d , s t } Mi =1 , consisting of M samples, model ( T ), difﬁculty metric ( d ),and competence function ( c ). Compute the difﬁculty, d ( s i ) , for each data-text pair ∈ D (Section 4). Compute the CDF score ¯ d ( s i ) of d ( s i ) , where ¯ d ( s i ) ∈ [0 , (See Figure 2). for training step t = 1 , . . . do Compute the model competence c ( t ) with T . Train T on sampled data batch, B t , drawnuniformly from all s i ∈ D , such that ¯ d ( s i ) ≤ c ( t ) . if c ( t ) = 1 then break. Output:

Trained model. about which samples to add to each batch. Thisdecision is determined by comparing the compe-tence score with the difﬁculty score as shown inAlgorithm 1.

For ease of discussion, we denote sequence to be s , which can be either data or text, or their con-catenation. For comparison, the difﬁculty metricsuse the unit tokens as tokenized by SpaCy . Webegin with discussion on length and word rarity ,which were previously applied by Kocmi and Bojar(2017); Platanios et al. (2019) on text sentences. Length.

Length-based difﬁculty is based on theintuition that longer sequences are harder to en-code, and that early errors may propagate duringthe decoding process, making longer sentences alsoharder to generate. It is deﬁned as: d length ( s ) = N. (2) Rarity.

Word rarity of a sentence is deﬁned asthe product of the unigram probabilities (Platanioset al., 2019). This metric implicitly incorporatesinformation about sentence length since longer sen-tence scores are sum of more terms and are thuslikely to be larger. The difﬁculty metric for wordrarity of a sequence s is deﬁned as: d rarity ( s ) = − N (cid:88) k =1 log p ( w k ) . (3) https://spacy.io/api/tokenizer ocum restaurantCocum is a restaurantItalian [name] [eattype] [food] ...... Edit Operations:

Addition / Deletion ... ... ...

AdditionDeletion

Figure 1:

Depiction of the process of soft edit distance metric with the

Wagner-Fischer table . Each cell in the table representsthe edit distance to convert data substring’s into the text substring.

Damerau-Levenshtein Distance.

To considerdata and text jointly, we measure the alignmentbetween data slots and text using the Damerau-Levenshtein Distance ( d dld ) (Brill and Moore,2000a).We calculate the minimum number of edit oper-ations needed to transform data ( s d ) into text ( s t ) ,and relies only four operations: (a) substitute aword in s d to a different word, (b) insert a wordinto s d , (c) delete a word of s d , and (d) transposetwo adjacent words of s d . The process involvesrecursive calls that compute distance between sub-strings s id ∈ s d and s it ∈ s t at i th comparison. Soft Data-to-Text Edit Distance.

We herepresent the proposed soft edit distance (SED) : (1)We include the basic add and delete edit operationsas in the Levenshtein Distance (Levenshtein, 1966),which was used in Levenshtein Transformer (Guet al., 2019) as the only two necessary operationsfor decoding sequences since it correlates well withhuman text writing where humans “ can revise, re-place, revoke or delete any part of their generatedtext ”. We call this variant the plain edit distance (PED) . (2) Next, we weight the indicator func-tion ( s id , s it ) for each edit operation with the neg-ative logarithmic unigram probability − log p ( w ) for each token w ∈ s i { t | d } , in order to incorporatethe idea of word rarity into the edit distance met-ric. For the delete operation, we use the w ∈ s id and for add operations, we use w ∈ s it . This isunlike the previous proposal by Brill and Moore(2000b), in which edits are weighted by the tokentransition probabilities – this is not suitable for ourscenario because there is no natural order of theslot sequence in data samples. Previous work applies the Damerau-Levenshtein Distance to slots in data (e.g. “[name]” in Figure 1) and extracts slotsfrom text.

The soft distance metric d sed is in principle sim-ilar to calculating the logarithmic sum as deﬁnedin the rarity function, but instead incrementallycompares all substrings and calculates their editdistances. This way, d sed includes the informationon length , rarity but also combining the edit opera-tions. We show this process in Figure 1.Note that we can compute length and rarity onthe concatenation of input data and text sequence,or as individual sequences; whereas Damerau-Levenshtein distance and soft edit distance are com-puted jointly on data and text.

Data.

We conduct experiments on theE2E (Novikova et al., 2017) and WebNLG (Colinet al., 2016) datasets. E2E is a crowd-sourceddataset containing 50k instances in the restaurantdomain. The inputs are dialogue acts consistingof three to slot-value pairs. WebNLG contains25k instances describing entities belonging to distinct DBpedia categories, where the datacontain are up to RDF triples of the form (subject,relation, object) . Conﬁgurations.

The LSTM-based model is im-plemented based on PyTorch (Paszke et al., 2019).We use -dimensional token embeddings andthe Adam optimizer with an initial learning rate at . . Batch size is kept at , and we decodewith beam search with size . The performancescores are averaged over random initializationruns. Settings.

We ﬁrst perform ablation studies (Ta-ble 1) on the impact of difﬁculty metrics on data,text or both (joint). We also analyse the averagebin size for each metric – a metric that gives thesame score to many instances creates large bins. .0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0 datatextjoint (a) length (b) rarity (c)

Damerau-Levenshtein distance (d) soft edit distance

Figure 2: The histogram of the cumulative density function for difﬁculty metrics.

This means that the order of samples within the binwill still be random. On the other hand, a metricthat assigns a lot of different difﬁculty scores tothe instances can yield a more complete ordering(and a smaller step size in moving from one levelof difﬁculty to the next). We present the change inperformance (BLEU) as the training progresses inorder to compare the various difﬁculty metrics onboth datasets (See Figure 3).

On Table 3, we observe that soft edit distance (SED) yields the best performance, outperforminga model that does not use curriculum learning by asmuch as . BLEU. It also outperforms all othermetrics by roughly BLEU. In general, we seethat models perform better on joint and text thanon data . This correlates to how a difﬁculty func-tion is related to the average bin sizes of scores itgenerates. We see that for models that distinguishsamples in a more deﬁned manner, it will havea smaller average bin size where probability ofhaving more difﬁcult samples at every conﬁdencethreshold is lower. From this, we see that length and

DLD have larger average bin sizes across its source plain L R DLD PED SED

BLE U data | | B i nS i ze data - 7010.17 385.88 - - -text - 737.91 1.04 - - -joint - 25.0 1.04 32.26 21.74 H u m a n ﬂuency 4.35 4.28 4.32 wrong 9 5 7 10 6 Table 1:

Ablation studies for the impact of difﬁculty metricson data, text or both (joint) with normalized scores includinglength ( L ), rarity ( R ), Damerau-Levenshtein Distance ( DLD ),plain edit distance (

PED ), and the proposed soft edit distance(

SED ). Plain means no curriculum learning techniques areadded. All scores are computed based on the E2E corpus,consisting of both performance (BLEU) and the average binsize . Each bin is deﬁned by the number of training sampleswith the same difﬁculty scores. difﬁculty scores and this makes samples less dis-tinguishable from one another. Thus, they resultin the smallest improvement over plain . We showreordered samples in Table ?? for all difﬁculty met-rics computed jointly on data and text. We includelength ( L ), rarity ( R ), Damerau-Levenshtein Dis-tance ( DLD ), and the proposed soft edit distance(

SED ).On the other hand, we also justify the use of

Training Steps B L E U E2E plainSED 0 2000 4000 6000 8000 10000

Training Steps B L E U WebNLG plainSED

Figure 3:

A plot of performance (BLEU) versus the number of steps for E2E and WebNLG datasets.

Vertical bars indicatewhere the maximum BLEU scores are attained for plain and

SED . weighting for edit operation where PED , whichis the “hard” variant of

SED that does not weightedit operations like

SED , is shown to be far infe-rior to that of

SED . The score margin comes upto . BLEU. Moreover, we further examine thedifference in sample orders and observe that

SED yields more intuitive and better sample ordering asopposed to other metrics.

Human Evaluation.

For human evaluation ,three annotators are instructed to evaluate 100 sam-ples from the joint variant to see (1) if the text is ﬂuent (score 0-5 with 5 being fully ﬂuent), (2) ifit misses information contained in the source dataand (3) if it includes wrong information. Thesescores are averaged and presented in Table 1.

On Training Speed.

We deﬁne speed by thenumber of updates it takes to reach a performanceplateau. On Figure 3, the speedup is measured bythe difference between the vertical bars. It can beobserved that curriculum learning reduces the train-ing steps to converge, where it consists of up to38.7% of the total updates for the same model with-out curriculum learning (on E2E). Further, we seethat the use of curriculum learning yields slightlyworse performance in the initial training steps, butrise to a higher score and ﬂattens as it converges.

To conclude, we show that the sample order doesindeed matter when taking into account model com-petence during training. Further, we demonstratethat the proposed metrics are effective in speed-ing up model convergence. Given that curriculumlearning can be combined with pretty much anyneural architecture, we recommend the use of cur-riculum learning for data-to-text generation. We believe this work offers insights into the annota-tion process of data with text labels where reducednumber of labels are needed (Hong et al., 2019;de Souza et al., 2018; Zhuang and Chang, 2017;Chang et al., 2020a; Wiehr et al., 2020; Shen et al.,2020; Chang et al., 2020b; Su et al., 2020; Changet al., 2021b,a).

Acknowledgements

This research was funded in part by the GermanResearch Foundation (DFG) as part of SFB 248“Foundations of Perspicuous Software Systems”.We sincerely thank the anonymous reviewers fortheir insightful comments that helped us to improvethis paper.

References

Regina Barzilay and Mirella Lapata. 2005. Model-ing local coherence: An entity-based approach. In

Proceedings of the 43rd Annual Meeting of the As-sociation for Computational Linguistics (ACL’05) ,pages 141–148, Ann Arbor, Michigan. Associationfor Computational Linguistics.Yoshua Bengio, J´erˆome Louradour, Ronan Collobert,and Jason Weston. 2009. Curriculum learning. In

Proceedings of the 26th annual international confer-ence on machine learning , pages 41–48.Eric Brill and Robert C Moore. 2000a. An improved er-ror model for noisy channel spelling correction. In

Proceedings of the 38th annual meeting of the as-sociation for computational linguistics , pages 286–293.Eric Brill and Robert C. Moore. 2000b. An improvederror model for noisy channel spelling correction. In

Proceedings of the 38th Annual Meeting of the As-sociation for Computational Linguistics , pages 286–293, Hong Kong. Association for ComputationalLinguistics.rnie Chang, David Adelani, Xiaoyu Shen, and VeraDemberg. 2020a. Unsupervised pidgin text genera-tion by pivoting english data and self-training. In

InProceedings of Workshop at ICLR .Ernie Chang, Jeriah Caplinger, Alex Marin, XiaoyuShen, and Vera Demberg. 2020b. Dart: Alightweight quality-suggestive data-to-text annota-tion tool. In

COLING 2020 , pages 12–17.Ernie Chang, Vera Demberg, and Alex Marin. 2021a.Jointly improving language understanding and gen-eration with quality-weighted weak supervision ofautomatic labeling. In

EACL 2021 .Ernie Chang, Xiaoyu Shen, Dawei Zhu, Vera Demberg,and Hui. Su. 2021b. Neural data-to-text generationwith lm-based text augmentation.

EACL 2021 .Emilie Colin, Claire Gardent, Yassine M’rabet, ShashiNarayan, and Laura Perez-Beltrachini. 2016. TheWebNLG challenge: Generating text from DBPediadata. In

Proceedings of the 9th International Nat-ural Language Generation conference , pages 163–167, Edinburgh, UK. Association for ComputationalLinguistics.Fred J Damerau. 1964. A technique for computer de-tection and correction of spelling errors.

Communi-cations of the ACM , 7(3):171–176.Jeffrey L Elman. 1993. Learning and development inneural networks: The importance of starting small.

Cognition , 48(1):71–99.Jiatao Gu, Changhan Wang, and Junbo Zhao. 2019.Levenshtein transformer. In

Advances in Neural In-formation Processing Systems , pages 11181–11191.Gholam Reza Haffari. 2009.

Machine learning ap-proaches for dealing with limited bilingual data instatistical machine translation . Ph.D. thesis, Schoolof Computing Science-Simon Fraser University.Xudong Hong, Ernie Chang, and Vera Demberg. 2019.Improving language generation from feature-richtree-structured data with relational graph convolu-tional encoders. In

Proceedings of the 2nd Work-shop on Multilingual Surface Realisation (MSR2019) , pages 75–80.Tom Kocmi and Ondˇrej Bojar. 2017. Curriculum learn-ing and minibatch bucketing in neural machine trans-lation. In

Proceedings of the International Confer-ence Recent Advances in Natural Language Process-ing, RANLP 2017 , pages 379–386.Kai A Krueger and Peter Dayan. 2009. Flexible shap-ing: How learning in small steps helps.

Cognition ,110(3):380–394.R´emi Lebret, David Grangier, and Michael Auli. 2016.Neural text generation from structured data with ap-plication to the biography domain. In

Proceedingsof the 2016 Conference on Empirical Methods inNatural Language Processing , pages 1203–1213. Vladimir I Levenshtein. 1966. Binary codes capableof correcting deletions, insertions, and reversals. In

Soviet physics doklady , volume 10, pages 707–710.Xuebo Liu, Houtim Lai, Derek F Wong, and Lidia SChao. 2020. Norm-based curriculum learningfor neural machine translation. arXiv preprintarXiv:2006.02014 .Jekaterina Novikova, Ondˇrej Duˇsek, and Verena Rieser.2017. The e2e dataset: New challenges for end-to-end generation. In

Proceedings of the 18th AnnualSIGdial Meeting on Discourse and Dialogue , pages201–206.Ankur P Parikh, Xuezhi Wang, Sebastian Gehrmann,Manaal Faruqui, Bhuwan Dhingra, Diyi Yang,and Dipanjan Das. 2020. Totto: A controlledtable-to-text generation dataset. arXiv preprintarXiv:2004.14373 .Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, et al. 2019. Pytorch: An imperative style,high-performance deep learning library. In

Ad-vances in Neural Information Processing Systems ,pages 8024–8035.Emmanouil Antonios Platanios, Otilia Stretcu, GrahamNeubig, Barnabas Poczos, and Tom M. Mitchell.2019. Competence-based curriculum learning forneural machine translation.

NAACL HLT 2019 -2019 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies - Proceedings of theConference , 1:1162–1172.Kim Plunkett and Virginia Marchman. 1993. From rotelearning to system building: Acquiring verb mor-phology in children and connectionist nets.

Cogni-tion , 48(1):21–69.Ehud Reiter and Robert Dale. 2000.

Building naturallanguage generation systems . Cambridge universitypress.Xiaoyu Shen, Ernie Chang, Hui Su, Jie Zhou,and Dietrich Klakow. 2020. Neural data-to-text generation via jointly learning the seg-mentation and correspondence. In

ACL 2020

INLG , pages 233–243.Hui Su, Xiaoyu Shen, Zhou Xiao, Zheng Zhang, ErnieChang, Cheng Zhang, Cheng Niu, and Jie Zhou.2020. Moviechats: Chat like humans in a closeddomain. In

EMNLP 2020 , pages 6605–6619.rederik Wiehr, Anke Hirsch, Florian Daiber, AntonioKruger, Alisa Kovtunova, Stefan Borgwardt, ErnieChang, Vera Demberg, Marcel Steinmetz, and Hoff-mann Jorg. 2020. Safe handover in mixed-initiativecontrol for cyber-physical systems. In Proceedingsof Workshop at CHI.Sam Wiseman, Stuart Shieber, and Alexander Rush.2017. Challenges in data-to-document generation.In

Proceedings of the 2017 Conference on Empiri-cal Methods in Natural Language Processing , pages2253–2263.Xuan Zhang, Gaurav Kumar, Huda Khayrallah, KentonMurray, Jeremy Gwinnup, Marianna J Martindale,Paul McNamee, Kevin Duh, and Marine Carpuat.2018. An empirical exploration of curriculum learn-ing for neural machine translation. arXiv preprintarXiv:1811.00739 .Xuan Zhang, Pamela Shapiro, Gaurav Kumar, Paul Mc-Namee, Marine Carpuat, and Kevin Duh. 2019. Cur-riculum learning for domain adaptation in neural ma-chine translation. In

Proceedings of the 2019 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers) , pages 1903–1915.Yikai Zhou, Baosong Yang, Derek F Wong, Yu Wan,and Lidia S Chao. 2020. Uncertainty-aware curricu-lum learning for neural machine translation. In

Pro-ceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 6934–6944.WenLi Zhuang and Ernie Chang. 2017. Neobility atsemeval-2017 task 1: An attention-based sentencesimilarity model. In, pages 6934–6944.WenLi Zhuang and Ernie Chang. 2017. Neobility atsemeval-2017 task 1: An attention-based sentencesimilarity model. In