[PDF] Better Long-Range Dependency By Bootstrapping A Mutual Information Regularizer

Abstract

In this work, we develop a novel regularizer to improve the learning of long-range dependency of sequence data. Applied on language modelling, our regularizer expresses the inductive bias that sequence variables should have high mutual information even though the model might not see abundant observations for complex long-range dependency. We show how the `next sentence prediction (classification)' heuristic can be derived in a principled way from our mutual information estimation framework, and be further extended to maximize the mutual information of sequence variables. The proposed approach not only is effective at increasing the mutual information of segments under the learned model but more importantly, leads to a higher likelihood on holdout data, and improved generation quality. Code is released at this https URL.

Full PDF

BBetter Long-Range DependencyBy Bootstrapping A Mutual Information Regularizer

Yanshuai Cao ∗ Peng Xu ∗ Borealis AI

Abstract

In this work, we develop a novel regularizerto improve the learning of long-range depen-dency of sequence data. Applied on languagemodelling, our regularizer expresses the in-ductive bias that sequence variables shouldhave high mutual information even thoughthe model might not see abundant obser-vations for complex long-range dependency.We show how the “next sentence prediction(classiﬁcation)" heuristic can be derived ina principled way from our mutual informa-tion estimation framework, and be furtherextended to maximize the mutual informa-tion of sequence variables. The proposed ap-proach not only is eﬀective at increasing themutual information of segments under thelearned model but more importantly, leadsto a higher likelihood on holdout data, andimproved generation quality. Code is releasedat https://github.com/BorealisAI/BMI . Transformer-based large scale pre-training (Devlinet al., 2018; Yang et al., 2019; Zhang et al., 2019;Sun et al., 2019; Radford et al.) has yielded impres-sive successes in many NLP tasks. Among the manycomponents introduced by BERT (Devlin et al., 2018)originally, the auxiliary task of next sentence predic-tion (NSP) is regarded as a heuristic, which is actuallya binary classiﬁcation task to distinguish if anothersentence is the correct next sentence or a randomlysampled sentence from the corpus. As an ad-hoc heuris-tic, NSP is often dropped by some subsequent works(Joshi et al., 2019; Liu et al., 2019b) on large scale ∗ Equal Contribution.Proceedings of the 23 rd International Conference on ArtiﬁcialIntelligence and Statistics (AISTATS) 2020, Palermo, Italy.PMLR: Volume 108. Copyright 2020 by the author(s). pre-training based on empirical performance, but ispicked up in other NLP problems (Xu et al., 2019; Liuet al., 2019a). This work explores a hidden connectionof NSP to mutual information maximization, providinga more principled justiﬁcation for those applicationswhere NSP is used. The new insight is independent ofthe transformer architecture, and it allows us to designa new algorithm that shows additional improvementsbeyond NSP for RNN language modelling, in terms ofimproving long-range dependency learning.Learning long-range dependency in sequential data suchas text is challenging, and the diﬃculty has mostlybeen attributed to the vanishing gradient problem inautoregressive neural networks such as RNNs (Hochre-iter et al., 2001). There is a vast literature trying tosolve this gradient ﬂow problem through better archi-tecture (Hochreiter et al., 2001; Mikolov et al., 2014;Vaswani et al., 2017), better optimization (Martensand Sutskever, 2011) or better initialization (Le et al.,2015). On the other hand, there is an orthogonal issuethat has received less attention: statistical dependencyover a short span is usually abundant in data, e.g.,bigrams, common phrases and idioms; on the otherhand, long-range dependency typically involves morecomplex or abstract relationships of a large number oftokens ( high order interactions ). In other words, thereis a sampling mismatch between observations support-ing local correlations versus evidence for high orderinteraction, while the latter requires more samples tolearn from at the ﬁrst place because they involve morevariables. We conjecture that in addition to the gradi-ent ﬂow issue, this problem of sparse sampling of highorder statistical relations renders learning long-rangedependency hard in natural language processing.Take language modelling for example: with a vocabu-lary of size K , the number of possible sequences growsas K m with sequence length m . Neural language mod-els use distributed representation to overcome this is-sue (Bengio et al., 2003), as not all K m sequencesform plausible natural language utterances, and thereis shared semantics and compositionality in diﬀerenttexts. However, the parametrization does not change a r X i v : . [ c s . L G ] F e b etter Long-Range Dependency By Bootstrapping A Mutual Information Regularizer the fundamental fact that in the training data, thereis an abundance of observation for local patterns, butmuch sparser observations for the diﬀerent high-levelideas. As language evolved to express the endless pos-sibilities of the world, even among the set of “plausible”long sequences, a training set can only cover a smallfraction. Therefore, there is an inherent imbalance ofsampling between short and long range dependencies.As such, because it is a data sparsity issue at the core,it cannot be completely solved by better architectureor optimization.The natural remedy facing limited data is to regularizethe model using prior knowledge. In this work, we pro-pose a novel approach for incorporating into the usualmaximum likelihood objective the additional prior thatlong-range dependency exists in texts. We achievethis by bootstrapping a lower bound on the mutualinformation (MI) over groups of variables (segmentsor sentences) and subsequently applying the boundto encourage high MI. The ﬁrst step of bootstrappingthe lower bound is exactly the NSP task. Both thebootstrapping and application of the bound improveslong-range dependency learning: ﬁrst, the bootstrapstep helps the neural network’s hidden representationto recognize evidence for high mutual information thatexists in the data distribution ; second, the informationlower bound value as the reward encourages the modeldistribution to exhibit high mutual information as well.We apply the proposed method for language modelling,although the general framework could apply to otherproblems as well.Our work oﬀers a new perspective on why the heuris-tic of next sentence prediction used in previous works(Trinh et al., 2018; Devlin et al., 2018) are useful auxil-iary tasks, while revealing missing ingredients, which wecomplete in the proposed algorithm. We demonstrateimproved perplexity on two established benchmarks,reﬂecting the positive regularizing eﬀect. We also showthat our proposed method can help the model generatehigher-quality samples with more diversity measuredby reversed perplexity (Zhao et al., 2018) and moredependency measured by an empirical lower bound ofmutual information. A language model (LM) assigns a probability to a se-quence of tokens (characters, bytes, or words). Let τ i denote token variables, a LM Q factorizes thejoint distribution of τ i ’s into a product of condition-als from left to right, leveraging the inherent order of text Q ( τ , . . . , τ k ) = (cid:81) ki =1 Q ( τ i | τ

Language Model Learning with BMIregularizer Input: batch size M , dataset Ω , proposal distribution G , max-imum number of iterations N . phase-two := false for itr = 1 , . . . , N do Compute LM objective L MLE ( ω ) from Eq. 1 and its gradient; Sample a mini-batch of consecutive sentences { X i , Y i } M from Ω as samples from P XY ; Sample another mini-batch of { Y − i } M from Ω to form { X i , Y − i } M as samples from P X ⊗ P Y ; Extract features φ Xω , φ Yω and φ Y − ω and compute ˜ I P θ,ω accord-ing to Eq. 6 and its gradient; if phase-two then Sample a mini-batch of { ˜ Y i } M from Ω according to G ,each with corresponding Y (cid:63) = Y i . Compute IW-RAML gradients according to Eq. 17, with Y (cid:63) = Y i , Y = ˜ Y i , and X = X i . end if Add gradient contributions from 1 , 2 , 3 and update pa-rameters ω and θ . if not phase-two and meeting switch condition then phase-two := true end if end for Long Range Dependency and GradientFlow

Capturing long-range dependency hasbeen a major challenge in sequence learning. Mostworks have focused on the gradient ﬂow in backpropa-gation through time (BPTT). The LSTM architecture(Hochreiter and Schmidhuber, 1997) was invented toaddress the very problem of vanishing and explodinggradient in RNN (Hochreiter et al., 2001). Thereis a vast literature on improving the gradient ﬂowwith new architectural modiﬁcation or regularization(Mikolov et al., 2014; Koutnik et al., 2014; Wu et al.,2016; Li et al., 2018). Seq-to-seq with attention ormemory (Bahdanau et al., 2014; Cho et al., 2015;Sukhbaatar et al., 2015; Joulin and Mikolov, 2015) is a major neural architecture advance that improvesthe gradient ﬂow by shortening the path that relevantinformation needs to traverse in the computationgraph. The recent invention of the Transformerarchitecture (Vaswani et al., 2017), and the subsequentlarge scale pre-training successes (Devlin et al., 2018;Radford et al., 2018a,b) are further examples of betterarchitecture improving gradient ﬂow.

Regularization via Auxiliary Tasks

Closer toour method are works that use auxiliary predictiontasks as regularization (Trinh et al., 2018; Devlin et al.,2018). Trinh et al. (2018) uses an auxiliary task ofpredicting some random future or past subsequencewith reconstruction loss. Their focus is still on van-ishing/exploding gradient and issues caused by BPTT.Their method is justiﬁed empirically and it is unclear ifthe auxiliary task losses are compatible with maximumlikelihood objective of language modelling, which theydid not experiment on. Devlin et al. (2018) adds a “nextsentence prediction” task to its masked language modelobjective, which tries to classify if a sentence is thecorrect next one or randomly sampled. This task is thesame as our Phase-I for learning the lower bound I P θ,ω ,but we are the ﬁrst to draw the theoretical connectionto mutual information, explaining its regularizationeﬀect on the model (Sec. 3.1.1), and applying the boot-strapped MI bound for more direct regularization inPhase-II is completely novel in our method. Language Modeling with Extra Context

Mod-eling long range dependency is crucial to languagemodels, since capturing the larger context eﬀectivelycan help predict the next token. In order to capturethis dependency, there are some works that feed anadditional representation of larger context into the net-work including additional block, document or corpuslevel topic or discourse information (Mikolov and Zweig,2012; Wang and Cho, 2015; Dieng et al., 2016; Wanget al., 2017). Our work is orthogonal to them and canbe combined.

We experiment on two widely-used benchmarks onword-level language modeling, Penn Treebank (PTB)(Mikolov and Zweig, 2012) and WikiText-2 (WT2)(Merity et al., 2016). We choose the recent state-of-the-art model among RNN-based models on these twobenchmarks, AWD-LSTM-MoS (Yang et al., 2017) asour baseline.We compare the baseline with the same model addingvariants of our proposed regularizer, BootstrappingMutual Information (BMI) regularizer: (1)

BMI-base :apply Phase-I throughout the training; (2)

BMI-full :apply Phase-I till we learn a good enough D θ then anshuai Cao ∗ , Peng Xu ∗ Table 1: Perplexity and reverse perplexity on PTB and WT2.PTB WT2PPL Reverse PPL PPL Reverse PPLModel Valid Test Valid Test Valid Test Valid Test

AWD-LSTM-MoS

BMI-base

BMI-full 56.85 54.65 78.46 73.73 63.86 61.37 90.20 85.11

AWD-LSTM-MoS (ft.)

BMI-base (ft.)

BMI-full (ft.) 55.61 53.67 75.81 71.81 62.99 60.51 88.27 83.43

Table 2: Estimated MI (lower bounds) of X and Y , tworandom segments of length separated by tokens.Estimations using -fold cross-validation and testing.Generations PTB WT2 AWD-LSTM-MoS ± ± ± ± ± ± ± ± Experimental Setup

We apply the max-poolingover the hidden states for all the layers in LSTM andconcatenate them as our φ ω -encoding. We use a one-layer feedforward network with the features similarto Conneau et al. (2017) as [ φ Xω , φ Yω , φ Xω − φ Yω , | φ Xω − φ Yω | , φ Xω ∗ φ Yω ] for our test function D θ whose number ofhidden units is . The ADAM (Kingma and Ba, 2014)optimizer with learning rate e − and weight decay of e − is applied on θ , while ω is optimized in the sameway as in Merity et al. (2017); Yang et al. (2017) withSGD then ASGD (Polyak and Juditsky, 1992). Allthe above hyperparameters are chosen by validationperplexity on PTB and applied directly to WT2. Theweight of the regularizer term is set to . for PTBand . for WT2 chosen by validation perplexity ontheir respective datasets. The remaining architectureand hyperparameters follow exactly the same as thecode released by Yang et al. (2017). As mentionedpreviously, we set the temperature hyperparameter β in RAML to , and λ hyperparameter of importancesample proposal G to . , both without tuning.All experiments are conducted on single (1080Ti) GPUswith PyTorch. We manually tune the following hyper-parameters based on validation perplexity: the BMIregularizer weights in [0 . , . , . , . , . ] ; D θ hid-den state size from [100 , , , , Adam learning rate from [1 e − , e − . Table 2 presents the main results of language modeling.We evaluate the baseline and variants of our approachwith and without ﬁnetune described in the baselinepaper (Yang et al., 2017). In all settings, the modelswith BMI outperforms the baseline, and BMI-full (withIW-RAML) yields further improvement on top of BMI-base (without IW-RAML).Following Zhao et al. (2018), we use reverse perplexity to measure the diversity aspect of generation quality.We generate a chunk of text with M tokens from eachmodel, train a second RNN language model (RNN-LM)on the generated text; then evaluate the perplexity ofthe held-out data from PTB and WikiText2 under thesecond language model. Note that the second RNN-LM is a regular LM trained from scratch and usedfor evaluation only. As shown in Table 2, the modelswith BMI regularizer improve the reverse perplexityover the baseline by a signiﬁcant margin, indicatingbetter generation diversity, which is to be expected as MI regularizer encourages higher marginal entropy (inaddition to lower conditional entropy).Fig. 2 shows the learning curves of each model on bothdatasets after switching to ASGD as mentioned earlierin Experiment Setup. The validation perplexities ofBMI models decrease faster than the baseline AWD-LSTM-MoS. In addition, BMI-full is also consistentlybetter than BMI-base and can further decrease theperplexity after BMI-base and AWD-LSTM-MoS stopdecreasing.

To verify that BMI indeed increased I Q , we measurethe sample MI of generated texts as well as the trainingcorpus. MI of long sequence pairs cannot be directlycomputed from samples, we instead estimate lowerbounds by learning evaluation discriminators, D eval onthe generated text. D eval is completely separate from etter Long-Range Dependency By Bootstrapping A Mutual Information Regularizer (a) PTB (b) WT2 Figure 2: Learning curve for validation perplexity on PTB and WT2 after switching.Figure 3: Grad variance ratio (RL / IW-RAML). Reddotted line indicates the ratio of 1, greens indicate theratio of 0.1 and 10, orange indicates the average ratioof RL against IW-RAML.the learned model, and is much smaller in size. Wetrain D eval ’s using the proxy objective in Eq. 6 andearly-stop based on the MINE lower bound Eq. 4 onvalidation set, then report the MINE bound value onthe test set. This estimated lower bound essentiallymeasures the degree of dependency. Table 2 shows thatBMI generations exhibit higher MI than those of thebaseline AWD-LSTM-MoS, while BMI-full improvesover BMI-base. Fig. 3 compares the gradient variance under RL andIW-RAML on PTB. The gradient variance for eachparameter is estimated over iterations after theinitial learning stops and switches to ASGD; the ratioof variance of the corresponding parameters is thenaggregated into the histogram. For RL, we use policygradient with self-critical baseline for variance reduc-tion (Rennie et al., 2017). Only gradient contributionsfrom the regularizers are measured, while the language model MLE objective is excluded.The histogram shows that the RL variance is morethan times larger than IW-RAML on average, andalmost all of the parameters having higher gradientvariance under RL. A signiﬁcant portion also has - orders of magnitude higher variance under RL thanunder IW-RAML. For this reason, policy gradient RLdoes not contribute to learning when applied in Phase-II in our trials. We have proposed a principled mutual informationregularizer for improving long-range dependency in se-quence modelling. The work also provides more princi-pled explanation for the next sentence prediction (NSP)heuristic, but improves on it with a method for directlymaximizing the mutual information of sequence vari-ables. Finally, driven by this new connection, a numberof possible extensions for future works are possible. Forexample, encouraging high MI between the title, theﬁrst sentence of a paragraph, or the ﬁrst sentence of anarticle, with the other sentences in the same context.

Acknowledgements

We thank all the anonymous reviewers for their valuableinputs.

References

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473 .Ishmael Belghazi, Sai Rajeswar, Aristide Baratin, R De-von Hjelm, and Aaron Courville. 2018. Mine: mu-tual information neural estimation. arXiv preprintarXiv:1801.04062 . anshuai Cao ∗ , Peng Xu ∗ Yoshua Bengio, Réjean Ducharme, Pascal Vincent, andChristian Jauvin. 2003. A neural probabilistic lan-guage model.

Journal of machine learning research ,3(Feb):1137–1155.Kyunghyun Cho, Aaron Courville, and Yoshua Bengio.2015. Describing multimedia content using attention-based encoder-decoder networks.

IEEE Transactionson Multimedia , 17(11):1875–1886.Alexis Conneau, Douwe Kiela, Holger Schwenk, LoïcBarrault, and Antoine Bordes. 2017. Supervisedlearning of universal sentence representations fromnatural language inference data. In

Proceedings of the2017 Conference on Empirical Methods in NaturalLanguage Processing , pages 670–680, Copenhagen,Denmark. Association for Computational Linguistics.Thomas M Cover and Joy A Thomas. 2012.

Elementsof information theory . John Wiley & Sons.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. arXiv preprint arXiv:1810.04805 .Adji B Dieng, Chong Wang, Jianfeng Gao, and JohnPaisley. 2016. Topicrnn: A recurrent neural networkwith long-range semantic dependency. arXiv preprintarXiv:1611.01702 .R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Adam Trischler, andYoshua Bengio. 2018. Learning deep representationsby mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 .Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, andJürgen Schmidhuber. 2001.

Gradient ﬂow in recur-rent nets: the diﬃculty of learning long-term de-pendencies , volume 1. A ﬁeld guide to dynamicalrecurrent neural networks. IEEE Press.Sepp Hochreiter and Jürgen Schmidhuber. 1997. Longshort-term memory.

Neural computation , 9:1735–80.Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld,Luke Zettlemoyer, and Omer Levy. 2019. Spanbert:Improving pre-training by representing and predict-ing spans. arXiv preprint arXiv:1907.10529 .Armand Joulin and Tomas Mikolov. 2015. Inferringalgorithmic patterns with stack-augmented recurrentnets. In

Advances in neural information processingsystems , pages 190–198.Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 .Jan Koutnik, Klaus Greﬀ, Faustino Gomez, and Juer-gen Schmidhuber. 2014. A clockwork rnn. arXivpreprint arXiv:1402.3511 . Quoc V Le, Navdeep Jaitly, and Geoﬀrey E Hin-ton. 2015. A simple way to initialize recurrentnetworks of rectiﬁed linear units. arXiv preprintarXiv:1504.00941 .Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and YanboGao. 2018. Independently recurrent neural network(indrnn): Building a longer and deeper rnn. In

Pro-ceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 5457–5466.Jingyun Liu, Jackie CK Cheung, and Annie Louis.2019a. What comes next? extractive summariza-tion by next-sentence prediction. arXiv preprintarXiv:1901.03859 .Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, LukeZettlemoyer, and Veselin Stoyanov. 2019b. Roberta:A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 .James Martens and Ilya Sutskever. 2011. Learningrecurrent neural networks with hessian-free optimiza-tion. In

Proceedings of the 28th International Confer-ence on Machine Learning (ICML-11) , pages 1033–1040. Citeseer.Stephen Merity, Nitish Shirish Keskar, and RichardSocher. 2017. Regularizing and optimizing lstm lan-guage models. arXiv preprint arXiv:1708.02182 .Stephen Merity, Caiming Xiong, James Bradbury, andRichard Socher. 2016. Pointer sentinel mixture mod-els. arXiv preprint arXiv:1609.07843 .Tomas Mikolov, Armand Joulin, Sumit Chopra,Michael Mathieu, and Marc’Aurelio Ranzato. 2014.Learning longer memory in recurrent neural networks. arXiv preprint arXiv:1412.7753 .Tomas Mikolov and Geoﬀrey Zweig. 2012. Contextdependent recurrent neural network language model.

SLT , 12(234-239):8.Mohammad Norouzi, Samy Bengio, Navdeep Jaitly,Mike Schuster, Yonghui Wu, Dale Schuurmans, et al.2016. Reward augmented maximum likelihood forneural structured prediction. In

Advances In NeuralInformation Processing Systems , pages 1723–1731.Boris T Polyak and Anatoli B Juditsky. 1992. Accelera-tion of stochastic approximation by averaging.

SIAMJournal on Control and Optimization , 30(4):838–855.Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018a. Improving language under-standing by generative pre-training.

OpenAI Blog .Alec Radford, Jeﬀrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. Language modelsare unsupervised multitask learners.Alec Radford, Jeﬀrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2018b. Language etter Long-Range Dependency By Bootstrapping A Mutual Information Regularizer models are unsupervised multitask learners.

OpenAIBlog .Steven J Rennie, Etienne Marcheret, Youssef Mroueh,Jerret Ross, and Vaibhava Goel. 2017. Self-criticalsequence training for image captioning. In

Proceed-ings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 7008–7024.Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al.2015. End-to-end memory networks. In

Advances inneural information processing systems , pages 2440–2448.Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, HaoTian, Hua Wu, and Haifeng Wang. 2019. Ernie2.0: A continual pre-training framework for languageunderstanding. arXiv preprint arXiv:1907.12412 .Trieu H Trinh, Andrew M Dai, Thang Luong, andQuoc V Le. 2018. Learning longer-term dependen-cies in rnns with auxiliary losses. arXiv preprintarXiv:1803.00144 .Alexandre B. Tsybakov. 2008.

Introduction to Nonpara-metric Estimation , 1st edition. Springer PublishingCompany, Incorporated.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention isall you need. In

Advances in Neural InformationProcessing Systems , pages 5998–6008.Tian Wang and Kyunghyun Cho. 2015. Larger-context language modelling. arXiv preprintarXiv:1511.03729 .Wenlin Wang, Zhe Gan, Wenqi Wang, Dinghan Shen,Jiaji Huang, Wei Ping, Sanjeev Satheesh, andLawrence Carin. 2017. Topic compositional neurallanguage model. arXiv preprint arXiv:1712.09783 .Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Ben-gio, and Ruslan R Salakhutdinov. 2016. On multi-plicative integration with recurrent neural networks.In

Advances in neural information processing sys-tems , pages 2856–2864.Peng Xu, Hamidreza Saghir, Jin Sung Kang, TengLong, Avishek Joey Bose, Yanshuai Cao, and JackieChi Kit Cheung. 2019. A cross-domain transferableneural coherence model.

ACL .Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, andWilliam W Cohen. 2017. Breaking the softmax bot-tleneck: A high-rank rnn language model. arXivpreprint arXiv:1711.03953 .Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Ruslan Salakhutdinov, and Quoc V Le.2019. Xlnet: Generalized autoregressive pretrain-ing for language understanding. arXiv preprintarXiv:1906.08237 . Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang,Maosong Sun, and Qun Liu. 2019. Ernie: Enhancedlanguage representation with informative entities. arXiv preprint arXiv:1905.07129 .Jake Zhao, Yoon Kim, Kelly Zhang, Alexander Rush,and Yann LeCun. 2018. Adversarially regularizedautoencoders.

Proceddings of the 35th InternationalConference on Machine Learning . anshuai Cao ∗ , Peng Xu ∗ A Appendix

A.1 RAML Background

The key idea behind RAML is to observe that theentropy-regularized policy gradient RL objective L RL can be written as (up to constant and scaling): L RL = (cid:88) ( X,Y (cid:63) ) ∈ D KL ( Q ω ( Y | X ) (cid:107) p (cid:63)β ( Y | Y (cid:63) )) (18)where p (cid:63)β ( Y | Y (cid:63) ) is the exponentiated pay-oﬀ distribu-tion deﬁned as: p (cid:63)β ( Y | Y (cid:63) ) = exp { r ( Y, Y (cid:63) ) /β } (cid:14) Z ( Y (cid:63) , β ) (19) r ( Y, Y (cid:63) ) is a reward function that measures some simi-larity of Y with respect to the ground truth Y (cid:63) (e.g.negative edit-distance). Whereas in RAML Norouziet al. (2016), one optimizes the KL in the reverse di-rection: L RAML = (cid:88) ( X,Y (cid:63) ) ∈ D KL ( p (cid:63)β ( Y | Y (cid:63) ) (cid:107) Q ω ( Y | X )) (20)It was shown that these two losses have the same globalextremum and when away from it their gap is boundedunder some conditions Norouzi et al. (2016). Comparethe RAML gradient with the policy gradient: ∇ L RAML = − E p (cid:63)β ( Y | Y (cid:63) ) {∇ log Q ω ( Y | X ) } (21) ∇ L RL = − E Q ω ( Y | X ) { r ( Y, Y (cid:63) ) ∇ log Q ω ( Y | X ) } (22)RAML gradient samples from a stationary distribution,while policy gradient samples from the changing Q ω distribution. Furthermore, samples from p (cid:63)β ( Y | Y (cid:63) ) hashigher chance of landing in conﬁgurations of high re-ward by deﬁnition, while samples Q ω ( Y | X ))