Better Long-Range Dependency By Bootstrapping A Mutual Information Regularizer
BBetter Long-Range DependencyBy Bootstrapping A Mutual Information Regularizer
Yanshuai Cao ∗ Peng Xu ∗ Borealis AI
Abstract
In this work, we develop a novel regularizerto improve the learning of long-range depen-dency of sequence data. Applied on languagemodelling, our regularizer expresses the in-ductive bias that sequence variables shouldhave high mutual information even thoughthe model might not see abundant obser-vations for complex long-range dependency.We show how the “next sentence prediction(classification)" heuristic can be derived ina principled way from our mutual informa-tion estimation framework, and be furtherextended to maximize the mutual informa-tion of sequence variables. The proposed ap-proach not only is effective at increasing themutual information of segments under thelearned model but more importantly, leadsto a higher likelihood on holdout data, andimproved generation quality. Code is releasedat https://github.com/BorealisAI/BMI . Transformer-based large scale pre-training (Devlinet al., 2018; Yang et al., 2019; Zhang et al., 2019;Sun et al., 2019; Radford et al.) has yielded impres-sive successes in many NLP tasks. Among the manycomponents introduced by BERT (Devlin et al., 2018)originally, the auxiliary task of next sentence predic-tion (NSP) is regarded as a heuristic, which is actuallya binary classification task to distinguish if anothersentence is the correct next sentence or a randomlysampled sentence from the corpus. As an ad-hoc heuris-tic, NSP is often dropped by some subsequent works(Joshi et al., 2019; Liu et al., 2019b) on large scale ∗ Equal Contribution.Proceedings of the 23 rd International Conference on ArtificialIntelligence and Statistics (AISTATS) 2020, Palermo, Italy.PMLR: Volume 108. Copyright 2020 by the author(s). pre-training based on empirical performance, but ispicked up in other NLP problems (Xu et al., 2019; Liuet al., 2019a). This work explores a hidden connectionof NSP to mutual information maximization, providinga more principled justification for those applicationswhere NSP is used. The new insight is independent ofthe transformer architecture, and it allows us to designa new algorithm that shows additional improvementsbeyond NSP for RNN language modelling, in terms ofimproving long-range dependency learning.Learning long-range dependency in sequential data suchas text is challenging, and the difficulty has mostlybeen attributed to the vanishing gradient problem inautoregressive neural networks such as RNNs (Hochre-iter et al., 2001). There is a vast literature trying tosolve this gradient flow problem through better archi-tecture (Hochreiter et al., 2001; Mikolov et al., 2014;Vaswani et al., 2017), better optimization (Martensand Sutskever, 2011) or better initialization (Le et al.,2015). On the other hand, there is an orthogonal issuethat has received less attention: statistical dependencyover a short span is usually abundant in data, e.g.,bigrams, common phrases and idioms; on the otherhand, long-range dependency typically involves morecomplex or abstract relationships of a large number oftokens ( high order interactions ). In other words, thereis a sampling mismatch between observations support-ing local correlations versus evidence for high orderinteraction, while the latter requires more samples tolearn from at the first place because they involve morevariables. We conjecture that in addition to the gradi-ent flow issue, this problem of sparse sampling of highorder statistical relations renders learning long-rangedependency hard in natural language processing.Take language modelling for example: with a vocabu-lary of size K , the number of possible sequences growsas K m with sequence length m . Neural language mod-els use distributed representation to overcome this is-sue (Bengio et al., 2003), as not all K m sequencesform plausible natural language utterances, and thereis shared semantics and compositionality in differenttexts. However, the parametrization does not change a r X i v : . [ c s . L G ] F e b etter Long-Range Dependency By Bootstrapping A Mutual Information Regularizer the fundamental fact that in the training data, thereis an abundance of observation for local patterns, butmuch sparser observations for the different high-levelideas. As language evolved to express the endless pos-sibilities of the world, even among the set of “plausible”long sequences, a training set can only cover a smallfraction. Therefore, there is an inherent imbalance ofsampling between short and long range dependencies.As such, because it is a data sparsity issue at the core,it cannot be completely solved by better architectureor optimization.The natural remedy facing limited data is to regularizethe model using prior knowledge. In this work, we pro-pose a novel approach for incorporating into the usualmaximum likelihood objective the additional prior thatlong-range dependency exists in texts. We achievethis by bootstrapping a lower bound on the mutualinformation (MI) over groups of variables (segmentsor sentences) and subsequently applying the boundto encourage high MI. The first step of bootstrappingthe lower bound is exactly the NSP task. Both thebootstrapping and application of the bound improveslong-range dependency learning: first, the bootstrapstep helps the neural network’s hidden representationto recognize evidence for high mutual information thatexists in the data distribution ; second, the informationlower bound value as the reward encourages the modeldistribution to exhibit high mutual information as well.We apply the proposed method for language modelling,although the general framework could apply to otherproblems as well.Our work offers a new perspective on why the heuris-tic of next sentence prediction used in previous works(Trinh et al., 2018; Devlin et al., 2018) are useful auxil-iary tasks, while revealing missing ingredients, which wecomplete in the proposed algorithm. We demonstrateimproved perplexity on two established benchmarks,reflecting the positive regularizing effect. We also showthat our proposed method can help the model generatehigher-quality samples with more diversity measuredby reversed perplexity (Zhao et al., 2018) and moredependency measured by an empirical lower bound ofmutual information. A language model (LM) assigns a probability to a se-quence of tokens (characters, bytes, or words). Let τ i denote token variables, a LM Q factorizes thejoint distribution of τ i ’s into a product of condition-als from left to right, leveraging the inherent order of text Q ( τ , . . . , τ k ) = (cid:81) ki =1 Q ( τ i | τ
Because the MINE bound holds for any parameters,we can instead use the binary classification form tooptimize the parameters, similar to what we do for I P θ,ω and as done in Hjelm et al. (2018). The proxy objective anshuai Cao ∗ , Peng Xu ∗ has the form: ˜ I Qθ,ω = E Q XY R + θ,ω − E Q X ⊗ Q Y R − θ,ω where, R + θ,ω = − SP ( − D θ ( φ Xω , φ Yω )) (7) R − θ,ω = SP ( D θ ( φ Xω , φ Yω )) (8)To optimize ˜ I Qθ,ω with respect to ζ = ( θ, ω ) , the gradienthas two terms ∇ ζ ˜ I Qθ,ω = g + g , where g = E Q XY ∇ R + θ,ω − E Q X ⊗ Q Y ∇ R − θ,ω (9) g = E Q XY R + θ,ω ∇ log Q XY − E Q X ⊗ Q Y R − θ,ω ( ∇ log Q X + ∇ log Q Y ) (10) g uses policy gradient (i.e. likelihood ratio estimator)with Q being the policy while R + and R − being thereward (and penalty). g can be variance-reduced bycontrol-variate methods, e.g. Rennie et al. (2017).However, deep RL is known to converge slowly due tohigh variance, our trials confirm the difficulty in thisparticular case. Furthermore, sampling from Q is gener-ally slow for autoregressive models as it cannot be easilyparallelized. These two issues compounded means thatwe would like to avoid sampling from Q . To this end,we develop a modification of the reward augmentedmaximum likelihood (RAML) (Norouzi et al., 2016),which avoids the high variance and slow Q -sampling.For the g part (Eq. 9), if we simply replace the Q distributions with P in the expectation, we recoverthe Phase-I regularizer Eq. 6, which we can use toapproximate g . The bias of this approximation is: (cid:88) X,Y ( Q ( X, Y ) − P ( X, Y )) ∇ R + − (cid:88) X,Y ( Q ( X ) Q ( Y ) − P ( X ) P ( Y )) ∇ R − (11)which becomes small as the maximum likelihood learn-ing progresses, because in both terms, the total varia-tion distance (cid:80) | Q − P | is bounded by (cid:112) KL ( P (cid:107) Q ) via Pinsker’s inequality (Tsybakov, 2008). RAML can be viewed as optimizing the reverse di-rection of KL divergence comparing to the entropy-regularized policy gradient RL objective. We will leavethe details of RAML to the Appendix. A.1 and referreaders to the work (Norouzi et al., 2016). For ourpurpose here, the important information is that theRAML gradient with the policy gradient are: ∇ L RAML = − E p (cid:63)β ( Y | Y (cid:63) ) {∇ log Q ω ( Y | X ) } (12) ∇ L RL = − E Q ω ( Y | X ) { r ( Y, Y (cid:63) ) ∇ log Q ω ( Y | X ) } (13)where p (cid:63)β ( Y | Y (cid:63) ) is the exponentiated pay-off distribu-tion defined as: p (cid:63)β ( Y | Y (cid:63) ) = exp { r ( Y, Y (cid:63) ) /β } (cid:14) Z ( Y (cid:63) , β ) (14) r ( Y, Y (cid:63) ) is a reward function that measures some simi-larity of Y with respect to the ground truth Y (cid:63) (e.g.negative edit-distance). RAML gradient Eq. 21 samplesfrom a stationary distribution, while policy gradientEq. 22 samples from the changing Q ω distribution. Fur-thermore, by definition, samples from p (cid:63)β ( Y | Y (cid:63) ) hashigher chance for high reward, while samples Q ω ( Y | X ) relies on exploration. For these reasons, RAML hasmuch lower variance than RL. Unfortunately, sampling from p (cid:63)β ( Y | Y (cid:63) ) can only bedone efficiently for some special classes of reward suchas the edit-distance used in Norouzi et al. (2016). Here,we would like to use the learned MI estimator, morespecifically the classifier scores as the reward. Assume Y (cid:63) is the sentence following X in the corpus, then forany other Y , the reward is: r ( Y, Y (cid:63) ; X ) = D θ ( φ Xω , φ Yω ) − D θ ( φ Xω , φ ω ( Y (cid:63) )) (15)In the illustration Fig. 1b, X would be S and Y (cid:63) = S ,and another Y = S is sampled to be evaluated. Y could also be any other sentence/segment not in thedataset.As the deep-neural-net-computed scores lack the simplestructure of edit-distance that can be exploited forefficient sampling from p (cid:63)β ( Y | Y (cid:63) ) , direct application ofRAML to the MI reward is not possible. We will insteaddevelop an efficient alternative based on importancesampling.Intuitively, a sentence that is near X in the text wouldtend to be more related to it, and vice versa. Therefore,we can use a geometric distribution based at the indexof Y (cid:63) as the proposal distribution, as illustrated in Fig.1b. Let Y (cid:63) have sentence/segment index m , then G ( Y = S k | Y (cid:63) = S m ) = (1 − λ ) ( k − m ) λ (16)where λ is a hyperparameter (we set to . withouttuning it). Other proposals are also possible. With G as the proposal, our importance weighted RAML (IW-RAML) gradient is then: ∇ L RAML = − E G (cid:0) ∇ log Q ω ( Y | X ) p (cid:63)β ( Y | Y (cid:63) ) (cid:14) G ( Y | Y (cid:63) ) (cid:1) (17)Because the reward in Eq. 15 is shift-standardized withrespect to the discriminator score at Y (cid:63) , we assume thatthe normalization constant Z in Eq. 19 does not varyheavily for different Y (cid:63) , so that we can perform self-normalizing importance sampling by averaging acrossthe mini-batches. A side benefit of introducing G is to re-establish thestationarity of the sampling distribution in the RAML etter Long-Range Dependency By Bootstrapping A Mutual Information Regularizer gradient estimator. Because the reward function Eq.15 depends on ( θ, ω ) , the exponentiated pay-off dis-tribution is no longer stationary like in the originalRAML with simple reward (Norouzi et al., 2016), butwe re-gain stationarity through the fixed proposal G ,keeping the variance low. Stationarity of the samplingdistribution is one of the reasons for the lower variancein RAML.Choosing IW-RAML over RL is a bias-variance trade-off. The RL objective gradient in Eq. 9-10 is the unbi-ased one, and IW-RAML as introduced has a few biases:using the opposite direction of the KL divergence (ana-lyzed in Norouzi et al. (2016)); distribution support of G being smaller than p (cid:63)β ( Y | Y (cid:63) ) . Each of these approx-imations introduces some bias, but the overall varianceis significantly reduced as the empirical analysis in Sec.5.3 shows. Algorithm 1
Language Model Learning with BMIregularizer Input: batch size M , dataset Ω , proposal distribution G , max-imum number of iterations N . phase-two := false for itr = 1 , . . . , N do Compute LM objective L MLE ( ω ) from Eq. 1 and its gradient; Sample a mini-batch of consecutive sentences { X i , Y i } M from Ω as samples from P XY ; Sample another mini-batch of { Y − i } M from Ω to form { X i , Y − i } M as samples from P X ⊗ P Y ; Extract features φ Xω , φ Yω and φ Y − ω and compute ˜ I P θ,ω accord-ing to Eq. 6 and its gradient; if phase-two then Sample a mini-batch of { ˜ Y i } M from Ω according to G ,each with corresponding Y (cid:63) = Y i . Compute IW-RAML gradients according to Eq. 17, with Y (cid:63) = Y i , Y = ˜ Y i , and X = X i . end if Add gradient contributions from 1 , 2 , 3 and update pa-rameters ω and θ . if not phase-two and meeting switch condition then phase-two := true end if end for Long Range Dependency and GradientFlow
Capturing long-range dependency hasbeen a major challenge in sequence learning. Mostworks have focused on the gradient flow in backpropa-gation through time (BPTT). The LSTM architecture(Hochreiter and Schmidhuber, 1997) was invented toaddress the very problem of vanishing and explodinggradient in RNN (Hochreiter et al., 2001). Thereis a vast literature on improving the gradient flowwith new architectural modification or regularization(Mikolov et al., 2014; Koutnik et al., 2014; Wu et al.,2016; Li et al., 2018). Seq-to-seq with attention ormemory (Bahdanau et al., 2014; Cho et al., 2015;Sukhbaatar et al., 2015; Joulin and Mikolov, 2015) is a major neural architecture advance that improvesthe gradient flow by shortening the path that relevantinformation needs to traverse in the computationgraph. The recent invention of the Transformerarchitecture (Vaswani et al., 2017), and the subsequentlarge scale pre-training successes (Devlin et al., 2018;Radford et al., 2018a,b) are further examples of betterarchitecture improving gradient flow.
Regularization via Auxiliary Tasks
Closer toour method are works that use auxiliary predictiontasks as regularization (Trinh et al., 2018; Devlin et al.,2018). Trinh et al. (2018) uses an auxiliary task ofpredicting some random future or past subsequencewith reconstruction loss. Their focus is still on van-ishing/exploding gradient and issues caused by BPTT.Their method is justified empirically and it is unclear ifthe auxiliary task losses are compatible with maximumlikelihood objective of language modelling, which theydid not experiment on. Devlin et al. (2018) adds a “nextsentence prediction” task to its masked language modelobjective, which tries to classify if a sentence is thecorrect next one or randomly sampled. This task is thesame as our Phase-I for learning the lower bound I P θ,ω ,but we are the first to draw the theoretical connectionto mutual information, explaining its regularizationeffect on the model (Sec. 3.1.1), and applying the boot-strapped MI bound for more direct regularization inPhase-II is completely novel in our method. Language Modeling with Extra Context
Mod-eling long range dependency is crucial to languagemodels, since capturing the larger context effectivelycan help predict the next token. In order to capturethis dependency, there are some works that feed anadditional representation of larger context into the net-work including additional block, document or corpuslevel topic or discourse information (Mikolov and Zweig,2012; Wang and Cho, 2015; Dieng et al., 2016; Wanget al., 2017). Our work is orthogonal to them and canbe combined.
We experiment on two widely-used benchmarks onword-level language modeling, Penn Treebank (PTB)(Mikolov and Zweig, 2012) and WikiText-2 (WT2)(Merity et al., 2016). We choose the recent state-of-the-art model among RNN-based models on these twobenchmarks, AWD-LSTM-MoS (Yang et al., 2017) asour baseline.We compare the baseline with the same model addingvariants of our proposed regularizer, BootstrappingMutual Information (BMI) regularizer: (1)
BMI-base :apply Phase-I throughout the training; (2)
BMI-full :apply Phase-I till we learn a good enough D θ then anshuai Cao ∗ , Peng Xu ∗ Table 1: Perplexity and reverse perplexity on PTB and WT2.PTB WT2PPL Reverse PPL PPL Reverse PPLModel Valid Test Valid Test Valid Test Valid Test
AWD-LSTM-MoS
BMI-base
BMI-full 56.85 54.65 78.46 73.73 63.86 61.37 90.20 85.11
AWD-LSTM-MoS (ft.)
BMI-base (ft.)
BMI-full (ft.) 55.61 53.67 75.81 71.81 62.99 60.51 88.27 83.43
Table 2: Estimated MI (lower bounds) of X and Y , tworandom segments of length separated by tokens.Estimations using -fold cross-validation and testing.Generations PTB WT2 AWD-LSTM-MoS ± ± ± ± ± ± ± ± Experimental Setup
We apply the max-poolingover the hidden states for all the layers in LSTM andconcatenate them as our φ ω -encoding. We use a one-layer feedforward network with the features similarto Conneau et al. (2017) as [ φ Xω , φ Yω , φ Xω − φ Yω , | φ Xω − φ Yω | , φ Xω ∗ φ Yω ] for our test function D θ whose number ofhidden units is . The ADAM (Kingma and Ba, 2014)optimizer with learning rate e − and weight decay of e − is applied on θ , while ω is optimized in the sameway as in Merity et al. (2017); Yang et al. (2017) withSGD then ASGD (Polyak and Juditsky, 1992). Allthe above hyperparameters are chosen by validationperplexity on PTB and applied directly to WT2. Theweight of the regularizer term is set to . for PTBand . for WT2 chosen by validation perplexity ontheir respective datasets. The remaining architectureand hyperparameters follow exactly the same as thecode released by Yang et al. (2017). As mentionedpreviously, we set the temperature hyperparameter β in RAML to , and λ hyperparameter of importancesample proposal G to . , both without tuning.All experiments are conducted on single (1080Ti) GPUswith PyTorch. We manually tune the following hyper-parameters based on validation perplexity: the BMIregularizer weights in [0 . , . , . , . , . ] ; D θ hid-den state size from [100 , , , , Adam learning rate from [1 e − , e − . Table 2 presents the main results of language modeling.We evaluate the baseline and variants of our approachwith and without finetune described in the baselinepaper (Yang et al., 2017). In all settings, the modelswith BMI outperforms the baseline, and BMI-full (withIW-RAML) yields further improvement on top of BMI-base (without IW-RAML).Following Zhao et al. (2018), we use reverse perplexity to measure the diversity aspect of generation quality.We generate a chunk of text with M tokens from eachmodel, train a second RNN language model (RNN-LM)on the generated text; then evaluate the perplexity ofthe held-out data from PTB and WikiText2 under thesecond language model. Note that the second RNN-LM is a regular LM trained from scratch and usedfor evaluation only. As shown in Table 2, the modelswith BMI regularizer improve the reverse perplexityover the baseline by a significant margin, indicatingbetter generation diversity, which is to be expected as MI regularizer encourages higher marginal entropy (inaddition to lower conditional entropy).Fig. 2 shows the learning curves of each model on bothdatasets after switching to ASGD as mentioned earlierin Experiment Setup. The validation perplexities ofBMI models decrease faster than the baseline AWD-LSTM-MoS. In addition, BMI-full is also consistentlybetter than BMI-base and can further decrease theperplexity after BMI-base and AWD-LSTM-MoS stopdecreasing.
To verify that BMI indeed increased I Q , we measurethe sample MI of generated texts as well as the trainingcorpus. MI of long sequence pairs cannot be directlycomputed from samples, we instead estimate lowerbounds by learning evaluation discriminators, D eval onthe generated text. D eval is completely separate from etter Long-Range Dependency By Bootstrapping A Mutual Information Regularizer (a) PTB (b) WT2 Figure 2: Learning curve for validation perplexity on PTB and WT2 after switching.Figure 3: Grad variance ratio (RL / IW-RAML). Reddotted line indicates the ratio of 1, greens indicate theratio of 0.1 and 10, orange indicates the average ratioof RL against IW-RAML.the learned model, and is much smaller in size. Wetrain D eval ’s using the proxy objective in Eq. 6 andearly-stop based on the MINE lower bound Eq. 4 onvalidation set, then report the MINE bound value onthe test set. This estimated lower bound essentiallymeasures the degree of dependency. Table 2 shows thatBMI generations exhibit higher MI than those of thebaseline AWD-LSTM-MoS, while BMI-full improvesover BMI-base. Fig. 3 compares the gradient variance under RL andIW-RAML on PTB. The gradient variance for eachparameter is estimated over iterations after theinitial learning stops and switches to ASGD; the ratioof variance of the corresponding parameters is thenaggregated into the histogram. For RL, we use policygradient with self-critical baseline for variance reduc-tion (Rennie et al., 2017). Only gradient contributionsfrom the regularizers are measured, while the language model MLE objective is excluded.The histogram shows that the RL variance is morethan times larger than IW-RAML on average, andalmost all of the parameters having higher gradientvariance under RL. A significant portion also has - orders of magnitude higher variance under RL thanunder IW-RAML. For this reason, policy gradient RLdoes not contribute to learning when applied in Phase-II in our trials. We have proposed a principled mutual informationregularizer for improving long-range dependency in se-quence modelling. The work also provides more princi-pled explanation for the next sentence prediction (NSP)heuristic, but improves on it with a method for directlymaximizing the mutual information of sequence vari-ables. Finally, driven by this new connection, a numberof possible extensions for future works are possible. Forexample, encouraging high MI between the title, thefirst sentence of a paragraph, or the first sentence of anarticle, with the other sentences in the same context.
Acknowledgements
We thank all the anonymous reviewers for their valuableinputs.
References
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473 .Ishmael Belghazi, Sai Rajeswar, Aristide Baratin, R De-von Hjelm, and Aaron Courville. 2018. Mine: mu-tual information neural estimation. arXiv preprintarXiv:1801.04062 . anshuai Cao ∗ , Peng Xu ∗ Yoshua Bengio, Réjean Ducharme, Pascal Vincent, andChristian Jauvin. 2003. A neural probabilistic lan-guage model.
Journal of machine learning research ,3(Feb):1137–1155.Kyunghyun Cho, Aaron Courville, and Yoshua Bengio.2015. Describing multimedia content using attention-based encoder-decoder networks.
IEEE Transactionson Multimedia , 17(11):1875–1886.Alexis Conneau, Douwe Kiela, Holger Schwenk, LoïcBarrault, and Antoine Bordes. 2017. Supervisedlearning of universal sentence representations fromnatural language inference data. In
Proceedings of the2017 Conference on Empirical Methods in NaturalLanguage Processing , pages 670–680, Copenhagen,Denmark. Association for Computational Linguistics.Thomas M Cover and Joy A Thomas. 2012.
Elementsof information theory . John Wiley & Sons.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. arXiv preprint arXiv:1810.04805 .Adji B Dieng, Chong Wang, Jianfeng Gao, and JohnPaisley. 2016. Topicrnn: A recurrent neural networkwith long-range semantic dependency. arXiv preprintarXiv:1611.01702 .R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Adam Trischler, andYoshua Bengio. 2018. Learning deep representationsby mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 .Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, andJürgen Schmidhuber. 2001.
Gradient flow in recur-rent nets: the difficulty of learning long-term de-pendencies , volume 1. A field guide to dynamicalrecurrent neural networks. IEEE Press.Sepp Hochreiter and Jürgen Schmidhuber. 1997. Longshort-term memory.
Neural computation , 9:1735–80.Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld,Luke Zettlemoyer, and Omer Levy. 2019. Spanbert:Improving pre-training by representing and predict-ing spans. arXiv preprint arXiv:1907.10529 .Armand Joulin and Tomas Mikolov. 2015. Inferringalgorithmic patterns with stack-augmented recurrentnets. In
Advances in neural information processingsystems , pages 190–198.Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 .Jan Koutnik, Klaus Greff, Faustino Gomez, and Juer-gen Schmidhuber. 2014. A clockwork rnn. arXivpreprint arXiv:1402.3511 . Quoc V Le, Navdeep Jaitly, and Geoffrey E Hin-ton. 2015. A simple way to initialize recurrentnetworks of rectified linear units. arXiv preprintarXiv:1504.00941 .Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and YanboGao. 2018. Independently recurrent neural network(indrnn): Building a longer and deeper rnn. In
Pro-ceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 5457–5466.Jingyun Liu, Jackie CK Cheung, and Annie Louis.2019a. What comes next? extractive summariza-tion by next-sentence prediction. arXiv preprintarXiv:1901.03859 .Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, LukeZettlemoyer, and Veselin Stoyanov. 2019b. Roberta:A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 .James Martens and Ilya Sutskever. 2011. Learningrecurrent neural networks with hessian-free optimiza-tion. In
Proceedings of the 28th International Confer-ence on Machine Learning (ICML-11) , pages 1033–1040. Citeseer.Stephen Merity, Nitish Shirish Keskar, and RichardSocher. 2017. Regularizing and optimizing lstm lan-guage models. arXiv preprint arXiv:1708.02182 .Stephen Merity, Caiming Xiong, James Bradbury, andRichard Socher. 2016. Pointer sentinel mixture mod-els. arXiv preprint arXiv:1609.07843 .Tomas Mikolov, Armand Joulin, Sumit Chopra,Michael Mathieu, and Marc’Aurelio Ranzato. 2014.Learning longer memory in recurrent neural networks. arXiv preprint arXiv:1412.7753 .Tomas Mikolov and Geoffrey Zweig. 2012. Contextdependent recurrent neural network language model.
SLT , 12(234-239):8.Mohammad Norouzi, Samy Bengio, Navdeep Jaitly,Mike Schuster, Yonghui Wu, Dale Schuurmans, et al.2016. Reward augmented maximum likelihood forneural structured prediction. In
Advances In NeuralInformation Processing Systems , pages 1723–1731.Boris T Polyak and Anatoli B Juditsky. 1992. Accelera-tion of stochastic approximation by averaging.
SIAMJournal on Control and Optimization , 30(4):838–855.Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018a. Improving language under-standing by generative pre-training.
OpenAI Blog .Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. Language modelsare unsupervised multitask learners.Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2018b. Language etter Long-Range Dependency By Bootstrapping A Mutual Information Regularizer models are unsupervised multitask learners.
OpenAIBlog .Steven J Rennie, Etienne Marcheret, Youssef Mroueh,Jerret Ross, and Vaibhava Goel. 2017. Self-criticalsequence training for image captioning. In
Proceed-ings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 7008–7024.Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al.2015. End-to-end memory networks. In
Advances inneural information processing systems , pages 2440–2448.Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, HaoTian, Hua Wu, and Haifeng Wang. 2019. Ernie2.0: A continual pre-training framework for languageunderstanding. arXiv preprint arXiv:1907.12412 .Trieu H Trinh, Andrew M Dai, Thang Luong, andQuoc V Le. 2018. Learning longer-term dependen-cies in rnns with auxiliary losses. arXiv preprintarXiv:1803.00144 .Alexandre B. Tsybakov. 2008.
Introduction to Nonpara-metric Estimation , 1st edition. Springer PublishingCompany, Incorporated.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention isall you need. In
Advances in Neural InformationProcessing Systems , pages 5998–6008.Tian Wang and Kyunghyun Cho. 2015. Larger-context language modelling. arXiv preprintarXiv:1511.03729 .Wenlin Wang, Zhe Gan, Wenqi Wang, Dinghan Shen,Jiaji Huang, Wei Ping, Sanjeev Satheesh, andLawrence Carin. 2017. Topic compositional neurallanguage model. arXiv preprint arXiv:1712.09783 .Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Ben-gio, and Ruslan R Salakhutdinov. 2016. On multi-plicative integration with recurrent neural networks.In
Advances in neural information processing sys-tems , pages 2856–2864.Peng Xu, Hamidreza Saghir, Jin Sung Kang, TengLong, Avishek Joey Bose, Yanshuai Cao, and JackieChi Kit Cheung. 2019. A cross-domain transferableneural coherence model.
ACL .Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, andWilliam W Cohen. 2017. Breaking the softmax bot-tleneck: A high-rank rnn language model. arXivpreprint arXiv:1711.03953 .Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Ruslan Salakhutdinov, and Quoc V Le.2019. Xlnet: Generalized autoregressive pretrain-ing for language understanding. arXiv preprintarXiv:1906.08237 . Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang,Maosong Sun, and Qun Liu. 2019. Ernie: Enhancedlanguage representation with informative entities. arXiv preprint arXiv:1905.07129 .Jake Zhao, Yoon Kim, Kelly Zhang, Alexander Rush,and Yann LeCun. 2018. Adversarially regularizedautoencoders.
Proceddings of the 35th InternationalConference on Machine Learning . anshuai Cao ∗ , Peng Xu ∗ A Appendix
A.1 RAML Background
The key idea behind RAML is to observe that theentropy-regularized policy gradient RL objective L RL can be written as (up to constant and scaling): L RL = (cid:88) ( X,Y (cid:63) ) ∈ D KL ( Q ω ( Y | X ) (cid:107) p (cid:63)β ( Y | Y (cid:63) )) (18)where p (cid:63)β ( Y | Y (cid:63) ) is the exponentiated pay-off distribu-tion defined as: p (cid:63)β ( Y | Y (cid:63) ) = exp { r ( Y, Y (cid:63) ) /β } (cid:14) Z ( Y (cid:63) , β ) (19) r ( Y, Y (cid:63) ) is a reward function that measures some simi-larity of Y with respect to the ground truth Y (cid:63) (e.g.negative edit-distance). Whereas in RAML Norouziet al. (2016), one optimizes the KL in the reverse di-rection: L RAML = (cid:88) ( X,Y (cid:63) ) ∈ D KL ( p (cid:63)β ( Y | Y (cid:63) ) (cid:107) Q ω ( Y | X )) (20)It was shown that these two losses have the same globalextremum and when away from it their gap is boundedunder some conditions Norouzi et al. (2016). Comparethe RAML gradient with the policy gradient: ∇ L RAML = − E p (cid:63)β ( Y | Y (cid:63) ) {∇ log Q ω ( Y | X ) } (21) ∇ L RL = − E Q ω ( Y | X ) { r ( Y, Y (cid:63) ) ∇ log Q ω ( Y | X ) } (22)RAML gradient samples from a stationary distribution,while policy gradient samples from the changing Q ω distribution. Furthermore, samples from p (cid:63)β ( Y | Y (cid:63) ) hashigher chance of landing in configurations of high re-ward by definition, while samples Q ω ( Y | X ))