[PDF] Moment Matching Training for Neural Machine Translation: A Preliminary Study

Abstract

In previous works, neural sequence models have been shown to improve significantly if external prior knowledge can be provided, for instance by allowing the model to access the embeddings of explicit features during both training and inference. In this work, we propose a different point of view on how to incorporate prior knowledge in a principled way, using a moment matching framework. In this approach, the standard local cross-entropy training of the sequential model is combined with a moment matching training mode that encourages the equality of the expectations of certain predefined features between the model distribution and the empirical distribution. In particular, we show how to derive unbiased estimates of some stochastic gradients that are central to the training, and compare our framework with a formally related one: policy gradient training in reinforcement learning, pointing out some important differences in terms of the kinds of prior assumptions in both approaches. Our initial results are promising, showing the effectiveness of our proposed framework.

Full PDF

aa r X i v : . [ c s . C L ] D ec Moment Matching Training for Neural MachineTranslation: A Preliminary Study

Cong Duy Vu Hoang ∗ Computing and Information SystemsUniversity of Melbourne, Australia [email protected]

Ioan Calapodescu Marc Dymetman

NAVER Labs Europe, Grenoble, France [email protected]@naverlabs.com

Abstract

In previous works, neural sequence models have been shown to improve signiﬁ-cantly if external prior knowledge can be provided, for instance by allowing themodel to access the embeddings of explicit features during both training and in-ference. In this work, we propose a different point of view on how to incorporateprior knowledge in a principled way, using a moment matching framework. Inthis approach, the standard local cross-entropy training of the sequential modelis combined with a moment matching training mode that encourages the equalityof the expectations of certain predeﬁned features between the model distributionand the empirical distribution. In particular, we show how to derive unbiased es-timates of some stochastic gradients that are central to the training, and compareour framework with a formally related one: policy gradient training in reinforce-ment learning, pointing out some important differences in terms of the kinds ofprior assumptions in both approaches. Our initial results are promising, showingthe effectiveness of our proposed framework.

Standard training of neural sequence to sequence (seq2seq) models requires the construction of across-entropy loss (Sutskever et al., 2014; Lipton et al., 2015). This loss normally manipulates atthe level of generating individual tokens in the target sequence, hence, potentially suffering fromlabel or observation bias (Wiseman and Rush, 2016; Pereyra et al., 2017). Thus, it might be difﬁcultfor neural seq2seq models to capture the semantics at sequence level. This may be detrimental whenthe desired generated sequence may be missing or lacking some desired properties, for example,avoiding repetitions, preserving the consistency between source and target length ratio, or satisfy-ing biasedness upon some external evaluation measures such as ROUGE (Lin, 2004) and BLEU(Papineni et al., 2002) in summarisation and translation tasks, respectively; or avoiding omissionsand additions of semantic materials in natural language generation; etc. Sequence properties, on theother hand, may be associated with prior knowledge about the sequence the model aims to generate.In fact, the cross-entropy loss with sequence constraints is intractable. In order to inject such priorknowledge into seq2seq models, methods in reinforcement learning (RL) (Sutton and Barto, 1998)emerge as reasonable choices. In principle, RL is a general-purpose framework applying for se-quential decision making processes. In RL, an agent interacts with an environment E over a certainnumber of discrete timesteps (Sutton and Barto, 1998). The ultimate goal of the agent is to selectany action according to a policy π that maximises a future cumulative reward. This reward is theobjective function of RL guided by the policy π , and is deﬁned speciﬁcally for the application task.Considering seq2seq models in our case, an action of choosing the next word prediction is guidedby a stochastic policy and receives a task-speciﬁc reward with a real value return. The agent tries to ∗ Work done during an internship at NAVER Labs Europe, Grenoble, FrancePreprint. Work in progress. aximise the expected reward for T timesteps, e.g., R = P Tt =1 r t . The idea of RL has recently beenapplied to a variety of neural seq2seq tasks. For instance, Ranzato et al. (2015) applied this idea toabstractive summarisation with neural seq2seq models, using the ROUGE evaluation measure (Lin,2004) as a reward. Similarly, some success also has been achieved for neural machine translation,e.g., (Ranzato et al., 2015; He et al., 2016, 2017). Ranzato et al. (2015) and He et al. (2017) usedBLEU score (Papineni et al., 2002) as a reward function in their RL setups; whereas He et al. (2016)used a reward interpolating the probabilistic scores from reverse translation and language models. The main motivation of moment matching (MM) is to inject prior knowledge into the model whichtakes the properties of whole sequences into consideration. We aim to develop a generic method thatis applicable for any seq2seq models.Inspired from the method of moments in statistics, we propose the following moment matchingapproach. The underlying idea of moment matching is to seek optimal parameters reconciling twodistributions, namely: one from the samples generated by the model and another one from theempirical data. Those distributions aim to evaluate the generated sequences as a whole, via theuse of feature functions or constraints that one would like to behave similarly between the twodistributions, based on the encoding of the prior knowledge about sequences. It is worth notingthat this proposed moment matching technique is not stand-alone, but to be used in alternation orcombination with standard cross-entropy training. This is similar to the way RL is typically appliedin seq2seq models (Ranzato et al., 2015).Here, we will discuss some important differences with RL, then we will present the details on howthe MM technique works in the next sections.The ﬁrst difference is that RL assumes that one has deﬁned some reward function R , which is donequite independently of what the training data tells us. By contrast, MM only assumes that one hasdeﬁned certain features that are deemed important for the task, but one then relies on the actualtraining data to tell us how to use these features. One could say that the “arbitrariness” in MM isjust in the choice of the features to focus on, while the arbitrariness in RL is that we want the modelto get a good reward, even if that reward is not connected to the training data at all.Suppose that we are in the context of NLG and are trying to reconcile several objectives at the sametime, such as (1) avoiding omissions of semantic material, (2) avoiding additions of semantic ma-terial, (3) avoiding repetitions (Agarwal and Dymetman, 2017). In general, in order to address thiskind of problem in an RL framework, we need to “invent” a reward function based on certain com-putable features of the model outputs which in particular means inventing a formula for combiningthe different objectives we have in mind into a single real number. This can be a rather arbitraryprocess, and potentially it does not guarantee any ﬁt with actual training data. The point of MM isthat the only arbitrariness is in choosing the features to focus on, but after that it is actual trainingdata that tells us what should be done.The second difference is that RL tries to maximize a reward, and is only sensitive to the rewards ofindividual instances, while MM tries to maximize the ﬁt of the model distribution with that of theempirical distribution, where the ﬁt is on speciﬁc features.For instance, this difference is especially clear in the case of language modelling where RL will tryto ﬁnd a model that is strongly peaked on the x which has the strongest reward (assuming no ties inthe rewards), while MM will try to ﬁnd a distribution over x which has certain properties in commonwith the empirical distribution, e.g., for generating diverse outputs. For language modelling, RL isa strange method, because language modelling requires the model to be able to produce differentoutputs; for MT, the situation is a bit less clear, in case one wanted to argue that for each sourcesentence, there is a single best translation; but in principle, the observation also holds for MT, whichis a conditional language model. In this section, we will describe our formulation of moment matching for seq2seq modeling in detail. https://en.wikipedia.org/wiki/Method of moments (statistics) .1 Moment Matching for Sequence to Sequence Models Recall the sequence-to-sequence problem whose goal is to generate an output sequence given aninput sequence. In the context of neural machine translation - which is our main focus here, theinput sequence is a source language sentence, and the output sequence is a target language sentence.Suppose that we are modeling the target sequence y = y , . . . , y t , . . . , y | y | given a source sequence x = x , . . . , x t , . . . , x | x | , using a sequential process p Θ ( y | x ) . This sequential process can beimplemented via a neural mechanism, e.g., recurrent neural networks within an (attentional) encoder- decoder framework (Bahdanau et al., 2015) or a transformer framework (Vaswani et al., 2017).Regardless of its implementation, such a neural mechanism depends on model parameters Θ .Our proposal is that we would like our sequential process to satisfy some moment constraints. Suchmoment constraints can be modeled based on features that encode prior (or external) knowledge orsemantics about the generated target sentence. Mathematically, features can be represented throughvectors, e.g., Φ ( y | x ) ≡ ( φ ( y | x ) , . . . , φ j ( y | x ) , . . . , φ m ( y | x )) , where φ j ( y | x ) is the j th con-ditional feature function of a target sequence y given a source sequence x , and m is number offeatures or moment constraints. Considering a simple example where the moment feature is forcontrolling the length of a target sequence - which would just return a number of elements in thattarget sequence. In order to incorporate such constraints into the seq2seq learning process, we introduce a new ob-jective function, namely the moment matching loss J MM . Generally speaking, given a vector offeatures Φ ( y | x ) , the goal of moment matching loss is to encourage the identity of the model averageestimate, ˆ Φ n (Θ) ≡ E y ∼ p Θ ( . | x n ) [ Φ ( y | x n )] with the empirical average estimate, ¯ Φ n ≡ E y ∼ p D ( . | x n ) [ Φ ( y | x n )] ; (1)where D is the training data; x , y ∈ D are source and target sequences, respectively; n is the dataindex in D . This can be formulated as minimising a squared distance between the two distributionswith respect to model parameters Θ : J MM (Θ) := 1 N N X n =1 (cid:13)(cid:13)(cid:13) ˆ Φ n (Θ) − ¯ Φ n (cid:13)(cid:13)(cid:13) = 1 N N X n =1 (cid:13)(cid:13) E y ∼ p Θ ( . | x n ) [ Φ ( y | x n )] − E y ∼ p D ( . | x n ) [ Φ ( y | x n )] (cid:13)(cid:13) . (2)To be more elaborate, ˆ Φ n (Θ) ≡ E y ∼ p Θ ( . | x n ) [ Φ ( y | x n )] is the model average estimate over thesamples which are drawn i.i.d. from the model distribution p Θ ( . | x n ) given the source sequence x n ,and ¯ Φ n ≡ E y ∼ p D ( . | x n ) [ Φ ( y | x n )] is the empirical average estimate given the n th training instance,where our data are drawn i.i.d. from the empirical distribution p D ( . | x ) . We now show how to compute the gradient of J MM in the equation 2, denoted as ∇ Θ J MM , whichwill be required in optimisation. We ﬁrst deﬁne: Γ Θ ,n ≡ ∇ Θ (cid:16) k ∆ n k (cid:17) , where ∆ n ≡ ˆ Φ n (Θ) − ¯ Φ n , then the gradient ∇ Θ J MM can be computed as: ∇ Θ J MM = 1 N X n Γ Θ ,n . (3)3ext, we need to proceed with the computation of Γ Θ ,n . By derivation, we have the following: Γ Θ ,n = 2 X y p Θ ( y | x n ) D ˆ Φ n (Θ) − ¯ Φ n , Φ ( y | x n ) − ¯ Φ n E ∇ Θ log p Θ ( y | x n )= 2 E y ∼ p Θ ( . | x n ) hD ˆ Φ n (Θ) − ¯ Φ n , Φ ( y | x n ) − ¯ Φ n E ∇ Θ log p Θ ( y | x n ) i . (4) Proof.

Mathematically, we can say that Γ Θ ,n is the gradient of the composition F ◦ G of two func-tions F ( . ) = k . k : R m → R and G ( . ) = ˆ Φ n ( . ) − ¯ Φ n : R | Θ | → R m .Noting that the gradient ∇ Θ (F ◦ G) is equal to the Jacobian J F ◦ G [Θ] , and applying the chain rulefor Jacobians, we have: J F ◦ G [Θ] = (cid:18) ∇ Θ (cid:13)(cid:13)(cid:13) ˆ Φ n (Θ) − ¯ Φ n (cid:13)(cid:13)(cid:13) (cid:19) [Θ] = J F [G (Θ)] · J G [Θ] (5)Next, we need the computation for J F [G (Θ)] and J G [Θ] in Equation 5. First, we have: J F [G (Θ)] = 2 (cid:16) ˆ Φ n (Θ) − ¯ Φ n (cid:17) = 2 (cid:16) ˆ φ n, (Θ) − ¯ φ n, , . . . , ˆ φ n,j (Θ) − ¯ φ n,j , . . . , ˆ φ n,m (Θ) − ¯ φ n,m (cid:17) , (6)where ˆ Φ n (Θ) and ¯ Φ n are vectors of size m . And we also have: J G [Θ] =  ∂ E y ∼ pΘ( . | x n ) [ φ ( y | x n ) − ¯ φ n, ] ∂ Θ . . . ∂ E y ∼ pΘ( . | x n ) [ φ ( y | x n ) − ¯ φ n, ] ∂ Θ | Θ | . . . ∂ E y ∼ pΘ( . | x n ) [ φ j ( y | x n ) − ¯ φ n,j ] ∂ Θ i . . . ∂ E y ∼ pΘ( . | x n ) [ φ m ( y | x n ) − ¯ φ n,m ] ∂ Θ . . . ∂ E y ∼ pΘ( . | x n ) [ φ m ( y | x n ) − ¯ φ n,m ] ∂ Θ | Θ |  = M (7)A key part of these identities in Equation 7 is the value of ∂ E y ∼ pΘ( . | x n ) [ φ j ( y | x n ) − ¯ φ n,j ] ∂ Θ i which can beexpressed as: M ij = ∂ E y ∼ p Θ ( . | x n ) (cid:2) φ j ( y | x n ) − ¯ φ n,j (cid:3) ∂ Θ i = ∂ P y p Θ ( y | x n ) (cid:0) φ j ( y | x n ) − ¯ φ n,j (cid:1) ∂ Θ i = X y (cid:0) φ j ( y | x n ) − ¯ φ n,j (cid:1) ∂ p Θ ( y | x n ) ∂ Θ i (8)Next, using the well-known “log-derivative trick”: p Θ ( y | x ) ∂ log p Θ ( y | x ) ∂ Θ i = ∂ p Θ ( y | x ) ∂ Θ i from the Policy Gradient technique in reinforcement learning (Sutton et al., 2000), we can rewritethe equation 8 as follows: M ij = X y (cid:0) φ j ( y | x n ) − ¯ φ n,j (cid:1) p ( y | x n ) ∂ log p Θ ( y | x n ) ∂ Θ i = E y ∼ p Θ ( . | x n ) (cid:20)(cid:0) φ j ( y | x n ) − ¯ φ n,j (cid:1) ∂ log p Θ ( y | x n ) ∂ Θ i (cid:21) . (9)Combining Equations 8, 9, we have: ∂ E y ∼ p Θ ( . | x n ) (cid:2) φ j ( y | x n ) − ¯ φ n,j (cid:3) ∂ Θ i = E y ∼ p Θ ( . | x n ) (cid:20)(cid:0) φ j ( y | x n ) − ¯ φ n,j (cid:1) ∂ log p Θ ( y | x n ) ∂ Θ i (cid:21) , so in turn we obtain the computation of J G [Θ] . Note that the expectation E y ∼ p Θ ( . | x n ) [ . ] is easy tosample and the gradient ∂ log p Θ ( y | x n ) ∂ Θ i is easy to evaluate as well.4 ethod Formulation Note Unconditional

CaseCE ∇ Θ log p Θ ( y ) y ∼ D RL w/ PG R ( y ) ∇ Θ log p Θ ( y ) y ∼ p Θ ( . ) MM D ˆ Φ (Θ) − ¯ Φ , Φ ( y ) − ¯ Φ E ∇ Θ log p Θ ( y ) y ∼ p Θ ( . ) Conditional

CaseCE ∇ Θ log p Θ ( y | x ) x , y ∼ D RL w/ PG R ( y ) ∇ Θ log p Θ ( y | x ) x ∼ D , y ∼ p Θ ( . | x ) MM D ˆ Φ (Θ) − ¯ Φ , Φ ( y | x ) − ¯ Φ E ∇ Θ log p Θ ( y | x ) x ∼ D ; y ∼ p Θ ( . | x ) Table 1 Comparing different methods for training seq2seq models. Note that we denote CE for cross-entropy, RL for reinforcement learning, PG as policy gradient, and MM for moment matching.Since we already have the computations of J F [G (Θ)] and J G [Θ] , we can ﬁnalise the gradientcomputation Γ Θ ,n as follows: Γ Θ ,n = 2 (cid:16) ˆ φ n, (Θ) − ¯ φ n, , . . . , ˆ φ n,j (Θ) − ¯ φ n,j , . . . , ˆ φ n,m (Θ) − ¯ φ n,m (cid:17) ·  . . . . . . . . .. . . E y ∼ p Θ ( . | x n ) h(cid:0) φ j ( y | x n ) − ¯ φ n,j (cid:1) ∂ log p Θ ( y | x n ) ∂ Θ i i . . .. . . . . . . . .  = 2 (cid:16) ˆ φ n, (Θ) − ¯ φ n, , . . . , ˆ φ n,j (Θ) − ¯ φ n,j , . . . , ˆ φ n,m (Θ) − ¯ φ n,m (cid:17) · X y p Θ ( y | x n )  . . . . . . . . .. . . (cid:0) φ j ( y | x n ) − ¯ φ n,j (cid:1) ∂ log p Θ ( y | x n ) ∂ Θ i . . .. . . . . . . . .  = 2 X y p Θ ( y | x n ) (cid:16) ˆ φ n, (Θ) − ¯ φ n, , . . . , ˆ φ n,j (Θ) − ¯ φ n,j , . . . , ˆ φ n,m (Θ) − ¯ φ n,m (cid:17) ·  . . . . . . . . .. . . (cid:0) φ j ( y | x n ) − ¯ φ n,j (cid:1) ∂ log p Θ ( y | x n ) ∂ Θ i . . .. . . . . . . . .  = 2 X y p Θ ( y | x n ) D ˆ Φ n (Θ) − ¯ Φ n , Φ ( y | x n ) − ¯ Φ n E ∇ Θ log p Θ ( y | x n )= 2 E y ∼ p Θ ( . | x n ) hD ˆ Φ n (Θ) − ¯ Φ n , Φ ( y | x n ) − ¯ Φ n E ∇ Θ log p Θ ( y | x n ) i . By the reasoning just made, we can obtain the computation of Γ Θ ,n which is the central formula ofthe proposed moment matching technique. Based on equation 4, and ignoring the constant factor, we can use as our gradient update, for eachpair (cid:0) x n , y j ∼ p Θ ( . | x n ) (cid:1) ( j ∈ [1 , J ] ) the value h ˆ Φ n, y (Θ) − ¯ Φ n , Φ ( y | x n ) − ¯ Φ n i | {z } multiplicative score ∇ Θ log p Θ ( y | x n ) | {z } standard gradient update , where ¯ Φ n , the empirical average of Φ ( ·| x n ) , can be estimated through the observed value y n , i.e. ¯ Φ n ≃ Φ ( y n | x n ) .Note that the above gradient update draws a very close connection to RL with policy gradient method(Sutton et al., 2000) where the “multiplication score” plays a similar role to the reward R ( y | x ) ;5owever, unlike RL training using a predeﬁned reward , the major difference in MM training is thatMM’s multiplication score does depend on the model parameters Θ and looks at what the empiricaldata tells the model via using explicit prior features. Table 1 compares the differences among threemethods, namely CE, RL with Policy Gradient (PG) and our proposal MM, for neural seq2seqmodels in both unconditional (e.g., language modeling) and conditional (e.g., NMT, summarisation)cases. We have derived the gradient of moment of matching loss as shown in Equation 4. In order tocompute it, we still need to have the evaluation of two estimates, namely the model average estimate ˆ Φ n (Θ) and the empirical average estimate ¯ Φ n . Empirical Average Estimate.

First, we need to estimate the empirical average ¯ Φ n . In the generalcase, given a source sequence x n , suppose there are multiple target sequences y ∈ Y associatedwith x n , then ¯ φ n ≡ |Y| P y ∈Y φ ( y | x n ) . Speciﬁcally, when we have only one reference sequence y per source sequence x n , then ¯ φ n ≡ φ ( y | x n ) — which is the standard case in the context ofneural machine translation training. Model Average Estimate.

In practice, it is impossible to obtain a full computation of Γ Θ ,n dueto intractable search of y . Therefore, we resort to estimate Γ Θ ,n by a sampling process. There arepossible options for doing this.The simplistic approach to that would be to:First, estimate the model average ˆ Φ n (Θ) by sampling y (1) , y (2) , . . . , y ( K ) and then estimating: ˆ Φ n (Θ) ≈ K X k ∈ [1 ,K ] Φ (cid:16) y ( k ) | x n (cid:17) . Next, estimate the expectation E y ∼ p Θ in Equation 4 by independently sampling J second values of y , and then estimate: Γ Θ ,n ≈ J X j ∈ [1 ,J ] D ˆ Φ n (Θ) − ¯ Φ n , Φ (cid:16) y ( j ) | x n (cid:17) − ¯ Φ n E ∇ Θ log p Θ (cid:16) y ( j ) | x n (cid:17) . Note that two sets of samples are separate and J = K . This would provide an unbiased estimateof Γ Θ ,n , but at the cost of producing two independent sample sets of sizes K and J , used for twodifferent purposes — which would be computationally wasteful.A more economical approach might consist in using the same sample set of size J for both purposes.However, this would produce a biased estimate of Γ Θ ,n . This can be illustrated by considering theestimate case with J = 1 . In this case, the dot product D ˆ Φ n (Θ) − ¯ Φ n , Φ (cid:0) y (1) | x n (cid:1) − ¯ Φ n E in Γ Θ ,n is strictly positive since ˆ Φ n (Θ) is equal to Φ (cid:0) y (1) | x n (cid:1) ; hence, in this case, the current sample y (1) to be systematically discouraged by the model.Here, we proposed a better approach, resulting in an unbiased estimate of Γ Θ ,n , formulated asfollows:First, we sample J values of y : y (1) , y (2) , . . . , y ( J ) with J ≥ , then: Γ Θ ,n ≈ J X j ∈ [1 ,J ] D ˆ Φ n,j (Θ) − ¯ Φ n , Φ (cid:16) y ( j ) | x n (cid:17) − ¯ Φ n E ∇ Θ log p Θ (cid:16) y ( j ) | x n (cid:17) , (10)where ˆ Φ n,j (Θ) ≡ J − X j ′ ∈ [1 ,J ] j ′ = j Φ (cid:16) y ( j ′ ) | x n (cid:17) . (11)6e can then prove that this computation provides an unbiased estimate of Γ Θ ,n (see § ?? ). Note thathere, we have exploited the same J samples for both purposes but have taken care of not exploitingthe exact same y ( j ) for both — akin to a Jackknife resampling estimator. For simplicity, we will consider here the “unconditional case”, e.g., p ( . ) instead of p ( . | x ) , but theconditional case follows easily. Lemma 2.1.

Let p ( . ) be a probability distribution over y , and let ζ ( y ) be any function of y and Φ ( y ) is a feature vector over y .We wish to compute the quantity A = E y ∼ p( . ) h h ˆ Φ , Φ ( y ) i ζ ( y ) i , where ˆ Φ ≡ E y ∼ p( . ) [ Φ ( y )] .Let us sample J sequences y (1) ,. . . , y ( J ) (where J is a pre-deﬁned number of generated samples)independently from p ( . ) , and let us compute: B (cid:16) y (1) , . . . , y ( J ) (cid:17) ≡ J h h ˜ Φ ( − , Φ (cid:16) y (1) (cid:17) i ζ (cid:16) y (1) (cid:17) + . . . + h ˜ Φ ( − J ) , Φ (cid:16) y ( J ) (cid:17) i ζ (cid:16) y ( J ) (cid:17)i , where ˜ Φ ( − i ) is formulated as: ˜ Φ ( − i ) ≡ J − h Φ (cid:16) y (1) (cid:17) + . . . + Φ (cid:16) y ( i − (cid:17) + Φ (cid:16) y ( i +1) (cid:17) + . . . + Φ (cid:16) y ( J ) (cid:17)i . Then we have : A = E { y ( i ) } Ji =1 iid ∼ p( . ) h B (cid:16) y (1) , . . . , y ( J ) (cid:17)i , in the other words, B ( y , . . . , y J ) provides an unbiased estimate of A .Proof. See Appendix.To ground this lemma in our problem setting, consider the case where p = p Θ , and ζ ( y ) = ∇ Θ log p Θ ( y ) , then the quantity A = E y ∼ p( . ) h h ˆ Φ , Φ ( y ) i ζ ( y ) i is equal to the overall gradientof the MM loss, for a given value of the model parameters Θ (by the formula (4) obtained earlier,and up to a constant factor). We would like to obtain an unbiased stochastic gradient estimator ofthis gradient, in other words, we want to obtain an unbiased estimator of the quantity A .By Lemma 2.1, A is equal to the expectation of B (cid:0) y (1) , . . . , y ( J ) (cid:1) , where y (1) , . . . , y ( J ) are drawni.i.d from distribution p . In other words, if we sample one set of J samples from p , and compute B (cid:0) y (1) , . . . , y ( J ) (cid:1) , where y (1) , . . . , y ( J ) on this set, then we obtain an unbiased estimate of A . Asa result, we obtain an unbiased estimate of the gradient of the overall MM loss, which is exactlywhat we need.In principle, therefore, we need to ﬁrst sample y (1) , . . . , y ( J ) , and to compute B (cid:16) y (1) , . . . , y ( J ) (cid:17) ≡ J h h ˜ Φ ( − , Φ (cid:16) y (1) (cid:17) i∇ Θ log p Θ (cid:16) y (1) (cid:17) + . . . + h ˜ Φ ( − J ) , Φ (cid:16) y ( J ) (cid:17) i∇ Θ log p Θ (cid:16) y ( J ) (cid:17)i , and then use this quantity as our stochastic gradient. In practice, what we do is to ﬁrst sample y (1) , . . . , y ( J ) , and then use the components of the sum : h ˜ Φ ( − j ) , Φ (cid:16) y ( j ) (cid:17) i∇ Θ log p Θ (cid:16) y ( j ) (cid:17) as our individual stochastic gradients. Note that this computation differs from the original one by aconstant factor J , which can be accounted for by manipulating the learning rate. https://en.wikipedia.org/wiki/Jackknife resampling lgorithm 1 General Algorithm for Training with Moment Matching Technique Input : a pre-trained model Θ , parallel training data D , λ is balancing factor in interpolationtraining mode if used for step = 1 , . . . , M do ⊲ M is maximum number of steps Select a batch of size N source and target sequences in X and Y in D . if MM mode is required then Sample J translations for the batch of source sequences X . ⊲ Random sampling has tobe used. Compute the total MM gradients according to Γ MM Θ ,n ≡ E y ∼ p Θ ( . | x n ) h h ˆ Φ n, y (Θ) − ¯ Φ n , Φ ( y | x n ) − ¯ Φ n i∇ Θ log p Θ ( y | x n ) i in Equations 10, 11. if alternation mode then if MM mode then Update model parameters according to the deﬁned MM gradients Γ MM Θ ,n with SGD. else if CE mode then

Update model parameters according to standard CE based gradients Γ CE Θ ,n ≡ E x ∼ X ; y ∼ Y [ ∇ Θ log p Θ ( y | x )] with SGD as usual. else ⊲ interpolation mode Compute the standard CE based gradients Γ CE Θ ,n ≡ E x ∼ X ; y ∼ Y [ ∇ Θ log p Θ ( y | x )] . Update model parameters according to Γ interpolation Θ ,n ≡ Γ CE Θ ,n + λ Γ MM Θ ,n . After some steps, save model parameters w.r.t best score based on J devMM using Equation 12. return newly-trained model Θ new Recall the goal of our technique is to preserve certain aspects of generated target sequences accord-ing to prior knowledge. In principle, the technique does not teach the model how to generate aproper target sequence based on the given source sequence (Ranzato et al., 2015). For that reason,it has to be used along with standard CE training of seq2seq model. In order to train the seq2seqmodel with the proposed technique, we suggest to use one of two training modes: alternation and interpolation . For the alternation mode, the seq2seq model is trained alternatively using both CEloss and moment matching loss. More speciﬁcally, the seq2seq model is initially trained with CEloss for some iterations, then switches to using moment matching loss; and vice versa. For the in-terpolation mode, the model will be trained with the interpolated objective using two losses with anadditional hyper-parameter balancing them. In summary, the general technique can be described asin Algorithm 1.After some iterations of the algorithm, we can approximate L MM over the development data (orsampled training data) through: J devMM (Θ) ≈ N N X n =1 (cid:13)(cid:13)(cid:13) ˆ Φ approxn (Θ) − ¯ Φ n (cid:13)(cid:13)(cid:13) . (12)We expect L MM to decrease over iterations, potentially improving the explicit evaluation mea-sure(s), e.g., BLEU (Papineni et al., 2002) in NMT. Maximum Mean Discrepancies (MMD).

Our MM approach is related to the technique of Max-imum Mean Discrepancies, a technique that has been successfully applied to computer vision, e.g.,an alternative to learning generative adversarial network (Li et al., 2015, 2017). The MMD is a wayto measuring discrepancy between two distributions (for example, the empirical distribution and themodel distribution) based on kernel-based similarities. The use of such kernels could potentially beuseful in the long term to extend our approach, which can be seen as using a simple linear kernelover our pre-deﬁned features, but in the speciﬁc context of seq2seq models, and in tandem with agenerative process based on an auto-regressive generative model.8 he Method of Moments.

Recently, Ravuri et al. (2018) proposed using a moment matchingtechnique in situations where Maximum Likelihood is difﬁcult to apply. A strong difference withthe way we use MM is that they deﬁne feature functions parameterised by some parameters and letthem be learned along with model parameters. In fact, they are trying to applying the method ofmoments to situations in which ML (maximum likelihood, or CE) is not applicable, but where MMcan ﬁnd the correct model distribution on its own. Hence the focus on having (and learning) a largenumber of features, because only many features will allow to approximate the actual distribution.In our case, we are not relying on MM to model the target distribution on its own. Doing so with asmall number of features would be doomed (e.g, thinking of the length ratio feature: it would onlyguarantee that the translation has a correct length, irrespective of the lexical content). We are usingMM to complement ML, in such a way that task-related important features are attended to even ifthat means getting a (slightly) worse likelihood (or perplexity) on the training set. One can in factsee our use of MM as a form of regularization technique for complementing the MLE training andthis is an important aspect of our proposed MM approach.

In order to validate the proposed technique, we re-applied two prior features used for training NMTas in (Zhang et al., 2017), including source and target length ratio and lexical bilingual features.Zhang et al. (2017) showed in their experiments that these two are the most effective features forimproving NMT systems.The ﬁrst feature is straightforward, just about measuring the ratio between source and target length.This feature aims at forcing the model to produce translations with consistent length ratio betweensource and target sentences, in such a way that too short or too long translations will be avoided.Given the respective source and target sequences x and y , we deﬁne this source and target lengthratio feature function Φ len ratio as follows: Φ len ratio := ( β ∗| x || y | if β × | x | < | y | | y || β ∗| x | otherwise , (13)where β is additional hyper-parameter, normally set empirically based on prior knowledge aboutsource and target languages. In this case, the feature function is a real value.The second feature we used is based on a word-to-word lexical translation dictionary produced byan off-the-shelf SMT system (e.g., Moses). The goal of this feature is to ask the model to takeexternal lexical translations into consideration. This feature will be potentially useful in cases suchas: translation for rare words, and in low resource setting in which parallel data can be scarce. Following Zhang et al. (2017), we deﬁned sparse feature functions Φ bd ≡ h φ h w x ,w y i , . . . , φ h w xi ,w yj i , . . . , φ h w x D lex ,w y D lex i i , where: φ h w x ,w y i := (cid:26) if w x ∈ x ∧ w y ∈ y otherwise , and where D lex is a lexical translation dictionary produced by Moses. We proceed to validate the proposed technique with small-scale experiments. We used theIWSLT’15 dataset, translating from English to Vietnamese. This dataset is relatively small, con-taining approximately 133K sentences for training, 1.5K for development , and 1.3K for testing. Were-implemented the transformer architecture (Vaswani et al., 2017) for training our NMT model https://github.com/moses-smt/mosesdecoder NMT has been empirically found to be less robust in such a setting than SMT. in our open source toolkit: https://github.com/duyvuleo/Transformer-DyNet LEU MM Loss tensor2tensor (Vaswani et al., 2017) 27.69base (our reimplementation - Transformer-DyNet) 28.53 0.0094808base+mm † Table 2 Evaluation scores for training moment matching with length ratio between source and targetsequences; bold : statistically signiﬁcantly better than the baselines, † : best performance on dataset. BLEU MM Loss tensor2tensor (Vaswani et al., 2017) 27.69base (our reimplementation - Transformer-DyNet) 28.53 0.7384base+mm † Table 3 Evaluation scores for training moment matching with bilingual lexical dictionary; bold :statistically signiﬁcantly better than the baselines, † : best performance on dataset.with hyper-parameters: 4 encoder and 4 decoder layers; hidden dimension 512 and dropout proba-bility 0.1 throughout the network. For the sampling process, we generated 5 samples for each mo-ment matching training step. We used interpolation training mode with a balancing hyper-parameterof 0.5. In fact, changing this hyper-parameter only slightly affects the overall result. For the featurewith length ratio between source and target sequences, we used the length factor β = 1 . For the fea-ture with bilingual lexical dictionary, we extracted it by Moses’s training scripts. In this dictionary,we ﬁltered out the bad entries based on word alignment probabilities produced by Moses, e.g., usinga threshold less than 0.5 following Zhang et al. (2017). Our results can be found in Table 2 and 3. As can be seen from the tables, as long as the modelattempted to reduce the moment matching loss, the BLEU scores (Papineni et al., 2002) improvedstatistically signiﬁcantly with p < . (Koehn, 2004). This was consistently shown in both ex-periments as an encouraging validation of our proposed training technique with moment matching. We have shown some nice mathematical properties of the proposed moment matching training tech-nique (in particular, unbiasedness) and believe it is promising. Our initial experiments indicate itspotential for improving existing NMT systems using simple prior features. Future work may in-clude exploiting more advanced features for improving NMT and evaluate our proposed techniqueon larger-scale datasets.

Acknowledgments

Cong Duy Vu Hoang would like to thank NAVER Labs Europe for supporting his internship; andReza Haffari and Trevor Cohn for their insightful discussions. Marc Dymetman wishes to thank EricGaussier and Shubham Agarwal for early discussions on the topic of moment matching.

References

Shubham Agarwal and Marc Dymetman. 2017. A surprisingly effective out-of-the-box char2charmodel on the e2e nlg challenge dataset. In

Proceedings of the 18th Annual SIGdial Meeting onDiscourse and Dialogue . Association for Computational Linguistics, pages 158–163.Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation byJointly Learning to Align and Translate. In

Proc. of 3rd International Conference on LearningRepresentations (ICLR2015) . 10i He, Hanqing Lu, Yingce Xia, Tao Qin, Liwei Wang, and Tieyan Liu. 2017. Decoding with valuenetworks for neural machine translation. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,R. Fergus, S. Vishwanathan, and R. Garnett, editors,

Advances in Neural Information ProcessingSystems 30 , Curran Associates, Inc., pages 178–187.Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. 2016.Dual Learning for Machine Translation. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, andR. Garnett, editors,

Advances in Neural Information Processing Systems 29 , Curran Associates,Inc., pages 820–828. http://papers.nips.cc/paper/6469-dual-learning-for-machine-translation.pdf.Philipp Koehn. 2004. Statistical Signiﬁcance Tests for Machine Translation Evaluation. In DekangLin and Dekai Wu, editors,

Proceedings of Conference on Empirical Methods on Natural Lan-guage Processing (EMNLP) . Association for Computational Linguistics, Barcelona, Spain, pages388–395.Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabas Poczos. 2017. Mmdgan: Towards deeper understanding of moment matching network. In I. Guyon, U. V. Luxburg,S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,

Advances in NeuralInformation Processing Systems 30 , Curran Associates, Inc., pages 2203–2213.Yujia Li, Kevin Swersky, and Richard Zemel. 2015. Generative moment matching networks. In

Pro-ceedings of the 32Nd International Conference on International Conference on Machine Learning- Volume 37 . JMLR.org, ICML’15, pages 1718–1727.Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In

Text Summa-rization Branches Out

ArXiv e-prints .Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002.BLEU: A Method for Automatic Evaluation of Machine Translation. In

Proceedingsof the 40th Annual Meeting on Association for Computational Linguistics . Associa-tion for Computational Linguistics, Stroudsburg, PA, USA, ACL ’02, pages 311–318.https://doi.org/10.3115/1073083.1073135.Gabriel Pereyra, George Tucker, Jan Chorowski, Lukasz Kaiser, and Geoffrey E. Hinton.2017. Regularizing Neural Networks by Penalizing Conﬁdent Output Distributions.

CoRR abs/1701.06548. http://arxiv.org/abs/1701.06548.Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015.Sequence level training with recurrent neural networks.

CoRR abs/1511.06732.http://arxiv.org/abs/1511.06732.Suman V. Ravuri, Shakir Mohamed, Mihaela Rosca, and Oriol Vinyals. 2018.Learning implicit generative models with the method of learned moments. In

Proceedings of the35th International Conference on Machine Learning, ICML 2018, Stockholmsm¨assan, Stockholm,Sweden, July 10-15, 2018 . pages 4311–4320. http://proceedings.mlr.press/v80/ravuri18a.html.Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with NeuralNetworks. In

Proceedings of the 27th International Conference on Neural Information ProcessingSystems . MIT Press, Cambridge, MA, USA, NIPS’14, pages 3104–3112.Richard S. Sutton and Andrew G. Barto. 1998. Reinforcement learning: An introduction.

IEEETransactions on Neural Networks

Advances in Neural Information Processing Systems 12 , MIT Press,pages 1057–1063. 11shish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V.Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,

Ad-vances in Neural Information Processing Systems 30 , Curran Associates, Inc., pages 5998–6008.http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.Sam Wiseman and Alexander M. Rush. 2016. Sequence-to-Sequence Learning as Beam Search Optimization.In

Proceedings of the 2016 Conference on Empirical Methods in Natural LanguageProcessing . Association for Computational Linguistics, Austin, Texas, pages 1296–1306.https://aclweb.org/anthology/D16-1137.Jiacheng Zhang, Yang Liu, Huanbo Luan, Jingfang Xu, and Maosong Sun. 2017.Prior knowledge integration for neural machine translation using posterior regularization. In

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics(Volume 1: Long Papers) . Association for Computational Linguistics, pages 1514–1523.https://doi.org/10.18653/v1/P17-1139. 12 ppendix - Proof of Unbiasedness

Proof.

Let us deﬁne: E = E { y ( i ) } Ji =1 iid ∼ p( . ) h B (cid:16) y (1) , . . . , y ( J ) (cid:17)i = X y (1) ,..., y ( J ) p (cid:16) y (1) (cid:17) . . . p (cid:16) y ( J ) (cid:17) J h h ˜ Φ ( − , Φ (cid:16) y (1) (cid:17) i ζ (cid:16) y (1) (cid:17) + . . . + h ˜ Φ ( − i ) , Φ (cid:16) y ( i ) (cid:17) i ζ (cid:16) y ( i ) (cid:17) + . . .. . . + h ˜ Φ ( − J ) , Φ (cid:16) y ( J ) (cid:17) i ζ (cid:16) y ( J ) (cid:17)i = 1 J X i ∈ [1 ,J ] X y (1) ,..., y ( J ) p (cid:16) y (1) (cid:17) . . . p (cid:16) y ( J ) (cid:17) h ˜ Φ ( − i ) , Φ (cid:16) y ( i ) (cid:17) i ζ (cid:16) y ( i ) (cid:17) (14)For a given value of i , we have: X y (1) ,..., y ( J ) p (cid:16) y (1) (cid:17) . . . p (cid:16) y ( J ) (cid:17) h ˜ Φ ( − i ) , Φ (cid:16) y ( i ) (cid:17) i ζ (cid:16) y ( i ) (cid:17) = X y ( i ) p (cid:16) y ( i ) (cid:17) ζ (cid:16) y ( i ) (cid:17) X y (1) ,..., y ( i − , y ( i +1) ,..., y ( J ) p (cid:16) y (1) (cid:17) . . . p (cid:16) y ( i − (cid:17) p (cid:16) y ( i +1) (cid:17) . . . p (cid:16) y ( J ) (cid:17) J − h h Φ (cid:16) y (1) (cid:17) + . . . + Φ (cid:16) y ( i − (cid:17) + Φ (cid:16) y ( i +1) (cid:17) + . . . + Φ (cid:16) y ( J ) (cid:17)i , Φ (cid:16) y ( i ) (cid:17) i = X y ( i ) p (cid:16) y ( i ) (cid:17) ζ (cid:16) y ( i ) (cid:17) J − hX y (1) ,..., y ( i − , y ( i +1) ,..., y ( J ) p (cid:16) y (1) (cid:17) . . . p (cid:16) y ( i − (cid:17) p (cid:16) y ( i +1) (cid:17) . . . p (cid:16) y ( J ) (cid:17) h Φ (cid:16) y (1) (cid:17) , Φ (cid:16) y ( i ) (cid:17) i + . . . + X y (1) ,..., y ( i − , y ( i +1) ,..., y ( J ) p (cid:16) y (1) (cid:17) . . . p (cid:16) y ( i − (cid:17) p (cid:16) y ( i +1) (cid:17) . . . p (cid:16) y ( J ) (cid:17) h Φ (cid:16) y ( J ) (cid:17) , Φ (cid:16) y ( i ) (cid:17) i i = X y ( i ) p (cid:16) y ( i ) (cid:17) ζ (cid:16) y ( i ) (cid:17) J − h X y (1) p (cid:16) y (1) (cid:17) h Φ (cid:16) y (1) (cid:17) , Φ (cid:16) y ( i ) (cid:17) i + . . . + X y ( i − p (cid:16) y ( i − (cid:17) h Φ (cid:16) y ( i − (cid:17) , Φ (cid:16) y ( i ) (cid:17) i + X y ( i +1) p (cid:16) y ( i +1) (cid:17) h Φ (cid:16) y ( i +1) (cid:17) , Φ (cid:16) y ( i ) (cid:17) i . . . + X y ( J ) p (cid:16) y ( J ) (cid:17) h Φ (cid:16) y ( J ) (cid:17) , Φ (cid:16) y ( i ) (cid:17) i i = X y ( i ) p (cid:16) y ( i ) (cid:17) ζ (cid:16) y ( i ) (cid:17) J −  h ˆ Φ , Φ (cid:16) y ( i ) (cid:17) i + . . . + h ˆ Φ , Φ (cid:16) y ( i ) (cid:17) i | {z } J − times  = X y ( i ) p (cid:16) y ( i ) (cid:17) ζ (cid:16) y ( i ) (cid:17) h ˆ Φ , Φ (cid:16) y ( i ) (cid:17) i = E y ∼ p( . ) h ˆ Φ , Φ ( y ) i ζ ( y ) . (15)13inally, by collecting the results for the J values of the index i , we obtain: E = 1 J · J · E y ∼ p( . ) h ˆ Φ , Φ ( y ) i ζ ( y )= E y ∼ p( . ) h ˆ Φ , Φ ( y ) i ζ ( y )= A ..