[PDF] Exploring Supervised and Unsupervised Rewards in Machine Translation

Abstract

Reinforcement Learning (RL) is a powerful framework to address the discrepancy between loss functions used during training and the final evaluation metrics to be used at test time. When applied to neural Machine Translation (MT), it minimises the mismatch between the cross-entropy loss and non-differentiable evaluation metrics like BLEU. However, the suitability of these metrics as reward function at training time is questionable: they tend to be sparse and biased towards the specific words used in the reference texts. We propose to address this problem by making models less reliant on such metrics in two ways: (a) with an entropy-regularised RL method that does not only maximise a reward function but also explore the action space to avoid peaky distributions; (b) with a novel RL method that explores a dynamic unsupervised reward function to balance between exploration and exploitation. We base our proposals on the Soft Actor-Critic (SAC) framework, adapting the off-policy maximum entropy model for language generation applications such as MT. We demonstrate that SAC with BLEU reward tends to overfit less to the training data and performs better on out-of-domain data. We also show that our dynamic unsupervised reward can lead to better translation of ambiguous words.

Full PDF

EExploring Supervised and Unsupervised Rewards in Machine Translation

Julia Ive , Zixu Wang , Marina Fomicheva , Lucia Specia , , Imperial College London , University of Shefﬁeld , ADAPT - Dublin City University [email protected], [email protected]@sheffield.ac.uk, [email protected] Abstract

Reinforcement Learning (RL) is a powerfulframework to address the discrepancy betweenloss functions used during training and the ﬁ-nal evaluation metrics to be used at test time.When applied to neural Machine Translation(MT), it minimises the mismatch between thecross-entropy loss and non-differentiable eval-uation metrics like BLEU. However, the suit-ability of these metrics as reward function attraining time is questionable: they tend to besparse and biased towards the speciﬁc wordsused in the reference texts. We propose to ad-dress this problem by making models less re-liant on such metrics in two ways: (a) with anentropy-regularised RL method that does notonly maximise a reward function but also ex-plore the action space to avoid peaky distri-butions; (b) with a novel RL method that ex-plores a dynamic unsupervised reward func-tion to balance between exploration and ex-ploitation. We base our proposals on theSoft Actor-Critic (SAC) framework, adaptingthe off-policy maximum entropy model forlanguage generation applications such as MT.We demonstrate that SAC with BLEU rewardtends to overﬁt less to the training data andperforms better on out-of-domain data. Wealso show that our dynamic unsupervised re-ward can lead to better translation of ambigu-ous words.

Autoregressive sequence-to-sequence (seq2seq)neural architectures have become the de facto approach in Machine Translation (MT). Suchmodels include Recurrent Neural Networks(RNN) (Sutskever et al., 2014; Bahdanau et al.,2014) and Transformer networks (Vaswani et al.,2017), among others. However, these models haveas a serious limitation the discrepancy betweentheir training and inference time regimes. They are traditionally trained using the Maximum Like-lihood Estimation (MLE), which aims to maximiselog-likelihood of a categorical ground truth distri-bution (samples in the training corpus) using lossfunctions such as cross-entropy, which are very dif-ferent from the evaluation metric used at inferencetime, which generally compares string similaritybetween the system output and reference outputs.Moreover, during training, the generator receivesthe ground truth as input and is trained to minimisethe loss of a single token at a time without takingthe sequential nature of language into account. Atinference time, however, the generator will take theprevious sampled output as the input at next timestep, rather than the ground truth word. MLE train-ing thus causes: (a) the problem of “exposure bias”as a result of recursive conditioning on its own er-rors at test time, since the model has never beenexclusively “exposed” to its own predictions duringtraining; (b) a mismatch between the training ob-jective and the test objective, where the latter relieson evaluation using discrete and non-differentiablemeasures such as BLEU (Papineni et al., 2002).The current solution for both problems is mainlybased on Reinforcement Learning (RL), where aseq2seq model (Sutskever et al., 2014; Bahdanauet al., 2014) is used as the policy which generatesactions (tokens) and at each step receives rewardsbased on a discrete metric taking into account im-portance of immediate and future rewards. How-ever, RL methods for seq2seq MT models also havetheir challenges: high-dimensional discrete actionspace, efﬁcient sampling and exploration, choiceof baseline reward, among others (Choshen et al.,2020). The typical metrics used as rewards (e.g.,BLEU) are often biased and sparse. They are mea-sured against one or a few human references and donot take into account alternative translation optionsthat are not present in the references.One way to address this problem is to use a r X i v : . [ c s . C L ] F e b ntropy-regularised RL frameworks. They incor-porate the entropy measure of the policy into thereward to encourage exploration. The expectationis that this leads to learning a policy that acts asstochastically as possible while able to succeed atthe task. Speciﬁcally, we focus on the Soft Actor-Critic (SAC) (Haarnoja et al., 2018a,b) RL frame-work, which to the best of our knowledge has notyet been explored for MT, as well as other naturallanguage processing (NLP) tasks. The main ad-vantage of this architecture, as compared to otherentropy regularised architectures (Haarnoja et al.,2017; Ziebart et al., 2008), is that it is formulatedin the off-policy setting that enables reusing previ-ously collected samples for more stability and bet-ter exploration. We demonstrate that SAC preventsthe model from overﬁtting, and as a consequenceleads to better performance on out-of-domain data.Another way to address the problem of sparseor biased reward is to design an unsupervised re-ward. Recently, in Robotics, SAC has been suc-cessfully used in unsupervised reward architectures,such as the “Diversity is All You Need” (DIAYN)framework (Eysenbach et al., 2018). DIAYN al-lows the learning of latent-conditioned sub-policies(“skills”) in unsupervised manner, which allows tobetter explore and model target distributions. In-spired by this work, we propose a formulation ofan unsupervised reward for MT. We thoroughly in-vestigate effects of this reward and conclude thatit is useful in lexical choice, particularly the raresense translation for ambiguous words.Our main contributions are thus twofold: (a)the re-framing of the SAC framework such that itcan be applied to MT and other natural languagegeneration tasks (Section 3). We demonstrate thatSAC results in improved generalisation comparedto the MLE training, leading to better translationof out-of-domain data; (b) the proposal of a dy-namic unsupervised reward within the SAC frame-work (Section 3.4). We demonstrate its efﬁcacy intranslating ambiguous words, particularly the raresenses of such words. Our datasets and settingsare described in Section 4, and our experiments inSection 5. Reinforcement Learning for MT

RL has beensuccessfully applied to MT to bridge the gapbetween training and testing by optimising thesequence-level objective directly (Yu et al., 2017; Ranzato et al., 2015; Bahdanau et al., 2016). How-ever, thus far mainly the REINFORCE (Williams,1992) algorithm and its variants have been used(Ranzato et al., 2015; Kreutzer et al., 2018). Theseare simpler algorithms that handle the large naturallanguage action space, but they employ a sequence-level reward which tends to be sparse.To reduce model variance, Actor-Critic (AC)models consider the reward at each decoding stepand use the Critic model to guide future actions(Konda and Tsitsiklis, 2000). This approach hasalso been explored for MT (Bahdanau et al., 2016;He et al., 2017). However, more advanced AC mod-els with Q-Learning are rarely applied to languagegeneration problems. This is due to the difﬁcultyof approximating the Q-function for the large ac-tion space. The large action space is one of thebottleneck for RL for text generation in general.Pre-training of the agent parameters to be close tothe true distribution is thus necessary to make RLwork (Choshen et al., 2020). Further RL training ofthe agent makes the overﬁtting problem even morepronounced resulting in peaky distributions. Suchproblems are traditionally addressed by entropyregularised RL.

Entropy Regularised RL

The main goal of thistype of RL is to learn an efﬁcient policy whilekeeping the entropy of the agent actions as highas possible. The paradigm promotes explorationof actions, suppresses peaky distributions and im-proves robustness. In this work, we explore theeffectiveness of the maximum entropy SAC frame-work (Haarnoja et al., 2018a).The work closest to ours is of Dai et al. (2018)where the Entropy-Regularised AC (ERAC) modelleads to better MT performance. The major differ-ence between ERAC and SAC is that the former isan on-policy model and the latter is an off-policymodel. On-policy approaches use consecutive sam-ples collected in real-time that are correlated toeach other. In the off-policy setting, our SAC al-gorithm uses samples from the memory that aretaken uniformly with reduced correlation. This keycharacteristic of SAC ensures better model gener-alisation and stability (Mnih et al., 2015). Thereare also differences in the architectures of SAC andERAC, i.a., using 4 Q-value networks instead oftwo. These differences will be covered in detail inSection 3. nsupervised reward RL

Signiﬁcant work hasbeen done in Robotics to improve the learning ca-pability of robots. These approaches do not relyon a single objective but rather promote intrinsicmotivation and exploration. Such an approach tolearn diverse skills (latent-conditioned sub-policies,in practice, skills like walking or jumping) in un-supervised manner was recently proposed by Ey-senbach et al. (2018). The approach relies on theSAC model and inspired our approach to designingour unsupervised reward for MT. We are not awareof other attempts to design dynamic unsupervisedRL rewards (learnt together with the network) inseq2seq in general, or MT in particular. Recentwork on unsupervised rewards in NLP (Gao et al.,2020) explores mainly static rewards computedagainst synthetic references.

In this section we start by describing the underly-ing MT architecture and its variant using RL, tothen introduce our SAC formulation and the rewardfunctions used.

A typical Neural Machine Translation (NMT) sys-tem is a seq2seq architecture (Sutskever et al., 2014;Bahdanau et al., 2014), where each source sentence x = ( x , x , · · · , x n ) is encoded by the encoderinto a series of hidden states. At each decodingstep t , a target word y t is generated according to p ( y t | y

The Q-function estimates the value of an actionat a given state based on its future rewards. Thesoft-Q value is computed recursively by applying amodiﬁed Bellman backup operator: Q ( s t , a t ) = r ( s t , a t ) + γ E s t +1 ∼ D [ V ( s t +1 )] (7)where V ( s t ) = E a t ∼ π [ Q ( s t , a t ) − α log π ( a t | s t )] (8)is the expected future reward of a state and log( π ( a t | s t )) is the entropy of the policy.The parameters of the Q-function are updatedtowards minimising the mean squared error be-tween the estimated Q-values and the assumedground-truth Q-value. The assumed ground-truthQ-values are estimated based on the current reward( r ( s t , a t ) ) and the discounted future reward of thenext state ( γV ¯ θ ( s t +1 ) ). This mean squared errorobjective function of the Q network is as follows: L ( θ ) = E s t ,a t ,r t ,s t +1 ∼ D,a t +1 ∼ π φ (cid:104)(cid:0) Q θ ( a t , s t ) − [ r ( s t , a t ) + γ E s t +1 ∼ D [ V ¯ θ ( s t +1 )]] (cid:1) (cid:105) (9)Note that the parameters of the networks are de-noted as θ and ¯ θ respectively. This is the best prac-tice where the critic is modeled with two neuralnetworks with the exact same architecture but inde-pendent parameters (Mnih et al., 2015).The parameters of the target critic network ( Q ¯ θ )are iteratively updated with the exponential mov-ing average of the parameters of the main criticnetwork ( Q θ ). This constrains the parameters ofthe target network to update at a slower pace towardthe parameters of the main critic, which has been shown to stabilise the training process (Lillicrapet al., 2016).Another advantage of SAC is the double Q-learning(Hasselt, 2010). In this approach, two Q networksfor both of the main and the target critic functionsare maintained. When estimating the current Qvalues or the discounted future rewards, the mini-mum of the outputs of the two Q networks is used.Thus the estimated Q values do not grow too large,which improves the policy training (Haarnoja et al.,2018a).• Actor Training

SAC updates the policy to minimise the KL-divergence to make the distribution of π φ ( s t ) pol-icy function look more like the distribution of theQ function: L π ( φ ) = E s t ∼ D [ π t ( s t ) T [ α log( π φ ( s t )) − Q θ ( s t )]] (10)where softmax is used in the ﬁnal layer of the policyto output a probability distribution over the actions.We note that some versions of the SAC algorithmallow to automatically tune the α parameter so thatwhile maximising the expected return, the policyshould satisfy the minimum entropy criteria. In ourexperiments we however used a ﬁxed α . Updating α during training resulted in too short sentences inthe output.Finally, we note that Eq. 10 does not simply addan entropy term to the standard Policy Gradient.The critic Q θ trained by Eq. 9 additionally capturesthe entropy from future steps .For more details on SAC for the discrete set-ting (like MT) we refer to Christodoulou (2019).For more formal details on the architecture,see Haarnoja et al. (2018a,b). Below we deﬁne the reward functions we use inour SAC architecture.

Supervised BLEU reward: -

SAC BLEU

Inthe supervised setup, we employ the sequence-levelBLEU score (Papineni et al., 2002) with add-1smoothing (Chen and Cherry, 2014). As an ad-ditional length constraint at each time step, wededuct from the respective score the length penalty: lp = | l y − l ˆ y | , where y is the reference transla-tion. This penalty prevents longer translations thatare not penalised by the brevity penalty of BLEU.LEU has been chosen in our study to ensure bet-ter comparability with the related work in RL MTtraditionally using the BLEU reward (Bahdanauet al., 2016; Dai et al., 2018). Unsupervised reward -

SAC unsuper

As dis-cussed above, using automatic metrics as rewardfunction can lead to a number of issues, e.g. rewardsparsity, overﬁtting towards single reference. More-over, designing a good reward can be challenging.Inspired by recent work on the SAC algorithm inunsupervised RL (Eysenbach et al., 2018), we havedesigned an unsupervised reward that balances thequality and diversity in the model search space .The pseudo-reward function we use is as follows: r z ( x , a ) = log q δ ( z | x , a ) − log p ( z ) (11)where p ( z ) is a categorical uniform distribution fora latent variable z . q δ ( z | x , a ) is provided by a discriminatorparametrised by a neural network. z is randomlyassigned to a word sampled at each step from theactor distribution. The discriminator is a Bag-of-Words model that takes as input the encoded sourcesequence and the word itself to predict its z .More intuitively, every time a word appears inthe translation hypothesis for a source sentence(within the Bag-of-Words formulation) it is ran-domly assigned a certain value of z . The moretimes this word appears in the sampled hypotheses(for a given source) the closer will be log q δ ( z | x , a ) to the uniform prior p ( z ) , hence reward r z ( x , a ) will be close to 0. Thus, frequent translations willbe suppressed and search for less frequent trans-lations will be encouraged in order to receive areward larger than 0.Such a reward is less sparse than the traditionalones and is also dynamic which prevents memoris-ing and overﬁtting. We perform experiments on the

Multi30K dataset (Elliott et al., 2016) of image descriptiontranslations and focus on the English-German (EN-DE) and English-French (EN-FR) (Elliott et al.,2017) language directions. Following best prac-tises, we use sub-word segmentation (BPE (Sen-nrich et al., 2016)) only on the target side of the https://github.com/multi30k/dataset corpus. The dataset contains 29,000 instances fortraining, 1,014 for development, and 1,000 for test-ing. We use ﬂickr2016 ( ), ﬂickr2017 ( )and coco2017 ( COCO ) test sets for model evalua-tion. is the most in-domain test set since it wastaken from the same superset of descriptions as thetraining set, whereas and

COCO are fromdifferent image description corpora and are thusconsidered out-of-domain .For more ﬁne-grained assessment of our mod-els with unsupervised reward, we use the

MLT testset (Lala and Specia, 2018; Lala et al., 2019), an an-notated subset of the

Multi30K corpus where eachinstance is a 3-tuple consisting of an ambiguous source word, its textual context (a source sentence),and its correct translation. The test set contains1,298 sentences for English-French and 1,708 forEnglish-German. It was designed to benchmarkmodels in their ability to select the right lexicalchoice for words with multiple translations, espe-cially when some of these translations are rarer.Additionally, to allow for comparison with pre-vious work, we evaluate on the

IWSLT 2014

German-to-English dataset (Cettolo et al., 2012)from TED talks, which has been used as testbedin most work on RL for MT. The training setcontains K sentence pairs. We followed thepre-processing procedure described in (Dai et al.,2018).When compared to the IWSLT 2014 dataset,all the three

Multi30K test sets are more out-of-domain. This was found by the analysis of perplex-ities of language models trained with respectivetraining data for each dataset (see Appendix A.4).

We modify the original SAC architecture to adaptit to MT following best practices (Bahdanau et al.,2016) in the area. The functions π φ and Q θ areparameterised with neural networks: π φ is an RNNseq2seq model with a 2-layer GRU (Cho et al.,2014) encoder and a 2-layer Conditional GRU de-coder (Sennrich et al., 2017) with attention (Bah-danau et al., 2014). For SAC BLEU , Q θ duplicatesthe structure of the former, but encodes the refer-ence instead of the source sentence to mimic inputsto the actual BLEU function.We ﬁrst pretrain the actor and then pretrain thecritic, before the actor-critic training. The pretrain-ing of actors is done until convergence according

016 2017 COCO model BLEU METEOR TER BLEU METEOR TER BLEU METEOR TER E N - F R MLE 57.5 71.7 27.5 50.9 66.8 33.0 42.8 61.5 37.3ERAC (ours)

SAC BLEU

SAC unsuper E N - D E MLE 38.5

SAC BLEU

SAC unsuper

Table 1: Performance of

SAC BLEU on the

Multi30K test sets (EN-FR, EN-DE) trained on the

Multi30K trainset. * marks statistically signiﬁcant changes (p-value ≤ . ) as compared to MLE. Bold highlights best results.ERAC (ours) indicates results obtained by us using the code openly provided by Dai et al. (2018). to the early stopping criteria of 10 epochs wrt. tothe MLE loss. We have also found that our crit-ics require much less pretraining (3-5 epochs ascompared to 10-20 epochs in general for AC archi-tectures with the MSE loss). Also, to prevent diver-gence during the actor-critic training, we continueperforming MLE training using a smaller weight λ mle . We set α to 0.01. Following Haarnoja et al.(2018a), we rescale the reward to the value inverseto α . Note that we did not ﬁnd it useful to addto SAC the smoothing objective minimising vari-ance of Q-values (Bahdanau et al., 2016; Dai et al.,2018). We presume that the double Q-learning sig-niﬁcantly contributes to the stability of the networkand additional smoothing is not required.For SAC unsuper , we parameterise q δ by a2-layer feed-forward neural network, which takesthe source as encoded by the actor and a t and out-puts q δ ( z | x , a ) . We set z to take one of 4 val-ues. For this unsupervised setting, we do not traina Q-function. We instead operate in the oraclemode and following (Keneshloo et al., 2018) de-ﬁne true Q-value estimates and use it to update ouractor. Details on training are given in Appendix A.We use pysimt (Caglayan et al., 2020) with Py-Torch (Paszke et al., 2019) v1.4 for our experi-ments. We use the standard set of MT evaluationmetrics: BLEU (Papineni et al., 2002), ME-TEOR (Denkowski and Lavie, 2014) andTER (Snover et al., 2006). We perform signiﬁ- This hyperparameter is tuned on the validation set. Ittypically varies from 2 to several hundreds in the relatedwork (Haarnoja et al., 2018b). https://github.com/ImperialNLP/pysimt cance testing via bootstrap resampling using the Multeval tool (Clark et al., 2011).For the lexical translation task, we measure the

Lexical Translation Accuracy (LTA) score (Lalaet al., 2019). The score provides an average es-timation of how accurately the words have beentranslated. For each ambiguous word, a score of+1 is awarded if the correct translation of the wordis found in the output translation; a score of 0 isassigned if a known incorrect translation is found,or none of the candidate words are found in thetranslation. We also propose a metric that notonly rewards correctly translated ambiguous words,but also penalises words translated with the wrongsense: the

Ambiguous Lexical Index (ALI) . ALIassigns -1 for wrong translations in the given con-text, whereas LTA simply does not reward them.

We ﬁrst compare our SAC models against the MLEmodel (baseline) and ERAC (state-of-the-art –SOTA) both trained and tested on the Multi30K data (Table 1). Compared to SAC, ERAC differsin that it uses the on-policy setting (i.e., using sam-ples collected in real time). Our SAC algorithm isan off-policy algorithm and uses samples from thememory to promote generalisation.We clearly observe the tendency of ERACmodels to perform better on the more in-domain data (+1.9 BLEU, +1.6 METEOR, -0.8 TER For ERAC, we present results that we reproduced our-selves using the code publicly provided by the authors. Wehad to perform several modiﬁcations to this code to make itconform recent deep learning framework software updates.The performance of this model is on pair with this reported bythe authors.

016 2017 COCO

Model BLEU METEOR TER BLEU METEOR TER BLEU METEOR TER

UNK

MLE 25.1

SAC BLEU no UNK

MLE 34.4 37.8 40.4 31.6

SAC BLEU

Table 2: Performance of

SAC BLEU on Multi30K (German-English) trained on the

IWSLT 2014 train set. UNKindicates standard output containing the UNK symbol; noUNK – outputs with sentences containing UNK not takeninto account. * marks statistically signiﬁcant changes (p-value ≤ . ) as compared to MLE. Bold highlights bestresults. against MLE for EN-FR) and the tendency of SACBLEU models to outperform other models on moreout-of-domain and

COCO sets (+2.7 BLEUand +3.0 METEOR, -1.5 TER against ERAC on

COCO for EN-DE).

SAC unsuper results are however worse thanthe baseline and SOTA. We focus thus on the in-vestigation of

SAC BLEU and come back to

SACunsuper in Section 5.2.To further conﬁrm our hypothesis that SACreduces overﬁtting and performs better on theout-of-domain data, we train our models on the

IWSLT 2014 train set and test on the out-of-domain

Multi30K test sets (in the reverse direction,German into English, Table 2).We observe similar performance for complete setof outputs (including sentences with UNK tokens)for MLE and

SAC BLEU . If the lines with UNKwords are not taken into account, we observe animprovement for the and test sets (+0.5BLEU, +0.1 METEOR, -0.5 TER on average), anda much bigger improvement for the more out-of-domain COCO set (+2.5 BLEU, +0.3 METEOR,-2 TER on average). This conﬁrms our hypothesisthat SAC helps to reduce overﬁtting.Finally, we compare SAC to the SOTA AC-baseRL architectures, namely ERAC and AC, on the

IWSLT 2014 set that is commonly used for thistask. Compared to SAC, AC differs in that it doesnot use entropy regularisation. We also providethe performance for the popular MIXER algorithm.Results are shown in Table 3.In terms of the general performance, our SAC The original corpus pre-processing pipeline that we fol-lowed to increase comparability does not include subwordsegmentation. We take the intersection of hypotheses sen-tences across

Multi30K test setups that contain no generatedUNK token wrt. the

IWSLT 2014 vocabulary. Reference ﬁlesmay still contain the UNK token, we focus on the generatedtext here. performs on pair with the MLE model.

SAC BLEU even slightly lowers this score (-0.2 BLEU, -0.2METEOR). We note that

SAC BLEU results con-tain an increased count of UNK words as comparedto MLE (+2.8%) This increased generation of UNKwords due to the entropy regularisation is partiallyresponsible for this similar performance. Anothercause is that SAC does not overﬁt to the BLEUdistribution of the target data. Model BLEU METEOR TERMLE (ours) 29.8 31.2 48.9MIXER (Ranzato et al., 2015) 20.73 - -AC (Bahdanau et al., 2016) 28.53 - -ERAC (w/feed) (Dai et al., 2018) 29.36 - -ERAC (w/o feed) (Dai et al., 2018) 28.42 - -ERAC (w/o feed, ours) 29.0* 30.6* 51.5*

SAC BLEU

Table 3: Performance of MLE and different RL al-gorithms on the

IWSLT 2014 test set trained on the

IWSLT 2014 train set. * marks statistically signiﬁcantchanges (p-value ≤ . ) as compared to MLE. Boldhighlights best RL results. MIXER, AC and ERACscores were taken from original papers. ERAC (ours)indicates our results using the code provided in (Daiet al., 2018). To further investigate the effect of the unsupervisedreward, we have evaluated

SAC unsuper on the

MLT dataset. Results are shown in Table 4. Wecalculate the scores on two conditions:

All Cases takes into account all possible lexical translations;while for

Rare Cases , only the instances where thegold-standard translation is not the most frequenttranslation for that particular ambiguous word. Weobserve that both

SAC BLEU and

SAC unsuper We mean that the model would have a tendency to selectcertain words to simply boost BLEU rather than picking wordsto reﬂect the correct meaning. ll Cases2016 2017 COCO

Model LTA ALI LTA ALI LTA ALI E N - F R MLE 81.60 63.19 79.65 59.31 74.60 49.21

SAC BLEU

SAC unsuper E N - D E MLE 65.34 30.68 70.91 41.82 67.45 34.91

SAC BLEU

SAC unsuper

Rare Cases2016 2017 COCO

LTA ALI LTA ALI LTA ALI52.81 24.49

Table 4: Performance of

SAC BLEU on the

MLT test sets (EN-FR, EN-DE). We report Ambiguous Words Accu-racy: LTA and ALI.

Rare Cases indicates the cases where the correct translation is not the most frequent translationin the training set. outperform the MLE baseline across metrics in allsetups except for the

COCO

EN-FR translation inRare Cases, where MLE performs better. For

SACBLEU , this observation is also shown by generalevaluation metrics BLEU, METEOR and TER onall

MLT test sets (see Table 10 in Appendix).Moreover,

SAC unsuper is particularly suc-cessful when evaluated on and and out-performs both MLE and

SAC BLEU across setups.This demonstrates the potential of the unsupervisedreward function for the cases when we have tochoose between possible translations for an am-biguous word (i.e., better exploration of the searchspace). BLEU reward, on the other hand, is morereliable when we have to adjust distributions toproduce one single possible translation. Manualinspection of these

SAC unsuper improvementsconﬁrmed their increased accuracy (see Table 5).For example, the ambiguous French source word‘hill’ (‘colline’) is translated as ‘pente’(‘slope’)by both MLE and

SAC BLEU , while only

SACunsuper produces the correct sentence: ‘adoles-cent saute la colline ‘hill’ avec son v´elo’.

To get further insights into the general results, wealso performed human evaluation of the outputsfor MLE,

SAC BLEU , and

SAC unsuper usingprofessional in-house expertise. This was donefor

COCO

EN-FR and

EN-DE as two setswith contrastive results in the lexical translationexperiment.For this human analysis, we randomly selectedtest samples (50 samples per language pair pergroup) with source words of different frequencyin the training data: rare words (frequency 1) andother words (frequency ≥ SAC BLEU to do well on the translation of raresource words, but not so well on the translation ofwords in the middle frequency range (this observa-tion is conﬁrmed by the analysis of the frequency ofoutput words, see Appendix A.5, see Table 6). Ourunsupervised reward tends to increase the perfor-mance on more frequent words (‘Other’ in Table 7)by promoting their less common translations in thedistribution, hence better translations for ambigu-ous words from our previous experiment. Theseambiguous words are quite frequent, they poten-tially have multiple possible translations but onlyone correct translation in a given context.

We propose and reformulate SAC reinforcementlearning approaches to help machine translationthrough better exploration and less reliance on thereward function. To provide a good trade-off be-tween exploration and quality, we devise two re-ward methods in the supervised and dynamic unsu-pervised manner. The maximum entropy off-policySAC algorithm mitigates the overﬁtting problemwhen evaluated in the out-of-domain space; bothrewards introduced in our SAC architecture canachieve better quality for lexical translation ofambiguous words, particularly the rare senses ofwords. The formulation of the unsupervised reward

N-FR source word hillgold target word collinesource sentence the teen jumps the hill with his bicycle .reference sentence ado saute sur la colline ‘hill’ avec son v´elo .MLE adolescent saute sur la pente ‘slope’ avec son v´elo .

SAC BLEU adolescent saute la pente ‘slope’ avec son v´elo .

SAC unsuper adolescent saute la colline ‘hill’ avec son v´elo .EN-DE source word outﬁtgold target word outﬁtsource sentence a rhythmic gymnast in a blue and pink outﬁt performs a ribbon routine .reference sentence eine rhythmische sportgymnastin in einem blauen und pinken outﬁt vollf¨uhrt eine bewegungmit dem band .MLE ein begeisterter turner in blau-rosa kleidung ‘dress’ f¨uhrt eine band auf .

SAC BLEU ein begeisterter turner in blau-rosa kleidung ‘dress’ f¨uhrt eine band auf .

SAC unsuper ein aufgeregter turner in einem blau-rosa outﬁt f¨uhrt eine band aus .

Table 5: Samples of ambiguous words translation on for both EN-FR and EN-DE. In both cases more correcttranslations are provided by

SAC unsuper . Bold highlights target words and their translations.

Freq. 1 source word travelergold target word reisendersource sentence an oriental traveler awaits his turn at the currency exchange .reference sentence ein orientalischer reisender ‘traveler’ wartet am wechselschalter bis er dran ist .MLE ein orientalisch aussehender behinderter ‘disabled’ wartet darauf , dass die kurve sich dieglast¨ur aufhebt .

SAC BLEU ein orientalisch aussehender techniker ‘technician’ wartet auf die hecke seiner kurve .

SAC unsuper ein orientalisch aussehender mann ‘man’ wartet darauf , dass seine kurve auf den fehenk diekurve ist .Freq. 28 source word checkgold target word schecksource sentence a woman is holding a large check for kids food basket .reference sentence eine frau h¨alt einen großen scheck ‘check’ f¨ur ” kids’ food basket ” .MLE eine frau h¨alt ein großes ¨uberpr¨ufen ‘proof’ f¨ur kinder .

SAC BLEU eine frau h¨alt einen großen informationen ‘information’ f¨ur kinder in den korb .

SAC unsuper eine frau h¨alt ein großes ¨uberpr¨ufen ‘proof’ f¨ur kinder , die einen korb zu verkaufen ist .

Table 6: Samples of translations for words of different frequency on

EN-DE. In both cases more correcttranslations are provided by

SAC unsuper . Bold highlights target words and their translations.

Lang Words MLE

SAC BLEU SAC unsuper

EN-FR Rare (Freq. 1) 1.76

Table 7: Human ranking results for

EN-DE and

COCO

EN-FR test set. Bold highlights best resultsper group of word types. The ﬁrst column indicatesthe groups of word types. Results are averaged for allwords per word type group. and its potential to inﬂuence translation qualityopen perspectives for future studies on the subject.We leave the exploration of how those supervisedand unsupervised rewards could be combined toimprove MT for future work.

Acknowledgments

The authors thank the anonymous reviewers fortheir useful feedback. This work was supportedby the MultiMT (H2020 ERC Starting Grant No.678017) and the Air Force Ofﬁce of Scientiﬁc Re-search (under award number FA8655-20-1-7006)projects. Marina Fomicheva and Lucia Specia weresupported by funding from the Bergamot project(EU H2020 grant no. 825303). We also thank theannotators for their valuable help.

References

Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu,Anirudh Goyal, Ryan Lowe, Joelle Pineau, AaronCourville, and Yoshua Bengio. 2016. An actor-criticalgorithm for sequence prediction. arXiv preprintarXiv:1607.07086 .Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2014. Neural machine translation byjointly learning to align and translate. Citerxiv:1409.0473Comment: Accepted at ICLR 2015as oral presentation.Ondˇrej Bojar, Rajen Chatterjee, Christian Federmann,Yvette Graham, Barry Haddow, Shujian Huang,Matthias Huck, Philipp Koehn, Qun Liu, Varvara Lo-gacheva, Christof Monz, Matteo Negri, Matt Post,Raphael Rubino, Lucia Specia, and Marco Turchi.2017. Findings of the 2017 conference on machinetranslation (WMT17). In

Proceedings of the Sec-ond Conference on Machine Translation , pages 169–214, Copenhagen, Denmark. Association for Com-putational Linguistics.Ozan Caglayan, Julia Ive, Veneta Haralampieva,Pranava Madhyastha, Lo¨ıc Barrault, and Lucia Spe-cia. 2020. Simultaneous machine translation with vi-sual context. In

Proceedings of the 2020 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP) , pages 2350–2361, Online. Associa-tion for Computational Linguistics.Mauro Cettolo, C. Girardi, and Marcello Federico.2012. Wit3: Web inventory of transcribed and trans-lated talks.

Proceedings of EAMT , pages 261–268.Boxing Chen and Colin Cherry. 2014. A systematiccomparison of smoothing techniques for sentence-level BLEU. In

Proceedings of the Ninth Workshopon Statistical Machine Translation , pages 362–367,Baltimore, Maryland, USA. Association for Compu-tational Linguistics.Kyunghyun Cho, Bart van Merri¨enboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using RNN encoder–decoderfor statistical machine translation. In

Proceedings ofthe 2014 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP) , pages 1724–1734, Doha, Qatar. Association for ComputationalLinguistics.Leshem Choshen, Lior Fox, Zohar Aizenbud, and OmriAbend. 2020. On the weaknesses of reinforcementlearning for neural machine translation. In

Interna-tional Conference on Learning Representations .Petros Christodoulou. 2019. Soft actor-criticfor discrete action settings. arXiv preprintarXiv:1910.07207 .Jonathan H. Clark, Chris Dyer, Alon Lavie, andNoah A. Smith. 2011. Better hypothesis testing forstatistical machine translation: Controlling for op-timizer instability. In

Proceedings of the 49th An-nual Meeting of the Association for ComputationalLinguistics: Human Language Technologies , pages176–181, Portland, Oregon, USA. Association forComputational Linguistics.Zihang Dai, Qizhe Xie, and Eduard Hovy. 2018.From credit assignment to entropy regularization:Two new algorithms for neural sequence prediction. arXiv preprint arXiv:1804.10974 . Michael Denkowski and Alon Lavie. 2014. Meteor uni-versal: Language speciﬁc translation evaluation forany target language. In

Proceedings of the EACL2014 Workshop on Statistical Machine Translation .Desmond Elliott, Stella Frank, Lo¨ıc Barrault, FethiBougares, and Lucia Specia. 2017. Findings of thesecond shared task on multimodal machine transla-tion and multilingual image description. In

Proceed-ings of the Second Conference on Machine Transla-tion , pages 215–233, Copenhagen, Denmark. Asso-ciation for Computational Linguistics.Desmond Elliott, Stella Frank, Khalil Sima’an, and Lu-cia Specia. 2016. Multi30K: Multilingual English-German image descriptions. In

Proceedings of the5th Workshop on Vision and Language , pages 70–74, Berlin, Germany. Association for ComputationalLinguistics.Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz,and Sergey Levine. 2018. Diversity is all you need:Learning skills without a reward function. arXivpreprint arXiv:1802.06070 .Yang Gao, Wei Zhao, and Steffen Eger. 2020. SU-PERT: Towards new frontiers in unsupervised evalu-ation metrics for multi-document summarization. In

Proceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 1347–1354, Online. Association for Computational Lin-guistics.Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, andSergey Levine. 2017. Reinforcement learning withdeep energy-based policies.

CoRR , abs/1702.08165.Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, andSergey Levine. 2018a. Soft actor-critic: Off-policymaximum entropy deep reinforcement learning witha stochastic actor. arXiv preprint arXiv:1801.01290 .Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen,George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar,Henry Zhu, Abhishek Gupta, Pieter Abbeel, andSergey Levine. 2018b. Soft actor-critic algorithmsand applications.

CoRR .Hado V. Hasselt. 2010. Double q-learning. In J. D.Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S.Zemel, and A. Culotta, editors,

Advances in Neu-ral Information Processing Systems 23 , pages 2613–2621. Curran Associates, Inc.Di He, Hanqing Lu, Yingce Xia, Tao Qin, Liwei Wang,and Tie-Yan Liu. 2017. Decoding with value net-works for neural machine translation. In I. Guyon,U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,S. Vishwanathan, and R. Garnett, editors,

Advancesin Neural Information Processing Systems 30 , pages178–187. Curran Associates, Inc.Yaser Keneshloo, Tian Shi, Naren Ramakrishnan, andChandan K. Reddy. 2018. Deep reinforcement learn-ing for sequence to sequence models. arXiv preprintarXiv:1805.09461 .iederik P. Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization.Philipp Koehn and Rebecca Knowles. 2017. Six chal-lenges for neural machine translation. In

Proceed-ings of the First Workshop on Neural Machine Trans-lation , pages 28–39, Vancouver. Association forComputational Linguistics.Vijay R. Konda and John N. Tsitsiklis. 2000. Actor-critic algorithms. In S. A. Solla, T. K. Leen, andK. M¨uller, editors,

Advances in Neural Informa-tion Processing Systems 12 , pages 1008–1014. MITPress.Julia Kreutzer, Joshua Uyheng, and Stefan Riezler.2018. Reliability and learnability of human banditfeedback for sequence-to-sequence reinforcementlearning. In

Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers) , pages 1777–1788, Mel-bourne, Australia. Association for ComputationalLinguistics.Chiraag Lala, Pranava Madhyastha, and Lucia Specia.2019. Grounded word sense translation. In

Proceed-ings of the Second Workshop on Shortcomings in Vi-sion and Language , pages 78–85, Minneapolis, Min-nesota. Association for Computational Linguistics.Chiraag Lala and Lucia Specia. 2018. Multimodal Lex-ical Translation. In

Proceedings of the Eleventh In-ternational Conference on Language Resources andEvaluation (LREC 2018) , Miyazaki, Japan. Euro-pean Language Resources Association (ELRA).Timothy P. Lillicrap, Jonathan J. Hunt, AlexanderPritzel, Nicolas Heess, Tom Erez, Yuval Tassa,David Silver, and Daan Wierstra. 2016. Continuouscontrol with deep reinforcement learning. In

ICLR .Volodymyr Mnih, Koray Kavukcuoglu, David Silver,Andrei A. Rusu, Joel Veness, Marc G. Bellemare,Alex Graves, Martin Riedmiller, Andreas K. Fid-jeland, Georg Ostrovski, Stig Petersen, CharlesBeattie, Amir Sadik, Ioannis Antonoglou, HelenKing, Dharshan Kumaran, Daan Wierstra, ShaneLegg, and Demis Hassabis. 2015. Human-level con-trol through deep reinforcement learning.

Nature ,518(7540):529–533.Myle Ott, Sergey Edunov, Alexei Baevski, AngelaFan, Sam Gross, Nathan Ng, David Grangier, andMichael Auli. 2019. fairseq: A fast, extensibletoolkit for sequence modeling. In

Proceedings ofNAACL-HLT 2019: Demonstrations .Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In

Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics , pages 311–318, Philadelphia,Pennsylvania, USA. Association for ComputationalLinguistics. Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.2013. On the difﬁculty of training recurrent neuralnetworks. In

Proceedings of the 30th InternationalConference on Machine Learning , volume 28 of

Proceedings of Machine Learning Research , pages1310–1318, Atlanta, Georgia, USA. PMLR.Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, Alban Desmaison, Andreas Kopf, EdwardYang, Zachary DeVito, Martin Raison, Alykhan Te-jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,Junjie Bai, and Soumith Chintala. 2019. Pytorch:An imperative style, high-performance deep learn-ing library. In

Advances in Neural Information Pro-cessing Systems 32 , pages 8024–8035. Curran Asso-ciates, Inc.Oﬁr Press and Lior Wolf. 2017. Using the output em-bedding to improve language models. In

Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational Linguistics:Volume 2, Short Papers , pages 157–163, Valencia,Spain. Association for Computational Linguistics.Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,and Wojciech Zaremba. 2015. Sequence level train-ing with recurrent neural networks. arXiv preprintarXiv:1511.06732 .Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexan-dra Birch, Barry Haddow, Julian Hitschler, MarcinJunczys-Dowmunt, Samuel L¨aubli, Antonio ValerioMiceli Barone, Jozef Mokry, and Maria N˘adejde.2017. Nematus: a toolkit for neural machine trans-lation. In

Proceedings of the Software Demonstra-tions of the 15th Conference of the European Chap-ter of the Association for Computational Linguistics ,pages 65–68, Valencia, Spain. Association for Com-putational Linguistics.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare wordswith subword units. In

Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, HuaWu, Maosong Sun, and Yang Liu. 2016. Minimumrisk training for neural machine translation. In

Pro-ceedings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers) , pages 1683–1692, Berlin, Germany. Asso-ciation for Computational Linguistics.Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-nea Micciulla, and John Makhoul. 2006. A study oftranslation edit rate with targeted human annotation.In

Proceedings of Association for Machine Transla-tion in the Americas , pages 223–231.Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural networks.n Z. Ghahramani, M. Welling, C. Cortes, N. D.Lawrence, and K. Q. Weinberger, editors,

Advancesin Neural Information Processing Systems 27 , pages3104–3112. Curran Associates, Inc.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors,

Advances in Neural Information Pro-cessing Systems 30 , pages 5998–6008. Curran Asso-ciates, Inc.Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning.

Machine learning , 8(3-4):229–256.Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu.2017. Seqgan: Sequence generative adversarial netswith policy gradient. In

Thirty-ﬁrst AAAI conferenceon artiﬁcial intelligence .Brian D. Ziebart, Andrew L. Maas, J. Andrew Bagnell,and Anind K. Dey. 2008. Maximum entropy inversereinforcement learning. pages 1433–1438. AAAIPress.

Training Details

A.1 Hyperparameters

For the NMT RNN agent, the dimensions of em-beddings and GRU hidden states are set to 200and 320, respectively. The decoder’s input and out-put embeddings are shared (Press and Wolf, 2017).We use Adam (Kingma and Ba, 2014) as the op-timiser and set the learning rate and mini-batchsize to 0.0004 and 64, respectively. A weight de-cay of e − is applied for regularisation. We clipthe gradients if the norm of the full parameter vec-tor exceeds (Pascanu et al., 2013). The four Q-networks are identical to the agent.For the unsupervised reward setting, we use 2two-layer feed-forward neural network (both di-mensionalities are equal to 100). We use againAdam as the optimiser and set the learning rate andmini-batch size to 0.0001 and 64, respectively. Hyper-parametersPre-train Critic optimiser Adamlearning rate 0.0003batch size 64 τ (target net speed) 0.005 α (entropy regularization) 0.001buffer size 1000length penalty 0.0001 Joint Training optimiser Adamlearning rate 0.0004batch size 64 τ (target net speed) 0.005 α (entropy regularization) 0.001buffer size 1000length penalty 0.0001 λ MLE

Table 8: Hyper-parameters for SAC training.

A.2 Training

We use PyTorch (Paszke et al., 2019) (v1.4, CUDA10.1) for our experiments. We early stop the actortraining if validation loss does not improve for 10epochs, we pretrain critics for 5 epochs for the

Multi30K datasets and for 3 epochs for the larger

IWSLT 2014 . We early stop the SAC training ifvalidation BLEU does not improve for 10 epochs. For all the setups, we also halve the learning rate ifno improvement is obtained for two epochs. On asingle NVIDIA RTX2080-Ti GPU, it takes around5-6 hours up to 36 hours to train a model dependingon the data size and the language pair. The numberof learnable parameters is about 7.89M for smallerMulti30K models and about for 15.64M for thebigger IWSLT model. All models were re-trained3 times to ensure reproducibility.

A.3 Soft Actor-Critic Training Algorithm

We describe the main steps of SAC training inAlgorithm 1.

Algorithm 1:

Soft Actor-Critic.Initialise parameters:Q function: θ ;Policy: φ ;Unsupervised Reward: δ ;Replay Buffer: D ← ∅ ; for each iteration dofor each translation step do a t ∼ π φ ( a t , s t ) ; s t +1 ∼ p ( s t +1 | s t , a t ) ; D ← D ∪ { s t , a t , r ( s t , a t ) , s t +1 } ; endfor each gradient step do θ i ← θ i − λ Q ∇ θ i L ( θ i ) for i ∈ { , } ; φ ← φ − λ π ∇ φ J ( φ ) ; α ← α − λ π ∇ α J ( α ) ; θ i ← τ θ i + (1 − τ ) ¯ θ i for i ∈ { , } ; if unsupervised reward then δ ← δ − λ z ∇ δ r ( δ ) ; endendendA.4 Domain Distance To assess to what extent the test sets used in ourexperiments can be considered out-of-domain, wetrain (i) an English language model on

Multi30K training set; and (ii) a German language model onthe

IWSLT 2014 training set. Table 9 shows lan-guage model perplexities on the Mutli30k test data.With respect to the

IWSLT 2014 model,

Multi30K We train Transformer language models using the fairseqtoolkit (Ott et al., 2019). est sets are clearly very different from the trainingdata. With respect to the Multi30K model, and

COCO are more distant from the train parti-tion than 2016 testset.LM

IWSLT 2014

Table 9: Perplexity on

Multi30K testsets for

Multi30K and

IWSLT 2014 language models.

A.5 Analysis of distributions

We argue that the improvement over MLE can bepartially attributed to a better handling of less fre-quent words. It has been shown that rare wordstend to be under-represented in NMT (Koehn andKnowles, 2017; Shen et al., 2016). RL training withregularized entropy might mitigate this issue dueto a better exploration of the action space. To illus-trate this point, we compute the training frequencyof the words generated by the NMT systems for thesentences where an improvement over MLE is ob-served. Figure 1 shows the training frequency per-centiles for MLE and

SAC BLEU

English-Frenchtranslations of the

COCO testset. Reference fre-quencies are also provided for comparison. Weobserve that although both MLE and SAC containmore frequent words than the reference, this ten-dency is less pronounced for SAC. We relate thisobservation to the fact that our SAC outperformsMLE for the ambiguous word translation (Table 4)where the most frequent translation is not alwaysthe correct one.

Figure 1: Training frequency for

COCO words astranslated by MLE and

SAC BLEU . We also report ref-erence frequencies.

016 2017 COCO model BLEU METEOR TER BLEU METEOR TER BLEU METEOR TER E N - F R MLE 58.8 73.8

SAC BLEU

SAC unsuper E N - D E MLE

SAC BLEU

SAC unsuper44.1 33.1 52.9* 48.7 28.3 48.6 51.5