[PDF] Exploiting Multimodal Reinforcement Learning for Simultaneous Machine Translation

Abstract

This paper addresses the problem of simultaneous machine translation (SiMT) by exploring two main concepts: (a) adaptive policies to learn a good trade-off between high translation quality and low latency; and (b) visual information to support this process by providing additional (visual) contextual information which may be available before the textual input is produced. For that, we propose a multimodal approach to simultaneous machine translation using reinforcement learning, with strategies to integrate visual and textual information in both the agent and the environment. We provide an exploration on how different types of visual information and integration strategies affect the quality and latency of simultaneous translation models, and demonstrate that visual cues lead to higher quality while keeping the latency low.

Full PDF

EExploiting Multimodal Reinforcement Learning for SimultaneousMachine Translation

Julia Ive , Andy Mingren Li , Yishu Miao , Ozan Caglayan , Pranava Madhyastha and Lucia Specia , , Imperial College London , University of Shefﬁeld , ADAPT - Dublin City University [email protected], [email protected], [email protected], [email protected]@sheffield.ac.uk, [email protected] Abstract

This paper addresses the problem of simulta-neous machine translation (SiMT) by explor-ing two main concepts: (a) adaptive policies tolearn a good trade-off between high translationquality and low latency; and (b) visual infor-mation to support this process by providing ad-ditional (visual) contextual information whichmay be available before the textual input is pro-duced. For that, we propose a multimodal ap-proach to simultaneous machine translation us-ing reinforcement learning, with strategies tointegrate visual and textual information in boththe agent and the environment. We providean exploration on how different types of vi-sual information and integration strategies af-fect the quality and latency of simultaneoustranslation models, and demonstrate that vi-sual cues lead to higher quality while keepingthe latency low.

Research into automating real-time interpretationhas explored deterministic and adaptive approachesto build policies that address the issue of trans-lation delay (Ryu et al., 2006; Cho and Esipova,2016; Gu et al., 2017). In another recent devel-opment, the availability of multimodal data (suchas visual information) has driven the communitytowards multimodal approaches for machine trans-lation (MMT) (Specia et al., 2016; Elliott et al.,2017; Barrault et al., 2018). Although determinis-tic policies have been recently explored for simul-taneous MMT (Caglayan et al., 2020; Imankulovaet al., 2020), there are no studies regarding howmultimodal information can be exploited to buildﬂexible and adaptive policies for simultaneous ma-chine translation (SiMT).Applications of reinforcement learning (RL) forunimodal SiMT have highlighted the challengesfor the agent to maintain good translation quality while learning an optimal translation path (i.e. asequence of

READ/WRITE decisions at every timestep) (Grissom II et al., 2016; Gu et al., 2017; Aline-jad et al., 2018).Incomplete source information will have detri-mental effect especially in the cases where signiﬁ-cant restructuring is needed while translating fromone language to another.In addition, the lack of information generallyleads to high variance during the training in theRL setup. We posit that multimodality in adaptiveSiMT could help the agent by providing extra sig-nals, which would in turn improve training stabilityand thus the quality of the estimator and translationdecoder.In this paper, we present the ﬁrst exploration onmultimodal RL approaches for the task of SiMT.As visual signals, we explore both image classi-ﬁcation features as well as visual concepts, whichprovide global image information and explicit ob-ject representations, respectively. For RL, we em-ploy the Policy Gradient method with a pre-trainedneural machine translation model acting as the en-vironment.As the SiMT model is optimised for both trans-lation quality and latency, we apply a combinedreward function that consists of a decomposedsmoothed BLEU score and a latency score. Tointegrate visual and textual information, we pro-pose different strategies that operate both on theagent (as prior information or at each step) and theenvironment side.In experiments on standard datasets for MMT,our models achieve the highest BLEU scores onmost settings without signiﬁcant loss on averagelatency, as compared to strong SiMT baselines. Aqualitative analysis shows that the agent beneﬁtsfrom the multimodal information by grounding lan-guage signals on the images.Our main contributions are as follows: (1) we a r X i v : . [ c s . C L ] F e b ropose the ﬁrst multimodal approach to simultane-ous machine translation based on adaptive policieswith RL, introducing different strategies to inte-grate visual and textual information (Sections 3and 4); (2) we show how different types of vi-sual information and integration strategies affectthe quality and latency of the models (Section 5);(3) we demonstrate that providing visual cues toboth agent and environment is beneﬁcial: modelsachieve high quality while keeping the latency low(Section 5). In this section, we ﬁrst present background andrelated work on SiMT, and then discuss recent workin MMT and multimodal RL.

In the context of neural machine translation (NMT),Cho and Esipova (2016) introduce a greedy decod-ing framework where simple heuristic waiting cri-teria are used to decide whether the model shouldread more source words or instead write a targetword. Gu et al. (2017) utilise a pre-trained NMTmodel in conjunction with an RL agent whose goalis to learn a

READ/WRITE policy by maximis-ing quality and minimising latency. Alinejad et al.(2018) further extend the latter approach by addinga

PREDICT action with an aim to capture the an-ticipation of the next source word. Ma et al. (2019)propose an end-to-end, ﬁxed-latency frameworkcalled ‘wait- k ’ which allows preﬁx-to-preﬁx train-ing using a deterministic policy: the agent startsby reading a speciﬁed number of source tokens ( k ),followed by alternating WRITE and

READ actions.Other approaches to SiMT include re-translationof previous outputs depending on new outputs (Ari-vazhagan et al., 2020; Niehues et al., 2018) orlearning adaptive policies guided by a heuristicor alignment-based approaches (Zheng et al., 2019;Arthur et al., 2020). A general theme in these ap-proaches is their reliance on consecutive

NMT mod-els pre-trained on full-sentences. However, Dalviet al. (2018) discuss potential mismatches betweenthe training and decoding regimens of these ap-proaches and propose to perform ﬁne-tuning of themodels using chunked data or preﬁx pairs.

MMT aims at improving the quality of automatictranslation using additional sources of informa- tion (Sulubacak et al., 2020). Different methodsfor fusing textual and visual information have beenproposed. These include initialising the textualencoder or decoder with the visual information (El-liott and K´ad´ar, 2017; Caglayan et al., 2017), com-bining the visual information through spatial fea-ture maps using soft attention (Caglayan et al.,2016; Libovick´y and Helcl, 2017; Huang et al.,2016; Calixto et al., 2017), and projecting a sum-mary of the visual representations to a commoncontext space via a trained projection matrix (Cal-ixto and Liu, 2017; Caglayan et al., 2017; Elliottand K´ad´ar, 2017; Gr¨onroos et al., 2018). Further,recent work has also focused on exploring Mul-timodal Pivots (Hitschler et al., 2016) and latentvariable models (Calixto et al., 2019) in the contextof multimodal machine translation. In this paper,we explore all these strategies, and also the use of visual concepts , similar to the approach by Ive et al.(2019).

Previous work has explored RL with language in-puts (Andreas et al., 2017; Bahdanau et al., 2018;Goyal et al., 2019) by making use of language toimprove the policy or reward function: for example,the task of navigating in the world grid environmentusing language instructions (Andreas et al., 2016).Alternatively, RL with language output can beshaped as sequential decision making for languagegeneration, while conditioning on other modalities.This includes image captioning (Ren et al., 2017),video captioning (Wang et al., 2018), question an-swering (Das et al., 2018), and text-based games(Cˆot´e et al., 2018). Our study sits somewhere inbetween these different types of work. We haveboth the source language and respective imagesas input and the target language as output. Ouragent is focused only on learning the

READ and

WRITE actions while the translation model is ﬁxedfor simplicity.The central aim of the agent is learning to cap-ture the relevant structures and relations of themodalities that can lead to a better SiMT system.

We ﬁrst present the architectures for consecutiveand baseline ﬁxed policy simultaneous MT (Sec-tion 3.1). Then we introduce our RL approaches,both the baseline, the proposed multimodal exten-sion (Section 3.2), and the visual features used byll multimodal approaches (Section 3.3).

We implement a standardencoder-decoder baseline with attention (Bahdanauet al., 2014) which incorporates a two-layer en-coder and a two-layer decoder with GRU (Choet al., 2014) units. Given a source sequenceof embeddings X = { x , . . . , x S } and a target se-quence of embeddings Y = { y , . . . , y T } , the en-coder ﬁrst computes the sequence of hidden states H = { h , . . . , h S } unidirectionally.The attention layer receives H as key-values whereas the hidden states of the ﬁrst decoder GRUprovide the queries . The context vector c T t pro-duced by the attention layer is given as input to thesecond GRU. Finally, the output token ( y t ) prob-abilities are obtained by applying a softmax layeron top of the concatenation of the previous wordembedding, context vector and the second GRU’shidden state.For consecutive NMT , all source tokens are ob-served before the decoder begins the process ofgeneration. Multimodal MT.

We extend unimodal MTwith multimodal attention (Calixto et al., 2016;Caglayan et al., 2016) in the decoder, in order to in-corporate visual information into the baseline NMT.Let us denote the visual counterpart of textual hid-den states H by V . Multimodal attention simplyapplies another attention layer on top of V , whichyields a visual context vector c V t at each decodingtimestep t . The ﬁnal multimodal context vectorthat would be given as input to the second GRU issimply the sum of both context vectors. Unimodal wait- k NMT.

We explore determinis-tic wait- k (Ma et al., 2019) approach as a unimodalbaseline for simultaneous NMT. The wait- k modelstarts by reading k source tokens and writes the ﬁrsttarget token. The model then reads and writes onetoken at a time to complete the translation process.This implies that the attention layer will now attendto a partial textual representation corresponding to k -words. We use the decoding-only variant whichdoes not require re-training an NMT model i.e. itre-uses the already trained consecutive NMT base-lines. These baselines are equivalent to the deterministic ap-proaches used in Caglayan et al. (2020).

We closely follow Gu et al. (2017)and cast SiMT as a task of producing a sequenceof

READ or WRITE actions. We then devise an RLmodel that connects the MT system and these ac-tions. The model is based on a reward function thattakes into account both quality and latency. Fol-lowing standard RL, the framework is composed ofan environment and an agent. The agent takes thedecision of either reading one more input token orwriting a token into the output – hence two actionsare possible:

READ and

WRITE . The environmentis a pre-trained NMT system which is frozen duringRL training.The agent is a GRU that parameterises a stochas-tic policy which decides on the action a t by receiv-ing as input the observation o t . In our setup, o t isdeﬁned as [ c T t ; y t ; a t − ] , i.e. the concatenation ofvectors coming from the environment, as well asthe previously produced action sequence. At eachtime step, the agent receives a reward r t = r Qt + r Dt where r Qt is the quality reward (the difference ofsmoothed BLEU scores for partial hypotheses pro-duced from one step to another) and r Dt is the la-tency reward formulated as: r Dt = α [ sgn ( C t − C ∗ ) + 1] + β (cid:98) D t − D ∗ (cid:99) + where C t denotes the consecutive wait (CW) metricwhich is added to avoid long consecutive waits (Guet al., 2017). CW measures how many source to-kens are consecutively read between committingtwo translations. D t refers to average proportion(AVP) (Cho and Esipova, 2016), which deﬁnes theaverage proportion of wait tokens when translatingthe words. D ∗ and C ∗ are hyper-parameters thatdetermine the expected/target values. The optimalquality-latency trade-off is achieved by balancingthe two reward terms. In our reward implementa-tion we again closely follow Gu et al. (2017). Multimodal extension.

Here we focus on inte-grating the visual information with the agent (seeFigure 1). The basic premise is that the addition ofmultimodal information, especially in the contextof MMT, can result in the agent learning better andmore ﬂexible policies. We explore several ways tointegrate visual information into this framework: We note that the use of GRU cells is not critical for themultimodal components. They were chosen as they led to thebest performance in our implementation.

Multimodal initialisation ( RL-init ) - theagent network is initialised with the image vec-tor V as d . We expect this vector to give the agentsome context w.r.t. the source sentence so it canpotentially read fewer words before producing out-puts.• Multimodal attention ( RL-att , Figure 1) ap-plies another attention layer on top of V , whichyields a visual context vector c V t at each agent timestep t . This visual context vector is a dot prod-uct attention c V t = Attention ( V, query ← y t ) thatcomputes the similarity between V and the embed-ding of the target word produced by the decoderat the time step t . In this setting, we expect theagent to pay attention to the information in V thatwill help in deﬁning whether y t is good enoughto be written to the output (potentially with closerrelationship to some part of the image information)or we need to read more source words to producea better y t . We concatenate c V t to o t , which nowbecomes [ c T t ; y t ; a t − ; c V t ] ;• As a control , we also study multimodal envi-ronment ( RL-env , Figure 1) where we use theMMT baseline as environment. Here, we expectthe initial translation quality of SiMT RL modelsbe closer to the quality of the respective consecu-tive multimodal baseline as the image informationis expected to compensate for partial source in-formation. When combined with

RL-init and

RL-att settings, we expect the agent to exploitdifferent kinds of image information than the envi-ronment.

Learning.

To learn the multimodal agent, we in-troduce an additional neural network with the samestructure as that of the agent GRU network to pro-vide for control variates (baselines) that improvethe Monte-Carlo policy gradient (REINFORCE(Williams, 1992)). Note that here we depart fromthe previous work where Gu et al. (2017) use asimple multilayer perceptron as the baseline.Therefore, with the reward r t at each time step,we obtain the estimation of the gradients by sub-tracting the baselines b ( o t ) : ∇ θ J ( θ ) = E [ T − (cid:88) t =0 ∇ θ log π ( a t | o t )( r t − b ( o t ))] To further reduce the variance of the gradient es-timator, we also introduce a temperature τ for controlling the interpolation between discrete ac-tion samples and continuous categorical densities,which yields to a Gumbel-Softmax reparameterisa-tion (Jang et al., 2017) that smooths the learning.To be more precise, we use the Gumbel-Softmaxdistribution instead of argmax while sampling. Sothe probability of the WRITE action is given to theagent network instead of the index of the action.

In order to represent the visual information, weexplore two settings that differ in the organisationof the spatial structure. Regardless of the setting,the image features are linearly projected into thehidden space of the decoder to yield the tensor V . Image classiﬁcation features (OC) are global image information represented by convolutionalfeature maps, which are believed to capture spatialcues. These features are extracted from the ﬁnalconvolution layer of a ResNet-50 convolutionalneural network (CNN) (He et al., 2016) pre-trainedon ImageNet (Deng et al., 2009) for object classi-ﬁcation. The size of the ﬁnal feature tensor being8x8x2048, the visual attention is applied on a gridof 64 equally-sized regions.

Visual Concepts (VC) are explicit object rep-resentations where local regions are detectedas objects and subsequently encoded with 100-dimensional word representations. For a given im-age, the detector provides 36 object and 36 attributeregion proposals which are abstract concepts asso-ciated with the image. We represent each of thedetected region with its corresponding GloVe (Pen-nington et al., 2014) word vectors. An image isthus represented by a feature tensor of size 72x100and the visual attention is now applied on thesevisual concepts, rather than the uniform grid ofthe ﬁrst approach above. We hypothesise that thistype of information can result in better referen-tial grounding by using conceptually meaningfulunits rather than global features. The detector usedhere is a Faster R-CNN/ResNet-101 object detector(with 1600 object labels) (Anderson et al., 2018) pre-trained on the Visual Genome dataset (Krishnaet al., 2017). https://hub.docker.com/r/airsplay/bottom-up-attention a) (b) Figure 1: Our multimodal RL SiMT models: the agent interacts with the environment to receive new translation andat each time step produces the

READ/WRITE action. For each action it receives a reward. The image informationcan be integrated into the agent by means of an attention mechanism (a,

RL-att ), or into the environment decoder(b,

RL-env ) producing the next translation.

We perform experiments on the Multi30kdataset (Elliott et al., 2016) which extends theFlickr30k image captioning dataset (Young et al.,2014) with caption translations in German andFrench (Elliott et al., 2017). Multi30k is a stan-dard MMT dataset that contains parallel sentencesin two languages that describe the images. Thetraining set for each language direction comprises29,000 image-source-target triplets whereas the de-velopment and the test sets have around 1,000 sam-ples. We use the corresponding test sets from 2016,2017 and 2018 for evaluation. Pre-processing.

We use Moses scripts (Koehnet al., 2007) to lowercase, normalise and tokenisethe sentences. We then create word vocabularieson the training subset of the dataset. We did notuse subword segmentation to avoid its potentialside effects on ﬁxed policy SiMT and to be ableto better analyse the grounding capability of themodels. The resulting English, French and Germanvocabularies contain 9.8K, 11K and 18K tokens,respectively.

We use

BLEU (Papineni et al., 2002) for quality,and perform signiﬁcance testing via bootstrap re-sampling using the

Multeval tool (Clark et al.,2011). For latency, we measure

Average propor-tion (AVP) (Cho and Esipova, 2016). AVP is theaverage number of source tokens required to com-mit a translation. This metric is sensitive to thedifference in lengths between source and target. https://github.com/multi30k/dataset Hence, as our main latency metric we measure

Av-erage Lagging (AVL) (Ma et al., 2019) which esti-mates the number of tokens the “writer” is laggingbehind the “reader”, as a function of the number ofinput tokens read.

We set the embeddings di-mensionality and GRU hidden states to 200 and320, respectively. We use the ADAM (Kingmaand Ba, 2014) optimiser with the learningrate 0.0004 and the batch size of 64. Weuse pysimt (Caglayan et al., 2020) with Py-Torch (Paszke et al., 2019) v1.4 for our experi-ments. We early stop w.r.t. the validation BLEUwith the patience of 10 epochs. On a singleNVIDIA RTX2080-Ti GPU, the training takesaround 35 minutes for the unimodal model andaround 1 hour for the multimodal model. The num-ber of learnable parameters is between 6.9M and9.3M depending on the language pair and the typeof multimodality.For the

RL systems , we follow (Gu et al.,2017). The agent is implemented by a 320-dimensional GRU followed by a softmax layer andthe baseline network is similar to the agent exceptwith a scalar output layer. We use ADAM as theoptimiser and set the learning rate and mini-batchsize to 0.0004 and 6, respectively. For each sen-tence pair in a batch, 5 trajectories are sampled.Following best practises in RL, the baseline net-work is trained to reduce the MSE loss betweenthe predictions and the rewards using a second op- https://github.com/ImperialNLP/pysimt https://github.com/nyu-dl/dl4mt-simul-trans Note that that Gu et al. (2017) use a 2-hidden layer feed-forward network as the baseline network. In our implementa-tion GRUs have demonstrated better performance. imiser.For inference, greedy sampling is used to pick ac-tion sequences. We set the hyperparameters C ∗ =2 , D ∗ =0 . , α =0 . and β = − . To encourageexploration, the negative entropy policy term isweighed empirically with 0.001. Following (Guet al., 2017), we choose the model that maximisesthe quality-to-latency ratio (BLEU/AVP) on thevalidation set with a patience of 5 epochs. Ona single NVIDIA RTX2080-Ti GPU, the trainingtakes around 2 hours. The number of learnableparameters is around 6M.

Model conﬁgurations.

We experiment withseven different conﬁgurations (below). We con-sider visual concepts (VC) as the main source ofmultimodal information. Visual concepts are moreabstract forms of multimodal information. Unlikespatial image representation or region of interest-based object representations, where the represen-tation for the same concept can vary signiﬁcantlyacross images, visual concepts remain constant.For example, the visual concept “dog” is the sameregardless of the breed, colour, size, position, etc.of the concept in different images. Image classiﬁca-tion (OC) features are used as a contrastive setting.• Unimodal RL baseline (

RL-base ): Thisbaseline follows (Gu et al., 2017) where theenvironment is a text-only NMT model.• Multimodal agent with VC initialisation(

RL-init

VC): We initialise the agent GRUusing a projection of the ﬂattened 72x100 ma-trix of visual concepts.• Multimodal agent with attention over VC(

RL-att

VC): The agent attends over theset of visual concepts at each step.• Multimodal agent with attention over OC(

RL-att

OC): The agent attends over the setof image classiﬁcation-based spatial featuremaps at each step.• Visually initialised multimodal agent with at-tention over VC (

RL-init-att

VC): Sim-ilar to

RL-att

VC but the agent is also ini-tialised with VC.• Multimodal environment with unimodal RLagent (

RL-env

VC): The environment is an We also attempted to choose the model that maximisesBLEU or BLEU/AVL but those stopping criteria resulted ininstability of convergence.

MMT model, however the agent is a standardRL agent akin to the baseline.• Multimodal agent with multimodal envi-ronment (

RL-env-init-att

VC): Thismerges all the variants in that both the multi-modal environment and the multimodal agentattend to visual concepts, the latter is also ini-tialised with visual information.

In this section, we ﬁrst provide the results fromour experiments (Section 5.1) and then analyse thebehaviour of the (multimodal) agents (Section 5.2).

We present the main re-sults in Table 1. The top block for each languagepair shows the textual Consecutive model and itsmultimodal counterpart (Consecutive+VC). Theseare our upperbounds since they have access to theentire source before translating. As expected, theyhave better BLEU but much larger AVL.

RL SiMT vs. Deterministic policy.

The secondblock in Table 1 shows the deterministic policyWait- and Wait- approaches. RL-base per-forms on par with the Wait- (English-French) andWait- (English-German). We however emphasisethe ﬂexibility of the stochastic policies with RLmodels. These are particularly beneﬁcial in themultimodal scenario and allow for exploitation ofthe image information more efﬁciently especiallytowards reducing the average lag. We further ex-pand on this later in Section 5.2. Unimodal RL vs. Multimodal RL.

The thirdblock in Table 1 compares all multimodal RL vari-ants against the text-only SiMT RL (

RL-base ). Ingeneral, the multimodal RL models produce trans-lations that are signiﬁcantly better than

RL-base . Across Multimodal RL Setups.

With regard todifferent conﬁgurations, we observe (1) an increasein quality for the

RL-att models when comparedto

RL-base which is consistent in both types ofvisual inputs OC and VC, and (2) a decrease inthe lag for the

RL-init models at a small de-crease in quality (for VC

RL-init in comparisonto

RL-base ).This observation suggests that the RL modelwith the agent explicitly attending over image in-formation leads to an increase in quality, as the est 2016 test 2017 test 2018

BLEU ↑ AVL ↓ AVP ↓ BLEU ↑ AVL ↓ AVP ↓ BLEU ↑ AVL ↓ AVP ↓ E n g li s h − → F re n c h Consecutive 58.0 13.1 1.0 50.6 11.1 1.0 36.0 13.8 1.0+VC 59.1 13.1 1.0 51.0 11.1 1.0 36.5 13.8 1.0Wait-2 48.1 2.6 0.7 42.9 2.6 0.7 32.1 2.7 0.7Wait-3 54.0 3.5 0.7 48.6 3.5 0.7 35.5 3.5 0.7 RL + att -OC 53.0* 4.1 0.8 46.4* 3.9 0.8 33.3* 4.4 0.8+ att -VC 53.0* 4.0 + init -VC 49.6 + init-att -VC 52.6* 3.8 + env -VC 54.0* 3.3 + env-init-att -VC E n g li s h − → G er m a n Consecutive 35.5 13.1 1.0 27.7 11.1 1.0 25.8 13.8 1.0+VC 35.9 13.1 1.0 27.0 11.1 1.0 25.4 13.8 1.0Wait-2 28.3 2.2 0.6 22.5 2.2 0.7 20.1 2.2 0.6Wait-3 32.6 3.0 0.7 25.4 3.0 0.7 24.1 3.0 0.7 RL att -OC 33.9* 3.7 0.7 att -VC 33.3* 3.3 0.7 24.7* 3.0 0.7 23.0* 3.2 0.7+ init -VC 29.7 2.8 0.7 21.3 2.4 0.7 20.5 2.5 0.6+ init-att -VC env -VC 30.0 + env-init-att -VC 31.4 3.0 0.7 24.0* 2.9 0.7 22.4 3.0 0.7 Table 1: Results for the test sets 2016, 2017 and 2018 (averaged over 3 runs): * marks statistically signiﬁcantincreases in BLEU w.r.t.

RL-base (p-value ≤ . ). Bold highlights best scores across the RL approaches. multimodal agent model is more selective towardsthe word choice. The RL-init conﬁguration withprior image context on the other hand reduces thelag and seems to use

WRITE actions more oftenthan

READ actions. It is interesting that OC andVC features result in similar quality translations,however we see that on average the average lagis lower with VC. We hypothesise that this couldbe due to the fact that the representations remainconstant across images (see Section 4.3).The

RL-init-att conﬁguration represents amiddle ground and we see similar quality improve-ment to

RL-att across setups (a gain of 2 BLEUpoints on average) but with a slightly lower latency.We however observe that

RL-env-init-att has a slightly inferior performance with a a pro-nounced latency when compared to the

RL-env model. We investigate this aspect in the next sec-tions.

Investigating Average Lag.

To further study theimpact of our conﬁgurations on the sentence levellag, in Figure 2 we present the binned-histogramsof sentence lags over the English → German test2016 set. Generally, the models which are ini-tialised with image information seem to have moremass towards the smaller delay bins. In terms of

RL-init and

RL-env-init-att setups, we also observe the presence of two modes around thelag value 3 as well as around two negative values(around -0.25 and -1.25 respectively). These nega-tive lag values are due the difference in length be-tween source and target sentences which is typicalfor the English → German. This also shows that theagent initialised with the image information tendsto prefer

WRITE actions with fewer

READ actions.Further, on manual inspection of some samples,we observed that in the cases with negative lag themodel begins with a

WRITE action straight afterreading the ﬁrst token (See Table 2). As the agentis a GRU model, this behavior resembles that of animage captioning model. We also observe similartrends for English → French with

RL-init modelspredominantly having more mass towards smallerdelay bins (see Figure 3).

In Figure 4 we visualize the agent’s attention ateach time step. On average, the agent actions cor-relate with the objects it attends to when producingthe translation.We now examine the general pattern of agentattention over the visual concepts across the fourconﬁgurations using attention norm: a)

RL-att -VC; b)

RL-att -OC; c)

RL-init-att ; and d) igure 2: Histogram of per sentence lag values in test 2016 English-German. Y axis shows mean values per bin.Bold highlights modes for each distribution.

SRC: the red car is ahead of the two cars in the background .

REF: das rote auto f¨ahrt vor den beiden autos im hintergrund .‘the red car goes before the both cars in the background’

RL-init: die person ist im begriff , die rote mannschaft auf dem roten auto versammelt .‘the person is in concept, that red manhood on the red car gathered’

Actions:

BLEU:

LAG: -1.875

Table 2: Example of a German VC

RL-init setup sentence with a negative lag, where the model tends to writemore before reading new words.

RL-env-init-att . The attention norm is sim-ply the average (cid:96) norm between two consecu-tive attention time-steps. This can help in mea-suring the average visual attention per time stepfor a given sentence. We then compare the at-tention norm distributions over all the sentencesin the English → German test 2016 set for thefour different agent attention conﬁgurations. Wepresent the result in Figure 5. Overall,

RL-init and

RL-att models are signiﬁcantly more peakythan the

RL-env-init-att . This suggests that

RL-env-init-att model is generally spreadacross the 72 visual concepts more uniformly thanthe other two models. This perhaps is one of thecauses for the slightly inferior performance of themodel. We hypothesise that further regularisationof the attention distribution can ameliorate this be-havior and leave it as future work.

In this paper we presented the ﬁrst thorough expo-sition of multimodal reinforcement learning strate-gies for simultaneous machine translation. Wedemonstrate the efﬁcacy of visual information andshow that it leads to adaptive policies which sub-stantially improve over the deterministic and uni-modal RL baselines. Our empirical results indicatethat both agent-side and environment-side visualinformation can be exploited to achieve higher qual-ity translations with lower latency.Throughout the experimental journey, we ob-served that the optimisation of simultaneous ma-chine translation for dynamic policies is non-trivial,due to the two competing objectives: translationquality versus latency. For unimodal simultaneousmachine translation, RL approaches tend to achievetranslation quality on par with the quality of thedeterministic policies within the same average lag.We believe that the fundamental issue is related igure 3: Histogram of per sentence lag values for test 2016 English-French. Y axis shows mean values per bin.Bold highlights modes for each distribution.Figure 4: Visualisation of the agent attention and thecorresponding actions over the source sentence fromthe test2016: ‘A man is grilling out in his backyard.’ to the high variance of the estimator for sequenceprediction, which increases sample complexity andimpedes effective learning. On the other hand, theapproaches with deterministic policies are simpleand effective, as they are positively biased for lan-guage pairs that are close to each other. But thelatter suffer from poor generalisation.In the multimodal simultaneous machine transla-tion setting, however, the variance of the estimatorfrom RL models can be substantially reduced withto the presence of additional (visual) information.

Figure 5: Distribution of attention norms for dif-ferent agents with visual attention trained on theEnglish → German dataset.

Acknowledgments

The authors thank the anonymous reviewers fortheir useful feedback. This work was supportedby the MultiMT (H2020 ERC Starting Grant No.678017) project. The work was also supported bythe Air Force Ofﬁce of Scientiﬁc Research (underaward number FA8655-20-1-7006) project. AndyMingren Li was supported by the Imperial CollegeLondon UROP grant.

References

Ashkan Alinejad, Maryam Siahbani, and Anoop Sarkar.2018. Prediction improves simultaneous neural ma-chine translation. In

Proceedings of the 2018 Con-erence on Empirical Methods in Natural LanguageProcessing , pages 3022–3027, Brussels, Belgium.Association for Computational Linguistics.Peter Anderson, Xiaodong He, Chris Buehler, DamienTeney, Mark Johnson, Stephen Gould, and LeiZhang. 2018. Bottom-up and top-down attentionfor image captioning and visual question answering.In , pages 6077–6086.Jacob Andreas, Dan Klein, and Sergey Levine. 2017.Modular multitask reinforcement learning with pol-icy sketches. In

International Conference on Ma-chine Learning , pages 166–175.Jacob Andreas, Marcus Rohrbach, Trevor Darrell, andDan Klein. 2016. Neural module networks. In

Pro-ceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition (CVPR) .Naveen Arivazhagan, Colin Cherry, WolfgangMacherey, and George Foster. 2020. Re-translationversus streaming for simultaneous translation. In

Proceedings of the 17th International Conferenceon Spoken Language Translation , pages 220–227,Online. Association for Computational Linguistics.Philip Arthur, Trevor Cohn, and Gholamreza Haf-fari. 2020. Learning coupled policies for si-multaneous machine translation. arXiv preprintarXiv:2002.04306 .Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate.

Computing ResearchRepository , arXiv:1409.0473. Version 7.Dzmitry Bahdanau, Felix Hill, Jan Leike, EdwardHughes, Arian Hosseini, Pushmeet Kohli, and Ed-ward Grefenstette. 2018. Learning to understandgoal speciﬁcations by modelling reward. arXivpreprint arXiv:1806.01946 .Lo¨ıc Barrault, Fethi Bougares, Lucia Specia, ChiraagLala, Desmond Elliott, and Stella Frank. 2018. Find-ings of the third shared task on multimodal machinetranslation. In

Proceedings of the Third Conferenceon Machine Translation: Shared Task Papers , pages304–323, Belgium, Brussels. Association for Com-putational Linguistics.Ozan Caglayan, Walid Aransa, Adrien Bardet, Mer-cedes Garc´ıa-Mart´ınez, Fethi Bougares, Lo¨ıc Bar-rault, Marc Masana, Luis Herranz, and Joost van deWeijer. 2017. LIUM-CVC submissions for WMT17multimodal translation task. In

Proceedings of theSecond Conference on Machine Translation , pages432–439, Copenhagen, Denmark. Association forComputational Linguistics.Ozan Caglayan, Walid Aransa, Yaxing Wang,Marc Masana, Mercedes Garc´ıa-Mart´ınez, FethiBougares, Lo¨ıc Barrault, and Joost van de Wei-jer. 2016. Does multimodality help human andmachine for translation and image captioning? In

Proceedings of the First Conference on Ma-chine Translation: Volume 2, Shared Task Papers ,pages 627–633, Berlin, Germany. Association forComputational Linguistics.Ozan Caglayan, Julia Ive, Veneta Haralampieva,Pranava Madhyastha, Lo¨ıc Barrault, and Lucia Spe-cia. 2020. Simultaneous machine translation with vi-sual context. In

Proceedings of the 2020 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP) , pages 2350–2361, Online. Associa-tion for Computational Linguistics.Iacer Calixto, Desmond Elliott, and Stella Frank. 2016.DCU-UvA multimodal MT system report. In

Pro-ceedings of the First Conference on Machine Trans-lation: Volume 2, Shared Task Papers , pages 634–638, Berlin, Germany. Association for Computa-tional Linguistics.Iacer Calixto and Qun Liu. 2017. Incorporating globalvisual features into attention-based neural machinetranslation. In

Proceedings of the 2017 Conferenceon Empirical Methods in Natural Language Process-ing , pages 992–1003, Copenhagen, Denmark. Asso-ciation for Computational Linguistics.Iacer Calixto, Qun Liu, and Nick Campbell. 2017.Doubly-attentive decoder for multi-modal neuralmachine translation. In

Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1913–1924, Vancouver, Canada. Association for Computa-tional Linguistics.Iacer Calixto, Miguel Rios, and Wilker Aziz. 2019.Latent variable model for multi-modal translation.In

Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics , pages6392–6405, Florence, Italy. Association for Compu-tational Linguistics.Kyunghyun Cho and Masha Esipova. 2016. Can neu-ral machine translation do simultaneous translation? arXiv preprint arXiv:1606.02012 .Kyunghyun Cho, Bart van Merri¨enboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using RNN encoder–decoderfor statistical machine translation. In

Proceedings ofthe 2014 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP) , pages 1724–1734, Doha, Qatar. Association for ComputationalLinguistics.Jonathan H. Clark, Chris Dyer, Alon Lavie, andNoah A. Smith. 2011. Better hypothesis testing forstatistical machine translation: Controlling for op-timizer instability. In

Proceedings of the 49th An-nual Meeting of the Association for ComputationalLinguistics: Human Language Technologies , pages176–181, Portland, Oregon, USA. Association forComputational Linguistics.arc-Alexandre Cˆot´e, ´Akos K´ad´ar, Xingdi Yuan, BenKybartas, Tavian Barnes, Emery Fine, James Moore,Matthew Hausknecht, Layla El Asri, MahmoudAdada, et al. 2018. Textworld: A learning environ-ment for text-based games. In

Workshop on Com-puter Games , pages 41–75. Springer.Fahim Dalvi, Nadir Durrani, Hassan Sajjad, andStephan Vogel. 2018. Incremental decoding andtraining methods for simultaneous translation in neu-ral machine translation. In

Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 2 (Short Papers) ,pages 493–499, New Orleans, Louisiana. Associa-tion for Computational Linguistics.Abhishek Das, Samyak Datta, Georgia Gkioxari, Ste-fan Lee, Devi Parikh, and Dhruv Batra. 2018. Em-bodied question answering. In

Proceedings of theIEEE Conference on Computer Vision and PatternRecognition Workshops , pages 2054–2063.J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical im-age database. In , pages 248–255.Desmond Elliott, Stella Frank, Lo¨ıc Barrault, FethiBougares, and Lucia Specia. 2017. Findings of thesecond shared task on multimodal machine transla-tion and multilingual image description. In

Proceed-ings of the Second Conference on Machine Transla-tion , pages 215–233, Copenhagen, Denmark. Asso-ciation for Computational Linguistics.Desmond Elliott, Stella Frank, Khalil Sima’an, and Lu-cia Specia. 2016. Multi30K: Multilingual English-German image descriptions. In

Proceedings of the5th Workshop on Vision and Language , pages 70–74, Berlin, Germany. Association for ComputationalLinguistics.Desmond Elliott and ´Akos K´ad´ar. 2017. Imaginationimproves multimodal translation. In

Proceedings ofthe Eighth International Joint Conference on Natu-ral Language Processing (Volume 1: Long Papers) ,pages 130–141, Taipei, Taiwan. Asian Federation ofNatural Language Processing.Prasoon Goyal, Scott Niekum, and Raymond J.Mooney. 2019. Using natural language for rewardshaping in reinforcement learning. In

Proceedingsof the Twenty-Eighth International Joint Conferenceon Artiﬁcial Intelligence, IJCAI-19 , pages 2385–2391. International Joint Conferences on ArtiﬁcialIntelligence Organization.Alvin Grissom II, Naho Orita, and Jordan Boyd-Graber.2016. Incremental prediction of sentence-ﬁnalverbs: Humans versus machines. In

Proceedingsof The 20th SIGNLL Conference on ComputationalNatural Language Learning , pages 95–104, Berlin,Germany. Association for Computational Linguis-tics. Stig-Arne Gr¨onroos, Benoit Huet, Mikko Kurimo,Jorma Laaksonen, Bernard Merialdo, Phu Pham,Mats Sj¨oberg, Umut Sulubacak, J¨org Tiedemann,Raphael Troncy, and Ra´ul V´azquez. 2018. TheMeMAD submission to the WMT18 multimodaltranslation task. In

Proceedings of the Third Confer-ence on Machine Translation: Shared Task Papers ,pages 603–611, Belgium, Brussels. Association forComputational Linguistics.Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Vic-tor O.K. Li. 2017. Learning to translate in real-timewith neural machine translation. In

Proceedings ofthe 15th Conference of the European Chapter of theAssociation for Computational Linguistics: Volume1, Long Papers , pages 1053–1062, Valencia, Spain.Association for Computational Linguistics.K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep resid-ual learning for image recognition. In , pages 770–778.Julian Hitschler, Shigehiko Schamoni, and Stefan Rie-zler. 2016. Multimodal pivots for image captiontranslation. In

Proceedings of the 54th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers) , pages 2399–2409, Berlin,Germany. Association for Computational Linguis-tics.Po-Yao Huang, Frederick Liu, Sz-Rung Shiang, JeanOh, and Chris Dyer. 2016. Attention-based multi-modal neural machine translation. In

Proceedingsof the First Conference on Machine Translation: Vol-ume 2, Shared Task Papers , pages 639–645, Berlin,Germany. Association for Computational Linguis-tics.Aizhan Imankulova, Masahiro Kaneko, Tosho Hira-sawa, and Mamoru Komachi. 2020. Towards mul-timodal simultaneous neural machine translation. In

Proceedings of the Fifth Conference on MachineTranslation , pages 594–603, Online. Association forComputational Linguistics.Julia Ive, Pranava Madhyastha, and Lucia Specia. 2019.Distilling translations with visual awareness. In

Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 6525–6538, Florence, Italy. Association for Computa-tional Linguistics.Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categor-ical reparameterization with gumbel-softmax.Diederik P. Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization.Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondˇrej Bojar, AlexandraConstantin, and Evan Herbst. 2007. Moses: Opensource toolkit for statistical machine translation. In roceedings of the 45th Annual Meeting of the As-sociation for Computational Linguistics CompanionVolume Proceedings of the Demo and Poster Ses-sions , pages 177–180, Prague, Czech Republic. As-sociation for Computational Linguistics.Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-son, Kenji Hata, Joshua Kravitz, Stephanie Chen,Yannis Kalantidis, Li-Jia Li, David A. Shamma,Michael S. Bernstein, and Li Fei-Fei. 2017. Vi-sual genome: Connecting language and vision usingcrowdsourced dense image annotations.

Int. J. Com-put. Vision , 123(1):32–73.Jindˇrich Libovick´y and Jindˇrich Helcl. 2017. Attentionstrategies for multi-source sequence-to-sequencelearning. In

Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 2: Short Papers) , pages 196–202, Vancou-ver, Canada. Association for Computational Linguis-tics.Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng,Kaibo Liu, Baigong Zheng, Chuanqiang Zhang,Zhongjun He, Hairong Liu, Xing Li, Hua Wu, andHaifeng Wang. 2019. STACL: Simultaneous trans-lation with implicit anticipation and controllable la-tency using preﬁx-to-preﬁx framework. In

Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics , pages 3025–3036,Florence, Italy. Association for Computational Lin-guistics.Jan Niehues, Ngoc-Quan Pham, Thanh-Le Ha,Matthias Sperber, and Alex Waibel. 2018. Low-latency neural speech translation. In

Proc. Inter-speech 2018 , pages 1293–1297.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In

Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics , pages 311–318, Philadelphia,Pennsylvania, USA. Association for ComputationalLinguistics.Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, Alban Desmaison, Andreas Kopf, EdwardYang, Zachary DeVito, Martin Raison, Alykhan Te-jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,Junjie Bai, and Soumith Chintala. 2019. Pytorch:An imperative style, high-performance deep learn-ing library. In

Advances in Neural Information Pro-cessing Systems 32 , pages 8024–8035. Curran Asso-ciates, Inc.Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for word rep-resentation. In

Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP) , pages 1532–1543, Doha, Qatar. Asso-ciation for Computational Linguistics. Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, andLi-Jia Li. 2017. Deep reinforcement learning-basedimage captioning with embedding reward. In

Pro-ceedings of the IEEE conference on computer visionand pattern recognition , pages 290–298.Koichiro Ryu, Shigeki Matsubara, and Yasuyoshi In-agaki. 2006. Simultaneous English-Japanese spo-ken language translation based on incremental de-pendency parsing and transfer. In

Proceedings of theCOLING/ACL 2006 Main Conference Poster Ses-sions , pages 683–690, Sydney, Australia. Associa-tion for Computational Linguistics.Lucia Specia, Stella Frank, Khalil Sima’an, andDesmond Elliott. 2016. A shared task on multi-modal machine translation and crosslingual imagedescription. In

Proceedings of the First Conferenceon Machine Translation: Volume 2, Shared Task Pa-pers , pages 543–553, Berlin, Germany. Associationfor Computational Linguistics.Umut Sulubacak, Ozan Caglayan, Stig-Arne Gr¨onroos,Aku Rouhe, Desmond Elliott, Lucia Specia, andJ¨org Tiedemann. 2020. Multimodal machine trans-lation through visuals and speech.

Machine Transla-tion .Xin Wang, Wenhu Chen, Jiawei Wu, Yuan-Fang Wang,and William Yang Wang. 2018. Video captioningvia hierarchical reinforcement learning. In

Proceed-ings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 4213–4222.Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning.

Machine learning , 8(3-4):229–256.Peter Young, Alice Lai, Micah Hodosh, and Julia Hock-enmaier. 2014. From image descriptions to visualdenotations: New similarity metrics for semantic in-ference over event descriptions.

Transactions of theAssociation for Computational Linguistics , 2:67–78.Baigong Zheng, Renjie Zheng, Mingbo Ma, and LiangHuang. 2019. Simpler and faster learning of adap-tive policies for simultaneous translation. In