[PDF] Towards Multimodal Simultaneous Neural Machine Translation

Abstract

Simultaneous translation involves translating a sentence before the speaker's utterance is completed in order to realize real-time understanding in multiple languages. This task is significantly more challenging than the general full sentence translation because of the shortage of input information during decoding. To alleviate this shortage, we propose multimodal simultaneous neural machine translation (MSNMT), which leverages visual information as an additional modality. Our experiments with the Multi30k dataset showed that MSNMT significantly outperforms its text-only counterpart in more timely translation situations with low latency. Furthermore, we verified the importance of visual information during decoding by performing an adversarial evaluation of MSNMT, where we studied how models behaved with incongruent input modality and analyzed the effect of different word order between source and target languages.

Full PDF

TTowards Multimodal Simultaneous Neural Machine Translation

Aizhan Imankulova ∗ Masahiro Kaneko ∗ Tosho Hirasawa ∗ Mamoru Komachi

Tokyo Metropolitan University6-6 Asahigaoka, Hino, Tokyo 191-0065, Japan { imankulova-aizhan, kaneko-masahiro, hirasawa-tosho } @[email protected] Abstract

Simultaneous translation involves translatinga sentence before the speaker’s utterance iscompleted in order to realize real-time under-standing in multiple languages. This task issigniﬁcantly harder than the general full sen-tence translation because of the shortage of in-put information during decoding. To alleviatethis shortage, we propose multimodal simul-taneous neural machine translation (MSNMT)which leverages visual information as an ad-ditional modality. Although the usefulness ofimages as an additional modality is moder-ate for full sentence translation, we veriﬁed,for the ﬁrst time, its importance for simulta-neous translation. Our experiments with theMulti30k dataset showed that MSNMT in asimultaneous setting signiﬁcantly outperformsits text-only counterpart in situations where 5or fewer input tokens are needed to begin trans-lation. We then veriﬁed the importance of vi-sual information during decoding by (a) per-forming an adversarial evaluation of MSNMTwhere we studied how models behave with in-congruent input modality and (b) analyzing theimage attention.

Simultaneous translation is a natural language pro-cessing (NLP) task in which translation begins be-fore receiving the whole source sentence. It iswidely used in international summits and confer-ences where real-time comprehension is one of themost important aspects. Simultaneous translationis already a difﬁcult task for human interpretersbecause the message must be understood and trans-lated while the input sentence is still incomplete(Seeber, 2015). Consequently, simultaneous trans-lation is even more difﬁcult for machines. Previ-ous works attempt to solve this task by predictingthe sentence-ﬁnal verb (Grissom II et al., 2014), ∗ These authors contributed equally to this paper

Wait whole source sentence

Wait K- words (a)(b)(c)

Schw-arzer

Wait K- words

Figure 1: An overview of (a) vanilla NMT, (b) wait-k simultaneous NMT and (c) multimodal simultaneousmachine translation based on wait-k approach in-corporating visual clues for better En → De translation(here k = 3 ). or predicting unseen syntactic constituents (Odaet al., 2015). Given the difﬁculty of predicting fu-ture inputs based on existing limited inputs, Maet al. (2019) proposed a simple simultaneous neu-ral machine translation (SNMT) approach wait-k which generates the target sentence concurrentlywith the source sentence, but always k tokens be-hind, for given k satisfying latency requirements.However, all existing approaches solve the giventask only using the text modality, which may beinsufﬁcient to produce a reliable translation. Si-multaneous interpreters often consider various ad-ditional information sources such as visual cluesor acoustic data while translating (Seeber, 2015).Therefore, we hypothesize that using supplemen-tary information, such as visual clues, can also bebeneﬁcial for simultaneous machine translation.To this end, we propose Multimodal Simultane-ous Neural Machine Translation ( MSNMT ) thatsupplements the incomplete textual modality with avisual modality, in the form of an image, during thedecoding process to predict still missing informa-tion to improve the translation quality. Our researchcan be applied in various situations where visual a r X i v : . [ c s . C L ] A p r nformation is related to the content of speech suchas presentations that use slides (e.g. TED Talks )and news video broadcasts , etc. Our experimentsshow that the proposed MSNMT method achieveshigher translation accuracy by leveraging imageinformation than the SNMT model that does notuse images. To the best of our knowledge, we arethe ﬁrst to propose the incorporation of visual in-formation to solve the problem of incomplete textinformation in SNMT.The main contributions of our research are: • We propose to combine multi-modal and si-multaneous NMT and discover cases wheresuch multimodal signals are beneﬁcial for theend-task. • We show that the MSNMT approach signiﬁ-cantly improves the quality of simultaneoustranslation by enriching incomplete text inputinformation using visual clues. • By providing an adversarial evaluation forboth text and image and a quantitative atten-tion analysis, we showed that the models in-deed depend on both textual and visual infor-mation.

For simultaneous translation, it is crucial to predictthe words that have not appeared yet to producea translation. For example, it is important to dis-tinguish nouns in SVO-SOV translation and verbsin SOV-SVO translation (Ma et al., 2019). SNMTcan be realized with two types of policy: ﬁxedand adaptive policies (Zheng et al., 2019). Moststudies with adaptive policy to predict upcomingtokens include explicit prediction of the sentence-ﬁnal verb (Grissom II et al., 2014; Matsubara et al.,2000) and unseen syntactic constituents (Oda et al.,2015). Such dynamic SNMT models (Gu et al.,2017; Dalvi et al., 2018; Arivazhagan et al., 2019),which decide to READ/WRITE in one model, havethe advantage of using input text information aseffectively as possible due to the lack of such in-formation in the ﬁrst place. Meanwhile, Ma et al.(2019) proposed a simple wait-k method withﬁxed policy, which generates the target sentenceonly from the source sentence that is delayed by k tokens. However, their models for simultaneoustranslation so far rely only on the source sentence. https://interactio.io/ In addition, in this research, we concentrate on the wait-k approach with ﬁxed policy, so that theamount of input textual context can be controlledto better analyze whether multimodality is effectivein SNMT.Multimodal NMT (MNMT) for full-sentencemachine translation has been developed to en-rich text modality by using visual informa-tion (Hitschler et al., 2016; Specia et al., 2016).While the improvement brought by visual featuresis moderate, their usefulness is proven by Caglayanet al. (2019). They showed that MNMT models areable to capture visual clues under limited textualcontext, where source sentences are syntheticallydegraded by color deprivation, entity masking, andprogressive masking. However, they use an arti-ﬁcial setting where they deliberately deprive themodels of source-side textual context by masking.However, our research has discovered an actualend-task and has shown the effectiveness of usingmultimodal data. Also, in their progressive mask-ing experiments, they use a model exposed to only k words. In our case, a model eventually sees alltext, generating each target tokens after taking ev-ery new source token after waiting for k words tostart translating.In MNMT, visual features are incorporatedinto standard machine translation in many ways.Doubly-attentive models are used to capture the tex-tual and visual context vectors independently andthen combine these context vectors in a concatena-tion manner (Calixto et al., 2017) or hierarchicalmanner (Libovick´y and Helcl, 2017). Some stud-ies use visual features in a multitask learning sce-nario (Elliott and K´ad´ar, 2017; Zhou et al., 2018).Also, recent work on MNMT has partly addressedlexical ambiguity by using visual information (El-liott et al., 2017; Lala and Specia, 2018; Gella et al.,2019) showing that using textual context with vi-sual features outperform unimodal models.In our study, visual features are extracted usingimage processing techniques and then integratedinto an SNMT model as additional information,which is supposed to be useful to predict missingwords in a simultaneous translation scenario. Tothe best of our knowledge, this is the ﬁrst work thatincorporates external knowledge into an SNMTmodel. Multimodal Simultaneous NeuralMachine Translation Architecture

Our main goal in this paper is to investigate if im-age information would bring improvement on anSNMT. As a result, two separate tasks could ben-eﬁt from each other by combining them. In orderto do that, we chose to keep our experiments aspure as possible, without using additional data, orother types of models. It will allow us to controlthe amount of input textual context, so we can eas-ily analyze the relationship between the amount oftextual and visual information.In this section, we describe our MSNMTmodel, which is composed by combining anSNMT (Ma et al., 2019) framework and a multi-modal model (Libovick´y and Helcl, 2017) (Figure1 (c)). We base our model on the RNN architec-ture (Libovick´y and Helcl, 2017; Caglayan et al.,2017a). The models take a sentence and its cor-responding image as inputs. The decoder of theMSNMT model outputs the target language sen-tence using a simultaneous translation mechanismby attaching attention not only to the source sen-tence but also to the image related to the sourcesentence. We ﬁrst brieﬂy review standard NMT to set up thenotations (see also Figure 1, (a)). The encoder ofstandard NMT model always takes the whole inputsequence X = ( x , ..., x n ) of length n where each x i is a word embedding and produces source hid-den states H = ( h , ..., h n ) . The decoder predictsthe next output token y t using H and previouslygenerated tokens, denoted Y

1. Captioning:

We experimented on image cap-tioning in order to examine the effect of using vi-sual clues only to produce adequate translations. Inthis setting, instead of an input sentence, we usedonly one token for each image of Multi30kto produce its description using MSNMT architec-ture.

2. SNMT:

We use only text modality for trainingdata as a baseline for each wait-k model.

3. MSNMT:

We use image modality along withtext modality for a training data for each wait-k model.To train the above models, we utilize attentionNMT (Bahdanau et al., 2015) with a 2-layer unidi-rectional GRU encoder and a 2-layer conditional Involving other types of data for training are out of thescope of this paper, however, they will be the next steps of thisresearch. We applied preprocessing using task1-tokenize.sh fromhttps://github.com/multi30k/dataset. ait- En → De De → En En → Fr Fr → En En → Cs Cs → En k S M S M S M S M S M S M1 12.76 † † † † † † † † † † † † † † † † Table 1: METEOR scores of SNMT (S) and MSNMT (M) models for six translation directions on test2016. Resultsare the average of three runs.

Bold indicates the best METEOR score for each wait-k for each translationdirection. “ † ” indicates statistical signiﬁcance of the improvement over SNMT. wait- En → De De → En En → Fr Fr → En k S M S M S M S M1 7.32 † † † † † † † † † † † Full

Table 2: METEOR scores of SNMT (S) and MSNMT (M) models for four language pairs on test2017. Results arethe average of three runs.

Bold indicates the best METEOR score for each wait-k for each translation direction.“ † ” indicates statistical signiﬁcance of the improvement over SNMT. → En → De → Fr → Cs12.36 18.65 17.71 8.76

Table 3: METEOR scores of Captioning models intofour target languages on test2016. Results are the aver-age of three runs.

GRU decoder. We use the open-source implementa-tion of the nmtpytorch toolkit v3.0.0 (Caglayanet al., 2017b). The hyper-parameters not men-tioned in this table were set to the default valuesin nmtpytorch . We incorporated early-stopping:when the METEOR score (Denkowski and Lavie,2011) did not increase on the development set for10 epochs, the training was stopped.

In this section, we report METEOR scores, whichis a widely used evaluation metric in MNMT, onour test sets for each wait-k model. Statisticalsigniﬁcance ( p < . ) on the difference of BLEU Due to space constraints, we list hyperparameters in Ap-pendix A. Due to space constraints, we show results only for testsets. Additionally, we report their BLEU scores in AppendixB. scores was tested by Moses’s bootstrap-hypothesis-difference-signiﬁcance.pl . “Full” means that thewhole input sentence is used as an input for themodel. All reported results are the average of threeruns using three different random seeds.Tables 1-2 illustrate the METEOR scores ofMSNMT and SNMT models on test2016 andtest2017, respectively. For all language pairs,MSNMT systems show signiﬁcant improvementsover SNMT systems when input textual informa-tion is scarce ( k ≤ k ≥

5) leads to the textinformation becoming sufﬁcient in most cases.The results of Captioning in Table 3 comparedto those in Table 1 show that using only visual in-formation is not enough for translation. The causeis that captioning does not consider the actual textand only describes the image itself.

In this section, we provide a thorough analysis tofurther investigate the effect of visual data to pro-duce a simultaneous translation by: (a) providing ait- En → De De → En En → Fr Fr → En En → Cs Cs → En k C I C I C I C I C I C I1

Table 4: Image Awareness results on test2016. METEOR scores of MSNMT Congruent (C) and Incongruent (I)settings for six translation directions. Results are the average of three runs.

Bold indicates the best METEOR scorefor each wait-k for each translation direction. wait- En → De De → En En → Fr Fr → En En → Cs Cs → En k S M S M S M S M S M S M1 11.33

Full 9.86

Table 5: Text Awareness results on test2016. METEOR scores of SNMT (S) and MSNMT (M) models for sixtranslation directions. Results are the average of three runs.

Bold indicates the best METEOR score for each wait-k for each translation direction. adversarial evaluation; and (b) visualizing atten-tion.

In order to determine whether MSNMT systemsare aware of the visual context (Elliott, 2018), weperform two different versions of adversarial evalu-ation on test2016:

Image Awareness.

We present our system withcorrect visual data with its source sentence (Con-gruent) as opposed to random visual data as aninput (Incongruent) (Elliott, 2018). For that pur-pose, we reversed the order of 1,000 images oftest2016, so there will be no overlapping congruentvisual data. Then we reconstruct image features forthose images to use as an input to a model.

Text Awareness.

We present our system with in-correct source sentences but with the correct visualinformation in order to determine the impact ofvisual data to produce correct translations for noisytext input. Similarly, we used the same shufﬂingtechnique as above for the text data.Results of image awareness experiments areshown in Table 4. We can see the large difference inMETEOR scores between MSNMT congruent and (a) Dogs (b) Players

Figure 2: Images presented in translation examples (Ta-ble 6) and attention visualization (Figures 3-4). incongruent settings when the input text informa-tion is incomplete which implies that our proposedmodel learns to extract information from imagesfor translation. The interesting part is for a fulltranslation, where scores for the incongruent set-ting outperform or are very close to those of thecongruent setting. The reason is that when textualinformation is enough, visual information becomesnot that relevant in some cases.From the results of the text awareness experi-ments (see Table 5) we can draw the followingconclusions. The fact that MSNMT models handlenoisy text input better than SNMT models impliesthat the proposed model can leverage visual infor-mation. For both SNMT and MSNMT, the ME-TEOR score degrades as the number of availableﬁrst k tokens increases. We assume that the more ource a black dog and a brown dog with a ball .Target ein schwarzer und ein brauner hund mit einem ball .Captioning zwei hunde spielen im gras . (Two dogs are playing in the grass .)S wait- ein schwarzer hund springt ¨uber einen zaun . (a black dog jumps over a fence .)M wait- ein schwarzer hund und ein brauner hund rennen auf einem Feld . (a black dog and a brown dog run on a ﬁeld .)S full ein schwarzer hund und ein brauner hund mit einem ball . (a black dog and a brown dog with a ball .)M full ein schwarzer hund und ein brauner hund mit einem ball . (a black dog and a brown dog with a ball .)Source a baseball player in a black shirt just tagged a player in a white shirt .Target eine baseballspielerin in einem schwarzen shirt f¨angt eine spielerin in einem weißen shirt .Captioning ein mann in einem weißen trikot macht einen trick auf dem boden und h¨alt dabei einen anderen mann .(a man in a white jersey is doing a trick on the ﬂoor while holding another man .)S wait- ein baseballspieler in einem roten trikot versucht den ball zu fangen , w¨ahrend der schiedsrichter zuschaut .(a baseball player in a red jersey tries to catch the ball while the referee is watching.)M wait- ein baseballspieler versucht , einen ball zu fangen .(a baseball player is trying to catch a ball.)S full ein baseballspieler in einem schwarzen hemd hat einen spieler in einem weißen hemd < unk > .(a baseball player in a black shirt has a player in a white shirt < unk > .)M full ein baseballspieler in einem schwarzen hemd hat gerade ein spieler in einem weißen hemd < unk > .(a baseball player in a black shirt has just one player in a white shirt < unk > .) Table 6: Examples of En → De translations from test2016 using SNMT (S) and MSNMT (M) models. In () areshown their English meanings.

Italic shows the correct translation outputs. (a) wait- (b) Full Figure 3: Attention visualization for MSNMT outputsfor Figure 2a at each decoding step of En → De transla-tion (see Table 6). noise is given as input, the more a model gets con-fused. However, visual information makes a modelmore robust to the introduced noise. MSNMT mod-els also consider textual information, as models (a) wait- (b) Full Figure 4: Attention visualization for MSNMT outputsfor Figure 2b at each decoding step of En → De transla-tion (see Table 6). have lower performance as the input tokens aremore restricted (opposed to Table 1, columns M ). igure 5: Hierarchical (second) attention scores for vi-sual features on test2016 for six translation directionsin different wait-k models. Scores are averaged forall sentences in test2016 set. As an example, we sampled sentences and their im-ages from test2016 (Figure 2) to compare the out-puts of our systems. Table 6 lists their translationsgenerated by Captioning, SNMT (S) and MSNMT(M) models. In the ﬁrst example, Captioning didnot capture “a ball” and “a black dog and a browndog” presented in the source sentence. An SNMTmodel with wait- predicted an erroneous “zaun(fence)” which is not present neither in source textnor in a corresponding image. On the other hand,the MSNMT model was able to capture both inputtext and visual information and generates a richeroutput. When a full sentence is given as an input,both MSNMT and SNMT translated it correctly. Inthe second example, none of the models generatedcorrect translations. For example, Captioning andSNMT models generated words that do not presentin either of inputs, such as “schiedsrichter (referee)”or “trick (trick).” Also, our MSNMT models failedto capture the gender of the source gender-neutralword “player” and translated it into “spieler” in-stead of “spielerin,” although it was obvious fromthe visual information.For a more detailed analysis, ﬁrst, we visualizedattention on the image of the above example at eachdecoding step for “ k =3” and “Full” input scenarios(see Figures 3-4). Given a piece of incomplete textinformation, the proposed MSNMT model attendsto the different parts of an image. For example,when decoding a token “brauner,” MSNMT attendsmore on a brown dog, and when decoding “ren-nen,” the model attends to the legs of the dogs (seeFigure 3a). Also, in the other example, MSNMT fo-cuses on a player while decoding “baseballspieler.”We hypothesize that the MSNMT model is trying to ﬁnd a piece of useful information from the im-age. In contrast, when an input text is fully given,MSNMT attends only localized parts of the image.These results show us, once again, that the visualdata can enrich an incomplete input sentence andbe used to produce more accurate translation withlow latency in most cases.Furthermore, we investigate how much attentionis given to the visual information in each wait-k model. For that purpose, we simply calculate theaverage score of the second attention (Equation 8)to the visual features for each decoding step forall sentences. Figure 5 reports averages of secondattention scores for visual features on test2016 forsix translation directions. We can see that for thelower k values the MSNMT model utilizes imageinformation more. In this paper, we proposed a multimodal simulta-neous neural machine translation approach whichtakes advantage of visual information as an addi-tional modality to compensate for the shortage ofinput text information in the simultaneous neuralmachine translation. We showed that in a wait-k setting our model signiﬁcantly outperformed itstext-only counterpart in situations where only afew input tokens are available to begin translation.Furthermore, we showed the importance of the vi-sual information for simultaneous translation, espe-cially in small k settings, by performing a thoroughanalysis on the Multi30k data. We hope that ourproposed method can be explored even further forvarious tasks and datasets.In this paper, we created a separate model foreach value of wait-k . However, in future work,we plan to experiment on having a single modelfor all k values (Zheng et al., 2019). Furthermore,we acknowledge the importance of investigatingMSNMT effects on more realistic data (e.g. TED),where the utterance does not necessarily matcha shown image while speaking and/or where itscontext can not be guessed from the shown image. Acknowledgments

We immensely grateful to Raj Dabre and Robvan der Goot who provided expertise, supportand insightful comments that greatly improved themanuscript. We would also like to show our grati-tude to Desmond Elliot for valuable feedback anddiscussions of the paper. eferences

Naveen Arivazhagan, Colin Cherry, WolfgangMacherey, Chung-Cheng Chiu, Semih Yavuz,Ruoming Pang, Wei Li, and Colin Raffel. 2019.Monotonic inﬁnite lookback attention for simul-taneous machine translation. In

Proceedings ofthe 57th Annual Meeting of the Association forComputational Linguistics , pages 1313–1323.Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In

Proceedings ofthe 3rd International Conference on Learning Rep-resentations .Ozan Caglayan, Walid Aransa, Adrien Bardet, Mer-cedes Garc´ıa-Mart´ınez, Fethi Bougares, Lo¨ıc Bar-rault, Marc Masana, Luis Herranz, and Joostvan de Weijer. 2017a. LIUM-CVC submissions forWMT17 multimodal translation task. In

Proceed-ings of the Second Conference on Machine Transla-tion , pages 432–439.Ozan Caglayan, Mercedes Garc´ıa-Mart´ınez, AdrienBardet, Walid Aransa, Fethi Bougares, and Lo¨ıcBarrault. 2017b. NMTPY: A ﬂexible toolkitfor advanced neural machine translation systems.

The Prague Bulletin of Mathematical Linguistics ,109(1):15–28.Ozan Caglayan, Pranava Madhyastha, Lucia Specia,and Lo¨ıc Barrault. 2019. Probing the need for visualcontext in multimodal machine translation. In

Pro-ceedings of the 2019 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 1 (Long and Short Papers) , pages 4159–4170.Iacer Calixto, Qun Liu, and Nick Campbell. 2017.Doubly-attentive decoder for multi-modal neuralmachine translation. In

Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1913–1924.Fahim Dalvi, Nadir Durrani, Hassan Sajjad, andStephan Vogel. 2018. Incremental decoding andtraining methods for simultaneous translation in neu-ral machine translation. In

Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 2 (Short Papers) ,pages 493–499.Michael Denkowski and Alon Lavie. 2011. Meteor 1.3:Automatic metric for reliable optimization and eval-uation of machine translation systems. In

Proceed-ings of the Sixth Workshop on Statistical MachineTranslation , pages 85–91.Desmond Elliott. 2018. Adversarial evaluation of mul-timodal machine translation. In

Proceedings of the2018 Conference on Empirical Methods in NaturalLanguage Processing , pages 2974–2978. Desmond Elliott, Stella Frank, Lo¨ıc Barrault, FethiBougares, and Lucia Specia. 2017. Findings of thesecond shared task on multimodal machine transla-tion and multilingual image description. In

Proceed-ings of the Second Conference on Machine Transla-tion , pages 215–233.Desmond Elliott, Stella Frank, Khalil Sima’an, and Lu-cia Specia. 2016. Multi30k: Multilingual English-German image descriptions. In

Proceedings of the5th Workshop on Vision and Language , pages 70–74.Desmond Elliott and `Akos K´ad´ar. 2017. Imaginationimproves multimodal translation. In

Proceedings ofthe Eighth International Joint Conference on Natu-ral Language Processing (Volume 1: Long Papers) ,pages 130–141.Spandana Gella, Desmond Elliott, and Frank Keller.2019. Cross-lingual visual verb sense disambigua-tion. In

Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , pages1998–2004.Alvin Grissom II, He He, Jordan Boyd-Graber, JohnMorgan, and Hal Daum´e III. 2014. Dont until theﬁnal verb wait: Reinforcement learning for simul-taneous machine translation. In

Proceedings of the2014 Conference on Empirical Methods in NaturalLanguage Processing , pages 1342–1352.Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Vic-tor OK Li. 2017. Learning to translate in real-timewith neural machine translation. In

Proceedings ofthe 15th Conference of the European Chapter of theAssociation for Computational Linguistics (Volume1, Long Papers) , pages 1053–1062.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages770–778.Julian Hitschler, Shigehiko Schamoni, and Stefan Rie-zler. 2016. Multimodal pivots for image captiontranslation. In

Proceedings of the 54th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers) , pages 2399–2409.Chiraag Lala and Lucia Specia. 2018. Multimodal lex-ical translation. In

Proceedings of the Eleventh In-ternational Conference on Language Resources andEvaluation (LREC 2018) , pages 3810–3817.Jindˇrich Libovick´y and Jindˇrich Helcl. 2017. Attentionstrategies for multi-source sequence-to-sequencelearning. In

Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 2: Short Papers) , pages 196–202.ingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng,Kaibo Liu, Baigong Zheng, Chuanqiang Zhang,Zhongjun He, Hairong Liu, Xing Li, Hua Wu, andHaifeng Wang. 2019. STACL: Simultaneous trans-lation with implicit anticipation and controllable la-tency using preﬁx-to-preﬁx framework. In

Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics , pages 3025–3036.Shigeki Matsubara, Kiyoshi Iwashima, NobuoKawaguchi, Katsuhiko Toyama, and Yoichi Inagaki.2000. Simultaneous Japanese-English interpreta-tion based on early prediction of English verb. In

Proceedings of The Fourth Symposium on NaturalLanguage Processing , pages 268–273.Yusuke Oda, Graham Neubig, Sakriani Sakti, TomokiToda, and Satoshi Nakamura. 2015. Syntax-basedsimultaneous translation through prediction of un-seen syntactic constituents. In

Proceedings of the53rd Annual Meeting of the Association for Compu-tational Linguistics and the 7th International JointConference on Natural Language Processing (Vol-ume 1: Long Papers) , pages 198–207.Kilian G Seeber. 2015. Simultaneous interpreting. In

The Routledge Handbook of Interpreting , pages 91–107.Lucia Specia, Stella Frank, Khalil Sima’an, andDesmond Elliott. 2016. A shared task on multi-modal machine translation and crosslingual imagedescription. In

Proceedings of the First Conferenceon Machine Translation: (Volume 2: Shared TaskPapers) , pages 543–553.Baigong Zheng, Renjie Zheng, Mingbo Ma, and LiangHuang. 2019. Simultaneous translation with ﬂexiblepolicy via restricted imitation learning.

Proceedingsof the 57th Annual Meeting of the Association forComputational Linguistics , pages 5816–5822.Mingyang Zhou, Runxiang Cheng, Yong Jae Lee, andZhou Yu. 2018. A visual attention grounding neuralmodel for multimodal machine translation. In

Pro-ceedings of the 2018 Conference on Empirical Meth-ods in Natural Language Processing , pages 3643–3653. arameter SNMT & MSNMT

Enc., Dec. dim. 320Emb. dim. 200Dropout 0.5Dropout for emb. 0.4Tied embedding 2-wayMax length 100Optimizer adamLearning rate 0.0004Batch size 64 wait-k

1, 3, 5, 7, 9, Full

Parameter MSNMT

Sampler type approximateDec. init zeroFusion type hierarchicalChannels 1024

Table 7: Hyperparameter values of SNMT and MSNMT models. wait- En → De De → En En → Fr Fr → En En → Cs Cs → En k S M S M S M S M S M S M1 1.13 † † † † † † † † † † † † † † † † Table 8: BLEU scores of SNMT (S) and MSNMT (M) models for six translation directions on test2016. Resultsare the average of three runs.

Bold indicates the best BLEU score for each wait-k for each translation direction.“ † ” indicates statistical signiﬁcance of the improvement over SNMT. A Hyperparameters

Table 7 lists the hyperparameters of the SNMT and MSNMT models used in our experiments. We use thesame hyperparameters, except for unique ones, for SNMT and MSNMT for a fair comparison.

B BLEU scores

Tables 8-10 show BLEU scores of models used in our experiments (corresponding METEOR scores areshown in Tables 1-3). ait- En → De De → En En → Fr Fr → En k S M S M S M S M1 0.09 † † † † † † † † † † † Table 9: BLEU scores of SNMT (S) and MSNMT (M) models for four language pairs on test2017. Results arethe average of three runs.