[PDF] UMONS Submission for WMT18 Multimodal Translation Task

Abstract

This paper describes the UMONS solution for the Multimodal Machine Translation Task presented at the third conference on machine translation (WMT18). We explore a novel architecture, called deepGRU, based on recent findings in the related task of Neural Image Captioning (NIC). The models presented in the following sections lead to the best METEOR translation score for both constrained (English, image) -> German and (English, image) -> French sub-tasks.

Full PDF

aa r X i v : . [ c s . C L ] O c t UMONS Submission for WMT18 Multimodal Translation Task

Jean-Benoit Delbrouck and

St´ephane Dupont

TCTS Lab, University of Mons, Belgium { jean-benoit.delbrouck, stephane.dupont } @umons.ac.be Abstract

This paper describes the UMONS solutionfor the Multimodal Machine Translation Taskpresented at the third conference on machinetranslation (WMT18). We explore a novel ar-chitecture, called deepGRU, based on recentﬁndings in the related task of Neural ImageCaptioning (NIC). The models presented inthe following sections lead to the best ME-TEOR translation score for both constrained(English, image) → German and (English, im-age) → French sub-tasks.

In the ﬁeld of Machine Translation (MT), the ef-ﬁcient integration of multimodal information stillremains a challenging task. It requires combiningdiverse modality vector representations with eachother. These vector representations, also calledcontext vectors, are computed in order the capturethe most relevant information in a modality tooutput the best translation of a sentence.To investigate the effectiveness of information ob-tained from images, a multimodal neural machinetranslation (MNMT) shared task (Specia et al.,2016) has been introduced to the community. Even though soft attention models had been exten-sively studied in MNMT (Delbrouck and Dupont,2017a; Caglayan et al., 2016; Calixto et al., 2017),the most successful recent work (Caglayan et al.,2017a) focused on using the max-pooled fea-tures extracted from a convolutional network tomodulate some components of the system (i.e.the target embeddings). Convolutional featuresor attention maps recently showed some success(Delbrouck and Dupont, 2017b) in a encoder-based attention model conditioned on the sourceencoder representation. Both model types lead to similar results, the latter being slightly complexand taking longer to train. One similar featurethey share is that the proposed models remainrelatively small. Indeed, the number of trainableparameters seems “upper bounded” due to thenumber of unique training examples being limited(cfr. Section 4). Heavy or complex attentionmodels on visual features showed prematureconvergence and restricted scalability.The model proposed by the University of Mons(UMONS) in 2018 is called DeepGRU, a novelidea based on the previously investigated condi-tional GRU (cGRU). We enrich the architecturewith three ideas borrowed from the closely re-lated NIC task: a third GRU as bottleneck func-tion, a multimodal projection and the use of gatedtanh activation. We make sure to keep the overallmodel light, efﬁcient and rapid to train. We startby describing the baseline model in Section 2 fol-lowed by the three aforementioned NIC upgradeswhich make up our deepGRU model in Section 3.Finally, we present the data made available by theMultimodal Machine Translation Task in Section4 and the results in section 5, then engage a quickdiscussion in Section 6.

Given a source sentence X = ( x , x , . . . , x M ) and an image I , an attention-based encoder-decoder model (Bahdanau et al., 2014) outputs thetranslated sentence Y = ( y , y , . . . , y N ) . Ifwe denote θ as the model parameters, then θ islearned by maximizing the likelihood of the ob-served sequence Y or in other words by minimiz-ing the cross entropy loss. The objective function https://github.com/nyu-dl/dl4mt-tutorial/blob/master/docs/cgru.pdf s given by: L ( θ ) = − n X t =1 log p θ ( y t | y

At every time-step t , an encoder cre-ates an annotation h t according to the current em-bedded word x t and internal state h t − : h t = f enc ( x ′ t , h t − ) (2)Every word x t of the source sequence X is an in-dex in the embedding matrix E x so that the fol-lowing formula maps the word to the f enc size S : x ′ t = W x E x x t (3)The total size of the embeddings matrix E x de-pends on the source vocabulary size |Y s | and theembedding dimension d such that E x ∈ R |Y s |× d .The mapping matrix W x also depends on theembedding dimension because W x ∈ R d × S .The encoder function f enc is a bi-directional GRU(Cho et al., 2014). The following equations deﬁnea single GRU block (called f gru for future refer-ences) : z t = σ (cid:0) x ′ t + W z h t − (cid:1) r t = σ (cid:0) x ′ t + W r h t − (cid:1) h t = tanh (cid:16) x ′ t + r t ⊙ ( W h h t − ) (cid:17) h ′ t = (1 − z t ) ⊙ h t + z t ⊙ h t − (4)where h ′ t ∈ R S . Our encoder consists of twoGRUs, one is reading the source sentence from1 to M and the second from M to 1. The ﬁnalencoder annotation h t for timestep t becomesthe concatenation of both GRUs annotations h ′ t .Therefore, the encoder set of annotations H is ofsize M × S . Decoder

At every time-step t , a decoder out-puts probabilities p t over the target vocabulary Y d according to previously generated word y t − , in-ternal state s t − and image I : y t ∼ p t = f bot (cid:0) f dec ( y t − , s t − , I ) (cid:1) (5) Every word y t of the target sequence Y is an indexin the embedding matrix E y so that the followingformula maps the word in the f dec size D : y ′ t = W y E y y t − (6)The decoder function f dec is a conditional GRU(cGRU). The following equations describes acGRU cell : s ′ t = f gru ( y ′ t , s t − ) c t = f att ( s ′ t , I, H ) s t = f gru ( s ′ t , c t ) (7)where f att is the visual attention module over theset of source annotation H and pooled vector v of ResNet-50 features extracted from image I .More precisely, our attention model is the prod-uct between the so-called soft attention over the M source annotations h { ,...,M − } and the lineartransformation over pooled vector v of image I : a ′ t = W a tanh( W s s ′ t + W H H ) (8) a t = softmax ( a ′ t ) (9) c ′ t = M − X i =0 a t i h i (10) v t = tanh( W img I ) (11) c t = W c c ′ t ⊙ v t (12)The bottleneck function f bot projects the cGRUoutput into probabilities over the target vocabu-lary. It is deﬁned so: b t = tanh( W bot [ y t − , s t , c t ]) (13) y t ∼ p t = softmax ( W proj b t ) (14)where [ · , · ] denotes the concatenation operation. The deepGRU decoder (Delbrouck and Dupont,2018) is a variant of the cGRU decoder.

Gated hyperbolic tangent

First, we makeuse of the gated hyperbolic tangent activationTeney et al., 2017) instead of tanh. This non-linear layer implements a function f ght : x ∈ R n → y ∈ R m with parameters deﬁned as fol-lows: y ′ = tanh( W t x + b ) g = σ ( W g x + b ) y = y ′ ⊙ g (15)where W x , W g ∈ R n × m . We apply this gatingsystem for equation 11 and 13. GRU bottleneck

When working with small di-mensions, one can afford to replace the computa-tion of b t of equation 13 by a new gru block f gru : b vt = f ght (cid:0) W v bot (cid:0) f gru ([ y t − , s ′ t , v t ] , s t ) (cid:1)(cid:1) (16)The GRU bottleneck can be seen as a new block f gru encoding the visual information v t with itssurrounding context ( y t − and s ′ t ). Therefore,equation is not computed with v t anymore sothat the second block f gru encodes the textualinformation only. Multimodal projection

Because we now havea linguistic GRU block and a visual GRU block,we want both representations to have their ownprojection to compute the candidate probabilities.Equation 13 and 14 becomes: b tt = f ght ( W t bot s t ) (17) y t ∼ p t = softmax ( W t proj b tt + W v proj b vt ) (18)where b vt comes from equation 16. Note that weuse the gated hyperbolic tangent for equation 16and 17. The Multi30K dataset (Elliott et al., 2016) isprovided by the challenge. For each image, oneof the English descriptions was selected andmanually translated into German and French bya professional translator. As training and devel-opment data, 29,000 and 1,014 triples are usedrespectively. We use the three available test setsto score our models. The Flickr Test2016 and theFlickr Test2017 set contain 1000 image-caption pairs and the ambiguous MSCOCO test set(Elliott et al., 2017) 461 pairs. For the WMT18challenge, a new Flickr Test2018 set of 1,071sentences is released without the German andFrench gold translations.Marices of the model are initialized using theXavier method (Glorot and Bengio, 2010) and thegradient norm is clipped to 5. We chose ADAM(Kingma and Ba, 2014) as the optimizer with alearning rate of 0.0004 and batch-size 32. Tomarginally reduce our vocabulary size, we usethe byte pair encoding (BPE) algorithm on thetrain set to convert space-separated tokens intosub-words (Sennrich et al., 2016). With 10Kmerge operations, the resulting vocabulary sizesof each language pair are: 5204 → → German and 5835 → → French.We use the following regularization methods: weapply dropout of 0.3 on source embeddings x ′ ,0.5 on source annotations H and 0.5 on bothbottlenecks b tt and b vt . We also stop training whenthe METEOR score does not improve for 10 eval-uations on the validation set (i.e. one validation isperformed every 1000 model updates).The dimensionality of the various settings andlayers is as follows:Embedding size d is 128, encoder and de-coder GRU size S is 256, embedding layersare: [ W x , W y ∈ R × , H = M × , E x ∈ R Y s × , E y ∈ R Y d × ] .Attention matrices: [ W s ∈ R × , W H ∈ R × , W a ∈ R × , W c ∈ R × , W img ∈ R × ] .Bottleneck matrices: [ W t bot , W v bot ∈ R × ] and projection matrices: [ W t proj , W v proj ∈ R ×Y d ] . Weights E y and W t proj are tied.The size of gated hyperbolic tangent weights W t , W g depends on their respective application. Our models performance are evaluated ac-cording to the following automated metrics:BLEU-4 (Papineni et al., 2002) and METEOR(Denkowski and Lavie, 2014). We decode with aeam-search of size 12 and use model ensemblingof size 5 for German and 6 for French. We used thenmtpytorch (Caglayan et al., 2017b) frameworkfor all our experiments. We also release our code. Test sets BLEU METEORTest 2016 Flickr

FR-Baseline 59.08 74.73FR-DeepGRU 62.49 +3.41 +2.10

DE-Baseline 38.43 58.37DE-DeepGRU 40.34 +1.91 +1.21

Test 2017 Flickr

FR-Baseline 51.86 72.75FR-DeepGRU 55.13 +3.27 +1.98

DE-Baseline 30.80 52.33DE-DeepGRU 32.57 +1.77 +1.27

Test 2017 COCO

FR-Baseline 43.31 64.39FR-DeepGRU 46.16 +2.85 +1.40

DE-Baseline 26.30 48.45DE-DeepGRU 29.21 +2.91 +1.00

Test 2018 Flickr

FR-DeepGRU 39.40 60.17DE-DeepGRU 31.10 51.64

The full leaderboard scores shows close resultsand it seems that everybody converges towardsthe same translation quality score. A few ques-tions arise. Did we reach —to some extent— thefull potential of images related to the informationthey can provide? Should we try and add tradi-tional machine translation techniques such as post-edition, since images have been exploited success-fully? Another major step forward would be tosuccessfully develop strong and stable models us-ing convolutional features, the latter having 98times more features than the max-pooled ones. https://github.com/jbdel/WMT18_MNMT https://competitions.codalab.org/competitions/19917 References

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointly learn-ing to align and translate.

CoRR , abs/1409.0473.Ozan Caglayan, Walid Aransa, Adrien Bardet, Mer-cedes Garc´ıa-Mart´ınez, Fethi Bougares, Lo¨ıc Barrault,Marc Masana, Luis Herranz, and Joost van de Weijer.2017a. Lium-cvc submissions for wmt17 multimodaltranslation task. In

Proceedings of the Second Confer-ence on Machine Translation , pages 432–439. Associ-ation for Computational Linguistics.Ozan Caglayan, Walid Aransa, Yaxing Wang, MarcMasana, Mercedes Garcia-Martinez, Fethi Bougares,Loic Barrault, and Joost van de Weijer. 2016. Doesmultimodality help human and machine for trans-lation and image captioning? arXiv preprintarXiv:1605.09186 .Ozan Caglayan, Mercedes Garc´ıa-Mart´ınez, AdrienBardet, Walid Aransa, Fethi Bougares, and Lo¨ıc Bar-rault. 2017b. Nmtpy: A ﬂexible toolkit for advancedneural machine translation systems.

Prague Bull.Math. Linguistics , 109:15–28.Iacer Calixto, Qun Liu, and Nick Campbell. 2017.Doubly-attentive decoder for multi-modal neural ma-chine translation. In

Proceedings of the 55th AnnualMeeting of the Association for Computational Linguis-tics (Volume 1: Long Papers) , pages 1913–1924. Asso-ciation for Computational Linguistics.Kyunghyun Cho, Bart van Merri¨enboer, C¸ alarG¨ulc¸ehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learning phraserepresentations using rnn encoder–decoder for statisti-cal machine translation. In

Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP) , pages 1724–1734, Doha,Qatar. Association for Computational Linguistics.Jean-Benoit Delbrouck and St´ephane Dupont. 2017a.An empirical study on the effectiveness of images inmultimodal neural machine translation. In

Proceed-ings of the 2017 Conference on Empirical Methods inNatural Language Processing , pages 910–919, Copen-hagen, Denmark. Association for Computational Lin-guistics.Jean-Benoit Delbrouck and St´ephane Dupont. 2017b.Modulating and attending the source image duringencoding improves multimodal translation.

CoRR ,abs/1712.03449.Jean-Benoit Delbrouck and St´ephane Dupont. 2018.Bringing back simplicity and lightliness into neural im-age captioning.

CoRR .Michael Denkowski and Alon Lavie. 2014. Meteoruniversal: Language speciﬁc translation evaluation forany target language. In

Proceedings of the EACL 2014Workshop on Statistical Machine Translation .. Elliott, S. Frank, K. Sima’an, and L. Specia. 2016.Multi30k: Multilingual english-german image descrip-tions. In

Proceedings of the 5th Workshop on Visionand Language , pages 70–74.Desmond Elliott, Stella Frank, Lo¨ıc Barrault, FethiBougares, and Lucia Specia. 2017. Findings of theSecond Shared Task on Multimodal Machine Transla-tion and Multilingual Image Description. In

Proceed-ings of the Second Conference on Machine Translation ,Copenhagen, Denmark.Xavier Glorot and Yoshua Bengio. 2010. Understand-ing the difﬁculty of training deep feedforward neuralnetworks. In

In Proceedings of the International Con-ference on Artiﬁcial Intelligence and Statistics (AIS-TATS10). Society for Artiﬁcial Intelligence and Statis-tics .Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 .Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evalua-tion of machine translation. In

Proceedings of the 40thAnnual Meeting on Association for Computational Lin-guistics , ACL ’02, pages 311–318, Stroudsburg, PA,USA. Association for Computational Linguistics.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare words withsubword units. In

Proceedings of the 54th AnnualMeeting of the Association for Computational Linguis-tics (Volume 1: Long Papers) , pages 1715–1725. Asso-ciation for Computational Linguistics.Lucia Specia, Stella Frank, Khalil Sima’an, andDesmond Elliott. 2016. A shared task on multimodalmachine translation and crosslingual image descrip-tion. In

Proceedings of the First Conference on Ma-chine Translation , pages 543–553, Berlin, Germany.Association for Computational Linguistics.Damien Teney, Peter Anderson, Xiaodong He, and An-ton van den Hengel. 2017. Tips and tricks for visualquestion answering: Learnings from the 2017 chal-lenge.