Bringing back simplicity and lightliness into neural image captioning
BBringing back simplicity and lightliness into neural image captioning
Jean-Benoit Delbrouck and
St´ephane Dupont
TCTS Lab, University of Mons, Belgium { jean-benoit.delbrouck, stephane.dupont } @umons.ac.be Abstract
Neural Image Captioning (NIC) or neural caption generationhas attracted a lot of attention over the last few years. Describ-ing an image with a natural language has been an emergingchallenge in both fields of computer vision and language pro-cessing. Therefore a lot of research has focused on drivingthis task forward with new creative ideas. So far, the goalhas been to maximize scores on automated metric and todo so, one has to come up with a plurality of new modulesand techniques. Once these add up, the models become com-plex and resource-hungry. In this paper, we take a small stepbackwards in order to study an architecture with interestingtrade-off between performance and computational complex-ity. To do so, we tackle every component of a neural caption-ing model and propose one or more solution that lightens themodel overall. Our ideas are inspired by two related tasks:Multimodal and Monomodal Neural Machine Translation.
Problems combining vision and natural language processingsuch as image captioning (Chen et al. 2015) is viewedas an extremely challenging task. It requires to graspand express low to high-level aspects of local and globalareas in an image as well as their relationships. Over theyears, it continues to inspire considerable research. Visualattention-based neural decoder models (Xu et al. 2015;Karpathy and Li 2015) have shown gigantic successand are now widely adopted for the NIC task. Theserecent advances are inspired from the neural encoder-decoder framework (Sutskever, Vinyals, and Le 2014;Bahdanau, Cho, and Bengio 2014)—or sequence tosequence model (seq2seq)— used for Neural MachineTranslation (NMT). In that approach, Recurrent NeuralNetworks (RNN, Mikolov et al. 2010) map a sourcesequence of words (encoder) to a target sequence (decoder).An attention mechanism is learned to focus on differentparts of the source sentence while decoding. The samemechanism applies for a visual input; the attention modulelearns to attend the salient parts of an image while decodingthe caption.These two fields, NIC and NMT, led to a Multimodal Neu-ral Machine Translation (MNMT, Specia et al. 2016) taskwhere the sentence to be translated is supported by the in-formation from an image. Interestingly, NIC and MNMT Figure 1: This image depicts a decoder timestep of theMNMT architecture. At time t , the decoder attend both avisual and textual representations. In NIC, the decoder onlyattends an image. This shows how both tasks are related.share a very similar decoder: they are both required to gener-ate a meaningful natural language description or translationwith the help of a visual input. However, both tasks differ inthe amount of annotated data. MNMT has ≈
19 times lessunique training examples, reducing the amount of learnableparameters and potential complexity of a model. Yet, overthe years, the challenge has brought up very clever and ele-gant ideas that could be transfered to the NIC task. The aimof this paper is to propose such an architecture for NIC in astraightforward manner. Indeed, our proposed models workwith less data, less parameters and require less computationtime. More precisely, this paper intents to: • Work only with in-domain data. No additional data be-sides proposed captioning datasets are involved in thelearning process; • Lighten as much as possible the training data used, i.e. thevisual and linguistic inputs of the model; • Propose a subjectively light and straightforward yet effi-cient NIC architecture with high training speed. a r X i v : . [ c s . C L ] O c t Captioning Model
As quickly mentionned in section 1, a neural captioningmodel is a RNN-decoder (Bahdanau, Cho, and Bengio 2014)that uses an attention mechanism over an image I to generatea word y t of the caption at each time-step t . The followingequations depict what a baseline time-step t looks like (Xuet al. 2015): x t = W x E y t − (1) c t = f att ( h t − , I ) (2) h t = f rnn ( x t , h t − , c t ) (3) y t ∼ p t = W y [ y t − , h t , c t ] (4)where equation 1 maps the previous embedded word gener-ated to the RNN hidden state size with matrice W x , equa-tion 2 is the attention module over the image I , equation 3is the RNN cell computation and equation 4 is the probabil-ity distribution p t over the vocabulary (matrix W y is alsocalled the projection matrix).If we denote θ as the model parameters, then θ is learned bymaximizing the likelihood of the observed sequence Y =( y , y , · · · , y n ) or in other words by minimizing the crossentropy loss. The objective function is given by: L ( θ ) = − n (cid:88) t =1 log p θ ( y t | y < t, I ) (5)The paper is structured so that each section tackles an equa-tion (i.e. a main component of the captioning model) in thefollowing manner: section 2.1 for equation 1 (embeddings),section 2.2 for equation 3 ( f rnn ), section 2.3 for equation 2( f att ), section 2.4 for equation 4 (projection) and section 2.5for equation 5 (objective function). Recall that: x t = W x E y t − The total size of the embeddings matrix E depends onthe vocabulary size |Y| and the embedding dimension d such that E ∈ R |Y|× d . The mapping matrix W x also de-pends on the embedding dimension because W x ∈ R d ×| x t | .Many previous researches (Karpathy and Li 2015;You et al. 2016; Yao et al. 2017; Anderson et al. 2018) usespretrained embeddings such as Glove and word2vec or one-hot-vectors. Both word2vec and glove provide distributedrepresentation of words. These models are pre-trained on30 and 42 billions words respectively (Mikolov et al. 2013;Pennington, Socher, and Manning 2014), weights severalgigabyes and work with d = 300 .For our experiments, each word y i is a column in-dex in an embedding matrix E y learned along withthe model and initialized using some random distri-bution. Whilst the usual allocation is from 512 to1024 dimensions per embedding (Xu et al. 2015;Lu et al. 2017; Mun, Cho, and Han 2017; Rennie et al. 2017) we show that a small embeddingsize of d = 128 is sufficient to learn a strong vocab-ulary representation. The solution of an jointly-learnedembedding matrix also tackles the high-dimensionalityand sparsity problem of one-hot vectors. For example,(Anderson et al. 2018) works with a vocabulary of 10,010and a hidden size of 1000. As a result, the mapping matrix W x of equation 1 has 10 millions parameters.Working with a small vocabulary, besides reducing the sizeof embedding matrix E , presents two major advantages: itlightens the projection module (as explained further in sec-tion 2.4) and reduces the action space in a ReinforcementLearning setup (detailed in section 2.5). To marginally re-duce our vocabulary size (of ≈
50 %), we use the bytepair encoding (BPE) algorithm on the train set to convertspace-separated tokens into subwords (Sennrich, Haddow,and Birch 2016). Applied originally for NMT, BPE is basedon the intuition that various word classes are made of smallerunits than words such as compounds and loanwords. In ad-dition of making the vocabulary smaller and the sentenceslength shorter, the subword model is able to productivelygenerate new words that were not seen at training time.
Recall that: h t = f rnn ( x t , h t − , c t ) Most previous researches in captioning (Karpathyand Li 2015; You et al. 2016; Rennie et al. 2017;Mun, Cho, and Han 2017; Lu et al. 2017;Yao et al. 2017) used an LSTM (Hochreiter and Schmidhu-ber 1997) for their f rnn function. Our recurrent model is apair of two Gated Recurrent Units (GRU (Cho et al. 2014)),called conditional GRU (cGRU), as previously investigatedin NMT . A GRU is a lighter gating mechanism than LSTMsince it doesn’t use a forget gate and lead to similar resultsin our experiments.The cGRU also addresses the encoding problem of the f att mechanism. As shown in equation 2 the context vector c t takes the previous hidden state h t − as input which isoutside information of the current time-step. This could betackled by using the current hidden h t , but then contextvector c t is not an input of f rnn anymore. A conditionalGRU is an efficient way to both build and encode the resultof the f att module.Mathematically, a first independent GRU encodes an inter-mediate hidden state proposal h (cid:48) t based on the previous hid-den state h t − and input x t at each time-step t : https://github.com/nyu-dl/dl4mt-tutorial/blob/master/docs/cgru.pdf (cid:48) t = σ ( x t + U (cid:48) z h t − ) r (cid:48) t = σ ( x t + U (cid:48) r h t − ) h (cid:48) t = tanh ( x t + r (cid:48) t (cid:12) ( U (cid:48) h t − )) h (cid:48) t = (1 − z (cid:48) t ) (cid:12) h (cid:48) t + z (cid:48) t (cid:12) h t − (6)Then, the attention mechanism computes c t over the sourcesentence using the image I and the intermediate hidden stateproposal h (cid:48) t similar to 3: c t = f att ( h (cid:48) t , I ) Finally, a second independent GRU computes the hiddenstate h t of the cGRU by looking at the intermediate rep-resentation h (cid:48) t and context vector c t : z t = σ ( W z c t + U z h (cid:48) t ) r t = σ ( W r c t + U r h (cid:48) t ) h t = tanh ( W c t + r t (cid:12) ( U h (cid:48) t )) h t = (1 − z t ) (cid:12) h t + z t (cid:12) h (cid:48) t (7)We see that both problem are addressed: context vector c t is computed according to the intermediate representation h (cid:48) t and the final hidden state h t is computed according to thecontext vector c t . Again, the size of the hidden state | h t | inthe literature varies between 512 and 1024, we pick | h t | =256.The most similar approach to ours is the Top-Down Atten-tion of (Anderson et al. 2018) that encodes the context vec-tor the same way but with LSTM and a different hidden statelayout. Recall that: c t = f att ( h t − , I ) Since the image is the only input to a captioning model, theattention module is crucial but also very diverse amongstdifferent researches. For example, (You et al. 2016) use asemantic attention where, in addition of image features, theyrun a set of attribute detectors to get a list of visual attributesor concepts that are most likely to appear in the image.(Anderson et al. 2018) uses the Visual Genome datasetto pre-train his bottom-up attention model. This datasetcontains 47,000 out-of-domain images of the capioningdataset densely annotated with scene graphs containingobjects, attribute and relationships. (Yang et al. 2016)proposes a review network which is an extension to thedecoder. The review network performs a given number ofreview steps on the hidden states and outputs a compactvector representation available for the attention mechanism.Yet everyone seems to agree on using a ConvolutionalNeural Network (CNN) to extract features of the image I .The trend is to select features matrices, at the convolutionallayers, of size × × (Resnet, He et al. 2016, res4flayer) or × × (VGGNet Simonyan and Zisserman 2014, conv5 layer). Other attributes can be extracted inthe last fully connected layer of a CNN and has shown tobring useful information (Yang et al. 2016; Yao et al. 2017;You et al. 2016) Some models also finetune the CNNduring training (Yang et al. 2016; Mun, Cho, and Han 2017;Lu et al. 2017) stacking even more trainable parameters.Our attention model f att is guided by a unique vector withglobal 2048-dimensional visual representation V of image I extracted at the pool5 layers of a ResNet-50. Our attentionvector is computed so: c t = h (cid:48) t (cid:12) tanh ( W img V I ) (8)Recall that following the cGRU presented in section 2.2,we work with h (cid:48) t and not h t − . Even though pooled fea-tures have less information than convolutional features ( ≈
50 to 100 times less), pooled features have shown great suc-cess in combination with cGRU in MNMT (Caglayan et al.2017a). Hence, our attention model is only the single ma-trice W img ∈ R ×| h (cid:48) t | Recall that: y t ∼ p t = W y [ y t − , h t , c t ] The projection also accounts for a lot of trainable param-eters in the captioning model, especially if the vocabularyis large. Indeed, in equation 4 the projection matrix is ∈ R [ y t − , h t , c t ] ×|Y| . To reduces the number of parameters, weuse a bottleneck function: b t = f bot ( y t − , h t , c t ) = W bot [ y t − , h t , c t ] (9) y t ∼ p t = W y bot b t (10)where | b t | < | [ y t − , h t , c t ] | so that | W bot | + | W y bot | < || W y | . Interestingly enough, if | b t | = d (embedding size),then | W y bot | = | E | . We can share the weights between thetwo matrices (i.e. W y bot = E ) to marginally reduce thenumber of learned parameters. Moreover, doing so doesn’tnegatively impact the captioning results.We push our projection further and use a deep-GRU, usedoriginally in MNMT (Delbrouck and Dupont 2018), so thatour bottleneck function f bot is now a third GRU as describedby equations 7: b t = f bot ( y t − , h (cid:48) t , c t ) = cGRU ([ y t − , h (cid:48) t , c t ] , h t ) (11)Because we work with small dimension, adding a new GRUblock on top barely increases the model size. .To directly optimize a automated metric, we can see thecaptioning generator as a Reinforcement Learning (RL)problem. The introduced f rnn function is viewed as an agentthat interact with an environment composed of words andimage features. The agent interacts with the environment byaking actions that are the prediction of the next word ofthe caption. An action is the result of the policy p θ where θ are the parameters of the network. Whilst very effectiveto boost the automatic metric scores, porting the captioningproblem into a RL setup significantly reduce the trainingspeed.Ranzato et al. 2015 proposed a method (MIXER), basedon the REINFORCE method, combined with a baselinereward estimator. However, they implicitly assume eachintermediate action (word) in a partial sequence has thesame reward as the sequence-level reward, which is not truein general. To compensate for this, they introduce a formof training that mixes together the MLE objective and theREINFORCE objective. Liu et al. 2017 also addresses thedelayed reward problem by estimating at each time-step thefuture rewards based on Monte Carlo rollouts. Rennie et al.2017 utilizes the output of its own test-time inference modelto normalize the rewards it experiences. Only samples fromthe model that outperform the current test-time system aregiven positive weight.To keep it simple, and because our reduced vocabulary al-lows us to do so, we follow the work of Ranzato et al. 2015and use the naive variant of the policy gradient with REIN-FORCE. The loss function in equation 5 is now given by: L ( θ ) = − E Y ∼ p θ r ( Y ) (12)where r ( Y ) is the reward (here the score given byan automatic metric scorer) of the outputted caption Y = ( y , y , · · · , y n ) .We use the REINFORCE algorithm based on the observa-tion that the expected gradient of a non-differentiable rewardfunction is computed as follows: ∇ θ L ( θ ) = − E Y ∼ p θ [ r ( Y ) ∇ θ log p θ ( Y )] (13)The expected gradient can be approximated using N Monte-Carlo sample Y for each training example in the batch: ∇ θ L ( θ ) = ∇ − (cid:20) N N (cid:88) i =1 [ r i ( Y i ) log p θ ( Y i )] (cid:21) (14)In practice, we can approximate with one sample: ∇ θ L ( θ ) ≈ − r ( Y ) ∇ θ log p θ ( Y ) (15)The policy gradient can be generalized to compute the re-ward associated with an action value relative to a baseline b .This baseline either encourages a word choice y t if r t > b t or discourages it r t < b t . If the baseline is an arbitrary func-tion that does not depend on the actions y , y , · · · , y n ∈ Y then baseline does not change the expected gradient, and im-portantly, reduces the variance of the gradient estimate. Thefinal expression is given by: ∇ θ L ( θ ) ≈ − ( r ( Y ) − b ) ∇ θ log p θ ( Y ) (16) Our decoder is a cGRU where each GRU is of size | h t | =256. Word embedding matrix E allocates d = 128 featuresper word To create the image annotations used by ourdecoder, we used a ResNet-50 and extracted the features ofsize 1024 at the pool-5 layer. As regularization method, weapply dropout with a probability of 0.5 on bottleneck b t andwe early stop the training if the validation set CIDER metricdoes not improve for 10 epochs. All variants of our modelsare trained with ADAM optimizer (Kingma and Ba 2014)with a learning rate of e − and mini-batch size of 256. Wedecode with a beam-search of size 3. In th RL setting, thebaseline is a linear projection of h t .We evaluate our models on MSCOCO (Lin et al. 2014), themost popular benchmark for image captioning which con-tains 82,783 training images and 40,504 validation images.There are 5 human-annotated descriptions per image. As theannotations of the official testing set are not publicly avail-able, we follow the settings in prior work (or ”Kaparthysplits” ) that takes 82,783 images for training, 5,000 forvalidation and 5,000 for testing. On the training captions, weuse the byte pair encoding algorithm on the train set to con-vert space-separated tokens into subwords (Sennrich, Had-dow, and Birch 2016, 5000 symbols), reducing our vocabu-lary size to 5066 english tokens. For the online-evaluation,all images are used for training except for the validation set. Our models performance are evaluated according to thefollowing automated metrics: BLEU-4 (Papineni et al.2002), METEOR (Vedantam, Lawrence Zitnick, and Parikh2015) and CIDER-D (Lavie and Agarwal 2007). Resultsshown in table 1 are using cross-entropy (XE) loss (cfr.equation 5). Reinforced learning optimization results arecompared in table 3.
We sort the different works in table 1 by CIDER score. Forevery of them, we detail the trainable weights involved inthe learning process (Wt.), the number of visual featuresused for the attention module (Att. Feat), the amount ofout-of-domain data (O.O.D) and the convergence speed(epoch).As we see, our model has the third best METEOR andCIDER scores across the board. Yet our BLEU metric isquiet low, we postulate two potential causes. Either ourmodel has not enough parameters to learn the correct preci-sion for a set of n-grams as the metric would require or it isa direct drawback from using subwords. Nevertheless, theCIDER and METEOR metric show that the main conceptsare presents our captions. Our models are also the lightestin regards to trainable parameters and attention featuresnumber. As far as convergence in epochs were reported in https://github.com/karpathy/neuraltalk2/tree/master/coco able 1: Table sorted per CIDER-D score of models being optimized with cross-entropy loss only (cfr. equation 5). ◦ pool features, • conv features, (cid:63) FC features, § means glove or word2vec embeddings, † CNN finetuning in-domain, ‡ usingin-domain CNN, (cid:113) CNN finetuning OOD
B4 M C Wt. (in M) Att. feat. (in K) O.O.D. (in M) epoch
This work cGRU ◦ Comparable work
Adaptive(Lu et al. 2017) •† •(cid:113) ≈
25 204Boosting (Yao et al. 2017) ◦ (cid:63) ≈ • (cid:63) † ≈ • ≈
18 100 - -
O.O.D work
Top-down (Anderson et al. 2018) •(cid:113) ≈
25 920 920
RCNN
T-G att (Mun et al. 2017) •† ≈ TSV -Semantic (You et al. 2016) ◦ (cid:63) § glove -NT(Karpathy and Li 2015) •§ glove -previous works, our cGRU model is by far the fastest totrain.The following table 2 concerns the online evaluation of theofficial MSCOCO test-set 2014 split. Scores of our modelare an ensemble of 5 runs with different initialization.Table 2: Published Ranking image captioning results on theonline MSCOCO test server B4(c5) M (c5) C(c5) cGru 0.326 0.253 0.973Comparable work(Lu et al. 2017) 0.336 0.264 1.042(Yao et al. 2017) 0.330 0.256 0.984(Yang et al. 2016) 0.313 0.256 0.965(You et al. 2016) 0.316 0.250 0.943(Wu et al. 2016) 0.306 0.246 0.911(Xu et al. 2015) 0.277 0.251 0.865We see that our model suffers a minor setback on this test-set, especially in term of CIDER score whilst the adaptive(Lu et al. 2017) and boosting method (Yao et al. 2017) yieldsto stable results for both test-sets.
The table 3 depicts the different papers using direct metricoptimization. Rennie et al. 2017 used the SCST method, themost effective one according to the metrics boost (+23, +3 an +123 points respectively) but also the most sophisticated.Liu et al. 2017 used a similar approach than ours (MIXER)but with Monte-Carlo roll-outs (i.e. sampling until the endat every time-step t ). Without using this technique, two ofour metrics improvement (METEOR and CIDER) surpassesthe MC roll-outs variant (+0 against -2 and +63 against +48respectively).Table 3: All optimization are on the CIDER metric. B4 M C
Renn. et al. 2017XE 0.296 0.252 0.940RL-SCST 0.319 ↑
23 0.255 ↑ ↑ ↑
39 0.249 ↓ ↑ ↑
13 0.258 1.071 ↑ An interesting investigation would be to leverage thearchitecture with more parameters to see how it scales. Weshowed our model performs well with few parameters, butwe would like to show that it could be used as a base formore complex posterior researches.We propose two variants to effectively do so : • cGRUx2 The first intuition is to double the width of theodel, i.e. the embedding size d and the hidden state size | h t | . Unfortunately, this setup is not ideal with a deep-GRU because the recurrent matrices of equations 6 and7 for the bottleneck GRU gets large. We can still use theclassic bottleneck function (equation 9). • MHA
We trade our attention model described in sec-tion 2.3 for a standard multi-head attention (MHA) to seehow convolutional features could improve our CIDER-Dmetrics. Multi-head attention (Vaswani et al. 2017) com-putes a weighted sum of some values, where the weightassigned to each value is computed by a compatibilityfunction of the query with the corresponding key. Thisprocess is repeated n multiple times. The compatibilityfunction is given by:Attention ( Q, K, V ) = softmax ( QK T (cid:112) d q ) V where the query Q is h (cid:48) t , the keys and values are a set of196 vectors of dimension 1024 (from the layer res4f reluof ResNet-50 CNN). Authors found it beneficial to lin-early project the queries, keys and values n times with dif-ferent learned linear projections to dimension d q . The out-put of the multi-head attention is the concatenation of the n number of d q values lineary projected to d q again. Wepick d q = | h (cid:48) t | and n = 3 . The multi-head attention modeladds up 0.92M parameters if | h (cid:48) t | = 256 and 2.63M pa-rameters if | h (cid:48) t | = 512 (in the case of cGRUx2).Figure 2: The figure shows a new set of results on the onlineMSCOCO test server and tries to put those in perspectiveWe have hence proposed an image captioning architecturethat, compared to previous work, cover a different area in theperformance-complexity trade-off plane. We hope these willbe of interest and will fuel more research in this direction. As mentioned in the introduction, our model is largelyinspired by the work carried out in NMT and MNMT. Component such as attention models like multi-head,encoder-based and pooled attention (2017a; 2017b; 2017);reinforcement learning in NMT and MNMT (2016; 2017a);embeddings (2016; 2017) are well investigated.In captioning, Anderson et al. used a very similar approachwhere two LSTMs build and encode the visual features. Twoother works, Yao et al. and You et al. used pooled featuresas described in this paper. However, they both used an addi-tional vector taken from the fully connected layer of a CNN.
We presented a novel and light architecture composed ofa cGRU that showed interesting performance. The modelbuilds and encodes the context vector from pooled featuresin an efficient manner. The attention model presented in sec-tion 2.3 is really straightforward and seems to bring the nec-essary visual information in order to output complete cap-tions. Also, we empirically showed that the model can easilyscale with more sought-after modules or simple with moreparameters. In the future, it would be interesting to use dif-ferent attention features, like VGG or GoogleNet (that haveonly 1024 dimensions) or different attention models to seehow far this architecture can get.
This work was partly supported by the Chist-Era projectIGLU with contribution from the Belgian Fonds de laRecherche Scientique (FNRS), contract no. R.50.11.15.F,and by the FSO project VCYCLE with contribution fromthe Belgian Waloon Region, contract no. 1510501.We also thank the authors of nmtpytorch (Caglayan et al.2017b) that we used as framework for our experiments. Ourcode is made available for posterior research . https://github.com/lium-lst/nmtpytorch https://github.com/jbdel/light captioning eferences Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.;Gould, S.; and Zhang, L. 2018. Bottom-up and top-down at-tention for image captioning and visual question answering.In
CVPR .Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural ma-chine translation by jointly learning to align and translate. arXiv e-prints abs/1409.0473.Caglayan, O.; Aransa, W.; Bardet, A.; Garc´ıa-Mart´ınez, M.;Bougares, F.; Barrault, L.; Masana, M.; Herranz, L.; andvan de Weijer, J. 2017a. Lium-cvc submissions for wmt17multimodal translation task. In
Proceedings of the SecondConference on Machine Translation , 432–439. Associationfor Computational Linguistics.Caglayan, O.; Garc´ıa-Mart´ınez, M.; Bardet, A.; Aransa, W.;Bougares, F.; and Barrault, L. 2017b. Nmtpy: A flexi-ble toolkit for advanced neural machine translation systems.
Prague Bull. Math. Linguistics
CoRR abs/1504.00325.Cho, K.; van Merri¨enboer, B.; G¨ulc¸ehre, C¸ .; Bahdanau, D.;Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learningphrase representations using rnn encoder–decoder for statis-tical machine translation. In
Proceedings of the 2014 Con-ference on Empirical Methods in Natural Language Pro-cessing (EMNLP) , 1724–1734. Doha, Qatar: Associationfor Computational Linguistics.Delbrouck, J.-B., and Dupont, S. 2017a. An empirical studyon the effectiveness of images in multimodal neural machinetranslation. In
Proceedings of the 2017 Conference on Em-pirical Methods in Natural Language Processing , 910–919.Association for Computational Linguistics.Delbrouck, J., and Dupont, S. 2017b. Modulating and at-tending the source image during encoding improves multi-modal translation.
CoRR abs/1712.03449.Delbrouck, J.-B., and Dupont, S. 2018. Umons submis-sion for wmt18 multimodal translation task. In
Proceedingsof the First Conference on Machine Translation . Brussels,Belgium: Association for Computational Linguistics.Delbrouck, J.-B.; Dupont, S.; and Seddati, O. 2017. Visuallygrounded word embeddings and richer visual features forimproving multimodal neural machine translation. In
Proc.GLU 2017 International Workshop on Grounding LanguageUnderstanding , 62–67.He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residuallearning for image recognition.
Neural Comput.
CVPR ,3128–3137. IEEE Computer Society.Kingma, D. P., and Ba, J. 2014. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980 . Lavie, A., and Agarwal, A. 2007. Meteor: An automaticmetric for mt evaluation with high levels of correlation withhuman judgments. In
Proceedings of the Second Workshopon Statistical Machine Translation , StatMT ’07, 228–231.Stroudsburg, PA, USA: Association for Computational Lin-guistics.Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ra-manan, D.; Doll´ar, P.; and Zitnick, C. L. 2014. Microsoftcoco: Common objects in context. In Fleet, D.; Pajdla,T.; Schiele, B.; and Tuytelaars, T., eds.,
Computer Vision –ECCV 2014 , 740–755. Cham: Springer International Pub-lishing.Liu, S.; Zhu, Z.; Ye, N.; Guadarrama, S.; and Murphy, K.2017. Improved image captioning via policy gradient opti-mization of spider.
INTERSPEECH , 1045–1048. ISCA.Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; andDean, J. 2013. Distributed representations of words andphrases and their compositionality. In Burges, C. J. C.;Bottou, L.; Welling, M.; Ghahramani, Z.; and Weinberger,K. Q., eds.,
Advances in Neural Information Processing Sys-tems 26 . Curran Associates, Inc. 3111–3119.Mun, J.; Cho, M.; and Han, B. 2017. Text-guided attentionmodel for image captioning. In
AAAI .Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002.Bleu: A method for automatic evaluation of machine trans-lation. In
Proceedings of the 40th Annual Meeting on Asso-ciation for Computational Linguistics , ACL ’02, 311–318.Stroudsburg, PA, USA: Association for Computational Lin-guistics.Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove:Global vectors for word representation. In
EMNLP , vol-ume 14, 1532–1543.Ranzato, M.; Chopra, S.; Auli, M.; and Zaremba, W. 2015.Sequence level training with recurrent neural networks.
CoRR abs/1511.06732.Rennie, S. J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel,V. 2017. Self-critical sequence training for image caption-ing.
Pro-ceedings of the 54th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers) , 1715–1725. Association for Computational Linguistics.Shen, S.; Cheng, Y.; He, Z.; He, W.; Wu, H.; Sun, M.; andLiu, Y. 2016. Minimum risk training for neural machinetranslation. In
Proceedings of the 54th Annual Meeting ofthe Association for Computational Linguistics (Volume 1:ong Papers) , 1683–1692. Association for ComputationalLinguistics.Simonyan, K., and Zisserman, A. 2014. Very deep convo-lutional networks for large-scale image recognition.
CoRR abs/1409.1556.Specia, L.; Frank, S.; Sima’an, K.; and Elliott, D. 2016. Ashared task on multimodal machine translation and crosslin-gual image description. In
Proceedings of the First Confer-ence on Machine Translation , 543–553. Berlin, Germany:Association for Computational Linguistics.Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequenceto sequence learning with neural networks. In
Advances inneural information processing systems , 3104–3112.Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,L.; Gomez, A. N.; Kaiser, L. u.; and Polosukhin, I. 2017.Attention is all you need. In Guyon, I.; Luxburg, U. V.; Ben-gio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Gar-nett, R., eds.,
Advances in Neural Information ProcessingSystems 30 . Curran Associates, Inc. 5998–6008.Vedantam, R.; Lawrence Zitnick, C.; and Parikh, D. 2015.Cider: Consensus-based image description evaluation. In
The IEEE Conference on Computer Vision and PatternRecognition (CVPR) .Wu, Q.; Shen, C.; Liu, L.; Dick, A.; and van den Hengel,A. 2016. What value do explicit high level concepts have invision to language problems? In
The IEEE Conference onComputer Vision and Pattern Recognition (CVPR) .Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudi-nov, R.; Zemel, R.; and Bengio, Y. 2015. Show, attendand tell: Neural image caption generation with visual atten-tion. In Bach, F., and Blei, D., eds.,
Proceedings of the 32ndInternational Conference on Machine Learning , volume 37of
Proceedings of Machine Learning Research , 2048–2057.Lille, France: PMLR.Yang, Z.; Yuan, Y.; Wu, Y.; Cohen, W. W.; and Salakhutdi-nov, R. R. 2016. Review networks for caption generation.In Lee, D. D.; Sugiyama, M.; Luxburg, U. V.; Guyon, I.; andGarnett, R., eds.,
Advances in Neural Information Process-ing Systems 29 . Curran Associates, Inc. 2361–2369.Yao, T.; Pan, Y.; Li, Y.; Qiu, Z.; and Mei, T. 2017. Boostingimage captioning with attributes. In
ICCV .You, Q.; Jin, H.; Wang, Z.; Fang, C.; and Luo, J. 2016. Im-age captioning with semantic attention.2016 IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR)