[PDF] Paying More Attention to Saliency: Image Captioning with Saliency and Context Attention

Abstract

Image captioning has been recently gaining a lot of attention thanks to the impressive achievements shown by deep captioning architectures, which combine Convolutional Neural Networks to extract image representations, and Recurrent Neural Networks to generate the corresponding captions. At the same time, a significant research effort has been dedicated to the development of saliency prediction models, which can predict human eye fixations. Even though saliency information could be useful to condition an image captioning architecture, by providing an indication of what is salient and what is not, research is still struggling to incorporate these two techniques. In this work, we propose an image captioning approach in which a generative recurrent neural network can focus on different parts of the input image during the generation of the caption, by exploiting the conditioning given by a saliency prediction model on which parts of the image are salient and which are contextual. We show, through extensive quantitative and qualitative experiments on large scale datasets, that our model achieves superior performances with respect to captioning baselines with and without saliency, and to different state of the art approaches combining saliency and captioning.

Full PDF

448

Paying More Attention to Saliency: Image Captioning withSaliency and Context Attention

MARCELLA CORNIA,

University of Modena and Reggio Emilia

LORENZO BARALDI,

University of Modena and Reggio Emilia

GIUSEPPE SERRA,

University of Udine

RITA CUCCHIARA,

University of Modena and Reggio EmiliaImage captioning has been recently gaining a lot of attention thanks to the impressive achievements shown bydeep captioning architectures, which combine Convolutional Neural Networks to extract image representations,and Recurrent Neural Networks to generate the corresponding captions. At the same time, a significant researcheffort has been dedicated to the development of saliency prediction models, which can predict human eyefixations. Even though saliency information could be useful to condition an image captioning architecture, byproviding an indication of what is salient and what is not, research is still struggling to incorporate these twotechniques. In this work, we propose an image captioning approach in which a generative recurrent neuralnetwork can focus on different parts of the input image during the generation of the caption, by exploitingthe conditioning given by a saliency prediction model on which parts of the image are salient and which arecontextual. We show, through extensive quantitative and qualitative experiments on large scale datasets, thatour model achieves superior performances with respect to captioning baselines with and without saliency,and to different state of the art approaches combining saliency and captioning.CCS Concepts: •

Computing methodologies → Scene understanding ; Natural language generation ;Additional Key Words and Phrases: saliency, visual saliency prediction, image captioning, deep learning.

ACM Reference format:

Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. 2018. Paying More Attention to Saliency:Image Captioning with Saliency and Context Attention.

ACM Trans. Multimedia Comput. Commun. Appl.

14, 2,Article 48 (May 2018), 21 pages.

A core problem in computer vision and artificial intelligence is that of building a system thatcan replicate the human ability of understanding a visual stimuli and describing it in naturallanguage. Indeed, this kind of system would have a great impact on society, opening up to a newprogress in human-machine interaction and collaboration. Recent advancements in computervision and machine translation, together with the availability of large datasets, have made it

This work is partially supported by “Città educante” (CTN01_00034_393801) of the National Technological Cluster on SmartCommunities (cofunded by the Italian Ministry of Education, University and Research - MIUR) and by the project “JUMP -Una piattaforma sensoristica avanzata per rinnovare la pratica e la fruizione dello sport, del benessere, della riabilitazione edel gioco educativo”, funded by the Emilia-Romagna region within the POR-FESR 2014-2020 program. We acknowledge theCINECA award under the ISCRA initiative, for the availability of high performance computing resources and support. Wealso gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research.Author’s addresses: M. Cornia and L. Baraldi and R. Cucchiara, Department of Engineering “Enzo Ferrari”, University ofModena and Reggio Emilia, Modena, Italy; G. Serra is with the Department of Computer Science, Mathematics and Physics,University of Udine, Udine, Italy.© 2018 Copyright held by the owner/author(s). Publication rights licensed to Association for Computing Machinery.This is the author’s version of the work. It is posted here for your personal use. Not for redistribution.ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 14, No. 2, Article 48. a r X i v : . [ c s . C V ] M a y possible to generate natural sentences describing images. In particular, deep image captioningarchitectures have shown impressive results in discovering the mapping between visual descriptorsand words [24, 55, 56, 59]. They combine Convolutional Neural Networks (CNNs), to extract animage representation, and Recurrent Neural Networks (RNNs), to build the corresponding sentence.While the progress of these techniques is encouraging, the human ability in the construction andformulation of a sentence is still far from being adequately emulated in today’s image captioningsystems. When humans describe a scene, they look at an object before naming it in a sentence [14],and they do not focus on each region with the same intensity, as selective mechanisms attracttheir gaze on saliency and relevant parts of the scene [43]. Also, they care about the context usingperipheral vision, so that the description of an image alludes not only to the main objects in thescene, and to how they relate to each other, but also to the context in which they are placed in theimage.An intensive research effort has been carried out in the computer vision community to predictwhere humans look in an image. This task, called saliency prediction, has been tacked in earlyworks by defining hand-crafted features that capture low-level cues such as color and textureor higher-level concepts such as faces, people and text [4, 19, 23]. Recently, with the advent ofdeep neural networks and large annotated datasets, saliency prediction techniques have obtainedimpressive results generating maps that are very close to the ones computed with eye-trackingdevices [8, 18, 20].Despite the encouraging progress in image captioning and visual saliency, and their closeconnections, these two fields of research have remained almost separate. In fact, only few attemptshave been recently presented in this direction [48, 52]. In particular, Sugano et al. [48] presenteda gaze-assisted attention mechanism for image caption based on human eye fixations ( i.e. thestatic states of gaze upon a specific location). Although this strategy confirms the importance ofusing eye fixations, it requires gaze information from a human operator. Therefore, it can not beapplied on general visual data archives, in which this information is missing. To overcome thislimit, Tavakoli et al. [52] presented an image captioning method based on saliency maps, whichcan be automatically predicted from the input image.In this paper we present an approach which incorporates saliency prediction to effectivelyenhance the quality of image description. We propose a generative Recurrent Neural Networkarchitecture which can focus on different regions of the input image by means of an attentivemechanism. This attentive behaviour, differently from previous works [56], is conditioned by twodifferent attention paths: the former focused on salient spatial regions, predicted by a saliencymodel, and the latter focused on contextual regions, which are computed as well from saliencymaps. Experimental results on five public image captioning datasets (SALICON, COCO, Flickr8k,Flickr30k and PASCAL-50S), demonstrate that our solution is able to properly exploit saliency cues.Also, we show that this is done without losing the key properties of the generated captions, suchas their diversity and the vocabulary size. By visualizing the states of both attentive paths, wefinally show that the trained model has learned to attend to both salient and contextual regionsduring the generation of the caption, and that attention focuses produced by the network effectivelycorrespond, step by step, to generated words.To sum up, our contributions are as follows. First, we show that saliency can enhance imagedescription, as it provides an indication of what is salient and what is context. Second, we proposea model in which the classic machine attention approach is extended to incorporate two attentivepaths, one for salient regions and one for context. These two paths cooperate together during thegeneration of the caption, and show to generate better captions according to automatic metrics, ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 14, No. 2, Article 48 aying More Attention to Saliency: Image Captioning with Saliency and Context Attention 48:3 without loss of diversity and size of the dictionary. Third, we qualitatively show that the trainedmodel has learned to attend to both salient and contextual regions in an appropriate way.

In this section, we review the literature related to saliency prediction and image captioning. Wealso report some recent works which investigate the contribution of saliency for generating naturallanguage descriptions.

Saliency prediction has been extensively studied by the computer vision community and, in thelast few years, has achieved a considerable improvement thanks to the large spread of deep neuralnetworks [8, 9, 18, 20, 28, 30, 39]. However, a very large variety of models have been proposed beforethe advent of deep learning and almost each of them has been inspired by the seminal work of Ittiand Koch [19], in which multi-scale low-level features extracted from the input image were linearlycombined and then processed by a dynamic neural network with a winner-takes-all strategy. Thesame idea of properly combining different low-level features was also explored by Harel et al. [15]who defined Markov chains over various image maps, and treated the equilibrium distribution overmap locations as an activation. In addition to the exploitation of low-level features, several saliencymodels have also incorporated high-level concepts such as faces, people, and text [4, 23, 61]. In fact,Judd et al. [23] highlighted that, when humans look at images, their gazes are attracted not only bylow-level cues typical of the bottom-up attention, but also by top-down image semantics. To thisend, they proposed a model in which low and medium level features were effectively combined,and exploited face and people detectors to capture important high-level concepts. Nonetheless,all these techniques have failed to effectively capture the wide variety of causes that contributeto define the visual saliency on images and, with the advent of deep learning, researchers havedeveloped data-driven architectures capable of overcoming many of the limitations of hand-craftedmodels.First attempts of computing saliency maps through a neural network lacked from the absence ofsufficiently large training datasets [30, 35, 54]. Vig et al. [54] proposed the first deep architecturefor saliency, which was composed by only three convolutional layers. Afterwards, Kümmerer etal. [30, 31] based their models on two popular convolutional networks (AlexNet [27] and VGG-19 [46]) obtaining adequate results, despite the network parameters were not fine-tuned on asaliency dataset. Liu et al. [35] tried to overcome the absence of large scale datasets by trainingtheir model on image patches centered on fixation and non-fixation locations, thus increasing theamount of training data.With the arrival of the SALICON dataset [21], which is still the large publicly available dataset forsaliency prediction, several deep architectures have moved beyond previous approaches bringingconsistent performance advances. The starting point of all these architectures is a pre-trainedConvolutional Neural Network (CNN), such as VGG-16 [46], GoogleNet [50] and ResNet [16], towhich different saliency-oriented components are added [8, 9], together with different trainingstrategies [9, 18, 20].In particular, Huang et al. [18] compared three standard CNNs by applying them at two differentimage scales. In addition, they were the first to train the network using a saliency evaluationmetric as loss function. Jetley et al. [20] introduced a model which formulates a saliency mapas generalized Bernoulli distribution. Moreover, they trained their network by using differentloss functions which pair the softmax activation function with measures designed to computedistances between probability distributions. Tavakoli et al. [51] investigated inter-image similarities

ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 14, No. 2, Article 48 to estimate the saliency of a given image using an ensemble of extreme learners, each trained on animage similar to the input image. Kruthiventi et al. [28], instead, presented an unified frameworkto predict both eye fixations and salient objects.Another saliency prediction model was recently presented by Pan et al. [38] who, followingthe large dissemination of Generative Adversarial Networks, trained their model by using adver-sarial examples. In particular, their architecture is composed by two agents: a generator which isresponsible for generating the saliency map of a given image, and a discriminator which performsa binary classification task between generated and real saliency maps. Liu et al. [34], instead,proposed a model to learn long-term spatial interactions and scene contextual modulation toinfer image saliency showing promising results, also thanks to the use of the powerful ResNet-50architecture [16].In contrast to all these works, we presented two different deep saliency architectures. The firstone, called

ML-Net [8], effectively combines features coming from different levels of a CNN andapplies a matrix of learned weights to the predicted saliency map thus taking into account the centerbias present in human eye fixations. The second one, called

SAM [9], incorporates neural attentivemechanisms which focus on the most salient regions of the input image. The core componentof the proposed model is an Attentive Convolutional LSTM that iteratively refines the predictedsaliency map. Moreover, to tackle the human center bias, the network is able to learn multipleGaussian prior maps without predefined information. Since this model achieved state of the artperformances, being at the top of different saliency prediction benchmarks, we use it in this work.

Recently, the automatic description of images and video has been addressed by computer visionresearchers with recurrent neural networks which, given a vectored description of the visual content,can naturally deal with sequences of words [3, 24, 55]. Before deep learning models, the generationof sentences was mainly tackled by identifying visual concepts, objects and attributes which werethen combined into sentences using pre-defined templates [29, 57, 58]. Another strategy was thatof posing the image captioning as a retrieval problem, where the closest annotated sentence in thetraining set was transferred to a test image, or where training captions were split into parts andthen reassembled to form new sentences [11, 17, 37, 47]. Obviously, all these approaches limited thevariety of possible outputs and could not satisfy the richness of natural language. Recent captioningmodels, in fact, address the generation of sentences as a machine translation problem in which avisual representation of the image coming from a convolutional network is translated in a languagecounterpart through a recurrent neural network.One of the first models based on this idea is that proposed by Karpathy et al. [24] in which sentencesnippets are aligned to the visual regions that they describe through a multimodal embedding. Afterthat, these correspondences are treated as training data for a multimodal recurrent neural networkwhich learns to generate the corresponding sentences. Vinyals et al. [55], instead, developed anend-to-end model trained to maximize the likelihood of the target sentence given the input image.Xu et al. [56] introduced an approach to image captioning which incorporates a form of machineattention, by which a generative LSTM can focus on different regions of the image while generatingthe corresponding caption. They proposed two different versions of their model: the first one, called“Soft Attention”, is trained in a deterministic manner using standard backpropagation techniques,while the second one, called “Hard Attention”, is trained by maximizing a variational lower boundthrough the reinforcement learning paradigm.Johnson et al. [22] addressed the task of dense captioning, which jointly localizes and describesin natural language salient image regions. This task consists of generalizing the object detection

ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 14, No. 2, Article 48 aying More Attention to Saliency: Image Captioning with Saliency and Context Attention 48:5 problem when the descriptions consist of a single word, and the image captioning task when onepredicted region covers the full image. You et al. [59] proposed a semantic attention model inwhich, given an image, a convolutional neural network extracts top-down visual features and atthe same time detects visual concepts such as regions, objects and attributes. The image featuresand the extracted visual concepts are combined through a recurrent neural network that finallygenerates the image caption. Differently from previous works which aim at predicting a singlecaption, Krause et al. [26] introduced the generation of entire paragraphs for describing images.Finally, Shetty et al. [45] employed adversarial training to change the training objective of thecaption generator from reproducing ground-truth captions to generating a set of captions that isindistinguishable from human generated captions.In this paper, we are interested in demonstrating the importance of using saliency along withcontextual information during the generation of image descriptions. Our solution falls in the classof neural attentive captioning architectures and, in the experimental section, we compare it againsta standard attentive model built upon the Soft Attention approach presented in [56].

Only a few other previous works have investigated the contribution of human eye fixations togenerate image descriptions. The first work that has explored this idea was that proposed in [48]which presented an extension of a neural attentive captioning architecture. In particular, theproposed model incorporates human fixation points (obtained with eye-tracking devices) instead ofcomputed saliency maps to generate image captions. This kind of strategy mainly suffers of the needof having both eye fixation and caption annotations. Currently, only the SALICON dataset [21],being a subset of the Microsoft COCO dataset [33], is available with both human descriptions andsaliency maps.Ramanishka et al. [41], instead, introduced an encoder-decoder captioning model in whichspatiotemporal heatmaps are produced for predicted captions and arbitrary query sentences withoutexplicit attention layers. They refer to these heatmaps as saliency maps, even though they areinternal representations of the network, not related with human attention. Experiments showedthat the gain in performance with respect to a standard captioning attentive model is not consistent,even though the computational overhead is lower.A different approach, presented in [52], explores if image descriptions, by humans or models,agree with saliency and if saliency can benefit image captioning. To this end, they proposed acaptioning model in which image features are boosted with the corresponding saliency map byexploiting a moving sliding window and mean pooling as aggregation strategies. Comparisonswith respect to a no-saliency baseline did not show significant improvements (especially on theMicrosoft COCO dataset).In this paper, we instead aim at enhancing image captions by directly incorporating saliencymaps in a neural attentive captioning architecture. Differently from previous models that exploithuman fixation points, we obtain a more general architecture which can be potentially trainedusing any image captioning dataset, and can predict captions for any input image. In our model, themachine attentive process is split in two different and unrelated paths, one for salient regions andone for context. We demonstrate through extensive experiments that the incorporation of saliencyand context can enhance image captioning on different state of art datasets.

Human gazes are attracted by both low-level cues such as color, contrast and texture, and high-level concepts such as faces and text [6, 23]. Current state of the art saliency prediction methods,

ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 14, No. 2, Article 48

Fig. 1. Ground-truth semantic segmentation and saliency predictions from our model [9] on sample imagesfrom Pascal-Context [36] (first row), Cityscapes [7] (second row) and LIP [13] (last row). thanks to the use of deep networks and large-scale datasets, are able to effectively incorporate allthese factors and predict saliency maps which are very close to those obtained from human eyefixations [9]. In this section we qualitatively investigate which parts of an image are actually hit orignored by saliency models, by jointly analyzing saliency and semantic segmentation maps. Thiswill motivate the need of using saliency predictions as an additional conditioning for captioningmodels.To compute saliency maps, we employ the approach in [9], which has shown good resultson popular saliency benchmarks, such as the MIT Saliency [5] and the SALICON dataset [21],and which also won the LSUN Challenge in 2017. It is worthwhile to mention, anyway, that thequalitative conclusions of this section can be applied to any state of the art saliency model.Since semantic segmentations algorithms are not always completely accurate, we perform theanalysis on three semantic segmentation datasets, in which regions have been segmented by humanannotators: Pascal-Context [36], Cityscapes [7] and the Look into Person (LIP) [13] dataset. Whilethe first one contains natural images without a specific target, the other two are focused, respectively,on urban streets and human body parts. In particular, the Pascal-Context provides additionalannotations for the Pascal VOC 2010 dataset [10] which contains 10 ,

103 training and validationimages and 9 ,

637 testing images. It goes beyond the original Pascal semantic segmentation taskby providing annotations for the whole scene, and images are annotated by using more than400 different labels. The Cityscapes dataset, instead, is composed by a set of video sequencesrecorded in street scenes from 50 different cities. It provides high quality pixel-level annotationsfor 5 ,

000 frames and coarse annotations for 20 ,

000 frames. The dataset is annotated with 30 street-specific classes, such as car , road , traffic sign , etc. Finally, the LIP dataset is focused on the semanticsegmentation of people and provides more than 50 ,

000 images annotated with 19 semantic humanpart labels. Images contain person instances cropped from the Microsoft COCO dataset [33] andsplit in training, validation and testing sets with 30 , ,

000 and 10 ,

000 images respectively. Forour analyses we only consider train and validation images for the Pascal-Context and LIP datasets,and the 5 ,

000 pixel-level annotated frames for the Cityscapes dataset. Figure 1 shows, for somesample images, the predicted saliency map and the corresponding semantic segmentation on thethree datasets.We firstly investigate which are the most and the least salient classes for each dataset. Sincethere are semantic classes with a low number of occurrences with respect to the total number of

ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 14, No. 2, Article 48 aying More Attention to Saliency: Image Captioning with Saliency and Context Attention 48:7 c a t a e r o p l a n e d o g b i r d p e r s o n t v m o n i t o r c a r (a) Pascal-Context [36] c a r b u il d i n g li c e n s e p l a t e p e r s o n b u s v e g e t a t i o n t r u c k (b) Cityscapes [7] f a c e u pp e r - c l o t h e s h a t h a i r s o c k s s k i r t d r e ss (c) LIP [13]Fig. 2. Most salient classes on Pascal-Context, Cityscapes and LIP datasets. c e ili n g f l oo r li g h t c l o t h r o p e p o l e s i g n (a) Pascal-Context [36] e g o v e h i c l e s k y g r o un d m o t o r c y c l e t e rr a i n d y n a m i c p a r k i n g (b) Cityscapes [7] r i g h t - s h o e l e f t - s h o e r i g h t - l e g l e f t - l e g p a n t s g l o v e j u m p s u i t s (c) LIP [13]Fig. 3. Least salient classes on Pascal-Context, Cityscapes and LIP datasets. images, we only consider relevant semantic classes ( i.e. classes with at least N occurrences). Due tothe different dataset sizes, we set N to 500 for the Pascal-Context and LIP datasets, and to 200 forthe Cityscapes dataset. To collect the number of times that the predicted saliency hits a semanticclass, we binarize each map by thresholding the values of its pixels. A low threshold value leads toa binarized map with dilated salient regions, while an high threshold creates small salient regionsaround the fixation points. For this reason, we use two different threshold values to analyze themost and the least salient classes. We choose a threshold near 0 to find the least salient classes foreach dataset, and a value near 255 to find instead the most salient ones.Figures 2 and 3 show the most and the least salient classes in terms of the percentage of timesthat saliency hits a region belonging to a class. As it can be seen, there are different distributionsdepending on the considered dataset. For example, for the Pascal-Context, the most salient classesare animals (such as cats, dogs and birds), people and vehicles (such as airplanes and cars), whilethe least salient ones result to be ceiling, floor and light. As for the Cityscapes dataset, cars areabsolutely the most salient class with a 70% of times in which is hit by saliency. All other classes,instead, do not reach the 40%. On the LIP dataset, the most salient classes are all human bodyparts in the upper body, while the least salient ones are all in the lower body. As expected, peoplefaces are those most hit by saliency with an absolute number of occurrences near to 90%. It canbe observed as a general pattern that the most important or visible objects in the scene are hit bysaliency, while objects in the background, and the context itself of the image are usually ignored . Thisleads to the hypothesis that both salient and non salient regions are important to generate the ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 14, No. 2, Article 48

Number of pixels per class (%) S a li e n c y v a l u e s (a) Pascal-Context [36] Number of pixels per class (%) S a li e n c y v a l u e s (b) Cityscapes [7] Number of pixels per class (%) S a li e n c y v a l u e s (c) LIP [13]Fig. 4. Distribution of object sizes and saliency values (best seen in color). description of an image, given that we generally want the context to be included in the caption,and that the distinction between salient regions and context, given by a saliency prediction model,can improve captioning results.We also investigate the existence of a relation between the size of an object and its saliencyvalues. In Figure 4, we plot the joint distribution of object sizes and saliency values on the threedatasets, where the size of an object is simply computed as the number of its pixels normalizedby the size of the image. As it can be seen, most of the low saliency instances are small; however,high saliency values concentrate on small objects as well as on large ones. In summary, there isnot always a proportionality between the size of an object and its saliency , so the importance of anobject can not be assessed by simply looking at its size. In the image captioning scenario that wewant to tackle, larger objects correspond to larger activations in the last layers of a convolutionalarchitecture, while smaller objects correspond to smaller activations. Since salient and non salientregions can have comparable activations, the supervision given by a saliency prediction model onwhether a pixel belongs or not to a salient region can be beneficial during the generation of thecaption. Following the qualitative findings of the previous section, we develop a model in which saliency isexploited to enhance image captioning. Here, a generative recurrent neural network is conditioned,step by step, on salient spatial regions, predicted by a saliency model, and on contextual featureswhich account for the role on non-salient regions in the generation of the caption. In the following,we describe the overall model. An overview is presented in Figure 5.Each input image I is firstly encoded through a Fully Convolutional Network, which provides astack of high-level features on a spatial grid { a , a , ..., a L } , each corresponding to a spatial locationof the image. At the same time, we extract a saliency map for the input image using the modelin [9], and downscale it to fit the spatial size of the convolutional features, so to obtain a spatialgrid { s , s , ..., s L } of salient regions, where s i ∈ [ , ] . Correspondingly, we also define a spatialgrid of contextual regions, { z , z , ..., z L } where z i = − s i . Under the model, visual features atdifferent locations will be selected or inhibited according to their saliency value.The generation of the caption is carried out word-by-word by feeding and sampling words froman LSTM layer, which, at every timestep, is conditioned on features extracted from the input imageand on the saliency map. Formally, the behaviour of the generative LSTM is driven by the following ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 14, No. 2, Article 48 aying More Attention to Saliency: Image Captioning with Saliency and Context Attention 48:9

Saliency predictor

ContextAttentionSaliencyAttention … a baseball player ball … LSTM LSTM LSTM LSTM … Fig. 5. Overview of the proposed model. Two different attention paths are built for salient regions andcontextual regions, to help the model build captions which describe both components (best seen in color). equations: i t = σ ( W vi ˆ v t + W wi w t + W hi h t − + b i ) (1) f t = σ ( W vf ˆ v t + W wf w t + W hf h t − + b f ) (2) o t = σ ( W vo ˆ v t + W wo w t + W ho h t − + b o ) (3) g t = ϕ ( W vд ˆ v t + W wд w t + W hд h t − + b д ) (4) c t = f t ⊙ c t − + i t ⊙ g t (5) h t = o t ⊙ ϕ ( c t ) (6)where, at each timestep, ˆ v t denotes the visual features extracted from I , by considering the mapof salient regions { s i } i , and those of contextual regions { z i } i . w t is the input word, and h and c are respectively the internal state and the memory cell of the LSTM. ⊙ denotes the element-wiseHadamard product, σ is the sigmoid function, ϕ is the hyperbolic tangent tanh , W ∗ are learnedweight matrices and b ∗ are learned biases vectors.To provide the generative network with visual features, we draw inspiration from the machineattention literature [56] and compute the fixed-length feature vector ˆ v t as a linear combination ofspatial features { a , a , ..., a L } with time-varying weights α ti , normalized over the spatial extentvia a softmax operator: ˆ v t = L (cid:213) i = α ti a i , (7) α ti = exp ( e ti ) (cid:205) Lk = exp ( e tk ) . (8)At each timestep the attention mechanism selects a region of the image, based on the previousLSTM state, and feeds it to the LSTM, so that the generation of a word is conditioned on thatspecific region, instead of being driven by the entire image.Ideally, we want weights α ti to be aware of the saliency and contextual value of location a i , andto be conditioned on the current status of the LSTM, which can be well encoded by its internalstate h t . In this way, the generative network can focus on different locations of the input imageaccording to their belonging to a salient or contextual region, and to the current generation state.Of course, simply multiplying attention weights with saliency values would result in a loss ofcontext, which is fundamental for caption generation. We instead split attention weights e ti intotwo contributions, one for saliency and one for context regions, and employ two different fully ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 14, No. 2, Article 48 connected networks to learn the two contributions (Figure 5). Conceptually, this is equivalent tobuilding two separate attention paths, one for salient regions and for contextual regions, which aremerged to produce the final attention. Overall, the model obeys to the following equation: e ti = s i · e salti + z i · e ctxti (9)where e salti and e ctxti are, respectively, the attention weights for salient and context regions. Attentionweights for saliency and context are computed as follows: e salti = v Te , sal · ϕ ( W ae , sal · a i + W he , sal · h t − ) (10) e ctxti = v Te , ctx · ϕ ( W ae , ctx · a i + W he , ctx · h t − ) (11)Notice that our model learns different weights for saliency and contextual regions, and combinesthem into a final attentive map in which the contributions of salient and non-salient regions aremerged together. Similarly to the classical Soft Attention approach [56], the proposed generativeLSTM can focus on every region of the image, but the attentive process is aware of the saliency ofeach location, so that the focus on salient and contextual regions is driven by the output of thesaliency predictor. Words are encoded with one-hot vectors having size equal to that of the vocabulary, and are thenprojected into an embedding space via a learned linear transformation. Because sentences havedifferent lengths, they are also marked with special begin-of-string and end-of-string tokens, tokeep the model aware of the beginning and end of a particular sentence.Given an image and sentence ( y , y , ..., y T ) , encoded with one-hot vectors, the generative LSTMis conditioned step by step on the first t words of the caption, and is trained to produce the nextword of the caption. The objective function which we optimize is the log-likelihood of correctwords over the sequence max w T (cid:213) t = log Pr ( y t | ˆ v t , y t − , y t − , ..., y ) (12)where w are all the parameters of the model. The probability of a word is modeled via a softmax layerapplied on the output of the LSTM. To reduce the dimensionality, a linear embedding transformationis used to project one-hot word vectors into the input space of the LSTM and, viceversa, to projectthe output of the LSTM to the dictionary space.Pr ( y t | ˆ v t , y t − , y t − , ..., y ) ∝ exp ( y Tt W p h t ) (13)where W p is a matrix for transforming the LSTM output space to the word space and h t is theoutput of the LSTM.At test time, the LSTM is given a begin-of-string tag as input for the first timestep, then the mostprobable word according to the predicted distribution is sampled and given as input for the nexttimestep, until an end-of-string tag is predicted. In this section we perform qualitative and quantitative experiments to validate the effectivenessof the proposed model with respect to different baselines and other saliency-boosted caption-ing methods. First, we describe datasets and metrics used to evaluate our solution and provideimplementation details.

ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 14, No. 2, Article 48 aying More Attention to Saliency: Image Captioning with Saliency and Context Attention 48:11

To validate the effectiveness of the proposed Saliency and Context aware Attention, we performexperiments on five popular image captioning datasets: SALICON [21], Microsoft COCO [33],Flickr8k [17], Flickr30k [60], and PASCAL-50S [53].Microsoft COCO is composed by more than 120 ,

000 images divided in training and validation sets,where each of them is provided with at least five sentences generated by using Amazon MechanicalTurk. SALICON is a subset of this one, created for the visual saliency prediction task. Since itsimages are from the Microsoft COCO dataset, at least five captions for each image are available.Overall, it contains 10 ,

000 training images, 5 ,

000 validation images and 5 ,

000 testing images whereeye fixations for each image are simulated with mouse movements. In our experiments, we only usetrain and validation sets for both datasets. The Flickr8k and the Flickr30k datasets are composed by8 ,

000 and 30 ,

000 images respectively. Both of them come with five annotated sentences for eachimage. In our experiments, we randomly choose 1 ,

000 validation images and 1 ,

000 test images foreach of these two datasets. The PASCAL-50S dataset provides additional annotations for the UIUCPASCAL sentences [42]. It is composed of 1 ,

000 images from the PASCAL-VOC dataset, each ofthem annotated with 50 human-written sentences, instead of 5 as in the original dataset. Due tothe limited number of samples and for a fair comparison with other captioning methods, we firstpre-train the model on the Microsoft COCO dataset, then we test it on the images of this datasetwithout a specific fine-tuning.For evaluation, we employ four automatic metrics which are usually employed in image caption-ing: BLEU [40], ROUGE L [32], METEOR [2] and CIDEr [53]. BLEU is a modified form of precisionbetween n-grams to compare a candidate translation against multiple reference translations. Weevaluate our predictions with BLEU using mono-grams, bi-grams, three-grams and four-grams.ROUGE L computes an F-measure considering the longest co-occurring in sequence n-grams. ME-TEOR, instead, is based on the harmonic mean of unigram precision and recall, with recall weightedhigher than precision. It also has several features that are not found in other metrics, such asstemming and synonymy matching, along with the standard exact word matching. CIDEr, finally,computes the average cosine similarity between n-grams found in the generated caption and thosefound in reference sentences, weighting them using TF-IDF. To ensure a fair evaluation, we use theMicrosoft COCO evaluation toolkit to compute all scores. Each image is encoded through a convolutional network, which computes a stack of high-levelfeatures. We employ the popular ResNet-50 [16], trained over the ImageNet dataset [44], to computethe feature maps over the input image. In particular, the ResNet-50 is composed by 49 convolutionallayers, divided in 5 convolutional blocks, and 1 fully connected layer. Since we want to maintainthe spatial dimensions, we extract the feature maps from the last convolutional layer and ignorethe fully connected layer. The output of the ResNet model is a tensor with 2048 channels. Tolimit the number of feature maps and the number of learned parameters, we fed this tensor intoanother convolutional layer with 512 filters and a kernel size of 1, followed by a ReLU activationfunction. Differently from the weights of the ResNet-50 which are kept fixed, the weights of this lastconvolutional layer are initialized according to [12] and fine-tuned over the considered datasets. Inthe LSTM, following the initialization proposed in [1], the weight matrices applied to the inputs areinitialized by sampling each element from the Gaussian distribution of 0 mean and 0 . variance,while the weight matrices applied to the internal states are initialized by using the orthogonal https://github.com/tylin/coco-captionACM Transactions on Multimedia Computing, Communications and Applications, Vol. 14, No. 2, Article 48 Table 1. Image captioning results. The conditioning of saliency and context (

Saliency+Context Attention )enhances the generation of the caption with respect to the traditional machine attention mechanism.

SoftAttention here indicates our reimplementation of [56], using the same visual features of our model.

Dataset Model B@1 B@2 B@3 B@4 METEOR ROUGE L CIDErSALICON Soft Attention

Saliency+Context Attention

COCO Soft Attention

Saliency+Context Attention

Flickr8k Soft Attention (Validation)

Saliency+Context Attention

Flickr8k Soft Attention (Test)

Saliency+Context Attention

Flickr30k Soft Attention (Validation)

Saliency+Context Attention

Flickr30k Soft Attention (Test)

Saliency+Context Attention

PASCAL-50S Soft Attention

Saliency+Context Attention initialization. The vectors v e , sal and v e , ctx as well as all bias vectors b ∗ are instead initialized tozero.To predict the saliency map for each input image, we exploit our Saliency Attentive Model(SAM) [9] which is able to predict accurate saliency maps according to different saliency benchmarks.We note however, that we do not expect a significant performance variation when using other stateof the art saliency methods.As mentioned, we perform experiments over five different datasets. For the SALICON dataset,since its images have all the same size of 480 × L = × = ×

480 obtaining L = × = ×

20 while, for all other datasets, we resize them to 15 × .

001 and batch size 64. The hidden state dimension is set to 1024while the embedding size to 512. For all datasets, we choose a vocabulary size equal to the numberof words which appear at least 5 times in training and validation captions.

To assess the performance of our method, and to investigate the hypotheses behind it, we firstcompare with the classic Soft Attention approach, and we then build three baselines in whichsaliency is used to condition the generative process.

Soft Attention [56] : The visual input to the LSTM is computed via the Soft Attention mechanismto attend at different locations of the image, without considering salient and non-salient regions. Asingle feed forward network is in charge of producing attention values, which can be obtained by

ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 14, No. 2, Article 48 aying More Attention to Saliency: Image Captioning with Saliency and Context Attention 48:13 replacing Eq. 9 with e ti = v Te · ϕ ( W ae · a i + W he · h t − ) . (14)This approach is equivalent to the one proposed in [56], although some implementation detailsare different. In order to achieve a fair evaluation, we use activations from the ResNet-50 modelinstead of the VGG-19, and we do not include the doubly stochastic regularization trick. For thisreason, the numerical results that we report are not directly comparable with those in the originalpaper (ours are in general higher than the original ones). Saliency pooling : Visual features from the CNN are multiplied at each location by the corre-sponding saliency value, and then summed, without any attention mechanism. In this case thevisual input of the LSTM is not time dependent, and salient regions are given more focus thannon-salient ones. Comparing with Eq. 7, it can be seen as a variation of the Soft Attention in whichthe network always focuses on salient regions.ˆ v t = ˆ v = L (cid:213) i = s i a i (15) Attention on saliency : This is an extension of the Soft Attention approach in which saliencyis used to modulate attention values at each location. The attention mechanism, therefore, isconditioned to attend salient regions with higher probability, and to ignore non-salient regions. e ti = s i · v Te · ϕ ( W ae · a i + W he · h t − ) (16) Attention on saliency and context (with weight sharing) : The attention mechanism isaware of salient and context regions, but weights used to compute the attentive scores of salientand context are shared, excluding the v Te vectors. Notice that, if those were shared too, this baselinewould be equivalent to the Soft Attention one. e ti = s i · e salti + ( − s i ) · e ctxti (17) e salti = v Te , sal · ϕ ( W ae · a i + W he · h t − ) (18) e ctxti = v Te , ctx · ϕ ( W ae · a i + W he · h t − ) (19)It is straightforward also to notice that our proposed approach is equivalent to the last baseline,without weight sharing.In Table 1 we first compare the performance of our method with respect to the Soft Attentionapproach, to assess the superior performance of the proposal with respect to the published state ofthe art. We report results on all the datasets, both on validation and test sets, with respect to allthe automatic metrics described in Section 5.1. As it can be seen, the proposed approach alwaysovercomes by a significant margin the Soft Attention approach, thus experimentally confirmingthe benefit of having two separate attention paths, one for salient and one for non-salient regions,and the role of saliency as a conditioning for captioning. In particular, on the METEOR metric, therelative improvement ranges from . − . . = .

30% on the PASCAL-50S to . − . . = .

53% of theFlickr8k validation set.In Table 2, instead, we compare our approach with the three baselines that incorporate saliency.Firstly, it can be observed that the Saliency pooling baseline usually performs worse than theSoft Attention, thus demonstrating that always attending to salient locations is not sufficient toachieve good captioning results. When plugging in attention, like in the Saliency Attention baseline,numerical results are a bit higher, thanks to a time-dependent attention, but still far from theperformance achieved by the complete model. It can also be noticed that, even though this baselinedoes not take into account the context, it sometimes achieves better results than the Soft Attention

ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 14, No. 2, Article 48

Table 2. Comparison with image captioning with saliency baselines. While the use of machine attentionstrategies is beneficial (see

Saliency pooling vs.

Attention on Saliency ), saliency and context areboth important for captioning. The use of different attention paths for saliency and context also enhancesthe performance (see

Saliency+Context Attention (with weight sharing) vs.

Saliency+ContextAttention ). Dataset Model B@1 B@2 B@3 B@4 METEOR ROUGE L CIDErSALICON Saliency pooling

Attention on saliency

Saliency+Cont. Att. (weight sh.)

Saliency+Context Attention

COCO Saliency pooling

Attention on saliency

Saliency+Cont. Att. (weight sh.)

Saliency+Context Attention

Saliency pooling

Flickr8k Attention on saliency (Validation) Saliency+Cont. Att. (weight sh.)

Saliency+Context Attention

Saliency pooling

Flickr8k Attention on saliency (Test) Saliency+Cont. Att. (weight sh.)

Saliency+Context Attention

Saliency pooling

Flickr30k Attention on saliency (Validation) Saliency+Cont. Att. (weight sh.)

Saliency+Context Attention

Saliency pooling

Flickr30k Attention on saliency (Test) Saliency+Cont. Att. (weight sh.)

Saliency+Context Attention

PASCAL-50S Saliency pooling

Attention on saliency

Saliency+Cont. Att. (weight sh.)

Saliency+Context Attention model (such as in the case of SALICON, with respect to the METEOR metric). Finally, we noticethat the baseline with attention on saliency and context, and with weight sharing, is better thanSaliency Attention, further confirming the benefit of including the context. Having two completelyseparated attention paths, such as in our model, is anyway important, as demonstrated by thenumerical results of this last baseline with respect to those of our method.

ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 14, No. 2, Article 48 aying More Attention to Saliency: Image Captioning with Saliency and Context Attention 48:15

Table 3. Comparison with existing saliency-boosted captioning models.

Dataset Model B@4 METEOR ROUGE L CIDErSALICON Sugano et al. [48]

Ours

COCO Tavakoli et al. [52] (GBVS)

Tavakoli et al. [52] (iSEEL)

Ours

Flickr30k (Test) Ramanishka et al. [41] – 18.3 – –

Ours

PASCAL-50S Tavakoli et al. [52] (GBVS)

Tavakoli et al. [52] (iSEEL)

Ours

We also compare to existing captioning models that incorporate saliency during the generationof image descriptions. In particular, we compare to the model proposed in [48], which exploitedhuman fixation points, to the work by Tavakoli et al. [52] which reports experiments on MicrosoftCOCO and on PASCAL-50S, and to the proposal by Ramanishka et al. [41] which used convolutionalactivations as a proxy for saliency.Table 3 shows the results on the three considered datasets in term of BLEU@4, METEOR, ROUGE L and CIDEr. We compare our solutions to both versions of the model presented in [52]. The GBVSversion exploits saliency maps calculated by using a traditional bottom-up model [15], while theother one includes saliency maps extracted from a deep convolutional network [51].Overall, results show that the proposed Saliency and Context Attention model can overcome theother methods on different metrics, thus confirming the strategy of including two attention paths.In particular, on the METEOR metric, we obtain a relative improvement of 4 .

57% on the SALICONdataset, 5 .

53% on the Microsoft COCO and 8 .

94% on the PASCAL-50S.

We further collect statistics on captions generated by our method and by the Soft Attention model,to quantitatively assess the quality of generated captions. Firstly, we define three metrics whichevaluate the vocabulary size and the difference between the corpus of captions generated by thetwo models and the ground-truth: • Vocabulary size : number of unique words generated in all captions; • Percentage of novel sentences : percentage of generated sentences which are not seen in thetraining set; • Percentage of different sentences : percentage of images which are described differently bythe two models;Then, we measure the diversity of the set of captions generated by each of the two models, via thefollowing two metrics [45]: • Div-1 : ratio of number of unique unigrams in a set of captions to the number of words inthe same set. Higher is more diverse.

ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 14, No. 2, Article 48

Table 4. Statistics on vocabulary size and diversity of the generated captions. Including saliency and contextin two different machine attention paths (

Saliency+Context attention ) produces different captions withrespect to the traditional machine attention approach (

Soft Attention ), while preserving almost the samediversity statistics.

Dataset Model Div-1 Div-2 Vocabulary % novel sent. % different sent.SALICON Soft Attention

Saliency+Context Attention

COCO Soft Attention

Saliency+Context Attention

Flickr8k Soft Attention (Validation)

Saliency+Context Attention

Flickr8k Soft Attention (Test)

Saliency+Context Attention

Flickr30k Soft Attention (Validation)

Saliency+Context Attention

Flickr30k Soft Attention (Test)

Saliency+Context Attention

PASCAL-50S Soft Attention

Saliency+Context Attention • Div-2 : ratio of number of unique bigrams in a set of captions to the number of words in thesame set. Higher is more diverse.In Table 4 we compare the set of captions generated by our model with that generated by theSoft Attention baseline. Although our model features a slight reduction of the vocabulary size onSALICON, COCO and PASCAL-50S, captions generated by the two models are very often different,thus confirming that the two approaches have learned different captioning models. Moreover, thediversity and the number of novel sentences of the Soft Attention approach are entirely preserved.

The selection of a location in our model is based on the competition between the saliency attentivepath and the context attentive path (see Eq. 9). To investigate how the two paths interact andcontribute to the generation of a word, in Figure 6 we report, for several images from the MicrosoftCOCO dataset, the changes in attention weights between the two paths. Specifically, for eachimage we report the average of e salti and e ctxti values at each timestep, along with a visualizationof its saliency map. It is interesting to see how the model was able to correctly exploit the twoattention paths for generating different parts of the caption, and how generated words correspondin most cases to the attended regions. For example, in the case of the first image (“a group of zebrasgraze in a grassy field”), the saliency attentive path is more active than the context path duringthe generation of words corresponding to the “group of zebras”, which are captured by saliency.Instead, when the model has to describe the context (“in a grassy field”), the saliency attentive pathhas lower weights with respect to the context attentive path. The same can be observed for all thereported images; it can also be noticed that generated captions tend to describe both salient objectsand the context, and that usually the salient part, which is also the most important, is describedbefore the context. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 14, No. 2, Article 48 aying More Attention to Saliency: Image Captioning with Saliency and Context Attention 48:17

Saliency Context -20246810

Saliency Context

Saliency Context -20246810

Saliency Context

Saliency Context -4-20246810

Saliency Context -20246810

Saliency Context

Fig. 6. Examples of attention weights changes between saliency and context along with the generation ofcaptions (best seen in color). Images are from the Microsoft COCO dataset [33].

Finally, in Figure 7 we report some sample results on images taken from the Microsoft COCOdataset. For each image we report the corresponding saliency map, and captions generated by ourmodel and by the Soft Attention baseline compared to the ground-truth. It can be seen that, onaverage, captions generated by our model are more consistent with the corresponding image andthe human-generated caption, and that, as also observed in the previous section, salient parts aredescribed as well as the context. The incorporation of saliency and context also help the modelto avoid failures due to hallucination, such as in the case of the fourth image, in which the SoftAttention model predicts a remote control which is not depicted in the image. Other failure cases,which are avoided by our model, include the repetition of words (as in the fifth image) and thefailure to describe the context (first image). We speculate that the presence of two separate attentionpaths, which the model has learned to attend during the generation of the caption, helps to avoidsuch failures more effectively than the classic machine attention approach.For completeness, some failure cases of the proposed model are reported in Figure 8. The majorityof failures occurs when the salient regions of the image are not described in the correspondingground-truth caption (as for example in the first row), thus causing a performance loss. Someproblems arise also in presence of complex scenes (such as in the fourth image). However, weobserve that the Soft Attention baseline fails as well to predict correct and complete captions inthese cases.

We proposed a novel image captioning architecture which extends the machine attention paradigmby creating two attentive paths conditioned on the output of a saliency prediction model. The

ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 14, No. 2, Article 48

Ground-truth : A group of people that arestanding around giraffes.

Saliency+Context Attention : A group ofpeople standing around a giraffe.

Without saliency : A group of peoplestanding around a stage with a group ofpeople.

Ground-truth : A group of people at thepark with some flying kites.

Saliency+Context Attention : A group ofpeople flying kites in a park.

Without saliency : A group of peoplestanding on top of a lush green field.

Ground-truth : A man is looking into ahome refrigerator.

Saliency+Context Attention : A man islooking inside of a refrigerator.

Without saliency : A man is making arefrigerator in a kitchen.

Ground-truth : A women who is jumpingon the bed.

Saliency+Context Attention : A woman isjumping up in a bed.

Without saliency : A woman is playingwith a remote control.

Ground-truth : A man takes a profilepicture of himself in a bathroom mirror.

Saliency+Context Attention : A persontaking a picture of himself in a bathroom.

Without saliency : A bathroom with a sinkand a sink.

Ground-truth : A double decker bus drivingdown a street.

Saliency+Context Attention : A doubledecker bus driving down a street.

Without saliency : A bus is parked on theside of the road.

Ground-truth : A teddy bear holding a cellphone in front of a window with a view ofthe city.

Saliency+Context Attention : A teddybear sitting on a chair next to a window.

Without saliency : A brown dog is sittingon a laptop keyboard.

Ground-truth : A group of people ridingdown a snow covered slope.

Saliency+Context Attention : A group ofpeople riding skis down a snow coveredslope.

Without saliency : A group of people onskis in the snow.

Ground-truth : A laptop computer sittingon top of a table.

Saliency+Context Attention : A laptopcomputer sitting on a top of a desk.

Without saliency : A desk with a laptopcomputer and a laptop.

Ground-truth : A person on a motorcycleriding on a mountain.

Saliency+Context Attention : A personriding a motorcycle on a road.

Without saliency : A man on a bike with abike in the background.

Ground-truth : A car is parked next to aparking meter.

Saliency+Context Attention : A car isparked in the street next to a parking meter.

Without saliency : A car parked next to awhite fire hydrant.

Ground-truth : A plate of food and a cup ofcoffee.

Saliency+Context Attention : A plate offood with a sandwich and a cup of coffee.

Without saliency : A table with a variety offood on it.

Fig. 7. Example results on the Microsoft COCO dataset [33]. first one is focused on salient regions, and the second on contextual regions: the overall modelexploits the two paths during the generation of the caption, by giving more importance to salientor contextual regions as needed. The role of saliency with respect to context has been investigatedby collecting statistics on semantic segmentation datasets, while the captioning model has beenevaluated on large scale captioning datasets, using standard automatic metrics and by evaluatingthe diversity and the dictionary size of the generated corpora. Finally, the activations of the twoattentive paths have been investigated, and we have shown that they correspond, word by word, toa focus on salient objects or on the context in the generated caption; moreover, we qualitatively

ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 14, No. 2, Article 48 aying More Attention to Saliency: Image Captioning with Saliency and Context Attention 48:19

Ground-truth : The yellow truck passes bytwo people on motorcycles from opposingdirections.

Saliency+Context Attention : A person ona motor bike in a city.

Without saliency : A man in a red shirt ona horse.

Ground-truth : A cityscape that is seenfrom the other side of the river.

Saliency+Context Attention : A largebuilding with a large clock tower in thebackground.

Without saliency : A large building with alarge clock in the water.

Ground-truth : A large tree situated next toa large body of water.

Saliency+Context Attention : A person issitting under a red umbrella.

Without saliency : A street sign with alarge tree in the middle.

Ground-truth : A busted fire hydrantspewing water out onto a street.

Saliency+Context Attention : A personstanding in a front of a large cruise ship.

Without saliency : A man is standing in adock near a large truck.

Ground-truth : A small airplane flying overa field filled with people.

Saliency+Context Attention : A group ofpeople walking around a large jet.

Without saliency : A large group of peoplestanding on top of a lush green field.

Ground-truth : The view of city buildingsis seen from the river.

Saliency+Context Attention : A largeclock tower towering over the water.

Without saliency : A large building with alarge clock tower in the water.

Fig. 8. Failure cases on sample images of the Microsoft COCO dataset [33]. assessed the superiority of the captions generated by our method with respect to those generated bythe Soft Attention approach. Although our focus has been that of demonstrating the effectivenessof saliency on captioning, rather than that of beating captioning approaches which rely on differentcues, we point out that our method can be easily incorporated into those architectures.

REFERENCES [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning toalign and translate. In

International Conference on Learning Representations .[2] Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlationwith human judgments. In

ACL Workshop on intrinsic and extrinsic evaluation measures for machine translation and/orsummarization .[3] Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2017. Hierarchical Boundary-Aware Neural Encoder for VideoCaptioning. In

IEEE International Conference on Computer Vision and Pattern Recognition .[4] Ali Borji. 2012. Boosting bottom-up and top-down visual features for saliency estimation. In

IEEE InternationalConference on Computer Vision and Pattern Recognition .[5] Zoya Bylinskii, Tilke Judd, Ali Borji, Laurent Itti, Frédo Durand, Aude Oliva, and Antonio Torralba. 2017. MIT SaliencyBenchmark. http://saliency.mit.edu/. (2017).[6] Zoya Bylinskii, Adrià Recasens, Ali Borji, Aude Oliva, Antonio Torralba, and Frédo Durand. 2016. Where shouldsaliency models look next?. In

European Conference on Computer Vision .[7] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke,Stefan Roth, and Bernt Schiele. 2016. The Cityscapes Dataset for Semantic Urban Scene Understanding. In

IEEEInternational Conference on Computer Vision and Pattern Recognition .[8] Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. 2016. A Deep Multi-Level Network for SaliencyPrediction. In

International Conference on Pattern Recognition .[9] Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. 2017. Predicting Human Eye Fixations via anLSTM-based Saliency Attentive Model. arXiv preprint arXiv:1611.09571 (2017).[10] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The PascalVisual Object Classes (VOC) Challenge.

International Journal of Computer Vision

88, 2 (2010), 303–338.[11] Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and DavidForsyth. 2010. Every picture tells a story: Generating sentences from images. In

European Conference on Computer

ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 14, No. 2, Article 48

Vision .[12] Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In

International Conference on Artificial Intelligence and Statistics .[13] Ke Gong, Xiaodan Liang, Xiaohui Shen, and Liang Lin. 2017. Look into Person: Self-supervised Structure-sensitiveLearning and A New Benchmark for Human Parsing. In

IEEE International Conference on Computer Vision and PatternRecognition .[14] Zenzi M Griffin and Kathryn Bock. 2000. What the eyes say about speaking.

Psychological science

11, 4 (2000), 274–279.[15] Jonathan Harel, Christof Koch, and Pietro Perona. 2006. Graph-based visual saliency. In

Advances in Neural InformationProcessing Systems .[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In

IEEEInternational Conference on Computer Vision and Pattern Recognition .[17] Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, modelsand evaluation metrics.

Journal of Artificial Intelligence Research

47 (2013), 853–899.[18] Xun Huang, Chengyao Shen, Xavier Boix, and Qi Zhao. 2015. SALICON: Reducing the Semantic Gap in SaliencyPrediction by Adapting Deep Neural Networks. In

IEEE International Conference on Computer Vision .[19] Laurent Itti, Christof Koch, and Ernst Niebur. 1998. A model of saliency-based visual attention for rapid scene analysis.

IEEE Transactions on Pattern Analysis and Machine Intelligence

20, 11 (1998), 1254–1259.[20] Saumya Jetley, Naila Murray, and Eleonora Vig. 2016. End-to-End Saliency Mapping via Probability DistributionPrediction. In

IEEE International Conference on Computer Vision and Pattern Recognition .[21] Ming Jiang, Shengsheng Huang, Juanyong Duan, and Qi Zhao. 2015. SALICON: Saliency in context. In

IEEE InternationalConference on Computer Vision and Pattern Recognition .[22] Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convolutional localization networks for densecaptioning. In

IEEE International Conference on Computer Vision and Pattern Recognition .[23] Tilke Judd, Krista Ehinger, Frédo Durand, and Antonio Torralba. 2009. Learning to predict where humans look. In

IEEE International Conference on Computer Vision .[24] Andrej Karpathy and Li Fei-Fei. 2015. Deep Visual-Semantic Alignments for Generating Image Descriptions. In

IEEEInternational Conference on Computer Vision and Pattern Recognition .[25] Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).[26] Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. 2017. A Hierarchical Approach for GeneratingDescriptive Image Paragraphs. In

IEEE International Conference on Computer Vision and Pattern Recognition .[27] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep ConvolutionalNeural Networks. In

Advances in Neural Information Processing Systems .[28] Srinivas SS Kruthiventi, Vennela Gudisa, Jaley H Dholakiya, and R Venkatesh Babu. 2016. Saliency Unified: ADeep Architecture for Simultaneous Eye Fixation Prediction and Salient Object Segmentation. In

IEEE InternationalConference on Computer Vision and Pattern Recognition .[29] Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara LBerg. 2013. Babytalk: Understanding and generating simple image descriptions.

IEEE Transactions on Pattern Analysisand Machine Intelligence

35, 12 (2013), 2891–2903.[30] Matthias Kümmerer, Lucas Theis, and Matthias Bethge. 2015. DeepGaze I: Boosting saliency prediction with featuremaps trained on ImageNet. In

International Conference on Learning Representations Workshops .[31] Matthias Kümmerer, Thomas SA Wallis, and Matthias Bethge. 2016. DeepGaze II: Reading fixations from deep featurestrained on object recognition. arXiv preprint arXiv:1610.01563 (2016).[32] Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In

ACL Workshop on Text SummarizationBranches Out .[33] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C LawrenceZitnick. 2014. Microsoft COCO: Common Objects in Context. In

European Conference on Computer Vision .[34] Nian Liu and Junwei Han. 2016. A Deep Spatial Contextual Long-term Recurrent Convolutional Network for SaliencyDetection. arXiv preprint arXiv:1610.01708 (2016).[35] Nian Liu, Junwei Han, Dingwen Zhang, Shifeng Wen, and Tianming Liu. 2015. Predicting eye fixations usingconvolutional neural networks. In

IEEE International Conference on Computer Vision and Pattern Recognition .[36] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and AlanYuille. 2014. The Role of Context for Object Detection and Semantic Segmentation in the Wild. In

IEEE InternationalConference on Computer Vision and Pattern Recognition .[37] Vicente Ordonez, Girish Kulkarni, and Tamara L Berg. 2011. Im2text: Describing images using 1 million captionedphotographs. In

Advances in Neural Information Processing Systems . 1143–1151.ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 14, No. 2, Article 48 aying More Attention to Saliency: Image Captioning with Saliency and Context Attention 48:21 [38] Junting Pan, Cristian Canton, Kevin McGuinness, Noel E O’Connor, Jordi Torres, Elisa Sayrol, and Xavier Giro-i Nieto.2017. SalGAN: Visual Saliency Prediction with Generative Adversarial Networks. In

IEEE International Conference onComputer Vision and Pattern Recognition Workshops .[39] Junting Pan, Kevin McGuinness, Elisa Sayrol, Noel O’Connor, and Xavier Giró-i Nieto. 2016. Shallow and DeepConvolutional Networks for Saliency Prediction. In

IEEE International Conference on Computer Vision and PatternRecognition .[40] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation ofmachine translation. In .[41] Vasili Ramanishka, Abir Das, Jianming Zhang, and Kate Saenko. 2017. Top-down Visual Saliency Guided by Captions.In

IEEE International Conference on Computer Vision and Pattern Recognition .[42] Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations usingAmazon’s Mechanical Turk. In

NAACL HLT Workshops .[43] Ronald A. Rensink. 2000. The Dynamic Representation of Scenes.

Visual Cognition

7, 1-3 (2000), 17–42.[44] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,Aditya Khosla, Michael Bernstein, Alexander C Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual RecognitionChallenge.

International Journal of Computer Vision arXiv preprint arXiv:1703.10476 (2017).[46] Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556 (2014).[47] Richard Socher, Andrej Karpathy, Quoc V Le, Christopher D Manning, and Andrew Y Ng. 2014. Grounded compositionalsemantics for finding and describing images with sentences.

Transactions of the Association for Computational Linguistics arXiv preprintarXiv:1608.05203 (2016).[49] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. 2013. On the importance of initialization andmomentum in deep learning. In

International Conference on Machine Learning .[50] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, VincentVanhoucke, and Andrew Rabinovich. 2015. Going Deeper with Convolutions. In

IEEE International Conference onComputer Vision and Pattern Recognition .[51] Hamed R Tavakoli, Ali Borji, Jorma Laaksonen, and Esa Rahtu. 2017. Exploiting inter-image similarity and ensembleof extreme learners for fixation prediction using deep features.

Neurocomputing

244 (2017), 10–18.[52] Hamed R Tavakoli, Rakshith Shetty, Ali Borji, and Jorma Laaksonen. 2017. Paying Attention to Descriptions Generatedby Image Captioning Models. In

IEEE International Conference on Computer Vision .[53] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image descriptionevaluation. In

IEEE International Conference on Computer Vision and Pattern Recognition .[54] Eleonora Vig, Michael Dorr, and David Cox. 2014. Large-scale optimization of hierarchical features for saliencyprediction in natural images. In

IEEE International Conference on Computer Vision and Pattern Recognition .[55] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image captiongenerator. In

IEEE International Conference on Computer Vision and Pattern Recognition .[56] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and YoshuaBengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In

InternationalConference on Machine Learning .[57] Yezhou Yang, Ching Lik Teo, Hal Daumé III, and Yiannis Aloimonos. 2011. Corpus-guided sentence generation ofnatural images. In

Conference on Empirical Methods in Natural Language Processing .[58] Benjamin Z Yao, Xiong Yang, Liang Lin, Mun Wai Lee, and Song-Chun Zhu. 2010. I2t: Image parsing to text description.

Proc. IEEE

98, 8 (2010), 1485–1508.[59] Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention.In

IEEE International Conference on Computer Vision and Pattern Recognition .[60] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations:New similarity metrics for semantic inference over event descriptions.

Transactions of the Association for ComputationalLinguistics