[PDF] Integrating Image Captioning with Rule-based Entity Masking

Abstract

Given an image, generating its natural language description (i.e., caption) is a well studied problem. Approaches proposed to address this problem usually rely on image features that are difficult to interpret. Particularly, these image features are subdivided into global and local features, where global features are extracted from the global representation of the image, while local features are extracted from the objects detected locally in an image. Although, local features extract rich visual information from the image, existing models generate captions in a blackbox manner and humans have difficulty interpreting which local objects the caption is aimed to represent. Hence in this paper, we propose a novel framework for the image captioning with an explicit object (e.g., knowledge graph entity) selection process while still maintaining its end-to-end training ability. The model first explicitly selects which local entities to include in the caption according to a human-interpretable mask, then generate proper captions by attending to selected entities. Experiments conducted on the MSCOCO dataset demonstrate that our method achieves good performance in terms of the caption quality and diversity with a more interpretable generating process than previous counterparts.

Full PDF

IIntegrating Image Captioning with Rule-based Entity Masking

Aditya Mogadala ∗ , Xiaoyu Shen , , Dietrich Klakow Spoken Language Systems, Saarland Informatics Campus, Saarland University Max Planck Institute for [email protected], [email protected], [email protected]

Abstract

Given an image, generating its natural language de-scription (i.e., caption) is a well studied problem.Approaches proposed to address this problem usu-ally rely on image features that are difﬁcult to in-terpret. Particularly, these image features are sub-divided into global and local features, where globalfeatures are extracted from the global representa-tion of the image, while local features are extractedfrom the objects detected locally in an image. Al-though, local features extract rich visual informa-tion from the image, existing models generate cap-tions in a blackbox manner and humans have difﬁ-culty interpreting which local objects the caption isaimed to represent. Hence in this paper, we proposea novel framework for the image captioning with anexplicit object (e.g., knowledge graph entity) selec-tion process while still maintaining its end-to-endtraining ability. The model ﬁrst explicitly selectswhich local entities to include in the caption ac-cording to a human-interpretable mask, then gener-ate proper captions by attending to selected entities.Experiments conducted on the MSCOCO datasetdemonstrate that our method achieves good perfor-mance in terms of the caption quality and diversitywith a more interpretable generating process thanprevious counterparts.

Over the past few years, the task of generating descriptionsfor images (i.e., image captioning) [Vinyals et al. , 2015;Anderson et al. , 2017] has become popular as it effectivelybrings together vision and natural language to serve variousreal-world applications. Most of the existing approaches areefﬁcient in learning a correspondence between image and se-quence of words with different techniques that either improvehow visual information is captured with attention [Xu et al. ,2015; Lu et al. , 2016; Anderson et al. , 2017] or languagemodel interactions [Shen et al. , 2017].Careful analysis of methods that aim to effectively cap-ture visual information reveal that either utilize global image ∗ Contact Author features or attend to regions for local image features to gen-erate captions. However, this makes it hard to interpret, asthey do not select or control objects in an image which maybe prominent for caption generation. It is especially impor-tant for easy understanding of the caption generation processin case of failures in those systems that cater real-world ap-plications such as autonomous driving, medical imaging andsurveillance. Also, observed previously [Wang et al. , 2018]that rich entities and their interactions in some kind of a lay-out can help to better understand image captioning.Therefore, in this paper, we introduce our interpretable im-age caption generation model (henceforth, Interpret-IC) toaddress the limitations of previous approaches as shown inthe Figure 1. Our proposed approach work with a human-interpretable mask which selects the set of local objects ob-served in an image based on human proposed rules. Theserules ensure that only those desirable objects are selectedwhich human wants to observe in the caption. For this towork, the local objects need to be represented with seman-tically enriched labels so that humans can comprehend. Asnone of the current approaches provide such local object in-formation. We leveraged relational knowledge provided bythe knowledge graph entities to attain semantic labels bybuilding a multi-label image classiﬁer and replace local objectvisual features with entity distributed representations [Bordes et al. , 2011]. We show that these entity labels and its fea-tures are superior detected local object features in terms ofinterpreting knowledge from the image. Very close to our ap-proach by [Cornia et al. , 2019], who considers the decompo-sition of a sentence into noun chunks and models the relation-ship between image regions and textual chunks. However, wedynamically select the number of objects prior learning themodel. Our main contributions are as follows: • We proposed a novel end-to-end caption model for in-tepretable image captioning. • We used knowledge graph entities as image labels forgrounding visual and factual knowledge. • We show that interpretable image captioning can attaindiversity in the captions generated with simple visual ob-ject masking. a r X i v : . [ c s . C V ] J u l picture of a boat in the waterA boat floating on a river near a house RCNN . Visual Entities

Base-IC

A group of boats on a body of water

Visual Entities

Interpret-IC

Human-Interpretable Mask (a)(b)

Base-IC

Figure 1: Comparison between two proposed models with different visual features (a) Base-IC (Section 3.1) and (b) Interpret-IC (Section 3.2).Interpret-IC has an extra process of highlighting which objects to cover in the generated caption from a shortlist of all detected objects in theimage.

In the related work, we explore deep neural network basedapproaches which generate sentence-level natural languagedescription for images.

In the recent years, monolingual image caption generationis explored to incorporate diversity in the generated cap-tions. Approaches [Li et al. , 2018] has leveraged ad-versarial training using either generative adversarial net-works [Shetty et al. , 2017] or variational auto-encoder [Shen et al. , 2019]. While, [Vijayakumar et al. , 2016] used di-verse beam search to decode diverse image captions in En-glish. Approaches were also proposed to describe imagesfrom cross-domain [Chen et al. , 2017]. However, our goal inthis research is to provide better selection procedure for iden-tifying preferable objects in images. Nevertheless, we showthat interpretability can also assist diversity.

Approach that is closer to intepretable image captioning isa procedure to control local objects in images. [Cornia etal. , 2019] used either a sequence or a set of local objectsby explicitly grounding them with noun chunks observed inthe captions to generate diverse captions. Further, instead ofmaking captions only diverse, [Deshpande et al. , 2019] madethe captioning more accurate. Our work falls into this space,however understanding the important entities that representthe image and controlling them is what we aim to achieve.

The base image caption model (Base-IC) is built without masking. Given an image I , its global representation I v ∈ R V denote the encoding of the full image, while the spatialobjects a v = { a v , . . . , a v L } encode local regions of the im-age provided as a v j ∈ R D . Similar to previous works [Lu et al. , 2016; Anderson et al. , 2017], our proposed image de-scription model also leverages soft attention mechanism to Figure 2: Illustration of Base-IC Model weigh spatial objects during description generation using thepartial output sequence as context. Figure 2 illustrates thearchitecture.Initially, L-1 of the model receives input from the globalvisual context provided by I v and textual sequence, whereeach word ( w t ∈ R T ) at time step t in the textual sequenceis initialized with the pretrained word embeddings to producehidden vectors h t ∈ R H . Furthermore, h t is used in com-bination with a v to compute soft attention. Later, h t andattended spatial features are added and provided as input toL-2 for attaining h t ∈ R H . For convenience and to reducemany parameter names, we use Θ as the reference for the pa-rameters of the LSTM.To calculate attended spatial features ( ˆ a t ) we leverage a v .Hidden sequences h t at each time step t is used to generate anormalized attention weight α t for each of the spatial objectfeatures ( a v j ) given by Equation 1 and Equation 2. α tj = exp ( e tj ) (cid:80) Lk =1 exp ( e tk ) (1) e tj = tanh ( W ae a v j + W he h t ) (2) here L represent the cardinality of set a v . W ae ∈ R M × D and W he ∈ R M × H are learnable parameters. Further, ˆ a t iscalculated with Equation 3 and is used as input along with h t to the L-2 at every time step t . ˆ a t = L (cid:88) j =1 α tj a v j (3) The ﬁnal Base-IC using w t and I v as input to L-1 is givenby Equation 4 and h t is given by Equation 5. Further, ˆ a t and h t are added using Equation 6 to provide as input for L-2for generating h t as given by Equation 7. It is then used topredict next words in the sequence as given in the Equation 8. x t = I v ⊕ w t (4) h t = L-1 ( x t , h t − ; Θ) (5) x (cid:48) t = ˆ a t + h t (6) h t = L-2 ( x (cid:48) t , h t − ; Θ) (7) p t +1 = softmax ( W vocab h t ) (8) where W vocab ∈ R vocab × ( V + H ) , ⊕ represents concatena-tion and vocab refers to vocabulary of the caption dataset. Main aim of the

Interpret-IC model is to select objects presentin the spatial objects set a v with human-interpretable mask-ing. This is in contrast with earlier approaches [Xu et al. ,2015; Anderson et al. , 2017], who decoded the caption byattending to spatial objects only by ranking them accordingto their importance at each time step. Also, these approachesprovide no control for humans to select their desirable ob-jects. It clearly sets expectation from Interpret-IC model thatthe selected objects should provide more prominence in cap-tion generation by discarding those objects that are not se-lected.Hence, we introduce masked attention to select those ob-jects that human wants to see in the generated captions. Toachieve it, we leverage ground truth mask i.e., mask gt whereeach object in the a v is masked with a binary parameter β , β , . . . , β n . We set β i = 1 if selected and 0 otherwise.Also, β i is assumed to be independent from each other and issampled from a bernoulli distribution. Prediction mask i.e.,mask pred is estimated during training with a multi-layer per-ceptron (MLP).Further, attention weights computed in the Equation 1 ismodiﬁed with the estimated mask pred as shown in Equa-tion 9. α masktj = exp ( e tj ) mask pred (cid:80) Lk =1 exp ( e tk ) mask pred (9) It is then used to calculate ˆ a maskt given by Equation 10,which is further used as input along with h t to the L-2 atevery time step t . Figure 3 illustrates the overall architecture. ˆ a maskt = L (cid:88) j =1 α masktj a v j (10) Image

MaskedAttention

LSTMx t h t Softmax

LSTM h t L-1L-2 w t I v A v â tmask h t p t+1 MLP mask pred

Figure 3: Illustration of Interpret-IC Model

Note that our selection strategy is very different from [Cor-nia et al. , 2019], who control spatial objects using the ﬁxednoun-chunks extracted from captions which are not availableduring testing phase. While, we use human designed rules tochange our mask, so that we control the mask as we aim touse it.

In the

Interpret-IC model, mask pred needs to be optimizedduring training phase closer to the ground truth binary maski.e., mask gt such that it can be utilized during the testingphase. However, ﬁrst we need to create such mask gt basedon human-interpretable rules to inﬂuence the caption gener-ation process.There can be several ways to create mask gt by changingthe rules. In this paper, we apply visual entities to caption noun matching approach to build the mask gt . Our rule herestates that for each noun identiﬁed in the caption, we need toﬁnd the closest visual entity by computing cosine distancebetween the noun and visual entity vectors attained usingpretrained fastText vectors. For all nouns identiﬁed, clos-est visual entities are set to 1, while rest are set to 0. Thisrule ensures that the nouns observed in the caption represent-ing some kind of objects present in images have to be givenhigher preference during caption generation. While, rest ofthe visual entities (e.g., actions) are put on back burner. Al-gorithm 1 presents the overview of selection process. Base-IC

The parameters ( θ ) of the Base-IC model aretrained for optimizing the cost function ( C ) to minimize thesentence-level categorical cross-entropy loss by ﬁnding neg-ative log likelihood of the appropriate ground truth word ( y ∗ t )at each time step t as shown in Equation 11. Here, we lever-age teacher forcing [Sutskever et al. , 2014], where ground https://spacy.io/ https://fasttext.cc/ nput: Nouns (N), Visual Entities (VE), fastTextEmbeddings (FTE)

Output: mask gt for each captionInitialize N emb = FTE(N) ;Initialize VE emb = FTE(VE) ;Initialize Image velist as I velist ;Initialize Caption list as C list ; Function mask gt Selection for

C,VE in C list ,I velist do Extract N from caption ;Initialize mask gt = zeros[len(VE)] ; for n in N doif n not EMPTY then dist = CosineDistance(n emb ,VE emb );close index = arg min( dist ) ;mask gt [ close index ] = 1; endendreturn mask gt ; endend Algorithm 1: mask gt selection processtruth ( y ∗ t ) is fed to next step in the layer L-1, instead of thepredicted word in previous step. C ( θ ) = − T ( n ) (cid:88) t =0 log p θ ( y ∗ t ) (11) The T ( n ) represents the length of the sentence at n -th train-ing sample. During inference, we leverage beam search withbeam size is set to 5 in our experiments. Interpret-IC

Similar to

Base-IC model, parameters ( θ (cid:48) )of the Interpret-IC are trained for optimizing the cost func-tion ( C (cid:48) ) which minimizes both the sentence-level categori-cal cross-entropy loss along with binary cross-entropy lossthat approximate (mask pred ) closer to the ground truth mask(mask gt ) as shown in Equation 12. C (cid:48) ( θ (cid:48) ) = − (cid:32)(cid:16) T ( n ) (cid:88) t =0 log p θ ( y ∗ t ) (cid:17) + mask gt log ( mask pred )+ (1 − mask gt ) log (1 − mask pred ) (cid:33) (12)During inference, similar to Base-IC model, we leveragebeam search by setting beam size to 5 in our experiments.

Datasets

For experimental evaluation, we use MSCOCOdataset with splits of [Karpathy and Fei-Fei, 2015]. Table 1summarizes the training, validation and test splits.

Mean Sentence-Length 11.3Vocabulary 9989Sentences 5Training 113287Validation 5000Test 5000Table 1: Statistics of the MSCOCO dataset

Local and Global Image Features

Spatial object ( a v ) fea-tures are extracted in two different ways. • Faster R-CNN [Ren et al. , 2015] in conjunction with theResNet-101 [He et al. , 2016] trained on visual genomedata by [Anderson et al. , 2017] is used to extract top 36local object features ( a v j ) of dimension 2048. Thereare pure visual features and we refer to this set asObj → RCNN. • Since, Obj → RCNN represent pure visual features with-out label information. Following [Mogadala et al. ,2018a; Mogadala et al. , 2018b], we extracted seman-tically enriched labels denoting entities from captionsaligned to an image in training set of MSCOCO witha knowledge graph annotation tool such as DBpediaspotlight . In total, 812 unique human-interpretablealready disambiguated labels are extracted. Further,a multi-label image classiﬁer is trained with sigmoidcross-entropy loss by ﬁne-tuning VGG-16 [Simonyanand Zisserman, 2014] pre-trained on the training partof the ILSVRC12 with training images in MSCOCO.After training, we use the classiﬁer to acquire Top-15entity labels for each image present in the training, val-idation and testing set of MSCOCO. Now, to use entitylabels similar to Obj → RCNN features. We use knowl-edge graph embeddings [Ristoski and Paulheim, 2016]and generate 500 dimensional vectors for each entity-label. We refer to this set as Obj → VisualEntity. • The global visual features ( I v ) of dimension 2048 is ex-tracted using the average pooling of Obj → RCNN fea-tures.

Caption Model

Both

Base-IC and

Interpret-IC models arebuilt by initializing the model with input ( w t ) word embed-dings pretrained using Glove [Pennington et al. , 2014] on theMSCOCO training captions corpora. The dimensions of thehidden units h t , h t in L-1 and

L-2 of models are set to 512.Also, the hidden units of shared layer h ( s ) t are set to 512. Allmodels are then trained with Adam optimizer with gradientclipping having maximum norm of 1.0 and mini-batch size of50 for 25 epochs. Initially, the learning is set to 0.001 and isreduced by a factor of 10 if there is no improvement in thevalidation loss for 3 continuous epochs. https://github.com/dbpedia-spotlight/ Please note that these embeddings are different from fastTextVectors used to build mask gt . These embeddings are analogous topure visual features, however, learned from knowledge graph struc-ture. valuation Measures We ﬁrst evaluate the generated cap-tions based on correctness which guarantee the generationquality based on standard captioning metrics. Further, wecheck if our proposed model with human-interpretable mask-ing can generate diverse and interesting captions. For this, weleverage earlier proposed [Shetty et al. , 2017; Deshpande etal. , 2019] metrics such as vocabulary size and novel captionwith best (i.e., Top-1) generated caption. Vocabulary Size(VS) ﬁnd unique words in generated captions and Novel cap-tions (NC) identify the percentage of generated captions thatare not seen in the training set.

We compared our proposed

Base-IC and

Interpret-IC alongwith other recent baselines. Table 2 shows the results ob-tained. It can be observed that the

Interpret-IC model wasable improve over recent approaches by allowing better con-trol over the caption generation process.

To understand the contribution made by human-interpretablemask to caption generation. We explored qualitatively thecaptions generated by both

Base-IC and

Interpret-IC mod-els with visual entities from two different perspectives. First,we observed the quality of the predicted mask in selecting re-quired visual entities for better coverage. Second, we checkedif

Interpret-IC model could overcome or correct mistakesmade by the

Base-IC model. In the following, we discusseach of these cases brieﬂy by showing some examples.

Caption Coverage

We use visual entities such that theyrepresent local objects in images to be incorporated them inthe caption. However, this cannot be simply achieved with a

Base-IC model. As seen in Figure 4, the

Interpret-IC modelwhich weighs each of these objects differently based on thepredicted mask, when compared with the

Base-IC model giv-ing equal importance to each of them. Although the Base-IC model generated partially relevant caption, masking hasshown to improve coverage of local objects in the image. Theselector is able to assign higher scores to prominent objectsin the image which increases the probability of covering themin the generated caption.

Caption Correction

We also observe that, apart from pro-viding better coverage of visual entities in the generated cap-tions. Masking also plays a prominent role in the captioncorrection. That is, as seen in the Figure 5, although the

Base-IC model generated a partially relevant caption,

Interpret-IC generated the most accurate caption with effective selectionof relevant visual entities. The selector is expected to assignlower scores to inappropriate (bird in Figure 5b) or wronglydetected objects (sheep in Figure 5a) thus encouraging thedecoder to attend to more plausible entities.

Although our aim is not to achieve diverse captions, tocomprehend whether our proposed

Base-IC and

Interpret-IC models generate best (i.e., Top-1) diverse and interesting cap-tion. We compared our models with other diverse caption generation baselines that compare best generated caption us-ing diversity measures described earlier. Table 3 shows theresults attained , where (NC) is Top-1 generated caption withBase-bs [Shetty et al. , 2017] and Adv-bs [Shetty et al. , 2017].We observe that, our

Interpret-IC model cannot exceed scoresof the baseline trained to generate diverse captions in an ad-versarial setting (i.e., Adv-bs). However, with less effort andsimple masking we could see a signiﬁcant jump on the stan-dard caption model (i.e., Base-bs).Also, in Figure 6, we plot unique unigrams and bigramspredicted at every word position. The plot shows that the

Interpret-IC have higher unique unigrams at different wordpositions and is consistently higher for the bigrams whencompared against

Base-IC with visual entities as features.This supports our hypothesis that

Interpret-IC can producemore diverse captions as it can alter caption generation pro-cess.

In this paper, we aimed to address the problem of inter-pretable image captioning by leveraging knowledge graph en-tity features. Initially, we obtained local objects as visual enti-ties in the image by grounding knowledge graph entities. Fur-ther, the human-interpretable masking rules are developed toselect those visual entities for generating desirable captions.Experimental results show that interpretability in caption gen-eration can help to alter caption generation process hence al-lowing control and selection. In Future, we aim to improvecaption generation process by trying different masks and bet-ter sampling.

Aditya Mogadala was supported by the German ResearchFoundation (DFG) as a part of - Project-ID 232722074 -SFB1102.

References [Anderson et al. , 2017] Peter Anderson, Xiaodong He, ChrisBuehler, Damien Teney, Mark Johnson, Stephen Gould,and Lei Zhang. Bottom-up and top-down attention for im-age captioning and vqa. arXiv preprint arXiv:1707.07998 ,2017.[Aneja et al. , 2018] Jyoti Aneja, Aditya Deshpande, andAlexander G Schwing. Convolutional image captioning.In

Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition , pages 5561–5570, 2018.[Bordes et al. , 2011] Antoine Bordes, Jason Weston, RonanCollobert, and Yoshua Bengio. Learning structured em-beddings of knowledge bases. In

Twenty-Fifth AAAI Con-ference on Artiﬁcial Intelligence , 2011.[Chen et al. , 2017] Tseng-Hung Chen, Yuan-Hong Liao,Ching-Yao Chuang, Wan-Ting Hsu, Jianlong Fu, andMin Sun. Show, adapt and tell: Adversarial train-ing of cross-domain image captioner. arXiv preprintarXiv:1705.00930 , 2017. ross-Entropy LossBLEU-4 METEOR ROUGE-L CIDEr SPICEModelAdv-bs [Shetty et al. , 2017] - 23.9 - - 16.7CNN+CNN [Wang and Chan, 2018] 26.7 23.4 51.0 84.4 -Convolutional-IC [Aneja et al. , 2018] 31.6 et al. , 2019] - 24.7 - - 18.0Base-IC+Obj → RCNN 31.8 24.9 52.9 96.7 +Obj → VisualEntity 32.1 24.8 53.6 96.9 18.0Interpret-IC+Obj → VisualEntity

Interpret-IC (Visual Entities): A white dog standing next to a bike

Selected Entity-Labels:

Bicycle (0.85), Dog (0.78), Cat (0.4), Tree (0.3), Base_on_balls (0.002), Water (0.005), Trail (0.3), Equestrianism (0.33), Hanging (0.03), City (0.2), Wood (0.02), Horse (0.3), Rock_music (0.003), Motorcycle (0.45), Sun (0.21) (a)

Base-IC A dog walking on a sidewalk next to a bike(RCNN) Base-IC A black bike on a street (missing: Dog)(Visual Entities)

Entity-Labels (Top-15):

Bicycle (1.0), Dog (1.0), Cat (1.0), Tree (1.0), Base_on_balls (1.0), Water (1.0), Trail (1.0), Equestrianism (1.0), Hanging (1.0), City (1.0), Wood (1.0), Horse (1.0), Rock_music (1.0), Motorcycle (1.0), Sun (1.0)

Base-IC A close up of a cake on a table(RCNN) Base-IC A birthday cake on a table (missing: Specificity)(Visual Entities)

Entity-Labels (Top-15):

Cake (1.0), Birthday_cake (1.0), Flower (1.0), Candle (1.0), Purple (1.0), Sprinkles (1.0), Chocolate_cake (1.0), Cupcake (1.0), Plastic (1.0), Red (1.0), Textile (1.0), Chocolate (1.0), Candy (1.0), Glass (1.0), Tablecloth (1.0)

Interpret-IC (Visual Entities): A birthday cake is decorated with pink and blue frosting

Selected Entity-Labels:

Cake (0.9), Birthday_cake (0.95), Flower (0.3), Candle (0.4), Purple (0.55), Sprinkles (0.3), Chocolate_cake (0.65), Cupcake (0.35), Plastic (0.003), Red (0.1), Textile (0.25), Chocolate (0.4), Candy (0.45), Glass (0.05), Tablecloth (0.55) (b)

Figure 4:

Caption Coverage Example (Entities with mask pred > . are highlighted in blue): (a) Missing local object (Dog) in the captiongenerated by Base-IC , while “White Dog” is included by

Interpret-IC providing better coverage. (b) Missing details about the birthday cake,

Interpret-IC generated better and interesting caption by highlighting objects that need to be focused on.

Base-IC A dog laying on a lush green hillside(RCNN) Base-IC A herd of sheep grazing on a field(Visual Entities)

Entity-Labels (Top-15):

Rock_music (1.0), Poaceae (1.0), Sheep (1.0), Grazing (1.0), Water (1.0), Grass (1.0), Landscape (1.0), Tree (1.0), Yellow_Sun (1.0), Goat (1.0), Meadow (1.0), Stream (1.0), Pasture (1.0), Single_(music) (1.0), Pond (1.0)

Interpret-IC (Visual Entities): A herd of cattle laying on a lush green hillside

Selected Entity-Labels:

Rock_music (0.001), Poaceae (0.8), Sheep (0.2), Grazing (0.6), Water (0.54), Grass (0.58), Landscape (0.4), Tree (0.35), Yellow_Sun (0.1), Goat (0.1), Meadow (0.6), Stream (0.23), Pasture (0.41), Single_(music) (0.005), Pond (0.25) (a)

Base-IC A group of ducks swim in the water(RCNN)

Base-IC A group of birds swimming in the (Visual Entities) water

Entity-Labels (Top-15):

Water (1.0), Duck (1.0), Bird (1.0), Pond (1.0), Swimming_(sport) (1.0), Stream (1.0), Bank (1.0), Drinking_water (1.0), Zebra (1.0), Equestrianism (1.0), Poaceae (1.0), Grass (1.0), Tree (1.0), Fish (1.0), Rock_music (1.0)

Interpret-IC (Visual Entities): A flock of ducks swimming in the water

Selected Entity-Labels:

Water (0.85), Duck (0.9), Bird (0.45), Pond (0.56), Swimming_(sport) (0.61), Stream (0.63), Bank (0.34), Drinking_water (0.4), Zebra (0.001), Equestrianism (0.002), Poaceae (0.04), Grass (0.23), Tree (0.09), Fish (0.41), Rock_music (0.002) (b)

Figure 5:

Caption Correction Example (Entities with mask pred > . are highlighted in blue): (a) Base-IC generate a caption by includingwrong objects i.e., sheep, while “cattle” is included by

Interpret-IC because a lower weight (0.2) is assigned to “sheep” hence ﬁlters out thewrongly detection. (b) Although

Base-IC covers the correct object (Birds), it is too general and fails to provide more informative caption.

Interpret-IC replaces it with the exact object by giving a large weight to emphasize the detected entity “duck”.

Position C o un t s Base-ICInterpret-IC (a) Unigrams

Position C o un t s

147 388 286 390 511 489 468 286 146 54 6 0237 562 575 713 797 773 816 609 283 91 17 1Unique bigrams at every position

Base-ICInterpret-IC (b) BigramsFigure 6: Plot of starting unique unigrams and bigrams observed in the generated caption.Base-bs Adv-bs Base-IC Interpret-ICMetricsVS 756

443 862NC 34.18 → VisualEntity features. [Cornia et al. , 2019] Marcella Cornia, Lorenzo Baraldi, andRita Cucchiara. Show, control and tell: a framework forgenerating controllable and grounded captions. In

Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 8307–8316, 2019.[Deshpande et al. , 2019] Aditya Deshpande, Jyoti Aneja, Li-wei Wang, Alexander G Schwing, and David Forsyth.Fast, diverse and accurate image captioning guided bypart-of-speech. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 10695–10704, 2019.[He et al. , 2016] Kaiming He, Xiangyu Zhang, ShaoqingRen, and Jian Sun. Deep residual learning for image recog-nition. In

Proceedings of the IEEE conference on computervision and pattern recognition , pages 770–778, 2016.[Karpathy and Fei-Fei, 2015] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating im-age descriptions. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 3128–3137, 2015.[Li et al. , 2018] Dianqi Li, Qiuyuan Huang, Xiaodong He,Lei Zhang, and Ming-Ting Sun. Generating diverse andaccurate visual captions by comparative adversarial learn-ing. arXiv preprint arXiv:1804.00861 , 2018.[Lu et al. , 2016] Jiasen Lu, Caiming Xiong, Devi Parikh,and Richard Socher. Knowing when to look: Adaptiveattention via a visual sentinel for image captioning. arXivpreprint arXiv:1612.01887 , 2016.[Mogadala et al. , 2018a] Aditya Mogadala, Umanga Bista,Lexing Xie, and Achim Rettinger. Knowledge guidedattention and inference for describing images containing unseen objects. In

European Semantic Web Conference ,pages 415–429. Springer, 2018.[Mogadala et al. , 2018b] Aditya Mogadala, Bhargav Kanu-parthi, Achim Rettinger, and York Sure-Vetter. Discover-ing connotations as labels for weakly supervised image-sentence data. In

Companion Proceedings of the The WebConference 2018 , pages 379–386, 2018.[Pennington et al. , 2014] Jeffrey Pennington, RichardSocher, and Christopher Manning. Glove: Global vectorsfor word representation. In

Proceedings of the 2014conference on empirical methods in natural languageprocessing (EMNLP) , pages 1532–1543, 2014.[Ren et al. , 2015] Shaoqing Ren, Kaiming He, Ross Gir-shick, and Jian Sun. Faster r-cnn: Towards real-time ob-ject detection with region proposal networks. In

Advancesin neural information processing systems , pages 91–99,2015.[Ristoski and Paulheim, 2016] Petar Ristoski and HeikoPaulheim. Rdf2vec: Rdf graph embeddings for data min-ing. In

International Semantic Web Conference , pages498–514. Springer, 2016.[Shen et al. , 2017] Xiaoyu Shen, Youssef Oualil, ClaytonGreenberg, Mittul Singh, and Dietrich Klakow. Estimationof gap between current language models and human per-formance.

Proc. Interspeech 2017 , pages 553–557, 2017.[Shen et al. , 2019] Xiaoyu Shen, Jun Suzuki, Kentaro Inui,Hui Su, Dietrich Klakow, and Satoshi Sekine. Select andattend: Towards controllable content selection in text gen-eration. In

Proceedings of the 2019 Conference on Em-pirical Methods in Natural Language Processing and the9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) , pages 579–590, 2019.[Shetty et al. , 2017] Rakshith Shetty, Marcus Rohrbach,Lisa Anne Hendricks, Mario Fritz, and Bernt Schiele.Speaking the same language: Matching machine to hu-man captions by adversarial training. arXiv preprintarXiv:1703.10476 , 2017.[Simonyan and Zisserman, 2014] Karen Simonyan and An-drew Zisserman. Very deep convolutional networksor large-scale image recognition. arXiv preprintarXiv:1409.1556 , 2014.[Sutskever et al. , 2014] Ilya Sutskever, Oriol Vinyals, andQuoc V Le. Sequence to sequence learning with neuralnetworks. In

Advances in neural information processingsystems , pages 3104–3112, 2014.[Vijayakumar et al. , 2016] Ashwin K Vijayakumar, MichaelCogswell, Ramprasath R Selvaraju, Qing Sun, Stefan Lee,David Crandall, and Dhruv Batra. Diverse beam search:Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424 , 2016.[Vinyals et al. , 2015] Oriol Vinyals, Alexander Toshev,Samy Bengio, and Dumitru Erhan. Show and tell: A neuralimage caption generator. In

Proceedings of the IEEE con-ference on computer vision and pattern recognition , pages3156–3164, 2015.[Wang and Chan, 2018] Qingzhong Wang and Antoni BChan. Cnn+ cnn: Convolutional decoders for image cap-tioning. arXiv preprint arXiv:1805.09019 , 2018.[Wang et al. , 2018] Josiah Wang, Pranava Swaroop Mad-hyastha, and Lucia Specia. Object counts! bringing ex-plicit detections back into image captioning. In

Proceed-ings of the 2018 Conference of the North American Chap-ter of the Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long Papers) ,volume 1, pages 2180–2193, 2018.[Xu et al. , 2015] Kelvin Xu, Jimmy Ba, Ryan Kiros,Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov,Rich Zemel, and Yoshua Bengio. Show, attend and tell:Neural image caption generation with visual attention.In