Towards Diverse and Accurate Image Captions via Reinforcing Determinantal Point Process
TTowards Diverse and Accurate Image Captions via ReinforcingDeterminantal Point Process
Qingzhong Wang and
Antoni B.Chan
Department of Computer Science, City University of Hong Kong [email protected], [email protected]
Abstract
Although significant progress has been madein the field of automatic image captioning, it isstill a challenging task. Previous works nor-mally pay much attention to improving thequality of the generated captions but ignorethe diversity of captions. In this paper, wecombine determinantal point process (DPP)and reinforcement learning (RL) and proposea novel reinforcing DPP (R-DPP) approachto generate a set of captions with high qual-ity and diversity for an image. We show thatR-DPP performs better on accuracy and di-versity than using noise as a control signal(GANs, VAEs). Moreover, R-DPP is able topreserve the modes of the learned distribution.Hence, beam search algorithm can be appliedto generate a single accurate caption, whichperforms better than other RL-based models.
Image captioning, which combines the fields ofcomputer vision (CV) and natural language pro-cessing (NLP), is a challenging task, which hasdrawn much attention from the two communitiesand significant progress has been achieved. Earlierworks (Kulkarni et al., 2013; Farhadi et al., 2010;Fang et al., 2015) generally directly employ visionand language models. However, these two-stagemodels cannot be trained in a end-to-end manner,which limits their performance.Recently, CNN-LSTM models have becomepopular (Vinyals et al., 2015; Xu et al., 2015).CNN-LSTM models are typically composed ofthree modules: (1) a visual CNN, (2) a languageLSTM, and (3) the connection module betweenthem, which can be trained in an end-to-end man-ner. More powerful captioning models are laterproposed (Anderson et al., 2017; Rennie et al.,2017; Wang et al., 2019), and trained using re-inforcement learning (RL) where the evaluation metric (e.g., CIDEr) is used as the reward func-tion. As a result, the generated captions obtainhigh quality according to the most popular met-rics, such as BLEU (Papineni et al., 2002), ME-TEOR (Denkowski and Lavie, 2014), ROUGLE(Lin, 2004), CIDEr (Vedantam et al., 2015) andSPICE (Anderson et al., 2016).However, most of the above models do not fo-cus on the diversity of captions. While directlymaximizing the metrics using RL (Rennie et al.,2017) significantly improves the metric scores,they lack diversity even though they are randomlydrawn from the learned distribution (Wang andChan, 2019). The lack of diversity in the captionsis further exacerbated when using beam search tofind the mode of the learned distribution.The main issue of RL-based methods that leadsto generating less diverse captions is they onlyconsider the quality (as measured by BLEU orCIDEr) of samples during training. To addressthis issue, in this paper, we propose a novel ap-proach that combines RL and determinantal pointprocesses (DPP) (Kulesza and Taskar, 2012) thatgenerates both accurate and diverse image cap-tions. Inspired by DPPs, which account for thequality and diversity of subsets, we first proposea new metric that is able to reflect the qualityand diversity of a set of captions. We then maxi-mize the proposed metric score using RL, which isequivalent to a DPP training process. We evaluateour model using the diversity metrics from (Wangand Chan, 2019), and our proposed R-DPP modelachieves both high accuracy and high diversityscores. In addition, R-DPP preserves the modesof the learned distribution – applying the beamsearch algorithm to generate one high-quality cap-tion yields better performance than the baselinecaptioning model. Moreover, R-DPP outperformsits counterparts on the oracle test (see Table 2). a r X i v : . [ c s . C V ] A ug Related Work
Diverse image captioning.
Recently, generat-ing diverse captions receives much attention, anda variety of captioning models are developed, suchas CVAE (Wang et al., 2017), CGAN (Dai et al.,2017), GroupTalk (Wang et al., 2016), GroupCap(Chen et al., 2018a), POS (Deshpande et al., 2018)and SCT (Cornia et al., 2018). CVAE and CGANemploy random noise vectors to control the dif-ference among the generated captions. However,the diversity is highly related to the variance ofthe noise, which makes it difficult to balance di-versity and accuracy. GroupTalk employ multiplecaptioners and a classifier to generated diversecaptions. Each captioner generate one caption andthe classifier is used to control the diversity amongthe captions. However the computational cost ishigh due to its use of multiple captioners. Group-Cap considers the structure relevance and diver-sity constraint to generate both accurate and di-verse captions, in which VP-trees are constructed.POS introduces part-of-speech (POS) tags to con-trol the difference among captions, which containstwo branches: 1) POS tag prediction, 2) word pre-diction. The same POS tag could result in usingdifferent words (synonyms), leading to diversity.Instead of employing POS tags as control signals,SCT applies noun chunks that are obtained bydependency parsing (Chen and Manning, 2014).Compared with the above captioning models, ourproposed RL using DPP is much simpler and moreefficient, does not require any other branches orcontrol signals, and can be applied to any baselinecaptioning model. Determinantal point process (DPP).
Given adiscrete set X = { x , x , · · · , x N } , a DPP P measures the probability of each subset X of X ,which is defined as (Kulesza and Taskar, 2012): P L ( X ) = det ( L X ) det ( L + I ) , (1)where L is a positive semidefinite matrix, repre-senting an L-ensemble , I denotes the N × N iden-tity matrix and det ( L + I ) = (cid:80) X ⊆X det ( L X ) .Generally, L = [ L ij ] can be decomposed asa Gram matrix with elements L ij = q i φ Ti φ j q j ,where q i denotes the quality of the i th element and s ij = φ Ti φ j denotes the similarity between the i thand j th elements, where || φ i || = 1 . A captioner could be a captioning model.
A DPP is trained by maximizing the log-likelihood log P L ( X ) , where the subset withlarger det ( L X ) will be assigned a higher prob-ability. Inference involves finding the subsetwith highest posterior probability (MAP). DPP hasbeen used in applications that require both qual-ity and diversity: text summarization (Kulesza andTaskar, 2012), video summarization (Zhang et al.,2016), recommendation (Chen et al., 2018b) andneural conversation (Song et al., 2018). We consider each caption as an item, and definethe quality of a caption using CIDEr, q i = CIDEr ( c i , C GT ) , (2)where c i denotes the i th caption in a subset, C GT denotes human annotations and CIDEr ( · , · ) is theCIDEr score. We define the similarity betweencaptions as (i.e., “self-CIDEr” in (Wang and Chan,2019)), s ij = CIDEr ( c i , c j ) . (3)The L matrix in DPP is then L = q T q (cid:12) S , (4)where q = [ q , · · · , q N ] , S = [ s ij ] , and (cid:12) denoteselement-wise multiplication.Let M ( θ ) be the captioning model and C = { c , c , · · · , c m } a subset of m captions sampledfrom M ( θ ) . The probability of C can be mea-sured with (1), using the determinants of L C and L + I . Unfortunately, to compute L is intractablesince the number of possible captions N is huge,roughly | D | l m , where | D | is the dictionary size(10,000) and l m is the caption length (16). Al-though computing L is intractable, we note that L is a constant w.r.t. θ for a fixed dictionary D andcaption length l m . Thus, the denominator in (1)can be ignored when maximizing the likelihood ofthe generated captions C w.r.t. θ , θ ∗ = argmax P L ( C ) = argmax log( det ( L C )) . To compute the quality scores and similarity ma-trix, we should sample a set of captions C from M ( θ ) , and thus we cannot directly calculate thegradient of log( det ( L C )) w.r.t. θ . Alternatively,we can first compute the derivative ∂ log( det ( L C )) ∂L C ij = L C ij , where L C ij is the element of L C and ˆ L C ij is theelement of L − C . Considering the derivative ˆ L C ij ,its sign indicates whether we should reduce or in-crease L C ij to enlarge log( det ( L C )) .Recall that the reward function in (Rennie et al.,2017) is the expectation of CIDEr, R ( θ ) = m (cid:88) i =1 q i p θ ( c i ) , (5)and the corresponding policy gradient is ∇ θ R ( θ ) = m (cid:88) i =1 q i ∇ θ log( p θ ( c i )) · p θ ( c i ) . (6)(6) shows that the probability of the high-qualitycaptions will increase, and finally the model couldtend to generate captions that have high quality butlack diversity.The main issue of using (5) is that it only ac-counts for the quality of captions. To promotediversity, we employ a new reward function thatconsiders each pair of captions in C , R ( θ ) = m (cid:88) i =1 m (cid:88) j =1 sign ( ˆ L C ij ) L C ij p θ ( c i ) p θ ( c j ) , (7)where sign ( x ) is the sign of x , and p θ ( c i ) is theprobability of the i th caption according to M ( θ ) .Note that p θ ( c i ) p θ ( c j ) is the joint probability ofthe i th and j th captions, since the captions aresampled independently. Our reward function con-siders both the quality of captions as well as thesimilarity among captions (see Eq. (4)) , thus isable to balance the quality and diversity. The cor-responding policy gradient is (see supplementalfor derivation): ∇ θ R ( θ ) =2 m (cid:88) i =1 ∇ θ log( p θ ( c i )) p θ ( c i ) m (cid:88) j =1 sign ( ˆ L C ij ) L C ij p θ ( c j ) (cid:124) (cid:123)(cid:122) (cid:125) E [ sign (ˆ L C ij ) L C ij ] , (8) which has the same form as (6), but here we con-sider both quality and similarity among captions. Experimental setup.
We conduct our experi-ments on MSCOCO dataset, which has 123,287annotated images, each with at least 5 captions. Adding a small constant (cid:15)I to L C ensures invertability. Note that the expectation of L C ij could be enlarged orreduced based on sign ( ˆ L C ij ) , which is different with Eq. (5)where the expectation of q i is always enlarged. Figure 1: Performance on diversity and accuracy. Thecaptions are generated via random sampling from thelearned distribution. For each model we sample 10 cap-tions to compute the self-CIDEr diversity scores (Wangand Chan, 2019), and the accuracy score is the averageof CIDEr scores. CGAN- { } use standard devia-tions of 1 and 10 to train CGANs, and greedy search isused for inference. m is the number of samples used totrain our R-DPP. Following (Rennie et al., 2017), we use 5k im-ages for validation, 5k for testing and the remain-ing for training. Our baseline captioning modelis based on Att2in (Rennie et al., 2017). We firsttrain the model for 100 epochs using cross-entropyloss, and then refine it for another 100 epochs us-ing our policy gradient in (8). During training, weapply Adam with learning rate 0.0004. For com-parison, we also refine the baseline model for 100epochs using original policy gradient in (6). Wealso compare with CGAN , GMM-CVAE (Wanget al., 2017), SCST (Rennie et al., 2017), andXE+ λ CIDEr (Wang and Chan, 2019). The diver-sity metric is self-CIDEr diversity, which is shownto be more correlated to human judgment (Wangand Chan, 2019).
Results.
Fig. 1 shows the performance of differ-ent models in the diversity-accuracy space. Hu-man annotations achieve relatively high diversityand accuracy , and there is still a large gap be-tween the proposed models and human annota-tions. Our R-DPP model slightly improves the ac-curacy of SCST and the baseline model (Att2in),when m = 2 , but the diversity score roughly dou-bles (0.2 to 0.4). Our R-DPP achieves comparable We train CGAN without using rollout, which is differentfrom (Dai et al., 2017) The accuracy score of human annotations is the leave-one-out CIDEr score as in (Wang and Chan, 2019). odel bw B-4 M R C SAdaptive-XE (Lu et al., 2017) 3 0.332 0.266 - 1.085 -Updown-XE (Anderson et al., 2017) 5 0.362 0.270 0.564 1.135 0.203Updown-RL (Anderson et al., 2017) 5 0.363 0.277 0.569 1.201 0.214DISC-RL (Luo et al., 2018) 2 0.363 0.273 0.571 1.141 0.211Hieratt-XE (Wang et al., 2019) 3 0.362 0.275 0.566 1.148 0.206Hieratt-RL (Wang et al., 2019) 3 0.376 0.278 0.581 1.217 0.215SCST (Rennie et al., 2017) - 0.333 0.263 0.553 1.114 -Att2in-XE (Rennie et al., 2017) - 0.313 0.260 0.543 1.013 -XE+5CIDEr (Wang and Chan, 2019) 3 0.382 0.277 0.579 1.172 0.206XE+10CIDEr (Wang and Chan, 2019) 3 0.378 0.276 0.580 1.174 0.207XE+20CIDEr (Wang and Chan, 2019) 3 0.375 0.276 0.579 1.173 0.209Our R-DPP ( m = 2 ) 3 0.371 0.279 0.579 m = 3 ) 3 0.369 0.278 0.577 1.216 0.214Our R-DPP ( m = 4 ) 3 0.360 0.280 0.572 1.198 0.214Our R-DPP ( m = 5 ) 3 0.357 0.278 0.568 1.179 0.212Our R-DPP ( m = 6 ) 3 0.352 0.276 0.566 1.146 0.208Our R-DPP ( m = 7 ) 3 0.347 0.272 0.562 1.124 0.206 Table 1: Performance on single caption generation.The caption is generated using beam search ( bw is thebeam width). m is the number of samples used duringtraining of our R-DPP. The “-XE” suffix indicates train-ing using cross-entropy loss, and “-RL” means fine-tuned with RL. { B, M, R, C, S } are abbreviations forBLEU, METEOR, ROUGE, CIDEr, and SPICE. diversity scores as XE+ λ CIDEr, but the captionsgenerated by XE+ λ CIDEr have lower accuracycompared to R-DPP. By maximizing det ( L C ) , ourR-DPP can simultaneously improves the qualityand suppresses the similarity among captions (im-proves diversity). Comparing GMM-CVAE andR-DPP, both methods can generate captions withsimilar diversity, while R-DPP ( m = 6 ) has higheraccuracy (0.8 vs 0.95), which indicates that R-DPP better approximates the modes of the ground-truth distribution. Finally, the R-DPP curve showsthat the number of samples m used during trainingbalances the diversity and accuracy of the model.A larger m leads to a more diverse set of captions,although it also incurs higher training computa-tional cost.Another advantage of R-DPP is that it can beused to generate a single high-quality caption foran image. Table 1 shows the comparison betweenR-DPP and the state-of-the-art models. Comparedwith SCST, R-DPP improves the CIDEr scorefrom 1.114 to 1.222, and the other metric scoresare also improved by around 5% or larger. Com-paring with Hieratt-RL (state-of-the-art), R-DPPobtains similar CIDEr score, however, the Hierattmodel cannot generate diverse captions.Fig. 1 and Table 1 show the effectiveness of R-DPP on generating both diverse and accurate cap-tions, whereas they do not consider the optimal se-lection of m . Hence, we conduct experiments onoracle test (see Table 2)—the upper bound of eachmetric. R-DPP outperforms other methods, pro- Model m = 2 ) 10 0.407 0.349 0.659 1.49520 0.443 0.365 0.677 1.563Our R-DPP ( m = 3 ) 10 0.442 0.367 0.677 1.56720 0.494 0.388 0.702 1.656Our R-DPP ( m = 4 ) 10 0.455 0.374 0.686 1.58520 0.518 0.400 0.713 1.691Our R-DPP ( m = 5 ) 10 0.463 0.375 0.688 1.58520 0.527 Our R-DPP ( m = 6 ) 10 0.458 0.374 0.686 1.58520 0.528 0.403 m = 7 ) 10 0.452 0.373 0.683 1.54520 Table 2: Oracle (upper bound) performance based oneach metric. λ CIDEr, SCST andour R-DPP models, we randomly sample captions fromthe trained model and the results of other models arefrom their papers. The blue numbers are the highestscores when sample 10 captions and the bold ones arethe highest scores when sample 20 captions. viding the highest-quality caption based on gen-erating 20 captions. Even sampling 10 captions,R-DPP obtains higher scores. With the increaseof m , the scores increase in the beginning, butthen fall, e.g., when we sample 20 captions, CIDErscore rises from 1.563 to 1.700 when m increasesfrom 2 to 5, after that it falls to 1.684. Also, whenwe sample 10 captions, R-DPP( m = 5 ) performsbetter. Thus, using m = 5 could be a better choiceto well balance diversity and accuracy, which alsoobtains the highest-quality caption. We show morequalitative examples in the supplemental. We have presented the reinforcing DPP (R-DPP)model, which is a simpler but efficient methodfor training a caption model to generate both di-verse and accurate captions. Compared with othermodels, R-DPP obtains similar diversity score, butmuch higher accuracy score. In addition, the state-of-the-art oracle performance is significantly im-proved by R-DPP. In the future, we believe thatmore quality and diversity measurements shouldbe introduced into R-DPP. It is also possible to ex-tend R-DPP to other text generation tasks, such asdialog and machine translation, in order to providediverse high-quality choices to the users. eferences
Peter Anderson, Basura Fernando, Mark Johnson, andStephen Gould. 2016. Spice: Semantic proposi-tional image caption evaluation. In
ECCV .Peter Anderson, Xiaodong He, Chris Buehler, DamienTeney, Mark Johnson, Stephen Gould, and LeiZhang. 2017. Bottom-up and top-down attentionfor image captioning and vqa. arXiv preprintarXiv:1707.07998 .Danqi Chen and Christopher Manning. 2014. A fastand accurate dependency parser using neural net-works. In
EMNLP , pages 740–750.Fuhai Chen, Rongrong Ji, Xiaoshuai Sun, YongjianWu, and Jinsong Su. 2018a. Groupcap: Group-based image captioning with structured relevanceand diversity constraints. In
CVPR .Laming Chen, Guoxin Zhang, and Eric Zhou. 2018b.Fast greedy map inference for determinantal pointprocess to improve recommendation diversity. In
NIPS , pages 5622–5633.Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara.2018. Show, control and tell: A framework for gen-erating controllable and grounded captions.
CoRR ,abs/1811.10652.Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin.2017. Towards diverse and natural image descrip-tions via a conditional gan. In
ICCV .M. Denkowski and A. Lavie. 2014. Meteor universal:Language specific translation evaluation for any tar-get language. In
EACL Workshop on Statistical Ma-chine Translation .Aditya Deshpande, Jyoti Aneja, Liwei Wang, Alexan-der Schwing, and David A Forsyth. 2018. Diverseand controllable image captioning with part-of-speech guidance. arXiv preprint arXiv:1805.12589 .Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K.Srivastava, Li Deng, Piotr Dollar, Jianfeng Gao,Xiaodong He, Margaret Mitchell, John C. Platt,C. Lawrence Zitnick, and Geoffrey Zweig. 2015.From captions to visual concepts and back. In
CVPR .Ali Farhadi, Mohsen Hejrati, Mohammad AminSadeghi, Peter Young, Cyrus Rashtchian, JuliaHockenmaier, and David Forsyth. 2010. Speakingthe same language: Matching machine to humancaptions by adversarial training. In
ECCV .Alex Kulesza and Ben Taskar. 2012. Learningdeterminantal point processes. arXiv preprintarXiv:1202.3738 .G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li,Y. Choi, A. C. Berg, and T. L. Berg. 2013. Babytalk:Understanding and generating simple image de-scriptions.
IEEE Transactions on Pattern Analysisand Machine Intelligence , 35(12):2891–2903. C.-Y. Lin. 2004. Rouge: A package for automatic eval-uation of summaries. In
ACL Workshop .Jiasen Lu, Caiming Xiong, Devi Parikh, and RichardSocher. 2017. Knowing when to look: Adaptive at-tention via a visual sentinel for image captioning. In
CVPR .Ruotian Luo, Brian Price, Scott Cohen, and GregoryShakhnarovich. 2018. Discriminability objective fortraining descriptive captions. In
CVPR .K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. 2002.Bleu: a method for automatic evaluation of machinetranslation. In
ACL .Steven J Rennie, Etienne Marcheret, Youssef Mroueh,Jerret Ross, and Vaibhava Goel. 2017. Self-criticalsequence training for image captioning. In
CVPR .Yiping Song, Rui Yan, Yansong Feng, Yaoyuan Zhang,Dongyan Zhao, and Ming Zhang. 2018. Towards aneural conversation model with diversity net usingdeterminantal point processes. In
AAAI .R. Vedantam, C. Lawrence Zitnick, and D. Parikh.2015. Cider: Consensus-based image descriptionevaluation. In
CVPR .Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan. 2015. Show and tell: A neural im-age caption generator. In
CVPR .Liwei Wang, Alexander Schwing, and Svetlana Lazeb-nik. 2017. Diverse and accurate image descriptionusing a variational auto-encoder with an additivegaussian encoding space. In
NIPS .Qingzhong Wang and Antoni B. Chan. 2019. Describ-ing like humans: on diversity in image captioning.
CoRR , abs/1903.12020.Weixuan Wang, Zhihong Chen, and Haifeng Hu. 2019.Hierarchical attention network for image captioning.In
AAAI .Zhuhao Wang, Fei Wu, Weiming Lu, Jun Xiao, Xi Li,Zitong Zhang, and Yueting Zhuang. 2016. Diverseimage captioning via grouptalk. In
IJCAI , pages2957–2964. AAAI Press.Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhudinov, Rich Zemel,and Yoshua Bengio. 2015. Show, attend and tell:Neural image caption generation with visual atten-tion. In
ICML .Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grau-man. 2016. Video summarization with long short-term memory. In
ECCV , pages 766–782. Springer. he supplemental is arranged as follows: • Details of the gradient computation. • Qualitative examples of diverse image captions.
A Gradient Computation
We show how to compute the policy gradient in Eq. (11) in our paper. Recall that the reward function isdefined as follows: R ( θ ) = m (cid:88) i =1 m (cid:88) j =1 sign ( ˆ L C ij ) L C ij p θ ( c i ) p θ ( c j ) . (9)Note that only p θ ( · ) is a function of θ , then we have ∇ θ R ( θ ) = m (cid:88) i =1 m (cid:88) j =1 sign ( ˆ L C ij ) L C ij ( ∇ θ p θ ( c i ) p θ ( c j ) + ∇ θ p θ ( c j ) p θ ( c i )) (10) = m (cid:88) i =1 m (cid:88) j =1 sign ( ˆ L C ij ) L C ij ∇ θ p θ ( c i ) p θ ( c j ) + m (cid:88) i =1 m (cid:88) j =1 sign ( ˆ L C ij ) L C ij ∇ θ p θ ( c j ) p θ ( c i ) (11) = m (cid:88) i =1 m (cid:88) j =1 sign ( ˆ L C ij ) L C ij ∇ θ log( p θ ( c i )) p θ ( c i ) p θ ( c j ) + m (cid:88) i =1 m (cid:88) j =1 sign ( ˆ L C ij ) L C ij ∇ θ log( p θ ( c j )) p θ ( c j ) p θ ( c i ) (12) =2 m (cid:88) i =1 m (cid:88) j =1 sign ( ˆ L C ij ) L C ij ∇ θ log( p θ ( c i )) p θ ( c i ) p θ ( c j ) (13) =2 m (cid:88) i =1 ∇ θ log( p θ ( c i )) p θ ( c i ) m (cid:88) j =1 sign ( ˆ L C ij ) L C ij p θ ( c j ) (cid:124) (cid:123)(cid:122) (cid:125) E [ sign (ˆ L C ij ) L C ij ] . (14) Since L C and ˆ L C are symmetric mtrices, we can derive Eq. (13) from Eq. (12). Note that ∂ log p θ ∂θ ≡ p θ ∂p θ ∂θ for p θ > , hence, we obtain Eq (12) from Eq. (11). B Qualitative Examples
We show more qualitative results of R-DPP. Fig. 2 to 5 show the comparison between R-DPP and othermodels, and Fig. 6 to 9 show the generated captions by R-DPP with different numbers of samples duringtraining. Compared with other methods, our R-DPP could generate more fluent and diverse captions. Wefind that R-DPP is able to generate captions with different sentence structures (syntactic diversity), suchas using synonyms, redundant and concise descriptions. uman annotations:1. a man sits in a diner photograph- ing his meal2. photographer taking a picture of a meal in a small restaurant3. a man taking a photo of food on a table4. a man takes a picture of his food in a restaurant5. a man taking a picture of his meal at a diner tableAtt2in-XE:1. the man takes a picture in front on a bowl2. a man taking a selfie of a view of a white pizza3. person with close photo of a personal reflection on the pizza4. lady in a taking in picture in front of5. a woman takes a camera of a very glass window6. a man in the waiting taking while woman7. a man sits in a chair holding a white and a pair of wine sitting around8. a person looking at himself and sitting on a table9. a guy looking at something grilled black topping pizza10. a man taking a photo of a small slice of pizza GMM-CVAE:1. a man is eating a meal at a restaurant2. a man taking a bite of a pizza with a fork3. a man is eating a meal at a restaurant4. a man is eating a piece of bread5. a man is taking a bite of a pizza6. a man is taking a picture of a man in a white shirt7. a man is holding a bowl of food on a table8. a man in a red shirt is looking at a pizza9. a man sitting at a table with a plate of food10. a large plate of food on a tableSCST:1. a man taking a picture of a pizza2. a woman taking a picture of a pizza3. a man taking a picture of a pizza4. a man taking a picture of a pizza5. a woman taking a picture of a pizza6. a man taking a picture of a pizza7. a man taking a picture of a pizza8. a man taking a picture of a pizza9. a man taking a picture of a pizza10. a man taking a picture of a pizza"XE+5CIDEr:1. a man taking a picture of a pizza with a pizza2. a man taking a picture of a pizza with a slice of pizza3. a man taking a picture of a pizza with a pizza4. a man taking a picture of pizza on a table5. a man taking a picture of pizza on a plate6. a man taking a picture of a pizza with a pizza7. a man taking a picture of a pizza with a8. a man taking a picture of a pizza on a table9. a man taking a picture of a pizza with a pizza10. a man taking a picture of a pizza with a pizza XE+10CIDEr:1. a man taking a picture of a pizza with a camera2. a man taking a picture of pizza on a table with a camera3. a man taking a picture of a pizza with a camera4. a man taking a picture of pizza with a persons of5. a man taking a picture of a pizza with a knife6. a man taking a picture of a pizza with a salad7. a person taking a picture of a pizza with a salad8. a man taking a picture of a pizza with a salad9. a person taking a picture of a pizza with a salad10. a person taking a camera of a pizza with a saladR-DPP(m=5):1. a man taking a photo of a pizza in a restaurant2. a man taking a photo of a pizza in a restaurant3. a woman taking a selfie in front of a pizza4. a close up of a person taking a picture of a pizza5. a person taking a picture of a pizza in a restaurant6. a man taking a picture of a pizza on a table7. a woman taking a picture of a pizza in the camera8. a man taking a photo of a pizza on a plate9. a man taking a picture of a pizza on a plate10. a woman taking a picture of a pizza in a restaurant R-DPP(m=7):1. a man taking a photo of a pizza with his side2. a man taking a picture of a pizza3. a man taking a picture of a pizza on a plate4. a man taking a selfie at a restaurant5. a man taking a picture of a pizza on a plate6. a person holding a camera in front of a pizza on a plate7. a man taking a photo of a pizza8. a man taking a photo of a pizza on a plate9. a man taking a photo of a pizza while10. a man in a suit holding a camera in front of a pizza with wine
Figure 2: uman annotations:1. different assortment of noodles and vegetables sitting in a pot and tray2. a pot of pasta cooking on the stove next to a tray3. a large pot of water and pasta sits on top of a stove4. an image of food items on top of the stove5. a pot of noodles is being cooked with more ingredients beside itAtt2in-XE:1. a trey of water and some fruit on it that2. an food is cup sits on a table3. the large meal during even a red mixture4. hot plate displays a built into a bowl of sauces and a wood in top of5. a piled next to a spoon and giving a fruit in a spoon6. a countertop meal on a fork over in and sink7. some bowl of stuff with pudding and man8. a closeup is into a table that is left as has mustard and9. a plate of food with some mexican10. a measuring bowl of with front of a bowls and a GMM-CVAE:1. a pot of food is being cooked in a pan2. a blender3. a pot of food is cooking4. bt a
Figure 3: uman annotations:1. a table with a sandwich and two cups of coffee2. sliced sandwich with tomatoes on a plate on a table3. theres a ham sandwich and coffee for lunch on the table4. a sandwich on a lace table cloth with coffee cups5. two mugs next to a white plate with a sandwich on itAtt2in-XE:1. a cheese and ice cream sit on a table2. a sandwich is cut at and a styrofoam cup of coffee the3. two pieces of beer on on a plate with coffee and cake cup4. cups of coffee with a cup of coffee a coffee mug in one and tiny cup5. a plate of a pizza sitting next to a cup of coffee on a white table6. a table with papers and cup of coffee7. a slice of ice tea is and a beverage of coffee8. a white sandwich with egg and eggs on and a cup9. a baked with white plate top a cup a coffee cup of coffee10. a coffee on a plate with some tea and and two cups GMM-CVAE:1. a sandwich and a cup of coffee on a table2. a close up of a plate of food3. a table with a sandwich and a cup of coffee4. a meal of eggs and a sandwich5. a sandwich and a cup of coffee on a table6. a close up of a plate of food7. a table with a plate of food and a cup of coffee8. a couple of plates that are sitting on a table9. a plate with a sandwich and a cup of coffee10. a table with a sandwich and a cup of coffeeSCST:1. a sandwich and sitting on a plate with a cup of coffee2. a plate of food and sitting on a table3. a cup of coffee and sitting on a table4. a sandwich and sitting on a plate with a cup of coffee5. a sandwich and sitting on a table with a cup6. a plate of food and sitting on a table7. a cup of coffee and sitting on a table8. a cup of coffee and sitting on a table9. a sandwich and sitting on a table with a cup of coffee10. a cup of coffee and sitting on a tableXE+5CIDEr:1. a sandwich and coffee on a table with a cup of coffee2. a sandwich and coffee on a table with a cup of coffee3. a sandwich and coffee on a table with a cup of coffee4. a sandwich and coffee sitting on a table with a cup of coffee5. a sandwich and coffee sitting on a table with a cup of coffee6. a sandwich and coffee on a table with a cup of coffee7. a sandwich and coffee on a table with a cup of coffee8. a sandwich and coffee sitting on a table with a cup of coffee9. a sandwich and coffee on a table with a cup of coffee10. a sandwich and coffee on a table with a cup of coffee XE+10CIDEr:1. a plate of breakfast and breakfast coffee on a table with coffee2. a breakfast and coffee breakfast on a table with a cup of coffee3. a white plate with a sandwich and coffee coffee on a table4. a plate of breakfast and coffee on a table with cups5. a breakfast plate of breakfast and coffee coffee on a table6. a plate with breakfast plates of coffee and coffee on a table7. a breakfast of breakfast and coffee on a table with coffee coffee8. a breakfast plate with breakfast and drinks on a table next coffee cups9. a plate of breakfast and coffee coffee on a table with coffee10. a breakfast with breakfast breakfast on a table with coffee cup and coffeeR-DPP(m=5):1. a sandwich on a plate with a cup of coffee on it2. a plate of food that is sitting on a table3. a sandwich sitting on top of a table next to a cup of coffee4. a sandwich sits on a table next to a cup of coffee5. a sandwich sitting on a desk next to a cup of coffee6. a sandwich and a cup of coffee on a table7. a plate of food sitting next to a cup of coffee8. a sandwich on a table with a cup of coffee on it9. a plate of breakfast are sitting on a table with a cup of coffee10. a plate of food on top of a table R-DPP(m=7):1. a plate of food is sitting on a table2. a sandwich on a plate next to a coffee cup of coffee3. a sandwich is sitting on a plate on a table4. a plate with a sandwich and a cup of coffee on it5. a sandwich that is sitting on a plate6. a large plate of food on a table7. a plate of food and a coffee mug on the table8. a sandwich is on a plate on a table9. a sandwich sitting on top of a table with a cup of coffee10. an egg and a plate of food on a wooden table
Figure 4: uman annotations:1. a man and a woman cross country skiing on a snow covered trail with mountain peaks in the background2. a couple of skiers are going down a snowy mountain3. a lady skiing looking back at a man skiing4. two snow skiers coming down a snowy hill5. a man and a woman cross country skiing in deep snowAtt2in-XE:1. two people riding skis on a snowy surface2. a couple of people standing on top of a snow covered slope slope3. two skiers cross country on beneath a mountain4. two skiers are carrying their snow on a snowy slope5. the people are racing skiing in the snow6. two skiers on poles a to finish the run down the mountain way7. two a number of skiers in snow with umbrellas8. three people are skiing across a mountain covered land9. two people skiing skis down on snowy mountain10. a group and skiers are gathered their the snow GMM-CVAE:1. a group of three men standing on top of a snow covered slope2. a couple of men standing on top of a snow covered slope3. a couple of people on a snowy mountain4. users of a group of people on skis5. a group of three men standing next to each other on a snow covered slope6. three people standing on a snow covered slope7. three men on skis are standing in the snow8. a group of people standing on top of a snow covered slope9. a group of three men standing next to each other on a snow covered slope10. a group of people standing on a snowy surfaceSCST:1. a couple of people on skis in the snow2. a couple of people on skis in the snow3. a couple of people on skis in the snow4. a couple of people on skis in the snow5. two people are on skis in the snow6. two people are skiing on skis in the snow7. a couple of people on skis in the snow8. a group of people on skis in the snow9. a couple of people on skis in the snow10. a couple of people on skis in the snowXE+5CIDEr:1. two people standing on skis in the snow covered mountain2. two people standing on skis in the snow covered mountain3. two people cross country skiing on a snow covered mountain4. two people standing on skis in the snow covered mountain5. two people cross country skiing on a snow covered mountain6. two women standing on skis in the snow covered mountain7. two people standing on skis in the snow covered mountain8. two people standing on skis in the snow covered mountain9. two people standing on skis in the snow covered mountain10. two people standing on skis in the snow covered mountain XE+10CIDEr:1. two people standing country skiing on a mountain mountain with2. two people standing country skiing in the snow covered mountain3. two people standing country skiing in the snow mountain slope4. two people standing country skiing in a mountain slope5. two people standing country skiing on a mountain slope6. two people cross country skiing on a snow mountain mountain7. two people standing on skis in the snow mountain mountain8. two people standing country skiing on a mountain mountain9. two people standing country skiing on a mountain mountain10. two people standing country skiing in the snow covered mountainR-DPP(m=5):1. two people cross country skiing in the snow2. two people cross country skiing on a snowy mountain3. two people on skis standing in the snow4. two skiers are standing on skis in the snow5. two people are riding skis on a snowy mountain6. two people standing on skis in the snow7. two people skiing skis on top of a snow covered mountain8. two people on skis standing in the snow9. two people are standing in the snow on skis10. a couple of people riding skis down a snow covered ski slopeR-DPP(m=7):1. two people are cross country skiing in the snow2. two people in the back of a ski slope3. two people are riding skis on a snowy slope4. two people on skis in the snow5. a couple of people riding skis down a snow covered slope6. two skiers on skis in the snow7. two people holding skis on a snowy slope8. two skiers are on skis on top of a snow covered slope9. two people that are standing on skis in the snow10. two people on skis in the snow
Figure 5: uman annotations:1. a lady that has a tennis racket in hand2. a young female swinging at a tennis ball coming in low3. the woman is playing tennis on the court4. the female tennis player is wearing red and white5. a girl reaches for a ball with her tennsi racket
R-DPP(m=2):1. a woman hitting a tennis ball with a racket2. a woman hitting a tennis ball with a racquet3. a woman hitting a tennis ball with a racket4. a woman hitting a tennis ball on a tennis court5. a woman is swinging a tennis ball on a tennis court6. a woman hitting a tennis ball with a racket7. a woman hitting a tennis ball with a racket8. a woman hitting a tennis ball with a racket9. a woman hitting a tennis ball with a racket10. a woman hitting a tennis ball with a racquet R-DPP(m=3):1. a woman hitting a tennis ball with a racket2. a woman hitting a tennis ball with a racquet3. a woman swinging a tennis racket at a ball4. a woman swinging a tennis ball on a court5. a woman hitting a tennis ball with a racquet6. a woman hitting a tennis ball with a racket7. a woman hitting a tennis ball with a racket8. a woman swinging a tennis racquet on a tennis ball9. a woman hitting a tennis ball with a racquet10. a woman is playing with a tennis ballR-DPP(m=4):1. a woman swinging a tennis racquet at a tennis ball2. a woman swinging a tennis racket at a ball3. a woman is on a tennis ball with her racket4. a woman holding a tennis racquet at a ball5. a woman hitting a tennis ball with a racket6. a woman is swinging a tennis racket at a ball7. a woman hitting a tennis ball with a racket8. a woman hitting a tennis ball with a racquet9. a woman is playing tennis on a court10. a woman swinging a tennis racket at a ball R-DPP(m=5):1. a woman swinging a tennis racket at a ball2. a woman hitting a tennis ball with a racquet3. a woman swinging a tennis racquet on a tennis ball4. a woman getting ready to hit a tennis ball5. a female tennis player getting ready to hit the ball6. a woman is playing tennis on a tennis court7. a woman swinging a tennis racquet on a court8. a woman is playing tennis on a court9. a woman hitting a tennis ball with her racket10. a female tennis player getting ready to hit the ballR-DPP(m=6):1. a woman hitting a tennis ball with a racquet2. a woman playing tennis on a tennis court3. a woman swinging a tennis racket on a court4. a woman is trying to hit a tennis ball5. a woman is swinging a tennis racket at a ball6. a woman swinging a tennis racket at a ball7. a girl is trying to hit a tennis ball8. a woman hitting a tennis ball with a racquet9. a woman hitting a tennis ball with a racquet10. a female tennis player swinging a racket at a ball R-DPP(m=7):1. a girl swinging a tennis racket at a ball2. a woman is playing tennis on the court3. a woman playing tennis on a clay court4. a woman on a court holding a tennis racket5. a woman hitting a tennis ball with a racket6. a female tennis player hitting a ball with the ball7. a woman in a red shirt is playing tennis8. a woman hitting a tennis ball with a racquet9. a woman is about to hit a tennis ball10. a female tennis player getting ready to hit the ball
Figure 6: uman annotations:1. a large pile of a variety of donuts seen from above2. a large stack of a variety of donuts all set on top of each other to make a pyramid like design3. a pile of different flavored donuts topped with chocolate and coconut4. overhead shot of a pyramid of assorted cake donuts5. various decorated donuts are stacked on top of each other
R-DPP(m=2):1. a pile of donuts sitting on top of a table2. a pile of donuts are sitting on a table3. a pile of donuts are sitting on a table4. a box of donuts are sitting on a table5. a box of donuts sitting on a table6. a box of donuts are on a table7. a pile of donuts sitting on a table8. a box of donuts are on a table9. a box of donuts are sitting on a table10. a pile of donuts sitting on a table R-DPP(m=3):1. a box of donuts sitting on top of a table2. a pile of doughnuts on a table3. a pile of donuts on a plate4. a bunch of glazed donuts on a plate5. a box of doughnuts on a table6. a box of doughnuts on top of a table7. a close up of a box filled with donuts8. a plate of donuts sitting on a table9. a pile of donuts on a plate10. a box of donuts in a tableR-DPP(m=4):1. a bunch of glazed donuts in a box2. a box filled with lots of donuts in it3. a bunch of doughnuts are sitting on a plate4. a bunch of glazed donuts sitting on a table5. a pile of glazed donuts on a white plate6. a pile of donuts that are sitting on a table7. a pile of glazed donuts sitting on a table8. a bunch of donuts are sitting on a box9. a bunch of doughnuts sitting on a table10. a bunch of glazed donuts in a box R-DPP(m=5):1. a box filled with lots of donuts sitting in it2. a bunch of donuts sitting on a glass3. a close up of a pile of doughnuts4. a box of different types of donuts5. a pile of donuts with a on a plate6. a pile of donuts sitting on a box7. a close up of a white of doughnuts8. a bunch of doughnuts that are on a table9. a group of doughnuts are sitting on a plate10. a variety of glazed donuts on a tableR-DPP(m=6):1. a variety of donuts sitting on a table2. a close up of a bunch of glazed donuts3. a close up of a bunch of doughnuts4. a bunch of doughnuts are on a plate5. a close up of a bunch of doughnuts6. a box filled with different types of doughnuts7. a box filled with different types of doughnuts8. a variety of donuts that are sitting on a table9. a close up of a pile of donuts10. a box filled with two donuts and doughnuts R-DPP(m=7):1. a close up of a bunch of donuts2. a variety of doughnuts that are sitting on a table3. a close up of three doughnuts on a plate4. a box of donuts that are sitting on a table5. a pile of donuts sitting on top of a stack of doughnuts6. a box filled with donuts in the background7. a close up of a group of doughnuts8. a glazed donuts on a plate next to a UNK9. a bunch of donuts are in a display10. a bunch of glazed donuts on a plate
Figure 7: uman annotations:1. a couple of people that are sitting on mopeds2. a street that has some motorcycles going down it3. a city street with multiple shops and people riding motorcycles4. people on motorbikes driving down the road in a city5. a few people on motor scooters riding down a street
R-DPP(m=2):1. a group of people riding motorcycles down a city street2. a group of people riding motorcycles down a city street3. a group of people riding motorcycles down a street4. a group of people riding motorcycles down a city street5. a group of people riding motorcycles down a city street6. a group of people riding motorcycles on a city street7. a group of people riding motorcycles down a city street8. a group of people riding motorcycles on a city street9. a group of people riding motorcycles down the street10. a group of people riding motorcycles down a city street R-DPP(m=3):1. a group of people on motorcycles riding down a city street2. a group of people on motorcycles on a city street3. a group of people riding motorcycles down a city street4. a group of people riding on motorcycles on a city street5. a group of people riding motorcycles down a city street6. a group of people riding motorcycles down a city street7. a group of people on motorcycles in the street8. a group of people riding on motorcycles on a city street9. a group of people on motorcycles on a city street10. a group of people on motorcycles in a city streetR-DPP(m=4):1. a busy street filled with cars and a city street2. a group of people are riding down a city street3. a group of people riding motorcycles down a city street4. a busy city street with a group of people riding bikes5. a group of people on motorcycles down a city street6. a bunch of people on motorcycles in the street7. a group of people are riding down motorcycles down a city street8. a busy city street with lots of people on motorcycles9. a bunch of motorcycles are on a city street10. a group of people are riding down motorcycles in the street R-DPP(m=5):1. a group of people riding motorcycles driving down a city street2. a group of people riding motorcycles down a street3. a group of people on horses on a city street4. a busy city street with people on motorcycles and a street5. a group of people riding motorcycles down a street6. a busy street filled with lots of people on motorcycles7. a group of people riding motorcycles down a street8. a group of people on motorcycles in a busy city street9. several people riding motorcycles in a busy city street10. a group of people riding motorcycles down a city streetR-DPP(m=6):1. a group of people on motorcycles on a city street2. a group of people riding motorcycles down a street in front of a building3. a group of people on a motorcycle on a city street4. a group of people riding motorcycles down a street5. a person riding bikes on the side of a street6. a busy city street with people on a motorcycle7. a large group of people on motorcycles on a city street8. a group of people with motorcycles on a city street9. a large group of people riding motorcycles down a street10. a group of people riding motorcycles on a city stree R-DPP(m=7):1. a line of people on motorcycles riding down a road2. a person riding on motorcycle down a street3. a group of people on motorcycles on a city street4. a group of people on motorcycles down the street5. a group of people on motorcycles in a city street6. a group of people riding motorcycles down a street7. a group of people on motorcycles in a city street8. a group of people on motorcycles on a city street9. a busy street with people and a UNK in the street10. a group of people on motorcycles down a street
Figure 8: uman annotations:1. a young boy throwing up a blue card2. a very blurry image of two small children in motion3. a group of children looking up a the person taking the picture4. two children playing with a kite on the sidewalk5. a pair of children playing catch on the street