[PDF] Creative Captioning: An AI Grand Challenge Based on the Dixit Board Game

Abstract

We propose a new class of "grand challenge" AI problems that we call creative captioning---generating clever, interesting, or abstract captions for images, as well as understanding such captions. Creative captioning draws on core AI research areas of vision, natural language processing, narrative reasoning, and social reasoning, and across all these areas, it requires sophisticated uses of common sense and cultural knowledge. In this paper, we analyze several specific research problems that fall under creative captioning, using the popular board game Dixit as both inspiration and proposed testing ground. We expect that Dixit could serve as an engaging and motivating benchmark for creative captioning across numerous AI research communities for the coming 1-2 decades.

Full PDF

CCreative Captioning: An AI Grand Challenge Based on the Dixit Board Game

Maithilee Kunda and Irina Rabkina Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN, USA Computer Science, Occidental College, Los Angeles, CA, [email protected], [email protected]

Abstract

We propose a new class of “grand challenge” AI problemsthat we call creative captioning—generating clever, interest-ing, or abstract captions for images, as well as understand-ing such captions. Creative captioning draws on core AI re-search areas of vision, natural language processing, narrativereasoning, and social reasoning, and across all these areas,it requires sophisticated uses of common sense and culturalknowledge. In this paper, we analyze several speciﬁc researchproblems that fall under creative captioning, using the popu-lar board game Dixit as both inspiration and proposed testingground. We expect that Dixit could serve as an engaging andmotivating benchmark for creative captioning across numer-ous AI research communities for the coming 1-2 decades.

Introduction

Consider the images in Figure 1. For each of the followingphrases, which image do you think is being described?1.

A person on a tall ladder using a hammer and chisel tomake a cloud pigeon in the sky, plus a cloud butterﬂy. A difﬁcult choice among three options. Scary. Monty Hall. She got the little dog, too!

Phrase 1 is fairly easy to match; it is just a literal descrip-tion of the contents of Image F.Phrase 2 is a little bit harder; it does not use any speciﬁcnouns that refer to depicted objects, but it does refer to “threeoptions,” which suggests the three doors in Image D. Noth-ing in Image D directly indicates that the choice is a difﬁcultone, or really that there is a “choice” at all, but we can easilyimagine that the knightly rabbit is facing a choice.Phrase 3 is harder still—“scary” refers to an emotionalquality, and so we have to consider the affect or mood con-veyed by each image. (The authors ﬁnd Image A to be themost scary, though E is admittedly also creepy, especially ifone has botanophobia, or fear of plants!)Phrase 4 requires know who (or what) Monty Hall is.Among AAAI readers, we expect that at least a few willrecognize our reference to the famous Monty Hall problemof choosing a prize from behind three closed doors (Selvin1975a,b), thus referring again to Image D.We leave Phrase 5 as an exercise for the reader. (Hint:This involves cultural reference to the classic American ﬁlm

The Wizard of Oz , and both a song and a dialogue from it.)We deﬁne creative captioning as a class of problems thatincludes (a) generating clever, interesting, or abstract cap-tions for images (as we did in creating our list of phrases),and (b) understanding such captions (as you did in matchingeach phrase to an image), and related variants thereof.Our work is inspired by the popular board game Dixit(Roubira 2008), in which players both generate and try to

A B C D E F

Figure 1: Images from the popular board game Dixit (Dixit 2019), which embodies several tasks in creative captioning. a r X i v : . [ c s . A I] S e p igure 2: An example of a round of Dixit. (a) The storyteller selects a card and utters a corresponding word/phrase. (b) Otherplayers select a card from their hand and play it. (c) All cards are turned face up and non-storyteller players vote on the cardthey believe was played by the storyteller. (d) Finally, points for the round are calculated. In the points table in (d), we mark thestoryteller, player 3, with an *. All images of cards were retrieved from the Dixit (2019) publisher website.match “interesting” captions to rather surreal-looking im-ages, like those shown in Fig. 1.While Dixit provides one concrete instantiation of prob-lems involved in creative captioning, other examples in-clude the well-known New Yorker cartoon captioning con-test (Prince and Radev 2017; Bogert, Watson, and Schecter2020; Li 2020), or generating engaging titles for inventions(Senda, Sinohara, and Okumura 2004) or artwork.Creative captioning is very related to image captioning(Hossain et al. 2019), though it goes beyond to incorporatemore sophisticated aspects of vision (e.g., multiple and oftenabstract interpretations of an image), natural language pro-cessing (including idioms, cultural references, etc.), story ornarrative reasoning (e.g., inferring, or imagining, narrativeconstructs related to a given image or phrase), and socialreasoning (such as thinking about the intended audience fora caption). Contributions of this paper include: • We identify characteristics that make Dixit a fascinat-ing window into human intelligence, by reviewing Dixit-related research in psychology and education. • We describe how the Dixit board game could be set upas an AI challenge, including speciﬁc assumptions, gamevariants, and evaluation methods. • We analyze the suite of problems that an artiﬁcial agentfaces while playing Dixit, and we review the state of theart in AI research related to each problem.

The Dixit Game

Dixit (Roubira 2008) is a tabletop card game typicallyplayed with 4-6 players. The game contains 84 cards, 36numbered tokens, 6 player markers, and a game board. Cardsare storybook-style illustrations, often surreal (Figure 1).Game play takes place as follows (Figure 2).

Setup.

Each player starts with zero points. The deck isshufﬂed, and each player is dealt six cards. Each player alsoreceives tokens to be used for voting. A player is selected atrandom to be the ﬁrst storyteller.

Storyteller Turn.

At the beginning of each round, the sto-ryteller secretly selects a card from their hand. They then utter a clue or label for the card, and place their card facedown on the table. Per the Dixit instructions:“[This] clue can take many different forms: it can bemade up of one or more words or can even be a soundor group of sounds that represent the clue. It can beinvented on the spot or it can take the form of alreadyexisting works (a part of a poem or song, a movie title,a proverb, etc...).” (Roubira 2008)

Remaining Players’ Turn.

Then, each remaining playerpicks a “decoy” card from their own hand that they associatewith the storyteller’s utterance. These cards are placed facedown on top of the storyteller’s card.

Voting.

After all players have added their cards to the pile,the storyteller shufﬂes and then reveals all of the cards. Eachplayer, other than the storyteller, then votes for the card thatthey believe to be the storyteller’s by placing the correspond-ing token face down in front of them. (Tokens are placedface-down so that votes are hidden while players are stillmaking decisions.) Players cannot vote for their own card.

Scoring.

Each round’s point distribution is determinedby the players’ votes. If all players or no players correctlyguessed the storyteller’s card, the storyteller does not earnany points, and all other players earn 2 points. Otherwise, thestoryteller earns 3 points, as does each player that guessedcorrectly. Each player, other than the storyteller, earns 1 ad-ditional point for each vote that their own card received. Next Turn.

All players draw from the deck to replenishtheir hand back up to six cards. The storyteller role is passedclockwise.

Game ending.

The game ends once any player has reached30 points or when the last card has been drawn from thedeck, at which time the player with the most points wins.

Dixit as creative captioning.

Dixit involves multipleforms of creative captioning, including when the storytellerchooses a card and generates a clue/phrase for it; when theother players select their own decoy cards; and when playersvote on the card they think is the original storyteller’s card.able 1: Sampling of research studies using Dixit from psychology, education, and other social and cognitive science areas.

Reference Summary Details

Piccolo & Guerra 2010 Used Dixit to help students learn about design patterns in programming. Asked 19 graduate computing students in Brazil to play Dixit but with allowable phrases only relating to design patterns and concepts in object-oriented design. Majority of students liked activity and felt that it enhanced their learning.Chircop 2016 Proposed a typology for board games including rules, randomness, theme, and interaction. Mentions Dixit as an example of a game with low rule complexity. Also discusses how in storytelling games, player-generated input forms a significant part of gameplay, though this input is still constrained by game rules and also by social norms.Mousnier et al. 2016 (In French.) Description of using Dixit cards during therapy. Use Dixit cards for therapy to elicit metaphoric language and thought. Describe several sessions with various patients, in which Dixit cards were used. Patients chose cards in response to a question (e.g., "How do you feel?") and then discussed, based on prompts from the examiners.Bekesas et al. 2018 Used a game similar to Dixit for eliciting people's sociocultural knowledge and identity Created a new game based off a popular Brazilian game, and inspired by Dixit and others, to elicit people's "cultural repertoire" and also the personal meanings or interpretations they bring. Successfully tested with 170 youth in Brazil aged 12-24 years.Vitancol & Baria 2018 Evaluated how group communication changed over the course of playing Dixit. Had visitors to a boardgame cafe play Dixit. Rated players qualitatively on their level of Participation, Observation, Evaluation, and Adaptation. Found that communication increased throughout gameplay. Musat & Faltings, 2013 Created a human computation game for AI evaluation, loosely based on Dixit Created a game, loosely based on Dixit rules, for human analysis of AI sentiment analysis. Present a pilot study, showing that people understand the game and that system's error type affects peoples' descriptions.Vayanou et al. 2019 Used Dixit-style game to help people engage with visual art. Conducted a series of group sessions with adults playing a Dixit-like game to tell stories about visual artworks, either in museum exhibitions or at home. Participant observations indicated increased engagement, meaning-making, and interest in artworks.

Research on Dixit with People

Dixit (or Dixit-inspired activities) represents a fascinatingtask format for eliciting human creative captioning, as evi-denced by its widespread use in a variety of research studiesacross different ﬁelds. Table 1 lists a sampling of these stud-ies. For example, one study used a card game similar to Dixitas a research tool for querying people’s cultural knowledge(Bekesas et al. 2018).In another quite clever deployment of the game, Dixit wasused to teach software design patterns to graduate CS stu-dents. Essentially, students played Dixit as usual except thattheir “clues” had to be drawn from the vocabulary of pro-gramming design patterns taught during the course. As anexample from this study (Piccolo and Guerra 2010):“In his turn, a player select in his hand a card that hasa humanized rabbit looking for three different doors.He thinks that this card relates to the Strategy pattern,where you can choose different implementations for analgorithm. Then, he put his card backwards in the ta-ble, saying ”Strategy”. Each other player them shouldselect the card in his hand that he thinks that is the mostrelated to the Startegy pattern. For instance, anotherplayer should select a card with several water drops,relating that in the Strategy pattern there are severalclasses encapsulated in the same abstraction.”Other studies include using Dixit as part of board gametraining to make people smarter (Bartolucci, Mattioli, andBatini 2019); analyzing Dixit within a taxonomy of narra-tive board and card games (Sullivan and Salter 2017); dis-cussing Dixit as an adaptation of Rorschach tests (Roger-son and Cocks 2017); using Dixit cards to spur ideation forgame design (Wetzel, Rodden, and Benford 2017); usingDixit cards as a method for language sampling in childrenthat elicits more lexical diversity than traditional methods (Smith 2018); Dixit as edutainment (Novikova and Beskrov-naya 2015); how Dixit can be used to teach complex con-cepts like ethics (Mazurkiewicz 2013); discussions of theshared narrative experience among players of games likeDixit (Montanarini 2019); and using Dixit for therapy (Ikizand B´eziat 2020).

Dixit AI Challenge

While the rules of Dixit are easy to follow, we believe thatthe game will pose a signiﬁcant challenge to modern AIagents. As such, we propose Dixit as a creative captioningAI challenge under the following parameters. (We use theabbreviation DA to refer to a Dixit Agent.)We note that there are two ways to evaluate a DA: win-ning the game and playing it in a believably human-like way.However, omitting the latter requirement changes the gameconsiderably, in that human players would then be votingand acting based on their own mental models of how theythink an artiﬁcial agent would be playing the game. Thus,our proposed Dixit AI challenge assumes that human play-ers do not know which is the artiﬁcial agent (or perhaps thatthere is an artiﬁcial agent at all).1. The DA must be able to play a full Dixit game againsthuman players. That is, it must be able to play both story-teller and non-storyteller roles.2. The basic Dixit game is intended for 4-6 human players,and thus the DA must be able to play against anywherefrom 3-5 human players.3. While the ofﬁcial rules of Dixit allow for game phrasesto include “noises” or, in some variants, physical gesturesor charades, the proposed AI challenge will allow onlytext-based phrases.4. Dixit calls for game phrases to be “short.” We proposemposing a hard limit K on phrase length as a game pa-rameter. For example K = 4 would be quite reasonable.5. The game will be played over a virtual connection, suchthat the only communications among players consists ofselections of cards, plain-text phrases, and votes. Thus, weeliminate the roles of facial expressions, body language,prosody, and other forms of nonverbal communication.(Perhaps we leave these for the next AI Dixit challenge!)6. Table chatter is not permitted amongst players, i.e., thereis no extraneous conversation allowed.7. The game will take place using a speciﬁed language (e.g.,English), cultural context (e.g., the United States), andplayer characteristics (e.g., adults, or college students, or10-year-olds, or college computer science students, etc.).We expect that DAs will be designed initially for one spe-ciﬁc set of these characteristics, but the core AI meth-ods developed ought to (eventually) be able to generalizeacross these different game contexts.8. The human players should be strangers to each other; thisensures that the players are not relying on personal knowl-edge, inside jokes, etc., that would be impossible for theDA to know or understand. However, the DA and humanplayers may well observe and learn from each other’s be-haviors during the course of the game.9. The game will be played with previously unseen cards(e.g., from an expansion pack) that neither the DA norany of the players have previously seen.10. The DA must be able to explain all of its actions (i.e., whyit selected a phrase or card). This is to prevent winning byway of Eliza effect (O’Dell and Dickson 1984), and is areasonable requirement to apply to human players as well.We intend for these parameters to limit the difﬁculty of thegame, while maintaining its spirit. For example, while tablechatter adds entertainment for human players and may oftenaffect the course of game play, we remove it as a simplify-ing assumption for our initial Dixit AI challenge. Explainingone’s choices, however, is common among human players(Piccolo and Guerra 2010; Vayanou et al. 2019) and servesto show that choices were not made completely at random. Creative Captioning Problems in Dixit

Winning a full game of Dixit involves maximizing one’sscore earned across both storyteller and non-storytellerrounds. Here, we formalize speciﬁc problems and subprob-lems that a Dixit Agent (DA) would have to solve in orderto win a game against human players. Of course, there aremany other possible ways to frame the problems that Dixitposes; what we present here is one possible starting point.

Storyteller round

In the storyteller round, the DA begins with a hand H ofsix cards, each represented as a single color image: H = { C , ..., C } .Then, the DA must select one card C target and produce acorresponding text-based phrase X target that goes with that card. The phrase can be anything from the space of all pos-sible utterances U k of length k in the language in which thegame is being played, i.e., X target ∈ U k .That is actually all that is required from the DA during thestorytelling round. The rest of the round depends entirely onhow the other players react.So how does the DA produce a “winning” choice of C target and X target ? This is actually a somewhat bizarreand ambiguous optimization problem.Let n ∈ , , be the total number of players in the game.Then, if the DA is the storyteller, there are n − votingplayers in that round.The score that the DA will receive as the storyteller de-pends on the number of players n V that vote for its card: score =  n V = 03 0 < n V < n − n V = n − For any choice of card C i and phrase X i , the DA mustessentially estimate the probability that the number of votingplayers n V will lie in the desired range. If X target is toospeciﬁc to C target , then it is likely that n V = n − . If X target is not speciﬁc enough to C target , then it is likelythat n V = 0 .Given the ability to estimate this probability P scoring , theDA is then trying to choose some combination of C target from its hand H , and X target from the set of all possibleutterances U k of length k , that maximizes P scoring . We callthis Storyteller Strategy : P scoring ( C i , X i ) = P (0 < n V < n − | C i , X i ) (1) (cid:20) C target X target (cid:21) = argmax C i ∈ H (cid:40) argmax X j ∈U k ( P scoring ) (cid:41) (2)However, other players can also earn points during thisround, based on whether they vote for the storyteller’s card(3 points), and also whether other players vote for their card(1 point per other player that has been deceived). So, whileacting according to Eq. 2 can maximize the chances that theDA will earn 3 points, the DA’s net lead over the other play-ers will vary, depending on how many other players vote forthe storyteller’s card.The net points earned by other players will be minimizedif only one player votes for the DA’s card. Thus, a differentstrategy for the DA during storytellers round is to minimizethe expected number of players E votes = E [ n V ] who mightvote for the DAs card, while keeping it above 0. We call this Storyteller Strategy . E votes ( C i , X i ) = E [ n V | C i , X i ] (3) (cid:20) C target X target (cid:21) = argmin C i ∈ H (cid:40) argmin X j ∈U k ( E votes | E votes > (cid:41) (4) How Many Votes subproblem.

Either of the above strate-gies can be roughly decomposed into two subproblems.First, given any candidate pairing of a card and a phrase C i , X i , the DA needs to be able to estimate how many otherplayers are likely to vote for it, as in Eq. 1 or Eq. 3. Asmentioned above, we require that the deck of cards is notpreviously known to the DA before gameplay, and so theseprobabilities cannot be “precomputed” for given cards andpre-selected target phrases.Solving this subproblem requires several different techni-cal AI capabilities, including:1. Vision:

What does the card C i actually depict? What arethe objects, characters, scene information, and affectiveand/or cultural implications?2. NLP:

What does the phrase X i actually mean? What is theproper parsing, word or phrase meanings, and affectiveand/or cultural implications?3. Story reasoning:

Given the card and phrase pairing, howcan they be interpreted together to form a coherent vi-sual+linguistic story?4.

Social reasoning:

Given the card and phrase pairing, howmany other players are likely to vote for this card, also rel-ative to the other potential decoy cards that other playersmight produce in response to the phrase X i ? Find a Phrase subproblem.

Eq. 2 and Eq. 4 require theDA to choose a target card C target ∈ H and a target phrase X target ∈ U k from all possible utterances of length k .One way to solve this subproblem could be to iteratethrough all C i in the DA’s hand H , which only has six cards,and then for each C i , have some search strategy that gener-ates a series of candidate phrases X i and evaluates each oneaccording to the How Many Votes subproblem.Then, the core remaining subproblem becomes how togenerate a series of candidate phrases X i for a given C i .Solving this subproblem requires:1. Vision:

As above.2.

NLP and story reasoning:

What are possible “creativestory captions” that could be applied to describe C i ? Non-storyteller round

When the DA is not the storyteller, the round begins whenthe storyteller (another player) produces a target phrase X target for that round. In these rounds, the DA has twosomewhat separable goals.First, the DA must choose a card C decoy ∈ H from itshand that best lures other players into voting for it. Duringthe voting phase, the DA will receive 1 point for each playerthat votes for its card. Thus, the DA should choose a cardfrom its hand that maximizes the number of players likely tovote for it. As above, we let n V denote the number of otherplayers voting for the DA’s card: C decoy = argmax C i ∈ H { E votes | C i , X target } (5)Second, the DA will see a set S of cards C i on the table(one from the storyteller, one from itself C decoy , and onefrom each other player), and it must vote for the card C vote that it thinks is the storyteller’s card: P target ( C i ) = P ( C i = C target | X target ) (6) C vote = argmax C i ∈ S { P target ( C i ) } (7) How Many Votes subproblem variants.

In the non-storyteller round, when the DA is selecting its decoy card,it is solving something very similar to the How Many Votessubproblem as described above, except now it is in the DA’sinterest to get as many players as possible to vote for itscard. And, because the DA is searching its hand H , thereare only six possible cards C i to choose from. Thus, a sim-ple approach would be for the DA to iterate through all sixcards in its hand, compute the expected number of votes foreach, and select the card giving the maximum estimate.Of course, as noted above, solving the How Many Votessubproblem is quite difﬁcult and requires vision, NLP, storyreasoning, and social reasoning.Finally, when the Dixit agent has to vote on the card that itbelieves is the original storyteller’s card, it is again solving avariant of the How Many Votes subproblem. It can performexhaustive search through the n − available card options(one card from each player except its own), and for each one,estimate the probability of its being the storyteller’s card. End-Game Considerations

The above equations describe several strategies that the DAcan use to essentially maximize its own score. However,there are also situations in Dixit when the DA might needto instead shift strategies to prevent other players from scor-ing, for instance towards the end of a game if another playeris very close to winning.For example, suppose the DA is not the storyteller, andthe player who is the storyteller is within 3 points of win-ning. Then, instead of using Eq. 7 to choose its vote, whichmaximizes the probability in Eq. 6, the DA might insteadwant to minimize this Eq. 6 probability, in order to preventthe storyteller from winning.Many games have strategies that shift as gameplay pro-gresses. In Dixit, all players can see the scoreboard at alltimes, and so while reasoning about the scores of other play-ers might not always be strictly necessary to win, it does playa potentially useful (and potentially game-changing) role.

Towards Creative Captioning: The CurrentState of the Art

Creative captioning in general, and in particular the speciﬁcchallenge we propose of winning a game of Dixit, toucheson core problems for many subﬁelds of AI, including (1)vision, (2) natural language processing, (3) story or narrativereasoning, and (4) social reasoning, among others. In thissection, we discuss the current state of the art in each of thesesubﬁelds individually and taken as an integrated whole.

Vision

Creative captioning requires several robust vision capabili-ties, as described (non-exhaustively) below.

Object recognition: What is in the image?

The last eightyears have seen a revolution in approaches to object recogni-tion, due in part to advances in dataset size (Deng et al. 2009)and deep learning methods (Krizhevsky, Sutskever, and Hin-ton 2012). However, generalized object recognition is still aifﬁcult problem, for example when models are faced withnew images that are more complex than training images(Recht et al. 2019) or that depict objects in unusual poses(Barbu et al. 2019) or sociocultural contexts (de Vries et al.2019). Additional challenges emerge when considering notjust photographic images but also artwork and other othervisual styles of depiction (Hall et al. 2015; Westlake, Cai,and Hall 2016). Dixit game images in particular, as shown inFigure 1 are especially challenging for artiﬁcial vision sys-tems because they include surreal elements (Florea, Florea,and Badea 2016).

Scene analysis: How are the objects related?

In additionto identifying objects and entities in an image, creative cap-tioning additionally requires understanding the scene, i.e.,relationships among objects. Going beyond just identifyingobjects and their relationships (Dai, Zhang, and Lin 2017),however, creative captioning also requires inferring aspectsof the relevance of the scene to common scenes, commonsense interpretations such as inferring physical interactionsamong objects (Battaglia, Hamrick, and Tenenbaum 2013),cultural contexts, etc.

Affective analysis: What is the emotional content of animage?

Creative captioning also requires inferring affec-tive aspects of an image, including overall mood or tone(Machajdik and Hanbury 2010), facial expressions of char-acters (Zhao and Zhang 2016), etc.

Natural Language Processing/Understanding

In recent years, language models such as Bert (Devlin et al.2018), GPT (Radford et al. 2018), and their derivatives (Liuet al. 2019; Radford et al. 2019; Brown et al. 2020, etc.) havemade substantial progress on many natural language tasks,ranging from question answering to story completion. Ofthese, story completion (e.g., HellaSwag (Zellers et al. 2019)and StoryCloze (Mostafazadeh et al. 2016)) is the most sim-ilar to creative captioning.Story completion tasks, sometimes referred to as clozetasks, combine language understanding with language gen-eration. The datasets contain short (3-5 sentence length) sto-ries, the ﬁrst few sentences of which are provided as inputto the system being tested. The system must then generatethe rest of the story. This requires understanding the con-tents of the story, along with any implied commonsense rea-soning and storyteller intentions. Humans perform incred-ibly well on these tasks—100% accuracy on StoryCloze(Mostafazadeh et al. 2016) and 95.6% accuracy on Hel-laSwag (Zellers et al. 2019). Yet, even the newest (at thetime of writing) GPT model, GPT-3, performs signiﬁcantlyworse (Brown et al. 2020). This is likely because languagemodels do not perform natural language understanding orcommonsense reasoning, and instead ﬁnd patterns and makeconnections across millions of training examples (Marcusand Davis 2020). We believe that this will cause languagemodels to struggle on the creative captioning task as well,especially since generation must occur between modalities(i.e., generate language based on an image, or select an im-age based on language). On the other hand, language mod-els have been successful on some creative tasks (e.g., poetry (Liao et al. 2019)), which suggests that creative captioningmay not be entirely out of reach.Other approaches combine knowledge and inference forlanguage understanding. For example, (Lin, Sun, and Han2017) encoded three types of commonsense knowledge asinference rules: event narrative knowledge, entity semanticknowledge, and sentiment coherence knowledge. They thenlearned an attention model which selected appropriate rulesfor a given question. This inference-based model outper-formed several others, including a Deep Structured Seman-tic Model (Mostafazadeh et al. 2016) and an LSTM-basedRecurrent Neural Network (Pichotta and Mooney 2016),on a version of the StoryCloze task. Others (Botschen,Sorokin, and Gurevych 2018) have found that incorporatingknowledge from sources like FrameNet (Baker, Fillmore,and Lowe 1998) and Wikidata (Vrandeˇci´c and Kr¨otzsch2014) similarly improve performance on another cloze task(Habernal et al. 2017). While, to the best of our knowledge,such knowledge based approaches have not been tested oncreative tasks, their strong performance on cloze tasks sug-gests that they may be able to handle the language interpre-tation component of creative captioning, as well.

Story Reasoning/Narrative Understanding

The “creative” part of creative captioning goes beyond tradi-tional image captioning tasks, and their emphasis on veridi-cal image description, to include more sophisticated in-terpretations of images and phrases. A creative captioningagent must be able to generate multiple alternatives wheninterpreting images, phrases, or both; and also to considersuch interpretations at multiple levels of abstraction.This class of capabilities is strongly tied to AI researchin story reasoning / narrative understanding, as human inter-pretations of images and phrases often revolve around story-like conceptual constructs, like the “Monty Hall” examplein Figure 1.AI research on narrative reasoning observes that such rea-soning often relies on an agent’s prior knowledge base ofstories or story prototypes, as well as rich analogical reason-ing to build or reason about new stories (Finlayson 2009).There are many open research questions in modeling storystructures (Riedl and Young 2006) as well as how to elicitdata for training story reasoning systems (Li et al. 2013).Moreover, creative captioning exempliﬁes narrative rea-soning that bridges both linguistic and visual inputs, requir-ing at some level uniﬁed conceptual representations that canspan both modalities. While much AI research in narrativereasoning has relied on linguistic representations, there isalso work on such reasoning in images (Cohn 2020; Iyyeret al. 2017).

Social Reasoning

Creative captioning must be creative, but not so much so thatit is uninterpretable; people must be able to understand thereference. This requires sufﬁcient social reasoning to con-sider what connections others are likely to make, what cul-tural references they are likely to be aware of, and how theyare likely to approach creative captioning more broadly. Inthe context of the Dixit game, the Dixit Agent (DA) mustlso reason about the strategies other players are likely topursue.The most similar social reasoning problems to thoseposed by creative captioning and the Dixit game are in othergame domains. For example, Hanabi and Werewolf haveboth been proposed as AI challenges (Bard et al. 2020; To-riumi et al. 2016, respectively) speciﬁcally because they re-quire social reasoning for successful gameplay. In this sec-tion, we focus our discussion on these two games.Hanabi is a cooperative card game, in which players workto construct ordered decks of cards (1-5) according to theircolors (white, yellow, blue, green, red). Players are limitedin both information and in communication: players cannotsee their own cards, they have a limited number of hints togive each other, and those hints can only contain informationabout card color or number—not both. To succeed, playersmust consider not only the information explicitly given in ahint, but also the information implied by it. For example, ifthe red 1 has just been played, and a player is then given thehint that they have a red card, they might infer that the cardis, in fact, a playable red 2.Bard et al. (2020) present several baseline agents for theHanabi challenge. These include both rule-based and learn-ing agents. In a self-play setting, the rule-based agents out-perform the learning agents. However, only the learningagents are tested in ad-hoc teams (i.e., teams of differentagent types) because of the rigidity built into the rule-basedagents. None of the agents were tested while playing onmixed teams with humans, and none explicitly took socialreasoning into account. Yet, Eger et al. (2020) found thatgiving agents the ability to reason about the intents of theirhuman teammates led to improved scores on mixed human-AI teams. Similarly, Liang et al. (2019) found that humanHanabi players are more likely to believe they are playingwith other people when AI agents perform explicit socialreasoning (in this case by considering the possible interpre-tations of implied information communicated by hints). Thisis even more likely to be true of Dixit, where metaphoricalinterpretation of communication is important to game play.Unlike Hanabi, Werewolf is a competitive game in whichwinning strategies require some level of deception. At thebeginning of gameplay, players are privately assigned roles,corresponding to one of two teams: townspeople and were-wolves . While each werewolf knows who the other were-wolves are, the townspeople do not know any other player’srole.Each round is separated into a day phase and a nightphase. During the day, there is open discussion. Townspeo-ple strategize and attempt to discern who the werewolvesare. Werewolves, on the other hand, try to throw the towns-people off their trail. At the end of the day phase, playersvote for who they believe to be a werewolf and that personis removed from the game. At night, the werewolves selecta townsperson, a victim, to also remove from the game. Ifthere are more werewolves than townspeople at any point,the werewolves win. If however, all werewolves are voted Some roles have special abilities that, for clarity, we do notdiscuss here. out, the townspeople win.The biggest challenge of the Werewolf game is the openconversation during the day, which requires complete con-versational AI. To limit this complexity, Toriumi et al.(2016) limit both the number and type of utterances allowedfor players in their challenge. Nonetheless, most approachesto Werewolf-playing agents base behavior on the game logsof human players, including transcription of the conversa-tions between them (Hirata et al. 2016; Hancock et al. 2017;Kondoh, Matsumoto, and Mori 2018; Shoji et al. 2019, etc.).These conversations encode the speakers’ social reasoning,often explicitly by lying or calling out presumed lies. We be-lieve that a similar approach (i.e., learning from observationof human players) may be ﬁtting for a DA, as well.

Putting it All Together: Integrated Reasoning

A successful DA needs to have strong abilities in each of thesubareas described above. Perhaps more importantly, how-ever, it needs to integrate these abilities into a uniﬁed sys-tem. This requires being able to not only reason about mul-tiple modalities (i.e., images and natural language) but alsoto unify multiple reasoning styles (i.e., story understandingand social reasoning).The systems that are closest to such integration are cogni-tive architectures (Anderson 2005; Langley and Choi 2006;Laird 2012; Forbus and Hinrichs 2017, etc.). Because theyare designed as a single system that performs multipletypes of reasoning over multiple modalities, they are well-positioned for a task like creative captioning. However, thestate of the art in most most of the subproblems needed forcreative captioning (i.e., visual perception, natural languageprocessing, etc.) has been set by deep learning systems,rather than cognitive architectures. The team behind Wat-son found that a combination of statistical and knowledge-based approaches was necessary to beat humans at Jeop-ardy! (Ferrucci et al. 2010). Perhaps integrating approachesfrom deep learning and cognitive architecture will similarlylead to beating humans at Dixit, as well success in creativecaptioning more broadly.

Conclusion

We have identiﬁed creative captioning as a novel challengefor AI systems. To be successful at creative captioning, weargue that an agent must, at the very least, integrate visualperception, natural language understanding, and social rea-soning. To that end, we have proposed the game Dixit as adomain for creative captioning, and identiﬁed intermediatesubproblems that must be solved along the path to both suc-cessful play in Dixit and creative captioning overall. In thefuture, we will work toward solving these subproblems. Wehope our colleagues will join us.

References

Anderson, J. R. 2005. Human symbol manipulation withinan integrated cognitive architecture.

Cognitive science , 86–90.Barbu, A.; Mayo, D.; Alverio, J.; Luo, W.; Wang, C.; Gut-freund, D.; Tenenbaum, J.; and Katz, B. 2019. Objectnet: Alarge-scale bias-controlled dataset for pushing the limits ofobject recognition models. In

Advances in Neural Informa-tion Processing Systems , 9453–9463.Bard, N.; Foerster, J. N.; Chandar, S.; Burch, N.; Lanctot,M.; Song, H. F.; Parisotto, E.; Dumoulin, V.; Moitra, S.;Hughes, E.; et al. 2020. The hanabi challenge: A new fron-tier for ai research.

Artiﬁcial Intelligence

International Journal of Game-Based Learning(IJGBL)

Proceedings of the National Academy of Sciences

Open Library of Humanities , 90–96.Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.;Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell,A.; et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 .Cohn, N. 2020. Your brain on comics: A cognitive model ofvisual narrative comprehension.

Topics in cognitive science

CVPR , 3076–3086.de Vries, T.; Misra, I.; Wang, C.; and van der Maaten, L.2019. Does object recognition work for everyone? In

CVPRWorkshops , 52–59.Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical imagedatabase. In

CVPR , 248–255. Ieee.Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018.Bert: Pre-training of deep bidirectional transformers for lan-guage understanding. arXiv preprint arXiv:1810.04805

IEEE Transactions on Games .Ferrucci, D.; Brown, E.; Chu-Carroll, J.; Fan, J.; Gondek,D.; Kalyanpur, A. A.; Lally, A.; Murdock, J. W.; Nyberg,E.; Prager, J.; et al. 2010. Building Watson: An overview ofthe DeepQA project.

AI magazine

New Frontiers in Analogy Re-search

Int. Conf. Com-munications (COMM) , 73–76. IEEE.Forbus, K. D.; and Hinrichs, T. 2017. Analogy and relationalrepresentations in the companion cognitive architecture.

AIMagazine arXivpreprint arXiv:1708.01425 .Hall, P.; Cai, H.; Wu, Q.; and Corradi, T. 2015. Cross-depiction problem: Recognition and synthesis of pho-tographs and artwork.

Computational Visual Media

FLAIRS , 388–393.Hirata, Y.; Inaba, M.; Takahashi, K.; Toriumi, F.; Osawa, H.;Katagami, D.; and Shinoda, K. 2016. Werewolf Game Mod-eling Using Action Probabilities Based on Play Log Anal-ysis. In Plaat, A.; Kosters, W.; and van den Herik, J., eds.,

Computers and Games , 103–114. Cham: Springer Interna-tional Publishing. ISBN 978-3-319-50935-8.Hossain, M. Z.; Sohel, F.; Shiratuddin, M. F.; and Laga, H.2019. A comprehensive survey of deep learning for imagecaptioning.

ACM Computing Surveys (CSUR) R (cid:13) Board game, a pro-jective mediation for adolescents?

Revue de psychotherapiepsychanalytique de groupe

CVPR , 7186–7195.Kondoh, M.; Matsumoto, K.; and Mori, N. 2018. Devel-opment of Agent Predicting Werewolf with Deep Learning.In

International Symposium on Distributed Computing andArtiﬁcial Intelligence , 18–26. Springer.Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Im-agenet classiﬁcation with deep convolutional neural net-works. In

Advances in neural information processing sys-tems , 1097–1105.Laird, J. E. 2012.

The Soar cognitive architecture . MITpress.Langley, P.; and Choi, D. 2006. A uniﬁed cognitive ar-chitecture for physical agents. In

AAAI , volume 21, 1469.enlo Park, CA; Cambridge, MA; London; AAAI Press;MIT Press; 1999.Li, B.; Lee-Urban, S.; Johnston, G.; and Riedl, M. 2013.Story Generation with Crowdsourced Plot Graphs. In

AAAI .Citeseer.Li, R. T. 2020. Learning humor through AI: A study on NewYorker’s Cartoon Caption Contests Using Deep Learning.Liang, C.; Proft, J.; Andersen, E.; and Knepper, R. A.2019. Implicit communication of actionable information inhuman-ai teams. In

CHI , 1–13.Liao, Y.; Wang, Y.; Liu, Q.; and Jiang, X. 2019. GPT-basedGeneration for Classical Chinese Poetry. arXiv preprintarXiv:1907.00151 .Lin, H.; Sun, L.; and Han, X. 2017. Reasoning with hetero-geneous knowledge for commonsense machine comprehen-sion. In

EMNLP , 2032–2043.Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.;Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V.2019. Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692 .Machajdik, J.; and Hanbury, A. 2010. Affective image clas-siﬁcation using features inspired by psychology and art the-ory. In

Ethics in Progress

Journal of Arts Writing by Students arXiv preprint arXiv:1604.01696 .Novikova, V.; and Beskrovnaya, L. 2015. Smart Edutain-ment as a Way of Enhancing Students Motivation (on theExample of Board Games). In

Smart Education and Smarte-Learning , 69–79. Springer.O’Dell, J. W.; and Dickson, J. 1984. Eliza as a therapeutictool.

Journal of Clinical Psychology .Pichotta, K.; and Mooney, R. J. 2016. Learning StatisticalScripts with LSTM Recurrent Neural Networks. In

AAAI ,2800–2806.Prince, R.; and Radev, D. 2017. Humorous CaptionGeneration for New Yorker Cartoons.

Semantic Scholar (202549495).Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever,I. 2018. Improving language understanding by generativepre-training. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; andSutskever, I. 2019. Language models are unsupervised mul-titask learners.

OpenAI Blog arXivpreprint arXiv:1902.10811 .Riedl, M. O.; and Young, R. M. 2006. From linear story gen-eration to branching story graphs.

IEEE Computer Graphicsand Applications

DiGRA , 1–5. Digital Games Research Association.Roubira, J. 2008. Dixit.[Board Game].Selvin, S. 1975a. On the Monty Hall problem (Letter to theeditor).

The American Statistician

COLING , 155–161.Shoji, N.; Jotaro, A.; Kosuke, O.; Kotaro, S.; Hideyuki, S.;Tatsunori, M.; and Noriko, K. 2019. Strategies for an Au-tonomous Agent Playing the Werewolf game as a StealthWerewolf. In { AIWolfDial2019 } ) , 20–24.Smith, T. 2018. The Dixit Method of Language Samplingin Early Adolescence. Masters Thesis, Western KentuckyUniversity.Sullivan, A.; and Salter, A. 2017. A taxonomy of narrative-centric board and card games. In , 1–10.Toriumi, F.; Osawa, H.; Inaba, M.; Katagami, D.; Shinoda,K.; and Matsubara, H. 2016. AI Wolf ContestDevelopmentof Game AI Using Collective Intelligence. In ComputerGames , 101–115. Springer.Vayanou, M.; Ioannidis, Y.; Loumos, G.; and Kargas, A.2019. How to play storytelling games with masterpieces:From art galleries to hybrid board games.

Journal of Com-puters in Education

Communications of the ACM

ECCV , 825–841. Springer.Wetzel, R.; Rodden, T.; and Benford, S. 2017. Developingideation cards for mixed reality game design.

Transactionsof the Digital Games Research Association arXiv preprint arXiv:1905.07830 .Zhao, X.; and Zhang, S. 2016. A review on facial expres-sion recognition: feature extraction and classiﬁcation.