[PDF] Baby Intuitions Benchmark (BIB): Discerning the goals, preferences, and actions of others

Abstract

To achieve human-like common sense about everyday life, machine learning systems must understand and reason about the goals, preferences, and actions of others. Human infants intuitively achieve such common sense by making inferences about the underlying causes of other agents' actions. Directly informed by research on infant cognition, our benchmark BIB challenges machines to achieve generalizable, common-sense reasoning about other agents like human infants do. As in studies on infant cognition, moreover, we use a violation of expectation paradigm in which machines must predict the plausibility of an agent's behavior given a video sequence, making this benchmark appropriate for direct validation with human infants in future studies. We show that recently proposed, deep-learning-based agency reasoning models fail to show infant-like reasoning, leaving BIB an open challenge.

Full PDF

BBaby Intuitions Benchmark (BIB): Discerning the goals,preferences, and actions of others

Kanishk Gandhi ∗ Gala Stojnic Brenden M. Lake Moira R. Dillon

New York University

Abstract

To achieve human-like common sense about everyday life,machine learning systems must understand and reasonabout the goals, preferences, and actions of others. Hu-man infants intuitively achieve such common sense bymaking inferences about the underlying causes of otheragents’ actions. Directly informed by research on in-fant cognition, our benchmark BIB challenges machinesto achieve generalizable, common-sense reasoning aboutother agents like human infants do. As in studies on infantcognition, moreover, we use a violation of expectationparadigm in which machines must predict the plausibilityof an agent’s behavior given a video sequence, makingthis benchmark appropriate for direct validation withhuman infants in future studies. We show that recentlyproposed, deep-learning-based agency reasoning modelsfail to show infant-like reasoning, leaving BIB an openchallenge.

1. Introduction

Humans have a rich capacity to infer the underlyingintentions of others by observing their actions. Forexample, when we watch the simple animations fromHeider and Simmel (1944)’s seminal study (see video and Figure 1), we attribute goals and dispositionsto simple 2D ﬁgures moving around a ﬂat world.Using behavioral experiments presenting both simpleand complex visual displays, developmental cognitivescientists have found that even young infants also inferintentionality in the actions of other agents. Infantsexpect other agents: to have object-based goals (Gergelyet al., 1995; Luo, 2011; Song et al., 2005; Woodward,1998, 1999; Woodward and Sommerville, 2000); to havegoals that reﬂect preferences (Buresh and Woodward,2007; Kuhlmeier et al., 2003; Repacholi and Gopnik,1997); to engage in instrumental actions to bring aboutgoals (Carpenter et al., 2005; Elsner et al., 2007; Gersonet al., 2015; Hernik and Csibra, 2015; Saxe et al., 2007;Woodward and Sommerville, 2000); and to act eﬃcientlytowards goals (Colomer et al., 2020; Gergely and Csibra, ∗ may thus be a critical diﬀerence betweenhuman and machine intelligence more generally (Lakeet al., 2017). Addressing this diﬀerence is crucial ifmachine learning aims to approximate the ﬂexibility ofhuman common sense and reasoning. Figure 1: A still from Heiderand Simmel (1944). In this ani-mation, the large triangle chasesthe small triangle and the circlewho cooperate to avoid it.

Understanding reasoningabout agents has so farreceived substantially moreattention from researchersin cognitive developmentthan in AI. However, recentcomputational work hasaimed to focus on suchreasoning by adoptingseveral approaches. Inversereinforcement learning(Abbeel and Ng, 2004;Ng et al., 2000; Ziebartet al., 2008) and Bayesianapproaches (Baker et al.,2011, 2017, 2009; Jara-Ettinger, 2019; Ullman et al., 2009) have modeledother agents as rational, yet noisy, planners. In thesemodels, rationality serves as the tool by which to inferthe underlying intentions that best explain an agent’s Note that in the cognitive development literature, “theory ofmind” typically refers to the attribution of mental states, such asphenomenological or epistemic states (e.g., perceptions or beliefs)to other intentional agents (Premack and Woodruﬀ, 1978). In thispaper, we address on only one potential component of theory ofmind, present from early infancy, which focuses on reasoning aboutthe intentional states, not the phenomenological or epistemic states,of others (Spelke, 2016) a r X i v : . [ c s . A I] F e b bserved behavior. Game theoretic models have aimed tocapture an opponent’s thought processes in multi-agentinteractive scenarios (see survey: Albrecht and Stone,2018), and learning-based, neural network approacheshave focused on learning predictive models of otheragents’ latent mental states, either through structuredarchitectures that encourage mental-state representations(Rabinowitz et al., 2018) or through the explicit modelingof other agents’ mental states using a diﬀerent agent’sforward model (Raileanu et al., 2018).Despite the increasing sophistication of these computa-tional models, they have not been evaluated or comparedusing a comprehensive benchmark that captures earlyemerging human competencies about agents. Forexample, some existing evaluations have provided fewerthan 100 sample episodes (Baker et al., 2011, 2017,2009), making it infeasible to evaluate learning-basedapproaches that require substantial training. Otherevaluations have used largely the same distribution forboth training and test episodes (Rabinowitz et al., 2018),making it diﬃcult to measure how abstract or ﬂexiblea model’s performance might be. Moreover, existingevaluations have not used or been translatable to thebehavioral paradigms that test infant cognition. Theytherefore cannot be validated with infants nor can theirresults be analyzed in terms of the representations andprocesses that support human performance. AGENT(Shu et al., 2021), a benchmark developed contemporane-ously to the one presented here, is inspired by studieswith infants and has been validated with behavioral datafrom adults. Moreover, it challenges machines to reasonabout the underlying intentions of agents as opposed totheir actions. We see AGENT as largely complementaryto our eﬀorts, covering a distinct (yet overlapping) set ofinfant abilities. There are other diﬀerences, including theease of evaluating new models: AGENT involves trainingon many diﬀerent leave-out splits, where most splits haverelatively minor diﬀerences between training and test. Incontrast, BIB oﬀers a single canonical split designed toevaluate the abstractness and ﬂexibility of the underlyingrepresentations of other agents. Ultimately we hopethat new models will be evaluated on both benchmarks,further probing their breadth and sensitivity to designchoices.In this paper, we present a comprehensive benchmark,the Baby Intuitions Benchmark (BIB), which is directlyinspired by infant cognition. BIB adapts experimentalstimuli from research in developmental cognitive sciencethat has captured the abstract nature of infants’ reasoningabout agents (Baillargeon et al., 2016; Banaji and Gelman,2013). Moreover, BIB adopts a “violation of expectation”(VOE) paradigm (similar to Riochet et al. (2018); Smithet al. (2019)), commonly used in behavioral research with infants, which both makes its direct validation withinfants possible and also makes its results interpretablein terms of human performance. Finally, we design theBIB training and evaluation sets so that they test forﬂexible, generalizable common sense reasoning. BIB thusserves as a key step in bridging machines’ impoverishedunderstanding of intentionality with humans’ rich one.

2. Baby Intuitions Benchmark (BIB)

BIB presents a battery of agency-reasoning tasks, basedon ﬁndings from developmental cognitive science andadopting its VOE paradigm, to evaluate computationalmodels. We focus on the following ﬁve questions: 1) canan AI system represent an agent as having a particularobject-based goal? 2) can it bind speciﬁc preferencesfor goal objects to speciﬁc agents? 3) can it understandthat there may be obstacles that restrict an agent’sactions and that an agent will move to a previouslynonpreferred object when their preferred object becomesinaccessible? 4) can it represent an agent’s sequence ofactions as instrumental, directed towards a higher-ordergoal object? 5) can it learn that an agent acts eﬃcientlytowards a goal object?We also adopt the VOE paradigm, which involvespresenting visual stimuli in two phases, a familiarizationphase and a test phase. We refer to the two phasestogether as an “episode.” The familiarization phaseincludes a succession of eight trials that introduce themain elements of the visual displays used in the testphase. This introduction also allows the observer to formexpectations about the future behavior of those elementsbased on their prior knowledge or learning. The testphase includes an unexpected and expected outcome,based on what was observed during familiarization. Theunexpected outcome is typically perceptually similarto the events in the familiarization while the expectedoutcome is typically more perceptually diﬀerent. So, inorder for the outcome to be unexpected, it must be soat the conceptual, rather than perceptual, level. Whenthis paradigm is used with infants, their looking time toeach event is measured, and infants tend to look longerat unexpected outcomes, i.e., outcomes that “violatetheir expectations” (Baillargeon et al., 1985; Oakes, 2010;Turk-Browne et al., 2008).

Infants attributeobject-based—as opposed to location-based—goals toagents (Gergely et al., 1995; Luo, 2011; Song et al., 2005;Woodward, 1998, 1999; Woodward and Sommerville,2000). As illustrated in Figure 2 (left), Woodward2 a) Familiarization (8 trials)(b) Test: Expected(c) Test: Unexpected

Figure 2: Evaluation of whether machines can represent preferencesof agents. Inspired by the Woodward (1998)’s original study withinfants (left), our version of the task is rendered in both 2D (middle)and 3D (right). The familiarization trials establish the preferenceof the agent. (1998, 1999)’s seminal study showed that when 5- and9-month-old infants saw a hand repeatedly reaching toa ball on the left over a bear on the right, they thenlooked longer when the hand reached to the left forthe bear, even though the direction of the reach wasmore similar in that event to the events in the previoustrials. These results suggest that the infants expectedthat the hand would reach consistently to a particulargoal object as opposed to a particular goal location.Other studies have shown that infants’ interpretationsare not restricted to reaching events. For example,infants attribute an object-based goal to a 3D boxduring a live puppet show when that box seeminglyexhibits self-propelled motion. (Luo, 2011; Luo andBaillargeon, 2005; Shimizu and Johnson, 2004). Whenshown an agent repeatedly moving to the same object atapproximately the same location, do AIs, like infants, in-fer that the agent’s goal is the object and not the location?

Familiarization Trials.

The familiarization shows anagent repeatedly moving towards a speciﬁc object in aworld with two objects (Figure 2a right). The agent’sstarting position is ﬁxed across trials, and the locationsof the objects are correlated with their identities suchthat the preferred object and nonpreferred object appearin generally the same location across trials (see appendixFigure 11 and 12).

Test Trials.

The test uses two object locations that hadbeen used during one familiarization trial, but the identityof the objects at those locations has been switched. In theexpected outcome (Figure 2b right), the agent moves tothe object that had been their goal during the familiariza-tion, i.e., their preferred object, but the trajectory of their motion and the location of that object is diﬀerent fromfamiliarization. In contrast, in the unexpected outcome(Figure 2c), the agent moves to the nonpreferred object,but the trajectory of their motion and the location theymove to is the same as familiarization. The model issuccessful if it expects the agent to go to the preferredobject in a diﬀerent location.

Infants are capableof attributing speciﬁc preferences to speciﬁc agents(Buresh and Woodward, 2007; Henderson and Wood-ward, 2012; Kuhlmeier et al., 2003; Repacholi andGopnik, 1997). For example, while 9- and 13-month-oldinfants looked longer at test when an actor reachedfor a toy that they did not prefer during habituation,infants showed no expectations when the habituationand test trials featured diﬀerent actors (Buresh andWoodward, 2007). When shown an one agent repeat-edly moving to the same object, do AIs, like infants,expect that that object is preferred to that speciﬁc agent?

Familiarization Trials.

The familiarization shows anagent consistently choosing one object over the other, asabove, but objects appear at widely varying locations inthe grid world.

Test Trials.

The test includes two possible scenarios.One scenario presents an expected outcome, in which thefamiliar agent goes to the object it prefers, and anotheroutcome, in which a new, unfamiliar agent goes to theobject preferred by the familiar agent. While the latteroutcome is not necessarily unexpected, the familiar agentgoing to the preferred object should be more expectedgiven the familiarization (appendix Figure 14). The sec-ond scenario presents an unexpected outcome, in whichthe familiar agent goes to the nonpreferred object, andanother outcome, in which the new agent goes to theobject not preferred by the familiar agent. Here, thefamiliar agent going to the nonpreferred object should bemore unexpected (Figure 3). The model is successful if ithas weak or no expectations about the preferences of thenew agent.

Infants understand theprinciple of solidity (e.g., that solid objects cannot passthrough one another), and they apply this principle toboth inanimate entities (Baillargeon, 1987; Baillargeon3t al., 1992; Spelke et al., 1992) and also animate enti-ties, such as human hands (Luo et al., 2009; Saxe et al.,2006). Infants’ expectations about the objects agentsmight approach are also informed by object accessibility.Scott and Baillargeon (2013) demonstrate, for example,that 16-month-old infants expected an agent, facing twoidentical objects, to reach for the one in the containerwithout a lid versus the one in the container with a lid.

Familiarization Trials.

The familiarization shows anagent consistently choosing one object over the other, asabove, and objects appear at widely varying locations inthe grid world. (Figure 4).

Test Trials.

The test presents two new object locations.In the expected outcome, the preferred object is nowinaccessible, blocked on all sides by the ﬁxed, black bar-riers, and the agent moves to the nonpreferred object.In the unexpected outcome, both of the objects remainaccessible, and the agent moves to the nonpreferred ob-ject (Figure 4). The model is successful if it expects theagent to move to the nonpreferred object only when thepreferred object is inaccessible.

Infants represent anagent’s sequence of actions as instrumental to achievinga higher-order goal (Carpenter et al., 2005; Elsner et al.,2007; Gerson et al., 2015; Hernik and Csibra, 2015; Saxeet al., 2007; Sommerville and Woodward, 2005; Wood-ward and Sommerville, 2000). For example, Sommervilleand Woodward (2005) showed that 12-month-old infantsunderstand an actor’s pulling a cloth as a means to get-ting the otherwise out-of-reach object placed on it. Whenshown an agent repeatedly taking the same action toeﬀect a change in the environment that enables them to (a) Familiarization (8 trials) (b) Test: No Expectation (c) Test: Unexpected

Figure 3: Evaluation of whether machines can bind speciﬁc goals tospeciﬁc agents. The familiarization trials establish the preferenceof the agent. (a) Familiarization (8 trials) (b) Test: Expected (c) Test: Unexpected

Figure 4: Evaluation of whether machines can understand thatobstacles restrict actions. The familiarization trials establish thepreference of the agent. move towards an object, do AIs, like infants, expect thatthat object is the goal, as opposed to the sequence ofactions?

Familiarization Trials.

The familiarization includesﬁve main elements: an agent; a goal object; a key; a lock;and a green removable barrier (see Figure 5). The greenbarrier initially restricts the agent’s access to the object.And so, the agent removes the barrier by collecting andthen inserting the key into the lock. The agent thenmoves to the object.

Familiarization (8 trials) Test: Expected Test: Unexpected(a) No barriersFamiliarization (8 trials) Test: Expected Test: Unexpected(b) Inconsequential barriersFamiliarization (8 trials) Test: Expected Test: Unexpected(c) Blocking barriers

Figure 5: The three types of trials that test machines’ understandingof an agent’s actions towards a higher-order goal. The goal isinitially inaccessible (blocked by a green removable barrier). Duringfamiliarization, the agent removes the barrier by retrieving the key(triangle) and inserting it into the lock. a) Familiarization (8 trials)(b) Test: Expected(c) Test: Unxpected Figure 6: Inspired by Gergely et al. (1995) (left) we ask whethermachines expect that agents move eﬃciently towards goal objects.At test, the agent moves along one of the same paths they movedalong during familiarization, but unlike familiarization, there is nobarrier between the agent and the object. So, this ineﬃcient actionis unexpected.

Test Trials.

The test includes three possible scenarios.One scenario presents no green barrier. In the expectedoutcome, the agent moves directly to the object whilein the unexpected outcome the agent moves to the key(Figure 5a). The second scenario presents a green barrier,but it does not restrict the agent’s access to the object.In the expected outcome, the agent moves directly tothe object while in the unexpected outcome the agentmoves to the key (Figure 5b). The third scenario presentsan expected outcome, in which the barrier restricts theagent’s access to the object and the agent moves to thekey. In the unexpected outcome, the barrier does notblock the object and the agent goes to the key (Figure5c). Including these three scenarios allows us to test forsimple heuristics that models might use to solve thesetasks. If the model uses the heuristic that the key shouldbe visited ﬁrst and then the object, it will fail on theno barrier and inconsequential barrier scenarios. If themodel uses the heuristic that the key should be visitedonly when a removable barrier is present, then it willfail on the inconsequential barrier scenario. Finally, theheuristic of always going to the object directly will failon the blocking barrier scenario. The model is successfulif it expects the agent to go to the key only when theremovable barrier is blocking that object.

Infants expect agentsto move eﬃciently towards their goals (Baillargeon et al.,2015; Colomer et al., 2020; Gergely and Csibra, 1997,2003; Gergely et al., 1995; Liu et al., 2019, 2017). Ina seminal study by Gergely et al. (1995), for example, 12-month-old infants repeatedly saw a small circlejumping over an obstacle to get to a big circle (see Figure6 left). At test, the obstacle was removed, and the smallcircle either performed the same, now ineﬃcient, actionto get to the big circle or performed the straight, noweﬃcient action. Infants were surprised when the agentperformed the familiar but ineﬃcient action. Theseﬁndings have been replicated by instantiating the agentand object in diﬀerent ways (as, e.g., humans, geometricshapes, or puppets) and by using diﬀerent kinds ofpresentations (e.g., prerecorded or live) (Colomer et al.,2020; Liu et al., 2017; Phillips and Wellman, 2005;Sodian et al., 2004; Southgate et al., 2008). When infantssee an irrational agent, i.e., one moving ineﬃciently totheir goal from the start, however, they do not formany expectations about that agent’s eﬃcient action attest (Gergely et al., 1995; Liu and Spelke, 2017). Whenshown a rational agent repeatedly taking an eﬃcientpath around a barrier to its goal object, do AIs, likeinfants, expect that that agent will continue to takeeﬃcient paths as opposed to similar-looking paths, oncethat barrier is removed?

Familiarization Trials.

The familiarization includestwo diﬀerent scenarios. In one scenario, a rational agentconsistently moves along an eﬃcient path to its goalobject around a ﬁxed, black barrier in the gird world(Figure 6a). In the other scenario, an irrational agentmoves along these same paths, but there is no barrier inthe way. So in this latter scenario, the irrational agent isacting ineﬃciently from the start (Figure 7).

Test Trials.

The test includes two possible scenarios.One scenario shows only the rational, eﬃcient agentduring familiarization, and at test, it presents one ofthe familiarization trials but with the barrier betweenthe agent and the goal object removed. In the expectedoutcome, the agent moves along a straight, eﬃcient pathto its goal. In the unexpected outcome, the agent either (a) Familiarization: Irrational (b) Familiarization: Rational (c) Test: Ineﬃcient

Figure 7: Inspired by Gergely et al. (1995), we ask whether machinesexpect either rational or irrational agents to move eﬃciently towardstheir goals.

Inspired by Heider and Simmel (1944), the primary setof visual stimuli present “grid-world” animations, shownfrom an overhead perspective and populated with simpleshapes that take on diﬀerent roles (e.g. “agents”, “ob-jects”, “tools”), and we assume the environment is fullyobservable to the agent (i.e., the agent can see over thewalls) and the observer. We chose this type of environ-ment as particularly suitable for testing AIs (e.g., Bakeret al., 2017; Rabinowitz et al., 2018) because it allows forprocedural generation of a large number of episodes, andthe simple visuals focus the problem on reasoning aboutagents.For each of the ﬁve evaluation tasks, we generated 1000episodes, each with one expected and one unexpected out-come (2000 videos), by sampling the locations of barriers,agents, and objects in the 10 ×

10 grid. The locationsare controlled to account for the distances and obstaclesbetween the agent and the objects so that, e.g., preferredobjects are not consistently closer or farther from agents.We provide two evaluation sets, one with objects andagents seen during background training and the otherwith new shapes for the objects and agents. Finally, as ameans to vary the perceptual diﬃculty of the benchmark,we also include 3D versions of the stimuli rendered tomatch the 2D versions and presented at a three-quarterspoint of view (Figure 2).The 2D stimuli (except for theinstrumental action tasks) are directly translated to 3Dusing the AI2THOR (Kolve et al., 2019) framework. Forboth 2D and 3D videos, we provide scene conﬁgurationﬁles describing the objects and agents present in thescene. (a) Single object (b) Preference (c) Instrumental action(d) Multi-agent

Figure 8: The four tasks from the background training set. Onlythe test trials are shown here.

3. Background Training

We provide a set of background training tasks for the mod-els to learn about agents and objects in our grid worldsand the structure of the trials. Although we provide atraining set, we do not intend to limit models to just thesedata prior to being tested. Additional out-of-distributiontraining data is allowed, just as infants get varied expe-rience with agents in the real world. Importantly, whenparticipating in a lab study, infants can make meaningfulinferences about novel stimuli/environments with only arelatively brief familiarization phase. We include tens ofthousands of background episodes as a generous stand-infor this type of in-lab familiarization so AI systems arenot surprised merely by the various elements and dy-namics used in the evaluation. Although learning-centricapproaches will learn something about other agents iftrained on the background set, we do not intend it to besuﬃcient for acquiring genuine, abstract agent representa-tions. We intend that either supplemental pretraining oradditional prior knowledge can be enriched by the back-ground training to approach the benchmark successfully.The episodes in the background training are structuredsimilarly to those in the evaluation, although thefamiliarization and test trials are now drawn from thesame distribution within each episode. Similar to IntPhys(Riochet et al., 2018) and ADEPT (Smith et al., 2019),we only provide the expected outcomes during training.There are four training tasks:

Single Object Task.

The agent navigates to an objectat some varied location in the scene (Figure 8a). Thistask is diﬀerent from the evaluation task in that itpresents only a single object. With this training, modelscan learn how agents start and end trials, how agentsmove, and how barriers inﬂuence agent motion. Weprovide 10,000 episodes of this type.6 o-Navigation Preference Task.

Two objects arelocated very close to the agent’s starting location, andthe agent approaches one object consistently across trials(Figure 8b). The task allows the model to learn thatagents have preferences. Critically, the navigation inthese trials is trivial compared to the evaluation trials,so navigation to goal objects is not trained. We provide10,000 episodes of this type.

No-Preference, Multiple-Agent Task.

One objectis located very close to the agent’s initial starting location(Figure 8d). At some point during the episode, a newagent takes the initial agent’s place (for example, theinitial agent could be replaced at the fourth trial andall subsequent trials would have the new agent). Thetask allows the model to learn that multiple agentscan appear across trials, but this task diﬀers from theevaluations, in which the new agent appears only in thetest trials. We provide 4,000 episodes of this type.

Agent-Blocked Instrumental Action Task.

Theagent starts conﬁned to a small region of the grid world,blocked by a removable green barrier (Figure 8c). Theagent collects a key and inserts it into a lock to makethe barrier disappear. The agent then navigates to theobject. This task allows the model to learn that thegreen barrier obstructs navigation and how the key andlock remove that barrier. These trials diﬀer from theevaluation in that the removable barriers are around theagent instead of the object. We provide 4,000 episodestrials of this type.To be successful at the evaluations, models must acquireor enrich their representations of agents for ﬂexible andsystematic generalization. For example, models have tocombine acquired knowledge of navigation (Single ObjectTask) and agent preferences (No-Navigation PreferenceTask) to be successful at the ﬁrst evaluation testing theunderlying preferences guiding agents’ goal-directed ac-tions (section 2.1).

4. Baseline Models

The baseline models are variants of a state-of-the-art,neural-network approach to reasoning about agents: thetheory of mind net (ToMnet) model in Rabinowitz et al.(2018). These models are trained passively and throughobservation only. We use a self-supervised learning setupwhere the objective is to predict the future actions ofthe agent. During evaluation, the expectedness of a testtrial, in the context of the previous familiarization trials,is deﬁned by its error on the most ‘unexpected’ videoframe (frame with the highest error).

Figure 9: Architecture of the video baseline model inspired byRabinowitz et al. (2018). An agent-characteristic embedding isinferred from the familiarization trials using a recurrent net. Thisembedding, with the state at test time, is used to predict the nextframe of the video using a U-Net (Ronneberger et al., 2015).

We test two baseline models (see appendix B forfull model speciﬁcations), one that operates directlyon the videos and another that operates on themask representations of the elements (i.e., individualelements – agents, objects, etc. — in a scene aresplit into diﬀerent channels). The objective of themask model (see appendix Figure 17) is to predict thetrajectory of the agent in the test trial (see appendix B.1).The video model (see Figure 9) operates on videos sam-pled at 3 fps and resized to 64 ×

64. Each frame in eachfamiliarization trial is encoded using a convolutional neu-ral network. The frame embeddings in a trial are passedto a bidirectional LSTM. The last output embdedding ofthe LSTM represents the characteristic of the agent inthe trial. These embeddings are averaged across familiar-ization to obtain a characteristic embedding for an agent.The characteristic embedding is tiled to a 64 ×

64 spatialresolution, concatenated to a frame from the test trial,and passed through a U-net to predict the next frame inthe trial. A mean squared error loss is used to train thenetwork.

5. Results

The models were trained on 80% of the background train-ing episodes (training set), and the rest of the episodeswere used for validation (validation set). A comparisonof the MSE loss on the training and validation sets and aqualitative evaluation of the video model’s performanceindicates that it learned the training tasks successfully7

IB Agency Task Mask Video Video (New Shapes) 3D Video

Rel. Abs. Rel. Abs. Rel. Abs. Rel. Abs.Preference 69.0 69.0 47.8 47.6 47.4 47.8 49.2 48.3Multi-Agent 50.0 49.8 50.3 50.3 50.0 51.5 50.0 51.0Inaccessible Goal 50.7 52.4 66.0 61.4 61.7 60.9 40.0 43.2Efficiency: Path control 95.6 94.3 99.8 92.0 98.5 92.1 66.3 57.9Efficiency: Time control 94.8 91.4 99.9 90.1 96.9 90.3 75.4 61.8Efficiency: Irrational agent 50.0 50.0 50.0 50.0 47.8 49.5 50.0 50.0Efficient Action Average 72.6 69.9 74.9 70.3 72.7 70.0 62.9 55.0Instrumental: No barrier 98.2 98.4 99.7 94.0 93.0 88.1 - -Instrumental: Inconsequential barrier 89.5 83.0 76.7 57.8 66.0 56.0 - -Instrumental: Blocking barrier 77.3 56.2 58.2 57.5 59.7 58.0 - -Instrumental Action Average 85.6 71.8 73.0 56.9 69.6 55.8 - -

Table 1: Performance of the baseline models on BIB. Scores for the mask model on 2D videos, the video model on the 2D videos, 2Dvideos with new elements, and 3D videos are shown above. Relative accuracy (Rel.) scores quantify pairwise VOE judgements. Absolutescores (Abs.) quantify VOE judgements on each video independently, requiring the prediction error to be lower on the expected videos.The absolute score is the Area Under the ROC Curve (AUC), where the true positive rate is plotted against the false positive rate fordiﬀerent threshold values. (see appendix Figure 15).For each evaluation episode, we ﬁrst calculated themodel’s relative accuracy, i.e., whether the model foundthe expected video in each pair more expected than theunexpected video (chance is 50%). We also calculatedthe model’s absolute score, i.e., the model’s prediction ofeach video’s plausibility independent of the pairing. Thisis measured by the Area Under the ROC Curve (AUC),which plots true positive rates against the false positiverate for diﬀerent threshold values.The results of our baseline models are presented inTable 1. The video model performs at chance on thePreference Task (see Figure 10a for predictions made bythe video model); it tends to predict that an agent willgo to the closer object (this prediction is made in about70% of trials). The model thus neglects the agent’spreference, established during familiarization. This isparticularly striking because the model does take intoaccount the familiarization phase when succeeding inthe No-Navigation Preference Task in the backgroundtraining.The video model also fails on the Multi-Agent Task,again tending to predict that an agent will go to thecloser object regardless of any established preferences.Consistent with this failure, the model also fails to mapspeciﬁc preferences to speciﬁc agents.This model does slightly better than chance on the In-accessible Goal Task. As seen in Figure 10b, it stillnevertheless, frequently predicts that the agent will go tothe inaccessible goal. The video model is proﬁcient at ﬁnding the shortestpath to the goal in the Eﬃciency Task (appendixFigure 19a), leading to high accuracy on both sub-evaluations that test for eﬃcient action: Path Controland Time Control (Table 1). However, the model fails

Input Frame Model Prediction Target Frame (a) Preference Task: The model predicts that the brown agent wouldgo to the green object instead of the established preference of the greyobject.

Input Frame Model Prediction Target Frame (b) Inaccessible goal task: The model predicts that the blue agentwould head to the inaccessible cyan object.

Input Frame Model Prediction Target Frame (c) Instrumental action task C: The model predicts that the blue agentwould directly go to the inaccessible orange object goal instead of per-forming the instrumental action by ﬁrst collecting the triangular key.

Figure 10: The most surprising frame (the frame with the highestprediction error) from the test trial for the video model taken fromthe evaluation tasks. Failure cases are shown here.

8o modulate its predictions based on whether the agentwas rational or irrational during familiarization (Table 1).Finally, the video model performs above chance onthe Instrumental Action Task, but performance on thesub-evaluations (Table 1) indicates that it relies on thesimple heuristic of directly going to the goal objectrather than understanding the nature of the instrumentalaction (Figure 10c). This leads to higher scores onsub-evaluations with no barrier and an inconsequentialbarrier (Table 1) but lower ones on the sub-evaluationwith a blocking barrier. This poor performance maybe due to the diﬀerence between the agent and barrierconditions in the background training (where the agentis conﬁned; Figure 8c) and evaluation (where the objectis conﬁned; Figure 5).The mask model shows similar performance to the videomodel across the tasks (see appendix B) for a detailedanalysis).Moreover, when we replace the elements in the evaluationset with new ones, the video model scores fall slightly, butthe trends remain the same (Table 1). Finally, the videomodel performs similarly on the 3D videos of the tasks,although performance is generally worse overall with 3Dvideos. This is likely because perceiving the trajectoriesof agents in 3D is more diﬃcult for a predictive model inpixel space. The predictive networks trained with MSEﬁnd it challenging to model trajectories in depth.

6. General Discussion

In this paper we introduced the Baby IntuitionsBenchmark (BIB), which tests machines on their abilityto reason about the underlying intentionality of otheragents by observing only agents’ actions. BIB is directlyinspired by the abstract reasoning about agents thatemerges early in human development, as revealed bybehavioral studies with infants. BIB’s adoption of theVOE paradigm, moreover, means its results can beinterpreted in terms of human performance and makes itappropriate for direct validation with human infants infuture studies.While baseline, deep-learning models successfully gener-alize to BIB’s training tasks, they fail to systematicallygeneralize to the evaluation tasks even though themodels incorporate theory-of-mind-inspired architectures(Rabinowitz et al., 2018). In particular, the baselinemodels performed at about chance when required toreason that agents have preferred goal objects, that pref-erences are tied to speciﬁc agents, and that goal objectscan be physically inaccessible. When presented withinstrumental actions, moreover, the models succeeded only by relying on a simple heuristic of going directlyto the goal object, rather than on a more sophisticatedunderstanding of an agent’s sequence of actions. Finally,the models failed to modulate their predictions abouteﬃcient action for irrational versus rational agents.These results suggest that state-of-the-art AI models donot have a common-sense understanding of agents theway human infants do.BIB is rooted in the ﬁndings and methods of devel-opmental cognitive science, but there are still criticaldiﬀerences between its stimuli and the stimuli usedwith infants, and its particular tasks have not yet beenvalidated with infants. First, while the simplicity of thegrid-world environment, for example, makes it ideal forprocedural generation to test AIs, such displays maynot be compelling enough to engage infants’ intuitionsabout agents, and overhead, object-directed navigationevents may not be the most intuitive context in whichto engage infants’ representations of other agents (incontrast to, e.g., perspectival reaching events). Caninfants reason about agents’ actions when viewing themfrom an overhead perspective? Can infants recognizesimple shapes with simple movements and minimal cuesto animacy (e.g., no eyes/gaze direction, no distinctivesounds, and no emotional expressions) as agents withintentionality? Most of the existing infant literature oﬀof which BIB is based presents infants with richer cuesto animacy and in the form of live-action or animateddisplays from a frontal or three-quarters points of view.Second, some of the variability introduced across theevaluation videos may make it diﬃcult for infants totrack and stably represent the diﬀerent elements. Forexample, the location of the preferred object variesgreatly during the familiarization phase in the evaluationthat links speciﬁc agents to speciﬁc preferences. No studywith infants, to our knowledge, has shown that infantssucceed in predicting an agent’s goal-directed actionsunder these conditions. Third, some inferences aboutagents included in this benchmark are yet to be testedwith infants. For example, no study to our knowledgehas examined whether infants expect agents to movetowards a nonpreferred object, versus not move at all,when a preferred object is inaccessible. And, no studyhas examined whether infants expect a goal object in atwo-alternative forced-choice scenario to generalize acrossagents when infants are familiarized to both agents bothmoving to the same object when there is only that oneobject present. Finally, the “extended familiarization”needed for training AI models (i.e., the backgroundtraining), reveals a striking diﬀerence between how BIBmight challenge minds versus machines. While bothinfants and AIs may have built-in knowledge and/orpretraining (e.g., from infants’ everyday experience orfrom AIs’ simulated experience), infants may need to9atch only eight, as opposed to thousands, of videosof shapes moving around grid worlds to successfullyapply their reasoning about agents to new, test eventspresented in that medium.The origins and development of human, intuitive under-standing of agents and their intentional actions have beenstudied extensively in developmental cognitive science.The representations and computations underlying suchunderstanding, however, are not yet understood. BIBserves as a test for computational models with diﬀerentpriors and learning-based approaches to achieve thecommon-sense reasoning about agents that humaninfants have. A computational description of how wereason about agents could ultimately help us buildmachines that better understand us and that we betterunderstand.Finally, BIB serves as a key step in bridging machines’ im-poverished understanding of intentionality with humans’rich one, since intentionality is one key component tounderstanding and reasoning about others in terms oftheir underlying mental states, including their beliefs anddesires. A benchmark that focuses on reasoning aboutagents’ intentional states, as well as their phenomenolog-ical and epistemic states, such as false-beliefs (a litmustest of human theory of mind (e.g. Baron-Cohen et al.(1985); Leslie (1987)), is thus a natural extension of BIBand could further advance our understanding of bothhuman and artiﬁcial intelligence.

Acknowledgements

This worked was supported by the DARPA MachineCommon Sense program (HR001119S0005). We thankVictoria Romero, Koleen McKrink, David Moore, LisaOakes, Clark Dorman, and Amir Tamrakar for their gen-erous feedback. We are especially grateful to ThomasSchellenberg, Dean Wetherby, and Brian Pippin for theirdevelopment eﬀort in porting the benchmark to 3D.

References

Abbeel, P. and Ng, A. Y. (2004). Apprenticeship learningvia inverse reinforcement learning. In

Proceedings ofthe 21st International Conference on Machine learning ,page 1.Albrecht, S. V. and Stone, P. (2018). Autonomous agentsmodelling other agents: A comprehensive survey andopen problems.

Artiﬁcial Intelligence , 258:66–95.Baillargeon, R. (1987). Object permanence in 31 / / Developmental psychology ,23(5):655.Baillargeon, R., Needham, A., and DeVos, J. (1992). Thedevelopment of young infants’ intuitions about support.

Early development and parenting , 1(2):69–78. Baillargeon, R., Scott, R. M., and Bian, L. (2016). Psy-chological reasoning in infancy.

Annual review of psy-chology , 67:159–186.Baillargeon, R., Scott, R. M., He, Z., Sloane, S., Setoh,P., Jin, K.-s., Wu, D., and Bian, L. (2015).

Psycho-logical and sociomoral reasoning in infancy.

AmericanPsychological Association.Baillargeon, R., Spelke, E. S., and Wasserman, S. (1985).Object permanence in ﬁve-month-old infants.

Cogni-tion , 20(3):191–208.Baker, C., Saxe, R., and Tenenbaum, J. (2011). Bayesiantheory of mind: Modeling joint belief-desire attribution.In

Proceedings of the annual meeting of the cognitivescience society , volume 33.Baker, C. L., Jara-Ettinger, J., Saxe, R., and Tenenbaum,J. B. (2017). Rational quantitative attribution of beliefs,desires and percepts in human mentalizing.

NatureHuman Behaviour , 1(4):1–10.Baker, C. L., Saxe, R., and Tenenbaum, J. B. (2009).Action understanding as inverse planning.

Cognition ,113(3):329–349.Banaji, M. R. and Gelman, S. A. (2013).

Navigating thesocial world: What infants, children, and other speciescan teach us . Oxford University Press.Baron-Cohen, S., Leslie, A. M., and Frith, U. (1985). Doesthe autistic child have a “theory of mind”?

Cognition ,21(1):37–46.Buresh, J. S. and Woodward, A. L. (2007). Infantstrack action goals within and across agents.

Cogni-tion , 104(2):287–314.Carpenter, M., Call, J., and Tomasello, M. (2005). Twelve-and 18-month-olds copy actions in terms of goals.

De-velopmental science , 8(1):F13–F20.Colomer, M., Bas, J., and Sebastian-Galles, N. (2020).Eﬃciency as a principle for social preferences in infancy.

Journal of Experimental Child Psychology , 194:104823.Elsner, B., Hauf, P., and Aschersleben, G. (2007). Imitat-ing step by step: A detailed analysis of 9-to 15-month-olds’ reproduction of a three-step action sequence.

In-fant Behavior and Development , 30(2):325–335.Gergely, G. and Csibra, G. (1997). Teleological reasoningin infancy: The infant’s naive theory of rational action:A reply to premack and premack.

Cognition , 63(2):227–233.Gergely, G. and Csibra, G. (2003). Teleological reasoningin infancy: The naıve theory of rational action.

Trendsin cognitive sciences , 7(7):287–292.Gergely, G., N´adasdy, Z., Csibra, G., and B´ır´o, S. (1995).Taking the intentional stance at 12 months of age.

Cognition , 56(2):165–193.Gerson, S. A., Mahajan, N., Sommerville, J. A., Matz, L.,and Woodward, A. L. (2015). Shifting goals: Eﬀects ofactive and observational experience on infants’ under-standing of higher order goals.

Frontiers in Psychology ,6:310.10eider, F. and Simmel, M. (1944). An experimentalstudy of apparent behavior.

The American journal ofpsychology , 57(2):243–259.Henderson, A. M. and Woodward, A. L. (2012). Nine-month-old infants generalize object labels, but notobject preferences across individuals.

Developmentalscience , 15(5):641–652.Hernik, M. and Csibra, G. (2015). Infants learn enduringfunctions of novel tools from action demonstrations.

Journal of experimental child psychology , 130:176–192.Jara-Ettinger, J. (2019). Theory of mind as inverse re-inforcement learning.

Current Opinion in BehavioralSciences , 29:105–110.Kolve, E., Mottaghi, R., Han, W., VanderBilt, E., Weihs,L., Herrasti, A., Gordon, D., Zhu, Y., Gupta, A., andFarhadi, A. (2019). Ai2-thor: An interactive 3d envi-ronment for visual ai.Kuhlmeier, V., Wynn, K., and Bloom, P. (2003). Attri-bution of dispositional states by 12-month-olds.

Psy-chological science , 14(5):402–408.Lake, B. M., Ullman, T. D., Tenenbaum, J. B., andGershman, S. J. (2017). Building machines that learnand think like people.

Behavioral and brain sciences ,40.Leslie, A. M. (1987). Pretense and representation: Theorigins of” theory of mind.”.

Psychological review ,94(4):412.Liu, S., Brooks, N. B., and Spelke, E. S. (2019). Originsof the concepts cause, cost, and goal in prereaching in-fants.

Proceedings of the National Academy of Sciences ,116(36):17747–17752.Liu, S. and Spelke, E. S. (2017). Six-month-old infantsexpect agents to minimize the cost of their actions.

Cognition , 160:35–42.Liu, S., Ullman, T. D., Tenenbaum, J. B., and Spelke,E. S. (2017). Ten-month-old infants infer the value ofgoals from the costs of actions.

Science , 358(6366):1038–1041.Luo, Y. (2011). Three-month-old infants attribute goals toa non-human agent.

Developmental science , 14(2):453–460.Luo, Y. and Baillargeon, R. (2005). Can a self-propelledbox have a goal? psychological reasoning in 5-month-old infants.

Psychological Science , 16(8):601–608.Luo, Y., Kaufman, L., and Baillargeon, R. (2009). Younginfants’ reasoning about physical events involving in-ert and self-propelled objects.

Cognitive psychology ,58(4):441–486.Ng, A. Y., Russell, S. J., et al. (2000). Algorithms forinverse reinforcement learning. In

Proceedings of the17th International Conference on Machine learning ,volume 1, page 2.Oakes, L. M. (2010). Using habituation of looking time toassess mental processes in infancy.

Journal of Cognitionand Development , 11(3):255–268. Phillips, A. T. and Wellman, H. M. (2005). Infants’understanding of object-directed action.

Cognition ,98(2):137–155.Premack, D. and Woodruﬀ, G. (1978). Does the chim-panzee have a theory of mind?

Behavioral and brainsciences , 1(4):515–526.Rabinowitz, N., Perbet, F., Song, F., Zhang, C., Eslami,S. M. A., and Botvinick, M. (2018). Machine theory ofmind. In Dy, J. and Krause, A., editors,

Proceedings ofthe 35th International Conference on Machine Learn-ing , volume 80 of

Proceedings of Machine LearningResearch , pages 4218–4227, Stockholmsm¨assan, Stock-holm Sweden. PMLR.Raileanu, R., Denton, E., Szlam, A., and Fergus, R.(2018). Modeling others using oneself in multi-agent re-inforcement learning. arXiv preprint arXiv:1802.09640 .Repacholi, B. M. and Gopnik, A. (1997). Early reasoningabout desires: evidence from 14-and 18-month-olds.

Developmental psychology , 33(1):12.Riochet, R., Castro, M. Y., Bernard, M., Lerer, A., Fergus,R., Izard, V., and Dupoux, E. (2018). Intphys: Aframework and benchmark for visual intuitive physicsreasoning.

CoRR , abs/1803.07616.Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net:Convolutional networks for biomedical image segmen-tation. In

International Conference on Medical imagecomputing and computer-assisted intervention , pages234–241. Springer.Saxe, R., Tzelnic, T., and Carey, S. (2006). Five-month-old infants know humans are solid, like inanimate ob-jects.

Cognition , 101(1):B1–B8.Saxe, R., Tzelnic, T., and Carey, S. (2007). Knowingwho dunnit: Infants identify the causal agent in anunseen causal interaction.

Developmental psychology ,43(1):149.Scott, R. M. and Baillargeon, R. (2013). Do infants reallyexpect agents to act eﬃciently? a critical test of therationality principle.

Psychological science , 24(4):466–474.Shimizu, Y. A. and Johnson, S. C. (2004). Infants’ attri-bution of a goal to a morphologically unfamiliar agent.

Developmental science , 7(4):425–430.Shu, T., Bhandwaldar, A., Gan, C., Smith, K., Liu, S.,Gutfreund, D., Spelke, E., Tenenbaum, J. B., andUllman, T. D. (2021). AGENT: A Benchmark for CorePsychological Reasoning. arXiv preprint arXiv:2102.

Smith, K., Mei, L., Yao, S., Wu, J., Spelke, E., Tenen-baum, J., and Ullman, T. (2019). Modeling expectationviolation in intuitive physics with coarse probabilisticobject representations. In Wallach, H., Larochelle, H.,Beygelzimer, A., d ' Alch´e-Buc, F., Fox, E., and Garnett,R., editors,

Advances in Neural Information ProcessingSystems 32 , pages 8985–8995. Curran Associates, Inc.Sodian, B., Schoeppner, B., and Metz, U. (2004). Doinfants apply the principle of rational action to human11gents?

Infant Behavior and Development , 27(1):31–41.Sommerville, J. A. and Woodward, A. L. (2005). Pullingout the intentional structure of action: the relationbetween action processing and action production ininfancy.

Cognition , 95(1):1–30.Song, H.-j., Baillargeon, R., and Fisher, C. (2005). Caninfants attribute to an agent a disposition to performa particular action?

Cognition , 98(2):B45–B55.Southgate, V., Johnson, M., and Csibra, G. (2008). In-fants attribute goals to biomechanically impossible ac-tions.

Cognition , 107(3):1059–1069.Spelke, E. S. (2016). Core knowledge and conceptualchange.

Core knowledge and conceptual change , 279:279–300.Spelke, E. S., Breinlinger, K., Macomber, J., and Jacob-son, K. (1992). Origins of knowledge.

Psychologicalreview , 99(4):605.Turk-Browne, N. B., Scholl, B. J., and Chun, M. M.(2008). Babies and brains: habituation in infant cogni-tion and functional neuroimaging.

Frontiers in humanneuroscience , 2:16.Ullman, T., Baker, C., Macindoe, O., Evans, O., Good-man, N., and Tenenbaum, J. B. (2009). Help or hinder:Bayesian models of social goal inference. In

Advancesin neural information processing systems , pages 1874–1882.Woodward, A. L. (1998). Infants selectively encode thegoal object of an actor’s reach.

Cognition , 69(1):1–34.Woodward, A. L. (1999). Infants’ ability to distinguish be-tween purposeful and non-purposeful behaviors.

Infantbehavior and development , 22(2):145–160.Woodward, A. L. and Sommerville, J. A. (2000). Twelve-month-old infants interpret action in context.

Psycho-logical Science , 11(1):73–77.Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey,A. K. (2008). Maximum entropy inverse reinforcementlearning. In

Aaai , volume 8, pages 1433–1438. Chicago,IL, USA. 12 . Data Speciﬁcations

Each video has a resolution of 200 x 200 at 25 fps(the videos can be converted to a higher resolutionif required). In addition to the videos, we providemetadata in the form of json ﬁles describing everyframe in the video. This description contains informa-tion about the layout of the scene and the objects present.Each video has a json ﬁle associated with it. A video has9 trials which correspond to the 9 items in the json ﬁle.These 9 trials have a variable number of frames. Eachframe is described by the objects contained in it.These include: • The ’size’ attribute speciﬁes the resolution of theframe. • The ’walls’ attribute has a list of [bottomleft, extent]attributes describing the barriers. The bottomleftattribute is 2-dimensional and is deﬁned by an x andy coordinate. Similarly, the extent for each wall is2-dimensional and describes the width and height ofthe wall. • The ’objects’ attribute is deﬁned as a list of attributes[bottomleft, size, image, color]. The bottomleft at-tribute is 2 dimensional and is deﬁned by an x andy coordinate. The size is the half of the side of thesquare shape that the image of the object would beresized to. So, if the size is 10, an object image ofsize 100x100 would be resized to 20x20. The imageattribute gives the path of the object image. Thecolor attribute gives the color of the object in RGBformat in the range [0, 255]. • The ’home’, ’agents’, ‘key’ and ‘lock’ attributes havea similar structure to the objects attribute. • The ‘fuse’ attribute corresponds to the removablebarrier and has a similar structure to the ‘walls’attribute.

B. Baseline Details

B.1. Mask ModelModel Description.

Each trial is represented in theform of its initial state and the trajectory taken bythe agent (see Figure 17). The states and trajectoriesare approximated to a grid of size 10 ×

10. The initialstate is approximated from the frame in the form of adownsampled representation of size 10 × × | O | , where | O | represents the number of possible elements in thescene. These include target objects (14), agents (5), walls(1), home (1), key (1), lock (1) and removable barriers(1) with a total of 24 possible objects in the environment. The trajectory of the agent for a trial is provided in theform of a ﬂat 10 ×

10 grid where the cells visited by theagent have a value of 1 while the rest are 0.The objective of the model is to predict the trajectory ofthe agent in the test trial conditioned on the initial stateof the trial and the eight familiarization trials, presentedin the form of initial state and agent trajectory pairs. Toencode a trial, the trajectory is concatenated with everychannel of the state representation and passed througha two convolutional layers (3 ×

3, 2 output channels,with batchnorm (BN) and residual connections). Theoutputs of this network are concatenated and passedthrough another convolutional neural network (1 ×

1, 24output channels, BN → ×

3, 24, BN → ×

3, 24, BNwith residual connections), ﬂattened and passed througha fully connected layer to get an agent characteristicembedding for the trial (1 × × ×

10 grid (10 × × ×

3; 32; BN] × × × Training Tasks.

The performance of the mask modelon the background training tasks is shown in appendixTable 2 and appendix Figure 15. For the mask model,each grid cell is treated as a separate binary classiﬁcationproblem (if the agent will visit the cell or not). Wecompute the precision and recall for these binaryclassiﬁcation problems. For the preference task, we alsoanalyse if the model predicts that the agent will visit thecell of the object goal. The model predicts the cell ofthe preferred object 83.2%, the cell of the less preferredobject 6.9%, of both objects 8.4% and no object 2.4% ofthe times. We see that the model successfully generalizesto the training tasks.

Evaluation Results.

The performance of the maskmodel can be seen in Table 1 and appendix Figure 16.The mask model quickly learns to ﬁnd the shortest pathbetween the agent and the object. It fails on the multi-agent, inaccessible goal and the eﬃcient action task withan irrational agent. The model does not have diﬀerentexpectations for the preferences of the new agent and13 a) Familiarization Trials (b) Test: Expected (c) Test: Unexpected

Figure 11: Evaluation task to test if machines can represent preferences of agents. 2D versions of the stimuli are shown here. (a) Familiarization Trials (b) Test: Expected (c) Test: Unexpected

Figure 12: Evaluation task to test to test if machines can represent preferences of agents. 3D versions of the stimuli are shown here. a) Familiarization (8 trials) (b) Test: Expected (c) Test: Unexpected Figure 13: We draw inspiration from Gergely et al. (1995) to designan equivalent task to test if machines can understand if agents acteﬃciently towards their goals. In this task, the time taken by theagent to reach the goal in the expected and unexpected cases is thesame. (a) Familiarization (8 trials) (b) Test: Expected (c) Test: No Expectation

Figure 14: Evaluation of binding speciﬁc preferences to speciﬁcagents. The familiarization trials establish the preference of theagent.

Input Frame Model Prediction Target Frame (a) Background Single Object Task: The model correctly predicts thatthe orange agent will go around the barriers to reach the beige objectgoal.

Input Frame Model Prediction Target Frame (b) Background No-Navigation Preference Task: The model correctlypredicts that the blue agent will go to the preferred green object goal.

Input Frame Model Prediction Target Frame (c) Background No Preference Multi-Agent Task: The model predictsthat the blue agent will go the object goal in the trial.

Input Frame Model Prediction Target Frame (d) Background Agent-Blocked Instrumental Action Task: The modelcorrectly predicts the locations visited by the agent to perform theinstrumental action and visit the object goal (with the caveat that themodel does not have the capacity to understand the sequence in whichthe cells in the grid are visited).

Figure 15: Agent trajectory predictions on the background trainingset in the test trial made by the model working on abstract maskrepresentations. Test trials are shown here. makes the same predictions as those for the familiaragent. For the inaccessible goal task, the model predictsthat the agent will go to both objects in the test trial(with the trajectory blocked by the obstacle around thegoal)(appendix Figure 16d). The model performs betterthan chance on the preference task but frequently predictsthat the agent will go to both objects in the scene (seeappendix Figure 16b). As the mask model tries to predictthe complete trajectory of the agent in a trial (ignoringthe sequence of the actions), it solves a weaker proxy ofthe instrumental action task, achieving a score higherthan the video model.

B.2. Video ModelModel Description.

In the video model, the framesof a familiarization trial are encoded using a residualconvolutional network with 4 blocks, each with two15 nput Frame Model Prediction Target Frame (a) Preference task: The model correctly predicts that the dark greyagent will go to the preferred cyan object (established in the familiar-ization)

Input Frame Model Prediction Target Frame (b) Preference task: The model predicts a trajectory going to the wrongmagenta object but also highlights the blue preferred object. Thisshows a case of failure.

Input Frame Model Prediction Target Frame (c) Eﬃcient action task: A successful case is shown here where themodel predicts that the agent will take the shortest path to the beigeobject goal. The target frame here is from the unexpected episode.

Input Frame Model Prediction Target Frame (d) Inaccessible goal task: A failure case is shown here where the modelpredicts that the orange agent will go to the less preferred blue objectand also to the preferred yellow object but the trajectory is blocked bythe walls.

Figure 16: Agent trajectory predictions on the evaluation set inthe test trial made by the model working on mask representations.

BIB Task

Precision RecallSingle object 0.88 0.67Preference 0.92 0.57Multi-agent 0.97 0.56Instrumental action 0.89 0.74

Table 2: The performance of the mask model on the backgroundtraining tasks. × × ×

16 for thetrial (see Figure 9). The characteristic embedding acrossthe 8 familiarization trials is averaged to get a ﬁnal agentcharacteristic embeddding. This embedding is tiled toget a vector of size 64 × ×

16 and concatenated tothe current frame from the test trial. This vector ofsize 64 × ×

19 is passed to a U-Net (Ronnebergeret al., 2015) to predict the next frame. We trainthe model with an Adam optimizer with a learningrate of 1e-4 (betas=(0.9, 0.999)). We train the 2Dvideo model for 11 epochs and the 3D model for 10 epochs.

Background Training.

The errors on the validationset for the model are shown in appendix Table 3. Someof the predictions made by the model can be seenin Figure 18. Only the preference task requires themodel to take the familiarization phase into consideration.

Evaluation Tasks.

The model fails to reliably un-derstand the preference of the agent. This could be aresult of diﬀerences in the distance at which the objectsare placed in the scene. In the background training,the objects are placed close (section 3) to the agent,making the length of the familiarization trials short. Thecharacteristic encoder LSTM might ﬁnd it diﬃcult toextract characteristcs from longer sequences that areseen in the evaluation tasks.The model learns the simple heuristic of always going tothe object in the instrumental action task. This couldbe caused due to a diﬀerence in the distribution of thebackground training and evaluation tasks. In the back-ground training task (Figure 8c), the agent is conﬁnedin a small space within green removable barriers withthe key and the lock. The number of samples where themodel has to predict that the agent goes to the key or thelock is relatively small compared to that of the barriersdisappearing and the agent moving towards the objectgoal. In the evaluation tasks (Figure 5c), the numberof steps to reach the key and the lock are signiﬁcantlyhigher (as the object goal is conﬁned in the removablebarriers). The model thus has trouble generalizing to thiscase (Table 1 Instrumental: Blocking barriers task).16 igure 17: Architecture of our baseline model working on abstract mask representations inspired from Rabinowitz et al. (2018). Theobjective of the model is to predict the trajectory of the agent.

BIB Task

MSESingle object . × − Preference . × − Multi-agent . × − Instrumental action × − Table 3: The performance of the video model on the 2D backgroundtraining tasks. nput Frame Model Prediction Target Frame (a) A trial from the training set where the model predicts that thebrownagent will go to the preferred (established in the familiarization) greyobject. Input Frame Model Prediction Target Frame (b) A trial from the training set where the model predicts that theblue agent will go to the preferred magenta object (established in thefamiliarization). We see that there is blurred blue prediction close tothe yellow object but the model thinks that it is more likely that theagent will go to the magenta one.

Input Frame Model Prediction Target Frame (c) The model correctly predicts that the agent will take the shortestpath to go to the object goal.

Input Frame Model Prediction Target Frame (d) The model correctly predicts that in the instrumental action task,when the key is inserted into the lock, the removable barriers will slowlydisappear.

Figure 18: Predictions of the video model on the backgroundtraining tasks. (a) and (b) show model predictions for two preferencetrials where the model splits its predictions between the two objectsbut thinks that going to the preferred object (established during thefamiliarization phase) is more likely. (c) shows model predictions forthe single object task where the model predicts that the agent willtake the shortest path to the object. (d) shows the instrumentalaction task where the model predicts the disappearance of theremovable barriers. Test trials are shown here.

Input Frame Model Prediction Target Frame (a) Preference Task: The model correctly predicts that the brown agentwill go to the preferred object that has been established during thefamiliarization (gray heart).