[PDF] AGENT: A Benchmark for Core Psychological Reasoning

Abstract

For machine agents to successfully interact with humans in real-world settings, they will need to develop an understanding of human mental life. Intuitive psychology, the ability to reason about hidden mental variables that drive observable actions, comes naturally to people: even pre-verbal infants can tell agents from objects, expecting agents to act efficiently to achieve goals given constraints. Despite recent interest in machine agents that reason about other agents, it is not clear if such agents learn or hold the core psychology principles that drive human reasoning. Inspired by cognitive development studies on intuitive psychology, we present a benchmark consisting of a large dataset of procedurally generated 3D animations, AGENT (Action, Goal, Efficiency, coNstraint, uTility), structured around four scenarios (goal preferences, action efficiency, unobserved constraints, and cost-reward trade-offs) that probe key concepts of core intuitive psychology. We validate AGENT with human-ratings, propose an evaluation protocol emphasizing generalization, and compare two strong baselines built on Bayesian inverse planning and a Theory of Mind neural network. Our results suggest that to pass the designed tests of core intuitive psychology at human levels, a model must acquire or have built-in representations of how agents plan, combining utility computations and core knowledge of objects and physics.

Full PDF

AAGENT: A Benchmark for Core Psychological Reasoning

Tianmin Shu Abhishek Bhandwaldar Chuang Gan Kevin A. Smith Shari Liu Dan Gutfreund Elizabeth Spelke Joshua B. Tenenbaum Tomer D. Ullman Abstract

For machine agents to successfully interact withhumans in real-world settings, they will need todevelop an understanding of human mental life.Intuitive psychology, the ability to reason abouthidden mental variables that drive observable ac-tions, comes naturally to people: even pre-verbalinfants can tell agents from objects, expectingagents to act efﬁciently to achieve goals given con-straints. Despite recent interest in machine agentsthat reason about other agents, it is not clear ifsuch agents learn or hold the core psychologyprinciples that drive human reasoning. Inspiredby cognitive development studies on intuitive psy-chology, we present a benchmark consisting ofa large dataset of procedurally generated 3D ani-mations, AGENT (Action, Goal, Efﬁciency, coN-straint, uTility), structured around four scenarios(goal preferences, action efﬁciency, unobservedconstraints, and cost-reward trade-offs) that probekey concepts of core intuitive psychology. Wevalidate AGENT with human-ratings, proposean evaluation protocol emphasizing generaliza-tion, and compare two strong baselines built onBayesian inverse planning and a Theory of Mindneural network. Our results suggest that to passthe designed tests of core intuitive psychologyat human levels, a model must acquire or havebuilt-in representations of how agents plan, com-bining utility computations and core knowledgeof objects and physics.

1. Introduction

In recent years, there has been a growing interest in build-ing socially-aware agents that can interact with humans inthe real world (Dautenhahn, 2007; Sheridan, 2016; Puig Massachusetts Institute of Technology MIT-IBM WatsonAI Lab Harvard University. Correspondence to: Tianmin Shu < [email protected] > . Example trials and the supplementary material are availableat . Goal objectsAgent Obstacles Normal PathOccluder Surprising Path

Familiarization TestFamiliarization Test

Scenario 1: Goal PreferencesScenario 3: Unobserved Constraints Scenario 2: Action E ﬃ ciency Familiarization TestFamiliarization Test

Scenario 4: Cost-Reward Trade-o ﬀ s A BC D

Figure 1.

Schematic of the four key scenarios of core intuitive psy-chology evaluated in AGENT. Each scenario is color coded. Solidarrows show the typical behavior of the agent in the familiarizationvideo(s) or in the expected test video. Dashed arrows show agentbehavior in the surprising test video. In Unobserved Constraintstrials (C), a surprising test video shows an unexpected outcome(e.g. no barrier) behind the occluder. et al., 2020). This requires agents that understand the moti-vations and actions of their human counterparts, an abilitythat comes naturally to people. Humans have an early-developing intuitive psychology, the ability to reason aboutother people’s mental states from observed actions. Frominfancy, we can easily differentiate agents from objects, ex-pecting agents to not only follow physical constraints, butalso to act efﬁciently to achieve their goals given constraints.Even pre-verbal infants can recognize other people’s costsand rewards, infer unobserved constraints given partially ob-served actions, and predict future actions (Baillargeon et al.,2016; Gergely & Csibra, 2003; Liu et al., 2017; Woodward,1998). This early core psychological reasoning developswith limited experience, yet generalizes to novel agents andsituations, and forms the basis for commonsense psycholog-ical reasoning later in life.Like human infants, it is critical for machine agents to de-velop an adequate capacity of understanding human minds,in order to successfully engage in social interactions. Recentwork has demonstrated promising results towards buildingagents that can infer the mental states of others (Baker et al.,2017; Rabinowitz et al., 2018), predict people’s future ac-tions (Kong & Fu, 2018), and even work with human part-ners (Rozo et al., 2016; Carroll et al., 2019). However, todate there has been a lack of rigorous evaluation benchmarks a r X i v : . [ c s . A I] F e b GENT: A Benchmark for Core Psychological Reasoning for assessing how much artiﬁcial agents learn about corepsychological reasoning, and how well their learned repre-sentations generalize to novel agents and environments.In this paper, we present AGENT (Action, Goal, Efﬁciency,coNstraint, uTility), a benchmark for core psychology rea-soning inspired by experiments in cognitive developmentthat probe young children’s understanding of intuitive psy-chology. AGENT consists of a large-scale dataset of 3Danimations of an agent moving under various physical con-straints and interacting with various objects. These anima-tions are organized into four categories of trials, designedto probe a machine learning model’s understanding of keysituations that have served to reveal infants’ intuitive psy-chology, testing their attributions of goal preferences (Fig-ure 1A; Woodward 1998), action efﬁciency (Figure 1B;Gergely et al. 1995), unobserved constraints (Figure 1C;Csibra et al. 2003), and cost-reward trade-offs (Figure 1D;Liu et al. 2017). As we detail in Section 3.1, each scenariois based on previous developmental studies, and is meantto test a combination of underlying key concepts in humancore psychology. These scenarios cover the early under-standing of agents as self-propelled physical entities thatvalue some states of the world over others, and act to maxi-mize their rewards and minimize costs subject to constraints.In addition to this minimal set of concepts, a model mayalso need to understand other concepts to pass a full batteryof core intuitive psychology, including perceptual accessand intuitive physics.Like experiments in many infant studies, each trial has twophases: in the familiarization phase, we show one or morevideos of a particular agent’s behavior in certain physicalenvironments to a model; then in the test phase, we showthe model a video of the behavior of the same agent in anew environment, which either is ‘expected’ or ‘surpris-ing,’ given the behavior of the agent in familiarization. Themodel’s task is to judge how surprising the agent’s behaviorsin the test videos are, based on what the model has learnedor inferred about the agent’s actions, utilities, and physi-cal constraints from watching the familiarization video(s).We validate AGENT with large-scale human-rating trials,showing that on average, adult human observers rate the ‘sur-prising’ test videos as more surprising than the ‘expected’test videos.Unlike typical evaluation for Theory of Mind reasoning (Ra-binowitz et al., 2018), we propose an evaluation protocolfocusing on generalization. We expect models to performwell not only in test trials similar to those from training,but also in test trials that require generalization to differ-ent physical conﬁgurations within the same scenario, or toother scenarios. We compare two strong baselines for The-ory of Mind reasoning: (i) Bayesian Inverse Planning andCore Knowledge, which combines Bayesian inverse plan- ning (Baker et al., 2017) with physical simulation (Battagliaet al., 2013), and (ii) ToMnet-G, which extends the The-ory of Mind neural network (Rabinowitz et al., 2018). Ourexperimental results show that ToMnet-G can achieve rea-sonably high accuracy when trained and tested on trials ofsimilar conﬁgurations or of the same scenario, but faces astrong challenge of generalizing to different physical situ-ations, or a different but related scenario. In contrast, dueto built-in representations of planning, objects, and physics,BIPaCK achieves a stronger performance on generalizationboth within and across scenarios. This demonstrates thatAGENT poses a useful challenge for building models thatachieve core psychological reasoning via learned or built-in representations of agent behaviors that integrate utilitycomputations, object representations, and intuitive physics.In summary, our contributions are: (i) a new benchmarkon core psychological reasoning consisting of a large-scaledataset inspired by infant cognition and validated by hu-man trials, (ii) a comprehensive comparison of two strongbaseline models that extends prior approaches for mentalstate reasoning, and (iii) a generalization-focused evaluationprotocol. We plan to release the dataset and the code fordata generation.

2. Related Work

Machine Social Perception.

While there has been a longand rich history in machine learning concerning human be-havior recognition (Aggarwal & Ryoo, 2011; Caba Heilbronet al., 2015; Poppe, 2010; Choi & Savarese, 2013; Shu et al.,2015; Ibrahim et al., 2016; Sigurdsson et al., 2018; Fouheyet al., 2018) and forecasting (Kitani et al., 2012; Koppula &Saxena, 2013; Alahi et al., 2016; Kong & Fu, 2018; Lianget al., 2019), prior work has typically focused on classifyingand/or predicting motion patterns. However, the kind of corepsychological reasoning evaluated in AGENT emphasizesmental state reasoning. This objective is loosely alignedwith agent modeling in work on multi-agent cooperationor competition (Albrecht & Stone, 2018), where a machineagent attempts to model another agent’s type, deﬁned byfactors such as intentions (Mordatch & Abbeel, 2018; Puiget al., 2020), rewards (Abbeel & Ng, 2004; Ziebart et al.,2008; Hadﬁeld-Menell et al., 2016; Shu & Tian, 2018), orpolicies (Sadigh et al., 2016; Kleiman-Weiner et al., 2016;Nikolaidis et al., 2017; Lowe et al., 2017; Wang et al., 2020;Xie et al., 2020). Here, we present a rigorously designedand human-validated dataset for benchmarking a machineagent’s ability to model aspects of other agents’ mentalstates that are core to human intuitive psychology. Theseprotocols can be used in future work to build and test modelsthat reason and learn about other minds the way that humansdo.

Synthetic Datasets for Machine Perception.

Empowered

GENT: A Benchmark for Core Psychological Reasoning by graphics and physics simulation engines, there have beensynthetic datasets for various problems in machine sceneunderstanding (Zitnick et al., 2014; Ros et al., 2016; John-son et al., 2017; Song et al., 2017; Xia et al., 2018; Riochetet al., 2018; Jiang et al., 2018; Groth et al., 2018; Yi et al.,2019; Bakhtin et al., 2019; Nan et al., 2020; Netanyahuet al., 2021). Many of these datasets focusing on socialperception are either built using simple 2D cartoons (Zit-nick et al., 2014; Gordon, 2016; Netanyahu et al., 2021), orfocus on simpler reasoning tasks (Cao et al., 2020). Con-current with this paper, Gandhi et al. 2021 have proposed abenchmark, BIB (Baby Intuitions Benchmark), for probinga model’s understanding of other agents’ goals, preferences,actions in maze-like environments. The tests proposed inAGENT have conceptual overlap with BIB, with three keydifferences: First, in addition to the common concepts testedin both benchmarks (goals, preferences, and actions), thescenarios in AGENT probe concepts such as unobservedconstraints and cost-reward trade-offs, whereas BIB focuseson the instrumentality of actions (e.g., using a sequenceof actions to make an object reachable before getting it).Second, trials in AGENT simulate diverse physical situa-tions, including ramps, platforms, doors, and bridges, whileBIB contains scenes that require more limited knowledgeof physical constraints: mazes with walls. Third, the evalua-tion protocol for AGENT emphasizes generalization acrossdifferent scenarios and types of trials, while BIB focuseson whether intuitive psychology concepts can be learnedand utilized from a single large training set in the ﬁrst place.BIB also provides baseline models that build on raw pixelsor object masks, while our baseline models address the sep-arate challenges presented by AGENT and focus more onincorporating the core knowledge of objects and physics intothe psychological reasoning. We see that AGENT and BIBprovide complementary tools for benchmarking machineagents’ core psychology reasoning, and relevant modelscould make use of both.

Few-shot Imitation Learning.

The two-phase setup ofthe trials in AGENT resembles few-shot imitation learning(Duan et al., 2017; Finn et al., 2017; Yu et al., 2018; Jameset al., 2018; Huang et al., 2019; Silver et al., 2020), wherethe objective is to imitate expert policies on multiple tasksbased on a set of demonstrations. This is critically differentfrom the objective of our benchmark, which is to asses howwell models infer the mental states of a particular agentfrom a single or few familiarization videos, and predict thesame agent’s behavior in a different physical situation.

3. AGENT Dataset

Figure 2 summarizes the design of trials in AGENT, whichgroups trials into four scenarios. All trials have two phases: (i) a familiarization phase showing one or multiple videosof the typical behaviors of a particular agent, and (ii) atest phase showing a single video of the same agent eitherin a new physical situation (the Goal Preference, ActionEfﬁciency and Cost-Reward Trade-offs scenarios) or thesame video as familiarization but revealing a portion ofthe scene that was previously occluded (Unobserved Con-straints). Each test video is either expected or surprising . Inan expected test video, the agent behaves consistently withits actions from the familiarization video(s) (e.g. pursues thesame goal, acts efﬁciently with respect to its constraints, andmaximizes rewards), whereas in a surprising test video, theagent aims for a goal inconsistent with its actions from thefamiliarization videos, achieves its goal inefﬁciently, or vio-lates physics. Each scenario has several variants, includingboth basic versions replicating stimuli used in infant studies,and additional types with new setups of the physical scenes,creating more diverse scenarios and enabling harder tests ofgeneralization. We next explain the designs. Supplementarymaterial includes example videos. Scenario 1: Goal Preferences.

This subset of trials probesif a model understands that an agent chooses to pursue aparticular goal object based on its preferences, and that pur-suing the same goal could lead to different actions in newphysical situations, following Woodward (1998). Each trialincludes one familiarization video and a test video, wheretwo distinct objects (with different shapes and colors) areplaced on either side of an agent. For half of the test videos,the positions of the objects change from familiarization totest. During familiarization, the agent prefers one objectover the other, and always goes to the preferred object. Ina expected test video, the agent goes to the preferred ob-ject regardless of where it is, whereas in a surprising testvideo, the agent goes to the less preferred object. A goodmodel should expect a rational agent to pursue its preferredobject at test, despite the varying physical conditions. Toshow a variety of conﬁgurations and thus control for lowlevel heuristics, we deﬁne four types of trials for the GoalPreferences scenario (Figure 2), that vary the relative costto pursue either one of the goal objects in the familiariza-tion video and the test video. In Type 1.1 and Type 1.2,reaching either one of the objects requires the same effortas during familiarization, whereas in Type 1.3 and Type 1.4,the agent needs to overcome a harder obstacle to reach itspreferred object. In Type 1.1 and Type 1.3, the agent needsto overcome the same obstacle to reach either object in thetest video, but reaching the less desired object in the testvideo of Type 1.2 and Type 1.4 requires a higher effort forthe agent than reaching the preferred object does.

Scenario 2: Action Efﬁciency.

This task evaluates if amodel understands that a rational agent is physically con-strained by the environment and tends to take the mostefﬁcient action to reach its goal given its particular physical

GENT: A Benchmark for Core Psychological Reasoning F a m ili a r i z a t i on E x p ec t e d S u r p r i s i ng Type 2.1 Type 2.2 Type 2.3 Type 2.4

No obstacle in test Obstacle out of the way in test A smaller obstacle in test A different type of obstacle in test

Type 2.5

Path in the fam. violates solidity in test F a m ili a r i z a t i on E x p ec t e d S u r p r i s i ng Type 3.1 Type 3.2

No barrier in the surprising video Inefficient path in the surprising situation B Scenario

2: Action Efficiency

C Scenario 3: Unobserved constraints F a m ili a r i z a t i on E x p ec t e d S u r p r i s i ng Type 1.1 Type 1.2 Type 1.3 Type 1.4 F a m ili a r i z a t i on Type 4.1 Type 4.2

A Scenario 1: Goal Preferences D Scenario

4: Cost-Reward Trade-offs

Equal cost in fam.Equal cost in test Equal cost in fam. Low goal cost in test High goal cost in fam.Equal cost in test High goal cost in fam.Low goal cost in test

Equal cost in test

SurprisingExpected SurprisingExpected

Low cost for the preferred object in test

Figure 2.

Overview of trial types of four scenarios in AGENT. Each scenario is inspired by infant cognition and meant to test a differentfacet of intuitive psychology. Each type controls for possibility of learning simpler heuristics. Example videos are included in thesupplementary material. constraints (e.g., walls or gaps in the ﬂoor). This means thatan agent may not follow the same path for the same goal ifthe physical environment is no longer the same as before.In the familiarization video, we show an agent taking anefﬁcient path to reach a goal object given the constraints. InType 2.1, that constraint is removed, and at test, agent takesa more efﬁcient path (expected), or takes the same path asit had with the constraint in place (surprising). Types 2.2-4further extend this scenario by ensuring that a model cannotuse the presence of the obstacle to infer that an agent shouldjump by placing the obstacle out of the way (2.2), using asmaller obstacle (2.3), or introducing a door or a bridge intothe obstacle (2.4). By introducing a surprising path in whichthe agent moves through the wall, Type 2.5 ensures that themodel is not simply ignoring constraints and predicting thatthe closest path to a straight line is the most reasonable.

Scenario 3: Unobserved Constraints.

By assuming thatagents tend to take the most efﬁcient action to reach theirgoals (Scenarios 1-2), infants are also able to infer hiddenobstacles based on agents’ actions. Speciﬁcally, after seeingan agent that performs a costly action (e.g. jumps up and lands behind an occluder), infants can infer that there mustbe an unobserved physical constraint (e.g. a obstacle behindthe occluder) that explains this action (Csibra et al., 2003).To evaluate if a model can reason about hidden constraintsin this way, we designed two types of trials for Scenario3. In both types of trials, we show an agent taking curvedpaths to reach a goal object (either by jumping verticallyor moving horizontally), but the middle of the agent’s pathis hidden behind an occluder (the wall appearing in themiddle of the familiarization video in Figure 2C). In thesevideos, the occluder partially hides the agent from view,and it is clear that the agent is deviating from a straightpath towards its goal. In the test videos, the occluder fallsafter the agent reaches goal object, potentially revealing theunseen physical constraints. Similar to Csibra et al. (2003),in the expected video, the occluder falls to reveal an obstaclethat justiﬁes the action that the agent took as efﬁcient; in thesurprising video, the occluder falls to reveal an obstacle thatmakes the observed actions appear inefﬁcient. The videosof Type 3.2 control for the absence of an object behind theoccluder being a signal for surprise by revealing an obstaclethat nonetheless makes the agent’s actions inefﬁcient (a

GENT: A Benchmark for Core Psychological Reasoning smaller wall that the agent could have leapt over or movedaround with less effort, or a wall with a doorway that theagent could have passed through).

Scenario 4: Cost-Reward Trade-offs.

Scenario 1 requiresreasoning about preferences over different goal states, andScenarios 2 and 3 require reasoning about cost functionsand physical constraints. However, infants can do more thanreason about agents’ goals and physically grounded costs inisolation. They can also infer what goal objects agents preferfrom observing the level of cost they willingly expend fortheir goals (Liu et al., 2017). To succeed here, infants needto understand that agents plan actions based on utility, whichcan be decomposed into positive rewards and negative costs(Jara-Ettinger et al., 2016). Rational action under this frame-work thus requires agents (and observers of their actions)to trade off the rewards of goal states against the costs ofreaching those goal states. Following experiments designedto probe infants’ understanding of rewards and costs (Liuet al., 2017), we construct two types of trials for Scenario4. Here we show the agent acting towards each of two goalobjects under two different physical situations (four famil-iarization videos in total). In the ﬁrst two familiarizationvideos, the agent overcomes an obstacle with a mediumdifﬁculty (a wall/platform/ramp with a medium height, ora chasm with a medium width) to reach the object that itlikes more, but gives up when the obstacle becomes toodifﬁcult (e.g., the maximum height or width). In the remain-ing two familiarization videos, the agent overcomes an easyobstacle to reach the less preferred object, but decides notto pursue the same object when there is a medium-difﬁcultyobstacle. During the testing phase, both objects are presentin the scene for the ﬁrst time. The agent goes to the morepreferred object in the expected video, but goes to the lesspreferred object in the surprising video. Type 4.1 shows noobstacles, or obstacles of the same difﬁculty, between theagent and the two objects in the test videos. In Type 4.2, amore difﬁcult obstacle is placed between the agent and theless preferred object at test. In both cases, a rational agentwill tend to choose the object it likes more, which requireseither the same amount of action cost to reach as the lesspreferred object (Type 4.1) or even less action cost than theless preferred object (Type 4.2). The key question is whetherthe model can infer this preference from the familiarizationvideos, and generalize it to the test video.

To generate each trial, we ﬁrst sample a physical scenegraph for each familiarization and test video that satisﬁesthe constraints speciﬁed for each trial type. In this scenegraph, we deﬁne the number, types, and sizes of obstacles(e.g., walls, ramps, etc.), the texture of the ﬂoor (out of 8types), the texture of the background wall (out of 3 types),as well as the shapes, colors, sizes, and the initial positions

Door Ramp PlatformChasm BridgeWall

ObstaclesObject Shapes

Figure 3.

Object shapes and obstacles used in AGENT. of the agent and all objects. We then instantiate the scenegraph in an open sourced 3D simulation environment, TDW(Gan et al., 2020). We deﬁne the goal of the agent in eachtrial by randomly assign preferences of objects to the agent,and simulate the agent’s path through the environment using(i) hand-crafted motion heuristics such as predeﬁned waypoints and corresponding actions (i.e., walking, jumping,climbing) to reach each way point in order to overcome anobstacle of certain type and size, and (ii) a gaze turning mo-tion that is naturally aligned with behaviors such as lookingat the surrounding at beginning and looking forward whilemoving. We sample object shapes and obstacles from theset depicted in Figure 3. Note that agent shapes are alwayssampled from the sphere, cone, and cube subset.

There are 9240 videos in AGENT. Each video lasts from 5.6s to 25.2 s, with a frame rate of 35 fps. With these videos, weconstructed 3360 trials in total, divided into 1920 trainingtrials, 480 validation trials, and 960 testing trials (or 480pairs of expected and surprising testing trials, where eachpair shares the same familiarization video(s)). All trainingand validation trials only contain expected test videos.In the dataset, we provide RGB-D frames, instance segmen-tation maps, and the camera parameters of the videos as wellas the 3D bounding boxes of all entities recorded from theTDW simulator. We categorize entities into three classes:agent, object, and obstacle, which are also available. Forcreating consistent identities of the objects in a trial, wedeﬁne 8 distinct colors and assign the corresponding colorcodes of the objects in the ground-truth information as well.

4. Baseline Methods

We propose two strong baseline methods for the benchmarkbuilt on well-known approaches to Theory of Mind reason-ing. We provide a sketch of both methods here, and discussimplementation details in the supplementary material.

The core idea of Bayesian inverse planning is to infer hid-den mental states (such as goals, preferences, and beliefs),

GENT: A Benchmark for Core Psychological Reasoning

AgentObstacleObject

Physics Engine

Physics Parameters Agent ParametersPlanner SampledTrajectory AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KomIeix68Vip/YA2ls120y7dbMLuRCihP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LR7dRvPXFtRKwecJxwP6IDJULBKFqpXn90e6WyW3FnIMvEy0kZctR6pa9uP2ZpxBUySY3peG6CfkY1Cib5pNhNDU8oG9EB71iqaMSNn81OnZBTq/RJGGtbCslM/T2R0ciYcRTYzoji0Cx6U/E/r5NieO1nQiUpcsXmi8JUEozJ9G/SF5ozlGNLKNPC3krYkGrK0KZTtCF4iy8vk+Z5xbusuPcX5epNHkcBjuEEzsCDK6jCHdSgAQwG8Ayv8OZI58V5dz7mrStOPnMEf+B8/gDR4I1+ S AAAB7XicbVBNSwMxEJ3Ur1q/qh69BIvgqewWUY9FLx4r2A9ol5JNs21sNlmSrFCW/gcvHhTx6v/x5r8xbfegrQ8GHu/NMDMvTAQ31vO+UWFtfWNzq7hd2tnd2z8oHx61jEo1ZU2qhNKdkBgmuGRNy61gnUQzEoeCtcPx7cxvPzFtuJIPdpKwICZDySNOiXVSq9cY8X6tX654VW8OvEr8nFQgR6Nf/uoNFE1jJi0VxJiu7yU2yIi2nAo2LfVSwxJCx2TIuo5KEjMTZPNrp/jMKQMcKe1KWjxXf09kJDZmEoeuMyZ2ZJa9mfif101tdB1kXCapZZIuFkWpwFbh2et4wDWjVkwcIVRzdyumI6IJtS6gkgvBX355lbRqVf+y6t1fVOo3eRxFOIFTOAcfrqAOd9CAJlB4hGd4hTek0At6Rx+L1gLKZ47hD9DnDw1wjsk= AAAB73icbVBNS8NAEN34WetX1aOXxSJ4KkkR9Vj04rFCv6ANZbOdtEs3m7g7EUron/DiQRGv/h1v/hu3bQ7a+mDg8d4MM/OCRAqDrvvtrK1vbG5tF3aKu3v7B4elo+OWiVPNocljGetOwAxIoaCJAiV0Eg0sCiS0g/HdzG8/gTYiVg2cJOBHbKhEKDhDK3V6jREg61f7pbJbceegq8TLSZnkqPdLX71BzNMIFHLJjOl6boJ+xjQKLmFa7KUGEsbHbAhdSxWLwPjZ/N4pPbfKgIaxtqWQztXfExmLjJlEge2MGI7MsjcT//O6KYY3fiZUkiIovlgUppJiTGfP04HQwFFOLGFcC3sr5SOmGUcbUdGG4C2/vEpa1Yp3VXEfLsu12zyOAjklZ+SCeOSa1Mg9qZMm4USSZ/JK3pxH58V5dz4WrWtOPnNC/sD5/AGfUY+y ⇥ AAAB9XicbVBNSwMxEM3Wr1q/qh69BIvgqewWUY9FD3qsYD+gu5bZNNuGJtklySpl6f/w4kERr/4Xb/4b03YP2vpg4PHeDDPzwoQzbVz32ymsrK6tbxQ3S1vbO7t75f2Dlo5TRWiTxDxWnRA05UzSpmGG006iKIiQ03Y4up767UeqNIvlvRknNBAwkCxiBIyVHvwhmMy/ASFg0qv1yhW36s6Al4mXkwrK0eiVv/x+TFJBpSEctO56bmKCDJRhhNNJyU81TYCMYEC7lkoQVAfZ7OoJPrFKH0exsiUNnqm/JzIQWo9FaDsFmKFe9Kbif143NdFlkDGZpIZKMl8UpRybGE8jwH2mKDF8bAkQxeytmAxBATE2qJINwVt8eZm0alXvvOrenVXqV3kcRXSEjtEp8tAFqqNb1EBNRJBCz+gVvTlPzovz7nzMWwtOPnOI/sD5/AFdw5Js ˆ Physics Engine

Planner AAAB7XicbVDLSgNBEOz1GeMr6tHLYBA8hV0R9Rj04jGCeUCyhNlJbzJmdmaZmRVCyD948aCIV//Hm3/jJNmDJhY0FFXddHdFqeDG+v63t7K6tr6xWdgqbu/s7u2XDg4bRmWaYZ0poXQrogYFl1i33ApspRppEglsRsPbqd98Qm24kg92lGKY0L7kMWfUOqnRqQ14N+iWyn7Fn4EskyAnZchR65a+Oj3FsgSlZYIa0w781IZjqi1nAifFTmYwpWxI+9h2VNIETTieXTshp07pkVhpV9KSmfp7YkwTY0ZJ5DoTagdm0ZuK/3ntzMbX4ZjLNLMo2XxRnAliFZm+TnpcI7Ni5AhlmrtbCRtQTZl1ARVdCMHiy8ukcV4JLiv+/UW5epPHUYBjOIEzCOAKqnAHNagDg0d4hld485T34r17H/PWFS+fOYI/8D5/AAvsjsg= AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8cK/YI2lM120y7dbOLuRCihf8KLB0W8+ne8+W/ctjlo64OBx3szzMwLEikMuu63U1hb39jcKm6Xdnb39g/Kh0ctE6ea8SaLZaw7ATVcCsWbKFDyTqI5jQLJ28H4bua3n7g2IlYNnCTcj+hQiVAwilbq9BojjrTv9csVt+rOQVaJl5MK5Kj3y1+9QczSiCtkkhrT9dwE/YxqFEzyaamXGp5QNqZD3rVU0YgbP5vfOyVnVhmQMNa2FJK5+nsio5ExkyiwnRHFkVn2ZuJ/XjfF8MbPhEpS5IotFoWpJBiT2fNkIDRnKCeWUKaFvZWwEdWUoY2oZEPwll9eJa2LqndVdR8uK7XbPI4inMApnIMH11CDe6hDExhIeIZXeHMenRfn3flYtBacfOYY/sD5/AGdzY+x ⇥ SampledTrajectory AAAB9XicbVBNS8NAEJ34WetX1aOXYBE8lUREPRY96LGC/YAmlsl22y7d3YTdjVJC/4cXD4p49b9489+4bXPQ1gcDj/dmmJkXJZxp43nfztLyyuraemGjuLm1vbNb2ttv6DhVhNZJzGPVilBTziStG2Y4bSWKoog4bUbD64nffKRKs1jem1FCQ4F9yXqMoLHSQzBAkwU3KASOO36nVPYq3hTuIvFzUoYctU7pK+jGJBVUGsJR67bvJSbMUBlGOB0Xg1TTBMkQ+7RtqURBdZhNrx67x1bpur1Y2ZLGnaq/JzIUWo9EZDsFmoGe9ybif147Nb3LMGMySQ2VZLaol3LXxO4kArfLFCWGjyxBopi91SUDVEiMDapoQ/DnX14kjdOKf17x7s7K1as8jgIcwhGcgA8XUIVbqEEdCCh4hld4c56cF+fd+Zi1Ljn5zAH8gfP5A1w/kms= ˆ Figure 4.

Overview of the generative model for BIPaCK. Thedashed arrow indicates extracting states via the ground-truth or aperception model. through a generative model of an agent’s plans (Baker et al.,2017). Combined with core knowledge of physics (Bail-largeon, 1996; Spelke et al., 1992), powered by simulation(Battaglia et al., 2013), we propose the Bayesian InversePlanning and Core Knowledge (BIPaCK) model.We ﬁrst devise a generative model that integrates physicssimulation and planning (Figure 4). Given the frame of thecurrent step, we extract the entities (the agent, objects, andobstacles) and their rough state information (3D boundingboxes and color codes), either based on the ground-truth pro-vided in AGENT, or on results from a perception model. Wethen recreate an approximated physical scene in a physicsengine that is different from TDW (here we use PyBullet;Coumans & Bai 2016–2019). In particular, all obstacle enti-ties are represented by cubes, and all objects and the agentare recreated as spheres. As the model has no access to theground-truth parameters of the physical simulation in theprocedural generation, nor any prior knowledge about themental states of the agents, it has to propose a hypothesis ofthe physics parameters (coordinate transformation, globalforces such as gravity and friction, and densities of entities),and a hypothesis of the agent parameters (the rewards ofobjects and the cost function of the agent). Given theseinferred parameters, the planner (based on RRT ∗ ; Karamanet al. 2011) samples a trajectory accordingly.We deﬁne the generative model as G ( S , Φ , Θ) , where S = { s i } i = N is the initial state of a set of entities, N ,and Φ and Θ are the parameters for the physics engine andthe agent respectively. In particular, Θ = ( R, w ) , where R = { r g } g ∈G indicates the agent’s reward placed over agoal object g ∈ G , and C ( s a , s (cid:48) a ) = w (cid:62) f is the cost func-tion for the agent, parameterized as the weighted sum ofthe force needed to move the agent from its current state s a to the next state s (cid:48) a . The generative model samples atrajectory in the next T steps from S , ˆΓ = { s ta } Tt =1 , tojointly maximize the reward and minimize the cost, i.e., ˆΓ = G ( S , Φ , Θ)= arg max Γ= { s ta } Tt =1 (cid:88) g ∈G r g δ ( s Ta , s g ) − t (cid:88) t =0 C ( s ta , s t +1 a ) , (1) where δ ( s Ta , s g ) = 1 if the ﬁnal state of the agent( s Ta ) reaches goal object g whose state is s g , otherwise δ ( s Ta , s g ) = 0 . Note that we assume object-oriented goalsfor all agents as a built-in inductive bias. Based on Eq. (1),we can deﬁne the likelihood of observing an agent trajectorybased on given parameters and the initial state as P (Γ | S , Φ , Θ) = e − β D (Γ , ˆΓ) = e − β D (Γ ,G ( S , Φ , Θ)) , (2)where D is the euclidean distance between two trajectories ,and β = 0 . adjusts the optimality of an agent’s behavior.The training data is used to calibrate the parameters in BI-PaCK. Given all N train trajectories and the correspondinginitial states in the training set (from both familiarizationvideos and test videos), X train = { (Γ i , S i ) } i ∈ N train , we cancompute the posterior probability of the parameters: P (Φ , Θ | X train ) ∝ (cid:88) i ∈ N train P (Γ i | S i , Φ , Θ) P (Φ) P (Θ) (3)where P (Φ) and P (Θ) are uniform priors of the parame-ters. For brevity, we deﬁne P train (Φ , Θ) = P (Φ , Θ | X train ) .Note that trajectories and the initial states in the videos ofUnobserved Constraints are partially occluded. To obtain X train , we need to reconstruct the videos. For this, we (i)ﬁrst remove the occluder from the states, and (ii) reconstructthe full trajectories by applying a 2nd order curve ﬁtting toﬁll the occluded the portion.For a test trial with familiarization video(s), X fam = { (Γ i , S i ) } i ∈ N fam , and a test video, (Γ test , S test ) , we adjustthe posterior probability of the parameters from Eq. (3): P (Φ , Θ | X fam , X train ) ∝ (cid:88) i ∈ N fam P (Γ i | S i , Φ , Θ) P train (Φ , Θ) . (4) We then deﬁne the surprise rating of a test video bycomputing the expected distance between the predictedagent trajectory and the one observed from the test video: E P (Φ , Θ | X fam ,X train ) (cid:2) D (Γ test , G ( S test , Φ , Θ)) (cid:3) . We extend ToMnet (Rabinowitz et al., 2018) to tackle themore challenging setting of AGENT, creating the secondbaseline model, ToMnet-G (see Figure 5). Like the originalToMnet, the network encodes the familiarization video(s) toobtain a character embedding for a particular agent, whichis then combined with the embedding of the initial state topredict the expected trajectory of the agent. The surprise rat-ing of a given test video is deﬁned by the deviation between As two trajectories may have different lengths, we adopt dy-namic time wrapping (Berndt & Clifford, 1994) for computing thedistance.

GENT: A Benchmark for Core Psychological Reasoning

GNN LSTMGNNScene Graph Node EmbeddingsLSTMFamiliarization 1Familiarization N … …

The initial state of the test video GNN LSTM MLPUpdateagent position Pooling AAAB+nicbVDLSsNAFJ3UV62vVJduBovgqiQi6rLoxmUF+4A2lMlk0g6dTMLMjVpiP8WNC0Xc+iXu/BsnbRbaemDgcM693DPHTwTX4DjfVmlldW19o7xZ2dre2d2zq/ttHaeKshaNRay6PtFMcMlawEGwbqIYiXzBOv74Ovc790xpHss7mCTMi8hQ8pBTAkYa2NV+wAQQ3I8IjPwwe5wO7JpTd2bAy8QtSA0VaA7sr34Q0zRiEqggWvdcJwEvIwo4FWxa6aeaJYSOyZD1DJUkYtrLZtGn+NgoAQ5jZZ4EPFN/b2Qk0noS+WYyT6gXvVz8z+ulEF56GZdJCkzS+aEwFRhinPeAA64YBTExhFDFTVZMR0QRCqatiinBXfzyMmmf1t3zunN7VmtcFXWU0SE6QifIRReogW5QE7UQRQ/oGb2iN+vJerHerY/5aMkqdg7QH1ifP26hlB4= x AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KomIeix68Vip/YA2ls120y7dbMLuRCihP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LR7dRvPXFtRKwecJxwP6IDJULBKFqpXn90e6WyW3FnIMvEy0kZctR6pa9uP2ZpxBUySY3peG6CfkY1Cib5pNhNDU8oG9EB71iqaMSNn81OnZBTq/RJGGtbCslM/T2R0ciYcRTYzoji0Cx6U/E/r5NieO1nQiUpcsXmi8JUEozJ9G/SF5ozlGNLKNPC3krYkGrK0KZTtCF4iy8vk+Z5xbusuPcX5epNHkcBjuEEzsCDK6jCHdSgAQwG8Ayv8OZI58V5dz7mrStOPnMEf+B8/gDR4I1+ S AAAB9HicbVBNS8NAEN34WetX1aOXxSJ4KomIeix68VjBfkAbymY7bZduNnF3Uiyhv8OLB0W8+mO8+W/ctDlo64OBx3szzMwLYikMuu63s7K6tr6xWdgqbu/s7u2XDg4bJko0hzqPZKRbATMghYI6CpTQijWwMJDQDEa3md8cgzYiUg84icEP2UCJvuAMreRDt4PwhCkfMj3tlspuxZ2BLhMvJ2WSo9YtfXV6EU9CUMglM6btuTH6KdMouIRpsZMYiBkfsQG0LVUsBOOns6On9NQqPdqPtC2FdKb+nkhZaMwkDGxnyHBoFr1M/M9rJ9i/9lOh4gRB8fmifiIpRjRLgPaEBo5yYgnjWthbafY+42hzKtoQvMWXl0njvOJdVtz7i3L1Jo+jQI7JCTkjHrkiVXJHaqROOHkkz+SVvDlj58V5dz7mrStOPnNE/sD5/AFcOZJ8 e char e mental AAACMHicbVDLSgNBEJz1bXxFPXoZDIIghF0R9Rj0oEcFo0I2ht5JrxmcfTDTq4ZlP8mLn6IXBUW8+hVOYsBHLBgoqqrp6QpSJQ257rMzMjo2PjE5NV2amZ2bXygvLp2aJNMC6yJRiT4PwKCSMdZJksLzVCNEgcKz4Gq/559dozYyiU+om2IzgstYhlIAWalVPvA7QLlPeEtBmN8WxUVOG17hKwwJtE5u+FCA+Ab326gI+LfcKlfcqtsHHybegFTYAEet8oPfTkQWYUxCgTENz02pmYMmKRQWJT8zmIK4gktsWBpDhKaZ9w8u+JpV2jxMtH0x8b76cyKHyJhuFNhkBNQxf72e+J/XyCjcbeYyTjPCWHwtCjPFKeG99nhbahSkupaA0NL+lYsOaBBkOy7ZEry/Jw+T082qt111j7cqtb1BHVNsha2ydeaxHVZjh+yI1Zlgd+yRvbBX5955ct6c96/oiDOYWWa/4Hx8AtkEq94= ˆ x t +1 ˆ x t + x Figure 5.

Architecture of ToMnet-G. The scene graphs are con-structed based on the ground-truth or a separately trained percep-tion model (hence the dashed arrows). the predicted trajectory ˆΓ and the observed trajectory Γ inthe test video. We extended ToMnet by using a graph neuralnetwork (GNN) to encode the states, where we represent allentities (including obstacles) as nodes. The input of a nodeincludes its entity class (agent, object, obstacle), boundingbox, and color code. We pass the embedding of the agentnode to the downstream modules to obtain the characterembedding e char and the mental state embedding e mental . Wetrain the network using a mean squared error loss on thetrajectory prediction: L (ˆΓ , Γ) = T (cid:80) Ti =1 || ˆ x t − x t || .To ensure that ToMnet-G can be applied to trials in Unob-served Constraints consistent with how it is applied to trialsin other scenarios, we reconstruct the familiarization videoand the initial state of the test video, using the same recon-struction method in Section 4.1. After the reconstruction,we can use the network to predict the expected trajectoryfor computing the surprise rating. Here, we use the recon-structed trajectory for calculating the surprise rating.

5. Experiments

Following Riochet et al. (2018), we deﬁne a metric based onrelative surprise ratings. For a paired set of N + surprisingtest videos and N − expected test videos (which share thesame familiarization video(s)), we obtain two sets of sur-prise ratings, { r + i } N + i =1 and { r − j } N − j =1 respectively. Accuracyis then deﬁned as the percentage of the correctly orderedpairs of ratings: N + N − (cid:80) i,j ( r + i > r − j ) . To validate the trials in AGENT and to estimate humanbaseline performance for the AGENT benchmark, we con-ducted an experiment in which people watched familiar-ization videos and then rated the relevant test videos on asliding scale for surprise (from 0, ‘not at all surprising’ to100, ‘extremely surprising’). We randomly sampled 240 test trials (i.e., 25% of the test set in AGENT) covering alltypes of trials and obstacles. We recruited 300 participantsfrom Amazon Mechanical Turk, and each trial was ratedby 10 participants. The participants gave informed consent,and the experiment was approved by an institutional reviewboard. Participants only viewed one of either the ‘expected’or ‘surprising’ variants of a scene.We found that the average human rating of each surprisingvideo was always signiﬁcantly higher than that of the cor-responding expected video, resulting in a 100% accuracywhen using ratings from an ensemble of human observers.To estimate the accuracy of a single human observer, weadopted the same metric deﬁned in Section 5.1, where weﬁrst standardized the ratings of each participant so that theyare directly comparable to the ratings from other partici-pants. We report the human performance in Table 1.

Table 1 summarizes human performance and the perfor-mance of the two methods when the models are trained andtested on all types of trials within all four scenarios. Notethat all results reported in the main paper are based on theground-truth state information. We report the model per-formance based on the states extracted from a perceptionmodel in the supplementary material. When given ground-truth state information, BIPaCK performs well on all typesof trials, on par or even better than the human baseline.ToMnet-G also has a high accuracy on Action Efﬁciencywhen tested on all trial types it has seen during training, butperforms worse than the human baseline and BIPaCK on theother three scenarios. ToMnet-G also performs less evenlyacross types within a scenario compared to BIPaCK, mostlydue to overﬁtting certain patterns in some types. E.g., inType 2.2 and 2.4, the agent always moves away from theobject when it needs to overcome a high cost obstacle duringthe test phase, so ToMnet-G uses that cue to predict the theagent’s behavior, rather than reasoning about agent’s costsand preferences given the familiarization videos (these arethe kind of heuristics controls are designed to rule out ininfant studies). The correlation between BIPaCK’s accuracyand the human performance on different types is 0.55, ver-sus a correlation of 0.23 between ToMnet-G and the humanperformance.

We conduct four types of generalization tests. The ﬁrsttrains a separate model for each scenario using all but onetype of trials in that scenario, and evaluates it on the heldout type (‘G1: leave one type out’). The second trains asingle model on all but one scenario and evaluates it on theheld out scenario (‘G2: leave one scenario out’). The third

GENT: A Benchmark for Core Psychological Reasoning

Table 1.

Human and model performance. The ‘All’ block reports results based on models trained on all scenarios, whereas ‘G1’ and ‘G2’report model performance on ‘G1: leave one type out’ and ‘G2: leave one scenario out‘ generalization tests. Here, G1 trains a separatemodel for each scenario using all but one type of trials in that scenario, and evaluates it on the held out type; G2 trains a single model onall but one scenario and evaluates it on the held out scenario. Blue numbers show where ToMnet-G generalizes well (performance > .8).Red numbers show where it performs at or below chance (performance ≤ .5). C o nd i t i o n Method Goal Preferences Action Efﬁciency Unobs. Cost-Reward All A ll ToMnet-G .73 1.0 .53 1.0 .84 .95 1.0 .95 .88 1.0 .94 .95 .78 .85 .63 1.0 .82 .86BIPaCK .97 1.0 1.0 1.0 .99 1.0 1.0 .85 1.0 1.0 .97 .93 .88 .90 .90 1.0 .95 .96 G ToMnet-G .63 .95 .53 1.0 .81 .95 .80 .45 .77 .05 .63 .45 .87 .70 .28 .42 .35 .63BIPaCK .93 1.0 1.0 1.0 .98 1.0 1.0 .80 1.0 1.0 .97 .93 .82 .86 .88 1.0 .94 .94 G ToMnet-G .50 .93 .50 .88 .73 .70 .60 .75 .75 1.0 .76 .60 .73 .68 .62 .98 .80 .74BIPaCK .93 1.0 1.0 1.0 .98 1.0 1.0 .75 1.0 .95 .95 .88 .85 .87 .83 1.0 .92 .94

Goal Preferences T r a i n i n g T y p e Testing Type

Unobserved ConstraintsCost-RewardTrade-o ﬀ s T r a i n i n g T y p e Unobserved ConstraintsCost-RewardTrade-o ﬀ s Testing Type A ToMnet-G B BIPaCK

Goal Preferences Action E ﬃ ciencyAction E ﬃ ciency A cc u r ac y Figure 6.

Performance of TomNet-G (A) and BIPaCK (B) on the ‘G3: single type’ test. This test trains a model on a single trial typewithin a scenario and evaluates it on the remaining types of the same scenario. Blue boxes show good generalization from ToMnet-G(off-diagonal performance > .8), whereas red boxes show where it performs at or below chance (off-diagonal performance ≤ .5); magentaboxes show failures of BIPaCK (off-diagonal performance < .8). Testing Scenario A ToMnet-G B BIPaCK A cc u r ac y T r a i n i n g S ce n a r i o Figure 7.

Performance of TomNet-G (A) and BIPaCK (B) on the‘G4: single scenario’ test. This test trains a model on a single sce-nario and evaluates it on the other three scenearios. GP, AE, UC,and CT represent Goal Preferences, Action Efﬁciency, UnobservedConstraints, and Cost-Reward Trade-offs respectively. Blue boxesshow good generalization from ToMnet-G (off-diagonal perfor-mance > .8, comparable to the performance when trained on thefull training set), whereas red boxes show where it performs at orbelow chance (off-diagonal performance ≤ .5). trains a model on a single trial type within a scenario andevaluates it on the remaining types of the same scenario(‘G3: single type’). The fourth trains a model on a single scenario and evaluates it on the other three scenarios (‘G4:single scenario’).We compare the performance of the two models on thesefour generalization tests in Table 1 (G1 and G2), Figure 6(G3), and Figure 7 (G4). In general, we ﬁnd little changein BIPaCK’s performance in various generalization con-ditions. The largest performance drop of BIPaCK comesfrom Type 2.3 (highlighted in magenta boxes in Figure 6B),where the distribution of the parameters estimated from thetraining trials has a signiﬁcant effect on the trajectory pre-diction (e.g., the model mistakenly predicts going aroundthe wall, instead of the ground truth trajectory of jumpingover the wall, due to an inaccurately learned cost function).In cases wherein this cost function was mis-estimated, BI-PaCK still does adjust its beliefs in the correct directionwith familiarization: if it does not adjust its posterior usingthe familiarization video(s) (Eq. 4), there would be a further10-15% performance drop. ToMnet-G, on the other hand,performs well in only a few generalization conditions (e.g.,results highlighted in blue in Table 1 and in Figure 6A, andFigure 7A). There are two main challenges that ToMnet-G GENT: A Benchmark for Core Psychological Reasoning faces (highlighted in red in Table 1, Figure 6A, and Fig-ure 7A): (i) predicting trajectories in unfamiliar physicalsituations; and (ii) reliably computing costs and rewardsthat are grounded to objects and physics. These resultscomplement the ﬁndings about the performance of ToMnet-based models reported in Gandhi et al. 2021, suggestingthat current model-free methods like ToMnet have a lim-ited capacity for (i) inferring agents’ mental states from asmall number of familiarization videos, and (ii) generalizingthe knowledge of the agents to novel situations. We reportcomprehensive results in the supplementary material.

6. Conclusion

We propose AGENT, a benchmark for core psychology rea-soning, which consists of a large-scale dataset of cognitivelyinspired tasks designed to probe machine agents’ under-standing of key concepts of intuitive psychology in four sce-narios – Goal Preferences, Action Efﬁciency, UnobservedConstraints, and Cost-Reward Trade-offs. We validate ourtasks with a large-scale set of empirical ratings from hu-man observers, and propose several evaluation proceduresthat require generalization both within and across scenar-ios. For the proposed tasks in the benchmark, we build twobaseline models (BIPaCK and ToMnet-G) based on existingapproaches, and compare their performance on AGENT tohuman performance. Overall, we ﬁnd that BIPaCK achievesa better performance than ToMnet-G, especially in tests ofstrong generalization.Our benchmark presents exciting opportunities for futureresearch on machine commonsense on intuitive psychol-ogy. For instance, while BIPaCK outperforms ToMnet-Gin almost all conditions, it also requires an accurate recon-struction of the 3D state and a built-in model of the physicaldynamics, which will not necessarily be available in realworld scenes. It is an open question whether we can learngeneralizable inverse graphics and physics simulators onwhich BIPaCK rests. There has been work on this front(e.g., Piloto et al. 2018; Riochet et al. 2020; Wu et al. 2017),from which probabilistic models built on human core knowl-edge of physics and psychology could potentially beneﬁt.On the other hand, without many built-in priors, ToMnet-Gdemonstrates promising results when trained and tested onsimilar scenarios, but it still lacks a strong generalizationcapacity both within scenarios and across them. General-ization could be potentially improved with more advancedarchitectures, or pre-training on a wider variety of physicalscenes to learn a more general purpose simulator. Theseopen areas for improvement suggest that AGENT is a well-structured diagnostic tool for developing better models ofintuitive psychology.

Acknowledgements

This work was supported by the DARPA Machine CommonSense program, MIT-IBM AI LAB, and NSF STC awardCCF-1231216.

References

Abbeel, P. and Ng, A. Y. Apprenticeship learning via inversereinforcement learning. In

Proceedings of the twenty-ﬁrstinternational conference on Machine learning , pp. 1,2004.Aggarwal, J. K. and Ryoo, M. S. Human activity analysis:A review.

ACM Computing Surveys (CSUR) , 43(3):1–43,2011.Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., and Savarese, S. Social lstm: Human trajectoryprediction in crowded spaces. In

Proceedings of the IEEEconference on computer vision and pattern recognition ,pp. 961–971, 2016.Albrecht, S. V. and Stone, P. Autonomous agents modellingother agents: A comprehensive survey and open problems.

Artiﬁcial Intelligence , 258:66–95, 2018.Baillargeon, R. Infants’ understanding of the physical world.

Journal of the Neurological Sciences , 143(1-2):199–199,1996.Baillargeon, R., Scott, R. M., and Bian, L. Psychologicalreasoning in infancy.

Annu. Rev. Psychol. , 67(1):159–186,2016.Baker, C. L., Jara-Ettinger, J., Saxe, R., and Tenenbaum,J. B. Rational quantitative attribution of beliefs, desiresand percepts in human mentalizing.

Nature Human Be-haviour , 1(4):1–10, 2017.Bakhtin, A., van der Maaten, L., Johnson, J., Gustafson, L.,and Girshick, R. Phyre: A new benchmark for physicalreasoning.

Advances in Neural Information ProcessingSystems , 32:5082–5093, 2019.Battaglia, P. W., Hamrick, J. B., and Tenenbaum, J. B. Sim-ulation as an engine of physical scene understanding.

Proceedings of the National Academy of Sciences , 110(45):18327–18332, 2013.Berndt, D. J. and Clifford, J. Using dynamic time warpingto ﬁnd patterns in time series. In

KDD workshop , pp.359–370. Seattle, WA, USA:, 1994.Caba Heilbron, F., Escorcia, V., Ghanem, B., and Car-los Niebles, J. Activitynet: A large-scale video bench-mark for human activity understanding. In

Proceedingsof the ieee conference on computer vision and patternrecognition , pp. 961–970, 2015.

GENT: A Benchmark for Core Psychological Reasoning

Cao, Z., Gao, H., Mangalam, K., Cai, Q.-Z., Vo, M., andMalik, J. Long-term human motion prediction with scenecontext. In

European Conference on Computer Vision ,pp. 387–404. Springer, 2020.Carroll, M., Shah, R., Ho, M. K., Grifﬁths, T. L., Seshia,S. A., Abbeel, P., and Dragan, A. On the utility of learningabout humans for human-ai coordination. arXiv preprintarXiv:1910.05789 , 2019.Choi, W. and Savarese, S. Understanding collective activ-itiesof people from videos.

IEEE transactions on pat-tern analysis and machine intelligence , 36(6):1242–1257,2013.Coumans, E. and Bai, Y. Pybullet, a python module forphysics simulation for games, robotics and machine learn-ing. http://pybullet.org , 2016–2019.Csibra, G., B´ır´o, Z., Ko´os, O., and Gergely, G. One-year-old infants use teleological representations of actionsproductively.

Cogn. Sci. , 27(1):111–133, 2003.Dautenhahn, K. Socially intelligent robots: dimensionsof human–robot interaction.

Philosophical transactionsof the royal society B: Biological sciences , 362(1480):679–704, 2007.Duan, Y., Andrychowicz, M., Stadie, B. C., Ho, J., Schnei-der, J., Sutskever, I., Abbeel, P., and Zaremba, W. One-shot imitation learning. arXiv preprint arXiv:1703.07326 ,2017.Finn, C., Yu, T., Zhang, T., Abbeel, P., and Levine, S. One-shot visual imitation learning via meta-learning. In

Con-ference on Robot Learning , pp. 357–368. PMLR, 2017.Fouhey, D. F., Kuo, W.-c., Efros, A. A., and Malik, J. Fromlifestyle vlogs to everyday interactions. In

Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition , pp. 4991–5000, 2018.Gan, C., Schwartz, J., Alter, S., Schrimpf, M., Traer, J.,De Freitas, J., Kubilius, J., Bhandwaldar, A., Haber, N.,Sano, M., et al. Threedworld: A platform for interac-tive multi-modal physical simulation. arXiv preprintarXiv:2007.04954 , 2020.Gandhi, K., Stojnic, G., Lake, B. M., and Dillon, M. R.Baby Intuitions Benchmark (BIB): Discerning the goals,preferences, and actions of others. arXiv preprintarXiv:2102.11938 , 2021.Gergely, G. and Csibra, G. Teleological reasoning in infancy:The na¨ıve theory of rational action.

Trends Cogn. Sci. , 7(7):287–292, 2003. Gergely, G., N´adasdy, Z., Csibra, G., and B´ır´o, S. Takingthe intentional stance at 12 months of age.

Cognition , 56(2):165–193, 1995.Gordon, A. Commonsense interpretation of triangle behav-ior. In

Proceedings of the AAAI Conference on ArtiﬁcialIntelligence , 2016.Groth, O., Fuchs, F. B., Posner, I., and Vedaldi, A. Shapes-tacks: Learning vision-based physical intuition for gener-alised object stacking. In

Proceedings of the EuropeanConference on Computer Vision (ECCV) , pp. 702–717,2018.Hadﬁeld-Menell, D., Dragan, A., Abbeel, P., and Russell,S. Cooperative inverse reinforcement learning. arXivpreprint arXiv:1606.03137 , 2016.Huang, D.-A., Xu, D., Zhu, Y., Garg, A., Savarese, S.,Fei-Fei, L., and Niebles, J. C. Continuous relaxation ofsymbolic planner for one-shot imitation learning. arXivpreprint arXiv:1908.06769 , 2019.Ibrahim, M. S., Muralidharan, S., Deng, Z., Vahdat, A.,and Mori, G. A hierarchical deep temporal model forgroup activity recognition. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pp. 1971–1980, 2016.James, S., Bloesch, M., and Davison, A. J. Task-embeddedcontrol networks for few-shot imitation learning. In

Con-ference on Robot Learning , pp. 783–795. PMLR, 2018.Jara-Ettinger, J., Gweon, H., Schulz, L. E., and Tenenbaum,J. B. The na¨ıve utility calculus: Computational principlesunderlying commonsense psychology.

Trends Cogn. Sci. ,20(8):589–604, 2016.Jiang, C., Qi, S., Zhu, Y., Huang, S., Lin, J., Yu, L.-F.,Terzopoulos, D., and Zhu, S.-C. Conﬁgurable 3d scenesynthesis and 2d image rendering with per-pixel groundtruth using stochastic grammars.

International Journalof Computer Vision , 126(9):920–941, 2018.Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L.,Lawrence Zitnick, C., and Girshick, R. Clevr: A diag-nostic dataset for compositional language and elementaryvisual reasoning. In

Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition , pp.2901–2910, 2017.Karaman, S., Walter, M. R., Perez, A., Frazzoli, E., andTeller, S. Anytime motion planning using the rrt ∗ . In , pp. 1478–1483. IEEE, 2011.Kitani, K. M., Ziebart, B. D., Bagnell, J. A., and Hebert,M. Activity forecasting. In European Conference onComputer Vision , pp. 201–214. Springer, 2012.

GENT: A Benchmark for Core Psychological Reasoning

Kleiman-Weiner, M., Ho, M. K., Austerweil, J. L., Littman,M. L., and Tenenbaum, J. B. Coordinate to cooperateor compete: abstract goals and joint intentions in socialinteraction. In

CogSci , 2016.Kong, Y. and Fu, Y. Human action recognition and predic-tion: A survey. arXiv preprint arXiv:1806.11230 , 2018.Koppula, H. and Saxena, A. Learning spatio-temporal struc-ture from rgb-d videos for human activity detection andanticipation. In

International conference on machinelearning , pp. 792–800. PMLR, 2013.Liang, J., Jiang, L., Niebles, J. C., Hauptmann, A. G., andFei-Fei, L. Peeking into the future: Predicting futureperson activities and locations in videos. In

Proceedingsof the IEEE/CVF Conference on Computer Vision andPattern Recognition , pp. 5725–5734, 2019.Liu, S., Ullman, T. D., Tenenbaum, J. B., and Spelke, E. S.Ten-month-old infants infer the value of goals from thecosts of actions.

Science , 358(6366):1038–1041, Novem-ber 2017.Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P.,and Mordatch, I. Multi-agent actor-critic for mixedcooperative-competitive environments. arXiv preprintarXiv:1706.02275 , 2017.Mordatch, I. and Abbeel, P. Emergence of grounded compo-sitional language in multi-agent populations. In

Proceed-ings of the AAAI Conference on Artiﬁcial Intelligence ,2018.Nan, Z., Shu, T., Gong, R., Wang, S., Wei, P., Zhu, S.-C.,and Zheng, N. Learning to infer human attention in dailyactivities.

Pattern Recognition , pp. 107314, 2020.Netanyahu, A., Shu, T., Katz, B., Barbu, A., and Tenen-baum, J. B. PHASE: PHysically-grounded Abstract So-cial Events for machine social perception. In

Proceedingsof the AAAI Conference on Artiﬁcial Intelligence (AAAI) ,2021.Nikolaidis, S., Hsu, D., and Srinivasa, S. Human-robot mu-tual adaptation in collaborative tasks: Models and experi-ments.

The International Journal of Robotics Research ,36(5-7):618–634, 2017.Piloto, L., Weinstein, A., TB, D., Ahuja, A., Mirza, M.,Wayne, G., Amos, D., Hung, C.-c., and Botvinick, M.Probing Physics Knowledge Using Tools from Develop-mental Psychology. arXiv:1804.01128 [cs] , 2018.Poppe, R. A survey on vision-based human action recogni-tion.

Image and vision computing , 28(6):976–990, 2010. Puig, X., Shu, T., Li, S., Wang, Z., Tenenbaum, J. B., Fidler,S., and Torralba, A. Watch-And-Help: A Challenge forSocial Perception and Human-AI Collaboration. arXivpreprint arXiv:2010.09890 , 2020.Rabinowitz, N., Perbet, F., Song, F., Zhang, C., Eslami,S. A., and Botvinick, M. Machine theory of mind. In

International conference on machine learning , pp. 4218–4227. PMLR, 2018.Riochet, R., Castro, M. Y., Bernard, M., Lerer, A., Fergus,R., Izard, V., and Dupoux, E. IntPhys: A Frameworkand Benchmark for Visual Intuitive Physics Reasoning. arXiv:1803.07616 [cs] , 2018.Riochet, R., Sivic, J., Laptev, I., and Dupoux, E. Occlu-sion resistant learning of intuitive physics from videos. arXiv:2005.00069 [cs, eess] , 2020.Ros, G., Sellart, L., Materzynska, J., Vazquez, D., andLopez, A. M. The synthia dataset: A large collectionof synthetic images for semantic segmentation of ur-ban scenes. In

Proceedings of the IEEE conference oncomputer vision and pattern recognition , pp. 3234–3243,2016.Rozo, L., Calinon, S., Caldwell, D. G., Jimenez, P., andTorras, C. Learning physical collaborative robot behav-iors from human demonstrations.

IEEE Transactions onRobotics , 32(3):513–527, 2016.Sadigh, D., Sastry, S., Seshia, S. A., and Dragan, A. D. Plan-ning for autonomous cars that leverage effects on humanactions. In

Robotics: Science and Systems , volume 2.Ann Arbor, MI, USA, 2016.Sheridan, T. B. Human–robot interaction: status and chal-lenges.

Human factors , 58(4):525–532, 2016.Shu, T. and Tian, Y. M RL: Mind-aware Multi-agentManagement Reinforcement Learning. arXiv preprintarXiv:1810.00147 , 2018.Shu, T., Xie, D., Rothrock, B., Todorovic, S., and Zhu, S.-C. Joint inference of groups, events and human roles inaerial videos. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pp. 4576–4584, 2015.Sigurdsson, G. A., Gupta, A., Schmid, C., Farhadi, A.,and Alahari, K. Charades-ego: A large-scale datasetof paired third and ﬁrst person videos. arXiv preprintarXiv:1804.09626 , 2018.Silver, T., Allen, K. R., Lew, A. K., Kaelbling, L. P., andTenenbaum, J. Few-shot bayesian imitation learning withlogical program policies. In

Proceedings of the AAAIConference on Artiﬁcial Intelligence , pp. 10251–10258,2020.

GENT: A Benchmark for Core Psychological Reasoning

Song, S., Yu, F., Zeng, A., Chang, A. X., Savva, M., andFunkhouser, T. Semantic scene completion from a singledepth image. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pp. 1746–1754, 2017.Spelke, E. S., Breinlinger, K., Macomber, J., and Jacobson,K. Origins of knowledge.

Psychol. Rev. , 99(4):605–632,October 1992.Wang, R. E., Wu, S. A., Evans, J. A., Tenenbaum, J. B.,Parkes, D. C., and Kleiman-Weiner, M. Too many cooks:Bayesian inference for coordinating multi-agent collabo-ration. arXiv e-prints , pp. arXiv–2003, 2020.Woodward, A. L. Infants selectively encode the goal objectof an actor’s reach.

Cognition , 69(1):1–34, 1998.Wu, J., Lu, E., Kohli, P., Freeman, W. T., and Tenenbaum,J. B. Learning to See Physics via Visual De-animation.In

Neural Information Processing Systems , pp. 12, 2017.Xia, F., R. Zamir, A., He, Z.-Y., Sax, A., Malik, J., andSavarese, S. Gibson env: real-world perception for em-bodied agents. In

Computer Vision and Pattern Recogni-tion (CVPR), 2018 IEEE Conference on . IEEE, 2018.Xie, A., Losey, D. P., Tolsma, R., Finn, C., and Sadigh, D.Learning latent representations to inﬂuence multi-agentinteraction. arXiv preprint arXiv:2011.06619 , 2020.Yi, K., Gan, C., Li, Y., Kohli, P., Wu, J., Torralba, A.,and Tenenbaum, J. B. Clevrer: Collision events forvideo representation and reasoning. arXiv preprintarXiv:1910.01442 , 2019.Yu, T., Finn, C., Xie, A., Dasari, S., Zhang, T., Abbeel, P.,and Levine, S. One-shot imitation from observing hu-mans via domain-adaptive meta-learning. arXiv preprintarXiv:1802.01557 , 2018.Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K.Maximum entropy inverse reinforcement learning. In

Aaai , volume 8, pp. 1433–1438. Chicago, IL, USA, 2008.Zitnick, C. L., Vedantam, R., and Parikh, D. Adoptingabstract images for semantic scene understanding.