[PDF] Grid-to-Graph: Flexible Spatial Relational Inductive Biases for Reinforcement Learning

Abstract

Although reinforcement learning has been successfully applied in many domains in recent years, we still lack agents that can systematically generalize. While relational inductive biases that fit a task can improve generalization of RL agents, these biases are commonly hard-coded directly in the agent's neural architecture. In this work, we show that we can incorporate relational inductive biases, encoded in the form of relational graphs, into agents. Based on this insight, we propose Grid-to-Graph (GTG), a mapping from grid structures to relational graphs that carry useful spatial relational inductive biases when processed through a Relational Graph Convolution Network (R-GCN). We show that, with GTG, R-GCNs generalize better both in terms of in-distribution and out-of-distribution compared to baselines based on Convolutional Neural Networks and Neural Logic Machines on challenging procedurally generated environments and MinAtar. Furthermore, we show that GTG produces agents that can jointly reason over observations and environment dynamics encoded in knowledge bases.

Full PDF

GGrid-to-Graph: Flexible Spatial Relational Inductive Biasesfor Reinforcement Learning

Zhengyao Jiang

Centre for Artificial IntelligenceUniversity College [email protected]

Pasquale Minervini

Centre for Artificial IntelligenceUniversity College [email protected]

Minqi Jiang

Centre for Artificial IntelligenceUniversity College [email protected]

Tim Rocktäschel

Centre for Artificial IntelligenceUniversity College [email protected]

ABSTRACT

Although reinforcement learning has been successfully applied inmany domains in recent years, we still lack agents that can sys-tematically generalize. While relational inductive biases that fita task can improve generalization of RL agents, these biases arecommonly hard-coded directly in the agent’s neural architecture.In this work, we show that we can incorporate relational induc-tive biases, encoded in the form of relational graphs, into agents.Based on this insight, we propose Grid-to-Graph (GTG), a map-ping from grid structures to relational graphs that carry usefulspatial relational inductive biases when processed through a Rela-tional Graph Convolution Network (R-GCN). We show that, withGTG, R-GCNs generalize better both in terms of in-distributionand out-of-distribution compared to baselines based on Convolu-tional Neural Networks and Neural Logic Machines on challengingprocedurally generated environments and MinAtar. Furthermore,we show that GTG produces agents that can jointly reason overobservations and environment dynamics encoded in knowledgebases.

KEYWORDS

Relational Inductive Bias; Reinforcement Learning; Graph NeuralNetwork

ACM Reference Format:

Zhengyao Jiang, Pasquale Minervini, Minqi Jiang, and Tim Rocktäschel.2021. Grid-to-Graph: Flexible Spatial Relational Inductive Biases for Re-inforcement Learning. In

Proc. of the 20th International Conference on Au-tonomous Agents and Multiagent Systems (AAMAS 2021), Online, May 3–7,2021 , IFAAMAS, 11 pages.

Reinforcement Learning (RL) has seen many successful applicationsin recent years. However, developing agents that can systematicallygeneralize to out-of-distribution observations remains an open chal-lenge [3, 9, 35]. Relational inductive biases are considered importantin promoting both in-distribution and systematic generalization, insupervised learning and RL settings [28, 34, 35].

Proc. of the 20th International Conference on Autonomous Agents and Multiagent Systems(AAMAS 2021), U. Endriss, A. Nowé, F. Dignum, A. Lomuscio (eds.), May 3–7, 2021, Online

Battaglia et al. [1] define relational inductive biases as constraintson the relationships and interactions among entities in a learningprocess. Traditionally, relational inductive biases have been hard-coded in an agent’s neural network architecture. By tailoring theconnections between neurons and applying different parametersharing schemes, architectures can embody various useful inductivebiases. For example, convolutional layers [15] exhibit locality andspatial translation equivariance [21], a particularly useful inductivebias for computer vision, as the features of an object should notdepend on its coordinates in an input image. Similarly, recurrentlayers [13] and Deep Sets respectively exhibit time translation andpermutation equivariance [1, 32].In this work, we introduce a unified graph-based framework thatallows us to express several useful relational inductive biases usingthe same formalism. Specifically, we frame the computation graphunderlying a neural architecture as a directed multigraph with pa-rameter sharing groups denoted by common edge labels connectingshared parameters in each group. This formalization allows us todefine specific inductive biases as comprising rules that generateedges and edge labels. The computation is then implemented byRelational Graph Convolutional Networks [R-GCNs, 22], a type ofGraph Neural Networks [GNNs, 37] that dynamically construct acomputation graph based on a relational graph.We make use of this formalism to introduce Grid-to-Graph (GTG),a mapping from grid structures of discrete 2D observations to re-lational graphs, based on a set of relation determination rules thatgenerate effective spatial relational inductive biases. Given a fea-ture map with entities (nodes) arranged in a lattice where eachentity corresponds to a feature vector, the relations encoded byGTG constrain the flow of information between these entity featurevectors when these features are processed by R-GCNs. We refer tothe resulting approach as R-GCN-GTG.We evaluate R-GCN-GTG in eight tasks: five MinAtar games [30],a procedurally-generated LavaCrossing environment [2], a box-world environment [33] requiring complex relational reasoning,and a symbolic variant of Read to Fight Monsters [RTFM, 36], anenvironment that provides knowledge bases (KBs) describing en-vironment dynamics that change in every episode. On RTFM, wedemonstrate that R-GCN-GTG can not only exploit the spatialinformation in a feature map, but in addition also relational infor-mation without modifying the neural architecture. Our experimentsshow that R-GCN-GTG produces better policies than Convolutional a r X i v : . [ c s . L G ] F e b AMAS ’21, May 3–7, 2021, Online Zhengyao Jiang, Pasquale Minervini, Minqi Jiang, and Tim Rocktäschel feature map grid structureset of feature vectors GTG1 25 3 4913 6 7 810 11 1214 15 16 1 23413 6 11relation graph 1 234 6relation graphwith feature vectors R-GCN action distribution

Figure 1: A high-level overview of GTG-R-GCN. We abstract away the grid structure of the feature map and turn observationsinto a spatial relational graph with GTG. The vectors of the feature map are attached to nodes in the relational graph. We thenuse an R-GCN to reason over the relational graph and node feature vectors to produce an action distribution.

Neural Networks (CNNs) or Neural Logic Machines [NLMs, 6], astate-of-the-art neural-symbolic model for relational reinforcementlearning.In summary, our main contributions are: i ) we propose a princi-pled approach for expressing relational inductive biases for neuralnetworks in terms of relational graphs, ii ) we introduce GTG totransform grid structures represented by a feature map into rela-tional graphs that carrying spatial relational inductive biases, iii ) weempirically demonstrate that in comparison to CNNs and NLMs,R-GCN-GTG generalizes better in both in- and out-of-distributiontasks in a diverse set of challenging procedurally-generated grid-world RL tasks, and finally iv ) we show that GTG is able to in-corporate external knowledge, enabling R-GCN-GTG to jointlyreason over spatial information and relational information aboutnovel environment dynamics without any additional architecturalmodifications. Graph Neural Networks in RL.

To our knowledge, NerveNet [28]is the first work that used GNNs to represent an RL policy. Theirmodel follows a similar message-passing scheme as GCNs, maintain-ing only node feature vectors. NerveNet has been bench-marked onMuJoCo environments, Snake and Centipede, and achieves betterin-distribution performance and generalization than Multi-LayerPerceptrons (MLPs). In these environments, NerveNet controlsmulti-joint robotic avatars. State information and output actionsare represented as a graph structure where each node correspondsto a movable part of the agent and local state and actions are at-tached to each node. Their generalization tests cover size, disabilitytransfer, and multi-task learning. While achieving good systematicgeneralization performance, Wang et al. [28] focus on the con-tinuous control setting and abstract the graph structure from themorphological information of the avatar. Kurin et al. [17] proposeAmorpheus, a transformer architecture for continuous control with-out relying on morphological information, outperforming NerveNet. Our work focuses on a different class of tasks for which the inputcan be represented as a feature map.

Neuro-Logic Models for RL.

A separate branch of relational neuro-symbolic models builds directly upon first-order logic, for example, 𝜕 ILP [8] and Neural Logic Machines (NLMs) [6]. Neural Logic Rein-forcement Learning [NLRL, 14] applies a modified version of 𝜕 ILPon a varied set of block world tasks and grid-world cliff-walkingtasks, displaying robust generalization properties. While 𝜕 ILP hasuseful strong inductive biases that allow it to generalize well onceit has learned a good policy, it suffers from poor scalability andproves difficult to train for more complex logical mappings. There-fore, we use Neural Logic Machines [NLMs, 6], a more expressiveand scalable neuro-symbolic architecture, as our baseline. Besidessupervised concept-learning tasks, NLMs also have been appliedto simple reinforcement learning tasks [6]. In Dong et al. [6], anNLM-based agent is trained to generalize to procedurally generatedblock world environments and algorithmic tasks. In these tasks,the NLM-based agent surpasses a Memory-Augmented Neural Net-works baseline [24] both in terms of in-distribution generalizationand out-of-distribution generalization to larger problem sizes. NLMtreats relations as inputs and reasons about them in a soft, dif-ferentiable manner. However, inductive biases for NLMs remainhard-coded in the architecture. In contrast, R-GCN uses learned re-lations to express relational inductive biases that constrain messagepassing.

Works Focusing on Symmetries.

Symmetries, especially equivari-ances, form an important class of relational inductive biases. Someprevious works [4, 27] focus on how prior knowledge about sym-metries can be incorporated into a model. Group equivariant con-volutional networks can represent arbitrary group symmetries, say,rotation and flipping [4]. In RL, MDP Homomorphic Networks [27]express symmetries in the joint state-action space. For these meth-ods, certain relational inductive biases outside of symmetries, likelocality in CNNs, still must be hard-coded into the architecture.R-GCN-GTG cannot express equivariance in the state-action space, rid-to-Graph: Flexible Spatial Relational Inductive Biases for Reinforcement Learning AAMAS ’21, May 3–7, 2021, Online but it can recover some symmetries like translation equivalence.However, it is unclear whether it can represent all group symme-tries.

Relational Graphs.

We define a relational graph as a labeled,directed multi-graph, denoted as G = (V , E , R) , where V is theset of nodes (representing entities), R is the set of relation labels(representing relation types), E ⊆ V × R × V is the set of relations(labeled directed edges). Each relation is represented by a tuple ( 𝑎, 𝑟, 𝑏 ) with 𝑎 ∈ V , 𝑟 ∈ R , and 𝑏 ∈ V , and represents a relationshipof type 𝑟 between the source entity 𝑎 and the target entity 𝑏 of theedge. Relational Graph Convolutional Networks.

R-GCNs [22] extendGNNs to model relational graphs. R-GCNs represent a map fromthe set of feature vector nodes to a new set of feature vectors, con-ditioned on the relational graph G and the parameters 𝑊 𝑟 attachedto each relation label. The update rule for a feature vector x 𝑎 ofnode 𝑎 is given by: x ′ 𝑎 : = 𝜎 (cid:169)(cid:173)(cid:171) ∑︁ 𝑟 ∈R ∑︁ 𝑏 ∈N 𝑟𝑎 𝑐 𝑎,𝑟 W 𝑟 x 𝑏 + W x 𝑎 (cid:170)(cid:174)(cid:172) , (1)where 𝜎 is a non-linearity (such as the ReLu function), N 𝑟𝑎 denotesthe neighboring nodes of 𝑎 under relation type 𝑟 , W 𝑟 is the weightmatrix associated with 𝑟 , and 𝑐 𝑎,𝑟 is a normalization constant. Inthis work, we use 𝑐 𝑎,𝑟 = |N 𝑟𝑎 | . R-GCNs were introduced to deal withgraph structured data [22]. Our work presents a new perspective onR-GCNs: we view relational graphs as representing the connectivityand parameter sharing scheme for the model, thereby encoding aprior relational inductive bias. In this section, we describe how relational graphs used by R-GCNscan be adopted to formalize two constraints commonly used in neu-ral architecture design: sparse connectivity and parameter sharing.We introduce GTG, a set of relation determination rules for repre-senting spatial relational inductive biases. GTG strictly generalizesthe inductive bias underlying convolutional layers.Finally, we propose two ways of enabling R-GCNs to jointlyreason with visual information restructured according to GTG andpotentially additional, external relational knowledge.

In R-GCNs, message passing is explicitly directed by the relationalgraph rather than implicitly by the model architecture. Removingthe non-linearity and representing 𝑊 x 𝑏 as a self-loop edge , weobtain the following simplified R-GCN update rule: y 𝑎 = ∑︁ 𝑟 ∈R∪{ } ∑︁ 𝑏 ∈N 𝑟𝑎 𝑐 𝑎,𝑟 W 𝑟 x 𝑏 . (2)The set of edges determines whether there is message passingbetween each pair of entities, thereby encoding the connectivity of All the self-loop edges share the same relation label, and term 𝑊 x 𝑏 is included thesummation. (a) local (all) (b) remote left (c) aligned Figure 2: Three subsets of GTG relation determination rules:the black tile is the target node, and the other colored tilesare source nodes. Tiles with the same relationship to the tar-get node share the same color. the model. The relationship labels indicate the specific pattern ofparameters to be used by the message-passing functions. By makinguse of different relational graphs, R-GCNs can represent manycommon neural architectures, including MLPs, CNNs, and DeepSets.We provide a formal description of the neural architectures thatR-GCN can represent in Appendix A.3.To construct the relational graph, we make use of relation deter-mination rules, each defined in the following form: 𝑟 ( 𝑎, 𝑏 ) ← condition , where 𝑟 ( 𝑎, 𝑏 ) is a relation from entity 𝑎 to 𝑏 with label 𝑟 , and condi-tion is a logic statement. If the condition holds true, relation label 𝑟 will be appended into R 𝑎𝑏 , the set of all relation labels of relationsbetween entities 𝑎 and 𝑏 . The relation determination rules thenexpress the relational inductive bias by controlling the sparsity andparameter sharing patterns of a feed forward neural network. We now introduce a set of relation determination rules that canbe used to construct spatial relational inductive biases. We startby replicating the relational inductive biases of CNNs, and thenintroduce new biases to address the limitation of CNNs.

The number of possible defini-tions of spatial relationships between objects is very large, and itmay not be feasible to enumerate each of them, let alone empiricallyevaluate them all. We, therefore, start by mimicking the inductivebiases encoded by CNNs, which have been shown to be effectivein computer vision tasks and Deep Reinforcement Learning taskswith visual inputs [16, 20, 23]. This provides us with a set of localdirectional relations. Each local directional relation specifies therelative position of two adjacent entities. A graphical illustration isshown in Fig. 2 (a), where a selected target node is painted black andsource nodes are painted with different colors, each correspondingto different relation labels. For clarity, we picked a single node asthe target node, though this may not be generally the case. Visu-alizing all the local directional relations would result in a mesh ofedges connecting all nodes to each other.Consider two entities 𝑎 and 𝑏 , and their coordinates 𝑥 𝑎 , 𝑦 𝑎 , 𝑥 𝑏 , 𝑦 𝑏 . The determination rules for local directional relations are as AMAS ’21, May 3–7, 2021, Online Zhengyao Jiang, Pasquale Minervini, Minqi Jiang, and Tim Rocktäschel follows: rightAdj ( 𝑎, 𝑏 ) ← ( 𝑥 𝑎 = 𝑥 𝑏 + ) ∧ ( 𝑦 𝑎 = 𝑦 𝑏 ) , leftAdj ( 𝑎, 𝑏 ) ← ( 𝑥 𝑎 = 𝑥 𝑏 − ) ∧ ( 𝑦 𝑎 = 𝑦 𝑏 ) , topAdj ( 𝑎, 𝑏 ) ← ( 𝑦 𝑎 = 𝑦 𝑏 + ) ∧ ( 𝑥 𝑎 = 𝑥 𝑏 ) , bottomAdj ( 𝑎, 𝑏 ) ← ( 𝑦 𝑎 = 𝑦 𝑏 − ) ∧ ( 𝑥 𝑎 = 𝑥 𝑏 ) , topRightAdj ( 𝑎, 𝑏 ) ← ( 𝑥 𝑎 = 𝑥 𝑏 + ) ∧ ( 𝑦 𝑎 = 𝑦 𝑏 + ) , topLeftAdj ( 𝑎, 𝑏 ) ← ( 𝑥 𝑎 = 𝑥 𝑏 − ) ∧ ( 𝑦 𝑎 = 𝑦 𝑏 + ) , bottomRightAdj ( 𝑎, 𝑏 ) ← ( 𝑥 𝑎 = 𝑥 𝑏 + ) ∧ ( 𝑦 𝑎 = 𝑦 𝑏 − ) , bottomLeftAdj ( 𝑎, 𝑏 ) ← ( 𝑥 𝑎 = 𝑥 𝑏 − ) ∧ ( 𝑦 𝑎 = 𝑦 𝑏 − ) . By only applying these local directional relations, the computationof the associated R-GCN model would be equivalent to that ofa convolutional layer with a 3 × One limitation of convolu-tional layers is the difficulty of message passing among remoteentities. In order to pass information to another node N blocksaway, N layers are needed for a CNN with strides length of 1. Thisproblem can be alleviated with larger strides, pooling layers, ordilated convolutions [11, 31], but the model will still require a largenumber of layers. For less deep CNNs, such as the baseline modelused in this work, long-distance message passing is accomplishedusing dense layers following the convolution layers, as there isno message passing between distant entities. However, dense lay-ers exhibit only a weak relational inductive bias, which can hurtgeneralization performance. With only local directional relations,R-GCNs inherit the same long distance message passing problemas convolution layers. We, therefore, introduce remote directionalrelations, which capture the notion of relative positions between ob-jects. We visualize one such remote directional relation, left , in Fig. 2(b). We express remote directional relations using the followingrules: right ( 𝑎, 𝑏 ) ← 𝑥 𝑎 > 𝑥 𝑏 , left ( 𝑎, 𝑏 ) ← 𝑥 𝑎 < 𝑥 𝑏 , top ( 𝑎, 𝑏 ) ← 𝑦 𝑎 > 𝑦 𝑏 , bottom ( 𝑎, 𝑏 ) ← 𝑦 𝑎 < 𝑦 𝑏 . Besides these directionalrelations, we also add two auxiliary relations, aligned and adjacent :aligned ( 𝑎, 𝑏 ) ← ( 𝑥 𝑎 = 𝑥 𝑏 ) ∨ ( 𝑦 𝑎 = 𝑦 𝑏 ) , adjacent ( 𝑎, 𝑏 ) ← (| 𝑥 𝑎 − 𝑥 𝑏 | ≤ ) ∧ (| 𝑦 𝑎 − 𝑦 𝑏 | ≤ ) . Aligned relations indicate if two entities are on the same horizontalor vertical line, visualized in Fig. 2 (c).

Adjacent relations indicatewhether two objects are adjacent to each other, which, unlike localdirectional relations, carry no directional information.

GTG expresses spatial relational inductive biases in the form ofa relational graph. As R-GCN was originally designed to reasonover knowledge graphs, it may be tempting to let R-GCN jointlyreason over spatial inputs and a task-relevant external relationalknowledge graph by simply merging some graph representation ofeach without further architecture changes. We introduce two ways

Knowlege baserelationsSpatial relationsGrounding relationsConceptualentitiesPhysicalentities

Figure 3: Reasoning with an external knowledge base of incorporating external relational knowledge: one-hop relationsbetween physical entities and grounding relations with a knowledgegraph. Examples of these two approaches applied to RTFM can befound in Section 5.4 and Fig. 5.

We refer to each cell inthe feature map a physical entity. We can then straightforwardly in-troduce relational knowledge by adding relations between physicalentities. However, this limits the knowledge that can be expressed,as this approach cannot represent more abstract knowledge thatdescribes relations between concepts rather than specific entities,e.g., a shinning weapon can kill fire monsters.

For enabling theinclusion of external knowledge in our model, we maintain twosets of entities: conceptual entities, which exist in the knowledgebase (e.g. the class of an object) and physical entities, which exist inthe environment (e.g. a specific object in the environment, such as amonster). We can then link the two graphs corresponding to thesetwo sets of entities with grounding relations so that informationcan flow between them.A graphical illustration of this approach is presented in Fig. 3. Thegrounding relations assign conceptual counterparts to the physicalentities. In this work, these relations are handcrafted.

MinAtar [29] is a collection of miniature Atari games. Each obser-vation is a 10 ×

10 feature map, where each cell represents a singleobject in the game. Unlike conventional Atari games, the environ-ments in MinAtar games are stochastic – for instance in Breakout,the ball starts in a random position. MinAtar environments alsoset a 10% sticky action [12] probability by default. Sticky actionsforce the agent to take the action taken in the last step. We enforcea 5,000 steps limit, as some agents can play indefinitely on some ofthe MinAtar games.

Box-world is a grid-world navigation game introduced by Zambaldiet al. [34]. To solve the game, the agent must collect the gem usingthe correct key. However, the key is in a locked box, which needs tobe unlocked with a separate key. There are also distractor branches rid-to-Graph: Flexible Spatial Relational Inductive Biases for Reinforcement Learning AAMAS ’21, May 3–7, 2021, Online which will consume the current key and produce a key that can-not unlock the gem box. As the game is combinatorially complex,the chance of hitting the correct solution by random walk is low.Zambaldi et al. [34] demonstrated that their RL models requiredbetween 2 × to 14 × steps to converge in this environment.Due to limited computational resources, we only train models for10 steps in each environment and further reduce the difficulty ofthe Box-World environment: ) the field size is reduced to 10 × ) the number of distractor branches is set to 1, ) the length ofdistractor branches is set to 1, ) the goal length is set to 2. Althoughwe reduced the difficulty, we still preserve all core elements of theBox-World environment. We expect this simplified experimentalsetup to still allow us to compare the relational reasoning capabili-ties of different models while reducing the total steps needed fortraining to convergence. LavaCrossing is a standard environment from MiniGrid [Minimal-istic Gridworld, 2]. The agent must navigate to the goal positionwithout falling into the lava river. The game is procedurally gen-erated, and there are 3 difficulty levels available for each map size,making this is an ideal RL environment to test combinatorial andout-of-distribution generalization of learned policies. MiniGrid en-vironments are by default partially-observable, but we configureour instances to be full-observable. Also, by default, an agent canturn left, turn right, and move forward, which requires the agentto know its direction when navigating to a particular position. Weadjust the action space, making the agent able to move in all fourdirections without turning left or right. The number of lava riversgenerated equals the level number. To test the out-of-distributiongeneralization, we train the agent on difficulty level 2 and test ondifficulty levels 1 and 3.We also design a Portal-LavaCrossing task, illustrated in Fig. 4 (a),to test whether R-GCN-GTG can generalize to non-Euclidean spaceswithout retraining. After training on difficulty level 2, we transferthe agent to Portal-LavaCrossing where there are no gaps in thelava river. For each side of the lava river, a teleportation portal isplaced in a random position: when the agent moves into the portal,it is then placed on the other side of the lava river, and this is theonly way to cross it. As no such portals exist in training levels,the agent must be able to generalize to a new environment withnon-Euclidean space, leveraging novel test-time spatial relations,or resort to moving around randomly until stepping into the portal.In this task, the CNN baseline is not aware of the portal at all, andcan only reach the goal if it moves into the portal by chance. ForR-GCNs and NLMs, we append new spatial relationships betweenportals and other grids: incoming relations to one portal all connectto the paired portal on the other side of the lava river, and outgoingrelations from one portal are kept the same. These relations areshown in Fig. 4 (b) and (c).

Read to Fight Monsters [36] is a grid world game, where eachlevel includes a text document providing information about per-episode game dynamics. Each map contains two monsters and twoweapons, each randomly generated and positioned. Each weapon Lava 𝐺 (a) Initial state (b) Incoming edges (c) Outgoing edges Figure 4: Portal-LavaCrossing tasks. (a) One possible initialstate, where blue circles are portals, the red triangle is theagent, and the green block is the goal position. (b) How in-coming edges to the portal are attached to the paired portal.Dashed arrows represent the original edges and solid arrows,new edges. (c) Outgoing edges from the portal remain thesame. has a modifier and each monster has an element property. Theagent must defeat the monster of a specific element, which can onlybe defeated with weapons with a specific modifier. The relationsdescribing which modifiers defeat which elements are procedu-rally generated at the start of each episode and described by thedocument. Furthermore, each monster belongs to a team. The textdocument also describes which team must be defeated. Withoutthe text document, the agent can only pick a weapon and attackan arbitrary monster, which leads to an overall win probability ofless than 50%. We construct a knowledge base which contains thesame information as the original RTFM document based on thegrammatical rules that RTFM uses to generate the text document.We test two approaches of introducing external knowledge toRTFM, shown in Fig. 5. The easier physical-entities-only approachuses two relation labels target and beat , ignoring the concept ofmodifiers, elements, and teams. target ( 𝑎 ) is a unary atom indicat-ing monster 𝑎 is the one the agent must defeat and is appended tofeature vectors. The relation beat ( 𝑎 , 𝑏 ) (binary atom) means weapon 𝑎 defeats monster 𝑏 . If the agent carries the weapon, entity 𝑎 cor-responds to the agent itself. A more complex approach introducesconceptual entities and uses multi-hop reasoning to solve the prob-lem. Along with the physical objects in the environment, this ap-proach considers the conceptual entities of teams, modifiers andelements. This approach introduces additional grounding relations:The relation assign ( 𝑎 , 𝑏 ) assigns modifier 𝑎 to weapon 𝑏 or element 𝑎 to monster 𝑏 . The relation belong ( 𝑎 , 𝑏 ) indicates that the monster 𝑎 belongs to team 𝑏 . The relation beat ( 𝑎 , 𝑏 ) states that modifier 𝑎 defeats monsters of element 𝑏 . The relation target ( 𝑎 ) means thatthe agent must defeat team 𝑎 . Finally, hold ( 𝑎 ) indicates the agentcurrently holds a weapon with modifier 𝑎 . In this subsection, we describe how GTG and R-GCN can be in-corporated into an RL policy. First, the state of the environmentis rendered as a feature map. Specifically, each tile in a grid worldis represented as a binary-valued feature vector x . These featurevectors X are attached to nodes and GTG generates edges betweenthese nodes, forming a relational graph G that represents particu-lar relational inductive biases. If required, extra knowledge about AMAS ’21, May 3–7, 2021, Online Zhengyao Jiang, Pasquale Minervini, Minqi Jiang, and Tim Rocktäschel

WW M M (a) RTFM-onehop-KB WM teamsalliance ... modifiersmaster ... elementsfire ... (b) RTFM-KB Figure 5: RTFM tasks with KB encodings of the text docu-ment. The red triangle represents the agent; the blue block,the monster; and the green block, the weapon. Figure (a)shows the one-hop reasoning version of KB encoding, wherearrows indicate beat relations, and the red frame indicatesthe target monster. Figure (b) illustrates the multi-hop rea-soning version. Arrows indicate a relation between two en-tities. The red frame indicates the opponent team. Relationlabels are not represented explicitly in the graph. game dynamics expressed as a knowledge base can be merged intothis multigraph. Subsequently, the R-GCN acts on this multi-graphand associated feature vectors. After processing using R-GCN, weapply a feature-wise max-pooling to all node feature vectors. Theoutputs are then fed to dense layers that output per-action logits.The graphical illustration of the whole process can be found inFig. 1. The probability of actions can thus be written as: 𝑃 ( a |X) = softmax ( MLP ( maxpool ( 𝑔 (X , G ; 𝜃𝜃𝜃 )) ; 𝜃𝜃𝜃 )) , (3)where 𝑔 is the stack of R-GCN layers and 𝜃𝜃𝜃 s are neural networkparameters.A separate head performs value estimation:ˆ 𝑣 (X) = MLP ( maxpool ( 𝑔 (X , G ; 𝜃𝜃𝜃 )) ; 𝜃𝜃𝜃 ) . (4)During training, the agent samples actions according to thisresultant action distribution. During testing, we take the action withmaximum probability. In our experiments, we make use of IMPALA[7], a policy-gradient algorithm to train our RL models. We basedour IMPALA implementation on TorchBeast[18] and our R-GCNimplementation, on Pytorch-Geometric[10]. Our implementationis available at https://github.com/ZhengyaoJiang/GTG. In this section, we report our empirical results comparing R-GCNto baseline methods regarding in-distribution performance, out-of-distribution combinatorial generalization, and the ability to incor-porate external knowledge. We also report ablations probing theeffectiveness of different relation determination rules and compo-nents of R-GCN-GTG.

Fig. 6 shows the in-distribution performance (training curve) ofCNNs and relational models on MinAtar and LavaCrossing tasks.For each model, we run 5 trials, each with different random seeds.The thin, opaque lines in the plot represent training curves corre-sponding to each run, and the bolded lines represent mean episodicreturn averaged over all five runs. Here, NLM and R-GCN use GTG with all three classes relationships, namely, local directional, remotedirectional, and auxiliary relations. We can see that R-GCN-GTGmodels consistently perform either better or on-par with CNNs andNLM-GTGs across all eight environments. For Asterix, Seaquest,Box-World and Breakout, R-GCN-GTG outperforms CNNs by asignificant margin. As the CNN baseline in the RTFM environmentis unable to access information in the knowledge base, it acts as aninformative baseline, for which only visual information is available.NLM-GTGs also achieve good performance on Seaquest and Break-out, but they are inferior to CNNs on other MinAtar tasks. We alsomark the best performance of original MinAtar baselines [30] as ared horizontal dashed line. It is worth noting that we use a deepernetwork than these baselines, which include a deep Q-network [19]and actor-critic with eligibility traces [5, 25], trained on twice asmany steps. Thus, our CNN agent acts as a much stronger baselinethan the original MinAtar models.There are two potential reasons for the large performance gainof R-GCN-GTG over CNNs. Firstly, R-GCN-GTG does not makeuse of the absolute position of objects, but instead only taking intoaccount the relative positions between objects. In contrast, CNNsemploy dense layers to reason globally. Further, these dense layershave a weak relational inductive bias which can negatively impactsample efficiency and generalization. Secondly, GTG provides moreflexible message passing than conventional CNN layers in termsof long-range dependencies. We further study the roles of variousrelation determination rules and max-pooling after convolutions inour ablation studies.Although the NLM-GTG models in our experiments make use ofrelational graphs determined by GTG, it uses this information in aless structured way. Specifically, NLM-GTGs encode such relationalinformation as dense vector representations, whereas R-GCN-GTGuses this information to construct a GNN, thus directly determiningthe computation graph and flow of messages. The performancegain of R-GCN-GTG over NLM-GTG suggests that using GTG todetermine the specific relational inductive bias, and therefore guidemessage passing in a structured way, results in better in-distributionperformance.

In Table 1 and Table 2, we show how policies learned by our rela-tional models can generalize to environments outside of the trainingdistribution. For our LavaCrossing experiments, we train the agenton difficulty level 2 and test the policy on difficulty levels 1 and 3.Table 1 shows average returns over five training runs. Each modelis evaluated on 200 test episodes. The relative performance changewith respect to the training environment is shown in parenthe-ses. All the models generalize and perform optimally on difficultylevel 1. However, when generalizing to difficulty level 3, the rela-tional models perform significantly better than CNNs. R-GCN-GTGgeneralizes the best among all the models we tested.Table 2 demonstrates the win rate for each model in the symbolicvariant of the RTFM tasks. Again, we report mean returns averagedover five training runs. Each section of the table represents a differ-ent task variation. We transform the text document into a symbolicknowledge base of triples for RTFM-KB and RTFM-onehop-KB. rid-to-Graph: Flexible Spatial Relational Inductive Biases for Reinforcement Learning AAMAS ’21, May 3–7, 2021, Online (a) Seaquest (b) Asterix (c) Freeway (d) Breakout(e) Space Invaders (f) Box-World (g) LavaCrossing (h) RTFM-KB

Figure 6: Training curves of CNN and relational models. The opaque lines represent the returns of individual runs, while thebolded lines, the average of 5 runs. Red dashed lines mark the final performance of the best model reported by the originalMinAtar baselines.

Model Level 2 Level 1 Level 3 Portal

CNN 0.958 0.960 0.790(-17.5%) 0.040(-95.8%)NLM-GTG 0.955 0.960 0.918(-3.9%) (-83.5%)R-GCN-GTG 0.958 0.960 (-1.7%) 0.096(-90.0%)

Table 1: Out-of-distribution generalization performance onLavaCrossing. The agent is trained on difficulty level 2.

This makes the task easier compared to the original RTFM-text task(last row) as models do not have to learn to encode informationpresented as textual inputs. Further, the RTFM-onehop-KB is easierthan RTFM-KB as the RTFM-KB require multihop reasoning. Wealso put the performance of the model proposed by the RTFM pa-per [36], txt2 𝜋 in RTFM-text environment into the table. We trainon environments with a grid size of 6 × ×

10. An optimalpolicy in the 10 ×

10 environments should achieve better perfor-mance compared to that on the smaller environments, as the agenthas more space to evade monsters. We observe that NLM-GTG andR-GCN-GTG generalize well to the larger environments. However,R-GCN-GTG performs much better than NLM-GTG in the harderRTFM-KB environment, both in terms of in-distribution (6 × ×

10 grid). room size 6 × × 𝜋

55% 55% (+0%)

Table 2: In-distribution and out-of-distribution generaliza-tion in RTFM variants. Figures report the win rate incre-ment between × environments and × environments The Portal-LavaCrossing and RTFM experiments demonstrate theflexibility of GTG in incorporating different kinds of external knowl-edge. In Table 1, we can see that with spatial information providedby the KB, the NLM-GTG and R-GCN-GTG agent managed to gen-eralize to Portal-LavaCrossing in a zero-shot manner. The RTFMresults in Table 2 show how GTG enables the relational model tojointly reasoning with both spatial information and environmentdynamics information when represented as a KB. The R-GCN-GTGagent performs well both in multi-hop reasoning and one-hop rea-soning variants of RTFM-KB, but NLM-GTG only performs well inthe easier one-hop variant.

AMAS ’21, May 3–7, 2021, Online Zhengyao Jiang, Pasquale Minervini, Minqi Jiang, and Tim Rocktäschel (a) Seaquest (b) Asterix(c) Freeway (d) Breakout(e) Space Invaders

Figure 7: Ablation of R-GCN-GTG on MinAtar tasks.

Fig. 7 presents the results of an ablation study of three relationalinductive biases encoded using GTG (local, remote, and auxiliaryrelations). Each line shows the smoothed training curve averagedover five training runs. The red lines represent the training curve ofR-GCN using the full set (local, remote, and auxiliary relations) ofrelational inductive biases. The green lines show the performanceof R-GCNs using local directional relations only, whose convolu-tion computation is equivalent to that of the image convolution inSection 4.2. A notable difference among these models is the infor-mation aggregation method between convolution layers and denselayers: the CNN model concatenates all output feature vectors (i.e.flattened), while the R-GCN model applies a max-pooling layer.The flattened vector in the CNN model tends to have high dimen-sionality, thereby increasing the total number of parameters in theadjacent dense layer. Therefore, when constraining the number ofparameters of the two architectures to be approximately equal, most of the parameters of the R-GCN model reside in convolution layers,whereas more parameters of the CNN model reside in the denselayers. This explains the performance difference between CNN andR-GCN-GTG with local directional relations only. In Seaquest andBreakout, R-GCNs yield better performance than CNNs, while inSpace Invaders CNNs outperform R-GCN-GTG. The two modelsachieve similar performance in Asterix and Freeway.To further investigate the role of max-pooling, we evaluateda wider CNN with the same number of parameters as the con-volution layer in the local-only R-GCN-GTG model. This widermodel flattens features before dense layers rather than applyingmax-pooling, resulting in 876k parameters (the full R-GCN-GTGmodel has 149k). We use this wider CNN model to isolate the per-formance improvement that results from max-pooling. ComparingCNN-wide and R-GCN-GTG local-only, we see mixed results: thetwo models achieve comparable performance on Seaquest, Asterixand Freeway; local-only R-GCN-GTG performs better in Breakout,while the wider CNN model performs better in Space Invaders. Thisshows that max-pooling by itself does not outperform flattening ifwe do not care about the number of parameters.Unsurprisingly, removing local directional relations underminesthe performance of R-GCN-GTG in almost all of the tasks, whichshows the importance of the relational inductive bias of locality.We also assessed the impact of remote and auxiliary relations, ob-serving that using both sets of relations improves performance onSeaquest, Asterix and Space Invaders. The benefits of introducingthese additional relational inductive biases are robust in the sensethat they do not degrade performance any environment. In contrast,this is not true for max-pooling. These improvements demonstratethat we can go beyond the relational inductive bias of CNNs byusing the relation determination rules of GTG, which provide aflexible framework for expressing many useful connectivity andparameter sharing constraints

This paper introduced Grid-to-Graph, a principled framework forrepresenting relational inductive biases. GTG is based on a set ofrelation determination rules, which act on inputs in the form of afeature map corresponding to discrete 2D state observations. Usingthese relation determination rules, GTG transforms 2D observationsinto a multigraph input for an R-GCN model. The resulting archi-tecture, R-GCN-GTG outperforms both CNNs and Neural LogicMachines, the previous state-of-the-art in deep relational RL, on Mi-nAtar and a series of challenging procedurally-generated grid worldenvironments, both in terms of in-distribution performance andout-of-distribution systematic generalization. Our results furthershow that GTG provides an effective and straightforward interfacefor incorporating various forms of external knowledge without anyarchitectural modifications.

ACKNOWLEDGMENTS

This research was supported by the European Union’s Horizon2020 research and innovation programme under grant agreementno. 875160. We thank Edward Grefenstette and the anonymousreviewers for their insightful feedback. rid-to-Graph: Flexible Spatial Relational Inductive Biases for Reinforcement Learning AAMAS ’21, May 3–7, 2021, Online

REFERENCES [1] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez,Vinícius Flores Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo,Adam Santoro, Ryan Faulkner, Çaglar Gülçehre, H. Francis Song, Andrew J. Bal-lard, Justin Gilmer, George E. Dahl, Ashish Vaswani, Kelsey R. Allen, CharlesNash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, PushmeetKohli, Matthew Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. 2018. Rela-tional inductive biases, deep learning, and graph networks.

CoRR abs/1806.01261(2018).[2] Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. 2018. MinimalisticGridworld Environment for OpenAI Gym. https://github.com/maximecb/gym-minigrid.[3] Karl Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. 2020. Leverag-ing Procedural Generation to Benchmark Reinforcement Learning. In

Proceedingsof the 37th International Conference on Machine Learning, ICML 2020, 13-18 July2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119) . PMLR,2048–2056. http://proceedings.mlr.press/v119/cobbe20a.html[4] Taco Cohen and Max Welling. 2016. Group Equivariant Convolutional Networks.In

Proceedings of the 33nd International Conference on Machine Learning, ICML2016, New York City, NY, USA, June 19-24, 2016 (JMLR Workshop and Confer-ence Proceedings, Vol. 48) , Maria-Florina Balcan and Kilian Q. Weinberger (Eds.).JMLR.org, 2990–2999. http://proceedings.mlr.press/v48/cohenc16.html[5] Thomas Degris, Patrick M Pilarski, and Richard S Sutton. 2012. Model-freereinforcement learning with continuous action in practice. In . IEEE, 2177–2182.[6] Honghua Dong, Jiayuan Mao, Tian Lin, Chong Wang, Lihong Li, and DennyZhou. 2019. Neural Logic Machines. In

ICLR (Poster) . OpenReview.net.[7] Lasse Espeholt, Hubert Soyer, Rémi Munos, Karen Simonyan, Volodymyr Mnih,Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg,and Koray Kavukcuoglu. 2018. IMPALA: Scalable Distributed Deep-RL withImportance Weighted Actor-Learner Architectures. 80 (2018), 1406–1415.[8] Richard Evans and Edward Grefenstette. 2018. Learning Explanatory Rulesfrom Noisy Data.

Journal of Artificial Intelligence Research

61 (Jan 2018), 1–64.https://doi.org/10.1613/jair.5714[9] Jesse Farebrother, Marlos C. Machado, and Michael Bowling. 2018. Generalizationand Regularization in DQN.

CoRR abs/1810.00123 (2018). arXiv:1810.00123http://arxiv.org/abs/1810.00123[10] Matthias Fey and Jan E. Lenssen. 2019. Fast Graph Representation Learning withPyTorch Geometric. In

ICLR Workshop on Representation Learning on Graphs andManifolds .[11] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016.

Deeplearning . Vol. 1. MIT press Cambridge.[12] Matthew J Hausknecht and Peter Stone. 2015. The Impact of Determinism onLearning Atari 2600 Games.. In

AAAI Workshop: Learning for General Competencyin Video Games .[13] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.

Neuralcomputation

9, 8 (1997), 1735–1780.[14] Zhengyao Jiang and Shan Luo. 2019. Neural Logic Reinforcement Learn-ing. In

Proceedings of the 36th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 97) , Kamalika Chaudhuri andRuslan Salakhutdinov (Eds.). PMLR, Long Beach, California, USA, 3110–3119.http://proceedings.mlr.press/v97/jiang19a.html[15] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifica-tion with deep convolutional neural networks. In

Advances in neural informationprocessing systems . 1097–1105.[16] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classifi-cation with Deep Convolutional Neural Networks. In

Advances in Neural Infor-mation Processing Systems 25 , F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Wein-berger (Eds.). Curran Associates, Inc., 1097–1105. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf[17] Vitaly Kurin, Maximilian Igl, Tim Rocktäschel, Wendelin Boehmer, and ShimonWhiteson. 2020. My Body is a Cage: the Role of Morphology in Graph-BasedIncompatible Control. In

International Conference on Learaning Representations(ICLR) .[18] Heinrich Küttler, Nantas Nardelli, Thibaut Lavril, Marco Selvatici, ViswanathSivakumar, Tim Rocktäschel, and Edward Grefenstette. 2019. TorchBeast: APyTorch Platform for Distributed RL. arXiv preprint arXiv:1910.03552 (2019).https://github.com/facebookresearch/torchbeast[19] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness,Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, GeorgOstrovski, et al. 2015. Human-level control through deep reinforcement learning. nature

Nature

IEEE Trans. Signal Process.

56, 8-1 (2008), 3572–3585.https://doi.org/10.1109/TSP.2008.925261[22] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, IvanTitov, and Max Welling. 2018. Modeling relational data with graph convolutionalnetworks. In

European Semantic Web Conference . Springer, 593–607.[23] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, MatthewLai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel,et al. 2018. A general reinforcement learning algorithm that masters chess, shogi,and Go through self-play.

Science

NIPS . 2440–2448.[25] Richard S. Sutton and Andrew G. Barto. 1998.

Introduction to ReinforcementLearning (1st ed.). MIT Press, Cambridge, MA, USA.[26] T. Tieleman and G. Hinton. 2012. Lecture 6.5—RmsProp: Divide the gradient bya running average of its recent magnitude. COURSERA: Neural Networks forMachine Learning.[27] Elise van der Pol, Daniel E. Worrall, Herke van Hoof, Frans A. Oliehoek, andMax Welling. 2020. MDP Homomorphic Networks: Group Symmetries in Re-inforcement Learning. In

Advances in Neural Information Processing Systems 33:Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020,December 6-12, 2020, virtual , Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell,Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/2be5f9c2e3620eb73c2972d7552b6cb5-Abstract.html[28] Tingwu Wang, Renjie Liao, Jimmy Ba, and Sanja Fidler. 2018. NerveNet: LearningStructured Policy with Graph Neural Networks. In

ICLR (Poster) . OpenReview.net.[29] Kenny Young and Tian Tian. 2019. MinAtar: An Atari-inspired Testbed for MoreEfficient Reinforcement Learning Experiments. arXiv preprint arXiv:1903.03176 (2019).[30] Kenny Young and Tian Tian. 2019. Minatar: An atari-inspired testbed for thor-ough and reproducible reinforcement learning experiments. arXiv preprintarXiv:1903.03176 (2019).[31] Fisher Yu and Vladlen Koltun. 2016. Multi-Scale Context Aggregation by DilatedConvolutions. (2016).[32] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabás Póczos, RuslanSalakhutdinov, and Alexander J. Smola. 2017. Deep Sets. In

NIPS . 3391–3401.[33] Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, IgorBabuschkin, Karl Tuyls, David Reichert, Timothy Lillicrap, Edward Lockhart,Murray Shanahan, Victoria Langston, Razvan Pascanu, Matthew Botvinick, OriolVinyals, and Peter Battaglia. 2018. Relational Deep Reinforcement Learning. arXiv preprint abs/1806.01830 (June 2018).[34] Vinícius Flores Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li,Igor Babuschkin, Karl Tuyls, David P. Reichert, Timothy P. Lillicrap, Edward Lock-hart, Murray Shanahan, Victoria Langston, Razvan Pascanu, Matthew Botvinick,Oriol Vinyals, and Peter W. Battaglia. 2019. Deep reinforcement learning withrelational inductive biases. In

ICLR (Poster) . OpenReview.net.[35] Chiyuan Zhang, Oriol Vinyals, Rémi Munos, and Samy Bengio. 2018. A Studyon Overfitting in Deep Reinforcement Learning.

CoRR abs/1804.06893 (2018).arXiv:1804.06893 http://arxiv.org/abs/1804.06893[36] Victor Zhong, Tim Rocktäschel, and Edward Grefenstette. 2020. RTFM: General-ising to New Environment Dynamics via Reading. (2020).[37] Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang,Changcheng Li, and Maosong Sun. 2018. Graph neural networks: A review ofmethods and applications. arXiv preprint arXiv:1812.08434 (2018).

AMAS ’21, May 3–7, 2021, Online Zhengyao Jiang, Pasquale Minervini, Minqi Jiang, and Tim Rocktäschel

A APPENDIXA.1 Hyperparameters

All models used in our experiments, including relational modelsand the baseline CNN, have 4 hidden layers. We train all modelsusing RMSprop [26] with a learning rate of 0.001.In the RL setting, an agent can, in principle, produce an infiniteamount of training data given enough computational power andtime, implying that increasing the model size can almost alwaysbe helpful for in-distribution performance. For fair comparison, weaimed to ensure similar parameter counts for all relational modelstested. Taking LavaCrossing as an example, the CNN model contains149k total parameters; R-GCN, 149k; and NLM, 139k.We also evaluated even larger CNN models but did not observesignificant improvements to in-distribution performance. We onlytested NLM with maximum arity of 2, because of the large computa-tional cost of higher arity. Each layer and each arity had an outputdimension of 64, resulting in 192 intensional predicates per layer.R-GCN models used 2 relational convolution layers, each with anoutputting 64-dimensional feature vectors. After max-pooling, wethen apply 2 dense layers, each with 128 hidden units. CNN modelsused 2 convolution layers, each with a 3 × A.2 Relational Inductive Bias andGeneralization

Prior work [1] defines relational inductive bias as constraints onrelationships and interactions among entities in a learning process.Usually, the relational inductive bias is implemented by a specificpattern of parameter sharing and neural connectivity in the neuralarchitecture. For example, CNNs exploit local connections con-straining information processing to a limited, local receptive field,sharing parameters between different local kernels to introducespatial translation invariance. Such relational inductive biases havebeen argued to be critical for promoting combinatorial generaliza-tion [1].Combinatorial generalization is the process of exploiting com-positional structure underlying a problem to successfully performinference, prediction, or useful behaviours on previously unseenexamples or scenarios [1]. This concept is closely related to system-atic generalization, a defining capability human intelligence, whichmodern deep learning methods have yet to attain. Combinatorialgeneralization can help the model improve sample efficiency andgeneralize to new tasks. In practice, it is important to disentanglecombinatorial generalization from memorization. One way to ex-plicitly test for combinatorial generalization is to test the model onout-of-distribution held-out data (sampled from a different distri-bution than the training data). This data should be generated by aprocess mirroring the compositional rules governing the generationof the training data.Unlike in-distribution generalization, which has been well-studiedby learning theory and statistics, out-of-distribution generalizationcannot, in general, be achieved by simply increasing the amountof data. However, successful combinatorial generalization wouldallow out-of-distribution generalization on such held-out data thatfollows similar composition rules as the training data. In the RL setting, given a fixed policy 𝜋 , the sampled data con-sists of trajectories of states, actions and rewards. We call an RLenvironment out-of-distribution if, at test time, either initial state,dynamics, or the task distribution in multi-task learning, is variedsuch that the distribution of trajectories differs from that at train-ing. We say an environment or collection of environments (withsome appropriate sampling distribution over these environments)is in-distribution if it generates trajectories with the same likelihoodas under the training distribution. For example, consider the case inwhich the initial state of an environment is procedurally generatedand, during testing, we keep all the level generation logic the same,only using different random seeds. Though the agent may see initialstates it never saw in training, we will would not call these test-timeconfigurations out-of-distribution, as the probability that the agentmeets these initial states remains the same at test time as at trainingtime. A.3 Matrix Representation of Message Passing

Here we provide a block matrix formalization of a class of neu-ral network layers, shedding light how different neural architec-tures can be represented by their respective relational inductivebiases. Suppose we have 𝑛 𝑚 -dimensional input feature vectors, x , . . . , x 𝑛 ∈ R 𝑚 , and want to map them to 𝑛 output feature vectors y , . . . , y 𝑛 ∈ R 𝑚 of the same dimensionality.If we consider the case where this mapping is linear, we canconcatenate all the input feature vectors x 𝑖 and express the trans-formation as a block matrix product:  A . . . A 𝑏 . . . A 𝑛 ... . . . ... A 𝑎 A 𝑎𝑏 A 𝑎𝑛 ... . . . ... A 𝑛 . . . A 𝑛 . . . A 𝑛𝑛   x ... x 𝑏 ... x 𝑛  =  y ... y 𝑎 ... y 𝑛  We see the update rule of y 𝑎 is: y 𝑎 = 𝑛 ∑︁ 𝑏 = A 𝑎𝑏 x 𝑏 (5)The submatrix A 𝑎𝑏 ∈ R 𝑚 × 𝑚 , which we refer to as the messagepassing matrix , dictates the message passing from entity 𝑎 to en-tity 𝑏 . When no constraints are applied to A , the overall mappingexpressed in Appendix A.3 corresponds to a standard dense linearlayer. In the following sections, we refer to the constraints overthe message passing matrices A as the relational inductive bias.This formalization provides a more concrete definition of relationalinductive bias than the conceptual one proposed by Battaglia et al.[1], i.e., constraints on relationships and interactions among entitiesin a learning process.Neural architectures typically encode two types of constraints:sparse connectivity and sharing parameters. Sparse connectivitycan be achieved by setting some of the A components to zero. Forexample, A 𝑎𝑏 = means no message can be passed from entity 𝑏 to entity 𝑎 . Parameter sharing simply sets some of the messagepassing matrices to correspond to the same matrix. Specific patternsof connectivity and parameter sharing encode different relational rid-to-Graph: Flexible Spatial Relational Inductive Biases for Reinforcement Learning AAMAS ’21, May 3–7, 2021, Online inductive biases, ensuring specific, desirable properties during com-putation, represented by message passing operations. For instance,the locality of a convolutional layer can be accomplished by keep-ing connectivity from 𝑎 to 𝑏 only if entity 𝑎 is in the receptive fieldof entity 𝑏 . Leveraging the [32, Lemma 3], we can also implementthe permutation equivariance (the inductive bias of Deep Set [32])by sharing parameters across diagonal submatrices and sharingparameters among the off-diagonal submatrices.The relational graph of R-GCN provides a natural way to de-scribe arbitrary connectivity and parameter sharing constraints. Byrearranging and applying the distributive law, we can express themessage passing matrix as: A 𝑎𝑏 = ∑︁ 𝑟 ∈R 𝑏𝑎 𝑐 𝑎,𝑟 W 𝑟 , (6)where R 𝑏𝑎 is the set of all relation labels for relations from 𝑏 to 𝑎 ,and 𝑐 𝑎,𝑟 , W 𝑟 are the normalization constant and weight matrix ofR-GCN in Eq. (2).We now provide an example demonstrating the equivalence ofR-GCN with local directional relations and 2D convolution: Theupdate rule for a single feature vector of a linear convolution layercan be written as y 𝑎 = (cid:205) 𝑏 ∈ 𝐾 𝑎 A 𝑎𝑏 x 𝑏 , where 𝐾 𝑎 is the set of entitiesin the receptive field around 𝑎 and A 𝑎𝑏 is the same value for all 𝑎 and 𝑏 with the same relative position to each other. Given , theequivalence between R-GCN with local directional encoding andimage convolution becomes clear. The only difference is that R-GCN introduces an extra normalization factor, which is a constantin this case, as most of the nodes have the same number of incomingedges. A.4 Relational Inductive Biases ofConvolutional Layers

We now consider the case where entities are associated with a 2Dfeature map and analyze the relational inductive biases of convo-lutional layers. For all tasks we consider, each environment ob-servation is (or contains) a feature map , defined as a mapping 𝑚 : 𝑊 × 𝐻 → 𝑃 with 𝑊 = { , ..., 𝑤 } , 𝐻 = { , ..., ℎ } and 𝑃 = R 𝑚 ,where 𝑤 and ℎ is the width and height of the input, and 𝑛 is thedimensionality of the feature vector. Each feature vector corre-sponds to an entity and therefore, a node in the relation graph.When dealing with a feature map, a single 2D convolution layer hasthe following relational inductive biases: i ) Locality : Informationis only transmitted from nodes in a kernel-sized receptive to thecentral node. Formally, 𝐴 𝑎𝑏 is not a zero matrix only if 𝑎 is in thereception field around 𝑏 . ii ) Anisotropy : The the propagation ofinformation in different directions follows different rules. Usingour block matrix formalization: in a reception field around entity 𝑎 ,the value of 𝐴 𝑎𝑏 for each distinct 𝑏 is distinct. iii ) Spatial translationequivariance : Node pairs with the same relative position share thesame message passing rules. Namely, 𝐴 𝑎𝑏 and 𝐴 𝑐𝑑 should be thesame if the relative position between 𝑎, 𝑏 and that between 𝑐, 𝑑 arethe same.Locality is a useful relational inductive bias for most environ-ments where feature maps abstract (or are sampled from) the phys-ical world because the objects that are close to each other are morelikely to inform or interact with each other. For example, in the grid world tasks we considered, most interactions happen locallyand even remote interactions typically rely on local, intermediaryobjects (e.g., an intermediary object thrown by entity 𝑎 to a re-mote entity 𝑏 ). Note that spatially adjacent objects may be furtherabstracted as a single larger entity.Anisotropy can be used to judge directionality, useful in theRL environments considered in this work, as actions are boundto directions e.g.,“ move left ” and “ move right ”.. We emphasize thisinductive bias to highlight why relations should be directional inGTG. Note that MLPs are by default anisotropicTranslation equivariance is an important inductive bias whichcaptures the intuition that interactions between two entities likelydepended on the relative positions between these two entities ratherthan their absolute position. Translation equivariance relates tothe assumption commonly made by feature engineering in imageprocessing, where and the rules for extracting features should beindependent of the feature position. Here, each filter correspondsto a high-level feature vector. A.5 Computational Costs

The computational cost of our model is proportional to the numberof ground atoms (i.e.,edges in the KB and thus in the computationalgraph) induced by the underlying relations. Let N be the numberof entities (e.g.,pixels). Our model’s time complexity ranges from 𝑂 ( 𝑁 ) for convolutional layers to 𝑂 ( 𝑁 ) for fully-connected graphs,e.g.,Transformer layers. For GTG’s relation determination rules,the time complexity of local directional relations is 𝑂 ( 𝑁 ) , that ofalignment relations (defined in Section 4.2.3) is 𝑂 ( 𝑁 / ) , and thatof remote relations is 𝑂 ( 𝑁 ))