Efficient and Interpretable Robot Manipulation with Graph Neural Networks
EEfficient and Interpretable Robot Manipulation withGraph Neural Networks
Yixin Lin , Austin S. Wang , and Akshara Rai Abstract — Many manipulation tasks can be naturally castas a sequence of spatial relationships and constraints betweenobjects. We aim to discover and scale these task-specific spatialrelationships by representing manipulation tasks as operationsover graphs. To do this, we pose manipulating a large, variablenumber of objects as a probabilistic classification problem overactions, objects and goals, learned using graph neural networks(GNNs). Our formulation first transforms the environment intoa graph representation, then applies a trained GNN policyto predict which object to manipulate towards which goalstate. Our GNN policies are trained using very few expertdemonstrations on simple tasks, and exhibits generalizationover number and configurations of objects in the environmentand even to new, more complex tasks, and provide interpretableexplanations for their decision-making. We present experimentswhich show that a single learned GNN policy can solve a varietyof blockstacking tasks in both simulation and real hardware.
I. I
NTRODUCTION
Everyday manipulation tasks deal with complex relation-ships and constraints between objects and environments.For example, a simple task like pouring water assumespre-conditions, like an open bottle and proximity of thebottle to a cup. These relationships can be spatial, such asthe location of the bottle should be close to the cup, ortemporal such as the bottle should be open before the pouringbegins. Specifying pre-conditions and constraints to completea particular manipulation task can be tedious and error-prone.Discovering and satisfying task-specific relationships in ageneralizable and scalable manner are significant roadblocksfor research in long-term task planning in manipulation.In this work, we present a graphical representation ofmanipulation tasks that naturally encompasses a large rangeof manipulation tasks, enables learning the temporal andspatial relationships between the objects involved, and canbe used for sim-to-real transfer of high-level manipulationstrategies. Similar to [1], and [2], we transform the task sceneinto a graph whose nodes represent the task-relevant entities(objects, robot hand, goals). Unlike [1], we represent ourpolicy as a graph neural network (GNN) operating on thisgraph, and learn its parameters from data, enabling zero-shot generalization to scenarios with a different number ofobjects. Specifically, we train two GNNs – one that selectsthe most relevant object in the scene, and another that selectsa suitable goal state for the selected object. Next, the robotexecutes an action that transforms the selected object fromits current state to the goal state. We use imitation learningto train the GNN policies, which learn from demonstrations The authors are with Facebook AI Research, in Menlo Park, CA. yixinlin,wangaustin,[email protected] (a) 3-block stack (b) 4-block stack (c) 5-block stack(d) 9-block stack (e) 6-block pyramid (f) 2 stacks of 3blocks on hardware
Fig. 1:
We train a policy on small instances of the problem (toprow) and test generalization on larger problem instances (1d), newtasks (1e), and on real hardware (1f). to encode the relationships between objects, goals and theenvironment. Once trained, the GNN can now be used tosolve tasks of significantly increased complexity, as long asthe learned spatial relations and constraints from the simplertasks hold on the complex tasks. For example, for blockstacking, we train the GNN on stacking 3,4,5 blocks andstudy zero-shot generalization at stacking 6 to 9 blocks.In its core, our approach depends on a hierarchical de-composition of long-term manipulation tasks – a commonparadigm in task and motion planning literature [3]. TheGNN policy is essentially a high-level policy, operatingon a space of actions, objects and goals. Our frameworkassumes that the task decomposition can reproduce the expertdemonstrations well. This hierarchical setup has severaladvantages: (1) It allows the training to be very sample-efficient and our GNN policy can train from as few as 15expert demonstrations, while generalizing to tasks beyond thetraining samples. (2) Minimizing supervised learning loss onexpert demonstrations can be used to directly infer the high-level constraints and spatial relationships of the task. (3)The hierarchical decomposition enables transfer of learnedpolicies across robots . Since the high-level constraints ofthe task do not change between different robots, the learnedhigh-level GNN policy also transfers to new systems with thesame high-level action space. In our experiments, we learna high-level block stacking policy in simulation on a Kuka-iiwa robot in Pybullet [4], and generalize it to a real Franka a r X i v : . [ c s . R O ] F e b anda robot arm.The main contributions of this work are presenting (1)GNNs as a promising policy architecture for long-termmanipulation tasks in a hierarchical setting, (2) imitationlearning as a well-suited training scheme for such a policychoice in this setting, and (3) a modified GNNExplainer [5]to interpret the decisions made by our learned policy. Ourexperiments are designed to highlight two features of ourapproach – generalization abilities of GNNs to large numbersof objects in the environment, and zero-shot generalizationto new, unseen tasks. We conduct experiments in a blockstacking environment, where the main task at hand is to pickblocks from different initial configurations and place themin different goal configurations. The initial configurationsare blocks in stacks, while the goal configurations includestacks of 2-9 blocks, pyramids and multiple stacks (Figure1). We train a single high-level GNN policy that can achieveall these variations of this task, starting from a dataset of150 expert trajectories stacking 3-5 blocks. We also compareour approach against reinforcement learning (RL) with bothfeedforward and GNN policies. Our comparisons furtherreinforce the generalizability of the GNN architecture overtraditional feed-forward policies, as well as the sample-efficiency and robustness of imitation learning from expertsover RL in long-horizon tasks.II. R ELATED W ORK
A. Graphical approaches to manipulation
Graphical representations of scenes have used for learninghigh-dimensional dynamics models [6], [7] using visualdata, learning object-relevance in problems with large objectinstances [8], visual imitation learning [1], [9], and high-levelpolicies that take the graphical representation of state as input[2]. Graph neural networks (GNNs) [10] are effective mech-anisms for learning the relational inductive biases presentin graph datasets. [8] train a GNN to predict if a particularobject in a scene is relevant to the planning problem at hand.Using an iterative approach over the number of allowedobjects, they are able to solve planning problems with alarge number of objects in the scene in much smaller timethan required for solving the full problem. [2] train a GNNpolicy using RL to solve block stacking tasks and showgeneralization of the trained policy to unseen block con-figurations without any additional training. We use a GNNpolicy for blockstacking, similar to [2], but in a hierarchicalsetting – instead of directly predicting low-level actions, wedecompose the task into a high-level GNN policy, and low-level motion primitives. This hierarchical setting allows usto use imitation learning on a very small number of experttrajectories, instead of running sample-inefficient deep RL asin [2], and still manage manage generalize to a wide rangeof unseen tasks. Our policy acts per high-level timestep ofour task in contrast to [8], which only outputs a fixed planat the beginning of the task, not a reactive policy which canrecover from unexpected perturbations.
B. Task and motion planning
Task and motion planning (TAMP) is a common approachfor solving long-horizon manipulation tasks. It brings to-gether discrete symbolic task planning over a set of discretestates, actions with continuous space motion planning overa robot’s configuration space. We refer readers to [11] foran overview of classical TAMP in manipulation. However,traditional TAMP algorithms rely on pre-defined symbolicrules, or planning domains , which determine the validityof a state, or a particular action in a state, as well astransition models that are used by symbolic planners [12],[13], [14], [15]. In contrast, reinforcement learning (RL)approaches can be domain independent , solving complexmanipulation problems in an end-to-end fashion [16], [17],though they are limited to short-horizon tasks. As a result,there has been a lot of interest in bringing together TAMPand RL approaches for scalable long-horizon manipulation.One area where learning has been especially successfulin TAMP problems is at speeding up the planning step,given the symbolic decomposition and transition models ofa task. This can be done by limiting search to feasible plans[18], [19], or by learning heuristics that guide the searchtowards high-performing plans [20]. There has also beenwork on learning the transition models over symbolic statesand actions [21], [22] which eliminate the need for hand-crafted transition tables, which can be error-prone and hard toscale. Our work draws inspiration from both these directions;however, we take a different approach – instead of learningtransition dynamics, or heuristics for planners, we use expertdemonstrations of a task to directly learn a policy whichinherently learns both aspects. Specifically, we assume adiscrete state and action representation of the task, and useimitation learning to train a high-level policy that operates onthe pre-defined decomposition to achieve new, unseen tasks.Our policy implicitly learns about the feasibility domain (e.g.only picking the top block in a stack) while generalizing tosolve unseen tasks (e.g. stacking multiple towers).III. B
ACKGROUND
A. Reinforcement learning and imitation learning
We consider a Markov Decision Process (MDP) with acontinuous state space S and a high-level discrete actionspace A . Starting from state s t , executing high-level action a t incurs a reward r t and leads to state s t +1 ∼ p ( s t +1 | s t , a t ) following the transition probability distribution p . Given thisproblem setup, we aim to learn a policy π θ ( s t ) = a t thatmaximizes the expected return V π ( s t ) = E s t ∼ p [ (cid:80) Tt r t | s t , a t ] of a trajectory of length T . Reinforcement learning optimizes θ by interacting with the environment and learning fromexperience. Another approach is to approximate the policy θ from expert demonstrations using imitation learning. Givenan expert dataset of N trajectories D = { τ i } Ni =1 , τ i = { s i, , a expi, , s i, , a expi, , . . . , s expi,T , a i,T } , we minimize the su-pervised learning loss: min θ E [ (cid:80) Ni =1 (cid:80) Tt =1 (cid:107) a expi,t − a predt (cid:107) ] ,where a predt = π θ ( s i,t ) . Typically, imitation learned policiesdo not generalize well outside of the training distribution, andeed iterative learning, such as in DAgger [23]. However,we present results where structured graphical state repre-sentations and learned relational inductive biases generalizeoutside of the training distribution. B. Graph Neural Networks
Graph neural networks (GNNs) [10] are deep networksdesigned to operate on graphs. Let G be a graph withnodes V and undirected edges E , where each node v ∈ V is associated with a d -dimensional feature vector φ ( v ) .A single message-passing GNN layer applies a message-passing function on every node, which updates its featureas a function of its own and its neighbors’ features; a GNNmodel commonly stacks multiple layers. At each layer l andfor every node v i ∈ V , we update the node’s feature vector h li = f lθ ( h l − i , { h l − j } j ∈N i ) , where h li is the updated noderepresentation and h i = φ ( v i ) . f θ is a parametrized functionwhose weights θ are learned using gradient descent duringtraining. The same function f and its parameters θ are shared across all nodes, which means that once the parameters θ arelearned, the GNN can be applied to a new graph with anynumber of nodes. GNNs highly parallelizable and efficientto compute; we use Pytorch Geometric [24], [25] for all ourcomputations.Different GNN architectures make different choices of f θ that induce different inductive biases on the problemat hand. We experiment with four different kinds of GNNarchitectures: Graph Convolution Networks (GCN):
GCNs [26] areisotropic graph networks where each neighbour’s contribu-tion is weighed by the edge weight of the connecting edge: h li = σ ( θ h l − i + θ (cid:80) j ∈N ( i ) e j,i · h l − j ). θ and θ constitute the learnable parameters, σ is theactivation function, such as the ReLU activation. GraphSage (Sage):
GraphSage [27] is also an isotropicnetwork like GCNs that takes the mean features of each ofits neighbors without taking edge weights into account: h li = σ ( θ h l − i + θ |N ( i ) | (cid:80) j ∈N ( i ) h l − j ) . GatedGCN (Gated):
GatedGCN [28] is an anisotropicgraph convolution network, where the weights on the neigh-bors are learned using a Gated Recurrent Unit (GRU). h li = GRU ( h l − i , (cid:80) j ∈N ( i ) θ h l − j ) Graph Attention Networks (Attention):
Graph attentionnetworks [29] are also anisotropic graph convolution net-works that learn relative weights between neighbors using anattention mechanism: h li = σ ( θ h l − i + (cid:80) j ∈N ( i ) a i,j θ h l − j ) ,where a learned self-attention weight a i,j measures theconnection between nodes v i and v j .Using the above GNN architectures, we study if addingcomplexity such as memory or attention mechanisms helpwhen dealing with manipulation tasks.IV. GNN POLICIES FOR MANIPULATION
In this section, we explain our formulation which castsmanipulation tasks as operations over a graph. We assume theexistence of a low-level
PickAndPlace primitive which,given an object and a goal, grasps the chosen object and Fig. 2:
An overview of our approach. We train a high-level GNNpolicy that takes a graph representation of state as input and selectsthe next object to pick, and the next goal to place it in. A low-level
PickAndPlace primitive then picks the chosen object andplaces it in the desired goal. Summary in Algo 1.
Algorithm 1:
Long-horizon manipulation with GNNsGiven a graph dataset D of N expert demonstrationsRandomly initialize a GNN policy π θ for each gradient step do Update θ ∗ = arg min θ L ( θ, D ) where L is the cross-entropy loss in Eq. 2 for each time step do Create graph G from the environment stateChoose o, g = π θ ( G ) Execute
PickAndPlace ( o → g ) places it in the desired goal. We then train a high-levelGNN policy that takes a graph representation of environmentstate as input and selects the input to PickAndPlace .Our approach is outlined in Figure 2, and in Algorithm1. The implementation of the low-level
PickAndPlace primitive is robot-specific, and we do not address the designof low-level primitives here. Instead, we focus on learningpolicies that can provide the correct high-level input to theseprimitives to accomplish a long-horizon manipulation task.
A. Problem formulation: Graphical representation of state
We encode the environment scene as a graph, whose nodesconsist of the task-relevant entities, such as objects and goals.Let there be K objects, and K goals in the scene. We createa graph G = ( V, E ) , where the vertices V = { v ok } Kk =1 ∪{ v gk } Kk =1 represent the objects and goals in the scene, givingus a total of · K nodes. We create a dense, fully-connectedgraph, where all nodes are connected to all other nodes; E = { e i,j } for i, j = 1 , . . . , · K .Each node v ∈ V in the graph has a feature vector φ ( v ) ,which contains node-specific information. An initial featurelist is provided as input to the GNN; in our setting, theinput features of each node are 5-dimensional: a categoricalfeature { , } denoting if a node is an object or goal, the3-dimensional position of the object or goal in the frame ofthe robot, and a binary feature which is 1 if a goal is filledor an object is in a goal, and 0 for empty goals or objects.For K objects, we allow a robot trajectory of K high-level steps. The current state graph G k is input to the GNNpolicy, which outputs a graph of the same topology. TheGNN outputs hidden features for every node, and theseare sorted into object and goal nodes, then passed througha softmax function to generate two categorical probabilitydistributions: one over blocks and one over goals. The highestprobability object and goal are selected as inputs to thelow-level PickAndPlace . Our graph representation fora K = 3 block stacking trajectory is illustrated in Figure 3.ig. 3: Overview of our algorithm at a timestep. Our method takes in an observation, transforms it into a graph with a 5-dimensionalfeature per node, and passes it to the GNN policy, which selects an object and goal to input to
PickAndPlace . In this work, we deal with planning problems with anunderlying structure – for example, pick the highest blockfrom a stack, and place it in the lowest free goal. Weuse expert demonstrations to train a GNN policy whichlearns this underlying structure, in contrast to traditionalTAMP, where such constraints are assumed to be pre-defined.Once this structure is learned, the policy will automaticallygeneralize to new unseen problems, as long as the underlyingstructure holds.
B. Training the GNN from demonstrations
Intuitively, we are posing a long-horizon manipulationproblem as a classification problem at each high-level stepwhere a decision is made over which object to move to where using what action. Assuming a fixed decompositionof a task in terms of objects, goals, and actions, the task oflearning manipulation skills from expert demonstrations is aprobabilistic classification problem.In detail: the output of the GNN policy is · K dimensionswhich correspond to the object and goal nodes of the originalgraph. This is reshaped as two K dimensional outputs V out g = { v gk } Kk =1 and V out o = { v ok } Kk =1 . V out o is then passed througha softmax function to generate a K -dimensional categoricaldistribution P o pred = { p o , p o , · · · p oK } over the probability ofeach object or goal being picked. The object with the highestpredicted probability is the output of the GNN policy. o ∗ = arg max j p ( o j ) where p ( o j ) = exp( v oj ) (cid:80) Kk =1 exp( v ok ) (1)The same transformation is applied to the goals, resultingin a probability distribution P g pred = { p g , p g , · · · p gK } over thegoals, and the goal with the highest probability is chosen asthe next desired goal. Given target distributions P o tgt for theobjects and P g tgt for goals from expert data, the GNN policyparameters θ are trained to minimize the cross-entropy loss: arg min θ (cid:104) − K (cid:88) k =1 ([ P o tgt ] k log( p ok ) + [ P g tgt ] k log( p gk )) (cid:105) (2)Equation 2 is the standard cross-entropy loss used in classi-fication problems.The expert demonstrations used for training the GNNpolicy are also cast as a graph with target output distri-butions coming from the expert action. Given an expertpolicy π exp which predicts the next object o exp to be movedto the next goal g exp for the current state, we collect N demonstrations of the expert solving the task. For K objectsin the environment, each expert demonstration consists of T = K steps. At each step t , we extract input-output pairs { ( s t = ( o i =1 , ··· ,T , g i =1 , ··· ,T ) , a t ) } , where o i and g i are theobjects and goals in the scene, and a t = { o exp t , g exp t } is theaction taken by the expert at step t . Note that the trainingdataset at each step collects information about all the objectsand goals in the scene, along with the goal and object chosenby the expert policy at that step. This generates a trainingdataset D of N expert demonstrations solving multiple tasks D = { τ n } Nn =1 , τ n = { s , a , s , a , . . . , s T } (3)For training the GNN policy, we sample a batch of state-action pairs from the dataset D and convert each sampledstate s b = ( o i =1 , ··· ,T , g i =1 , ··· ,T ) into a graph G b = ( V b , E b ) ,as described in Section IV-A. The object and goal statesstored in o i and g i are used to create the graph nodes V b and corresponding node features φ b , and edges E b . Theexpert action a b = { o exp b , g exp b } is converted into two K -dimensional target distributions P o tgt and P g tgt for goal andobject prediction, respectively. P o tgt = [ o k = o exp b ] is a one-hot vector which is 1 for the object chosen by the expert, and0 for all others. Similarly, P g tgt = [ g k = g exp b ] is a one-hotvector which is 1 for the goal chosen by the expert, and 0for all others. The parameters θ of the GNN are updated tominimize the cross-entropy loss in Equation 2 between thedistribution predicted by the GNN policy given G b as input,and the target distribution P o tgt and P g tgt .We note that this high-level policy could be learned inmany ways, and one does not need to use a GNN. For ex-ample, we could learn a feed-forward multilayer perceptron(MLP) that takes as input the features of the blocks andgoals, and predicts the next block and goal. However, ifthe MLP policy is trained on K = 3 objects and goals,it does not automatically generalize to K = 4 , since thenumber of inputs, and architecture of the policy are differentfor different K . On the other hand, GNNs generalize todifferent number of nodes in the graph, and hence canbe used on variable number of objects. Our GNN policytrained on K = 3 , , exhibits zero-shot generalization on K = 2 , , · · · , (Section V-A), reinforcing that GNNs arean appropriate policy choice for manipulation. C. Interpreting the learned GNN policy
One of the challenges of working with learned neuralnetwork policies is the lack of insight into their decisionmaking, hence the interest in adding transparency to thedecisions made by deep networks [30]. For GNNs specif-ically, [5] propose a GNNExplainer algorithm for extractingexplanations out of a trained GNN – determining importanceof neighbouring nodes, as well as input features for decisionaking. Intuitively, [5] find a subgraph and subset of inputfeatures that result in the smallest shift in the output distri-bution of the GNN. We modify GNNExplainer to suit ourproblem setting, while keeping the main idea of finding the“closest" subgraph the same.The output of our trained GNN policy π θ given aninput graph G = ( V, E ) are two categorical probabilitydistributions P o pred , P g pred , predicting which object and whichgoal will be picked next. We aim to find a mutated graph G S and feature mask F , such that the output of π θ given G S and masked features φ S = φ (cid:12) F results in a distributionclose to P o pred , P g pred . This setup is different from [5] where acategorical distribution is predicting the class of every nodein a graph; our model instead predicts two distributions over all object and goal nodes. As a result, the number of nodes V in our mutated graph G S are fixed to be the same as in theoriginal graph G , to maintain a valid probability distribution.In our analysis, we aim to identify which spatial relationship,or neighbouring object or goal led to the selection of aparticular object or goal by the policy. To do this, we createsubsets of edges E S ⊂ E , and a feature mask F that setscertain node features φ to 0. Intuitively, removing an edgesevers the message-passing channel from the two previously-connected nodes. If the edge was important, then removingit would significantly change the output prediction. Thisablation method helps us systematically understand whichedges and features were actually important for choosing aparticular object to pick or goal to place the object in.Formally, given a trained GNN policy π θ and input graph G = ( V, E ) , we aim to find a mutated graph G S =( V, E S ) , E S ⊂ E and a feature mask F , such that themutual information between the prediction Y = π θ ( G ) ,and predictions given the subgraph G S and masked features φ S = φ (cid:12) F as input, is maximized: G S , F = arg max G S ,F MI ( Y, ( G S , φ S )) (4) = H ( Y ) − H ( Y | G = G S , φ = φ S ) (5)Inspecting Equation 4, we see that H ( Y ) does not dependon G S or F as Y is obtained by applying the trained policyon the original graph G . Hence maximizing the mutualinformation between predicted distribution Y and mutatedgraph ( G s , φ S ) is the same as minimizing the conditionalentropy H ( Y | G = G S , φ = φ S ) . Our explanation for adecision Y hence is a mutated graph G S consisting ofa subset of edges E S from the original graph, and inputfeatures masked by F that minimize the uncertainty over Y given this limited information. G S , F = arg min G S ,F H ( Y | G = G S , φ = φ S ) (6)Since Y consists of two output distributions, we re-frame theobjective from Eq 6 to be minimize the sum of the condi-tional entropy of the two predicted distributions P o pred , P g pred given the predicted distributions P os, pred , P gs, pred over the ( G S , φ S ) . min G S ,F H ( P o pred | P os, pred ) + H ( P g pred | P gs, pred ) (7) Fig. 4: Visualizing the most important edges and features forchoosing each block over a 3-step trajectory. The circled objectand goal are the ones selected by the policy. The most importantedge is bolded; the most important feature is listed by time step.
We limit the total number of alive edges | E S | ≤ c E ,and alive features (cid:80) j F j ≤ c F in our mutated graph G S .Next, we exhaustively enumerate all graphs G S = ( V, E S ) and features φ S = F (cid:12) φ and search over all possiblecombinations to find the minimum of Eq. 7 to find thesubset of edges and features that reduce uncertainty overthe predictions of the original graph G .As an example, we apply this interpretation technique to atrajectory generated by a learned GNN policy applied to the3-block environment in Figure 4, visualizing the environmentstate as a graph. To produce the explanation, we choose c E =3 , c F = 1 , i.e. we want to visualize the 3 most importantedges and most important (spatial) feature.The visualization procedure automatically outputs inter-pretable explanations of the form “node i was chosen becauseof its relationship with nodes j, k, l ; the most importantfeature was spatial dimension z .” As an intuitive check onthe validity of the explanation, we note that the importantedges always include edges between the selected object and other objects and goals; implying that the policy’sdecision was informed by how did the selected block relateto its neighbours. z is an extremely important feature inthe blockstacking task, as the height of the current blockpositions determine which block is “safe” to pick up.V. E XPERIMENTS
In simulation, we use a set of PyBullet[4] simulation envi-ronments centered around controlling a 7DoF robot manip-ulator (KUKA iiwa7) to perform a variety of blockstackingtasks. On hardware, we use a 7-DoF Franka Panda manip-ulator equipped with a Robotiq 2F-85 two-finger gripper,and solve a subset of the blockstacking tasks. For detectingblocks on hardware, we utilize a RealSense depth camerawith the ArUco ARTags library[31], [32] for estimatingrelative positions between objects and gripper (Fig. 8).Each environment varies by the number of blocks K , theirinitial positions, and their goal positions. Success is measuredby recording the percentage of perfect stacks of size K atthe end of each trial. Our test environments are meant tostudy the generalization of the trained GNN policy acrossincreased number of blocks for single stacks, as well asgeneralization to new tasks like multiple stacks or pyramids.Our test environments are:1. K -block : K blocks are initialized in a random location.The goal is to invert this stack at another random location.ig. 5a visualizes K = 6 . This environment studies sensitiv-ity of the policy to number of objects.2. K -pyramid : same as K -block, but the goal positions arein a pyramid configuration, not a stack (Figure 5b), analyzingrobustness to new goal configurations for the blocks.3. K -block s -stack : There are s stacks of K blocks inthe environment, all of which have different random goalpositions (Figure 5c), studying robustness to both initial andgoal configurations.4. K -block perturbed : same as K -block, but after K timesteps the stack is disturbed, and all blocks fall randomlyto new positions. The policy is given an additional K stepsto recover from this state. The goal of this experiment is todemonstrate the policy’s reactive nature.For all experiments, we consider 4 variants of our ap-proach ( IL-GNN ), consisting of different GNN policy archi-tectures described in Section III-B. Specifically, we compareGCN, SAGE, Gated and Attention architectures on our testenvironments. All policies consist of 3 hidden layers, with64 hidden units each and ReLU activation. For attentionpolicies, the number of attention heads were set to 1. (a) 6-block (b) 6-pyramid (c) 3-block 3-stack
Fig. 5: Test environments
A. Comparisons on K-block stacking
We compare our trained GNN policy (
IL-GNN ) againsta set of competitive baselines on blockstacking environ-ments, designed to highlight the generalization abilities ofa GNN policy trained with imitation learning (IL) over otherapproaches from literature. We compare the GNN policyarchitecture with a feedforward network architecture (MLP)for both RL with IL. All baselines operate on our hierarchicalexperimental setup.1.
RL-MLP : This baseline consists of an MLP policy in-stead of a GNN. Since feedforward networks have fixed inputsizes, we have to retrain the policy for each stack of size 2 to9 using RL. This comparison highlights the generalizabilityof the GNN architecture over MLP architectures.2.
RL-GNN : In this baseline, our GNN policy is trainedusing RL on stacks of size 2 to 9, and its performancecompared to training with imitation learning.3.
RL-GNN-Seq : We design this baseline using the se-quential training curriculum described in [2]. The curriculumstarts by training our GNN policy for K base = 2 blocks andinitializes the policy for K blocks with policy trained in the K − environment, until K = 9 . This comparison highlightsthe advantage of imitation learning even over tuned RLtraining approaches. Fig. 6: Generalization over block numbers. A successful trajectoryis one which all goals are filled at the end.
For all RL baselines, we use Proximal Policy Optimization(PPO) [33] as our training method of choice, modified fromthe Stable Baselines open-source project[34]. We give alarge environment interaction budget to the RL policies:2000 environment interactions per stack, resulting in 16,000interactions in total across K = 2 . . . . In comparison,our approach IL-GNN is trained on only 600 environmentinteractions from 150 expert trajectories on 3, 4, 5-blocks.As can be seen in Figure 6,
RL-MLP performs the worst( . ± . on 3-block stack), and both RL-GNN and
RL-GNN-Seq perform better ( . ± . and . ± . on 3-block stack) at smaller problems.Hence, providing a strong spatially-oriented inductive biasby introducing a graphical state representation improveslearning, and the effect is strongest on environments withlower numbers of blocks. While RL-MLP , RL-GNN donot share data across tasks,
RL-GNN-Seq does and henceperforms better than both
RL-MLP and
RL-GNN . However,the performance of all RL baselines gets significantly worseas the number of blocks goes beyond 5. For K ≥ , thecomplexity of the task is too high for RL to learn high-performing policies in 2k interactions per new environment.In comparison, IL-GNN is trained on expert data of K =3 , , blocks, but successfully generalizes to the out-of-distribution 9-block environment (0.85 to 1.0, depending onthe GNN architecture). To our knowledge, this is the firsttime a learned policy has been demonstrated to performzero-shot generalization on a blockstacking task up to 9blocks, while training on a dataset of less than 5 blocks.For comparison, the zero-shot generalization behavior in[2] for a reinforcement learning policy trained on K blockblockstacking fails to generalize beyond K + 1 , while wecan generalize to K + 4 . B. Generalization to diverse goal configurations
Once the GNN policy has been trained on expert datasetof stacking K = 3 , , blocks, it can be tested on new taskconfigurations to study generalization to new, unseen tasks.The experiment tests the policy’s ability toachieve different goal configurations outside of its trainingdistribution, given the same initial condition. The policy hasnly been trained on single-stack goals, and has never seengoals in a pyramid configuration. As can be seen in Figure7a, all GNN architectures achieve near perfect performanceat stacking blocks in a pyramid, highlighting that the policiescan generalize to new goal configurations. tests robustness to both different initialand goal states, and is the most difficult of the test en-vironments. The policies have been trained on data withblocks in single stacks, to goal in single stacks, but nowneed to generalize to multiple stacks of both. Figure 7ashows that Sage and Attention policies are able to solvethis task perfectly, but Gated GNN polices suffer. Analyzingthis effect, we observed that the gated GNN architecturetends to overfit to small datasets, and this can result inpoor performance in settings that are significantly out of thetraining distribution, such as . (a) Generalization experimentsof the GNN policies to new,unseen tasks. (b) Experiments testing sensi-tivity to training dataset size. Fig. 7:
Generalization and sensitivity experiments with differentGNN architectures.
The experiments primarily test thepolicy’s ability to get to the same goal state despite ob-serving a disturbance mid-way through its execution. Thistest highlights the reactive nature of our approach, and itsrobustness to block positions out of its training distribution.After a perturbation, previously stacked blocks are strewnacross the workspace, which is a configuration not seen dur-ing the expert demonstrations. Success at this environmentdemonstrates the GNN policy learns to achieve a desiredspatial configuration which is robust to perturbations of thecurrent spatial configuration. As shown in Figure 7a, all GNNarchitectures are able to solve almostperfectly, showing that they are all robust to disturbances,and out-of-distribution initial conditions.
C. Explaining the learned GNN policies
Upon observing the difference in behaviors of the trainedGNN policies at the environment, we ran anablation experiment over number of expert demonstrationsused in training the policies, shown in Figure 7b. Thoughthere are more demonstrations, the dataset only coveredstacking 3,4,5 blocks, and hence was stillout-of-distribution. As seen in Fig.7b, the performance ofSage and GCN architectures improves as the amount of dataincreases, while the Attention architecture performs the same
Num. traj Attention Gated GCN Sage5 z , unfilled y , z z , z z , unfilled15,000 z , unfilled y , z z , z z , z TABLE I:
We compare the top two most important features overdifferent numbers of trajectories used for training and networkarchitectures during a task. between 500 and 15k expert demonstrations. Interestingly,the Gated architecture does not improve in performance aswe give more training data.To explain this difference in behavior between the GNNarchitectures, we run the GNNExplainer from Section IV-C on the GNN policies, when trained using 5 and 15,000expert trajectories. For Gated GNN, for a state in the middleof the execution, the important features between 5 and 15,000demonstrations stay the same – y position for block, and z height of goals. On the other hand, for Sage and AttentionGNN, the most important features for both 5 and 15,000 are(1) z height of block and (2) if a goal was free. As a result,Sage always picks an empty goal, while Gated tries to puta block in a filled goal, collapsing a half-formed tower. ForGCN policies, when using only 5 expert demonstrations, themost important feature is z height of block, and z heightof goal, which causes failure when the policy tries to placeblocks in the lowest goal. even if its occupied. When GCNis trained on 15k the most important features become (1) z height of block and (2) unfilled goal, which improvesperformance. A comparison of the most salient features bymodel are listed in Table I.Overall, we observe that given very few demonstrationsGCN policies might overfit, but Sage and Attention arestill able to learn generalizable features. On the other hand,even with 15k demonstrations, Gated policies overfit to thesingle stack settings, and do not learn to generalize to out-of-distribution environments. In the future, exploring richertraining datasets for Gated policies may help with overfittingand therefore improve generalization. D. Generalization to hardware (a) 3-blocks. (b) 4-blocks. (c) 3-block 2-stack.
Fig. 8:
Experimental setup for hardware tasks.
Finally, we validate our approach by training GNN policiesin simulation and applying them to hardware (Fig.8). Wedirectly transfer a GCN policy trained in simulation on the3-, 4-, 5-block dataset of 150 expert trajectories and deploy iton hardware. The policy picks a block and goal, and a low-level
PickAndPlace policy executes the motion. Sincehe block sizes, robot frame, etc. differ between simulationand hardware, we apply a simple affine coordinate transformto transform the hardware coordinates to ones the policy hasseen during training. For generalization, it is important thatthe scale and distribution of block and goal positions aresimilar between simulation and hardware, even if their actuallocations are different.We execute 5 runs of stacking 3 and 4 blocks on hardware,and achieve a zero-shot success rate of on 3-block stacksand on 4-block stacks. We also succeeded at running a 3-block 2-stack task. Errors on hardware arise from noisy blockposition measurements and differences in input distributionsbetween simulation and hardware. Training the GNN policywith domain randomization, and wider variety of trainingdata might help improve accuracy on hardware. To the bestof our knowledge, this is the first demonstration of zero-shottransfer of a GNN policy trained in simulation to hardwarefor blockstacking.VI. C
ONCLUSION AND F UTURE W ORK
In this work, we present a graphical policy architecturefor manipulation tasks that can be learned with expertdemonstrations, and is extremely sample-efficient to train.Once the graph neural network policies are trained, theydemonstrate zero-shot generalization behavior across unseenand larger problem instances, along with interpretable expla-nations for policy decisions. We test 4 GNN architectures,finding several that are extremely sample-efficient at learningthe underlying structure of the task and generalizing to newtasks. We transfer a GNN policy learned in simulation to areal Franka robot and show that such a high-level policy cangeneralize to hardware. This work opens exciting avenuesfor combining research on GNNs with TAMP problems,especially for learning manipulation tasks from visual input.VII. A
CKNOWLEDGEMENTS
We thank Sarah Maria Elisabeth Bechtle and Franziska Meierfor helpful discussions and feedback on the paper. R EFERENCES[1] M. Sieb, Z. Xian, A. Huang, O. Kroemer, and K. Fragkiadaki,“Graph-structured visual imitation,” in
Conference on Robot Learning .PMLR, 2020, pp. 979–989.[2] R. Li, A. Jabri, T. Darrell, and P. Agrawal, “Towards practical multi-object manipulation using relational reinforcement learning,” in
ICRA2020 . IEEE, 2020, pp. 4051–4058.[3] L. P. Kaelbling and T. Lozano-Pérez, “Hierarchical task and motionplanning in the now,” in . IEEE, 2011, pp. 1470–1477.[4] E. Coumans and Y. Bai, “Pybullet,” https://pybullet.org/ , 2016.[5] R. Ying, D. Bourgeois, J. You, M. Zitnik, and J. Leskovec, “Gnnex-plainer: Generating explanations for graph neural networks,”
Advancesin neural information processing systems , vol. 32, p. 9240, 2019.[6] K. Fragkiadaki, P. Agrawal, S. Levine, and J. Malik, “Learning visualpredictive models of physics for playing billiards,” 2016.[7] Y. Ye, D. Gandhi, A. Gupta, and S. Tulsiani, “Object-centric forwardmodeling for model predictive control,” 2019.[8] T. Silver, R. Chitnis, A. Curtis, J. Tenenbaum, T. Lozano-Perez, andL. P. Kaelbling, “Planning with learned object importance in largeproblem instances using graph neural networks,” 2020.[9] D.-A. Huang, S. Nair, D. Xu, Y. Zhu, A. Garg, L. Fei-Fei, S. Savarese,and J. C. Niebles, “Neural task graphs: Generalizing to unseen tasksfrom a single video demonstration,” in
CVPR , 2019, pp. 8565–8574. [10] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez,V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro,R. Faulkner et al. , “Relational inductive biases, deep learning, andgraph networks,” arXiv preprint arXiv:1806.01261 , 2018.[11] C. R. Garrett, R. Chitnis, R. Holladay, B. Kim, T. Silver, L. P. Kael-bling, and T. Lozano-Pérez, “Integrated task and motion planning,” arXiv preprint arXiv:2010.01083 , 2020.[12] G. Konidaris, L. P. Kaelbling, and T. Lozano-Perez, “From skillsto symbols: Learning symbolic representations for abstract high-levelplanning,”
Journal of Artificial Intelligence Research , 2018.[13] M. Fox and D. Long, “Pddl2. 1: An extension to pddl for expressingtemporal planning domains,”
Journal of artificial intelligence research ,vol. 20, pp. 61–124, 2003.[14] D. Höller, G. Behnke, P. Bercher, S. Biundo, H. Fiorino, D. Pellier,and R. Alford, “Hddl: An extension to pddl for expressing hierarchicalplanning problems,” in
AAAI , vol. 34, 2020.[15] M. Gharbi, R. Lallement, and R. Alami, “Combining symbolic andgeometric planning to synthesize human-aware plans: toward moreefficient combined search.” in
IROS . IEEE, 2015, pp. 6360–6365.[16] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end trainingof deep visuomotor policies,”
The Journal of Machine LearningResearch , vol. 17, no. 1, pp. 1334–1373, 2016.[17] I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew,A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas et al. , “Solvingrubik’s cube with a robot hand,” arXiv preprint arXiv:1910.07113 ,2019.[18] B. Kim, Z. Wang, L. P. Kaelbling, and T. Lozano-Perez, “Learningto guide task and motion planning using score-space representation,”2018.[19] A. M. Wells, N. T. Dantam, A. Shrivastava, and L. E. Kavraki, “Learn-ing feasibility for task and motion planning in tabletop environments,”
IEEE RAL , vol. 4, no. 2, pp. 1255–1262, 2019.[20] R. Chitnis, D. Hadfield-Menell, A. Gupta, S. Srivastava, E. Groshev,C. Lin, and P. Abbeel, “Guided search for task and motion plansusing learned heuristics,” in . IEEE, 2016, pp. 447–454.[21] Z. Wang, C. R. Garrett, L. P. Kaelbling, and T. Lozano-Pérez, “Activemodel learning and diverse action sampling for task and motionplanning,” 2018.[22] L. P. Kaelbling and T. Lozano-Pérez, “Learning composable modelsof parameterized skills,” in
ICRA, 2017 , pp. 886–893.[23] S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitationlearning and structured prediction to no-regret online learning.” JMLRWorkshop and Conference Proceedings, 2011, pp. 627–635.[24] M. Fey and J. E. Lenssen, “Fast graph representation learning withpytorch geometric,” arXiv preprint arXiv:1903.02428 , 2019.[25] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al. , “Pytorch: An im-perative style, high-performance deep learning library,” arXiv preprintarXiv:1912.01703 , 2019.[26] C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen,G. Rattan, and M. Grohe, “Weisfeiler and leman go neural: Higher-order graph neural networks,” in
Proceedings of the AAAI Conferenceon Artificial Intelligence , vol. 33, no. 01, 2019, pp. 4602–4609.[27] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representationlearning on large graphs,” arXiv preprint arXiv:1706.02216 , 2017.[28] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graphsequence neural networks,” arXiv preprint arXiv:1511.05493 , 2015.[29] P. Veliˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Lio,and Y. Bengio, “Graph attention networks,” arXiv preprintarXiv:1710.10903 , 2017.[30] C. Molnar,
Interpretable machine learning . Lulu. com, 2020.[31] S. Garrido-Jurado, R. Munoz-Salinas, F. J. Madrid-Cuevas, andR. Medina-Carnicer, “Generation of fiducial marker dictionaries usingmixed integer linear programming,”
Pattern Recognition , vol. 51, pp.481–491, 2016.[32] F. J. Romero-Ramirez, R. Muñoz-Salinas, and R. Medina-Carnicer,“Speeded up detection of squared fiducial markers,”
Image and visionComputing , vol. 76, pp. 38–47, 2018.[33] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,“Proximal policy optimization algorithms,” arXiv preprintarXiv:1707.06347arXiv preprintarXiv:1707.06347