[PDF] Object and Relation Centric Representations for Push Effect Prediction

Abstract

Pushing is an essential non-prehensile manipulation skill used for tasks ranging from pre-grasp manipulation to scene rearrangement, reasoning about object relations in the scene, and thus pushing actions have been widely studied in robotics. The effective use of pushing actions often requires an understanding of the dynamics of the manipulated objects and adaptation to the discrepancies between prediction and reality. For this reason, effect prediction and parameter estimation with pushing actions have been heavily investigated in the literature. However, current approaches are limited because they either model systems with a fixed number of objects or use image-based representations whose outputs are not very interpretable and quickly accumulate errors. In this paper, we propose a graph neural network based framework for effect prediction and parameter estimation of pushing actions by modeling object relations based on contacts or articulations. Our framework is validated both in real and simulated environments containing different shaped multi-part objects connected via different types of joints and objects with different masses. Our approach enables the robot to predict and adapt the effect of a pushing action as it observes the scene. Further, we demonstrate 6D effect prediction in the lever-up action in the context of robot-based hard-disk disassembly.

Full PDF

11 Object and Relation Centric Representations forPush Effect Prediction

Ahmet E. Tekden , Aykut Erdem , Erkut Erdem , Tamim Asfour , and Emre Ugur Abstract —Pushing is an essential non-prehensile manipulationskill used for tasks ranging from pre-grasp manipulation to scenerearrangement, reasoning about object relations in the scene, andthus pushing actions have been widely studied in robotics. Theeffective use of pushing actions often requires an understandingof the dynamics of the manipulated objects and adaptation tothe discrepancies between prediction and reality. For this reason,effect prediction and parameter estimation with pushing actionshave been heavily investigated in the literature. However, currentapproaches are limited because they either model systems witha ﬁxed number of objects or use image-based representationswhose outputs are not very interpretable and quickly accumulateerrors. In this paper, we propose a graph neural network basedframework for effect prediction and parameter estimation ofpushing actions by modeling object relations based on contactsor articulations. Our framework is validated both in real andsimulated environments containing different shaped multi-partobjects connected via different types of joints and objects withdifferent masses. Our approach enables the robot to predict andadapt the effect of a pushing action as it observes the scene.Further, we demonstrate 6D effect prediction in the lever-upaction in the context of robot-based hard-disk disassembly.

Index Terms —Push Manipulation, Effect Prediction, Parame-ter Estimation, Graph Neural Networks, Interactive Perception,Articulation Prediction

I. I

NTRODUCTION

Pushing is a fundamental non-prehensile (manipulationwithout grasping) motion primitive that gives robots greatﬂexibility in manipulating objects [1], [2]. Using push actions,a robot can navigate objects to goal conﬁgurations even whenobjects are not graspable [3]; it can manipulate objects underuncertainty [4], or bring an object to the graspable area[5]. Compared to grasping actions, it is not as restrictive;however, the issue is that the robot does not have directcontrol over the state of the manipulated objects. This resultsin greater complexity in planning and control as the dynamicsof the manipulated objects are often required to be takeninto consideration [1]. Effect prediction of pushing action hasmany applications [2], [6], including scene rearrangement [7],object segmentation [8], object singulation [9], [10], pre-graspmanipulation [10]–[13]. However, action-effect prediction ofpushing actions depends on many factors [14] and requires Ahmet E. Tekden and Emre Ugur are with Computer Engineering Depart-ment, Bogazici University, Turkey. Aykut Erdem is with Computer Engineering Department, Koc¸ University,Turkey Erkut Erdem is with Computer Engineering Department, Hacettepe Uni-versity, Turkey Tamim Asfour is with with the Institute for Anthropomatics and Robotics,Karlsruhe Institute of Technology, Germany*Corresponding author: Ahmet E. Tekden, email: [email protected] Fig. 1. We will normally expect the action of the robot on the left imageto scatter contacted objects. However, seeing the contacted objects movingtogether, the robot should correct its belief to enable this dynamic. adaptation when mispredictions occurs. Figure 1 shows anexample illustration. The initial prediction of the robot willbe objects getting scattered. However, after seeing some ofthe objects moving together, the robot will understand thattheir future motion will continue reﬂecting this dynamic.In many environments, robots work with object clutterscontaining different shaped and weighted objects with possiblearticulations between them. A robot should be able to reasonabout the inﬂuence of shape and mass of objects, physicalconnections like contacts or different types of articulationsbetween objects, propagation of motions between objects,and correction of unknown or partially known objects orobject parts in the environment. Current approaches modelenvironments with a ﬁxed number of objects or use imagedata, an object-independent representation. While there hasbeen great progress on effect prediction using raw sensorydata [15]–[18], using them on decision making level has beendifﬁcult and required tasks to be generated on pixel level.While there are certain advantages of such approaches, manytasks often require more interpretable representations for thetask to be deﬁned. Humans decompose environments intoobjects and use their interactions for physical reasoning [19]–[21], so there is certainly value in using such representationsin effect prediction. We propose using graph neural networks(GNNs) for push effect prediction. Graph neural networks [22]can exploit the graph structure of multi-objects systems by ex-ploiting and using object- and relation-centric representationsand they are heavily used in modelling physics [21], [23]–[29].In this paper, we propose a general-purpose learnablephysics engine in which object- and relation-centric repre-sentations are learned via a shared propagation network andused for physics prediction and parameter estimation in pushmanipulation tasks . We use articulation based graph represen- Project page: https://fzaero.github.io/push learning/ a r X i v : . [ c s . R O ] F e b tations that use cylinder- and cuboid-shaped objects and theirpossible interactions via contacts or joints for modeling multi-part object systems. We resort to a two-step training schemewhere our framework is ﬁrst trained for effect prediction,then using learned object and relation representations, it istrained for parameter estimation. Our framework can predictlow-level trajectories of groups of articulated objects givenrobot actions and estimate the mass of observed objects andjoint relations between them based on their interaction history.Using articulation based representation, novel tools that arenot encountered during training can be built by connectingmultiple cuboids via ﬁxed joints, and they can be used inplanning in tool manipulation tasks.An early version of this work was published in [30].However, this paper signiﬁcantly extend the work in severalimportant directions. In [30], for physics prediction and param-eter estimation, two independents networks were required. Byemploying a new weight-sharing mechanism that allows thesetasks to share object- and relation-centric representations, thenumber of learnable parameters is decreased by about thirtypercent. Previously, our framework was only able to modelcylindrical objects. We extend the input representations ofobjects and their relations, allowing our network to handleobjects with different shapes, predict the mass of objects, andrepresent complex shaped objects that are built by connectingmultiple cuboids and cylinders, which even allow our frame-work to work with tools that are not previously encountered.In addition, we have shown that our framework can make6D effect predictions. Furthermore, the training of the net-work has been improved by the use of scheduled sampling[31] and greater data distribution. These novel contributionsdecrease the errors for long-horizon prediction tasks, and inSection V-E, our new results have been shown to surpass theones in [30]. More speciﬁcally, the general contributions ofour framework can be listed as follows: • We develop a graph neural network based frameworkfor parameter estimation and physics prediction in pushmanipulation tasks. • We utilize a weight-sharing mechanism to transferlearned representations to be used in new tasks. • We show the feasibility of articulation based graph rep-resentations for modeling multi-part objects. • We design a novel 6-D action-effect prediction in lever-uptask in the context of hard-disk drive disassembly. • Through simulated and real-world experiments, we verifyour framework in joint relation and mass prediction,physics prediction, and tool manipulation and planningtasks. II. R

ELATED W ORK a) Learning Dynamics / Modelling Physics:

Modelingintuitive physics has attracted considerable interest in recentyears [32]. For instance, Battaglia et al. [33] proposed aBayesian model called Intuitive Physics Engine and showedthat the physics of stacked cuboids could be modeled withthis model. Similarly, Hamrick et al. [34] showed that humanscould reason about object masses from their interactions and modeled it with Bayesian models. Smith et al. [35] havemodelled expectation violation in intuitive physics. They dis-cuss how humans surprise when their physical expectationsmismatch with reality, and they modeled this with deep learn-ing methods. Deisenroth et al. [36] suggested a probabilisticdynamic model that depends on Gaussian Processes and that iscapable of predicting the next state of a robot given the currentstate and the action. Recently, these studies have been extendedthrough the use of deep learning methods. Lerer et al. [37]trained a deep network to predict the stability of the blocktowers given their raw images obtained from a simulator. Groth et al. [38] extended this idea by allowing stacking of objectswith different geometries. They showed that their proposednetwork could predict the stability of given towers in this moredifﬁcult setup. The tower stacking task has continued to be animportant environment for intuitive physics problems [39].A speciﬁc topic of interest within modeling physics withdeep learning is motion prediction from images, which hasgained increasing attention over the last few years. Mottaghi et al. [40] trained a Convolutional Neural Networks (CNN) formotion prediction on static images by casting this problem asa classiﬁcation problem. Mottaghi et al. [41] employed CNNsto predict movements of objects in static images in responseto applied external forces. Fragkiadaki et al. [42] suggesteda deep architecture in which the outputs of a CNN are usedas inputs to Long Short Term Memory (LSTM) cells [43] topredict movements of balls in simulated environments. b) Graph Neural Networks (GNNs) for Learning Physics:

As deep structured models, GNNs allow learning useful rep-resentations of entities and relations among them, providinga reasoning tool for solving structured learning problems.Hence, it has found extensive use in physics prediction.Interaction network by Battaglia et al. [23] and Neural PhysicsEngine by Chang et al. [24] are the earliest examples ofgeneral-purpose physics engines that depend on GNNs. Thesemodels do object-centric and relation-centric reasoning topredict the movements of objects in a scene. While theywere successful in modeling dynamics of several systemssuch as n-body simulation and billiard balls, their models hadcertain shortcomings, especially when movements of objectshave a chain effect on other objects (e.g., a pushed objectpushes a group/sequence of objects it is contacting with) orwhen the objects are composed of complex shapes. Theseshortcomings can be partly handled by including a messagepassing structure within GNNs as done in the recent workssuch as [21], [25], [26]. Most of these networks used simpleneural networks for encoding object and relation information.Kipf et al. [44] showed that variational autoencoders could beused in encoding object and relation information, where theirnetwork was shown to encode object information directly fromtrajectories of the objects in an unsupervised way.Another approach was acquiring object information directlyfrom images. Ye et al. [45] used image and detected thelocation of objects to predict the latent representation of thenext time step. This latent representation was then decodedto create the image expected to be observed in the nexttime step. Watters et al. [27] and van Steenkiste et al. [28]proposed hybrid network models which encode object infor- mation directly from images via CNNs and predict the nextstates of the objects via GNNs. Lately, these networks havebeen extended to handle even more complex environments.Sanchez-Gonzales et al. [29] showed that GNNs could be usedfor learning particle-based simulations that consist of morethan 1000 particles. c) Effect Prediction in Robotics:

Action-effect predictionhas been investigated using model-based approaches that useanalytical models [14], [46], data-driven methods that usemachine learning methods and hybrid methods that incorporatemachine learning into analytical modeling [47], [48]. Theeffect prediction methods can be further divided into twocategories depending on the number of involved objects. Inorder to deal with predicting action effects on single objects,object masks have been heavily used [11], [49]–[51]. Recently,Kopicki et al. [52] proposed learning multiple motion predictormodels for different shaped single objects, where a visionsystem selects a predictor depending on the context. Seker etal. [53] investigated how changing object shapes affects low-level object motion trajectories and modeled it using CNNsand LSTMs.In the context of end-to-end learning, Agrawal et al. [54]trained forward and inverse models for learning how to pokean object to move it into a target position. This networkuses latent vectors of CNN to train predictive models. Theforward model tries to predict the latent representation of theﬁnal image using the current image, and the inverse modeltook latent representations of both ﬁnal and initial imagesto ﬁnd the parameters of the poke action. Finn et al. [15]proposed a convolutional recurrent neural network [55] topredict the future image frames using only the current imageframe and actions of the robot. Byravan et al. [17] presentedan encoder-decoder like architecture to predict SE(3) motionsof rigid bodies in depth data. However, the output images getblurry over time, or their predictions tend to drift away fromthe actual data due to the accumulated errors, making it notstraightforward to use for long-term predictions in robotics.The previous data-driven methods that directly used object-centric representations cannot deal with multiple (any numberof) objects and relations as the predictors have generallyﬁxed input and output dimensions. End-to-end approaches canhandle multiple objects as their inputs and outputs are images,however, the pixel-based prediction quickly accumulated, re-sulting in blurry long-term predictions. Recently, GNNs thatcan represent multiple objects in an object-centric way havestarted being employed in robotics research as well. Janner et al. [56] used GNNs to learn object representations fromperception and physics prediction jointly. Ye et al. [57] learnedobject-centric forward models for planning and control. Theirmodel takes object bounding boxes as input and learns futurestate prediction from object embeddings generated by CNNs.Tung et al. [58] similarly use object bounding boxes withGNNs for effect prediction and control. Paus et al. [6] usedGNNs for action-effect prediction. Sanchez-Gonzales et al. [59] have used graph networks as learnable physics enginesin robotic setups. While previous GNN based robotic effectprediction models were successful in modeling physics, theylargely overlook unknown or partial information. Our model can also handle more complex shaped objects by modelingthem as a group of articulated simple shaped objects. d) Parameter Estimation: Wu et al. [60] proposed a deepapproach for ﬁnding the parameters of a simulation enginethat predicts the future positions of the objects that slideon various tilted surfaces. Zheng et al. [61] used perceptionprediction networks, a type of graph neural network, forlearning latent object properties from interaction experienceto simulate system dynamics.In many scenarios, simply observing the scene may not yieldenough information, and the robot may need to actively acton the environment to perceive more. In these cases, the robotcan improve its perception by actions [62]. Li et al. [63] usedrecurrent neural networks to predict the center of mass fromobject mask and interaction experience. Xu et al. [64] useda deep learning architecture for learning object properties. Intheir settings, a robot slides an object from an inclined surfaceand cause it to collide with another. Using a sequence ofdynamic interactions, they showed that their model could learnto predict object representations. Kumar et al. [65] trainedpolicy and predictor networks to estimate the mass distributionof articulated objects. They showed that their policy networkimproves the mass prediction capacity of the predictor networkcompared to the random policy. However, their approach waslimited to articulated objects with a ﬁxed number of parts.In [66]–[69], researchers also studied estimating the jointrelations between objects for real-time tracking and predictionof the articulated motions in challenging interactive perceptualsettings. These works, however, assume expert knowledgeabout the joint types and hard-code the corresponding transfor-mation matrices [67], candidate template models [66], speciﬁcmeasurement models [68], [69] to detect kinematic structures.Our system assumes no prior knowledge about joint dynamics,and the robot learns the dynamics of categories purely fromobservations. Therefore, the learning dynamics of completelynovel relation types is possible with our system. Exceptionally,in [66], Sturm et al. proposed to learn articulation dynamicsfrom data; however, it was only realized on a single-pair ofobjects from a single articulation observation (garage doormotion). Furthermore, these studies do not learn or predicthow the pairs or chains of non-articulated touching objectswould propagate the applied forces along the cluster/chain.In contrast, our system can predict the propagated effect ongroups of touching non-articulated objects.In our work, we veriﬁed the prediction and reasoningcapability of the robot in use of tools that are composedof basic primitive shapes. While our main focus is not ondecomposing objects into primitives, it should be noted thatthis topic has been studied in the literature. For example,Deng et al. [70] showed that from input images, objects canbe decomposed into convex hulls. In addition, they showedthat these convex hulls could be used for physics simulation.Similarly, Pashevich et al. [71] proposed a framework that canpropose different part sets where objects can be divided into,and then reconstruct the divided object in the real world witha robot using the available primitives in the workspace. Fig. 2. Our framework extracts object- and relation-centric latent representations from the current physical scene. The latent representations are initially usedto update unknown parameters of the scene graph, then with the planned motor commands, they are used for predicting future motion of the manipulatedobjects.

III. P

ROPOSED F RAMEWORK

We propose methods and framework that are capable oflearning object- and relation-centric representations for dif-ferent physical scenes. These representations can be usedin a set of various tasks. In this work, we designed ourframework around solving two complementary tasks, namely belief regulation and physics prediction . Figure 2 shows agraphical illustration of our framework. First, object- andrelation-centric representation for each object and their object-object relations are learned using propagation network. Bygiving these representations to RNN networks, our frameworkﬁnds unknown object and relation parameters and acquiresan updated graph of the scene. By passing the updatedscene graph and future robot actions to the same propagationnetwork, our framework predicts the future motion of themanipulated objects by chaining the effect predictions. In therest of the section, more technical details will be provided.

A. PreliminariesPhysical System as a Graph:

From a physical systemwith multiple interacting objects, we form a graph G = (cid:104) O, R (cid:105) where each object O is represented by the nodes (of cardinality N o ) O = { o i } i =1: N o and the relations R between objectssuch as a contact or a joint are represented by the edges (ofcardinality N r ) R = { r k } k =1: N r of the graph. Representing Push Manipulation Tasks:

We are inter-ested in representing the push manipulation task as a robotinteracting with an object clutter. The clutter could contain many objects that may have different parts with different massdistributions, objects with possible articulations, etc. We planto represent such a system with the aforementioned graphs G = (cid:104) O, R (cid:105) . Each node o i = (cid:104) x i , a oi (cid:105) store object or partvectors, where x i = (cid:104) q i , ˙ q i (cid:105) is the state of the object i , withits pose q i and velocity ˙ q i . a oi stands for object properties suchas shape or mass. Between each i , j node pair, there is an edge r k = (cid:104) d k , s k , a rk (cid:105) that represents object-object relations where d k = q i − q j stands for displacement vector, s k = ˙ q i − ˙ q j stands for velocity difference, and a rk corresponds to propertiesof relation k between objects i and j . Representation of Robot:

We propose representing theend-effector of the robot as a part of the graph. For this, arobot ﬂag and a control vector that shows how the end-effectorwill move in the next step are used.

Leveraging Graph Representation:

For this work, ourrepresentation covers cylinders, cuboids, and objects that canbe represented with the combination of two. Objects in thescene are represented with their shape, state, and other objectfeatures such as mass. Shape of objects are represented withtheir dimensions (the radius for cylinder and edge lengthsfor cuboid) and their orientations. Orientations of objects arerepresented with vector [ cos ( θ ) , sin ( θ )] for 2D cases, and withquaternions for 3D cases. Unlike previous work [66]–[69], thesystem has no prior information about how joints behave, andthe articulation dynamics are left for the network to learn. B. Physics PredictionPropagation Network:

We used propagation network asa base for learning object- and relation-centric representations.In this network, ﬁrst, the state of each object and the relationsbetween them are encoded separately. This step is shown inFigure 2 (Encoding-Step). The encoding process is achievedby use of f encR and f encO encoders where former process rela-tion features r k,t , while the latter process the object features o i,t . c rk,t and c oi,t are the latent encodings of the objects andthe relations. c rk,t = f encR ( r k,t ) , k = 1 . . . N r (1) c oi,t = f encO ( o i,t ) , i = 1 . . . N o (2)Next, the network incorporates interactions between objectsand propagations of these interactions between non-neighborobjects (e.g., force transmission between non-contacting ob-jects) into object and relation latent vectors. This step in shownin Figure 2 (Propagation Step). For this, c rk,t and c oi,t are passedto propagator functions f lR and f lO respectively for estimatingpropagation latent vectors e lk,t for relation k and p li,t for object i , for each propagation step l at time t . Using these functionsin subsequent propagation steps allow for nodes and edgesto accumulate propagated information from nodes and edgesconnected to them in e lk,t and p li,t . e lk,t = f lR (cid:0) c rk,t , p l − i,t , p l − j,t (cid:1) , k = 1 . . . N r (3) p li,t = f lO (cid:32) c oi,t , p l − i,t , (cid:88) k ∈N i e l − k,t (cid:33) , i = 1 . . . N o (4)where N i stands for set of relations object i is part of.Effect propagation allows network to pass information be-tween non-connected objects, and it beneﬁts our framework intwo important ways. Firstly, it allows force transmission whenthe robot pushes objects towards another one, effectively push-ing both objects while contacting only one of them. Secondly,it allows mass and friction feedback between objects (e.g.,when a light object is pushed towards a heavy object, the lightobject will not move the heavy objects in the push direction,but instead its motion will be shifted toward light or left side.).Figure 3 shows a simple illustration of how the robot initiatesa chain of interaction and how force applied by the robotend-effector propagates. In initial propagation step, force thatemerge from motion of robot is passed to contacted objectsand in second propagation step, this force propagates to non-directly interacted objects. How many subsequent propagationsteps to apply can be chosen based on the difﬁculty of thetask.Resulting e lk,t and p li,t well represent the objects and theirrelations in the graph and can be further passed to othernetworks for physics prediction and belief regulation. Physics Prediction:

For each object, the latent vector p li,t can be used to predict the next state of object x i,t +1 . Givenstates of the objects in time t , our framework can be used forpredicting the trajectory rollout of objects between time t and t + T by chaining its estimates, using the predictions as aninput for estimating subsequent states of objects. C. Belief RegulationTemporal propagation network:

We propose a temporalpropagation network to estimate and correct object and relationproperties over time. The propagation network is augmentedwith long short-term memory (LSTM) networks to regulateobject and relation beliefs. Network illustration is shown inFigure 2 (Temporal Propagation Network). In temporal prop-agation network, sequence of propagation latent vectors e lk,t and p li,t are passed to LSTM-based encoder functions f blfR and f blfO . In this way, the temporal propagation network estimatesand corrects object and relation properties by considering theiroverall state history during the robot execution. o (cid:48) i,t = f blfR (cid:0) p Li,t , o (cid:48) i,t − (cid:1) , i = 1 . . . N o (5) r (cid:48) k,t = f blfO (cid:0) e Lk,t , r (cid:48) k,t − (cid:1) , k = 1 . . . N r (6) Belief Regulation:

Belief Regulation module can contin-uously regulate beliefs regarding objects and relations states( o i,t and r k,t ). These beliefs can then be used in physicsprediction to compensate for errors that arise from unknownor partial information regarding the scene. This will allow ournetwork to close the gap between its physics predictions andreality. Weight Sharing:

After training the propagation networkfor physics prediction, learned weights can be reused in beliefregulation, preventing the framework from having to learn twoseparate networks. This decreases the number of parameters byabout thirty percent. As we show in our experimental analysis,the representation used with physics prediction well representsthe environment and can be used in transfer learning , withoutaffecting the system performance.IV. E XPERIMENTAL S ETUPS

In this section, we explain the details of the experimentalsetups that are designed to evaluate how our model can be usedfor predicting object properties, relations between objects, andfuture object trajectories.

A. Robotic Setup

Experiments are conducted with a 6 DoF UR10 robot armwith a cylinder shaped object attached to its end-effectorboth in simulation and real-world. For simulation experiments,CoppeliaSim [72] with Pyrep toolkit [73] is used. For demon-strating prediction capacity of our framework, two differentobject setups, namely

Multiple Parts Setup and

DifferentMasses Setup , are deﬁned. The former setup includes a diverseset of interactions in the form of joints and is designed withthe aim of showing the full capacity of our framework. As thephysical effects of object parameters are limited in the formersetup, the latter setup is designed with the aim of showing theperformance of our framework in setups where effect variationresult from object parameters. In these setups, edges betweenobjects are dynamically created as objects approach each other. In more complex tasks, ﬁne-tuning the propagation network may berequired, but physics prediction pre-training will still hasten the learningprocess.

Fig. 3. This illustration shows how the graph of the scene is constructed and how the force emerging from robot end-effector motion is passed to the farawayobjects. After graph construction, each node holds state information of their corresponding objects, including the robot. Considering how state information ofrobot is passed, in the ﬁrst propagation step, it is passed to nodes of objects that contact the robot end-effector. In the second propagation step, via nodes ofobjects that the robot initially contact, this state information is passed to nodes of non-contacted objects.TABLE IE

XPLANATIONS OF THE JOINT TYPES AND THEIR EFFECTS .Example setup Effect of action Outcome Explanation

No joint:

The objects would moveindependent of each other as they areseparated by the gripper.

Fixed joint:

The objects would movetogether with the end-effector of therobot.

Prismatic joint:

The object belowwould move in linear line along thedirection between the above objectto below object.

Revolute joint:

Both of the objectswould move, but as the end-effectormainly contacts object below, therobot will rotate the object belowaround the object above.Note: Objects and the robot are shown with single-edged and double-edgedcircles respectively, and the lines between objects represent different jointtypes. The arrow shows how the robot end-effector will move.

As the robot interacts with the objects in the environment, onlya certain subset of objects will be in the same sub-graph ofthe robot (This can be seen in Figure 3 graph view.), andaccordingly, this allows the system to encounter sub-graphswith a different number of objects and relations.

Multiple Parts Setup:

This setup consists of a group of artic-ulated objects where our framework should learn dynamics ofobjects, including cylinders and cuboids, with complex spatialrelations between them. The objects may be connected toeach other through three different joint relation types, namely ﬁxed , revolute and prismatic joints, or they may have no jointconnections between them ( no-joint ). The Illustration of thesejoint relations and their explanations are shown in Table I. Different Masses Setup:

This setup consists of differentlymassed cylindrical objects where masses of objects have an effect on their future motion. From the motion trajectoriesof the objects, our framework should be able to predict theirmasses. The masses are sampled from three intervals: . − . kg , . − . kg , . − . kg , representing light, normaland heavy objects, respectively.For both of these setups, we generated datasets containing30,000 training and 1000 validation trajectories with 9 objects.Since it is hard to exactly tune end-effector velocity to matchreal-world, end-effector velocity of the robot is changed be-tween different trajectories so that it can generalize to differentvalues. For testing the generalization capacity of the networkto changing number of objects, we used trajectories consistingof 9, 6, and 12 objects, each with 1000 trajectories. B. Implementation DetailsGeneration of Graph:

For each object in the scene andclose-by object pair, a node and two directed edges (a receiverand a sender) are created. To make the system position andorientation invariant, object position and orientations are notincluded in the node features. Instead, for each object-objectrelation, the pose of the object on the sender side of the relationis encoded with respect to frame of the object on the receiverside of the relation. After the motion of an object on its ownframe is predicted, it is transformed back to the global frame.

Network information: f encO is a two 256-dim hiddenlayerMLP, and f encR is a three 256-dim hidden layer MLP. f lO and f lR are MLPs with 256-dim single hidden layer. f lO and f lR are chosen to have a low number of layers since these networkcalled multiple times successively and therefore more costly touse than f encO and f encR . Finally, f blfO and f blfR are LSTM with256 neurons. For physics prediction, outputs of f lO is given toan MLP with one hidden layer and one linear layer to predictvelocity ( ˙ q i ) of each object; and for belief regulation, outputsof f blfO and f blfR are given to an MLP with single linear layerto predict object masses and joint relations.In the belief regulation module, as more interaction expe-rience is acquired, the framework is expected to have higher accuracy in identifying initially unknown parameters of theenvironment. For this reason, the loss function is scaled in away that further time-steps have a higher loss value comparedto earlier time-steps. Besides, to make networks predictionssmooth and preventing them from oscillating between differentoutcomes, outputs of f blfO and f blfR are regularized by applyingMSE loss between latent vectors of successive time-steps.The network is trained with 16 batch-size and 3e-4 learningrate using Adam optimizer [74] with AMSgrad [75]. Thelearning rate is reduced by 0.8 when the validation errorstopped decreasing for a window of 20 epochs. Networks aretrained for 1000 epochs. The physics prediction module istrained with epochs of 10,000 batches of randomly sampledtime-steps, and for the training belief regulation, 200 batchesof randomly sampled trajectories from the training scenes areused.First, our network is trained on physics prediction. Afterthe training is complete, the weights of the shared part of thenetwork are frozen, and then the belief regulation module istrained. To increase the performance of physics prediction, weused scheduled sampling [31]. Using Nvidia P100 GPU, thephysics prediction and belief regulation modules are trainedfor two and one days respectively.V. R ESULTS

For quantitative analysis, our framework is evaluated injoint prediction and mass prediction tasks. For the relationprediction case, our results are compared with PropNets withthree different relation assignment strategies.1)

Oracle

This relation assignment strategy utilizesground-truth relations. In the ideal case, as more inter-actions are observed, the performance of our frameworkshould approach to oracle.2)

No-Joint

This relation assignment strategy assumesthere are no joints in the scene.3)

All-Fixed

This relation assignment strategy assumes aﬁxed joint between every contacting object pairs.

A. Quantitative Analysis in Multiple Parts Setup

For evaluating the physics prediction module, our frame-work is tested with the oracle relation assignment strategyin multiple parts setup . In this setup, while collecting eachtrajectory, the robot executes 9 linear pushes of 30 cm, con-tacting with a most diverse set of objects. In this setup, outputsof physics prediction module are chained to predict multipletime-step trajectory roll-outs (i.e., essentially simulating theenvironment with network predictions). These trajectory roll-outs are used in evaluation. Figure 4 presents the performancein scenarios with different number of objects. As the lengthof predicted trajectory roll-outs increase, the errors in highernumber of trajectories accumulate. This results in trajectoriesto drift away from the ground truth. In Figure 4, on the left, asthe roll-out length is shorter in each environment setup, morethan 600 trajectories have lower mean error than . cm , andmost of the remaining trajectories have a lower mean error than . cm . On the right, the roll-out length is longer, and less than400 trajectories have a lower mean error than . cm . Besides Fig. 4. Physics prediction results on articulated object environments.Fig. 5. Belief regulation results on articulated object environments. for the rest of the trajectories, there are more trajectories inhigh mean error bins.Next, the belief regulation module is evaluated on predictionof joint relations. As shown in Figure 5, as the robot interactswith the objects and higher number of observation data isacquired, our network becomes better at predicting the jointrelation types more accurately. The joint prediction plot in Fig-ure 5 shows that our method performs similarly independentof the number of objects used due to the underlying graphstructure. In the same ﬁgure, on precision plot, no joint (blue)and prismatic joint(green) lines show that networks are goodat identifying whether there is a joint between two objectsand whether this joint is prismatic. Compared to the prismaticjoint, the model is more likely to make erroneous predictionson whether a joint is ﬁxed or revolute. This is likely becausewithout interaction experience, it is easier for network to mixthese two joints. Nonetheless, from the recall plot, we cansee that the model can correct its predictions on ﬁxed andrevolute as it observes more robot interactions. From bothprecision and recall plots, the model abstains from predictinga joint prismatic unless it is certain. This may be because

Fig. 6. Results of coupled system on articulated object environments.Fig. 7. Belief Regulation results on mass prediction. As more motionis observed in the scene, mass prediction error decreases, but eventuallyconverges to about . kg mean error. prismatic joint dynamics are similar to no-joint dynamicsunless robot gains enough observations about objects the jointare connected to.Finally, the coupled results of the physics prediction andthe belief regulation modules can be seen in Figure 6. Thelines show the mean errors, and the shaded regions show thestandard error. As expected, physics prediction done with no-joint and all-ﬁxed relation assignment strategies performedpoorly. This is because these relation assignment strategiesdo not learn from interactions. As the number of observedtime-steps increases, the mean error of the coupled modulesdecreases and eventually in 40 time-steps, it reaches to themean error of the physics prediction of the oracle system thathas access to ground-truth joint relations. B. Quantitative Analysis of Belief Regulation for Mass Pre-diction

We design different masses experimental setup for furthertesting the object-centric prediction capacity of our framework.In this setup, in each trajectory, the robot executes a totalof 3 linear pushes of 30 cm, scattering objects as much aspossible. In this experiment, our framework should predictobject masses, and as the robot acquires more observations,it should improve its mass prediction accuracy further. Meanerrors for mass prediction is shown in Figure 7. Consideringthe distribution masses, our model manages to decrease masserrors over time as it acquires more observations. However, thepredictions seem to not go below a certain value. This may bebecause the robot has limited interaction with the objects inthe scene, and this limits the capacity of the model to predictmasses of objects correctly.To further analyze the performance of our system in massprediction, we prepared two controlled environment test setupsto examine why mass error does not decrease below a certainvalue. These setups can be seen in Figure 8. In these setups,

Fig. 8. Visualization of controlled environment setups for mass prediction. Inthese conﬁgurations, object masses are changed between different runs whilekeeping robot motion and object shapes the same.Fig. 9. Mass prediction results in controlled environments. In many cases,our model acquires low error, however there are still many cases that havehigh error. we only change the mass of objects while setting robot action,initial positions of objects, and shapes of objects same. Therobot manipulates each object, so it should be possible forthe network to predict mass if it is predictable. The resultsobtained in the these controlled settings are provided inFigure 9. Considering the mass distribution of the objects,the ﬁrst two bars of both plots show that our frameworkpredicts light and medium within their cluster correctly half ofthe time. The third and fourth bin shows that our frameworksometimes mixes light and medium objects and medium andheavy objects. For three objects, the ﬁfth bin shows thatour framework mixes light and heavy objects in rare cases A number of representative correct and incorrect predictionsare provided in Figure 10. We investigated setups where thenetwork made high-error in mass predictions and observe thatthere are cases where different objects mass conﬁgurationshaving same object motions. Figure 10C and Figure 10D,the robot observes very similar trajectories with . cmdifference between them, despite the interacted objects havingvery different masses. In these scenes, the network makes verysimilar predictions. However, only in the former scene, it iscorrect. C. Qualitative Analysis - Tool Usage

We design a tool manipulation and planning experiment.Given a goal position, the aim is to select the best tool andaction sequence to bring a given object to the goal positionusing the corresponding tool. In addition, this experimentaims to show generalization capacity of our framework bytransferring representation and the network trained in multipleparts setup for modelling novel tools that are not encounteredin the training distribution. Videos of the results are available at project page.

Fig. 10. Mass predictions for two very close observations. The sameobservations are acquired from scenes with two different mass conﬁgurations,and our framework could not differentiate between the two. Our frameworkmakes the same mass prediction for both; one of them is correctly predicted,while the other is not.Fig. 11. Tools used In tool selection and planning experiments.

In this experiment, stick, L-shaped tool, inv-L-shaped tool,and their various conﬁgurations are used as shown in Fig-ure 11. These tools are represented as multi-part objectscomposed of cuboids and ﬁxed joints, and are attached torobot end-effector. The robot uses linear pushes in principaldirections to manipulate the object on the table. In theseactions, tool motion is modeled kinematically and not updatedfrom the network prediction. Please note that a new network isnot trained and the results obtained by the previously trainednetwork are reported.In each test case, the robot should select one of the availabletools and apply three pushes of 20 cm in principle directionsto move an object to a given goal position. To make all testcases feasible, goal points are generated through simulation.More speciﬁcally, 24 uniform initial positions are generatedfrom − . ≤ x ≤ − . and − . ≤ y ≤ . for. Then,on each initial position, a cylindrical object is generated, andall possible action sequences are applied using each of thetools. The ﬁnal positions of objects are recorded. These ﬁnalpositions are ﬁltered where if a ﬁnal position of object is lessthan 5 cm away from its initial position, it is removed. Besides,if the difference between any two initial and ﬁnal position pairis lower than 5 cm, one of them is removed as well. In this Fig. 12. Tool results. As robot is allowed to use wider variety of tools, successrate increases and error amount decreases. way, a dataset for tool and action selection that contains 166completely diverse solvable initial and ﬁnal position pairs aregenerated.The task is deﬁned as the selection of the best tool and bestaction sequence from all possible tools and action sequences.The network is run for all the initial-target position pairs foreach possible tool and action sequences. For each of thesepairs, the tool and action sequence that gives the lowest meanerror is selected. Besides, for comparison, to see whetherour framework can utilize each of the tools, the best actionsequences for each tool are found as well. Then, each solutionis transferred to simulation to testing their correctness.The results can be seen in Figure 12. The left column showsthe prediction errors of selected action sequences, and theright column shows actual errors of selected actions whenthey are run on simulation. The ﬁrst row shows the meanerror between the ﬁnal positions of manipulated objects andthe goal positions. The second row shows the number ofsuccessful action sequences (i.e., action sequences where theﬁnal position of the object is less than 5 cm away from itstarget position.). Each bar corresponds to the result for actionselection with a particular tool, and with the last one, the toolcan be selected as well. From the ﬁgure, it can be seen thatour framework managed to utilize all tools for solving about40 of the tasks, and when all tools are allowed to be used,about 130 of the tasks are solvable. Comparing predictionand simulation results shows that predictions made by ourframework are plausible, and there is just marginal loss ofperformance when found action sequences are transferred tosimulation. Our framework is successful in tool manipulationand action selection despite its not being designed for such a Fig. 13. Snapshots of 6D Effect Prediction. Ground truth pose of object isshown with transparent cuboid. As can be seen, prediction is very close toground truth.Fig. 14. Snapshots of a robot interaction in real-world. Our frameworkcontinuously updates its joint predictions as it observes the motion of objectsand predicts their future positions. task.

D. Qualitative Analysis in Simulation - 6D Motion Prediction

Finally, we designed an experimental setup where we cantest our framework on 6D rigid body motion prediction. In thissetup, the robot is tasked to lever up a printed circuit board(PCB) from a hard drive disk (HDD) with a screwdriver tool.PCB is on top of the HDD, and at each side of the HDD, theremay be a ledge that PCB may contact while being levered up.PCB and HDD are represented as a set of boxes, and theirsizes change between runs. Note that some sides of the HDDmay have no ledge in different scenes, and therefore, whilerepresenting a scene in a graph, the number of nodes changesbetween runs.For scene generation, lengths of both sides of HDD are setto cm . There is either a ledge of size between to cm ,or no ledge at each side of HDD. In the middle of HDD, aPCB with its side lengths between to cm are generated.The network is trained using 500 lever-up interactions onscenes with 125 different procedurally generated hard-disks(One lever-up action from each side of HDD).A sample prediction can be seen in Figure 13. Our furtherresults on this setup can be found on project page. In thissetup, our network make plausible predictions that match wellwith the ground truth. E. Analysis of our framework in real world

In this section, our framework is evaluated with a real-world dataset, presented in [30]. In this dataset, a UR10 robotarm holds a hammer and use it for pushing objects. Thedataset contains cylinder-shaped objects and possible ﬁxedjoints between them. The effect of a ﬁxed joint between objectsis mimicked by placing customized card-boards under them. Asample created scene and how the robot makes its manipula-tion on objects can be found in Figure 14. As the dataset does

Fig. 15. Average errors (in cm) change in real world as robot makes its ﬁrstcontact with the objects.Fig. 16. Average errors (in cm) change in real-world as our frameworkacquires more object tracking information. not have angle information, our network is retrained with theangles of cylinders removed. Since it is also possible for ournetwork to predict revolute or prismatic joint, prediction arelimited only to no-joint and ﬁxed joint relations (By selectingthe joint relation with the max probability between no-jointand ﬁxed joint relations.).The dataset contains scenes with 2 to 5 cylindrical objectsand 1 to 3 ﬁxed joint relations between them. In total, thereare 102 different test setups in the dataset. On average, objectsmove . cm, and our physics prediction network achieves . cm in predicting ﬁnal object positions where [30] achieved . cm in the same test. Our coupled framework is furtheranalyzed with the same dataset in Figure 15. Similar to[30], we tested our framework on exact timesteps where theﬁrst contact between robot and objects occurs. Our networkmanages to acquire better results than the one in [30] forboth physics prediction with ground truth and with predictedrelations (In [30], prediction with ground truth and predictedrelations acquires . cm and . cm at time t and cm and . cm at time t + 4 .). In Figure 16, performance of ourframework on different time-steps is shown where predictionsof our framework catch up to the ground truth as moreobservations are acquired.VI. C ONCLUSION

We presented methods and a framework for learning action-effects in object and relation-centric push manipulation tasks.Our framework allows the robot to correct its belief aboutobject and relation parameters as it interacts with the scene Unlike [30], we do not retrain our network with only cylindrical objectsand ﬁxed joints; we only remove angle information of cylindrical objects. and observe the effects of its actions. It then can continuouslypredict the future dynamics of objects. We have tested beliefregulation and physics prediction performance on multiple ex-periments, including a real-world one. We have shown that ourframework can predict joint types in articulated object settingswith different object and relation types, masses of objects,and their future motion. We have shown that our frameworkcan be extended for 6D trajectory prediction. Furthermore, wealso validated our framework on action selection in a toolmanipulation task. Although we do not train a new networkthat includes situations that are not present in our articulatedobject setting, our network was successfully transferred to thisnew domain and succeeded in ﬁnding action sequences thatcomplete the given tasks.As our framework is very generic, we believe it can befurther reﬁned and extended. First, our framework can beneﬁtfrom intelligent exploration strategies that can generalize toa changing number of objects. In addition, learning of un-supervised representations for objects via interactions can bevery powerful for the visual grounding of objects. In futurework, we are planning to extend our framework for theseadaptations. R EFERENCES[1] F. Ruggiero, V. Lippiello, and B. Siciliano, “Nonprehensile dynamicmanipulation: A survey,”

IEEE Robotics and Automation Letters , vol. 3,no. 3, pp. 1711–1718, 2018.[2] J. St¨uber, C. Zito, and R. Stolkin, “Let’s push things forward: A surveyon robot pushing,”

Frontiers in Robotics and AI , vol. 7, p. 8, 2020.[3] J. St¨uber, M. Kopicki, and C. Zito, “Feature-based transfer learning forrobotic push manipulation,” in . IEEE, 2018, pp. 1–5.[4] M. R. Dogar and S. S. Srinivasa, “Push-grasping with dexterous hands:Mechanics and a method,” in . IEEE, 2010, pp. 2123–2130.[5] J. E. King, M. Klingensmith, C. M. Dellin, M. R. Dogar, P. Velagapudi,N. S. Pollard, and S. S. Srinivasa, “Pregrasp manipulation as trajectoryoptimization.” in

Robotics: Science and Systems . Berlin, 2013.[6] F. Paus, T. Huang, and T. Asfour, “Predicting pushing action effects onspatial object relations by learning internal prediction models,” in .IEEE, 2020, pp. 10 584–10 590.[7] T. Meric¸li, M. Veloso, and H. L. Akın, “Push-manipulation of complexpassive mobile objects using experimentally acquired motion models,”

Autonomous Robots , vol. 38, no. 3, pp. 317–329, 2015.[8] H. Van Hoof, O. Kroemer, H. B. Amor, and J. Peters, “Maximally in-formative interaction learning for scene exploration,” in . IEEE,2012, pp. 5152–5158.[9] A. Eitel, N. Hauff, and W. Burgard, “Learning to singulate objects usinga push proposal network,” in

Robotics Research . Springer, 2020, pp.405–419.[10] A. Zeng, S. Song, S. Welker, J. Lee, A. Rodriguez, and T. Funkhouser,“Learning synergies between pushing and grasping with self-superviseddeep reinforcement learning,” in . IEEE, 2018, pp. 4238–4245.[11] D. Omrˇcen, C. B¨oge, T. Asfour, A. Ude, and R. Dillmann, “Autonomousacquisition of pushing actions to support object grasping with a hu-manoid robot,” in . IEEE, 2009, pp. 277–283.[12] D. Kappler, L. Y. Chang, N. S. Pollard, T. Asfour, and R. Dillmann,“Templates for pre-grasp sliding interactions,”

Robotics and AutonomousSystems , vol. 60, no. 3, pp. 411–423, 2012.[13] S. Elliott, M. Valente, and M. Cakmak, “Making objects graspablein conﬁned environments through push and pull manipulation with atool,” in . IEEE, 2016, pp. 4851–4858. [14] K.-T. Yu, M. Bauza, N. Fazeli, and A. Rodriguez, “More than amillion ways to be pushed. a high-ﬁdelity experimental dataset of planarpushing,” in . IEEE, 2016, pp. 30–37.[15] C. Finn, I. Goodfellow, and S. Levine, “Unsupervised learning forphysical interaction through video prediction,” in

Advances in neuralinformation processing systems , 2016, pp. 64–72.[16] C. Finn and S. Levine, “Deep visual foresight for planning robotmotion,” in , 2017, pp. 2786–2793.[17] A. Byravan and D. Fox, “SE3-Nets: Learning rigid body motion usingdeep neural networks,” in

International Conference on Robotics andAutomation , 2017, pp. 173–180.[18] I. Nematollahi, O. Mees, L. Hermann, and W. Burgard, “Hindsightfor foresight: Unsupervised structured dynamics models from physicalinteraction,” arXiv preprint arXiv:2008.00456 , 2020.[19] E. S. Spelke, K. Breinlinger, J. Macomber, and K. Jacobson, “Originsof knowledge.”

Psychological review , vol. 99, no. 4, p. 605, 1992.[20] J. B. Tenenbaum, C. Kemp, T. L. Grifﬁths, and N. D. Goodman, “Howto grow a mind: Statistics, structure, and abstraction,” science , vol. 331,no. 6022, pp. 1279–1285, 2011.[21] D. Mrowca, C. Zhuang, E. Wang, N. Haber, L. F. Fei-Fei, J. Tenenbaum,and D. L. Yamins, “Flexible neural representation for physics predic-tion,” in

Advances in neural information processing systems , 2018, pp.8813–8824.[22] P. Battaglia, J. B. C. Hamrick, V. Bapst, A. Sanchez, V. Zambaldi,M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner,C. Gulcehre, F. Song, A. Ballard, J. Gilmer, G. E. Dahl, A. Vaswani,K. Allen, C. Nash, V. J. Langston, C. Dyer, N. Heess, D. Wierstra,P. Kohli, M. Botvinick, O. Vinyals, Y. Li, and R. Pascanu, “Relationalinductive biases, deep learning, and graph networks,” arXiv , 2018.[Online]. Available: https://arxiv.org/pdf/1806.01261.pdf[23] P. Battaglia, R. Pascanu, M. Lai, D. J. Rezende et al. , “Interactionnetworks for learning about objects, relations and physics,” in

Advancesin neural information processing systems , 2016, pp. 4502–4510.[24] M. B. Chang, T. Ullman, A. Torralba, and J. B. Tenenbaum, “Acompositional object-based approach to learning physical dynamics,” arXiv preprint arXiv:1612.00341 , 2016.[25] Y. Li, J. Wu, J.-Y. Zhu, J. B. Tenenbaum, A. Torralba, and R. Tedrake,“Propagation networks for model-based control under partial observa-tion,” in

International Conference on Robotics and Automation , 2019,pp. 1205–1211.[26] Y. Li, J. Wu, R. Tedrake, J. B. Tenenbaum, and A. Torralba, “Learningparticle dynamics for manipulating rigid bodies, deformable objects, andﬂuids,” in

International Conference on Learning Representations , 2019.[27] N. Watters, D. Zoran, T. Weber, P. Battaglia, R. Pascanu, and A. Tac-chetti, “Visual interaction networks: Learning a physics simulator fromvideo,” in

Advances in neural information processing systems , 2017, pp.4539–4547.[28] S. van Steenkiste, M. Chang, K. Greff, and J. Schmidhuber, “Relationalneural expectation maximization: Unsupervised discovery of objects andtheir interactions,” in

International Conference on Learning Represen-tations , 2018.[29] A. Sanchez-Gonzalez, J. Godwin, T. Pfaff, R. Ying, J. Leskovec,and P. Battaglia, “Learning to simulate complex physics with graphnetworks,” in

International Conference on Machine Learning . PMLR,2020, pp. 8459–8468.[30] A. E. Tekden, A. Erdem, E. Erdem, M. Imre, M. Y. Seker, and E. Ugur,“Belief regulated dual propagation nets for learning action effects ongroups of articulated objects,” in . IEEE, 2020, pp. 10 556–10 562.[31] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled samplingfor sequence prediction with recurrent neural networks,” in

Advances inNeural Information Processing Systems , 2015, pp. 1171–1179.[32] J. R. Kubricht, K. J. Holyoak, and H. Lu, “Intuitive physics: Currentresearch and controversies,”

Trends in cognitive sciences , vol. 21, no. 10,pp. 749–759, 2017.[33] P. W. Battaglia, J. B. Hamrick, and J. B. Tenenbaum, “Simulation as anengine of physical scene understanding,”

Proceedings of the NationalAcademy of Sciences , vol. 110, no. 45, pp. 18 327–18 332, 2013.[34] J. B. Hamrick, P. W. Battaglia, T. L. Grifﬁths, and J. B. Tenenbaum,“Inferring mass in complex scenes by mental simulation,”

Cognition ,vol. 157, pp. 61–76, 2016.[35] K. Smith, L. Mei, S. Yao, J. Wu, E. Spelke, J. Tenenbaum, and T. Ull-man, “Modeling expectation violation in intuitive physics with coarseprobabilistic object representations,” in

Advances in Neural InformationProcessing Systems , 2019, pp. 8983–8993. [36] M. Deisenroth and C. E. Rasmussen, “Pilco: A model-based anddata-efﬁcient approach to policy search,” in Proceedings of the 28thInternational Conference on machine learning (International Conferenceon Machine Learning) , 2011, pp. 465–472.[37] A. Lerer, S. Gross, and R. Fergus, “Learning physical intuition of blocktowers by example,” in

International Conference on Machine Learning ,2016, pp. 430–438.[38] O. Groth, F. B. Fuchs, I. Posner, and A. Vedaldi, “Shapestacks: Learn-ing vision-based physical intuition for generalised object stacking,” in

Proceedings of the European Conference on Computer Vision (ECCV) ,2018, pp. 702–717.[39] W. Li, S. Azimi, A. Leonardis, and M. Fritz, “To fall or not tofall: A visual approach to physical stability prediction,” arXiv preprintarXiv:1604.00066 , 2016.[40] R. Mottaghi, H. Bagherinezhad, M. Rastegari, and A. Farhadi, “Newto-nian scene understanding: Unfolding the dynamics of objects in staticimages,” in

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2016, pp. 3521–3529.[41] R. Mottaghi, M. Rastegari, A. Gupta, and A. Farhadi, “”what happensif...” learning to predict the effect of forces in images,” in

EuropeanConference on Computer Vision . Springer, 2016, pp. 269–285.[42] K. Fragkiadaki, P. Agrawal, S. Levine, and J. Malik, “Learning visualpredictive models of physics for playing billiards,” in

InternationalConference on Learning Representations , 2016.[43] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neuralcomputation , vol. 9, no. 8, pp. 1735–1780, 1997.[44] T. Kipf, E. Fetaya, K.-C. Wang, M. Welling, and R. Zemel, “Neuralrelational inference for interacting systems,” in

International Conferenceon Machine Learning . PMLR, 2018, pp. 2688–2697.[45] Y. Ye, M. Singh, A. Gupta, and S. Tulsiani, “Compositional videoprediction,” in

Proceedings of the IEEE International Conference onComputer Vision , 2019, pp. 10 353–10 362.[46] F. R. Hogan and A. Rodriguez, “Feedback control of the pusher-slidersystem: A story of hybrid and underactuated contact dynamics,” in

Algorithmic Foundations of Robotics XII . Springer, 2020, pp. 800–815.[47] J. Zhou, M. T. Mason, R. Paolini, and D. Bagnell, “A convex polynomialmodel for planar sliding mechanics: theory, application, and experimen-tal validation,”

The International Journal of Robotics Research , vol. 37,no. 2-3, pp. 249–265, 2018.[48] A. Kloss, S. Schaal, and J. Bohg, “Combining learned and analyticalmodels for predicting action effects,” arXiv preprint arXiv:1710.04102 ,2017.[49] J. King, J. A. Haustein, S. S. Srinivasa, and T. Asfour, “Nonprehensilewhole arm rearrangement planning with physics manifolds,” in

IEEEInternational Conference on Robotics and Automation (ICRA) , 2015,pp. 2508–2515.[50] J. A. Haustein, J. King, S. S. Srinivasa, and T. Asfour, “Kinodynamicrandomized rearrangement planning via dynamic transitions betweenstatically stable states,” in

IEEE International Conference on Roboticsand Automation (ICRA) , 2015, pp. 3075–3082.[51] M. Kopicki, S. Zurek, R. Stolkin, T. M¨orwald, and J. Wyatt, “Learning topredict how rigid objects behave under simple manipulation,” in . IEEE,2011, pp. 5722–5729.[52] M. Kopicki, S. Zurek, R. Stolkin, T. Moerwald, and J. L. Wyatt,“Learning modular and transferable forward models of the motions ofpush manipulated objects,”

Autonomous Robots , vol. 41, no. 5, pp. 1061–1082, 2017.[53] M. Y. Seker, A. E. Tekden, and E. Ugur, “Deep effect trajectoryprediction in robot manipulation,”

Robotics and Autonomous Systems

Advancesin Neural Information Processing Systems , 2016, pp. 5074–5082.[55] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, “Convolutional lstm network: A machine learning approach forprecipitation nowcasting,” in

Advances in neural information processingsystems , 2015, pp. 802–810.[56] M. Janner, S. Levine, W. T. Freeman, J. B. Tenenbaum, C. Finn,and J. Wu, “Reasoning about physical interactions with object-centricmodels,” in

International Conference on Learning Representations ,2019.[57] Y. Ye, D. Gandhi, A. Gupta, and S. Tulsiani, “Object-centric forwardmodeling for model predictive control,” in

Conference on Robot Learn-ing . PMLR, 2020, pp. 100–109. [58] H.-Y. F. Tung, Z. Xian, M. Prabhudesai, S. Lal, and K. Fragkiadaki,“3d-oes: Viewpoint-invariant object-factorized environment simulators,” arXiv preprint arXiv:2011.06464 , 2020.[59] A. Sanchez-Gonzalez, N. Heess, J. T. Springenberg, J. Merel, M. Ried-miller, R. Hadsell, and P. Battaglia, “Graph networks as learnablephysics engines for inference and control,” in

International Conferenceon Machine Learning . PMLR, 2018, pp. 4470–4479.[60] J. Wu, I. Yildirim, J. J. Lim, B. Freeman, and J. Tenenbaum, “Galileo:Perceiving physical object properties by integrating a physics enginewith deep learning,” in

Advances in neural information processingsystems , 2015, pp. 127–135.[61] D. Zheng, V. Luo, J. Wu, and J. B. Tenenbaum, “Unsupervised learningof latent physical properties using perception-prediction networks,” arXiv preprint arXiv:1807.09244 , 2018.[62] J. Bohg, K. Hausman, B. Sankaran, O. Brock, D. Kragic, S. Schaal, andG. S. Sukhatme, “Interactive perception: Leveraging action in perceptionand perception in action,”

IEEE Transactions on Robotics , vol. 33, no. 6,pp. 1273–1291, 2017.[63] J. K. Li, W. S. Lee, and D. Hsu, “Push-net: Deep planar pushing forobjects with unknown physical properties.” in

Robotics: Science andSystems , vol. 14, Pittsburgh, Pennsylvania, June 2018.[64] Z. Xu, J. Wu, A. Zeng, J. B. Tenenbaum, and S. Song, “Densephysnet:Learning dense physical object representations via multi-step dynamicinteractions,” in

Robotics: Science and Systems (RSS) , 2019.[65] N. K. Kannabiran, I. Essa, and C. K. Liu, “Estimating mass distribu-tion of articulated objects through physical interaction,” arXiv preprintarXiv:1907.03964 , 2019.[66] J. Sturm, V. Pradeep, C. Stachniss, C. Plagemann, K. Konolige, andW. Burgard, “Learning kinematic models for articulated objects,” in

Twenty-First International Joint Conference on Artiﬁcial Intelligence ,2009.[67] T. Schmidt, R. A. Newcombe, and D. Fox, “Dart: Dense articulatedreal-time tracking.” in

Robotics: Science and Systems , vol. 2, no. 1.Berkeley, CA, 2014.[68] R. Mart´ın-Mart´ın, S. H¨ofer, and O. Brock, “An integrated approach tovisual perception of articulated objects,” in

International Conference onRobotics and Automation . IEEE, 2016, pp. 5091–5097.[69] R. Mart´ın-Mart´ın and O. Brock, “Coupled recursive estimation for onlineinteractive perception of articulated objects,”

The International Journalof Robotics Research , 2019.[70] B. Deng, K. Genova, S. Yazdani, S. Bouaziz, G. Hinton, andA. Tagliasacchi, “Cvxnet: Learnable convex decomposition,” in

Pro-ceedings of the IEEE/CVF Conference on Computer Vision and PatternRecognition , 2020, pp. 31–44.[71] A. Pashevich, I. Kalevatykh, I. Laptev, and C. Schmid, “Learn-ing visual policies for building 3d shape categories,” arXiv preprintarXiv:2004.07950 , 2020.[72] E. Rohmer, S. P. Singh, and M. Freese, “V-rep: A versatile and scalablerobot simulation framework,” in

Intelligent Robots and Systems (IROS),2013 IEEE/RSJ International Conference on . IEEE, 2013, pp. 1321–1326.[73] S. James, M. Freese, and A. J. Davison, “Pyrep: Bringing v-rep to deeprobot learning,” arXiv preprint arXiv:1906.11176 , 2019.[74] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[75] S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of adamand beyond,” in