[PDF] Physical Reasoning Using Dynamics-Aware Models

Abstract

A common approach to solving physical-reasoning tasks is to train a value learner on example tasks. A limitation of such an approach is it requires learning about object dynamics solely from reward values assigned to the final state of a rollout of the environment. This study aims to address this limitation by augmenting the reward value with additional supervisory signals about object dynamics. Specifically,we define a distance measure between the trajectory of two target objects, and use this distance measure to characterize the similarity of two environment rollouts.We train the model to correctly rank rollouts according to this measure in addition to predicting the correct reward. Empirically, we find that this approach leads to substantial performance improvements on the PHYRE benchmark for physical reasoning: our approach obtains a new state-of-the-art on that benchmark.

Full PDF

PPhysical Reasoning Using Dynamics-Aware Models

Eltayeb Ahmed Anton Bakhtin Laurens van der Maaten Rohit Girdhar

Facebook AI Research, New York

Abstract

A common approach to solving physical-reasoning tasks is to train a value learneron example tasks. A limitation of such an approach is it requires learning aboutobject dynamics solely from reward values assigned to the ﬁnal state of a rollout ofthe environment. This study aims to address this limitation by augmenting the re-ward value with additional supervisory signals about object dynamics. Speciﬁcally,we deﬁne a distance measure between the trajectory of two target objects, and usethis distance measure to characterize the similarity of two environment rollouts.We train the model to correctly rank rollouts according to this measure in additionto predicting the correct reward. Empirically, we ﬁnd that this approach leads tosubstantial performance improvements on the PHYRE benchmark for physicalreasoning [2]: our approach obtains a new state-of-the-art on that benchmark.

Many open problems in artiﬁcial intelligence require agents to reason about physical interactionsbetween object. Spurred by the release of benchmarks such as Tools [1] and PHYRE [2], such physical reasoning tasks have become a popular subject of study [3, 9, 11]. Speciﬁcally, the tasksdeﬁne an initial state and a goal state of the world, and require selecting an action that comprisesplacing one or more additional objects in the world. After the action is performed, the world simulatoris unrolled to determine whether or not the goal state is attained. Despite their simplicity, benchmarkslike PHYRE are surprisingly difﬁcult to solve due to the chaotic nature of the dynamics of physicalobjects. Current approaches for physical reasoning problems can be subdivided into two main types:1.

Dynamics-agnostic approaches treat the problem as a “standard” contextual bandit that tries tolearn the value of taking a particular action given an initial state, without using the simulator rolloutin any way [2]. An advantage of such approaches is that they facilitate the use of popular learningalgorithms for this setting, such as deep Q-networks (DQNs; [7]) However, the approaches do notuse information from the simulator rollout as learning signal, which limits their efﬁcacy.2.

Dynamics-modeling approaches learn models that explicitly aim to capture the dynamics ofobjects in the world, and use those models to perform forward prediction [3, 9, 11]. Such forwardpredictions can then be used, for example, in a search algorithm to ﬁnd an action that is likelyto be successful. An advantage of such approaches is that it uses learning signal obtained fromthe simulator rollout. However, despite recent progress [10], high-ﬁdelity dynamics prediction inenvironments like PHYRE remains an unsolved problem [3]. Moreover, current approaches do notuse the uncertainty in the dynamics model to select actions that are most likely to solve the task.In this paper, we develop a dynamics-aware approach for physical reasoning that is designed tocombine the strengths of the current two approaches. Our approach incorporates information onsimulator rollout into the learning signal used to train DQNs. We show that the resulting modelsoutperform prior models on the PHYRE benchmark, achieving a new state-of-the-art score of . on the 1B, within-template tranche of that benchmark (compared to . in prior work [3]). Preprint: Under review. a r X i v : . [ c s . A I] F e b Dynamics-Aware Deep Q-Networks

The basis of the model we develop for physical reasoning is a standard deep Q-network (DQN; [2, 7]).We augment the loss function used to train this model with a dynamics-aware loss function. Thisallows the model-free DQN learner to explicitly incorporate dynamics of the environment at trainingtime, without having to do accurate dynamics prediction at inference time.Our backbone model is a ResNet [4] that takes an image depicting the initial scene for task s as input.The action, a , is parameterized as a ( x, y, r ) -vector that is processed by a multilayer perceptron withone hidden layer to construct an action embedding. The action embedding is fused with the output ofthe third ResNet block using FiLM modulation [8]. This fused representation is input into the fourthblock of the ResNet to obtain a scene-action embedding, e s,a . We score action a by applying a linearlayer with weights w and bias b on e s,a . At training time, we evaluate this score using a logisticloss that compares it against a label, y s,a , that indicates whether or not action a solves task s : ˆ y s,a = w (cid:62) e s,a + b ; L = − ( y s,a log (ˆ y s,a ) + (1 − y s,a ) log (1 − ˆ y s,a )) . (1) Dynamics-aware loss.

We develop an auxiliary loss function that encourages the embeddings ofactions that lead to similar rollouts in a given scene to be similar. Given a pair of actions ( a, a (cid:48) ) fortask s , we compute a joint embedding of the two actions j s,a,a (cid:48) for that task as follows: p s,a = MLP ( e s,a ); p s,a (cid:48) = MLP ( e s,a (cid:48) ); j s,a,a (cid:48) = p s,a (cid:12) p s,a (cid:48) . (2)Herein, (cid:12) refers to a combination function: we use the element-wise product by default but we alsoexperiment with outer products and concatenation in Section 3.1. We pass j s,a,a (cid:48) through anotherlinear layer to predict the similarity of the two actions in task s . The model is trained to minimize aloss that compares the predicted similarity to a “ground-truth” similarity. Speciﬁcally, we bin theground-truth similarity into K bins and minimize the cross-entropy loss of predicting the right bin: u s,a,a (cid:48) = W (cid:62) j s,a,a (cid:48) + b ; L aux = y (cid:62) s u s,a,a (cid:48) − log (cid:88) y (cid:48) s exp (cid:2) y (cid:48)(cid:62) s u s,a,a (cid:48) (cid:3) . (3)Herein, y s is a one-hot vector of length K indicating the bin in which the ground-truth similarityfalls. The model is trained to minimize L + L aux , assigning equal weight to both losses. Measuring action similarity.

To measure the ground-truth similarity, v a,a (cid:48) , between two actions a and a (cid:48) on task s , we run the simulator on the two scenes obtained after applying the actions. Wetrack all objects throughout the simulator roll-outs, and measure the Euclidean distance betweeneach object in one roll-out and its counterpart in the other roll-out. This results in distance functions, d a,a (cid:48) ( o, t ) , for all objects o ∈ O (where t represents time). We convert the distance function into asimilarity functions and aggregate all similarities over time and over all objects: q a,a (cid:48) ( o, t ) = 1 − min( d a,a (cid:48) ( o, t ) , α ) α ; v a,a (cid:48) = (cid:80) o ∈O (cid:80) Tt =1 q a,a (cid:48) ( o, t ) T |O| , (4)where α is a hyperparameter that clips the distance at a maximum value, and T is the number of timesteps in the roll-out. The similarity v a,a (cid:48) is binned to construct y s . See Appendix A for details. Training.

We follow [2] and train the model using mini-batch SGD. We balance the training batchesto contain an equal number of positive and negative task-action pairs. To facilitate computation of L aux , we further constrain the batch composition. First, we sample t tasks uniformly at random in abatch. For each task, we sample n actions that solve the task and n actions that do not solve the task.We compute the similarity, v a,a (cid:48) , for all n action pairs for a task. To evaluate L aux , we averageover these tn action pairs. Simultaneously, we average L over the tn task-action pairs. Additionaldetails on our training procedure as well as hyperparameter settings are presented in the appendix. Inference.

At inference time, the agent scores a set of A randomly selected actions using the scoringfunction ˆ y t,a i . The agent proposes the highest-scoring action as a solution. If that action does notsolve the task, the agent submits the subsequent highest-scoring action until the task is solved or untilthe agent has exhausted its attempts (whichever happens ﬁrst). For the PHYRE-2B tier, the action is parametrized via a ( x , y , r , x , y , r ) -vector. AUCCESS Success Percentage @10PHYRE-1B PHYRE-2B PHYRE-1B PHYRE-2BWithin Cross Within Cross Within Cross Within Cross

RAND [2] 13.7 ± ± ± ± ± ± ± ± MEM [2] 2.4 ± ± ± ± ± ± ± ± DQN [2] 77.6 ± ± ± ± ± ± ± ± Dec[Joint]1f [3] 80.0 ± ± – – 84.1 ± ± – – Ours 85.2 ± ± ± ± ± ± ± ± DQN (Online) [2] – 56.2 ± – 39.6 ± – 58.1 ± – 41.6 ± Ours (Online) – ± – ± – ± – ± Dataset.

We test our dynamics-aware deep Q-network (DQN) on the PHYRE benchmark on bothtiers (1B and 2B) and both generalization settings: within-template and cross-template. Following [2],we use all 10 folds and evaluate on the test splits in our ﬁnal experiments; the results are reported inTable 1. For all ablation studies, we use the 4 folds on the validation splits; results are in Table 2.

Implementation Details.

We train our models as described in Section 2, using both ResNet-18 andResNet-50 backbones. We use 100,000 batches with 512 samples per batch. Each batch contains 64unique tasks with 8 actions per task, such that half of them solve the task (positives) and half of themdo not (negatives). Training is performed using Adam [5] with an initial learning rate of · − and a cosine learning rate schedule [6]. We set K = 5 , we set the maximum possible distance in aPHYRE scene α = √ , and we set the dimensionality of p s,a to 256. We train and test all models inboth the within-template and the cross-template generalization settings. Following [2], we also studythe effect of online updates during inference time in the cross-template setting. The results of our ablation study are presented in Table 2. In each subtable, we vary only onecomponent of the model and keep all the other components to their default value. The default valuesthat we used to produce our ﬁnal results in Table 1 are underlined in the subtables.

Effect of network depth.

We evaluate the effect of backbone depth in Table 2a. We observe thatResNet-50 performs a little better than ResNet-18 in the within-template setting, but not in thecross-template setting. These results hold for both “vanilla” DQNs and our dynamics-aware DQNs.

Effect of projection layer.

In Equation 2, we described an MLP to project the embedding from thebackbone network. In Table 2b, we test replacing this module by a linear layer or directly using thebackbone embeddings. We ﬁnd a two-layer MLP works slightly better and adopt it in our ﬁnal model.

Effect of number of bins, K . We evaluate the effect of the number of bins, K , used for classifyingaction similarity values in L aux . The results in Table 2c show that K = 5 performs well but usingfewer bins works ﬁne, too. We also compare the bin-classiﬁcation approach to a regression approachthat minimizes mean-squared error (MSE) on the action similarities, and ﬁnd it to perform worse. Effect of combination function.

We compare different combination functions (cid:12) for computingthe representations j s,a,a (cid:48) . Table 2d presents the results of this comparison. We ﬁnd that element-wise multiplication works substantially better than concatenation and matches the performance ofcombination via a bilinear layer. We opt to use element-wise multiplication over bilinear combinationfor our ﬁnal model, as it is computationally cheaper and uses less parameters. Effect of frames considered in action similarity measure.

We evaluate the effect of changing theframes of the simulator roll-out used to compute the action similarity, v a,a (cid:48) . The results of this3able 2: AUCCESS on PHYRE-1B (averaged over 4 validation folds) observed in our ablationstudies. Unless otherwise noted, results are for the within-template setting. (a) ResNet backbone depth. Model Depth Within Cross

DQN 18 81.2 ± ± DQN 50 82.4 ± ± Ours 18 83.2 ± ± Ours 50 83.5 ± ± (b) Projection layer. AUCCESS

None 82.7 ± Linear 82.7 ± ± ± (c) Number of bins. k AUCCESS ± ±

10 83.2 ±

20 82.8 ± MSE 82.8 ± (d) Combination function. AUCCESS

Multiplication 83.2 ± Concatenation 82.2 ± Bilinear 83.2 ± (e) Frames used in v a,a (cid:48) . Frames AUCCESS

First 1 81.7 ± First 3 82.2 ± First 5 82.7 ± First 10 83.2 ± Last 1 82.8 ± Last 3 82.9 ± Last 5 83.6 ± Last 10 82.5 ± Entire Rollout 82.7 ± GT Baseline Ours GT Baseline Ours Baseline Ours(a) Task A (b) Task B (c) Task B top actions space

Figure 1: In (a) and (b), we visualize all the ground truth (GT) and top 10 predicted actions’ positions( x, y ) that solves the above two tasks, with darker color representing higher conﬁdence. On TaskA, our method performs similarly to a dynamics-agnostic baseline. In Task B where the incline isslanted the other way, however, the baseline model is confused between two possible sets of positionsof the action. By contrast, our dynamics-aware DQN model is able to solve this task correctly. Finallyin (c), we visualize all actions with > . cosine similarity to any action that solves the task, withdenser color = ⇒ higher similarity. The illustration suggests that our dynamics-aware DQN model isable to rule out incorrect actions much more effectively than the baseline DQN model.evaluation in Table 2e show that using the ﬁrst 10 frames or the last 5 frames works best. Althoughthe differences are small, using only the ﬁrst or the last frame is clearly worse. In our ﬁnal model, weaverage the action similarity over the last ﬁve 5 frames of the roll-out. Table 1 presents the AUCCESS and success percentage of our best dynamics-aware DQNs, andcompares it to results reported in prior work. The results show the strong performance of our models:in the within-template setting, our models improve the prior state-of-the-art AUCCESS by 5 to 8points. In the cross-template setting, our dynamics-aware DQN also outperforms it dynamics-agnosticcounterpart but does not outperform a model that makes full dynamics prediction [3]. Nevertheless,the results suggest dynamics-aware DQNs have the potential to improve physical-reasoning models.Figure 1 illustrates these results by visualizing how our dynamics-aware DQN model more effectivelyrules out parts of the action space that cannot lead to a successful solution.

We have presented a dynamics-aware DQN model for efﬁcient physical reasoning that is better atcapturing object dynamics than baseline models, without having to do explicit forward prediction.Our best models substantially outperform prior work on the challenging PHYRE benchmark.4 eferences [1] Kelsey R Allen, Kevin A Smith, and Joshua B Tenenbaum. The Tools challenge: Rapidtrial-and-error learning in physical problem solving. In

CogSci , 2020. 1[2] Anton Bakhtin, Laurens van der Maaten, Justin Johnson, Laura Gustafson, and Ross Girshick.PHYRE: A new benchmark for physical reasoning. In

NeurIPS , 2019. 1, 2, 3, 6[3] Rohit Girdhar, Laura Gustafson, Aaron Adcock, and Laurens van der Maaten. Forwardprediction for physical reasoning. arXiv preprint arXiv:2006.10734 , 2020. 1, 3, 4[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In

CVPR , 2016. 2[5] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In

ICLR ,2015. 3[6] I. Loshchilov and H. Hutter. SGDR: Stochastic gradient descent with warm restarts. In

ICLR ,2016. 3[7] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, DaanWierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. In

NIPS DeepLearning Workshop . 2013. 1, 2[8] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film:Visual reasoning with a general conditioning layer. In

AAAI , 2018. 2[9] Haozi Qi, Xiaolong Wang, Deepak Pathak, Yi Ma, and Jitendra Malik. Learning long-termvisual dynamics with region proposal interaction networks. arXiv preprint arXiv:2008.02265 ,2020. 1[10] Alvaro Sanchez-Gonzalez, Jonathan Godwin, Tobias Pfaff, Rex Ying, Jure Leskovec, andPeter W. Battaglia. Learning to simulate complex physics with graph networks. arXiv preprintarXiv:2002.09405 , 2020. 1[11] William F. Whitney, Rajat Agarwal, Kyunghyun Cho, and Abhinav Gupta. Dynamics-awareembeddings. In

ICLR , 2020. 1 5able 3: Maximum distance at which distances areclipped. We ﬁnd stable performance across differentclipping distance, and use 0.1 for ﬁnal results. α AUCCESS ± ± ± ± √ ± Table 4: DQN trained with our modiﬁedbatch composition approach used for train-ing our dynamics-aware models. We ﬁnd thebatching by itself does not lead to any gains.

Batching approach AUCCESS

Standard [2] 81.1 ± Ours 81.6 ± A Action Similarity Metric

The similarity between two actions is computed from the object feature representation of the actions’rollouts provided by the PHYRE API. For two rollouts of two actions ( a, a (cid:48) ) we use the notationthat ( x ( o, t ) , y ( o, t )) and ( x (cid:48) ( o, t ) , y (cid:48) ( o, t )) are the locations of the object o at the timestep t in therollouts of a and a (cid:48) respectively then: T = min( t , t ) (5) q a,a (cid:48) ( o, t ) = 1 − min( d a,a (cid:48) ( o, t ) , α ) α (6) v a,a (cid:48) = (cid:80) o ∈O (cid:80) Tt =1 q a,a (cid:48) ( o, t ) T |O| , (7)where O is the set of moving objects in the scene, t and t are the lengths of the ﬁrst and secondrollouts respectively and α is a hyperparameter that clips the distance at a maximum value. Whencomputing the metric using only "last n " frames the frames we consider are the frames from time ( T − n + 1) to T . The similarity v a,a (cid:48) is binned to construct y s as follows: y s = (cid:98) v a,a (cid:48) ( K − (cid:101) , (8)where (cid:98)·(cid:101) is an operator that rounds continuous numbers to the nearest integer and K is the numberof bins used.In Table 1, we take α = √ . This value is suggested by the PHYRE environment since the coordinatesof locations in the scene fall in the square limited by the corner points (0 , and (1 , with themaximum possible Euclidean distance between two objects being √ . In other environments wemight not have easy access to the maximum possible Euclidean distance or it might not be ﬁnite.To study the sensitivity of our method with respect to this parameter choice we train a group ofmodels with arbitrary values for α < √ . Using a value α > √ corresponds to disabling distancethresholding altogether. We show the results in Table 3 and ﬁnd the effect using an arbitrary distancethreshold to be negligible. B Additional Ablations

Here we conduct one further ablation to those that were carried out in Section 3.1. Here we examinethe effect of our batch composition on the models In [2] the task-actions pairs are sampled uniformlyat random with the sole imposed constraint that the number of negative examples are equal to thenumber of positive examples in the batch. This label balancing is in line with standard practices suchas over-sampling which are beneﬁcial when training on datasets with heavy label imbalances. Inour method, we keep this label balancing and to facilitate using the auxiliary losses we impose anadditional constraint such that in each batch we have multiple actions for each task as described inSection 2. We examine the effect of this modiﬁed batch composition in Table 4 and we ﬁnd that thischange alone does not lead to a signiﬁcant performance improvement.6 U CC E SS BaselineOurs (a) Template

BaselineOurs (b) No. of moving objects

Figure 2: Here we break down the AUCCESS by template in 2a and number of moving objects in2b. We see our agents biggest gains are on templates where the baseline performs worst, while thebaseline marginally outperforms our models in the templates where baseline was already performingwell. We aggregate the templates by the number of moving objects in where we see our modeloutperforming the baseline across all numbers of moving objects. ∆ AU CC E SS Baseline AUCCESS

Figure 3: Here we show baselineAUCCESS vs ∆ AUCCESS from ourmethod and ﬁnd a statistically signiﬁ-cant correlation with a Pearson corre-lation factor of -0.55. This shows weget highest gains on templates wherethe baseline performs poorly.

GT sim > sim > . sim > . sim > . B a s e li n e O u r s Figure 4: Here we show an extended version of Figure 1 (c),showing action space embeddings color coded by similarityto GT actions, at different similarity thresholds. We observeour method leads to actions is able to rule out actions inca-pable of solving the task at all thresholds, and at 0.98 theselected actions are almost indistinguishable from GT.