Prospection: Interpretable Plans From Language By Predicting the Future
Chris Paxton, Yonatan Bisk, Jesse Thomason, Arunkumar Byravan, Dieter Fox
PProspection: Interpretable Plans From Language By Predicting the Future
Chris Paxton , Yonatan Bisk , Jesse Thomason , Arunkumar Byravan , Dieter Fox , Abstract — High-level human instructions often correspond tobehaviors with multiple implicit steps. In order for robots tobe useful in the real world, they must be able to to reasonover both motions and intermediate goals implied by humaninstructions. In this work, we propose a framework for learningrepresentations that convert from a natural-language commandto a sequence of intermediate goals for execution on a robot.A key feature of this framework is prospection , training anagent not just to correctly execute the prescribed command,but to predict a horizon of consequences of an action beforetaking it. We demonstrate the fidelity of plans generated byour framework when interpreting real, crowd-sourced naturallanguage commands for a robot in simulated scenes.
I. I
NTRODUCTION
A robot agent executing natural language commands mustsolve a series of problems. First, human language must betranslated to an understanding of intent. For example, thecommand pick up the yellow block and place it on top ofthe red block corresponds to an intended change in worldstate that results in a yellow block on top of a red one.Given that understanding, an agent must plan a sequence ofactions it can take to reach the target world state. In theabove example, this could be (move(yellow); grasp(yellow);move(yellow, red); release(yellow)) . Finally, these high levelcontrols have to be executed in the world by servoing an armto appropriate positions and controlling the gripper.Each of these problems is challenging and has been inves-tigated by existing research. Commonly, a pipeline approachis used, where each problem is addressed sequentially, andthe outputs of one are fed to the next. In the example above,the semantic understanding that the goal is on(yellow, red) from natural language is passed to a high level controller.To simplify high level control prediction, robot perceptionis often augmented with visual semantics information, suchas oracle object detections, bounding boxes, or 6d poses [1],[2]. In this work, we instead train a single model end-to-endthat takes natural language and raw pixels, depths, and jointstates to produce low-level controls to accomplish a goal.Additionally, rather than treat the pipeline problems aboveindependently, we introduce a prospection component to anagent’s training and inference, which facilitates “dreaming”about the consequences of chosen high level actions in thecurrent scene. Prospection is the ability to reason about theconsequences of future actions without executing them [3].Our approach allows the agent to predict whether its highlevel actions will lead to undesirable world states. NVIDIA, USA Paul G. Allen School of Computer Science and Engineering, Universityof Washington, Seattle, USA take the yellow object from the table and place it on top of the red object
Goal( ): World( ): grasp(yellow) control release (yellow)move(yellow,red)
Interpretable Predicted FuturesNext Prediction actionsubgoal
Fig. 1: D
REAM C ELLS convert instructions to interpretablesubgoals which can be visualized ( ˆ W ) and executed ( θ ).We consider a simple pick-and-place task, where the goalis to stack one block on top of another (Figure 1). This islimited in that there are only a few high-level actions therobot can take in a given world. Still, it proves challengingwhen specified with real, crowd-sourced natural language.In short, our contributions are: • A language embedding rich enough to specify a se-quence of actions to achieve a task-level goal. • A D
REAM C ELL architecture to predict future worldstates from language and raw sensor observations. • An approach to convert a task plan output from theseD
REAM C ELLS into low-level executions.II. R
ELATED W ORK
Our work draws inspiration from recent efforts on learningabstract representations for planning [4]. We build primarilyon work in planning and natural language processing, withimportant future work in manipulation.Communicating control and goals has traditionally beenaccomplished by specifying high level operations [5], [1],[6], via formal languages like the Problem Domain De-scription Language (PDDL) [7] or as a Hierarchical TaskNetwork [8]. Such systems provide a straightforward way tocompose black-box operations to solve problems. While wemaintain an interpretable intermediate layer, our interface isnatural language, most akin to [9], [10] though we work ina fully end-to-end differentiable paradigm where embedded a r X i v : . [ c s . A I] M a r ream Cell RNN + Predictor
Fig. 2: At each timestep, the model receives an LSTM encoded language vector (cid:126)L , the initial world state W , and the currentworld state W t . Using these, it predicts the next world state ˆ W t +1 and sub-goal G t .language representations are learned alongside visual encod-ings of the world.Our work is also motivated by the Universal PlanningNetworks which learn an embedding for images and a worldstate vector used to generate motion plans to goals specifiedvia a target image [11]. That work learns a distance metricfrom the current state image to the target image which is usedto perform rollouts for training and inference. While learnedgeneric representations have notions of agency and planning,the produced plans lack human interpretability, which maybe important to modularity and generalization [12]. Neuralapproaches and scaling robotic learning within simulationhave become common as they allow for end-to-end trainingand can easily acquire more data (often from multiple do-mains) than otherwise possible on a physical device [2]. Thishas proven particularly important for RL-based approaches[13] and interpretability [14]. More generally there is agrowing literature on learning deep representations that canbe used to accomplish local control tasks or simple objectmanipulations [15], [16], [11], [17].Core to our contribution is simulating the future actionsand dynamics of our system. High level process simulationhas been used in the NLP literature [18] without sensordata. Simultaneously, prospection has been used before inRL, often as a means of model-based control [19], [17],[20], and for fine control tasks like cutting [19]. In addition,our approach is compatible with work in Visual RobotTask Planning [4], which shows that prospective subgoalpredictions can be used to generate task plans.Complementary to our work is the growing literature inNLP focusing on complex grounding instructions with eso-teric references and other long tail linguistic phenomena [21],[22]. Natural language communication with robotics alsoallows for learning joint multimodal representations [23],[24] which harness the unique perceptual and manipulationcapabilities of robotics.While accurate grasping and placement is not a focus ofour work, it has been explored in the literature [25], [26],[27]. In particular, high-precision grasping with deep neuralnetworks generally takes the form of predicting a graspsuccess classifier [25], [26]. III. P ROBLEM D ESCRIPTION
Given a natural language command and raw sensor read-ings for a scene, the task is to issue a sequence of low levelcontrols to a robotic arm that accomplish the intended task.
Specification.
A D
REAM C ELL takes in the initial W andcurrent W t simulation-based state observations and a natural-language sentence s to produce a sequence of intermediate,latent-space goals z i , . . . , z i + h for this pick-and-place task upto horizon h . Each goal is a semantically meaningful breakpoint in the execution, e.g., a completed grasp on the target.D REAM C ELLS make three predictions at every time step:1) A sequence of subgoals predictions, representing thenext high level actions out to some planning horizon;2) A sequence of hidden state representations z i , . . . , z i + h representing the results of these subgoals; and3) The end effector command θ that parameterizes thelow-level controller for task execution, consisting of a6DOF pose and gripper command.These predictions allow us to learn an interpretable, exe-cutable representation for hallucinating future world states. Metrics.
We measure both extrinsic performance on thetask and intrinsic performance of D
REAM C ELL components.Extrinsically, we evaluate how well the robot agent completesthe pick-and-place task. We record binary task success/failureas whether the target block to be moved is dropped withina threshold of its intended position based on the languagecommand. We also record the average mean-squared errorof the predicted end effector goal at each step of the task.This metric penalizes moving the wrong block while givingpartial credit for moving the right block to towards the rightplace, even if it never arrives there or is placed unstably (e.g.,if it falls off the target block after release ). Intrinsically, weevaluate the language-to-action component by how closelythe predicted sequence of subgoals matches ground truthexecution. IV. A
PPROACH
We train the system end-to-end using simulated data. Thisallows us to automatically generate training sequences forboth images and high-level subgoals. Subgoals take the formof semantic predicates like grasp and move with blockarguments. To simplify notation, throughout the paper, werefer to the union of RGB, pose, and depth images withthe single world state variable W . At every timestep the Predictor
Fig. 3: Diagram of a single prediction cell. The prediction cell predicts a change in hidden state ∆ z , and is an importantcomponent of the D REAM C ELL , used both for visualizing possible futures and for predicting the goal of a particular motionfor execution on our robot.model is provided the current world observation ( W t ), adescription of the goal configuration in natural language(encoded as (cid:126)L ). In practice W t also includes the initial state W to capture changes over time. All aspects of the model(including the encoders for both language and the world)are trained together. The model is trained on superviseddemonstration data collected from an expert policy as perprevious work [15], [4].The basic unit in our model is the D REAM C ELL (Fig. 2)which produces a sub-goal and a corresponding predictedimage of the arm’s position at the next time step. This formu-lation allows for recurrent chaining of cells to rollout futuregoals and states (Fig. 1). Specifically, because the output ofthe D
REAM C ELL includes a deconvolved hallucination ofthe next world state ( ˆ W t +1 ) we can simply continue to runthe network forward, where true observations are replacedwith the network’s predictions. Our cell has two outputs atevery timestep t : 1. Subgoals ( G t ) and 2. Predicted Worlds( ˆ W t +1 ). We provide intrinsic evaluations on future predictionperformance in Section VI.At inference time, we generate a task plan by rolling outmultiple D REAM C ELL timesteps into a possible future givenstate observations, z . The core of our approach is that thesesubgoals are converted into estimated world states ˆ W andassociated end effector goals θ , which are fed into a lowerlevel controller π that will convert them into trajectories. A. D REAM C ELL
Subgoal Module
The subgoal module predicts the next subgoal from thecurrent world state and the language instruction. It is formu-lated similarly to image captioning and sequence to sequenceprediction. First, we use an LSTM [28] to encode thegoal as expressed in language. Words are embedded as 64dimensional vectors initialized randomly. We concatenate thefinal hidden state ( (cid:126)L ) with the output of our world encoder( z t ) as the initial hidden state of a new LSTM cell fordecoding.We generate an output of G t = ( verb , to obj , with obj ) tuples at each timestep, to a horizon of length five. We dividethe subgoal G t into: verb the action to be taken, to obj theobject to servo towards, and with obj the optional objectin hand (e.g., move ( red , yellow ) moves the yellow block inhand to the target red block).During training, we use cross entropy loss on all 5x3generations. The first timestep in the RNN is passed thecurrent hidden state and the At each timestep the RNNcell is passed the prediction from the previous timestep.As is common practice in the language modeling literature[29], we tie the emission and embedding matrix parameters.Traditionally, this is achieved by simply transposing a singlematrix. Our model produces tuples by passing the hiddenvector through three different feed-forward layers, so, tore-embed predictions we multiply by the three transposedembedding matrices and average the outputs to reconstructan embedding. B. D REAM C ELL
World Predictor Module
The prediction cell, shown in Fig. 3, takes in the currenthidden state z t and the predicted subgoal G t . Each predictionmodel outputs a predicted latent-state subgoal ˆ z t +1 , such that P ( z t , G t ) = ˆ z t +1 . In effect, we learn a many-to-one mappingacross multiple timesteps, all of which need to produce thesame goal. We should also be able to roll this simulationforward in time in order to visualize future actions. Trainingthis prediction space is a difficult problem and requires acomplex loss function involving multiple components.The prediction cell is a simple autoencoder mapping inputs W to and from a learned latent space, as show in Fig. 2.World observations W t and W are combined into a singleestimated latent state z t . The vector containing the predictedsubgoal G t is tiled onto this state. We use a bottleneck withineach prediction cell to force information to propagate acrossthe entire predicted image, and then estimate a change inlatent state ∆ z such that ˆ z t +1 = z t + ∆ z .When visualizing the predicted image ˆ W t +1 , we use adecoder consisting of a series of 5x5 convolutions and -hotselector * p r e d i c t e d c u r r e n t Fig. 4: The actor module takes in a hidden state and associated subgoal API and converts this to a motion goal, whichis represented as a Cartesian ( x, y, z ) position, a unit quaternion q = ( a, b, c, d ) , and a gripper command g ∈ (0 , . Thismotion goal can then be sent to the control module for execution.bilinear interpolation for upsampling. C. Actor Module
The actor module, shown in Fig. 4, predicts the parametersof an action that can be executed on the robot. Specifically,it takes in z t and the current high-level action and predicts adestination end effector pose that corresponds to the robot’sposition at z t .The architecture is a simple set of convolutions: the high-level action is concatenated with the current z t as in theprediction module, then a set of three 3x3 convolutionswith 64, 128, and 256 filters, each followed by a 2x2max pool. This is followed by a dropout and a single 512dimensional fully connected layer, and then to N verbs × outputs, predicting gripper command g ∈ (0 , , Cartesianend effector position, and a unit quaternion for each high-level action verb. The gripper command uses a sigmoidactivation where represents closed and represents open,and Cartesian end effector position uses a tanh activationfunction. All pose values are normalized to be in ( − , .A one-hot attention over action verbs chooses which poseand gripper command should be executed. In effect, the actorlearns to compute a set of pose features for predicting thenext manipulation goal and learns a simple perceptron modelfor each action verb in order to choose where the arm shouldgo and whether the gripper should be opened or closed afterthe motion is complete. D. Training
We train the encoder and decoder jointly when training thePrediction and Actor modules and optimize with Adam [30],using an initial learning rate of e − . We fix the latent stateencoder and decoder functions after this step, then use thelearned hidden space to train the Subgoal module. Image Reconstruction Loss
This determines how wellour model can reconstruct an image from a given hidden state z t , and is trained on the output of our visualization module.We used an L2 loss on pixels both for RGB and depth. Depthvalues were capped at 2 meters and were normalized to bebetween 0 and 1. Subgoal Recovery Loss
Image reconstruction losses areoften insufficient for capturing fine details. This issue has motivated recent work on GANs [31]. These are oftenunstable, so we propose an alternative solution specializedto our problem. Since each successful high-level actionhas a predictable result, we jointly train a classifier thatwill recover the subgoal associated with each successivehigh-level action ˆ G t . We use C G ( z t ) as the classifier loss,minimizing cross entropy between the recovered estimate ˆ G t and ground truth G t . Actor Loss
Instead of estimating the full joint state of therobot as the result of a high-level action, our Actor moduleestimates the end-effector pose θ t associated with sugboal G t .These poses are represented as θ = (ˆ p, ˆ q, ˆ g ) , where ˆ p is the Cartesian position, ˆ q is a quaternion describing theorientation, and ˆ g ∈ (0 , is the gripper command. Whenregressing to poses, we use a mixture of the L2 loss betweenCartesian position and a loss derived from the quaternionangular distance (to capture both spatial and rotational error).The angle between two quaternions ˆ q and ˆ q is given as: ω = cos − (2 (cid:104) ˆ q , ˆ q (cid:105) ) . To avoid computing the inverse cosine as a part of the loss,we use a squared distance metric. In addition, normalizegripper commands to be between 0 and 1, where 0 is closedand 1 is open, and trained with an additional L2 loss onpredicted gripper commands. Given estimated pose ˆ θ andfinal pose θ , we calculate pose estimation loss: C actor (ˆ θ, θ ) = λ actor (cid:8) (cid:107) ˆ p − p (cid:107) + (1 − (cid:104) ˆ q, q (cid:105) ) + (cid:107) ˆ g − g (cid:107) (cid:9) . Object Pose Estimation Loss
It is important to ensurethat our learned latent states z t capture all the necessary in-formation to perform the task. As such, we use an augmentedloss C obj ( z t ) that predicts the position of each of the fourblocks in the scene at the observed frame. This informationis not used at test time, but is structurally identical to thepose estimation loss C pose Combined Prediction Loss
The final loss function forpredicting the effects of performing a sequence of high-level ut the blue cube onto the yellow cubestack the top most cube onto the second highest cube
Fig. 5: Human participants on Mechanical Turk gave twocommands for how to create the target image (right) fromthe initial image (left).actions is then: C ( ˆ Z ) = λ W (cid:107) ˆ W t − W t (cid:107) + C obj ( z t )+ (cid:88) i ∈ h (cid:32) λ W (cid:107) ˆ W t + i − W t + i (cid:107) + C actor (ˆ θ t + i , θ t + i ) + C G ( ˆ G t + i , G t + i ) (cid:33) . E. Execution
When executing in a new environment, the robot agenttakes in the current world state W and a natural languageinstruction L . The agent computes a future prediction using aD REAM C ELL , by rolling out predicted goals G t , . . . , G t + H which generate latent space subgoals. These subgoals canthen be visualized to provide insight into how the robotexpects the task to progress. This also illuminates misunder-standings and limitations of the system (see Analysis VI).The system generates new prospective plans out to a givenplanning horizon. After predicting the next subgoal z t +1 , itwill then use the actor to estimate the next motion goal θ t +1 .This goal is sent to the low-level execution system, which inour case is a traditional motion planner that does not haveknowledge of object positions. In our case, the planner usedwas RRT-connect [32], via MoveItIn the future, these subgoals shown to the user, who cangive the final confirmation on whether or not to execute thishallucinated task plan if it accomplishes what they requested.Alternatively, the user could input a new L , or the agentcould sample a new sequence of goals.V. E XPERIMENT S ETUP
We performed a number of variations on a simple blockstacking task. All experiments were performed in simulation.We collected 5015 trials using a sub-optimal expert policy,of which 2370 were successes. Our model was trained onlyon successful examples.We generated trials using a simple simulation of an ABBYuMi robot picking up 3.5 cm cubes. There were four cubes,one of each color: red, green, yellow, blue. When collectingdata, we first randomly compute a table position within a50 cm box centered in front of the robot. Blocks wererandomly placed in non-intersecting positions on this table.The arm’s initial position was also randomized to an area offthe right side of the table.We selected manipulation goals at random, and provideda simple expert policy which moved to pick up each objectusing an RRT motion planner. The plan has five steps:
L2 distance in cm ↓ Success ↑ Align Grasp Lift Move Release RateOracle 0.04 0.03 0.04 0.04 0.04 98.4%GT Action 0.32 0.31 0.48 0.63 0.63 90.4%Template 0.32 0.39 0.47 0.65 0.65 87.8%Real Lang 0.51 1.23 1.50 2.39 2.40 77.1%
TABLE I: L2 distances and accuracy when executing plansgenerated from either ambiguous natural language instruc-tions or unambiguous template language. align with the top of a random block, grasp that blockand close the gripper, lift the block off the table, move theblock to atop another block, and then release the block.We collected natural language commands from humanannotators through the Mechanical Turk crowd-sourcing plat-form. Annotators were shown two scene images: one beforeand one after a block had been stacked on another block.They were instructed to give two distinct commands thatwould let someone create the second scene from the first(Figure 5), and were paid $0.25 per such annotation. Foreach of our 2370 successful trials, we obtained two languagecommands describing the high level pick-and-place goal.On average, commands are 11 words long. We compareTurk data to unambiguous templated language that wasprocedurally generated from the manipulated blocks.VI. R
ESULTS
We ran a set of experiments on our simulation, andcomputed task execution success rate. We analyze the perfor-mance of the Subgoal and Predictor modules given differentclasses of language input.
A. Execution Results
Finally, we test our model on a set of held-out scenarios,and compare to ground-truth execution. We compared accu-racy of the estimated motion under each of three conditions:with oracle subgoals G t from the test data, with unambiguoustemplated language, and with natural language. Positionaccuracy results are shown in Table I.In all cases, we compute an execution plan ˆ G , . . . , ˆ G atthe beginning and use our Predictor and Actor networks tofollow this execution plan until all steps have been executed.Results are shown in Table I. We count successes when theblock was moved to within 1.5 cm of the target in the x andy direction, and 0.5 cm z of the final position from which itwas dropped.We see only a handful of failures when the robot was sentto ground truth “oracle” poses, due to stochastic interactionsbetween the objects and gripper and randomness at thecontrol level. . % of execution with ground truth actions, . % of unambiguous templated language, and . % ofnatural language were successfully able to pick up a blockand put it in the right area—indicating a high level ofprecision independent of the task specification. Grasp successrates tended to be very high. The most dramatic failures we Participants used 389 unique words after lowercasing and tokenization. re d i c t e d O r a c l e Fig. 6: Comparison of generated subgoal predictions. Top two rows: RGB and depth images generated from predicted subgoal G . Bottom two rows: RGB and depth images generated from ground truth G from training data. Horizon h = 1 h = 2 h = 3 h = 4 h = 5 Template
Verb 84.0% 87.9% 93.0% 97.4% 100.0%To Object 91.5% 90.4% 93.0% 97.4% 100.0%With Object 93.7% 91.8% 94.4% 96.8% 100.0%Overall 82.9% 86.8% 92.5% 97.3% 100.0%
Real Lang
Verb 84.2% 87.8% 92.9% 97.2% 100.0%To Object 87.8% 87.4% 89.8% 93.9% 98.2%With Object 91.4% 90.1% 92.5% 96.4% 100.0%Overall 79.6% 83.8% 89.1% 93.7% 98.2%
TABLE II: Subgoal prediction accuracy ( ↑ ) at different hori-zons with templated versus natural language. We see higheraccuracy as we move closer to the end of the task, when thespace of possible remaining plans is less ambiguous.observed were situations where one or more necessary blockswas out of the camera’s viewpoint, in which case our vision-based system fails by default.The similar performance between templated language andground truth actions suggests that unambiguous, templatedlanguage is insufficient to demonstrate the language learningcapabilities of our system. We find our method is robustto real natural language from Mechanical Turk workers,achieving 95% of the success rate seen on unambiguoustemplates. B. Subgoal Module
We analyze the language learning component of ourmodel. A full breakdown of subgoal prediction accuracyis given in Table II. Performance was comparable betweentemplated language and natural language data collected fromAmazon Mechanical Turk. We see that it is more difficult tomake accurate predictions on real language data. Addition-ally, accuracy is remarkably consistent over time, meaning that the model properly learned the correct sequence ofactions that should be executed. Accuracy farther out intothe future is stable because the network knows when andhow a sequence should end.There are two major sources of error we observe inthese examples. First, the difference in accuracy betweenMechanical Turk language and templated language is largelyexplained by the ambiguity and underspecificity in Turkcommands (e.g., not specifying a destination after a grasp).Second, the overall error is largely due to sequence error attransition points, where multiple possible actions are reason-able depending on whether or not the low level control hasarrived at its destination. This further supports our hypothesisthat we need to reason about all three sub-problems jointly.
C. Prediction Model
The role of the prediction model is to generate subgoalpredictions ˆ z t +1 , . . . , ˆ z t + h representing the h next actionsthat the robot can take in order to perform the task. Fig. 6shows an example of one course of error we see during theseprediction rollouts. The top two rows show a sequence ofpredictions coming from the sequence to sequence model,while the bottom row shows predictions using ground truthactions from our data. In this case, we see that the sequenceto sequence model started the grasp verb earlier than theground truth execution, but both models generate good imagepredictions.As we can see in Table I, there is persistently some error inthe low-level predictions from our actor module, even whengiven oracle arguments Average placement error increasesas we move away from the ground-truth arguments. Oftenfailures occur because the object is not clearly visible in thefirst frame.ig. 7: Results of different simulated executions. Successfulgrasps (top row) can be undermined by small errors in thelow-level actor network that compound to create accuracyissues at execution time (bottom row).Our reconstruction results have another advantage, how-ever, as seen in Fig. 6: they are clearly interpretable, whichmeans that the robot can readily justify its decisions evenwhen it does make a mistake, facilitating a human userproviding a new instruction that considers this mistake.Overall, these results show that we can learn representationsfor a task that are sufficient for planning and execution purelyfrom language and raw sensory data.We performed an ablation analysis on the best predictionmodels to determine how much they use information fromdifferent layers. In particular, we see similar performancewhen training without the image loss ( . successfulon held out test data) and without the image and objectlosses ( . successful). This suggests that our imagereconstruction loss may help, and certainly does not havea negative impact. VII. C ONCLUSIONS
We present an approach for inferring interpretable plansfrom natural language and raw sensor input using prospec-tion. Our D
REAM C ELL architecture predicts future worldstates from language and raw sensor observations, facilitatinghigh level plan inference that can be converted into low-levelexecution. Prospection enables end-to-end plan inference thatis agnostic to the nature of sensory input and low-levelcontroller modules. In the future, using this architecture tobootstrap language understanding for execution on a realrobot using sim-to-real transfer techniques could facilitateend-to-end control on a physical platform.VIII. A
CKNOWLEDGEMENTS
This work was funded in part by the National ScienceFoundation under contract no. NSF-NRI-1637479, and theDARPA CwC program through ARO (W911NF-15-1-0543). We would like to thank Jonathan Tremblay for valuablediscussions. R
EFERENCES[1] C. Paxton, A. Hundt, F. Jonathan, K. Guerin, and G. D. Hager,“CoSTAR: Instructing collaborative robots with behavior trees andvision,”
Robotics and Automation (ICRA), 2017 IEEE InternationalConference on , 2017.[2] D. Xu, S. Nair, Y. Zhu, J. Gao, A. Garg, L. Fei-Fei, and S. Savarese,“Neural task programming: Learning to generalize across hierarchicaltasks,”
International Conference on Robotics and Automation (ICRA) ,2018.[3] D. T. Gilbert and T. D. Wilson, “Prospection: Experiencing the future,”
Science , vol. 317, no. 5843, pp. 1351–1354, 2007.[4] C. Paxton, Y. Barnoy, K. D. Katyal, R. Arora, and G. D. Hager, “Visualrobot task planning,” in
International Conference on Robotics andAutomation (ICRA) , 2019.[5] E. Guizzo. (2017) Rethink’s robots get massive softwareupgrade, rodney brooks “so excited”. [Online]. Avail-able: https://spectrum.ieee.org/automaton/robotics/industrial-robots/rethink-robots-get-massive-software-upgrade[6] C. Paxton, F. Jonathan, A. Hundt, B. Mutlu, and G. D. Hager, “Eval-uating methods for end-user creation of robot task plans,”
IntelligentRobots and Systems (IROS), 2018 IEEE International Conference on ,2018.[7] M. Ghallab, C. Knoblock, D. Wilkins, A. Barrett, D. Christian-son, M. Friedman, C. Kwok, K. Golden, S. Penberthy, D. E.Smith et al. , “PDDL-the planning domain definition language,” , 1998.[8] K. Erol, J. Hendler, and D. S. Nau, “Htn planning: Complexity andexpressivity,” in
AAAI , vol. 94, 1994, pp. 1123–1128.[9] R. Paul, J. Arkin, N. Roy, and T. Howard, “Efficient grounding ofabstract spatial concepts for natural language interaction with robotmanipulators,” in
Proceedings of the 2016 Robotics: Science andSystems Conference , June 2016.[10] D. Arumugam, S. Karamcheti, N. Gopalan, L. L. Wong, and S. Tellex,“Accurately and efficiently interpreting human-robot instructions ofvarying granularities,” in
Proceedings of the 2017 Robotics: Scienceand Systems Conference , 2017.[11] A. Srinivas, A. Jabri, P. Abbeel, and S. Levine, “Universal planningnetworks,” in
Proceedings of the International Conference in MachineLearning (ICML) , 2018.[12] M. Garnelo, K. Arulkumaran, and M. Shanahan, “Towards deepsymbolic reinforcement learning,” in
Deep Reinforcement LearningWorkshop at NIPS , 2016.[13] O. Nachum, S. Gu, H. Lee, and S. Levine, “Data-efficient hierarchicalreinforcement learning,” 2018.[14] J. Tremblay, T. To, A. Molchanov, S. Tyree, J. Kautz, and S. Birch-field, “Synthetically trained networks for learning human-readableplans from real-world demonstrations,”
International Conference onRobotics and Automation (ICRA) , 2018.[15] A. Byravan and D. Fox, “Se3-nets: Learning rigid body motion usingdeep neural networks,” in
Robotics and Automation (ICRA), 2017IEEE International Conference on . IEEE, 2017, pp. 173–180.[16] C. Finn and S. Levine, “Deep visual foresight for planning robotmotion,” in
Robotics and Automation (ICRA), 2017 IEEE InternationalConference on . IEEE, 2017, pp. 2786–2793.[17] T. Weber, S. Racani`ere, D. P. Reichert, L. Buesing, A. Guez, D. J.Rezende, A. P. Badia, O. Vinyals, N. Heess, Y. Li et al. , “Imagination-augmented agents for deep reinforcement learning,” arXiv preprintarXiv:1707.06203 , 2017.[18] A. Bosselut, L. Omer, A. Holtzmann, C. Ennis, D. Fox, and Y. Choi,“Simulating action dynamics with neural process networks,”
Interna-tional Conference on Learning Representations , 2018.[19] I. Lenz, R. A. Knepper, and A. Saxena, “DeepMPC: Learning deeplatent features for model predictive control.” in
Robotics: Science andSystems , 2015.[20] R. Pascanu, Y. Li, O. Vinyals, N. Heess, L. Buesing, S. Racani`ere,D. Reichert, T. Weber, D. Wierstra, and P. Battaglia, “Learning model-based planning from scratch,” arXiv preprint arXiv:1707.06170 , 2017.[21] Y. Bisk, D. Yuret, and D. Marcu, “Natural language communicationwith robots,” in
Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computational Linguistics:Human Language Technologies , 2016, pp. 751–761.22] Y. Bisk, K. J. Shih, Y. Choi, and D. Marcu, “Learning interpretablespatial operations in a rich 3d blocks world,” in
Proceedings of theThirty-Second Conference on Artificial Intelligence (AAAI-18) , 2017.[23] J. Thomason, J. Sinapov, M. Svetlik, P. Stone, and R. Mooney,“Learning multi-modal grounded linguistic semantics by playing “Ispy”,” in
Proceedings of the 25th International Joint Conference onArtificial Intelligence (IJCAI-16) , July 2016, pp. 3477–3483.[24] J. Thomason, J. Sinapov, R. J. Mooney, and P. Stone, “Guidingexploratory behaviors for multi-modal grounding of linguistic descrip-tions,”
Intelligence (AAAI-18) , 2018.[25] S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,” in
Proceedings of the International Symposiumon Experimental Robotics , 2016.[26] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea,and K. Goldberg, “Dex-net 2.0: Deep learning to plan robust graspswith synthetic point clouds and analytic grasp metrics,” 2017.[27] D. Morrison, P. Corke, and J. Leitner, “Closing the loop for roboticgrasping: A real-time, generative grasp synthesis approach,”
Proceed-ings of the 2018 Robotics: Science and Systems Conference , 2018.[28] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
NeuralComputation , vol. 9, no. 8, pp. 1735–1780, 1997.[29] O. Press and L. Wolf, “Using the output embedding to improvelanguage models,” in
Proceedings of the 15th Conference of theEuropean Chapter of the Association for Computational Linguistics:Volume 2, Short Papers . Association for Computational Linguistics,April 2017, pp. 157–163.[30] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” in ,2015.[31] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
Proceedings of the 2014 Conference on Neural Information ProcessingSystems , 2014, pp. 2672–2680.[32] J. J. Kuffner Jr and S. M. LaValle, “Rrt-connect: An efficient approachto single-query path planning,” in
ICRA , vol. 2, 2000. - I . T EMPLATED VS R EAL L ANGUAGE
Table A1 contains examples of the different types of language that occur when asking humans to describe the actionversus using templated language.
Before After Template Human place yellow block on the redblock stack warm colors
Unknown concepts stack the red block on thegreen one move red right to same x and y axis asgreen
Coordinate System place green on the yellow one move the green box forward threespaces
Spatial language stack the blue one on yellow take the blue block in your hand andraise it above the table. move the blockback and to the right until it is directlyabove the yellow block. lower the blueblock down onto the yellow block andrelease it
Latent details about hand movement put the yellow one on thegreen block move the yellow cube to the right untilit is on top of the green cube with thefront half of the yellow cube touchingthe far half of the top of the green cube
Denotes specific nuance
TABLE A1: Above are the initial and final visual frames for each task, next to the template language and human descriptionsfor examples from our training set. These examples illustrate why it can be so difficult for a model to predict specific motionsthat correspond to a particular natural language command, and further justify our approach for visualizing robot actionsbefore execution. Specific reasons why each description is difficult to ground are indicated in bold A- II . P REDICTION R ESULTS
One advantage of proposed D
REAM C ELL system is that it allows us to generate multiple hallucinations of possible futures.Here, we show example plans generated from four unseen test environments, given a natural-language prompt. We showpredictions for the first four high level actions: align , grasp , lift , and move to . Environments and trials were chosenat random, and should be indicative of performance on the prospection problem.rompt: “put red on blue” “put blue on the other one” “take the red block in your hand and lift it off the table and move itto the blue block and lower it and open your hand” “stack warm colors” N = 1= 1