Visually-Grounded Planning without Vision: Language Models Infer Detailed Plans from High-level Instructions
VVisually-Grounded Planning without Vision:Language Models Infer Detailed Plans from High-level Instructions
Peter A. Jansen
School of Information, University of Arizona, Tucson, AZ [email protected]
Abstract
The recently proposed ALFRED challengetask aims for a virtual robotic agent to com-plete complex multi-step everyday tasks in avirtual home environment from high-level nat-ural language directives, such as “put a hotpiece of bread on a plate”. Currently, thebest-performing models are able to completeless than 5% of these tasks successfully. Inthis work we focus on modeling the translationproblem of converting natural language direc-tives into detailed multi-step sequences of ac-tions that accomplish those goals in the virtualenvironment. We empirically demonstrate thatit is possible to generate gold multi-step plansfrom language directives alone without any vi-sual input in 26% of unseen cases. When asmall amount of visual information is incorpo-rated, namely the starting location in the vir-tual environment, our best-performing GPT-2model successfully generates gold commandsequences in 58% of cases. Our results sug-gest that contextualized language models mayprovide strong visual semantic planning mod-ules for grounded virtual agents.
Simulated virtual environments with steadily in-creasing fidelity are allowing virtual agents to learnto perform high-level tasks that couple language un-derstanding, visual planning, and embodied reason-ing through sensorimotor grounded representations(Gordon et al., 2018; Puig et al., 2018; Wijmanset al., 2019). The ALFRED challenge task recentlyproposed by Shridhar et al. (2020) requires a virtualrobotic agent to complete everyday tasks (such as “put cold apple slices on the table” ) in one of 120 in-teractive virtual home environments by generatingand executing complex visually-grounded seman-tic plans that involve movable objects, irreversiblestate changes, and an egocentric viewpoint. Inte-grating natural language task directives with one {goto, countertop} {pick up, fork} {goto, sink basin}{clean, fork} {goto, drawer} {put, fork, drawer} “Wash the fork and put it away”Directive
Figure 1:
An example of the ALFRED grounded languagetask. In this work, we focus on visual semantic planning –from the textual directive alone (top) , our model predicts avisual semantic plan of { command, argument } tuples (cap-tions) that matches the gold plan without requiring visual input (images) . of the most complex interactive virtual agent envi-ronments to date is challenging, with the currentbest performing systems successfully completingless than 5% of ALFRED tasks in unseen environ-ments , while common baseline models generallycomplete less than 1% of tasks successfully.In this work we explore the visual semantic plan-ning task in ALFRED, where the high-level natu-ral language task directive is converted into a de-tailed sequence of actions in the AI2-THOR 2.0virtual environment (Kolve et al., 2017) that willaccomplish that goal (see Figure 1). In contrast toprevious approaches to visual semantic planning(e.g. Zhu et al., 2017; Fried et al., 2018; Fang et al.,2019), we explore the performance limits of thistask solely using goals expressed in natural lan-guage as input – that is, without visual input fromthe virtual environment. The contributions of this https://leaderboard.allenai.org/alfred/ a r X i v : . [ c s . C L ] O c t ork are:1. We model visual semantic planning as asequence-to-sequence translation problem,and demonstrate that our best-performingGPT-2 model can translate between naturallanguage directives and sequences of gold vi-sual semantic plans in 26% of cases withoutvisual input.2. We show that when a small amount of visualinput is available – namely, the starting lo-cation in the virtual environment – our bestmodel can successfully predict 58% of unseenvisual semantic plans.3. Our detailed error analysis suggests that re-pairing predicted plans with correct locationsand fixing artifacts in the ALFRED datasetcould substantially increase performance ofthis and future models. Models for completing multi-modal tasks canachieve surprising performance using informationfrom only a single modality. The Room-to-Room(R2R) visual language navigation task (Andersonet al., 2018) requires agents to traverse a discretescene graph and arrive at a destination describedusing natural language. In ablation studies, Thoma-son et al. (2019) found that models using inputfrom a single modality (either vision or language)often performed nearly as good as or better thantheir multi-modal counterparts on R2R and othervisual QA tasks. Similarly, Hu et al. (2019) foundthat two state-of-the-art multi-modal agents per-formed significantly worse on R2R when usingboth linguistic and visual input instead of a singlemodality, while also showing that performance canimprove by combining separate-modality modelsinto mixture-of-expert ensembles.Where R2R requires traversing a static scenegraph using locomotive actions, ALFRED is adynamic environment requiring object interactionfor task completion, and has a substantially richeraction sequence space that includes 8 high-level ac-tions. This work extends these past comparisons ofunimodal vs. multimodel performance by demon-strating that strong performance on visual seman-tic planning is possible in a vastly more complexvirtual environment using language input alone,through the use of generative language models.
We approach the task of converting a natural lan-guage directive into a visual semantic plan – aseries of commands that achieve that directive in avirtual environment – as a purely textual sequence-to-sequence translation problem, similar to conver-sion from Text-to-SQL (e.g. Yu et al., 2018; Guoet al., 2019). Here we examine two embeddingmethods that encode language directives and de-code command sequences.
RNN:
A baseline encoder-decoder network forsequence-to-sequence translation tasks (e.g. Bah-danau et al., 2015), implemented using recurrentneural networks (RNNs). One RNN serves as anencoder for the input sequence, here the tokensrepresenting the natural language directive. A de-coder RNN network with attention uses the contextvector of the encoder network to translate into out-put sequences of command triples representing thevisual semantic plan. Both encoder and decodernetworks are pre-initialized with 300-dimensionalGLoVE embeddings (Pennington et al., 2014).
GPT-2:
The OpenAI GPT-2 transformer model(Radford et al., 2019), used in a text genera-tion capacity. We fine-tune the model on se-quences of natural languge directives paired withgold command sequences separated by delimiters(i.e. “ < Directive > [SEP] < CommandTuple > [CSEP] < CommandTuple > [CSEP] ... [CSEP] < CommandTuple N > [EOS]” ). During evaluationwe provide the prompt “ < Directive > [SEP]” , andthe model generates a command sequence untilproducing the end-of-sequence (EOS) marker. Wemake use of nucleus sampling (Holtzman et al.,2020) to select only tokens from the set of mostlikely tokens during generation, with p = 0 . ,but do not make use of top-K filtering (Fan et al.,2018) or penalize repetitive n-grams, which arecommonly used in text generation tasks, but areinappropriate here for converting to the often repet-itive (at the scale of bigrams) command sequences.For tractability we make use of the GPT-2 Mediumpre-trained model, which contains 24 layers, 16attention heads, and 325M parameters. Duringevaluation, task directives are sorted into same-length batches to prevent generation artifacts frompadding, and maintain high generation quality. Negative results not reported for space: We hypothe-sized that separating visual semantic plans into variablizedaction-sequence templates and variable-value assignments rep-resented as separate decoders would help models learn toriple Components Full Entire Visual Semantic PlansModel Command Arg1 Arg2 Triples Full Sequence Full Minus First
Strict Scoring
RNN 89.6% 64.8% 58.4% 60.2% 17.1% 43.6%GPT-2 90.8% 69.9% 63.8% 65.8% 22.2% 53.4%
Permissive Scoring
RNN 89.6% 70.6% 61.4% 65.9% 23.6% 51.6%GPT-2 90.8% 73.8% 65.1% 69.4% 26.1% 58.2%
Table 1:
Average prediction accuracy on the unseen test set broken down by triple components, full triples, and full visualsemantic plans.
Full Sequence accuracy represents the proportion of predicted visual semantic plans that perfectly match goldplans.
Full Minus First represents the same, but omitting the first tuple, typically a { goto, location } that moves the agent to thestarting location in the virtual environment (see description in text).Model G o t o P i c kup P u t C oo l H ea t C l ea n S li ce T ogg l e A vg . RNN 59 81 60
67 91 66GPT-2
63 84 66
70 94 69
Table 2:
Average triple prediction accuracy on the test setbroken down into each of the 8 possible ALFRED commands.Values represent percentages.
Goto has an N of 24k, Pick up an N of 11k, and Put an N of 10k. All other commands occurapproximately 1000 times in the test dataset. Dataset:
The ALFRED dataset contains 6,574gold command sequences representing visual se-mantic plans, each paired with 3 natural languagedirectives describing the goal of those commandsequences (e.g. ‘ ‘put a cold slice of lettuce on thetable” ) authored by mechanical turkers. High-levelcommand sequences range from 3 to 20 commands(average 7.5), and are divided into 7 high-levelcategories (such as examine object in light, picktwo objects then place, and pick then cool thenplace ). Commands are represented as triples thatpair one of 8 actions ( goto, pickup, put, cool, heat,clean, slice, and toggle ) with up to two arguments,typically the object of the action (such as “slic-ing lettuce ”) and an optional receptacle (such as“putting a spoon in a mug ”). Arguments can refer-ence 58 possible objects (e.g. butter knife, chair, or apple ) and 26 receptacles (e.g. fridge, microwave, or bowl ). To prevent knowledge of the small un-seen test set for the full task, here we redivide thelarge training set into three smaller train, develop-ment, and test sets of 7,793, 5,661, and 7,571 gold- separate the general formula of action sequences with spe-cific instances of objects in action sequences, which has beenshown to help in Text-to-SQL translation (Guo et al., 2019).Pilot experiments with both RNNs and transformer modelsyielded slightly lower results than vanilla models. Languagemodeling: In addition to GPT-2 we also piloted XLNET, butperplexity remained high even after significant fine-tuning. directive/command-sequence pairs, respectively. Processing Pipeline:
Command sequences areread in as sequences of { command, arg1, arg2 } triples, converted into natural language using com-pletion heuristics (e.g. “ { put, spoon, mug } ” → “put the spoon in the mug” , and augmented withargument delimiters to aid parsing (e.g. “put < arg1 > the spoon < arg2 > in the mug” ). Inputdirectives are tokenized, but receive no other pre-processing. Generated strings from all models arepost-processed for common errors in sequence-to-sequence models, including token doubling,completing missing bigrams (e.g. “pick < arg1 > ” → “pick up < arg1 > ” ), and heuristics for addingmissing argument tags. Post-processed output se-quences are then parsed and converted back into { command, arg1, arg2 } tuples for evaluation. Evaluation Metrics:
Performance in translatingbetween natural language directives and sequencesof command triples is evaluated in terms of ac-curacy at the command-element ( command, argu-ment1, argument2 ), triple, and full-sequence level.Because our generation includes only textual inputand no visual input for a given virtual environment,commands may be generated that reference objectsthat do not exist in a scene (such as generating anaction to toggle a “lamp” to examine an object,when the environment specifically contains a “desklamp” ). As such we include two scoring metrics: a strict metric that requires exact matching of eachtoken in an argument to be counted as correct, anda permissive metric that requires matching only asingle token within an argument to be correct.
Strict Scoring butter knife (cid:54) = knife Permissive Scoring desk lamp = lampAll accuracy scoring is binary. Triples receivea score of one if all elements in a given goldand predicted triple are identical, and zero oth- rop. Error Class Description Example Errors Incorrect Arguments Predicted wrong location:
45% Predicted wrong location (G) ... slice lettuce, put knife on countertop , put lettuce in fridge, ...4% Predicted wrong object (P) ... slice lettuce, put knife in microwave , put lettuce in fridge, ...
Incorrect Triples Predicted extra (not harmful) action † , and introduced offset error ‡
22% Offset due to extra/missing actions Instructions: Put a mug with a spoon in the sink.22% Predicted extra (incorrect) actions (G) ... pick up mug, put mug in sink basin ‡
12% Predicted missed actions (P) ... pick up mug, go to sink basin † , put mug in sink basin ‡
7% Predicted extra (not harmful) actions5% Order of actions swapped
Instruction Errors Gold Instructions Incomplete:
17% Gold Instructions Incorrect Instructions: Put a heated mug in the microwave.13% Gold Instructions Incomplete (G) ... go to microwave, heat mug, go to cabinet, put mug in cabinet
Table 3: (left)
Common classes of prediction errors in the GPT-2 model, and their proportions in 100 predictions from thedevelopment set. (right ) Example errors, where (G) and (P) represent subsets of gold and predicted visual semantic plans,respectively. erwise. Full-sequence scoring directly compares < CommandTuple i > for each i in the gold and pre-dicted sequences, and receives a score of one onlyif all triples are identical and in identical locations i , and zero otherwise. Performance of the embedding models is reportedin Table 1, broken down by triple components, fulltriples, and full sequences. Both models achieveapproximately 90% accuracy in predicting the cor-rect commands, in the correct location i in the se-quence. Arguments are predicted less accurately,with the RNN model predicting 65% and 58% offirst and second arguments correctly, respectively.The GPT-2 model increases performance on argu-ment prediction by approximately +5%, reaching70% and 64% under strict match scoring. Permis-sive scoring, allowing for partial matches betweenarguments (e.g. “lamp” and “desk lamp” are con-sidered equivalent) further increases argument scor-ing to approximately 74% and 65% in the bestmodel. Scoring by complete triples in the correctlocation i shows a similar pattern of performance,with the best-scoring GPT-2 model achieving 66%accuracy using strict scoring, and 69% under per-missive scoring, with triple accuracy broken downby command shown in Table 2.Fully-correct predicted sequences of commandsthat perfectly match gold visual semantic plans us-ing only the text directives as input, – i.e. without Tuning and Computational Resources: RNN models re-quired approximately 100k epochs of training to reach con-vergence over 12 hours, requiring 8GB of GPU RAM. GPT-2models asymptoted performance at 25 epochs, requiring 6hours of training and 16GB of GPU RAM. All experimentswere conducted using an NVIDIA Titan RTX. visual input from the virtual environment – occurin 17% of unseen test cases with the RNN model,and 22% of cases with the GPT-2 model, highlight-ing how detailed and accurate visual plans can beconstructed from text input alone in a large subsetof cases. In analyzing the visual semantic plans,the first command is typically to move the virtualagent to a starting location that contains the firstobject it must interact with (for example, movingto the countertop , where a potato is resting in theinitialized virtual environment, to begin a direc-tive about slicing, washing, and heating a potatoslice ). If we supply the model with this single pieceof visual information from the environment, full-sequence prediction accuracy for all models morethan doubles, increasing to 53% in the strict con-dition, and 58% with permissive scoring, for thebest-performing GPT-2 model.
Table 3 shows an analysis of common categories oferrors in 100 directive/visual semantic plan pairsrandomly drawn from the development set thatwere not answered correctly by the best-performingGPT-2 model that includes the starting location forthe first step. As expected, a primary source of erroris the lack of visual input in generating the visualplans, with the most common error, predicting thewrong location in an argument , occuring in 45%of errors. Conversely, predicting the wrong ob-ject to interact with occurred in only 4% of errors, An unexpected source of error is that our GPT-2 plannerfrequently prefers to store used cutlery in either the fridge ormicrowave – creating a moderate fire hazard. Interestingly,this behavior appears learned from the training data, whichfrequently stores cutlery in unusual locations. Disagreementson discarded cutlery locations occurred in 15% of all errors. s this information is often implicitly or explicitlysupplied in the text directive. This suggests aug-menting the model with object locations from theenvironment could mend prediction errors in nearlyhalf of all errorful plans.The GPT-2 model predicted additional (incor-rect) actions in 22% of errorful predictions, whilemissing key actions in 12% of errors, causing offseterrors in sequence matching that reduced overallperformance in nearly a quarter of cases. In a smallnumber of cases, the model predicted extra actionsthat were not harmful to completing the goal, orswitched the order of sets of actions that could becompleted independently (such as picking up andmoving two different objects to a single location).In both cases the virtual agent would likely havebeen successful in completing the directive if fol-lowing these plans.A final significant source of error includes in-consistencies in the crowdsourced text directivesor gold visual semantic plans themselves. In 17%of errors, the gold task directive had a mismatchwith the objects referenced in the gold commands(e.g. the directive referenced a watering can , wherethe gold annotation references a tea pot ), and au-tomated scoring marked the predicted sequenceas incorrect. Similarly, in 13% of cases, the taskdirective failed to mention one or more subtasks(e.g. the directive is “turn on a light” , but the goldcommand sequence also includes first retrieving aspecific object to examine in the light). This sug-gests that nearly one-third of errors may be dueto issues in the evaluation data, and that overallvisual semantic plan generation performance maybe significantly higher.
To examine how performance varies with theamount of training data available, we randomlydownsampled the amount of training data to of its original size. This analysis,shown in Figure 2, demonstrates that relatively highperformance on the visual semantic prediction taskis still possible with comparatively little trainingdata. When only 10% of the original training data isused, average prediction accuracy reduces by 24%,but still reaches 44%. In the few-shot case (1%downsampling), where each of the 7 ALFREDtasks observes only 4 gold command sequenceseach (for a total of 12 natural language directives
10 15 20 25 300102030405060
Number of Training Epochs P r ed i c t i on A cc u r a cy ( % ) Figure 2:
Average prediction accuracy as a function of train-ing set size ( of the full training set)for the GPT-2 model on the test set. Even with a large re-diction in training data, the model is still able to accurratelypredict a large number of visual semantic plans. Performancerepresents the permissive scoring metric in the “full minusfirst” condition in Table 1. per task) during training, the GPT-2 model is stillable to generate an accurate visual semantic planin 8% of cases. Given that large pre-trained lan-guage models have been shown to encode a varietyof commonsense knowledge as-is , without fine-tuning (Petroni et al., 2019), it is possible that someof the model’s few-shot performance on ALFREDmay be due to an existing knowledge of similarcommon everyday tasks.
We empirically demonstrate that detailed gold vi-sual semantic plans can be generated for 26% ofunseen task directives in the ALFRED challengeusing a large pre-trained language model with-out visual input from the simulated environment,where 58% can be generated if starting locationsare known. We envision these plans may be usedeither as-is, or as an initial “hypothetical” plan ofhow the model believes the task might be solved ina generic environment, that is then modified basedon visual or other input from a specific environmentto further increase overall accuracy.We release our planner code, data, predictions,and analyses for incorporation into end-to-end sys-tems at: http://github.com/cognitiveailab/alfred-gpt2/ . References
Peter Anderson, Qi Wu, Damien Teney, Jake Bruce,Mark Johnson, Niko S¨underhauf, Ian Reid, Stephenould, and Anton van den Hengel. 2018. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environ-ments. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages3674–3683.Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In
Proceedings ofthe International Conference on Learning Represen-tations (ICLR) .Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi-erarchical neural story generation. In
Proceedingsof the 56th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers) ,pages 889–898.Kuan Fang, Alexander Toshev, Li Fei-Fei, and SilvioSavarese. 2019. Scene memory transformer for em-bodied agents in long-horizon tasks. In
Proceedingsof the IEEE Conference on Computer Vision and Pat-tern Recognition , pages 538–547.Daniel Fried, Ronghang Hu, Volkan Cirik, AnnaRohrbach, Jacob Andreas, Louis-Philippe Morency,Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein,and Trevor Darrell. 2018. Speaker-follower mod-els for vision-and-language navigation. In
Advancesin Neural Information Processing Systems , pages3314–3325.Daniel Gordon, Aniruddha Kembhavi, MohammadRastegari, Joseph Redmon, Dieter Fox, and AliFarhadi. 2018. Iqa: Visual question answering in in-teractive environments. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recog-nition , pages 4089–4098.Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao,Jian-Guang Lou, Ting Liu, and Dongmei Zhang.2019. Towards complex text-to-sql in cross-domaindatabase with intermediate representation. In
Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 4524–4535.Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, andYejin Choi. 2020. The curious case of neural textdegeneration. In
Proceedings of the InternationalConference on Learning Representations (ICLR) .Ronghang Hu, Daniel Fried, Anna Rohrbach, DanKlein, Trevor Darrell, and Kate Saenko. 2019. Areyou looking? grounding to multiple modalities invision-and-language navigation. In
Proceedings ofthe 57th Annual Meeting of the Association for Com-putational Linguistics , pages 6551–6557, Florence,Italy. Association for Computational Linguistics.Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli Van-derBilt, Luca Weihs, Alvaro Herrasti, Daniel Gor-don, Yuke Zhu, Abhinav Gupta, and Ali Farhadi.2017. Ai2-thor: An interactive 3d environment forvisual ai. arXiv preprint arXiv:1712.05474 . Jeffrey Pennington, Richard Socher, and Christopher D.Manning. 2014. Glove: Global vectors for word rep-resentation. In
Empirical Methods in Natural Lan-guage Processing (EMNLP) , pages 1532–1543.Fabio Petroni, Tim Rockt¨aschel, Sebastian Riedel,Patrick Lewis, Anton Bakhtin, Yuxiang Wu, andAlexander Miller. 2019. Language models as knowl-edge bases? In
Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 2463–2473.Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li,Tingwu Wang, Sanja Fidler, and Antonio Torralba.2018. Virtualhome: Simulating household activitiesvia programs. In
Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition ,pages 8494–8502.Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.
OpenAIBlog , 1(8):9.Mohit Shridhar, Jesse Thomason, Daniel Gordon,Yonatan Bisk, Winson Han, Roozbeh Mottaghi,Luke Zettlemoyer, and Dieter Fox. 2020. Alfred:A benchmark for interpreting grounded instructionsfor everyday tasks. In
Computer Vision and PatternRecognition (CVPR) .Jesse Thomason, Daniel Gordon, and Yonatan Bisk.2019. Shifting the baseline: Single modality per-formance on visual navigation & QA. In
Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long and Short Papers) , pages 1977–1983, Min-neapolis, Minnesota. Association for ComputationalLinguistics.Erik Wijmans, Samyak Datta, Oleksandr Maksymets,Abhishek Das, Georgia Gkioxari, Stefan Lee, IrfanEssa, Devi Parikh, and Dhruv Batra. 2019. Em-bodied question answering in photorealistic environ-ments with point cloud perception. In
The IEEEConference on Computer Vision and Pattern Recog-nition (CVPR) .Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn-ing Yao, Shanelle Roman, et al. 2018. Spider: Alarge-scale human-labeled dataset for complex andcross-domain semantic parsing and text-to-sql task.In
Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing , pages3911–3921.Yuke Zhu, Daniel Gordon, Eric Kolve, Dieter Fox,Li Fei-Fei, Abhinav Gupta, Roozbeh Mottaghi, andAli Farhadi. 2017. Visual semantic planning usingdeep successor representations. In