[PDF] Are We There Yet? Learning to Localize in Embodied Instruction Following

Abstract

Embodied instruction following is a challenging problem requiring an agent to infer a sequence of primitive actions to achieve a goal environment state from complex language and visual inputs. Action Learning From Realistic Environments and Directives (ALFRED) is a recently proposed benchmark for this problem consisting of step-by-step natural language instructions to achieve subgoals which compose to an ultimate high-level goal. Key challenges for this task include localizing target locations and navigating to them through visual inputs, and grounding language instructions to visual appearance of objects. To address these challenges, in this study, we augment the agent's field of view during navigation subgoals with multiple viewing angles, and train the agent to predict its relative spatial relation to the target location at each timestep. We also improve language grounding by introducing a pre-trained object detection module to the model pipeline. Empirical studies show that our approach exceeds the baseline model performance.

Full PDF

AAre We There Yet? Learning to Localize in Embodied Instruction Following

Shane Storks , Qiaozi Gao , Govind Thattai , Gokhan Tur University of Michigan Amazon Alexa [email protected], { qzgao, thattg, gokhatur } @amazon.com Abstract

Embodied instruction following is a challenging problem re-quiring an agent to infer a sequence of primitive actions toachieve a goal environment state from complex language andvisual inputs. Action Learning From Realistic Environmentsand Directives (ALFRED) is a recently proposed benchmarkfor this problem consisting of step-by-step natural languageinstructions to achieve subgoals which compose to an ulti-mate high-level goal. Key challenges for this task includelocalizing target locations and navigating to them throughvisual inputs, and grounding language instructions to visualappearance of objects. To address these challenges, in thisstudy, we augment the agent’s ﬁeld of view during navigationsubgoals with multiple viewing angles, and train the agentto predict its relative spatial relation to the target location ateach timestep. We also improve language grounding by in-troducing a pre-trained object detection module to the modelpipeline. Empirical studies show that our approach exceedsthe baseline model performance.

Introduction

As in-home robots become a reality, there are still funda-mental problems along the intersection of vision, language,and robotics to be solved. A particularly challenging prob-lem is embodied instruction following, where an agent mustcomplete a task by following a human teacher’s language in-structions. The language instructions may require the agentto navigate through a space, and manipulate objects in thespace. Thus, navigation skills and language grounding areessential capabilities for such agents, and this research areahas received signiﬁcant attention in recent years.While much progress has been made toward this ingraph-structured, discrete navigation environments (Ander-son et al. 2018; Zhu et al. 2020; Majumdar et al. 2020), theproblem remains challenging in continuous navigation en-vironments. Action Learning From Realistic Environmentsand Directives (ALFRED) is a recently introduced bench-mark dataset for embodied task learning, providing human-written natural language instructions for an agent to performtasks in a virtual environment with continuous navigation * Instructions: ”Walk to the counter. Pick up the knife. …”

1. RotateLeft2. MoveAhead3. MoveAhead4. RotateLeft5. MoveAhead6. MoveAhead7. MoveAhead8. MoveAhead9. MoveAhead10. STOP t i m e s t e p s Figure 1: To improve an embodied agent’s navigation ca-pability, we use enhanced inputs from the environment totrack the direction of the goal location from the agent’s per-spective at every timestep (colored dotted lines). This signalguides the agent during navigation.(Shridhar et al. 2020). We focus our work on this challeng-ing benchmark.To address key weaknesses of existing approaches forALFRED, we implement several enhancements in this work.First, we train the model on a step-by-step granularity oflanguage instructions. Next, we augment the inputs for nav-igation by adding panoramic visual observations at eachtimestep of navigation, and object segmentation masks of allobjects in the environment. We use these inputs to train anobject detection module that can be used at inference time.Lastly, we propose a novel, transformer-based localizationmodule which we train to predict the agent’s orientation an-gle with respect to the goal location. This module is thenused at inference time to guide the agent during navigation,as shown in Figure 1. From these modiﬁcations, we showsome improvements over a strong baseline approach.

Related Work

Language, vision, and robotics.

Within the intersectionof language, vision, and robotics, there are several threadsof work toward embodied instruction following, navigation,and localization within a virtual environment. As such prob-lems rely on the availability of virtual environments andlarge-scale data, many of these threads have spawned fromthe release of benchmarks and challenge datasets. Chang a r X i v : . [ c s . A I] J a n t al. (2017), Kolve et al. (2017), Yan et al. (2018), Savvaet al. (2019), and Xia et al. (2019) propose various virtualenvironments for embodied AI research, each with their ownconsiderations and features. Das et al. (2018) framed the tra-ditional task of QA in a virtual environment, posing ques-tions that required an agent to explore an environment to ﬁndthe answer. Qi et al. (2020) proposes a remote object ground-ing challenge, where an embodied agent must ﬁnd a partic-ular object in a virtual environment. Xia et al. (2020) framesrobotic motion planning in a virtual environment with mov-able objects as a reinforcement learning problem. Andersonet al. (2018) proposes the vision-and-language navigation(VLN) task, where an agent must use human-written lan-guage instructions to reach a goal location in a photorealis-tic virtual environment, and Thomason et al. (2019) expandsthis task to a dialog-based formulation. Vision-and language navigation.

A particularly substan-tial body of work toward navigation in embodied instructionfollowing exists in the VLN thread. Fried et al. (2018) in-troduced a pragmatic reasoning module to the problem thatlearned to re-tell candidate navigation routes in natural lan-guage, and use this output to re-rank the routes. Jain et al.(2019) introduced a new graph-based metric for navigationﬁdelity, and used this as a new training objective to improvethe ﬁdelity of agents to navigation instructions. Ma et al.(2019a) introduced another auxiliary training objective forthe agent to estimate its progress toward the goal location,and Ma et al. (2019b) expanded upon this to give the agentthe ability to backtrack during navigation. Hu et al. (2019)improve language grounding in navigation agents by usinga pre-trained object detection system to augment visual in-puts with localized representations of objects in the scene.Zhu et al. (2020) combined several previously proposed andnovel self-supervised auxiliary training objectives for VLNto achieve state-of-the-art performance on the task. Majum-dar et al. (2020) pre-trained a transformer on a large-scalelanguage grounding task to further improve performance.

ALFRED Benchmark and Baseline

In this work, we focus on the ALFRED benchmark. AL-FRED is a recent embodied task learning benchmark wherean agent must complete household tasks to achieve a spe-ciﬁc goal state in a virtual environment powered by AI2Thor (Kolve et al. 2017). First, we introduce the problemand the existing baseline approach formally.

Problem Formulation

An instance of ALFRED consists of a high-level goal G which consists of a sequence of N subgoals g i ∈ G . Eachsubgoal may be a navigation subgoal or a manipulation sub-goal, e.g., to pick up, put down, heat, or cool an object. Nav-igation subgoals consist of primitive navigation actions, e.g.,to turn, move forward, or change the vertical heading angle,while manipulation subgoals consist mostly of primitive ma-nipulation actions, e.g., to pick up, slice, or toggle an object.A typical ALFRED instance consists of alternating naviga-tion and manipulation subgoals, and all primitive actions are performed sequentially.These subgoals, when completed in order, achieve a goalstate S ∗ in the virtual environment. The internal represen-tation of goals and subgoals are not provided for inference.Instead, the agent needs to reason from the associated hu-man annotated natural language instructions. From languageinstructions L G and L i , i ∈ { , , · · · , N } describing thegoal and subgoals respectively, as well as visual observa-tions v t at each timestep t , the agent must predict an ac-tion a t and (if applicable) a binary mask m t over v t whichhighlights an object to interact with. After an episode lengthof T timesteps, an inferred sequence of predicted actions a , a , · · · , a T and masks m , m , · · · , m T is consideredsuccessful if and only if the ﬁnal state S T is equal to S ∗ . Evaluation

ALFRED employs three primary modes ofevaluation, which we call action-by-action , subgoal-by-subgoal , and goal-by-goal . Action-by-action evaluation isused while training the model to select the best model in-stance during hyperparameter search. It aims to score onlythe action sequence predicted by the agent by comparing itto the ground truth action sequence. To facilitate this eval-uation, at each timestep, we simply supply the model withthe ground truth action sequence up until the timestep, andcompare the model’s output action with the ground truth ac-tion. This can give us some ideas on how well the model canchoose the next primitive action based on language instruc-tions and the ground truth episode history.To evaluate the composition of both an agent’s action andobject interaction mask prediction in achieving particularsubgoals, we use subgoal-by-subgoal evaluation. Here, werun the agent and underlying model through the ground truthepisode up until a particular subgoal, then allow the agent toattempt to complete the subgoal. The episode will continueuntil the agent successfully completes the subgoal, predictsstop, or reaches a time limit. The agent will be consideredsuccessful if all subgoal conditions are satisﬁed, e.g., if theinstructions direct the agent to pick up an object, that theobject is now held by the agent.To measure how well the agent’s sequential predictionsfor all subgoals achieve the overall goal in each ALFREDtask, we use goal-by-goal evaluation, the primary mode usedfor ranking on the ALFRED leaderboard. It places the agentinto the virtual environment under the initial conditions, andthen requires the agent to execute all steps to achieve allsubgoals toward a goal. The episode ends when the agentpredicts a stop action, reaches some limit of timesteps, orencounters some maximum number of API errors when in-teracting with the AI2 Thor environment. The agent is thenevaluated based on whether the goal state is true in the envi-ronment. An agent is considered successful if all goal con-ditions are satisﬁed, but we can also score the agent basedon the proportion of goal conditions it achieved. For exam-ple, if a goal is given as “Rinse off a mug and place it in thecoffee maker,” the goal conditions may be that there existsa mug in the coffee maker, and that mug is clean. https://leaderboard.allenai.org/alfred/submissions/public Example from Shridhar et al. (2020).

STM

Linear

DeConv L G + L g1 + … + L gN a t a t-1 + … … ++ predicted action and masklanguage instructions for goal G and all subgoals g - g N previous predicted actionvisual observation v t Figure 2: At each timestep, the S EQ EQ model takes sev-eral encoded inputs: the visual observation v t , all languageinstructions L G + L + · · · + L N (reweighted by an atten-tion mechanism at every timestep), and the previous pre-dicted action a t − . These encoded inputs are passed into anLSTM, which generates a history-dependent representation.The original inputs and this representation can be passed toa linear layer to predict a primitive action to take and (if amanipulation action is predicted) to a deconvolutional layerto predict a binary mask over the target object. Baseline Approach

The baseline S EQ EQ approach is shown in Figure 2. First,we generate encoded representations of all language instruc-tions for the example, the current visual observation image,and the previously predicted action. Language is encoded ina bidirectional LSTM (Hochreiter and Schmidhuber 1997),images are encoded in a frozen pre-trained ResNet (He et al.2016) model, and an embedding is learned for actions. Ateach timestep, these inputs are passed to a decoder LSTM.The LSTM’s hidden state along with these task inputs areused to predict a primitive action to take (through a linearlayer), and a binary mask to be used if a manipulation actionis predicted (through a deconvolutional layer). More detailedinformation on the implementation of the baseline model isprovided by Shridhar et al. (2020). Key Limitations

We identify several key opportunities forimprovement of the baseline approach. First, the requiredsequence of actions is typically quite long before the goalstate is achieved, requiring the completion of several sub-goals, which each consist of several primitive actions. Suchlong-distance dependency is very challenging for current neural networks.Second, the agent’s performance in navigation is a bot-tleneck to the overall performance . Nearly every othersubgoal in ALFRED task instances requires the agent to nav-igate to a new location; if the agent failed to arrive at thecorrect location, it will not ﬁnd the objects it must manip-ulate afterward. Using the pre-trained weights of the base-line agent, we found that navigation subgoals achieved asubgoal-by-subgoal success rate of 61.9%, relatively lowcompared to many manipulation subgoal types, such as tocool or toggle an object, which both achieved over 90%.As navigation occurs before every manipulation subgoal andthus is a prerequisite to all other subgoal types, if we canimprove the navigation performance, then we expect to im-prove the overall performance of the model.We suspect that the navigation performance is so low fortwo reasons. First, because the baseline model is trained byimitation learning, there are no opportunities to explore dur-ing training. During navigation, a task that relies heavily onexploration, this becomes a challenge. The agent is trainedonly on expert demonstrations where the agent knows ex-actly where to go, even if the initial location is far away orout of view of the destination. However, if the agent encoun-ters these conditions at inference time, it is unclear whetherit could have learned any exploratory behavior to be able toﬁnd the destination.Another possible factor is that the agent is not given anyexplicit training toward language grounding during naviga-tion subgoals. While the agent is trained to generate bi-nary masks over the target objects of manipulation actionsduring manipulation subgoals, there is no such supervisionavailable during navigation. Human navigation relies on ﬁrstidentifying landmarks, then ﬁnding routes between them(Siegel and White 1975). Navigation instructions in AL-FRED often point out landmarks along the way, for example,“Turn around and walk towards the bed , then hang a left andwalk up to the wall , turn left again and walk over to the rightside of the wooden desk .” Therefore, it is advantageous forthe agent to identify them, especially if the landmark refersto the destination. Proposed Improvements

Considering the key limitations of the baseline model, wepropose three main improvements. First, in order to lightenthe load on the LSTM used to decode the sequence of ac-tions, we train the model to execute only one subgoal at atime rather than the entire sequence of subgoals. Second, toenable better visual understanding and language grounding,we augmented the task inputs with additional visual obser-vations for panoramic view angles at each timestep, and ap-plied an object detection module to all of the agent’s visualobservations. Lastly, to guide navigation, we use these aug-mented inputs to predict an additional input to the model atevery timestep: the angle between the agent’s view and thegoal location. This input is predicted using a new localizer module before being passed into the LSTM. https://github.com/askforalfred/alfred CLS] spatial token spatial token [SEP] Walk to [SEP]. spatial token current + next subgoal “bread” “knife” “sink”

BERT d t YOLO bounding box coords. (in panoramic space) + class labels ……… ……

Figure 3: For a navigation subgoal g k , the multimodal BERT-based localizer takes text and spatial tokens as input. The textinput consists of language instructions L k and L k +1 for the current and next subgoal, while the spatial input is comprised of thecoordinates (projected into panoramic space) and BERT-embedded object class labels of panoramic bounding boxes at timestep t . The pre-trained BERT model generates a spatial representation of the model’s percepts and language instructions, and weinput this representation for the special [CLS] token into a linear layer to predict d t , the angle from the agent’s orientation tothe goal location. Granular Training

The LSTM-based S EQ EQ baseline for ALFRED istrained to predict a sequence of primitive actions from along sequence of text (consisting of language descriptions ofthe overall goal and all subgoals), visual observations of theagent, and previously executed actions. However, most sub-goals are disentangled from one another with minimal cross-referencing. As such, we reduced the strain on the LSTM-based model by breaking the execution down into one sub-goal at a time: for each instance of inference, the model willreceive the language instructions for one subgoal, and out-put the actions and interaction masks required to completethe subgoal. As the language instructions are delimited bysubgoal at both training and test time, we can simply iteratethrough them to facilitate this. Augmented Navigation

We augment the input data for ALFRED in two ways. First,for navigation subgoals, we add panoramic image observa-tions to the visual input at each timestep. Second, we train anobject detection model to identify objects in all visual inputimages.

Panoramic Visual Observations

As the baseline agentis trained by imitation learning, it does not learn any ex-ploratory behavior. During navigation, this becomes prob-lematic. The agent can only see directly in front of itself, soif the referred landmarks or destination are not in view, itmay cause difﬁculties linking language to visual perception. To mitigate this, for all navigation subgoals in the train-ing data, we use AI2 Thor (Kolve et al. 2017) to generatethe agent’s visual observations at eight rotation angles foreach timestep. These observations can be used to approxi-mate a panoramic view for the agent at each timestep dur-ing navigation, which has proven beneﬁcial in the relatedvision-and-language navigation (VLN) task (Anderson et al.2018; Fried et al. 2018). In training, this allows the agent toeffectively “look around” before taking a step. During infer-ence, the additional observations can be generated by forc-ing the agent to take eight consecutive rotation actions ateach timestep.

Object Detection Model Training

While the baselineagent is trained to explicitly identify objects from languageinstructions during object manipulation subgoals, there is nosuch training during navigation subgoals. This inhibits theagent’s ability to ground navigation instructions to its sur-roundings.To enable better language grounding, especially duringnavigation, we introduced YOLO V4 (Bochkovskiy, Wang,and Liao 2020), an object detection model, to the pipeline.The model was trained on robot view images, with bound-ing boxes supplied by the AI2 Thor. Once trained, this objectdetection model enables the agent to identify 203 categoriesof objects in view during inference. 𝑝, 𝑐 ! , 𝑐 " ) 𝜃 horizontal angle 𝜃 vertical angle 𝜙 parameterized bounding box 𝜙 Figure 4: Spatial representation of the bounding box for a knife . At each timestep, the agent collects visual inputs for eightpanoramic view angles, and applies the trained object detection model to them. Each resulting bounding box can be parameter-ized by the ordinal view angle index p of its image and the centroid coordinates ( c x .c y ) of the bounding box within the image.We project the parametric representation for each bounding box into panoramic space by using it to calculate the horizontalangle θ and vertical angle φ . Localizer Module

Lastly, we combine these new inputs to introduce a newmodule to the pipeline: the localizer . At each timestep dur-ing navigation subgoals, this module predicts a spatial vec-tor d t representing the angle between the agent’s orienta-tion and direction to the goal location. d t is then appendedto the S EQ EQ model’s inputs at each timestep to guidenavigation toward the goal location. During non-navigationsubgoals, a zero vector is passed instead. Inputs

The agent must predict the angle toward the goallocation using task inputs that are available at inference time.These include only the language instructions and any visualobservations resulting from taken actions.In order to characterize the goal location of a navigationsubgoal and subsequently predict its direction, it is criticalto consider the target objects of both the navigation sub-goal and the following object manipulation subgoal. Con-sider two consecutive instructions: “Walk to the counter”and “Pick up the knife”. The instruction for the navigationsubgoal refers to a large object, the counter , as the destina-tion. If we only look at this instruction, anywhere around thecounter could be a correct location. However, from the nextinstruction, we know that the agent should stop at a locationthat it can reach for the knife .As such, given the language instructions L k and L k +1 forconsecutive navigation and object manipulation subgoals g k and g k +1 , along with the panoramic visual observations andbounding boxes for all visible objects as described earlier,we aim to predict the direction from the agent’s view to thegoal location. As ALFRED provides the ground truth se-quence of agent movements and the ﬁnal goal location, it isstraightforward to calculate the ground truth for d t . To suc-cessfully perform this task, the localizer must learn how theposition and size of bounding boxes correspond to the direc-tion and distance of objects around the agent, and how theseproperties may be affected by object type and context. For example, at the same distance, the expected sizes of a knife

Implementation

The full architecture of the localizer isshown in Figure 3. It is implemented using a multimodalvariant of BERT (Devlin et al. 2019) which encodes thepanoramic bounding boxes available at each timestep alongwith the current and next subgoal language instructions toproduce a spatial representation of the agent’s observationsat each timestep. We perform a regression on the model out-put to estimate the current angle to the goal location. Thismodel incorporates techniques proposed by Li et al. (2020),who input image region features into BERT alongside lan-guage features for impressive results on various multimodaltasks, and Miyazawa et al. (2020), who encode spatial infor-mation as input to a similar multimodal BERT model.

Bounding box coordinates in panoramic space.

Asshown in Figure 4, the augmented navigation inputs give uslabeled bounding boxes for images from eight panoramicview angles. In order to uniquely represent the position ofeach bounding box with respect to the agent, we project itinto three-dimensional, panoramic space through two polarcoordinates. In this space, we represent a bounding box byits horizontal angle θ and vertical angle φ . θ will be calcu-lated relative to the agent’s body orientation, while φ willbe relative to the center of the agent’s vertical range of headmotion.To calculate these angles, we ﬁrst parameterize eachbounding box by three values: the ordinal index p of thepanoramic view angle for the image the bounding box be-longs in, and the centroid coordinates ( c x , c y ) of the bound-ing box within its image normalized by the image width andheight. Note that two consecutive panoramic views have anangle of 45 degrees between them. Given these parametersand the horizontal ﬁeld of view F x , we can calculate the hor-izontal angle θ of a particular bounding box (in degrees) by and of a dining table in the agent’s ﬁeld of view are different. For example, if two objects of the same type are visible atonce, only the context given by the language instructions can dis-ambiguate the correct destination, e.g., “The knife on the left.” odel Action-by-Action F (%) Navigation SubgoalSuccess Rate (%) Goal ConditionSuccess Rate (%) Val. Seen Val. Unseen Val. Seen Val. Unseen Val. Seen Val. Unseen baseline 84.5 75.6 θ = arctan (cid:20) c x − .

5) tan F x (cid:21) + 45 p (1)Additionally given the vertical ﬁeld of view F y and theagent’s current vertical heading angle δ , we can calculatethe vertical φ of a particular bounding box (in degrees) by φ = arctan (cid:20) . − c y ) tan F y (cid:21) + δ (2) Input and output details.

Once we have calculated θ and φ , we encode the spatial representation as a 5-dimensionalvector of sin θ , cos θ , sin φ , and the width w and height h ofthe bounding box, respectively. By incorporating sines andcosines, we account for the circular nature of angles and re-strict the range of the values. This encoding is repeated tobe comparable size to a BERT embedding, then added tothe BERT embedding for the bounding box class label, e.g., knife , to create a spatial token ready for input to BERT. Asshown in Figure 3, the full input to the localizer is a con-catenation of the special [CLS] token, the spatial tokens,the [SEP] token, the BERT embedding for the current andnext subgoal language instructions L k and L k +1 , and an-other [SEP] token.After being processed by BERT into a spatial represen-tation for the agent’s inputs, we use a linear layer to predict d t , the sine and cosine of the angle toward the goal location.Speciﬁcally, we use the generated representation for the spe-cial [CLS] token. Again, sine and cosine are used to accountfor the circular nature of angles. Empirical Results

Due to prohibitive training time, we conduct experiments onexamples from one task type stack and place , where each in-stance consists of both navigation actions and manipulationactions on multiple objects. We compare the performance ofthe following approaches:• S EQ EQ baseline ( baseline )• S EQ EQ baseline + granular training ( step-by-step )• S EQ EQ baseline + granular training + oracle spatialtracking ( oracle )• S EQ EQ baseline + granular training + localizer spatialtracking ( localizer ) For the remainder of this section, we introduce the oracleapproach, then present the results. Oracle spatial tracking.

To gauge the effectiveness ofthis spatial tracking approach and compare our model toa perfect model, we introduce an oracle spatial trackingmodel. At each timestep during navigation subgoals, ratherthan predict the angle toward the goal location d t , we feedthe ground truth value of this into the model directly. Thisgives us an upper bound on the performance of this spatialtracking approach. Metrics.

At inference time, we evaluate the model withmetrics in the three increasingly strict modes of evaluationfor our models derived from Shridhar et al. (2020): action-by-action, subgoal-by-subgoal, and goal-by-goal. To judgethe agent’s prediction of primitive actions, we use the action-by-action F-measure of the predicted sequences of primitiveactions for entire goal trajectories compared to the groundtruth. To judge navigation performance, we also calculatethe navigation subgoal success rate , i.e., the percentage ofnavigation subgoals the agent successfully completes. As wetrained our models on a relatively small subset of the data, nomodels achieve a viable goal completion success rate in thisevaluation. Consequently, to judge the ability of the agent toachieve the overall goal of ALFRED tasks, we only reportthe goal condition success rate , the percentage of all goalconditions achieved by the model.All metrics are calculated for ALFRED’s two validationsets: validation seen and unseen . The seen dataset uses vir-tual rooms from AI2 Thor which were seen in training, whilethe unseen uses rooms that were not seen in training.

Results interpretation.

The evaluation results for thecompared models are listed in Table 1. We see that intro-ducing granular training in the step-by-step model gives usa signiﬁcant improvement in the prediction of individualactions, but achieves comparable results for other metrics.While the oracle spatial tracking approach offers drastic im-provements in navigation subgoal and goal performance, thenon-oracle BERT-based localizer approach does not comeclose to this upper bound. Nonetheless, it provides an overallnet improvement in action-by-action F-measure of 9.3% and3.1% over the baseline for seen and unseen rooms, respec-tively, and provides a slight improvement in navigation sub-goal success rate over the baseline. This may indicate thatin the task of predicting the spatial direction of targets, thetransformer-based model still has much room for improve-ment.

Conclusion and Future Work

In this work, we investigated several methods to improvethe ALFRED baseline S EQ EQ model, including subgoal-by-subgoal granular training, augmenting navigation inputswith panoramic visual observation images and a full cov-erage of object segmentation masks, and combining thesenew inputs to propose a novel BERT-based spatial trackingmodule.While granular training considerably improves precisionof predicted actions, we show through an oracle approachthat if the agent is given the angle toward the goal locationas input at every timestep, we can achieve astounding per-formance improvements in navigation and goal completion.This suggests that improving navigation performance can in-deed be a key to solving this problem. Our fair, learning-based spatial tracking approach makes some progress to-ward this grand goal of closing the gap to oracle perfor-mance, but we expect there is much more work to be done.Beyond this, future work may include extending BERT orsimilar transformer-based models to handle the entire taskend-to-end, including action and object interaction maskprediction. References

Anderson, P.; Wu, Q.; Teney, D.; Bruce, J.; Johnson, M.;S¨underhauf, N.; Reid, I.; Gould, S.; and van den Hengel,A. 2018. Vision-and-Language Navigation: InterpretingVisually-Grounded Navigation Instructions in Real Environ-ments. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) .Bochkovskiy, A.; Wang, C.-Y.; and Liao, H.-Y. M. 2020.YOLOv4: Optimal Speed and Accuracy of Object Detec-tion. arXiv preprint arXiv:2004.10934 .Chang, A.; Dai, A.; Funkhouser, T.; Halber, M.; Niessner,M.; Savva, M.; Song, S.; Zeng, A.; and Zhang, Y. 2017.Matterport3D: Learning from RGB-D Data in Indoor En-vironments.

International Conference on 3D Vision (3DV) .Das, A.; Datta, S.; Gkioxari, G.; Lee, S.; Parikh, D.; andBatra, D. 2018. Embodied Question Answering. In

Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition (CVPR) .Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019.BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding. In

Proceedings of the 2019 Con-ference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technolo-gies, Volume 1 (Long and Short Papers) arXiv:1806.02724 [cs] .He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid-ual learning for image recognition. In

Proceedings of theIEEE conference on computer vision and pattern recogni-tion , 770–778.Hochreiter, S.; and Schmidhuber, J. 1997. Long Short-TermMemory.

Neural Computation

Proceed-ings of the 57th Annual Meeting of the Association for Com-putational Linguistics

Proceedings ofthe 57th Annual Meeting of the Association for Computa-tional Linguistics arXiv .Li, X.; Yin, X.; Li, C.; Hu, X.; Zhang, P.; Zhang, L.; Wang,L.; Hu, H.; Dong, L.; Wei, F.; Choi, Y.; and Gao, J. 2020.Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. arXiv preprint arXiv:2004.06165 .Ma, C.-Y.; Lu, J.; Wu, Z.; AlRegib, G.; Kira, Z.; Socher, R.;and Xiong, C. 2019a. Self-Monitoring Navigation Agent viaAuxiliary Progress Estimation. In

Proceedings of the Inter-national Conference on Learning Representations (ICLR) .URL https://arxiv.org/abs/1901.03035.Ma, C.-Y.; Wu, Z.; AlRegib, G.; Xiong, C.; and Kira, Z.2019b. The Regretful Agent: Heuristic-Aided Navigationthrough Progress Estimation. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition(CVPR) . URL https://arxiv.org/abs/1903.01602.Majumdar, A.; Shrivastava, A.; Lee, S.; Anderson, P.;Parikh, D.; and Batra, D. 2020. Improving Vision-and-Language Navigation with Image-Text Pairs from the Web. arXiv:2004.14973 [cs] .Miyazawa, K.; Aoki, T.; Horii, T.; and Nagai, T. 2020. lam-BERT: Language and Action Learning Using MultimodalBERT. arXiv preprint arXiv:2004.07093 .Qi, Y.; Wu, Q.; Anderson, P.; Wang, X.; Wang, W. Y.; Shen,C.; and van den Hengel, A. 2020. REVERIE: Remote Em-bodied Visual Referring Expression in Real Indoor Environ-ents. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) .Savva, M.; Kadian, A.; Maksymets, O.; Zhao, Y.; Wijmans,E.; Jain, B.; Straub, J.; Liu, J.; Koltun, V.; Malik, J.; Parikh,D.; and Batra, D. 2019. Habitat: A Platform for EmbodiedAI Research. In

Proceedings of the IEEE/CVF InternationalConference on Computer Vision (ICCV) .Shridhar, M.; Thomason, J.; Gordon, D.; Bisk, Y.; Han, W.;Mottaghi, R.; Zettlemoyer, L.; and Fox, D. 2020. ALFRED:A Benchmark for Interpreting Grounded Instructions for Ev-eryday Tasks. In

Computer Vision and Pattern Recognition(CVPR) .Siegel, A. W.; and White, S. H. 1975. The Developmentof Spatial Representations of Large-Scale Environments. In

Advances in Child Development and Behavior , volume 10,9–55. Elsevier.Thomason, J.; Murray, M.; Cakmak, M.; and Zettlemoyer, L.2019. Vision-and-Dialog Navigation. In .Xia, F.; Li, C.; Chen, K.; Shen, W. B.; Mart´ın-Mart´ın, R.;Hirose, N.; Zamir, A. R.; Fei-Fei, L.; and Savarese, S. 2019.Gibson Env V2: Embodied Simulation Environments for In-teractive Navigation. Technical report, Stanford University.Xia, F.; Shen, W. B.; Li, C.; Kasimbeg, P.; Tchapmi, M. E.;Toshev, A.; Mart´ın-Mart´ın, R.; and Savarese, S. 2020. In-teractive Gibson Benchmark: A Benchmark for InteractiveNavigation in Cluttered Environments.

IEEE Robotics andAutomation Letters arXiv preprint arXiv:1801.07357 .Zhu, F.; Zhu, Y.; Chang, X.; and Liang, X. 2020. Vision-Language Navigation With Self-Supervised Auxiliary Rea-soning Tasks. In