[PDF] The Regretful Agent: Heuristic-Aided Navigation through Progress Estimation

Abstract

As deep learning continues to make progress for challenging perception tasks, there is increased interest in combining vision, language, and decision-making. Specifically, the Vision and Language Navigation (VLN) task involves navigating to a goal purely from language instructions and visual information without explicit knowledge of the goal. Recent successful approaches have made in-roads in achieving good success rates for this task but rely on beam search, which thoroughly explores a large number of trajectories and is unrealistic for applications such as robotics. In this paper, inspired by the intuition of viewing the problem as search on a navigation graph, we propose to use a progress monitor developed in prior work as a learnable heuristic for search. We then propose two modules incorporated into an end-to-end architecture: 1) A learned mechanism to perform backtracking, which decides whether to continue moving forward or roll back to a previous state (Regret Module) and 2) A mechanism to help the agent decide which direction to go next by showing directions that are visited and their associated progress estimate (Progress Marker). Combined, the proposed approach significantly outperforms current state-of-the-art methods using greedy action selection, with 5% absolute improvement on the test server in success rates, and more importantly 8% on success rates normalized by the path length. Our code is available at this https URL .

Full PDF

TThe Regretful Agent: Heuristic-Aided Navigation through Progress Estimation

Chih-Yao Ma ∗† , Zuxuan Wu ‡ , Ghassan AlRegib † , Caiming Xiong § , Zsolt Kira † † Georgia Institute of Technology, ‡ University of Maryland, College Park, § Salesforce Research

I know I came from there. Where should I go next?

My estimated confidence decreased. Something went wrong. Let’s learn this lesson and go back.

Instruction: Exit the room. Walk past the display case and into the kitchen. Stop by the table. st step 1 st step 2 nd th th step4 th th th Figure 1: Vision-and-Language Navigation task and our proposed regretful navigation agent. The agent leverages the self-monitoring mechanism [14] through time to decide when to roll back to a previous location and resume the instruction-following task. Our code is available at https://github.com/chihyaoma/regretful-agent.

Abstract

As deep learning continues to make progress for chal-lenging perception tasks, there is increased interest in com-bining vision, language, and decision-making. Speciﬁcally,the Vision and Language Navigation (VLN) task involvesnavigating to a goal purely from language instructionsand visual information without explicit knowledge of thegoal. Recent successful approaches have made in-roads inachieving good success rates for this task but rely on beamsearch, which thoroughly explores a large number of trajec-tories and is unrealistic for applications such as robotics.In this paper, inspired by the intuition of viewing the prob-lem as search on a navigation graph, we propose to use aprogress monitor developed in prior work as a learnableheuristic for search. We then propose two modules incorpo-rated into an end-to-end architecture: 1) A learned mech-anism to perform backtracking, which decides whether tocontinue moving forward or roll back to a previous state(Regret Module) and 2) A mechanism to help the agent de-cide which direction to go next by showing directions thatare visited and their associated progress estimate (Progress ∗ Work partially done while the author was a research intern at Sales-force Research.

Marker). Combined, the proposed approach signiﬁcantlyoutperforms current state-of-the-art methods using greedyaction selection, with 5% absolute improvement on the testserver in success rates, and more importantly 8% on suc-cess rates normalized by the path length.

1. Introduction

Building on the success of deep learning in solving var-ious computer vision tasks, several new tasks and corre-sponding benchmarks have been proposed to combine vi-sual perception and decision-making [2, 24, 7, 13, 19, 25,5]. One such task is the Vision-and-Language Navigationtask (VLN), where an agent must navigate to a goal purelyfrom language instructions and visual input without explicitknowledge of the goal. This task has a number of applica-tions, including service robotics where it would be prefer-able if humans interacted naturally with the robot by in-structing it to perform various tasks.Recently, there have been several approaches proposedto solve this task. The dominant approaches frame the nav-igation task as a sequence to sequence problem [2]. Severalenhancements such as synthetic data augmentation [12],pragmatic inference [12], and combinations of model-freeand model-based reinforcement learning techniques [21]1 a r X i v : . [ c s . A I] M a r ave also been proposed. However, current methods areseparated into two regimes: those that use beam search andobtain good success rate (with longer trajectory lengths) andthose that use greedy action selection (and hence result invery low trajectory lengths) but obtain much lower successrates. In fact, there have recently been new metrics pro-posed that balance these two objectives [1]. Intuitively, theagent should perform intelligent action selection (akin tobest-ﬁrst search), without exhaustively exploring the searchspace. For robotics application, for example, the use ofbeam search is unrealistic as it would require the robot toexplore a large number of possible trajectories.In this paper, we view the process of navigation as graphsearch across the navigation graph and employ two strate-gies, encoded within the neural network architecture, to en-able navigation without the use of beam search. Specif-ically, we develop: 1) A Regret Module that providesa mechanism to allow the agent to learn when to back-track [11, 3] and 2) We propose a

Progress Marker mech-anism that allows the agent to incorporate information fromprevious visits and reason about such visits and their asso-ciated progress estimates towards better action selection.Speciﬁcally, in graph search a heuristic is used to makemeaningful progress towards the goal in a manner thatavoids exhaustive search but is more effective than na¨ıvegreedy search. We therefore build on recent work [14] thatdeveloped a progress monitor which is a learned mechanismthat was used to estimate the progress made towards thegoal (with low values meaning progress has not been madeand high values meaning the agent is closer to the goal). Inthat work, however, the focus was on the regularizing effectof the progress monitor as well as its use in beam search. In-stead, we use this progress monitor effectively as a learnedheuristic that can be used to determine directions that aremore likely to lead towards the goal during inference.We use the Progress Marker in two ways. First, we lever-age the notion of backtracking, which is prevalent in graphsearch, by developing a learned rollback mechanism thatdecides whether to go back to the previous location or not(Regret Module). Second, we incorporate a mechanism toallow the agent to use the estimated progress it computedwhen visiting the viewpoints to choose the next action toperform after it has rolled back (Progress Marker). This al-lows the agent to know when particular directions have al-ready been visited and the progress they resulted in, whichcan bias it to not re-visit states unless warranted. We do thisby augmenting the visual state vectors with the progress es-timates so that the agent can reduce the probability of revis-iting such states (again, in a learned manner).We demonstrate that these learned mechanisms are supe-rior to greedy decoding. Our agent is able to achieve state-of-the-art results among published works both in terms ofsuccess rate (when beam search is not used) and more im- portantly the SPL [1] metric which incorporates path length,owing to our short trajectory lengths. In summary, our con-tributions include: 1) A graph search perspective on theinstruction-based navigation problem, and use of a learnedheuristic in the form of a progress monitor to effectivelyexplore the navigation graph, 2) an end-to-end trainable

Re-gret Module that can learn to decide when to roll back tothe previous location given the history of textual and visualgrounding observed, 3) a

Progress Marker that can enableeffective backtracking and reduce the probability of goingto a visited location accordingly, and 4) state-of-the-art re-sults on the VLN task.

2. Related Work

Vision and language navigation.

There are a numberof benchmarks and environments for investigating the com-bination of vision, language, and decision-making. This in-cludes House3D [24], Embodied QA [7], AI2-THOR [13],navigation based agents [17, 22, 15] (including with com-munication [9]), and the VLN task that we focus on [2].For tasks that contain only sparse rewards, reinforcementlearning approaches exist [21, 27, 8], for example focus-ing on language grounding through guided feature trans-formation [27] and development of a neural module ap-proach [8]. Our work, in contrast, focuses on tasks thatcontain language instructions that can guide the naviga-tion process and has applications such as service robotics.Approaches to this task are dominated by a sequence-to-sequence formulation, beginning with initial work intro-ducing the task [2]. Subsequent methods have used aSpeaker-Follower technique to generate synthetic instruc-tions that are used for data augmentation and pragmaticinference [12], as well as the combination of supervised-based and RL-based approaches [21, 20]. Recently, theSelf-Monitoring navigation agent was introduced whichlearns to estimate progress made towards the goal using vi-sual and language co-grounding [14]. Prior work employsbeam-search type techniques, though, optimizing for suc-cess rate at the expense of trajectory length and reducedapplicability to robotics and other domains. Inspired bythe latter work, we view the progress monitor as a learnedheuristic and combine it with other techniques in graphsearch, namely backtracking, to use it for action selection ,which was not a focus of the prior work.

Navigation and learned heuristics.

Several works invision and robotics have explored the intersection of learn-ing and planning. In robotics, planning systems must of-ten explore large search trees for getting from start to goal,and selection of the next state to expand must be done in-telligently to reduce computation. Often ﬁxed heuristics(e.g. distance to goal) are used, but these are static, re-quire known goal locations, and are used for optimal A*-style algorithms rather than greedy best-ﬁrst search, which2s what can be employed on robots when maps are not avail-able [18]. Recently, several learning-based approaches havebeen developed for such heuristics, including older worksthat learn residuals for existing heuristics [26], heuristicranking methods that enable reﬁnement of new ones [23] aswell as learning of a heuristic policy in a Markov DecisionProcess (MDP) formulation to directly optimize search ef-fort by taking into account history and contextual informa-tion [4]. In our work, we similarly learn to estimate a heuris-tic (progress monitor) and use it for action selection, show-ing that the resulting estimates can generalize to unseenenvironments. We also develop an architecture to explic-itly learn when to backtrack based on this progress monitor(with a Progress Marker to reduce the chance of choosingthe same action again after backtracking unless warranted),which further improves navigation performance.

Modern Reinforcement Learning.

Modern Rein-forcement Learning methods like Asynchronous Advan-tage Actor Critic (A3C) [16] or Advantage Actor Critic(A2C) methods are related to the baseline Self-Monitoringagent [14] and the proposed Regretful agent. Speciﬁcally,the progress monitor in the Self-Monitoring agent (ourbaseline) is similar to the value function in RL, and the dif-ference between progress marker of a viewpoint and currentprogress estimation (denote as ∆ v markert,k , see Sec. 4.2) isconceptually similar to the advantage function. However,the advantage function in RL serves as a way to regularizeand improve the training of the policy network. We insteadassociate the ∆ v markert,k directly to all navigable states, andthis ∆ v markert,k has a direct impact on the agent decidingnext action even during inference. While having an accuratevalue estimate for VLN with dynamic and implicit goalsmay reduce the need for this formulation, we however be-lieve that this is hardly possible because of the lack of train-ing data. On the other hand, relating to the proposed end-to-end learned regret module, Leave no Trace [10] learns aforward and a reset policy to reset the environment for pre-venting the policy entering a non-reversible state. Insteadof learning to reset, we learn to rollback to a previous stateand continue the navigation task with a policy network thatlearns to decide a better next step.

3. Baseline

Given natural language instructions, our task is to trainan agent to follow these instructions and reach an (unspeci-ﬁed) goal in the environment (see Figure 1 for an example).This requires processing both the instructions and the visualinputs, along with attentional mechanisms to ground themto the current situation. We adapt the recently introducedSelf-Monitoring Visual-Textual Co-grounding agent [14] asour baseline. The Self-Monitoring agent consists of twoprimary components: (1) A visual-textual co-groundingmodule that grounds to the completed instruction, the next instruction, and the subsequent navigable directions rep-resented as visual features. (2) A progress monitor thattakes the attention weights of grounded instructions as in-put and estimates the agent’s progress towards completingthe instruction. It was shown that such a progress monitorcan regularize the attentional mechanism (via an additionalloss), but the authors did not focus on using the progressestimates for action selection itself. In the following, webrieﬂy introduce the Self-Monitoring agent.Speciﬁcally, a language instruction with L wordsis represented via embeddings denoted as X = (cid:8) x , x , . . . , x L (cid:9) , where x l is the feature vector forthe l -th word encoded by a Long Short-Term Memory(LSTM) language encoder. Following [14, 12], we use apanoramic view as visual input. At the t -th time step, theagent perceives a set of images at each viewpoint v t = (cid:8) v t, , v t, , ..., v t,K (cid:9) , where K is the maximum numberof navigable directions, and v t,k represents the image fea-ture of direction k obtained from an ImageNet pre-trainedResNet-152. It ﬁrst obtains visual and textual grounded fea-tures, ˆ v t , and ˆ x t , respectively, with hidden states from thelast time step h t − using soft-attention (see [14] for details).Conditioned on these grounded features and historical con-text, it then produces the hidden context of the current step h t : h t , c t = LST M ([ ˆ x t , ˆ v t , a t − ] , h t − , c t − ) , where [ , ] denotes concatenation and c t − denote cell statesfrom the last time step. To further decide where to go next,the current hidden states h t are concatenated with groundedinstructions ˆ x t , yielding a representation that contains his-torical context and relevant parts of the instructions (for ex-ample, corresponding to parts that have just been carried outand those that have to be carried out next), to compute thecorrelations with visual features for each viewpoint k ( v t,k ).Formally, action selection is calculated as follows: o t,k = ( W a [ h t , ˆ x t ]) (cid:62) g ( v t,k ) and p t = softmax ( o t ) where W a are the learned parameters and g ( · ) is a Multi-Layer Perceptron (MLP).Furthermore, we also equip the agent with a progressmonitor following [14] to enforce the attention weights ofthe textual grounding to align with the progress made to-ward the goal, further regularizing the grounded instruc-tions to be relevant. The progress monitor is optimized suchthat the agent is required to use the attention distributionof textual grounding to predict the distance from goal. Theoutput of progress monitor p pmt represents the completenessof instruction-following estimated by the agent. h pmt = σ ( W h ([ h t − , ˆ v t ]) ⊗ tanh ( c t )) p pmt = tanh ( W pm ([ ααα t , h pmt ])) eature extraction Textual groundingVisual groundingSoft-attentionGrounded img features Action selectionForwardRollback MovementLSTM S o ft a tt en t i on Progress marker

Progress monitor (cid:31)(cid:30)(cid:29)(cid:31)(cid:28)(cid:27)

Regret module

Figure 2: Illustration of the proposed regretful navigationagent. Note that the progress monitor is based on [14].where W h and W pm are the learnt parameters, c t is the cellstate of the LSTM, ⊗ denotes the element-wise product, α t is the attention weights of textual grounding, and σ is thesigmoid function. Please refer to [14] for further details onthe baseline architecture.

4. Regretful Navigation Agent

The progress monitor previously mentioned reﬂects theagent’s progress made towards the goal, and consequentlyits outputs will decrease or ﬂuctuate if the agent selects anaction leading to deviation from the goal. Conversely it willincrease if it moves closer to the goal by completing the in-struction. We posit that such a property, while conceptuallysimple, provides critical feedback for action selection. Tothis end, we leverage the outputs of the progress monitorto allow the agent to regret and backtrack using a

RegretModule and a

Progress Marker (see Figure. 2). In partic-ular, the Regret Module examines the progress made fromthe last step to the current step to decide whether to take a forward or rollback action. Once the agent regrets and rollsback to the previous location, the Progress Marker informswhether location(s) have been visited before and rates thevisited location(s) according to the agent’s conﬁdence incompleting the instruction-following task. Combining thetwo proposed methods, we show that the agent is able toperform a local search on the navigational graph by (1) as-sessing the current progress, (2) deciding when to roll back,and (3) selecting the next location after rollback occurs. Inthe following, we elaborate these two components in detail. The

Regret Module takes in the outputs of the progressmonitor at different time steps and decides whether to go forward or to rollback . In particular, we use the concatena-tion of hidden state h t and grounded instruction ˆ x t as our forward embedding m ft , and more importantly we intro- duce a rollback embedding m rt to be the projection of thevisual features for the action that leads to the previously vis-ited location. The two vector representations are as follows: m ft = W a [ h t , ˆ x t ] and m rt = g ( v t,r ) , where W a are the learned parameters, ˆ x t is the groundedinstruction obtained from the textual grounding module,and v t,r is the image feature vector representing a directionthat points to the previously visited location.To decide whether to go forward or rollback, the Re-gret Module leverages the difference of the progress mon-itor outputs between the current time step and the previ-ous time step ∆ p pmt = p pmt − p pmt − . Intuitively, if thedifference is larger than a certain threshold ∆ p pmt > σ ,the agent should decide to take a forward action, and viceversa. Since it is hard to decide an optimal value for σ , weachieve this by computing attention weights α frt and per-form a weighted sum on both forward and rollback embed-dings. If the weight on rollback is larger, the agent is likelyto be biased to take an action that leads to the last visitedlocation. Formally, the weights can be computed as: α frt = softmax ( W r (∆ p pmt )) m frt = ( α frt ) (cid:62) [ m ft , m rt ] , where W r are the learnt parameters, [ , ] denotes concate-nation between feature vectors, and m frt represents theweighted sum of the forward and rollback embeddings.Note that to ensure the progress monitor remains focusedon estimating the agent’s progress and regularizing the tex-tual grounding module, we detach the output of the progressmonitor which is fed into the Regret Module and set it as aleaf in the computational graph. Action selection.

Similar to existing work, the agentdetermines which image features from navigable directionshave the highest correlation with the movement vector m frt by computing the inner-product, and the probability of eachnavigable direction is then computed as: o t,k = ( W fr m frt ) (cid:62) g ( v t,k ) and p t = softmax ( o t ) , where W fr are the learned parameters and p t is the prob-ability distribution over navigable directions at time t . Inpractice, once the agent takes a rollback action, we blockthe action that leads to oscillation. The Regret Module provides a mechanism for the agentto decide when to rollback to a previous location or moveforward according to the progress monitor outputs. Oncethe agent rolls back, it is required to select the next directionto go forward. It is thus essential for the agent to (1) knowwhich directions it has already visited (and rolled back) and42) estimate if the visited locations can lead to a path whichcompletes the given instruction.Toward this end, we propose the

Progress Marker tomark each visited location with the agent’s conﬁdence incompleting the instruction (see Figure 3). More speciﬁcally,we maintain a set of memory M and store the output of theprogress monitor associated with each visited location; ifthe location is not yet visited, the marker will be ﬁlled witha value 1: v markert,k = (cid:40) p pmi , if k leads to a location i ∈ M . , otherwise . where i is a unique viewpoint ID for each location. Weallow the marker on each location to be updated every timethe agent visits it.The marker value on each navigable direction indicatesthe estimated conﬁdence that a location leads to the goal.We assign a value 1 for unvisited directions to encouragethe agent to explore the environment. The navigating prob-abilities between unvisited directions depend on the actionprobabilities p t since their marker values are the same. Action selection with Progress Marker.

During actionselection, in addition to the movement vector m frt that theagent can rely on in deciding which direction to go, wepropose to label the marker value to each navigation direc-tion as indications of whether a direction is likely to lead tothe goal or to unexplored (and potentially better) paths. Toachieve this, we leverage the difference between the currentestimated progress and the marker for each navigable direc-tion ∆ v markert,k = p pmt − v markert,k . We then concatenate itto the visual feature representation for each navigable direc-tion before action selection. v markedt,k = [ g ( v t,k ) , ∆ v markert,k ] . The difference ∆ v markert,k indicates the chances of nav-igable directions leading to the goal and further informthe agent which direction to select. In our design, lower ∆ v markert,k corresponds to higher chance for action selec-tion. For instance, in step 4 in Figure 3, the ∆ v markert,k forstarting location and the last visited location are 0.08 and-0.02 respectively, whereas an unvisited location will have-0.71, which eventually leads to 0.52 estimated progress.When using Progress Marker, the ﬁnal action selectionis formulated as: o t,k = ( W fr m frt ) (cid:62) v markedt,k and p t = softmax ( o t ) In practice, we tiled the difference n times before concate-nating with the projected image feature v t,k in order to ac-count for imbalance. The marker value for the stop actionis set to be 0. . . . Rollback . . Updating… . Figure 3: Concept of the proposed Progress Marker (redﬂags). The agent marks each visited location with estimatedprogress made towards the goal. The changes on the esti-mated progress determines whether the agent should roll-back or forward , and the difference between the current es-timated progress and the markers on the next navigable di-rections helps the agent decide which direction to go. We train the proposed agent with cross-entropy loss foraction selection and Mean Squared Error (MSE) loss forprogress monitor. In addition to these losses, we also in-troduce an additional entropy loss to encourage the agent toexplore other actions, such that it is not biased to actionswith already very high conﬁdence. The motivation is that,after training an agent for a period of time, the agent startsto overﬁt and perform fairly well on the training set. As aresult, the agent will not learn to properly roll back duringtraining since the majority of the training samples do notrequire the agent to roll back. Introducing the entropy lossincreases the chance of exploration and making incorrectactions during training. L loss = λ action selection (cid:122) (cid:125)(cid:124) (cid:123) T (cid:88) t =1 y nvt log ( p t,k ) +(1 − λ ) progress monitor (cid:122) (cid:125)(cid:124) (cid:123) T (cid:88) t =1 ( y pmt − p pmt ) − β T (cid:88) t =1 K (cid:88) k =1 − p t,k log ( p t,k ) (cid:124) (cid:123)(cid:122) (cid:125) entropy loss , where p t,k is the action probability of each navigable direc-tion, y nvt is the ground-truth navigable direction at step t , λ = 0 . is the weight balancing the cross-entropy loss andMSE loss, and β = 0 . is the weight for entropy loss.Following existing approaches [14, 12, 2], we performcategorical sampling during training for action selection.During inference, the agent greedily selects the action withhighest action probability.5able 1: Comparison with the state of the arts with greedy decoding for action selections . *: with data augmentation. Validation-Seen Validation-Unseen Test (unseen)Method NE ↓ SR ↑ OSR ↑ SPL ↑ NE ↓ SR ↑ OSR ↑ SPL ↑ NE ↓ SR ↑ OSR ↑ SPL ↑ Random 9.45 0.16 0.21 - 9.23 0.16 0.22 - 9.77 0.13 0.18 0.12Student-forcing [2] 6.01 0.39 0.53 - 7.81 0.22 0.28 - 7.85 0.20 0.27 0.18RPA [21] 5.56 0.43 0.53 - 7.65 0.25 0.32 - 7.53 0.25 0.33 0.23Speaker-Follower [12]* 3.36 0.66 0.74 - 6.62 0.36 0.45 - 6.62 0.35 0.44 0.28RCM [20]* 3.37 0.67 0.77 - 5.88 0.43 0.52 - 6.01 0.43 0.51 0.35Self-Monitoring [14]*

5. Dataset and Implementations

Room-to-Room dataset.

We use the Room-to-Room(R2R) dataset [2] for evaluating our proposed approach.The R2R dataset is built upon the Matterport3D dataset [6].It consists of 10,800 panoramic views from 194,400 RGB-D images in 90 buildings and has 7,189 paths sampled fromits navigation graphs. Each path has three ground-truth nav-igation instructions written by humans. The whole datasethas 90 scenes: 61 for training and validation seen, 11 forvalidation unseen, 18 for test unseen.

Evaluation metrics.

To compare to existing work, weshow the same evaluation metrics used in those works: (1)Navigation Error (NE), mean of the shortest path distancein meters between the agent’s ﬁnal position and the goallocation. (2) Success Rate (SR), the percentage of ﬁnalpositions less than 3m away from the goal location. (3)Oracle Success Rate (OSR), the success rate if the agentcan stop at the closest point to the goal along its trajectory.However, we note the importance of a recently added met-ric that emphasizes the trade-off between success rate andtrajectory length: Success rate weighted by (normalized in-verse) Path Length (SPL) [1], which incorporates trajectorylengths and is an important consideration for real-world ap-plications such as robotics.

Implementation Details.

For fair comparison with ex-isting work, we use the pre-trained ResNet-152 on Im-ageNet to extract image features. Following the Self-Monitoring [14] and Speaker-Follower [12] works, the em-bedded feature vector for each navigable direction is ob-tained by concatenating an appearance feature with a 4-dorientation feature [ sinφ ; cosφ ; sinθ ; cosθ ] , where φ and θ are the heading and elevation angles. Please refer to theAppendix for further implementation details. Note that both Speaker-Follower [12] and Self-Monitoring [14] wereoriginally designed to optimize the success rate (SR) via beam search, andconcurrently to our work, RCM [20] proposed a new setting allowing theagent to explore unseen environments prior to the navigation task via Self-Supervised Imitation Learning (SIL).

6. Evaluation

We ﬁrst compare the proposed regretful navigation agentwith the state-of-the-art methods [14, 12, 20]. As shown inTable 1, our method achieves signiﬁcant performance im-provement over the existing approaches. We achieved 37%SPL and 48% SR on the validation unseen set and out-performed all existing work. Our best performing modelachieves 41% SPL and 50% SR on validation unseen setwhen trained with the synthetic data from the Speaker [12].We demonstrate absolute 8% SPL improvement and 5% SRimprovement on the test server over the current state-of-the-art method. We can also see that our regretful navigationagent without data augmentation has already outperformedthe existing work on both SR and SPL metrics.

Table 2 shows an ablation study to analyze the effect ofeach component. The ﬁrst thing to note is that our methodis signiﬁcantly better than the Self-Monitoring agent whichuses greedy decoding, even though it still has a progressmonitor loss (although the progress monitor is not used foraction selection). A second interesting point is that whenthe Progress Marker is available with the features of eachnavigable direction that have been visited before, but theRegret Module is not available, performance does not in-crease signiﬁcantly (44% SR). Note that we also conductedan experiment with another condition, where the progressmonitor estimates were attached to the forward embedding,meaning that the network could use that information to im-prove action selection. That condition again was only ableto achieve modest gains (45% SR), compared to our Re-gret Module which was able to achieve 47% SR (and 48%when the Progress Marker was added). In all, this showsthat the key improvement stems from the design of the Re-gret Module, allowing the agent to intelligently backtrackafter making mistakes.6able 2: Ablation study showing the effect of each proposed components compared to the prior arts. All methods here trainedwithout data augmentation.

Regret Progress Validation-Seen Validation-UnseenMethod ↓ SR ↑ OSR ↑ SPL ↑ NE ↓ SR ↑ OSR ↑ SPL ↑ Speaker-Follower [12] 4.86 0.52 0.63 - 7.07 0.31 0.41 -Self-Monitoring [14] 3.72 0.63 (cid:88) (cid:88) (cid:88) (cid:88) Table 3: Sanity check for verifying that the source of performance improvement is from the agent’s ability to decide when toroll back.

Blocking Validation-Seen Validation-UnseenMethod Rollback NE ↓ SR ↑ OSR ↑ SPL ↑ NE ↓ SR ↑ OSR ↑ SPL ↑ Self-Monitoring [14] 3.72 0.63 (cid:88) (cid:88)

We now further analyze the behavior of the agent to ver-ify that the source of improvement is indeed from the abilityto learn when to roll back.

Does rollback lead to the performance improvement?

Our proposed regretful agent relies on the ability to regretand roll back to a previous location, further exploring theunknown environment to increase the success rate. As asanity check, we manually block all actions leading to roll-back for both the state-of-the-art Self-Monitoring agent andour regretful agent . The result is shown in Table 3. As canbe seen, blocking rollback for the Self-Monitoring agentproduces mixed results, with worse NE but better metricssuch as OSR. The SR, however, is unchanged. On the otherhand, blocking rollback for our agent signiﬁcantly reducesmost metrics including NE, SR, and OSR especially on un-seen environments. This shows that blocking the ability tolearn when to roll back degrades a large source of perfor-mance increase, and this is especially true for unseen envi-ronments. Number of unsuccessful examples reduced.

We cal-culate the total number of unsuccessful examples involvesrollback action for both Self-Monitoring and our proposedagent (in percentage). As demonstrated in Figure 4, our pro-posed regretful agent signiﬁcantly reduces the unsuccessfulexamples from around 43% to 38%, which correlates to the4-5% improvement on SR in Table 1 and 2.

Regretful agent in unfamiliar environments.

The key except when there is only one navigable direction to go. RegretfulSelf-Monitoring

Figure 4: Percentage of unsuccessful examples involvingrollback reduced by our proposed regretful agent.to the performance increase of an agent focusing on therollback ability is not that the agent learns a better textualor visual grounding, but that the agent learns to search es-pecially when it is not certain which direction to go. Todemonstrate this, we train both the Self-Monitoring agentand our proposed regretful agent only on synthetic data andtest them on the unseen validation set (real data). We expectthe regretful agent to outperformed the Self-Monitoringagent across all metrics since our agent is designed to oper-ate in an environment where the agent is likely to be un-certain on action selection. As shown in Table 4, whentrained using only the synthetic data, our method signiﬁ-cantly outperformed Self-Monitoring agent. Interestingly,when compared with the Self-Monitoring agent trained with7able 4: Ablation study when trained using only the syn-thetic or real training data. Oracle Navigation Error (ONE):the navigation error if the agent can stop at the closest pointto the goal along its trajectory.

Validation-UnseenMethod Synthetic Real ONE ↓ SR ↑ OSR ↑ Self-Monitoring [14] (cid:88) (cid:88) (cid:88) (cid:88) real data, our agent trained with synthetic data is slightlybetter on ONE, same on OSR, and marginally lower on SR.We achieved slightly better performance on oracle metricssince stopping at the correct location is not a hard constrain.This indicates that even though our regretful agent is not yetlearned how to properly stop at the goal (due to training onsynthetic data only), the chance that it passes/reaches thegoal is slightly higher than Self-Monitoring agent trainedwith real data. Further, when the regretful agent trained withreal data, the performance improved across all metrics.

Figures 5 show qualitative outputs of our model duringsuccessful navigation in unseen environments. In Figure 5(a), the agent made a mistake at the ﬁrst step, and the esti-mated progress at the second step slightly decreases. Theagent then decides to rollback, after which the progressmonitor signiﬁcantly increases. Finally, the agent stoppedcorrectly as instructed. Figure 5 (b) shows an examplewhere the agent correctly goes up the stairs but incorrectlydoes it again rather than turning and ﬁnding the TV as in-structed. Note that the progress monitor increases but onlyby a small amount; this demonstrates the need for learnedmechanisms that can reason about the textual and visualgrounding and context, as well as the resulting level ofchange in progress. In this case the agent then correctly de-cides to rollback and successfully walked into the TV room.Similarly, in Figure 5 (c), the agent misses the stairs, result-ing in a very small progress increase. The agent decidesto rollback as a result. Upon reaching the goal, the agent’sprogress estimate is 99%. Please refer to the Appendix forthe full trajectories and unsuccessful examples.

7. Conclusion

In this paper, we have proposed a end-to-end trainableregretful navigation agent for the VLN task. Inspired bythe intuition of viewing this task as graph search over thenavigation graph, we speciﬁcally use a progress monitor asa learned heuristic that can be trained and employed dur-ing inference to greedily select the next best action (best-ﬁrst search). The progress monitor incorporates informa- tion from grounded language instructions and visual infor-mation, integrated across time with LSTMs. We then pro-pose a Regret Module that is able to learn to decide when toperform backtracking depending on the progress made andstate of the agent. Finally, a Progress Marker is used to al-low the agent to reason about previous visits and unvisiteddirections, so that the agent can choose a better navigabledirection by reducing action probabilities for visited loca-tions with lower progress estimate.The resulting framework is able to achieve state-of-the-art success rates compared to existing published methodson the public leaderboard, without using beam search. Weshow through several extensive analyses that the source ofperformance improvement is the design of the learned roll-back mechanism, that when blocked the performance de-creases, and that this learning can occur even on purelysynthetic data and generalize to real data. We also demon-strated that the total number of unsuccessful examples in-volve rollback reduces with our regretful agent. There isa great deal of future work possible, which extends ourmethod. For example, other aspects of graph search can beincorporated such as elements of exploration (e.g. a searchspace frontier), but in a manner that is more efﬁcient thanbeam search, can also be investigated. Finally, a combina-tion of goal-driven perception and reinforcement learningwould be interesting to explore, as the tasks contain lessand less structured information (e.g. embodied QA).

Acknowledgments

This research was partially supported by DARPAs Life-long Learning Machines (L2M) program, under Coopera-tive Agreement HR0011-18-2-001. We thank Chia-JungHsu for her valuable and artistic suggestions on the ﬁgures.

AppendixA. Network Architecture

The embedding dimension of the instruction en-coder is 256, followed by a dropout layer with ra-tio 0.5. We encode the instruction using a regularLSTM, and the hidden state is 512 dimensional. TheMLP g used for projecting the raw image feature is BN −→ F C −→ BN −→ Dropout −→ ReLU . The FC layerprojects the 2176-d input vector to a 1024-d vector, andthe dropout ratio is set to be 0.5. The hidden state ofthe LSTM which allows integration of information acrosstime is 512. When using the progress marker, the mark-ers are tiled n = 32 times. The dimension of the learn-able matrices are: W x ∈ R × , W v ∈ R × , W a ∈ R × , W r ∈ R × , W fr ∈ R × with-out progress marker, and W fr ∈ R × with progressmarker.8igure 5: Successful regretful agent navigates in unseen environments. (a) The agent made a mistake at the ﬁrst step, butit was able to roll back to the previous location since the output of the progress monitor was not signiﬁcantly increased. Itthen follows the rest of the instruction correctly. (b) The agent is able to correctly follow the instruction at the beginning butmade a mistake by walking up the stairs again. The agent realized that the output of the progress monitor is decreased andthe next action take a right is not feasible and decides to rollback rollback at step 4. The agent was then able to follow therest of the instruction and stop with estimated progress 0.95. (c) The agent made a mistake by missing the stairs at step 1. Itwas however able to decide to rollback at step 2 and moves down stairs as instructed and successfully stops near the bambooplant with estimated progress 0.99. Please see Appendix for the full trajectories. B. Comparison with Beam Search Methods

We compare our method using greedy action selectionwith existing beam search approaches, e.g., Pragmatic In-ference in Speaker-Follower [12] and progressed integratedbeam search in Self-Monitoring agent [14]. We can see inTable 5 that, while beam search methods perform well onsuccess rate (SR), their trajectory lengths are signiﬁcantlylonger, achieving low success rate weighted by Path Length(SPL) scores and therefore are impractical for real-worldapplications. On the other hand, our proposed method sig-niﬁcantly improved both SR and SPL when not using beamsearch. Table 5: Comparison of our regretful agent using greedyaction selection with beam search.

Beam Test set (leaderboard)Method search NE ↓ SR ↑ Length ↓ SPL ↑ Speaker-Follower [12] 6.62 0.35 14.82 0.28 (cid:52) (cid:52)

C. Qualitative Analysis

C.1. Successful examples

We show the complete trajectory of the agents success-fully deciding when to roll back and reach the goal in un-seen environments in Figure 6, 7, 8, and 9.9n Figure 6, we demonstrate that the agent is capable ofperforming a local search on the navigation graph. Specif-ically, from step 0 to step 3, the agent searched two pos-sible directions and decided to move with one particulardirection at step 4. Once it reached step 5, the agent de-cides to continue to move forward, and we observed thatthe progress estimate signiﬁcantly increased to 45% at step7. Interestingly, unlike other examples we have shown, theagent did not decide to roll back despite the progress esti-mate slightly decreased from 45% to 40%. We reckon thatthis is one of the advantages of using a learning-based regretmodule, where a learned and dynamically changing thresh-old decides when to rollback. Finally, the agent successfullystopped in front of the microwave.In Figure 7, the agent is instructed to walk across liv-ing room . It is ambiguous since both directions seem like aliving room. Our agent ﬁrst decides to move into the direc-tion that leads to a room with a kitchen and living room. Itthen decided to roll back with the progress monitor outputslightly decreased. The agent then followed the rest of theinstruction successfully with the progress monitor steadilyincreased at each step after that. Finally, the agent decidesto stop with the progress estimate 99%.In Figure 8, the agent ﬁrst moved out of the room andwalked up the stairs as instructed, but the second set of stairsmakes the instruction ambiguous. The agent continued towalk up the stairs for one more step and then decided to godown the stairs at step 4. As the agent decided to turn rightat step 6, we can see the progress estimate signiﬁcantly in-creased from 51% to 66%. Once the agent entered the TVroom, the progress estimate increased again to 82%. Fi-nally, the agent successfully stopped with the progress mon-itor output 95%.In Figure 9, the agent failed to walk down the stairs at step 1. Because of the proposed Regret Module andProgress Marker, the agent was able to discover the correctpath to go downstairs. Once walking down, the progressestimate increased to 39% immediately, and as the agentgoes further down, the progress estimate reached 98% bythe time the agent reached the bottom of the stairs. Finally,the agent decided to wait by the bamboo plant with progressestimate 99%.

C.2. Failed examples

We have shown how the agent can successfully utilizethe rollback mechanism to reach the goal, even though it isnot familiar with the environment and likely to be uncertainabout some actions it took. Intuitively, the rollback mecha-nism can increase the chance that the agent reaches the goalas long as the agent can correctly decide when to stop.We now discuss two failed examples of our proposed re-gretful agent in unseen environments that highly resemblethe successful examples in terms of the given instruction and ground-truth path. Both examples demonstrate that theagent successfully rolled back to the correct path towardsthe goal but failed to stop at the goal.Speciﬁcally, in Figure 10, the agent reaches the roomwith the white cabinet as instructed but decided to moveone step forward. The agent then decided to roll back to theroom correctly at step 5. However, this does not help theagent to stop at the goal resulting in a failed run.On the other hand, in Figure 11, we can see that theprogress estimate at step 5 signiﬁcantly dropped by 21%,and the agent correctly decided to roll back. The agent thensuccessfully reached the refrigerator but did not stop imme-diately. It continued to move forward after step 8, resultingin an unsuccessful run.Lastly, we discuss a failed example when the agent incor-rectly decided when to roll back. In Figure 12, the agent ﬁrstfollowed the instruction to go down the hallway and tried toﬁnd the second door to turn right. As the agent reached theend of the hallway at step 4, it decided to roll back sincethere is no available navigable direction that leads to turnright . The agent then decided to go down the hallway againwith completely opposite direction. However, the agent de-cided to roll back again at step 7 with the progress estimatedropped to 18%. Although the agent eventually was able to escape from the hallway leading to the dead end, it ends upunsuccessful.10igure 6: The ﬁrst part of the instruction walk past the glass doors is ambiguous since there are multiple directions that leadto glass doors, and naturally the agent is confused and uncertain where to go. Our agent is able to perform local search onthe navigation graph and decides to roll back multiple times at the beginning of the navigation. At step 6, the agent performsan action turn right . Consequently, the progress estimate at step 7 signiﬁcantly increased to 45%. Interestingly, the agentcontinues to move forward even though the progress estimate slightly decreased from step 7 to step 8. We reckon that thisas one of the advantage of using a learning-based regret module as opposed to using a hard-coded threshold. The agent thensuccessfully follows the instruction and stops in front of the microwave with progress estimate 89%.11igure 7: The agent ﬁrst walk across living room , but de-cides to move into the direction that leads to kitchen anddinning room. At step 1, the agent decides to roll back dueto a decreasing of the progress monitor output. The agentthen followed the rest of the instruction successfully withthe progress monitor steadily increased at each step. Fi-nally, the agent decides to stop with the progress estimate99%. Figure 8: The agent walked up the stairs as instructed atstep 1, but the second set of stairs makes the instructionambiguous. The agent continues to walk up stairs but soonrealized that it needs to go down the stairs and turn right from step 4 - 6. When the agent decides to turn right, wecan see the progress estimate signiﬁcantly increased from51% to 66%. As the agent turned right to the TV room, theprogress estimate increased again to 82%. Finally, the agentstops with the progress monitor output 95%.12igure 9: The agent walks down the hall way to the stairsbut failed to walk down the stairs at step 1. With a smallincrease on the progress monitor output, the agent then de-cides to roll back and take the action to walk down the stairs.Once walking down, we can see the progress estimate in-creased to 39%, and as the agent goes further down, theprogress estimate reached 98% at the bottom of the stairs.Finally, the agent decides to stop near by the bamboo plantwith progress estimate 99%. Figure 10:

Failed example.

The agent starts to navigatethrough the unseen environment by following the given in-struction. It was able to successfully follow the instructionand correctly reach the goal at step 4. The agent then de-cided to move forward towards the kitchen and correctlydecided to roll back to the goal. However, the agent did notstop and continue to explore the environment and eventuallystopped a bit further from the goal.13igure 11: The agent correctly followed the ﬁrst parts of the instruction until step 4, but it decided to move forward towardsthe hall. At step 5, the agent correctly decided to roll back with the progress estimate decreased from 56% to 35%. The agentwas then able to follow the rest of the instruction successfully and reach the refrigerator at step 8. However, the agent did notstop nearby the refrigerator and continued to take another two forward steps.14igure 12: The agent followed the ﬁrst part of instruction to go down the hallway . As the agent reached the end of thehallway, it was not able to ﬁnd the second door to turn left. The agent then decided to roll back at step 4 with progressestimate decreased from 65% to 61%. The agent continued to go back towards the hallway but decided to roll back again atstep 7. Although the agent was able to correct its errors made at the ﬁrst few steps and escape from the hallway leading tothe dead end, it ends up unsuccessful. 15 eferences [1] P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy,S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi,M. Savva, et al. On evaluation of embodied navigationagents. arXiv preprint arXiv:1807.06757 , 2018. 2, 6[2] P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson,N. S¨underhauf, I. Reid, S. Gould, and A. van den Hen-gel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , volume 2, 2018. 1, 2, 5, 6[3] J. Barraquand and J.-C. Latombe. Robot motion planning: Adistributed representation approach.

The International Jour-nal of Robotics Research , 10(6):628–649, 1991. 2[4] M. Bhardwaj, S. Choudhury, and S. Scherer. Learn-ing heuristic search via imitation. arXiv preprintarXiv:1707.03034 , 2017. 3[5] S. Brodeur, E. Perez, A. Anand, F. Golemo, L. Celotti,F. Strub, J. Rouat, H. Larochelle, and A. Courville. Home:A household multimodal environment. arXiv preprintarXiv:1711.11017 , 2017. 1[6] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner,M. Savva, S. Song, A. Zeng, and Y. Zhang. Matterport3D:Learning from RGB-D data in indoor environments.

Inter-national Conference on 3D Vision (3DV) , 2017. 6[7] A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Ba-tra. Embodied question answering. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) , 2018. 1, 2[8] A. Das, G. Gkioxari, S. Lee, D. Parikh, and D. Batra. Neu-ral Modular Control for Embodied Question Answering. In

Proceedings of the Conference on Robot Learning (CoRL) ,2018. 2[9] H. de Vries, K. Shuster, D. Batra, D. Parikh, J. Weston, andD. Kiela. Talk the walk: Navigating new york city throughgrounded dialogue. arXiv preprint arXiv:1807.03367 , 2018.2[10] B. Eysenbach, S. Gu, J. Ibarz, and S. Levine. Leave notrace: Learning to reset for safe and autonomous reinforce-ment learning. In

Proceedings of the International Confer-ence on Learning Representations (ICLR) , 2018. 3[11] R. E. Fikes, P. E. Hart, and N. J. Nilsson. Learning and exe-cuting generalized robot plans.

Artiﬁcial intelligence , 3:251–288, 1972. 2[12] D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, andT. Darrell. Speaker-follower models for vision-and-languagenavigation. In

Advances in Neural Information ProcessingSystems (NIPS) , 2018. 1, 2, 3, 5, 6, 7, 9[13] E. Kolve, R. Mottaghi, D. Gordon, Y. Zhu, A. Gupta, andA. Farhadi. AI2-THOR: An Interactive 3D Environment forVisual AI. arXiv , 2017. 1, 2[14] C.-Y. Ma, J. Lu, Z. Wu, G. AlRegib, Z. Kira, R. Socher,and C. Xiong. Self-monitoring navigation agent via auxil-iary progress estimation. In

Proceedings of the InternationalConference on Learning Representations (ICLR) , 2019. 1, 2,3, 4, 5, 6, 7, 8, 9 [15] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard,A. Banino, M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu,et al. Learning to navigate in complex environments. In

Pro-ceedings of the International Conference on Learning Rep-resentations (ICLR) , 2017. 2[16] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap,T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronousmethods for deep reinforcement learning. In

Internationalconference on machine learning (ICML) , pages 1928–1937,2016. 3[17] A. Mousavian, A. Toshev, M. Fiser, J. Kosecka, and J. David-son. Visual representations for semantic target driven navi-gation. arXiv preprint arXiv:1805.06066 , 2018. 2[18] S. J. Russell and P. Norvig.

Artiﬁcial intelligence: a modernapproach . Malaysia; Pearson Education Limited,, 2016. 3[19] M. Savva, A. X. Chang, A. Dosovitskiy, T. Funkhouser,and V. Koltun. Minos: Multimodal indoor simulatorfor navigation in complex environments. arXiv preprintarXiv:1712.03931 , 2017. 1[20] X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y.-F.Wang, W. Y. Wang, and L. Zhang. Reinforced cross-modalmatching and self-supervised imitation learning for vision-language navigation. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , 2019.2, 6[21] X. Wang, W. Xiong, H. Wang, and W. Y. Wang. Look be-fore you leap: Bridging model-free and model-based rein-forcement learning for planned-ahead vision-and-languagenavigation. In

European Conference on Computer Vision(ECCV) , 2018. 1, 2, 6[22] G. Wayne, C.-C. Hung, D. Amos, M. Mirza, A. Ahuja,A. Grabska-Barwinska, J. Rae, P. Mirowski, J. Z. Leibo,A. Santoro, et al. Unsupervised predictive memory in a goal-directed agent. arXiv preprint arXiv:1803.10760 , 2018. 2[23] C. M. Wilt and W. Ruml. Building a heuristic for greedysearch. In

Eighth Annual Symposium on CombinatorialSearch , 2015. 3[24] Y. Wu, Y. Wu, G. Gkioxari, and Y. Tian. Building general-izable agents with a realistic and rich 3d environment. arXivpreprint arXiv:1801.02209 , 2018. 1, 2[25] F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese.Gibson env: Real-world perception for embodied agents. In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 9068–9079, 2018. 1[26] Y. Xu, A. Fern, and S. W. Yoon. Discriminative learning ofbeam-search heuristics for planning. In

IJCAI , pages 2041–2046, 2007. 3[27] H. Yu, X. Lian, H. Zhang, and W. Xu. Guided feature trans-formation (gft): A neural language grounding module forembodied agents. arXiv preprint arXiv:1805.08329arXiv preprint arXiv:1805.08329