[PDF] Translating Navigation Instructions in Natural Language to a High-Level Plan for Behavioral Robot Navigation

Abstract

We propose an end-to-end deep learning model for translating free-form natural language instructions to a high-level plan for behavioral robot navigation. We use attention models to connect information from both the user instructions and a topological representation of the environment. We evaluate our model's performance on a new dataset containing 10,050 pairs of navigation instructions. Our model significantly outperforms baseline approaches. Furthermore, our results suggest that it is possible to leverage the environment map as a relevant knowledge base to facilitate the translation of free-form navigational instruction.

Full PDF

TTranslating Navigation Instructions in Natural Languageto a High-Level Plan for Behavioral Robot Navigation

Xiaoxue Zang ∗ , Ashwini Pokle ∗ , Marynel V´azquez , Kevin Chen ,Juan Carlos Niebles , Alvaro Soto , Silvio Savarese Stanford University, Yale University, P. Universidad Cat´olica de Chile { xzang, ashwinipokle, kchen92, jniebles, ssilvio } @stanford.edu, [email protected], [email protected] Abstract

We propose an end-to-end deep learningmodel for translating free-form natural lan-guage instructions to a high-level plan forbehavioral robot navigation. The proposedmodel uses attention mechanisms to connectinformation from user instructions with a topo-logical representation of the environment. Toevaluate this model, we collected a new datasetfor the translation problem containing 11,051pairs of user instructions and navigation plans.Our results show that the proposed modeloutperforms baseline approaches on the newdataset. Overall, our work suggests that atopological map of the environment can serveas a relevant knowledge base for translatingnatural language instructions into a sequenceof navigation behaviors.

Enabling robots to follow navigation instructionsin natural language can facilitate human-robot in-teraction across a variety of applications. For in-stance, within the service robotics domain, robotscan follow navigation instructions to help withmobile manipulation (Tellex et al., 2011) and de-livery tasks (Veloso et al., 2015).Interpreting navigation instructions in naturallanguage is difﬁcult due to the high variabil-ity in the way people describe routes (Chen andMooney, 2011). For example, there are a varietyof ways to describe the route in Fig. 1(a):– “Exit the room, turn right, follow the corri-dor until you pass a vase on your left, andenter the next room on your left” ; or– “Turn right after you exit the room, and enterthe room on the left right before the end of thecorridor” ; or– “Advance forward to the right after going outof the door. Enter the room which is in themiddle of two vases on your left.” ∗ Both authors contributed equally to this work.

Each fragment of a sentence within these instruc-tions can be mapped to one or more than one navi-gation behaviors. For instance, assume that a robotcounts with a number of primitive, navigation be-haviors, such as “enter the room on the left (or onright)” , “follow the corridor” , “cross the inter-section” , etc. Then, the fragment “advance for-ward” in a navigation instruction could be inter-preted as a “follow the corridor” behavior, or asa sequence of “follow the corridor” interspersedwith “cross the intersection” behaviors depend-ing on the topology of the environment. Resolvingsuch ambiguities often requires reasoning about“common-sense” concepts, as well as interpretingspatial information and landmarks, e.g., in sen-tences such as “the room on the left right beforethe end of the corridor” and “the room which is inthe middle of two vases” .In this work, we pose the problem of inter-preting navigation instructions as ﬁnding a map-ping (or grounding) of the commands into an ex-ecutable navigation plan. While the plan is typ-ically modeled as a formal speciﬁcation of low-level motions (Chen and Mooney, 2011) or agrammar (Artzi and Zettlemoyer, 2013; Matuszeket al., 2010), we focus speciﬁcally on translatinginstructions to a high-level navigation plan basedon a topological representation of the environ-ment. This representation is a behavioral navi-gation graph , as recently proposed by (Sep´ulvedaet al., 2018), designed to take advantage of the se-mantic structure typical of human environments.The nodes of the graph correspond to semanti-cally meaningful locations for the navigation task,such as kitchens or entrances to rooms in corri-dors. The edges are parameterized, visuo-motorbehaviors that allow a robot to navigate betweenneighboring nodes, as illustrated in Fig. 1(b). Un-der this framework, complex navigation routes canbe achieved by sequencing behaviors without anexplicit metric representation of the world. a r X i v : . [ c s . C L ] S e p igure 1 : Map of an environment (a), its (partial) behavioral navigation graph (b), and the problem setting of interest (c). Thered part of (b) corresponds to the representation of the route highlighted in blue in (a). The codes “oo-left”, “oo-right”, “cf”,“left-io”, and “right-io” correspond to the behaviors “go out and turn left”, “go out and turn right”, “follow the corridor”, “enterthe room on left”, and “enter ofﬁce on right”, respectively. We formulate the problem of following instruc-tions under the framework of (Sep´ulveda et al.,2018) as ﬁnding a path in the behavioral naviga-tion graph that follows the desired route, given aknown starting location. The edges (behaviors)along this path serve to reach the – sometimes im-plicit – destination requested by the user. As in(Zang et al., 2018), our focus is on the problem ofinterpreting navigation directions. We assume thata robot can realize valid navigation plans accord-ing to the graph.We contribute a new end-to-end model for fol-lowing directions in natural language under the be-havioral navigation framework. Inspired by theinformation retrieval and question answering lit-erature (Lewis and Jones, 1996; Seo et al., 2017;Xiong et al., 2016; Palangi et al., 2016), we pro-pose to leverage the behavioral graph as a knowl-edge base to facilitate the interpretation of naviga-tion commands. More speciﬁcally, the proposedmodel takes as input user directions in text form,the behavioral graph of the environment encodedas (cid:104) node; edge; node (cid:105) triplets, and the initiallocation of the robot in the graph. The model thenpredicts a set of behaviors to reach the desired des-tination according to the instructions and the map(Fig. 1(c)). Our main insight is that using atten-tion mechanisms to correlate navigation instruc-tions with the topological map of the environmentcan facilitate predicting correct navigation plans.This work also contributes a new dataset of , pairs of free-form natural language in-structions and high-level navigation plans. Thisdataset was collected through Mechanical Turkusing 100 simulated environments with a corre-sponding topological map and, to the best of ourknowledge, it is the ﬁrst of its kind for behavioral navigation. The dataset opens up opportunities toexplore data-driven methods for grounding navi-gation commands into high-level motion plans.We conduct extensive experiments to study thegeneralization capabilities of the proposed modelfor following natural language instructions. We in-vestigate both generalization to new instructionsin known and in new environments. We concludethis paper by discussing the beneﬁts of the pro-posed approach as well as opportunities for futureresearch based on our ﬁndings. This section reviews relevant prior work on fol-lowing navigation instructions. Readers interestedin an in-depth review of methods to interpret spa-tial natural language for robotics are encouragedto refer to (Landsiedel et al., 2017).Typical approaches to follow navigation com-mands deal with the complexity of natural lan-guage by manually parsing commands, constrain-ing language descriptions, or using statistical ma-chine translation methods. While manually pars-ing commands is often impractical, the ﬁrst typeof approaches are foundational: they showed thatit is possible to leverage the compositionality ofsemantic units to interpret spatial language (Bug-mann et al., 2004; Levit and Roy, 2007).Constraining language descriptions can reducethe size of the input space to facilitate the inter-pretation of user commands. For example, (Tal-bot et al., 2016) explored using structured, sym-bolic language phrases for navigation. As in thisearlier work, we are also interested in navigationwith a topological map of the environment. How-ever, we do not process symbolic phrases. Our aimis to translate free-form natural language instruc-ions to a navigation plan using information from ahigh-level representation of the environment. Thistranslation problem requires dealing with missingactions in navigation instructions and actions withpreconditions, such as “at the end of the corridor,turn right” (MacMahon et al., 2006).Statistical machine translation (Koehn, 2009) isat the core of recent approaches to enable robotsto follow navigation instructions. These meth-ods aim to automatically discover translation rulesfrom a corpus of data, and often leverage the factthat navigation directions are composed of sequen-tial commands. For instance, (Wong and Mooney,2006; Matuszek et al., 2010; Chen and Mooney,2011) used statistical machine translation to mapinstructions to a formal language deﬁned by agrammar. Likewise, (Kollar et al., 2010; Tellexet al., 2011) mapped commands to spatial descrip-tion clauses based on the hierarchical structureof language in the navigation problem. Our ap-proach to machine translation builds on insightsfrom these prior efforts. In particular, we focus onend-to-end learning for statistical machine trans-lation due to the recent success of Neural Net-works in Natural Language Processing (Goodfel-low et al., 2016).Our work is inspired by methods that reduce thetask of interpreting user commands to a sequentialprediction problem (Shimizu and Haas, 2009; Meiet al., 2016; Anderson et al., 2018). Similar to Meiet al. and Anderson et al., we use a sequence-to-sequence model to enable a mobile agent to followroutes. But instead leveraging visual informationto output low-level navigation commands, we fo-cus on using a topological map of the environmentto output a high-level navigation plan. This planis a sequence of behaviors that can be executed bya robot to reach a desired destination (Sep´ulvedaet al., 2018; Zang et al., 2018).We explore machine translation from the per-spective of automatic question answering. Follow-ing (Seo et al., 2017; Xiong et al., 2016), our ap-proach uses attention mechanisms to learn align-ments between different input modalities. In ourcase, the inputs to our model are navigation in-structions, a topological environment map, and thestart location of the robot (Fig. 1(c)). Our resultsshow that the map can serve as an effective sourceof contextual information for the translation task.Additionally, it is possible to leverage this kind ofinformation in an end-to-end fashion.

Our goal is to translate navigation instructions intext form into a sequence of behaviors that a robotcan execute to reach a desired destination from aknown start location. We frame this problem un-der a behavioral approach to indoor autonomousnavigation (Sep´ulveda et al., 2018) and assumethat prior knowledge about the environment isavailable for the translation task. This prior knowl-edge is a topological map, in the form of a behav-ioral navigation graph (Fig. 1(b)). The nodes ofthe graph correspond to semantically-meaningfullocations for the navigation task, and its directededges are visuo-motor behaviors that a robot canuse to move between nodes. This formulationtakes advantage of the rich semantic structure be-hind man-made environments, resulting in a com-pact route representation for robot navigation.Fig. 1(c) provides a schematic view of the prob-lem setting. The inputs are: (1) a navigation graph m , (2) the starting node s of the robot in m , and(3) a set of free-form navigation instructions I innatural language. The instructions describe a pathin the graph to reach from s to a – potentially im-plicit – destination node g . Using this informa-tion, the objective is to predict a suitable sequenceof robot behaviors b , . . . , b T to navigate from s to g according to I . From a supervised learningperspective, the goal is then to estimate: argmax b ,...,b T P ( b , . . . , b T | m, s, I ) (1)based on a dataset of input-target pairs { ( x i , y i ) | ≤ i ≤ N } , where x i = ( m, s, I ) i and y i = ( b , . . . , b T ) i , respectively. The sequen-tial execution of the behaviors b , . . . , b T shouldreplicate the route intended by the instructions I .We assume no prior linguistic knowledge.Thus, translation approaches have to cope with thesemantics and syntax of the language by discover-ing corresponding patterns in the data. We view the behavioral graph m as a knowledgebase that encodes a set of navigational rules astriplets (cid:104) p i ; b l [ attr ]; p j (cid:105) , where p i and p j are ad-jacent nodes in the graph, and the edge b l is anexecutable behavior to navigate from p i to p j . Ingeneral, each behaviors includes a list of relevantnavigational attributes attr that the robot mightencounter when moving between nodes. ehavior Description oo < d > Go out of the current place and turn < d > io < d > Turn < d > and enter the place straight aheadoio Exit current place and enter straight ahead < d > t Turn < d > at the intersectioncf Follow (or go straight down) the corridorsp Go straight at a T intersectionch < d > Cross the hall and turn < d > Table 1 : Behaviors (edges) of the navigation graphs consid-ered in this work. The direction < d > can be left or right. We consider 7 types of semantic locations, 11types of behaviors, and 20 different types of land-marks. A location in the navigation graph can bea room, a lab, an ofﬁce, a kitchen, a hall, a corri-dor, or a bathroom. These places are labeled withunique tags, such as ”room-1” or ”lab-2”, exceptfor bathrooms and kitchens which people do nottypically refer to by unique names when describ-ing navigation routes.Table 1 lists the navigation behaviors that weconsider in this work. These behaviors can be de-scribed in reference to visual landmarks or objects,such as paintings, book shelfs, tables, etc. As inFig. 1, maps might contain multiple landmarks ofthe same type. Please see the supplementary ma-terial (Appendix A) for more details.

We leverage recent advances in deep learningto translate natural language instructions to asequence of navigation behaviors in an end-to-end fashion. Our proposed model builds on thesequence-to-sequence translation model of (Bah-danau et al., 2015), which computes a soft-alignment between a source sequence (natural lan-guage instructions in our case) and the correspond-ing target sequence (navigation behaviors).As one of our main contributions, we augmentthe neural machine translation approach of Bah-danau et al. to take as input not only natural lan-guage instructions, but also the corresponding be-havioral navigation graph m of the environmentwhere navigation should take place. Speciﬁcally,at each step, the graph m operates as a knowl-edge base that the model can access to obtain in-formation about path connectivity, facilitating thegrounding of navigation commands.Figure 2 shows the structure of the proposedmodel for interpreting navigation instructions.The model consists of six layers: Embed layer : The model ﬁrst encodes eachword and symbol in the input sequences I and m into ﬁxed-length representations. The instruc-tions I are embedded into a 100-dimensional pre-trained GloVe vector (Pennington et al., 2014).Each of the triplet components, p i , b l [ attr ] , and p j of the graph m , are one-hot encoded into vectorsof dimensionality N + E , where N and E are thenumber of nodes and edges in m , respectively. Encoder layer : The model then uses two bidi-rectional Gated Recurrent Units (GRUs) (Choet al., 2014) to independently process the infor-mation from I and m , and incorporate contextualcues from the surrounding embeddings in each se-quence. The outputs of the encoder layer are thematrix ¯ I ∈ R T × H for the navigational commandsand the matrix ¯ G ∈ R L × H for the behavioralgraph, where H is the hidden size of each GRU, T is the number of words in the instruction I , and L is the number of triplets in the graph m . Attention layer : Matrices ¯ I and ¯ G generatedby the encoder layer are combined using an at-tention mechanism. We use one-way attentionbecause the graph contains information about thewhole environment, while the instruction has (po-tentially incomplete) local information about theroute of interest. The use of attention providesour model with a two-step strategy to interpretcommands. This resembles the way people ﬁndpaths on a map: ﬁrst, relevant parts on the mapare selected according to their afﬁnity to each ofthe words in the input instruction (attention layer);second, the selected parts are connected to assem-ble a valid path (decoder layer). More formally,let ¯ G i ( i ∈ [1 , L ] ) be the i -th row of ¯ G , and ¯ I j ( j ∈ [1 , T ] ) the j -th row of ¯ I . We use each en-coded triplet ¯ G i in ¯ G to calculate its associatedattention distribution a i ∈ R T over all the atomicinstructions ¯ I j : e i = [ ¯ G i W ¯ I (cid:124) , . . . , ¯ G i W ¯ I (cid:124) T ] (2) a i = sof tmax ( e i ) (3)where the matrix W ∈ R H × H serves to com-bine the different sources of information ¯ G and ¯ I .Each component a ij of the attention distributions a i quantiﬁes the afﬁnity between the i -th triplet in ¯ G and the j -th word in the corresponding input I .The model then uses each attention distribution a i to obtain a weighted sum of the encodings ofthe words in ¯ I , according to their relevance to thecorresponding triplet ¯ G i . This results in L atten-tion vectors R i ∈ R H , R i = (cid:80) Tj =1 a ij I j .The ﬁnal step in the attention layer concate-nates each R i with ¯ G i to generate the outputs igure 2 : Model overview. The model contains six layers, takes the input of behavioral graph representation, free-forminstruction, and the start location (yellow block marked as START in the decoder layer) and outputs a sequence of behaviors. F i = [ R i ; ¯ G i ] , i ∈ [1 , L ] . Following (Seo et al.,2017), we include the encoded triplet ¯ G i in theoutput tensor F i of this layer to prevent early sum-maries of relevant map information. FC layer : The model reduces the dimension-ality of each individual vector F i from H to H with a fully-connected (FC) layer. The resulting Lvectors are output to the next layer as columns ofa context matrix C ∈ R H × L . Decoder layer : After the FC layer, the modelpredicts likelihoods over the sequence of behav-iors that correspond to the input instructions witha GRU network. Without loss of generality, con-sider the t -th recurrent cell in the GRU network.This cell takes two inputs: a hidden state vector h t − from the prior cell, and a one-hot embeddingof the previous behavior b t − that was predictedby the model. Based on these inputs, the GRU celloutputs a new hidden state h t to compute likeli-hoods for the next behavior. These likelihoods areestimated by combining the output state h t withrelevant information from the context C : ˆ d ts = v (cid:124) a tanh( W h t + W C s ) (4) d t = sof tmax ( ˆ d t , . . . , ˆ d tL ) (5)where W , W , and v a are trainable parameters.The attention vector d t ∈ R L in Eq. (5) quanti-ﬁes the afﬁnity of h t with respect to each of thecolumns C s of C , where s ∈ [1 , L ] . The attentionvector also helps to estimate a dynamic contextualvector S t = (cid:80) Ls =1 d ts C s that the t -th GRU celluses to compute logits for the next behavior: o t = W [ S t ; h t ] (6)with W trainable parameters. Note that o t in- cludes a value for each of the pre-deﬁned behav-iors in the graph m , as well as for a special “ stop ”symbol to identify the end of the output sequence. Output layer : The ﬁnal layer of the modelsearches for a valid sequence of robot behaviorsbased on the robot’s initial node, the connectivityof the graph m , and the output logits from the pre-vious decoder layer. Again, without loss of gen-erality, consider the t -th behavior b t that is ﬁnallypredicted by the model. The search for this behav-ior is implemented as: b t = argmax ( sof tmax ( o t + mask ( m, n t ))) (7)with mask ( m, n t ) a masking function that takesas input the graph m and the node n t that the robotreaches after following the sequence of behaviors b , . . . , b t − previously predicted by the model.The mask function returns a vector of the samedimensionality as the logits o t , but with zeros forthe valid behaviors after the last location n t andfor the special stop symbol, and − inf for any in-valid predictions according to the connectivity ofthe behavioral navigation graph. We created a new dataset for the problem of fol-lowing navigation instructions under the behav-ioral navigation framework of (Sep´ulveda et al.,2018). This dataset was created using AmazonMechanical Turk and 100 maps of simulated in-door environments, each with 6 to 65 rooms. Tothe best of our knowledge, this is the ﬁrst bench- The dataset is publicly available through the website: follow-nav-directions.stanford.edu. ataset

Training 4062 2002 8066Test-Repeated 944 34 1012Test-New 962 0 962

Table 2 : Dataset statistics. “ × mark for comparing translation models in the con-text of behavioral robot navigation.As shown in Table 2, the dataset consists of8066 pairs of free-form natural language instruc-tions and navigation plans for training. This train-ing data was collected from 88 unique simulatedenvironments, totaling 6064 distinct navigationplans (2002 plans have two different navigationinstructions each; the rest has one). The datasetcontains two test set variants:

1) Test-Repeated:

Contains 1012 pairs of instruc-tions and navigation plans. These routes are notpart of the training set; however, they are collectedusing environments that are part of the training set.

2) Test-New:

Contains 962 pairs of instructionsand navigation plans. This test set is more chal-lenging than the Test-Repeated dataset because itcontains new routes on 12 new indoor environ-ments not included in the training set.While the dataset was collected with simulated en-vironments, no structure was imposed on the nav-igation instructions while crowd-sourcing data.Thus, many instructions in our dataset are am-biguous. Moreover, the order of the behaviors inthe instructions is not always the same. For in-stance, a person said “turn right and advance” todescribe part of a route, while another person said “go straight after turning right” in a similar sit-uation. The high variability present in the natu-ral language descriptions of our dataset makes theproblem of decoding instructions into behaviorsnot trivial. See Appendix A of the supplementarymaterial for additional details on our data collec-tion effort.

This section describes our evaluation of the pro-posed approach for interpreting navigation com-mands in natural language. We provide both quan-titative and qualitative results.

While computing evaluation metrics, we only con-sider the behaviors present in the route becausethey are sufﬁcient to recover the high-level navi-gation plan from the graph. Our metrics treat eachbehavior as a single token. For example, the sam-ple plan “R-1 oor C-1 cf C-1 lt C-0 cf C-0 iol O-3”is considered to have 5 tokens, each correspond-ing to one of its behaviors (“oor”, “cf”, “lt”, “cf”,“iol”). In this plan, “R-1”,“C-1”, “C-0”, and “O-3” are symbols for locations (nodes) in the graph.We compare the performance of translation ap-proaches based on four metrics: - Exact Match (EM).

As in (Shimizu and Haas,2009), EM is 1 if a predicted plan matches exactlythe ground truth; otherwise it is 0. - F1 score (F1).

The harmonic average of the pre-cision and recall over all the test set (Chinchor andSundheim, 1993). - Edit Distance (ED).

The minimum number ofinsertions, deletions or swap operations requiredto transform a predicted sequence of behaviorsinto the ground truth sequence (Navarro, 2001). - Goal Match (GM).

GM is 1 if a predicted planreaches the ground truth destination (even if thefull sequence of behaviors does not match exactlythe ground truth). Otherwise, GM is 0.

We compare the proposed approach for translat-ing natural language instructions into a navigationplan against alternative deep-learning models:

Baseline model.

The baseline approach is basedon (Shimizu and Haas, 2009). It divides the taskof interpreting commands for behavioral naviga-tion into two steps: path generation, and path ver-iﬁcation. For path generation, this baseline uses astandard sequence-to-sequence model augmentedwith an attention mechanism, similar to (Bah-danau et al., 2015; Zang et al., 2018). For pathveriﬁcation, the baseline uses depth-ﬁrst search toﬁnd a route in the graph that matches the sequenceof predicted behaviors. If no route matches per-fectly, the baseline changes up to three behaviorsin the predicted sequence to try to turn it into avalid path.

Ablation model.

To test the impact of using thebehavioral graphs as an extra input to our trans-lation model, we implemented a version of ourpproach that only takes natural language instruc-tions as input. In this ablation model, the outputof the bidirectional GRU that encodes the input in-struction I is directly fed to the decoder layer. Thismodel does not have the attention and FC layersdescribed in Sec. 4, nor uses the masking functionin the output layer. Ablation with mask model.

This model is thesame as the previous Ablation model, but with themasking function in the output layer.

We pre-processed the inputs to the various modelsthat are considered in our experiment. In partic-ular, we lowercased, tokenized, spell-checked andlemmatized the input instructions in text-form us-ing WordNet (Miller, 1995). We also truncated thegraphs to a maximum of 300 triplets, and the navi-gational instructions to a maximum of 150 words.Only 6.4% (5.4%) of the unique graphs in thetraining (validation) set had more than 300 triplets,and less than 0.15% of the natural language in-structions in these sets had more than 150 tokens.The dimensionality of the hidden state of theGRU networks was set to 128 in all the experi-ments. In general, we used 12.5% of the train-ing set as validation for choosing models’ hyper-parameters. In particular, we used dropout afterthe encoder and the fully-connected layers of theproposed model to reduce overﬁtting. Best perfor-mance was achieved with a dropout rate of 0.5 andbatch size equal to 256. We also used scheduledsampling (Bengio et al., 2015) at training time forall models except the baseline.We input the triplets from the graph to our pro-posed model in alphabetical order, and consider amodiﬁcation where the triplets that surround thestart location of the robot are provided ﬁrst in theinput graph sequence. We hypothesized that suchrearrangement would help identify the starting lo-cation (node) of the robot in the graph. In turn, thiscould facilitate the prediction of correct output se-quences. In the remaining of the paper, we referto models that were provided a rearranged graph,beginning with the starting location of the robot,as models with “Ordered Triplets”.

Table 3 shows the performance of the models con-sidered in our evaluation on both test sets. Thenext two sections discuss the results in detail.

First, we can observe that the ﬁnal model “Ourswith Mask and Ordered Triplets” outperforms theBaseline and Ablation models on all metrics inpreviously seen environments. The difference inperformance is particularly evident for the ExactMatch and Goal Match metrics, with our model in-creasing accuracy by 35% and 25% in comparisonto the Baseline and Ablation models, respectively.These results suggest that providing the behavioralnavigation graph to the model and allowing it toprocess this information as a knowledge base inan end-to-end fashion is beneﬁcial.We can also observe from Table 3 that the mask-ing function of Eq. (7) tends to increase perfor-mance in the Test-Repeated Set by constrainingthe output sequence to a valid set of navigation be-haviors. For the Ablation model, using the mask-ing function leads to about increase in EMand GM accuracy. For the proposed model (withor without reordering the graph triplets), the in-crease in accuracy is around . Note that theimpact of the masking function is less evident interms of the F1 score because this metric considersif a predicted behavior exists in the ground truthnavigation plan, irrespective of its speciﬁc posi-tion in the output sequence.The results in the last four rows of Table 3 sug-gest that ordering the graph triplets can facilitatepredicting correct navigation plans in previouslyseen environments. Providing the triplets that sur-round the starting location of the robot ﬁrst to themodel leads to a boost of in EM and GM per-formance. The rearrangement of the graph tripletsalso helps to reduce ED and increase F1.Lastly, it is worth noting that our proposedmodel (last row of Table 3) outperforms all othermodels in previously seen environments. In partic-ular, we obtain over increase in EM and GMbetween our model and the next best two models. The previous section evaluated model perfor-mance on new instructions (and correspondingnavigation plans) for environments that were pre-viously seen at training time. Here, we examinewhether the trained models succeed on environ-ments that are completely new.The evaluation on the Test-New Set helps un-derstand the generalization capabilities of themodels under consideration. This experiment ismore challenging than the one in the previous sec- odel Test-Repeated Set Test-New SetEM ↑ F1 ↑ ED ↓ GM ↑ EM ↑ F1 ↑ ED ↓ GM ↑ Baseline 25.30 79.83 2.53 26.28 25.44 81.38 2.39 25.44Ablation 36.36 90.28 1.36 36.36 24.82 88.65 1.71 24.92Ablation with Mask 45.95 90.08 1.20 46.05 36.45 88.31 1.45 36.56Ours without Mask 52.47 91.74 0.95 53.95 21.94 87.50 1.78 22.65Ours with Mask 57.31 91.91 0.91 57.31 38.52 88.98 1.32 38.52Ours without Mask and with Ordered Triplets 57.21 93.37 0.79 57.71 33.36

Table 3 : Performance of different models on the test datasets. EM and GM report percentages, and ED corresponds to averageedit distance. The symbol ↑ indicates that higher results are better in the corresponding column; ↓ indicates that lower is better. tion, as can be seen in performance drops in Ta-ble 3 for the new environments. Nonetheless, theinsights from the previous section still hold: mask-ing in the output layer and reordering the graphtriplets tend to increase performance.Even though the results in Table 3 suggest thatthere is room for future work on decoding naturallanguage instructions, our model still outperformsthe baselines by a clear margin in new environ-ments. For instance, the difference between ourmodel and the second best model in the Test-Newset is about EM and GM. Note that the averagenumber of actions in the ground truth output se-quences is 7.07 for the Test-New set. Our model’spredictions are just . edits off on average fromthe correct navigation plans. This section discusses qualitative results to betterunderstand how the proposed model uses the nav-igation graph.

We analyze the evolution of the attention weights d t in Eq. (5) to assess if the decoder layer of theproposed model is attending to the correct partsof the behavioral graph when making predictions.Fig 3(b) shows an example of the resulting atten-tion map for the case of a correct prediction. In theFigure, the attention map is depicted as a scaledand normalized 2D array of color codes. Each col-umn in the array shows the attention distribution d t used to generate the predicted output at step t .Consequently, each row in the array represents atriplet in the corresponding behavioral graph. Thisgraph consists of 72 triplets for Fig 3(b).We observe a locality effect associated to theattention coefﬁcients corresponding to high val-ues (bright areas) in each column of Fig 3(b).This suggests that the decoder is paying atten-tion to graph triplets associated to particular neigh-borhoods of the environment in each prediction Figure 3 : Visualization of the attention weights of the de-coder layer. The color-coded and numbered regions on themap (left) correspond to the triplets that are highlighted withthe corresponding color in the attention map (right). step. We include additional attention visualiza-tions in the supplementary Appendix, includingcases where the dynamics of the attention distri-bution are harder to interpret.

All the routes in our dataset are the shortestpaths from a start location to a given destination.Thus, we collected a few additional natural lan-guage instructions to check if our model was ableto follow navigation instructions describing sub-optimal paths. One such example is shown inFig. 4, where the blue route (shortest path) and thered route (alternative path) are described by: – Blue route: “ Go out the ofﬁce and make a left.Turn right at the corner and go down the hall.Make a right at the next corner and enter thekitchen in front of table. ” – Red route: “ Exit the room 0 and turn right, goto the end of the corridor and turn left, go straightto the end of the corridor and turn left again. Afterpassing bookshelf on your left and table on yourright, Enter the kitchen on your right. ”For both routes, the proposed model was ableto predict the correct sequence of navigation be-haviors. This result suggests that the model is in-deed using the input instructions and is not just ap-proximating shortest paths in the behavioral graph. igure 4 : An example of two different navigation paths be-tween the same pair of start and goal locations.

Other examples on the prediction of sub-obtimalpaths are described in the Appendix.

This work introduced behavioral navigationthrough free-form natural language instructions asa challenging and a novel task that falls at theintersection of natural language processing androbotics. This problem has a range of interestingcross-domain applications, including informationretrieval.We proposed an end-to-end system to trans-late user instructions to a high-level navigationplan. Our model utilized an attention mechanismto merge relevant information from the navigationinstructions with a behavioral graph of the envi-ronment. The model then used a decoder to predicta sequence of navigation behaviors that matchedthe input commands.As part of this effort, we contributed a newdataset of 11,051 pairs of user instructions andnavigation plans from 100 different environments.Our model achieved the best performance in thisdataset in comparison to a two-step baseline ap-proach for interpreting navigation instructions,and a sequence-to-sequence model that does notconsider the behavioral graph. Our quantitativeand qualitative results suggest that attention mech-anisms can help leverage the behavioral graph asa relevant knowledge base to facilitate the trans-lation of free-form navigation instructions. Over-all, our approach demonstrated practical form oflearning for a complex and useful task.In future work, we are interested in investigat-ing mechanisms to improve generalization to new environments. For example, pointer and graphnetworks (Vinyals et al., 2015; Defferrard et al.,2016) are a promising direction to help supervisetranslation models and predict motion behaviors.

Acknowledgments

The Toyota Research Institute (TRI) providedfunds to assist with this research, but this papersolely reﬂects the opinions and conclusions of itsauthors and not TRI or any other Toyota entity.This work is also partially funded by Fondecytgrant 1181739, Conicyt, Chile. The authors wouldalso like to thank Gabriel Sep´ulveda for his assis-tance with parts of this project.

References

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce,Mark Johnson, Niko S¨underhauf, Ian Reid, StephenGould, and Anton van den Hengel. 2018. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environ-ments. In

IEEE Conference on Computer Vision andPattern Recognition (CVPR) .Yoav Artzi and Luke Zettlemoyer. 2013. Weakly su-pervised learning of semantic parsers for mappinginstructions to actions.

Transactions of the Associa-tion of Computational Linguistics (TACL) , 1:49–62.Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In

International Con-ference on Learning Representations (ICLR) .Samy Bengio, Oriol Vinyals, Navdeep Jaitly, andNoam Shazeer. 2015. Scheduled sampling for se-quence prediction with recurrent neural networks.In

Advances in Neural Information Processing Sys-tems , pages 1171–1179.Guido Bugmann, Ewan Klein, Stanislao Lauria, andTheocharis Kyriacou. 2004. Corpus-based robotics:A route instruction example. In

Proceedings of In-telligent Autonomous Systems (IAS) , pages 96–103.Citeseer.David L. Chen and Raymond J. Mooney. 2011. Learn-ing to interpret natural language navigation instruc-tions from observations. In

AAAI Conference on Ar-tiﬁcial Intelligence , pages 859–865.Nancy Chinchor and Beth Sundheim. 1993. Muc-5evaluation metrics. In

Proceedings of the 5th confer-ence on Message understanding , pages 69–78. As-sociation for Computational Linguistics.Kyunghyun Cho, Bart van Merrienboer, aglar G¨ulehre,Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder-decoderor statistical machine translation. In

Empiri-cal Methods in Natural Language Processing(EMNLP) .Micha¨el Defferrard, Xavier Bresson, and Pierre Van-dergheynst. 2016. Convolutional neural networks ongraphs with fast localized spectral ﬁltering.

CoRR ,abs/1606.09375.Ian Goodfellow, Yoshua Bengio, and Aaron Courville.2016.

Deep Learning . MIT Press. .P. Koehn. 2009.

Statistical Machine Translation . Cam-bridge University Press.T. Kollar, S. Tellex, D. Roy, and N. Roy. 2010. To-ward understanding natural language directions. In

ACM/IEEE International Conference on Human-Robot Interaction (HRI) , pages 259–266.Christian Landsiedel, Verena Rieser, Matthew Walter,and Dirk Wollherr. 2017. A review of spatial rea-soning and interaction for real-world robotics.

Ad-vanced Robotics , 31(5):222–242.Michael Levit and Deb Roy. 2007. Interpretation ofspatial language in a map navigation task.

IEEETransactions on Systems, Man, and Cybernetics,Part B (Cybernetics) , 37(3):667–679.David D Lewis and Karen Sp¨arck Jones. 1996. Naturallanguage processing for information retrieval.

Com-munications of the ACM , 39(1):92–101.Matt MacMahon, Brian Stankiewicz, and BenjaminKuipers. 2006. Walk the talk: Connecting language,knowledge, and action in route instructions. In

Na-tional Conference on Artiﬁcial Intelligence (AAAI) .C. Matuszek, D. Fox, and K. Koscher. 2010. Follow-ing directions using statistical machine translation.In

ACM/IEEE International Conference on Human-Robot Interaction (HRI) , pages 251–258.Hongyuan Mei, Mohit Bansal, and Matthew R. Walter.2016. Listen, attend, and walk: Neural mapping ofnavigational instructions to action sequences. In

Na-tional Conference on Artiﬁcial Intelligence (AAAI) ,pages 2772–2778.George A. Miller. 1995. Wordnet: A lexical databasefor english.

Communications of the ACM .Gonzalo Navarro. 2001. A guided tour to approximatestring matching.

ACM computing surveys (CSUR) ,33(1):31–88.Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao,Xiaodong He, Jianshu Chen, Xinying Song, andRabab Ward. 2016. Deep sentence embedding usinglong short-term memory networks: Analysis and ap-plication to information retrieval.

IEEE/ACM Trans-actions on Audio, Speech, and Language Process-ing . Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In

Proceedings of the conference onempirical methods in natural language processing(EMNLP) , pages 1532–1543.Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, andHannaneh Hajishirzi. 2017. Bidirectional attentionﬂow for machine comprehension. In

InternationalConference on Learning Representations ICLR .G. Sep´ulveda, JC. Niebles, and A. Soto. 2018. A deeplearning based behavioral approach to indoor au-tonomous navigation. In

International Conferenceon Learning Representations (ICRA) .Nobuyuki Shimizu and Andrew R. Haas. 2009. Learn-ing to follow navigational route instructions. In

In-ternational Joint Conferences on Artiﬁcial Intelli-gence (IJCAI) .Ben Talbot, Obadiah Lam, Ruth Schulz, Feras Dayoub,Ben Upcroft, and Gordon Wyeth. 2016. Find my of-ﬁce: Navigating real space from semantic descrip-tions.

IEEE International Conference on Roboticsand Automation (ICRA) , pages 5782–5787.Stefanie Tellex, Thomas Kollar, Steven Dickerson,Matthew R Walter, Ashis Gopal Banerjee, Seth JTeller, and Nicholas Roy. 2011. Understanding nat-ural language commands for robotic navigation andmobile manipulation. In

National Conference onArtiﬁcial Intelligence (AAAI) , volume 1, page 2.Manuela M Veloso, Joydeep Biswas, Brian Coltin, andStephanie Rosenthal. 2015. Cobots: Robust sym-biotic autonomous mobile service robots. In

Inter-national Joint Conferences on Artiﬁcial Intelligence(IJCAI) , page 4423.Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.2015. Pointer networks. In C. Cortes, N. D.Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,editors,

Advances in Neural Information ProcessingSystems 28 , pages 2692–2700.Yuk Wah Wong and Raymond J Mooney. 2006. Learn-ing for semantic parsing with statistical machinetranslation. In

Proc. of the main conference on Hu-man Language Technology Conference of the NorthAmerican Chapter of the Association of Computa-tional Linguistics , pages 439–446.Caiming Xiong, Stephen Merity, and Richard Socher.2016. Dynamic memory networks for visual andtextual question answering. In

International Con-ference on Machine Learning (ICML) , pages 2397–2406.Xiaoxue Zang, Marynel V´azquez, Juan Carlos Niebles,Alvaro Soto, and Silvio Savarese. 2018. Behavioralindoor navigation with natural language directions.In