[PDF] A Tale of Two DRAGGNs: A Hybrid Approach for Interpreting Action-Oriented and Goal-Oriented Instructions

Abstract

Robots operating alongside humans in diverse, stochastic environments must be able to accurately interpret natural language commands. These instructions often fall into one of two categories: those that specify a goal condition or target state, and those that specify explicit actions, or how to perform a given task. Recent approaches have used reward functions as a semantic representation of goal-based commands, which allows for the use of a state-of-the-art planner to find a policy for the given task. However, these reward functions cannot be directly used to represent action-oriented commands. We introduce a new hybrid approach, the Deep Recurrent Action-Goal Grounding Network (DRAGGN), for task grounding and execution that handles natural language from either category as input, and generalizes to unseen environments. Our robot-simulation results demonstrate that a system successfully interpreting both goal-oriented and action-oriented task specifications brings us closer to robust natural language understanding for human-robot interaction.

Full PDF

AA Tale of Two DRAGGNs:A Hybrid Approach for InterpretingAction-Oriented and Goal-Oriented Instructions

Siddharth Karamcheti, Edward C. Williams, Dilip Arumugam,Mina Rhee , Nakul Gopalan , Lawson L.S. Wong , Stefanie Tellex

Department of Computer Science, Brown University, Providence, RI 02912 { siddharth karamcheti@ , edward c williams@ , dilip arumugam@ , mina rhee@ , ngopalan@cs. , lsw@ , stefie10@cs. } brown.edu Abstract

Robots operating alongside humans in di-verse, stochastic environments must beable to accurately interpret natural lan-guage commands. These instructions of-ten fall into one of two categories: thosethat specify a goal condition or target state,and those that specify explicit actions, orhow to perform a given task. Recent ap-proaches have used reward functions asa semantic representation of goal-basedcommands, which allows for the use ofa state-of-the-art planner to ﬁnd a policyfor the given task. However, these rewardfunctions cannot be directly used to rep-resent action-oriented commands. We in-troduce a new hybrid approach, the DeepRecurrent Action-Goal Grounding Net-work (DRAGGN), for task grounding andexecution that handles natural languagefrom either category as input, and gener-alizes to unseen environments. Our robot-simulation results demonstrate that a sys-tem successfully interpreting both goal-oriented and action-oriented task speciﬁ-cations brings us closer to robust naturallanguage understanding for human-robotinteraction.

Natural language affords a convenient choice fordelivering instructions to robots, as it offers ﬂex-ibility, familiarity, and does not require users tohave knowledge of low-level programming. In thecontext of grounding natural language instructionsto tasks, human-robot instructions can be inter-preted as either high-level goal speciﬁcations orlow-level instructions for the robot to execute. Figure 1: Sample conﬁguration of the CleanupWorld mobile-manipulator domain (MacGlashanet al., 2015), used throughout this work. A possi-ble goal-based instruction could be “Take the chairto the green room,” while a possible action-basedinstruction could be “Go three steps south, thentwo steps west.”Goal-oriented commands deﬁne a particular tar-get state specifying where a robot should end up,whereas action-oriented commands specify a par-ticular sequence of actions to be executed. For ex-ample, a human instructing a robot to “go to thekitchen” outlines a goal condition to check if therobot is in the kitchen. Alternatively, a human pro-viding the command “take three steps to the left”deﬁnes a trajectory for the robot to execute. Weneed to consider both forms of commands to un-derstand the full space of natural language thathumans may use to communicate their intent torobots. While humans also combine commands ofboth types into a single instruction, we make thesimplifying assumption that a command belongsentirely to a single type and leave the task of han-dling mixtures and compositions to future work.Existing approaches can be broadly divided intoone of two regimes. Goal-based approaches like a r X i v : . [ c s . A I] J u l igure 2: System for grounding both action-oriented (left branch) and goal-oriented (rightbranch) natural language instructions to exe-cutable robot tasks. Our main contribution isthe hybrid interpretation system (blue box), forwhich we present two novel models based onthe DRAGGN framework (J-DRAGGN and I-DRAGGN) in Section 4.MacGlashan et al. (2015) and Arumugam et al.(2017) leverage some intermediate task represen-tation and then automatically ﬁnd a low-level tra-jectory to achieve the goal using a planner. Otherapproaches, in the action-oriented regime, directlyinfer action sequences (Tellex et al., 2011; Ma-tuszek et al., 2012; Artzi and Zettlemoyer, 2013;Andreas and Klein, 2015) from the syntactic or se-mantic parse structure of natural language. How-ever, these approaches can be computationally in-tractable for large state-action spaces or use ad-hoc methods to execute high-level language ratherthan relying on a planner. Furthermore, thesemethods are unable to adapt to dynamic changesin the environment; for example, consider an en-vironment in which the wind, or some other forcemoves an object that a robot has been tasked withpicking. Action sequence based approaches wouldfail to handle this without additional user input,while goal-based approaches would be able to re-plan on the ﬂy, and complete the task.To address the issue of dealing with bothgoal-oriented and action-oriented commands, wepresent a new language grounding framework that,given a natural language command, is capable ofinferring the latent command type. Recent ap-proaches leveraging deep neural networks haveformulated the language grounding problem as sequence-to-sequence learning or multi-label clas-siﬁcation (Mei et al., 2016; Arumugam et al.,2017). Inspired by the recent success of neu-ral networks to model programs that are highlycompositional and sequential in nature, we presentthe Deep Recurrent Action/Goal Grounding Net-work (DRAGGN) framework, derived from thethe Neural Programmer-Interpreter (NPI) of Reedand de Freitas (2016) and outlined in Section 4.2.We introduce two instances of DRAGGN mod-els, each with slightly different architectures. Theﬁrst, the Joint-DRAGGN (J-DRAGGN) is deﬁnedin Section 4.3, while the second, the Independent-DRAGGN (I-DRAGGN) is deﬁned in Section 4.4. There has been a broad and diverse set of workexamining how best to interpret and execute natu-ral language instructions on a robot platform (Vo-gel and Jurafsky, 2010; Tellex et al., 2011; Artziand Zettlemoyer, 2013; Howard et al., 2014; An-dreas and Klein, 2015; Hemachandra et al., 2015;MacGlashan et al., 2015; Paul et al., 2016; Meiet al., 2016; Arumugam et al., 2017). Vogel andJurafsky (2010) produce policies using languageand expert trajectories based rewards, which allowfor planning within a stochastic environment alongwith re-planning in case of failure. (Tellex et al.,2011) instead grounds language to trajectories sat-isfying the language speciﬁcation. (Howard et al.,2014) chose to ground language to constraintsgiven to an external planner, which is a muchsmaller space to perform inference over than tra-jectories. MacGlashan et al. (2015) formulate lan-guage grounding as a machine translation prob-lem, treating propositional logic functions as botha machine language and reward function. Rewardfunctions or cost functions can allow richer de-scriptions of trajectories than plain constraints, asthey can describe preferential paths. Addition-ally, Arumugam et al. (2017) simplify the prob-lem from one of machine translation to multi-classclassiﬁcation, learning a deep neural network tomap arbitrary natural language instructions to thecorresponding reward function.Informing our distinction between action se-quences and goal state representation is the di-vision presented by Dzifcak et al. (2009), whoposited that natural language can be interpretedas both a goal state speciﬁcation and an actionspeciﬁcation. Rather than producing both fromach language command, our DRAGGN frame-work makes the simplifying assumption that onlyone representation captures the semantics of thelanguage; additionally, our framework does not re-quire a manually pre-speciﬁed grammar.Recently, deep neural networks have foundwidespread success and application to a wide ar-ray of problems dealing with natural language(Bengio et al., 2000; Mikolov et al., 2010, 2011;Cho et al., 2014; Chung et al., 2014; Iyyer et al.,2015). Unsurprisingly, there have been some ini-tial steps taken towards applying neural networksto language grounding problems. Mei et al. (2016)uses a recurrent neural network (RNN) withlong short-term memory (LSTM) cells (Hochre-iter and Schmidhuber, 1997) to learn sequence-to-sequence mappings between natural language androbot actions. This model augments the standardsequence-to-sequence architecture by learning pa-rameters that represent latent alignments betweennatural language tokens and robot actions. Aru-mugam et al. (2017) used an RNN-based modelto produce grounded reward functions at multiplelevels of an Abstract Markov Decision Process hi-erarchy (Gopalan et al., 2017), varying the abstrac-tion level with the level of abstraction used in nat-ural language.Our DRAGGN framework is closely related tothe Neural Programmer-Interpreter (NPI) (Reedand de Freitas, 2016). The original NPI modelis a controller trained via supervised learningto interpret and learn when to call speciﬁc pro-grams/subprograms, which arguments to pass intothe currently active program, and when to termi-nate execution of the current program. We drawa parallel between inferred NPI programs and ourmethod of predicting either lifted reward functionsor action trajectories.

We consider the problem of mapping from natu-ral language to robot actions within the contextof Markov decision processes. A Markov deci-sion process (MDP) is a ﬁve-tuple (cid:104)S , A , T , R , γ (cid:105) deﬁning a state space S , action space A , state tran-sition probabilities T , reward function R , and dis-count factor γ (Bellman, 1957; Puterman, 1994).An MDP solver produces a policy that maps fromstates to actions in order to maximize the total ex-pected discounted reward.While reward functions are ﬂexible and expres- sive enough for a wide variety of task speciﬁca-tions, they are a brittle choice for specifying anexact sequence of actions, as enumerating everypossible action sequence as a reward function (i.e.a speciﬁc reward function for the sequence Up 3,Down 2) can quickly become intractable. Thispaper introduces models that can produce desiredbehavior by inferring either reward functions orprimitive actions. We assume that all availableactions A and the full space of potential rewardfunctions ( i.e. , the full space of possible tasks) areknown a priori . When a reward function is pre-dicted by the model, an MDP planner is appliedto derive the resultant policy (see system pipelineFigure 2).We focus our evaluation of all models on the theCleanup World mobile-manipulator domain (Mac-Glashan et al., 2015; Arumugam et al., 2017).The Cleanup World domain consists of an agentin a -D world with uniquely colored rooms andmovable objects. A domain instance is shown inFigure 1. The domain itself is implemented asan object-oriented Markov decision process (OO-MDP) where states are denoted entirely by collec-tions of objects, with each object having its ownidentiﬁer, type, and set of attributes (Diuk et al.,2008). Domain objects include rooms and inter-actable objects ( e.g a chair, basket, etc.) all ofwhich have location and color attributes. Propo-sitional logic functions can be used to identifyrelevant pieces of an OO-MDP state and theirattributes; as in MacGlashan et al. (2015) andArumugam et al. (2017), we treat these proposi-tional functions as reward functions. In Figure 1,the goal-oriented command “take the chair to thegreen room” may be represented with the rewardfunction blockInRoom block0 room1 , where the blockInRoom propositional function checks if thelocation attribute of block0 is contained in room1 . We now outline the pipeline that converts naturallanguage input to robot behavior. We begin by ﬁrstdeﬁning the semantic task representation used byour grounding models that comes directly from theOO-MDP propositional functions of the domain.Next, we examine our novel DRAGGN frameworkfor language grounding and, in particular, addressthe separate paths taken by action-oriented andgoal-oriented commands through the system asseen in Figure 2. Finally, we discuss two different ction-Oriented Goal-Oriented goUp(numSteps) agentInRoom(room)goDown(numSteps) blockInRoom(room)goLeft(numSteps)goRight(numSteps)

Table 1: Set of action-oriented and goal-orientedcallable units that can be generated by ourDRAGGN models in the Cleanup World domain.implementations of the DRAGGN framework thatmake different assumptions about the relationshipbetween tasks and constraints. Speciﬁcally, weintroduce the Joint-DRAGGN (J-DRAGGN), thatassumes a probabilistic dependence between tasks(i.e. goUp ) and the corresponding arguments (i.e. steps) based on a natural language instruction,and the Independent-DRAGGN (I-DRAGGN) thattreats tasks and arguments as independent given anatural language instruction. In order to map arbitrary natural language instruc-tions to either action trajectories or goal condi-tions, we require a compact but sufﬁciently ex-pressive semantic representation for both. To thisend, we deﬁne the callable unit , which takes theform of a single-argument function. These func-tions are paired with binding arguments whosepossible values depend on the callable unit type.As in MacGlashan et al. (2015) and Arumugamet al. (2017), our approach generates reward func-tion templates, or lifted reward functions, for goal-oriented tasks along with environment-speciﬁcconstraints. Once these templates and constraintsare resolved to get a grounded reward function,the associated goal-oriented tasks can be solved byan off-the-shelf planner thereby improving trans-fer and generalization capabilities.Goal-oriented callable units (lifted reward func-tions) are paired with binding arguments that spec-ify properties of environment entities that must besatisﬁed in order to achieve the goal. These bind-ing arguments are later resolved by the Ground-ing Module (see Section 4.5) to produce groundedreward functions (OO-MDP propositional logicfunctions) that are handled by an MDP planner.Action-oriented callable units directly corre-spond to the primitive actions available to therobot and are paired with binding arguments deﬁn-ing the number of sequential executions of that ac-tion. The full set of callable units along with req- uisite binding arguments is shown in Table 1.

While the Single-RNN model of Arumugam et al.(2017) is effective, it cannot model the compo-sitional argument structure of language. A unit-argument pair not observed at training time willnot be predicted from input data, even if the con-stituent pieces were observed separately. Addi-tionally, the Single-RNN model requires everypossible unit-argument pair to be enumerated, toform the output space. As the environment growsto include more objects with richer attributes, thisoutput space becomes intractable.To resolve this, we introduce the Deep Recur-rent Action/Goal Grounding Network (DRAGGN)framework. Unlike previous approaches, theDRAGGN framework maps natural language in-structions to separate distributions over callableunits and (possibly multiple) binding constraints,generating either action sequences or goal condi-tions. By treating callable units and binding argu-ments as separate entities, we circumvent the com-binatorial dependence on the size of the domain.This unit-argument separation is inspired by theNeural Programmer-Interpreter (NPI) of Reed andde Freitas (2016). The callable units output byDRAGGN are analogous to the subprograms out-put by NPI. Additionally, both NPI and DRAGGNallow for subprograms/callable units with an ar-bitrary number of arguments (by adding a corre-sponding number of Binding Argument Networks,as shown at the top right of Figure 3a, each withits own output space).We assume that each natural language instruc-tion can be represented by a single unit-argumentpair with only one argument. Consequently, in ourexperiments, we assume that sentences specify-ing sequences of commands have been segmented,and each segment is given to the model one ata time. The limitation to a single argument onlyarises because of the domain’s simplicity; as men-tioned above, it is straightforward to extend ourmodels to handle extra arguments by adding extraBinding Argument Networks.To formalize the DRAGGN objective, considera natural language instruction l . Our goal is to ﬁndthe callable unit ˆ c and binding arguments ˆa that a) Joint DRAGGN (b) Independent DRAGGN Figure 3: Architecture diagrams for the two Deep Recurrent Action/Goal Grounding Network(DRAGGN) models, introduced in Sections 4.3 and 4.4. Both architectures ground arbitrary naturallanguage instructions to callable units (either actions or lifted reward functions), and binding arguments.maximize the following joint probability: ˆ c, ˆ a = arg max c, a Pr( c, a | l ) (1)Depending on the assumptions made about therelationship between callable units c and bind-ing arguments a , we can decompose the aboveobjective in two ways: preserving the depen-dence between the two, and learning the relation-ship between the units and arguments jointly, andtreating the two as independent. These two de-compositions result in the Joint-DRAGGN andIndependent-DRAGGN models respectively.Given the training dataset of natural languageand the space of unit-argument pairs, we train ourDRAGGN models end-to-end by minimizing thesum of the cross-entropy losses between the pre-dicted distributions and true labels for each sepa-rate distribution ( i.e. over callable units and bind-ing arguments). At inference time, we ﬁrst choosethe callable unit with the highest probability giventhe natural language instruction. We then choosethe binding argument(s) with highest probabilityfrom the set of valid arguments. The validity ofa binding argument given a callable unit is given a priori , by the speciﬁc environment, rather thanbeing learned at training time.Our models were trained using Adam (Kingmaand Ba, 2014), for 125 epochs, with a batch sizeof 16, and a learning rate of 0.0001. The Joint DRAGGN (J-DRAGGN) models thejoint probability in Equation 1, coupled via theshared RNN state in the DRAGGN Core (as de-picted in Figure 3a), but selects the optimizer se-quentially, as follows: ˆ c, ˆ a = arg max c, a Pr( c, a | l ) (2) ≈ arg max a (cid:104) arg max c Pr( c, a | l ) (cid:105) We ﬁrst encode the constituent words of our nat-ural language segment into ﬁxed-size embeddingvectors. From there, the sequence of word em-beddings is fed through an RNN denoted by theDRAGGN Core . After processing the entire seg-ment, the current gated recurrent unit (GRU) hid-den state is then treated as a representative vectorfor the entire natural language segment. This sin-gle hidden core vector is then passed to both theCallable Unit Network and the Binding ArgumentNetwork, allowing for both networks to be trainedjointly, enforcing a dependence between the two.The Callable Unit Network is a two-layer feed-forward network using rectiﬁed linear unit (ReLU)activation. It takes the DRAGGN Core output We use the gated recurrent unit (GRU) as our RNN cell,because of its effectiveness in natural language processingtasks, such as machine translation (Cho et al., 2014), whilerequiring fewer parameters than the LSTM cell (Hochreiterand Schmidhuber, 1997). ector as input to produce a softmax probabilitydistribution over all possible callable units. TheBinding Argument Network is a separate networkwith an identical architecture and takes the sameinput, but instead produces a probability distribu-tion over all possible binding arguments. The twomodels do not need to share the same architec-ture; for example, callable units with multiple ar-guments require multiple different argument net-works, one for each possible binding constraint.

The Independent DRAGGN (I-DRAGGN), con-trary to the Joint DRAGGN, decomposes theobjective from Equation 1 by treating callableunits and binding arguments as being indepen-dent, given the original natural language instruc-tion. More precisely, the I-DRAGGN objective is: ˆ c, ˆ a = arg max c, a Pr( c | l ) Pr( a | l ) (3)The I-DRAGGN network architecture is shownin Figure 3b. Beyond the difference in objectivefunctions, there is another key difference betweenthe I-DRAGGN and J-DRAGGN architectures.Rather than encoding the constituent words of thenatural language instruction once, and feeding theresulting embeddings through a DRAGGN Coreto generate a shared core vector, the I-DRAGGNmodel embeds and encodes the natural languageinstruction twice , using two separate embeddingmatrices and GRUs, one each for the callableunit and binding argument. In this way, the I-DRAGGN model encapsulates two disjoint neuralnetworks, each with their own individual param-eter sets that are trained independently. The lat-ter half of each individual network (the CallableUnit Network and Binding Argument Network)remains the same as that of the J-DRAGGN. If a goal-oriented callable unit is returned ( i.e. a lifted reward function), we require an addi-tional step of completing the reward function withenvironment-speciﬁc variables. As described inArumugam et al. (2017), we use a GroundingModule to perform this step. The Grounding Mod-ule maps the inferred callable unit and binding ar-gument(s) to a ﬁnal grounded reward function thatcan be passed to an MDP planner. In our imple-mentation, the Grounding Module is a lookup ta-ble mapping speciﬁc binding arguments to room

Natural Language Callable Unit ArgumentGo to the red room. agentInRoom roomIsRed

Put the block in blockInRoom roomIsGreen the green room.Go up three spaces. goUp 3

Table 2: Examples of natural language phrases andcorresponding callable units and arguments.ID tokens. A more advanced implementation ofthe Grounding Module would be required in or-der to handle domains with non-unique binding ar-guments ( e.g. resolving between multiple objectswith overlapping attributes).

We assess the effectiveness of both our J-DRAGGN and I-DRAGGN models via instruc-tion grounding accuracy for robot navigation andmobile-manipulation tasks. As a baseline, wecompare against the state-of-the-art Single-RNNmodel introduced by Arumugam et al. (2017).

To conduct our evaluation, we use the dataset ofnatural language commands for the single instanceof Cleanup World domain seen in Figure 1, fromArumugam et al. (2017). In the user study, Ama-zon Mechanical Turk users were presented withtrajectory demonstrations of a robot completingvarious navigation and object manipulation tasks.Users were prompted to provide natural languagecommands that they believed would have gener-ated the observed behavior. Since the originaldataset was compiled for analyzing the hierarchi-cal nature of language, we were easily able to ﬁlterthe commands down to only those using high-levelgoal speciﬁcations and low-level trajectory speci-ﬁcations. This resulted in a dataset of naturallanguage commands total.To produce a dataset of action-specifyingcallable units, experts annotated low-level tra-jectory speciﬁcations from the Arumugam et al.(2017) dataset. For example, the command “Downthree paces, then up two paces, ﬁnally left fourpaces” was segmented into “down three spaces,”“then up two paces,” “ﬁnally left four paces,”and was given a corresponding execution trace of goDown 3 , goUp 2 , goLeft 4 . The existing setof grounded reward functions in the dataset wereconverted to callable units and binding arguments.Examples of both types of language are presentedction-Oriented Goal-Oriented Action-Oriented (Unseen) OverallSingle-RNN . ± . . ± . % 0 . . ± . J-DRAGGN . ± . . ± . % 20 . ± .

4% 83 . ± . I-DRAGGN . ± . % 84 . ± . . + . % . ± . % Table 3: Action-oriented and goal-oriented accuracy results (mean and standard deviation across 3 ran-dom initializations) on both the standard and unseen datasets.

Bold indicates the singular model thatperformed the best on the given task, whereas italics denotes the best models that were within the marginof error of each other for the given task. The overall column was computed by taking an average ofindividual task accuracies, weighted by the number of test examples per task.in Table 2 with their corresponding callable unitand binding arguments.To fully show the capabilities of our model,we tested on two separate versions of the dataset.The ﬁrst is the standard dataset, consisting of a90-10 split of the collected action-oriented andgoal-oriented commands We also evaluated ourmodels on an “unseen” dataset, which consists ofa speciﬁc train-test split that evaluates how wellmodels can predict previously unseen action se-quence combinations. For example, in this datasetthe training data might consist only of action se-quences of the form goUp 3 , and goDown 4 ,while the test data would only consist of the “un-seen” action sequence goUp 4 . Note that in bothdatasets, we assume that the test environment isconﬁgured the same as the train environment.

Language grounding accuracies for our twoDRAGGN models, as well as the baseline Single-RNN, are presented in Table 3. All three mod-els received the same set of training data, con-sisting of low-level action-oriented segmentsand high-level goal-based sentences. All to-gether, there are unique combinations action-oriented callable units and respective binding ar-guments, and unique combinations of goal-oriented callable units and binding argumentspresent in the data. Then, we evaluated all threemodels on the same set of held-out data, whichconsisted of low-level segments and high-level sentences.In aggregate, the models that use callableunits for both action- and goal-based languagegrounding demonstrate superior performance tothe Single-RNN baseline, largely due to their abil-ity to generalize, and output combinations unseenat train time. We break down the performance on each task in the following three sections. We evaluate the performance of our models onlow-level language that directly speciﬁes an actiontrajectory. An instruction is correctly grounded ifthe output trajectory speciﬁcation corresponds tothe ground-truth action sequence. To ensure fair-ness, we augment the output space of Single-RNNto include all distinct action trajectories found inthe training data (an additional 17 classes, as men-tioned previously).All models perform generally well on this task,with Single-RNN correctly identifying the cor-rect action callable unit on . of test samples,while both DRAGGN models slightly outperformwith on . and . respectively. In addition to the action-oriented results, we evalu-ate the ability for each model to ground goal-basedcommands. An instruction is correctly grounded ifthe output of the grounding module corresponds tothe ground-truth (grounded) reward function.In our domain, all models predict the correctgrounded reward function with an accuracy of . or higher, with the Single-RNN and J-DRAGGN models being too close to call. The Single-RNN baseline model is completelyunable to produce unit-argument pairs thatwere never seen during training, whereas bothDRAGGN models demonstrate some capacity forgeneralization. The I-DRAGGN model in partic-ular demonstrates a strong understanding of eachtoken within the original natural language utter-ances which, in large part, comes from the sep-arate embedding spaces maintained for callableunits and binding constraints respectively.

Discussion

Our experiments show that the DRAGGN mod-els have a clear advantage over the existing state-of-the-art in grounding action-oriented language.Furthermore, due to the factored nature of the out-put, I-DRAGGN generalizes well to unseen com-binations of callable units and binding arguments.Nevertheless, I-DRAGGN did not perform aswell as Single-RNN and J-DRAGGN on goal-oriented language. This is possibly due to thesmall number of goal types in the dataset and thestrong overlap in goal-oriented language. Whereasthe Single-RNN and J-DRAGGN architecturesmay experience some positive transfer of infor-mation (due to the shared parameters in eachof the two models), the I-DRAGGN model doesnot because of its assumed independence betweencallable units and binding arguments. This abilityto allow for positive information transfer suggeststhat J-DRAGGN would perform best in environ-ments where there is a strong overlap in the in-structional language, with a relatively smaller butcomplex set of possible action sequences and goalconditions.On action-oriented language, J-DRAGGN hasgrounding accuracy of around . while I-DRAGGN achieves a near-perfect . . SinceJ-DRAGGN only encodes the input language in-struction once, the resulting vector representationis forced to characterize both callable unit andbinding argument features. While this can result inpositive information transfer and improve ground-ing accuracy in some cases ( e.g. goal-based lan-guage), this enforced correlation heavily biases themodel towards predicting combinations it has seenbefore. By learning separate representations forcallable units and binding arguments, I-DRAGGNis able to generalize signiﬁcantly better. This sug-gests that I-DRAGGN would perform best in situ-ations where the instructional language consists ofmany disjoint words and phrases.While our results demonstrate that theDRAGGN framework is effective, more ex-perimentation is needed to fully explore thepossibilities and weaknesses of such models. Oneof the shortcomings in the DRAGGN models isthe need for segmented data. We found that allevaluated models were unable to handle long,compositional instructions, such as “Go up threesteps, then down two steps, then left ﬁve steps”.Handling conjunctions of low-level commands requires extending our model to learn how toperform segmentation, or producing sequences ofcallable units and arguments. In this paper, we presented the Deep RecurrentAction/Goal Grounding Network (DRAGGN), ahybrid approach that grounds natural languagecommands to either action sequences or goal con-ditions, depending on the language. We presentedtwo separate neural network architectures that canaccomplish this task, both of which factor the out-put space according to the compositional structureof our semantic representation.We show that overall the DRAGGN models sig-niﬁcantly outperform the existing state of the art.Most notably, we show that the DRAGGN mod-els are capable of generalizing to action sequencesunseen during training time.Despite these successes, there are still openchallenges with grounding language to novel, un-seen environment conﬁgurations. Furthermore,we hope to extend our models to handle in-structions that are a mixture of goal-oriented andaction-oriented language, as well as to long, se-quential commands. An instruction such as “goto the blue room, but avoid going through thered hallway” does not map to either an action se-quence or a traditional, Markovian reward func-tion. We believe new tools and approaches willneed to be developed to handle such instructions,in order to handle the diversity and complexity ofhuman natural language.

This material is based upon work supported bythe National Science Foundation under grant num-ber IIS-1637614 and the National Aeronauticsand Space Administration under grant numberNNX16AR61G.Lawson L.S. Wong was supported by aCroucher Foundation Fellowship.

References

Jacob Andreas and Dan Klein. 2015. Alignment-basedcompositional semantics for instruction following.In

Conference on Empirical Methods in NaturalLanguage Processing .Yoav Artzi and Luke Zettlemoyer. 2013. Weakly su-pervized learning of semantic parsers for mappingnstructions to actions. In

Annual Meeting of the As-sociation for Computational Linguistics .Dilip Arumugam, Siddharth Karamcheti, NakulGopalan, Lawson L.S. Wong, and Stefanie Tellex.2017. Accurately and efﬁciently interpretinghuman-robot instructions of varying granularities.

CoRR abs/1704.06616.R. Bellman. 1957. A Markovian decision process.

In-diana University Mathematics Journal

Journal of Machine Learning Re-search

EmpiricalMethods in Natural Language Processing .Junyoung Chung, C¸ aglar G´ulc¸ehre, Kyunghyun Cho,and Yoshua Bengio. 2014. Empirical evaluation ofgated recurrent neural networks on sequence model-ing.

CoRR abs/1412.3555.Carlos Diuk, Andre Cohen, and Michael L. Littman.2008. An object-oriented representation for efﬁcientreinforcement learning. In

International Conferenceon Machine Learning .Juraj Dzifcak, Matthias Scheutz, Chitta Baral, and PaulSchermerhorn. 2009. What to do and how to do it:Translating natural language directives into tempo-ral and dynamic logic representation for goal man-agement and action execution. In

IEEE Interna-tional Conference on Robotics and Automation .Nakul Gopalan, Marie desJardins, Michael L. Littman,James MacGlashan, Shawn Squire, Stefanie Tellex,John Winder, and Lawson L.S. Wong. 2017. Plan-ning with abstract Markov decision processes. In

International Conference on Automated Schedulingand Planning .Sachithra Hemachandra, Felix Duvallet, Thomas M.Howard, Nicholas Roy, Anthony Stentz, andMatthew R. Walter. 2015. Learning models for fol-lowing natural language directions in unknown en-vironments. In

IEEE International Conference onRobotics and Automation .Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Longshort-term memory.

Neural Computation

IEEE International Confer-ence on Robotics and Automation . Mohit Iyyer, Varun Manjunatha, Jordan L. Boyd-Graber, and Hal Daum´e. 2015. Deep unorderedcomposition rivals syntactic methods for text classi-ﬁcation. In

Conference of the Association for Com-putational Linguistics .Diederik P. Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization.

CoRR abs/1412.6980.James MacGlashan, Monica Babes¸-Vroman, MariedesJardins, Michael L. Littman, Smaranda Muresan,Shawn Squire, Stefanie Tellex, Dilip Arumugam,and Lei Yang. 2015. Grounding english commandsto reward functions. In

Robotics: Science and Sys-tems .Cynthia Matuszek, Evan Herbst, Luke Zettlemoyer,and Dieter Fox. 2012. Learning to parse natural lan-guage commands to a robot control system. In

In-ternational Symposium on Experimental Robotics .Hongyuan Mei, Mohit Bansal, and Matthew R. Wal-ter. 2016. Listen, attend, and walk: Neural mappingof navigational instructions to action sequences. In

AAAI Conference on Artiﬁcial Intelligence .Tomas Mikolov, Martin Karaﬁ´at, Luk´as Burget, JanCernock´y, and Sanjeev Khudanpur. 2010. Recur-rent neural network based language model. In

Inter-speech .Tomas Mikolov, Stefan Kombrink, Luk´as Burget, JanCernock´y, and Sanjeev Khudanpur. 2011. Exten-sions of recurrent neural network language model.In

IEEE International Conference on Acoustics,Speech, and Signal Processing .Rohan Paul, Jacob Arkin, Nicholas Roy, andThomas M. Howard. 2016. Efﬁcient grounding ofabstract spatial concepts for natural language inter-action with robot manipulators. In

Robotics: Sci-ence and Systems .Martin L. Puterman. 1994. Markov decision processes:Discrete stochastic dynamic programming.Scott E. Reed and Nando de Freitas. 2016. Neuralprogrammer-interpreters. In

International Confer-ence on Learning Representations .Stefanie Tellex, Thomas Kollar, Steven Dickerson,Matthew R. Walter, Ashis Gopal Banerjee, SethTeller, and Nicholas Roy. 2011. Understanding nat-ural language commands for robotic navigation andmobile manipulation. In

AAAI Conference on Artiﬁ-cial Intelligence .Adam Vogel and Dan Jurafsky. 2010. Learning to fol-low navigational directions. In