Guiding Reinforcement Learning Exploration Using Natural Language
GGuiding Reinforcement Learning Exploration Using Natural Language
Brent Harrison
Department of Computer ScienceUniversity of KentuckyLexington, Kentucky, [email protected]
Upol Ehsan
School of Interactive ComputingGeorgia Institute of TechnologyAtlanta, Georgia, [email protected]
Mark O. Riedl
School of Interactive ComputingGeorgia Institute of TechnologyAtlanta, Georgia, [email protected]
Abstract
In this work we present a technique to use natural languageto help reinforcement learning generalize to unseen envi-ronments. This technique uses neural machine translation,specifically the use of encoder-decoder networks, to learnassociations between natural language behavior descriptionsand state-action information. We then use this learned modelto guide agent exploration using a modified version of pol-icy shaping to make it more effective at learning in unseenenvironments. We evaluate this technique using the populararcade game, Frogger, under ideal and non-ideal conditions.This evaluation shows that our modified policy shaping algo-rithm improves over a Q-learning agent as well as a baselineversion of policy shaping.
Introduction
Interactive machine learning (IML) algorithms seek to aug-ment machine learning with human knowledge in order toenable intelligent systems to better make decision in com-plex environments. These algorithms allow human teachersto directly interact with machine learning algorithms to trainthem to learn tasks faster than they would be able to on theirown. Typically, humans interact with these systems by eitherproviding demonstrations of positive behavior that an intelli-gent agent can learn from, or by providing online critique ofan agent while it explores its environment. While these tech-niques have proven to be effective, it can sometimes be dif-ficult for trainers to provide the required demonstrations orcritique. Demonstrations may require that the trainer possessin-depth prior knowledge about a system or its environment,and trainers may have to provide hundreds of instances offeedback before the agent begins to utilize it. This issue iscompounded when one considers that this training must oc-cur for each new environment that the agent finds itself in.In this work, we seek to reduce the burden on humantrainers by using natural language to enable interactive ma-chine learning algorithms to better generalize to unseen en-vironments. Since language is one of the primary ways thathumans communicate, using language to train intelligentagents should come more naturally to human teachers thanusing demonstrations or critique. In our proposed approach,natural language instruction also need not be given onlinewhile the agent is learning. Allowing instruction to be given offline greatly reduces the time and effort required on thepart of the human teacher to train these intelligent agents.Humans are also extremely proficient and generalizingover many states, and often language aids in this endeavor.With this work, we aim to use human language to learn thesehuman-like state abstractions and use them to enhance re-inforcement learning in unseen environments. To do that,we use neural machine translation techniques—specificallyencoder-decoder networks—to learn generalized associa-tions between natural language behavior descriptions andstate/action information. We then use this model, which canbe thought of as a model of generalized action advice, toaugment a state of the art interactive machine learning al-gorithm to make it more effective in unseen environments.For this work, we choose to modify policy shaping , an inter-active machine learning algorithm that learns from humancritique (Griffith et al. 2013).We evaluate this technique using the arcade game, Frog-ger. Specifically, we evaluate how our technique performsagainst a base Q-learning algorithm and a version of policyshaping that uses only demonstrations as examples policycritique on the task of learning on a set of unseen Froggermaps in a variety of situationsTo summarize, the main contributions of this paper are asfollows: 1) We show how neural machine translation can beused to create a generalized model of action advice, 2) weshow how this model can be used to augment policy shap-ing to enable reinforcement learning agents to better learnin unseen environments, and 3) we perform an evaluation ofour method in the arcade game, Frogger, on several previ-ously unseen maps using unreliable synthetic oracles meantto simulate human trainers.
Related Work
This work is primarily related to two bodies of artificial in-telligence research: interactive machine learning and knowl-edge transfer in reinforcement learning. Interactive machinelearning (IML) (Chernova and Thomaz 2014) algorithmsuse knowledge provided by human teachers to help train ma-chine learning models. This allows for human experts to helptrain intelligent agents, thus enabling these agents to learnfaster than they would if left to learn on their own. Typi-cally human teachers interact with the agent by providingeither demonstrations of correct behavior (Argall et al. 2009) a r X i v : . [ c s . A I] S e p r directly critique the agent’s behavior (Loftin et al. 2014;Griffith et al. 2013; Cederborg et al. 2015). Our work seeksto improve upon these methods by enabling these algorithmsto learn from natural language in addition to demonstrationsor critique.There has been other work on using natural languageto augment machine learning algorithms. There has beenmuch work done on using natural language instructions tohelp reinforcement learning agents complete tasks more ef-ficiently. Early works in this area focused on learning map-pings between these instructions and specific control se-quences in learning environments (Branavan et al. 2009;Branavan, Zettlemoyer, and Barzilay 2010; Matuszek et al.2013). In this previous work, language is used mainly usedto instruct how to complete a specific task in a specific envi-ronment. In other words, language and state are tightly cou-pled. The main way that our work differs from this work isthat we are seeking to use language as an abstraction tool.Our work focuses on exploring how language can be usedto help reinforcement learning agents transfer knowledge tounseen environments.More recent work has examined how language canhelp reinforcement learning agents in a more environment-agnostic way. For example, work has been done on us-ing high-level task specifications to engineer environment-agnostic reward functions to improve learning (MacGlashanet al. 2015). Also, techniques such as sentiment analysishave been used to bias agent exploration to improve learningin unseen environments (Krening et al. 2017). Most of thesetechniques, however, require additional information aboutthe environment, such as descriptions of object types in theenvironment, that may not always be readily available. Ourtechnique relaxes this requirement by using neural machinetranslation to learn relationships between natural languageaction/state descriptions and parts of the state space.The work most closely related to our own involves us-ing deep Q-learning to identify language representations thatcan help reinforcement learning agents learn in unseen envi-ronments (Narasimhan, Kulkarni, and Barzilay 2015). Thistechnique, however, also requires some knowledge about theenvironment to be provided in order to learn these represen-tations. Our technique does not require additional informa-tion to be provided by the domain author as all state annota-tions are generated by human teachers. Background
In this section, we will discuss three concepts that are criti-cal to our work:reinforcement learning, policy shaping, andencoder-decoder networks.
Reinforcement Learning
Reinforcement learning (Sutton and Barto 1998) is a tech-nique that is used to solve a Markov decision process(MDP). A MDP is a tuple M = < S, A, T, R, γ > where S is the set of possible world states, A is the set of possibleactions, T is a transition function T : S × A → P ( S ) , R is the reward function R : S × A → R , and γ is a discountfactor ≤ γ ≤ . Reinforcement learning first learns a policy π : S → A ,which defines which actions should be taken in each state.In this work, we use Q-learning (Watkins and Dayan 1992),which uses a Q-value Q ( s, a ) to estimate the expected futurediscounted rewards for taking action a in state s . As an agentexplores its environment, this Q-value is updated to take intoaccount the reward that the agent receives in each state. Inthis paper, we use Boltzmann exploration (Watkins 1989)to select the actions that a reinforcement learning agent willtake during training. When using Boltzmann exploration, theprobability of the agent choosing a particular action duringtraining is calculated as P r q ( a ) = e Q ( s,a ) /τ (cid:80) a (cid:48) e Q ( s,a (cid:48) ) /τ , where τ isa temperature constant that controls whether the agent willprefer more random exploration or exploration based on cur-rent Q-values. Policy Shaping
In this paper, we build upon the policy shaping frame-work (Griffith et al. 2013), which is a technique that in-corporates human critique into reinforcement learning. Un-like other techniques such as reward shaping, policy shapingconsiders critique to be a signal that evaluates whether the action taken in a state was desirable rather than whether theresulting state was desirable. Policy shaping utilizes humanfeedback by maintaining a critique policy to calculate theprobability,
P r c ( a ) , that an action a ∈ A should be taken ina given state according to the human feedback signal. Dur-ing learning, the probability that an agent takes an action iscalculated by combining both P r c ( a ) and P r q ( a ) : P r ( a ) = P r q ( a ) P r c ( a ) (cid:80) a (cid:48) ∈ A P r q ( a (cid:48) ) P r c ( a (cid:48) ) (1)Thus, the ultimate decision on which action to exploreduring learning is a combination of knowledge from theagent’s experience as well as knowledge from a humanteacher.The critique policy used in policy shaping is generated byexamining how consistent the feedback for certain actionsare. If an action receives primarily positive or negative cri-tique, then the critique policy will reflect this with a greateror lower probability, respectively, to explore that action dur-ing learning. Encoder-Decoder Networks
Encoder-decoder networks have been used frequently inother areas, such as machine translation (Luong, Pham,and Manning 2015), to learn how to convert sets of in-put sequences into desired output sequences. In this workwe use encoder-decoder networks to translate state/actiondescriptions written in natural language into machine-understandable state/action information that the natural lan-guage describes. For example, the input to this networkcould be natural language describing the layout of a grid en-vironment and an action taken in that specific state, whilethe desired output of the network would be the specific stateand action representation used by the learning agent.This translation task, sometimes known as sequence-to-sequence learning, involves training two recurrent neuraligure 1: High-level flowchart of our technique.networks (RNNs): an encoder network and a decoder net-work. In this generative architecture, these component neu-ral networks work in conjunction to learn how to translatean input sequence X = ( x , ..., x T ) into an output sequence Y = ( y , ..., y (cid:48) T ) . To do this, the encoder network first learnsto encode the input vector X into a fixed length context vec-tor v . This context vector is meant to encode important as-pects of the input sequence to aid the decoder in producingthe desired output sequence. This vector is then used as in-put into the second component network, the decoder, whichis a RNN that learns how to iteratively decode this vectorinto the target output Y . By setting up the learning problemin this particular way, this thought vector encodes high-levelconcept information that can help the decoder construct gen-eral state representations for each input sequence. Using Language to Generalize HumanCritique
As mentioned previously, one of the primary disadvantagesof interactive machine learning is that humans must re-train the agent whenever it encounters a new environment.To address this issue, we show how an encoder-decodernetwork (Luong, Pham, and Manning 2015) can be usedto learn a language-based critique policy . A high leveloverview of our technique can be seen in Figure 1. Our tech-nique works by first having humans generate a set of anno-tated states and actions by interacting with a single learn-ing environment offline and thinking aloud about the actionsthey are performing. These annotations are then used to trainan encoder-decoder network to create the language-basedcritique model. This model can then be queried while theagent is exploring new environments to receive action adviceto guide it towards states with potentially high rewards. Thiscan be done even if the learning agent encounters states thathave not been explicitly seen before and have not been usedto train the language-based critique model. Each of thesesteps will be discussed in greater detail below.
Acquiring Human Feedback
Typically, training an agent using critique requires a largeamount of consistent online feedback in order to build upa critique policy , a model of human feedback for a specificproblem. In other words, a human trainer would normallybe required to watch an agent as its learning and providefeedback which is used in real time to improve the agent’sperformance. This is because critique normally comes in theform of a discrete positive or negative feedback signal that isthen associated with a given state or action. This provides the agent with little opportunity to generalize to unseen environ-ments since this feedback is tightly coupled with state infor-mation. To address this, our technique uses natural languageas a means to generalize feedback across many, possibly un-known, states. We do this by training an encoder-decodermodel to act as a more general critique policy that we re-fer to as a language-based critique policy , which enables anagent to receive action advice for any potential state it findsitself in.We use two types of data in order to create this pol-icy: examples of actions taken in the environment and nat-ural language describing the action taken. This informationcan be gathered by having humans interact with the agent’slearning environment while providing natural language de-scriptions of their behavior. There are many ways that hu-mans can potentially provide these state and action annota-tions. For instance, humans could provide a full episode ofbehavior along with behavior annotations. It is also possi-ble for humans to provide incomplete trajectories, or evensimply examples of single actions, along with natural lan-guage annotations. Regardless of how they were collected,the state/action demonstrations provided by the human canbe stored and later used as a positive feedback signal whilethe language can be used to help generalize this feedbacksignal over many states.For an example of this process, consider an environmentwhere an agent is tasked with dodging obstacles. Assume inthis environment that the learning agent can only move inthe four cardinal direction: up, down, left, and right. In or-der to gather the necessary data to learn the feedback policy,a human trainer is presented with different obstacle initial-izations and tasked with providing short behavior examplesof navigating them. The trainer could then provide feedbackon each action that they took after the fact. An example ofthis feedback might be the description,
I am dodging the ob-stacle that is coming up beside me, if describing the actionto move up when an obstacle is approaching from the sideof the agent.
Training the Encoder-Decoder Network
Due to how these annotations are generated, it is possible todirectly associate language information to state and actioninformation. This paired data is used to train an encoder-decoder network. Specifically, the natural language descrip-tions are used as inputs to the network, and the network isthen tasked with reconstructing the state and action that areassociated with that description. By asking the network toreconstruct the state and action, the network learns to iden-tify common elements shared between similar inputs and,mportantly, how these elements in the input natural lan-guage sequences relate to certain regions of the output se-quence of state and action information. This enables the net-work to learn high-level concept information that enables itto generalize natural language advice to unseen states.
Utilizing the Language-Based Critique Policy
The ultimate goal of this work is to use the language-basedcritique policy to speed up learning in unseen environments.To do that, we use this language-based critique policy inconjunction with policy shaping. Recall that a reinforce-ment learning agent using policy shaping makes decisionsusing two distinct pieces of information:
P r q ( a ) , the proba-bility of performing an action based on that action’s currentQ-value, and P r c ( a ) , the probability of performing an ac-tion based on the human critique policy. Here, we will usethe language-based critique policy learned early to take theplace of the standard critique policy normally used by policyshaping.For the language-based critique policy to be used in thisframework, we must be able to calculate the probability ofperforming an action in each state, even if it has never beenseen before. Whenever the agent encounters a state, we canquery the language-based critique policy to get the probabil-ity of performing each action in that state for a given naturallanguage input; however, for this to be of any use we mustfirst determine which piece of feedback out of all of the feed-back used in the training set is most applicable to the currentstate. For each natural language utterance in our training setand for each action the agent perform in its current state, wecalculate the log probability of the model reconstructing theagent’s current state and performing said action. d One canthink of this log probability as how well an utterance de-scribes performing that specific action in that specific state.Whichever utterance is associated with the action that re-sults in the overall largest log probability is then used as thenetwork input to create the action distribution. To create theaction distribution, we calculate the following: P r lc ( a ) = e P r l ( s,a,i ) /τ (cid:80) a (cid:48) e P r l ( s,a (cid:48) ,i ) /τ (2)where P r l ( s, a, i ) is the log probability of performing action a in state s according to the language-based critique policyusing sequence i as input.In the original policy shaping algorithm, the critique pol-icy is constantly updated while the agent is learning. Sincethe language-based critique policy is trained offline, it doesnot have an opportunity to update itself, which can lead to anagent blindly following poor feedback. To address this, wemake use of the τ parameter in Equation 2 to control howmuch weight we place on the knowledge extracted from thelanguage-based critique policy. In practice, we have foundthat the algorithm performs well when τ is initialized tobe a small value that increases over the course of learning.This will cause the RL agent to begin learning by trustingthe language-based critique policy and then shift towards re-lying on its own experience as time goes on. This allows the agent to disregard feedback that results in poor Q-valuesover time.Having done this, the RL agent now explores its environ-ment as it normally would using policy shaping; however,the probability of the agent performing an action in a givenstate is defined as: P r ( a ) = P r q ( a ) P r lc ( a ) (cid:80) a (cid:48) ∈ A P r q ( a (cid:48) ) P r lc ( a (cid:48) ) (3)where the original probability obtained from human feed-back, P r c ( a ) , is replaced with the probability obtained fromthe language-based critique policy. Evaluation
To evaluate this technique, we examine how performs intraining vitual agents to play the arcade game, Frogger.Specifically, we seek to show that using natural languageto augment policy shaping enables reinforcement learningagents to speed up learning in unknown environments. Inthis section we will discuss our evaluation in which we com-pare agents trained with our technique against agents trainedusing a Q-learning algorithm as well as an agent trained us-ing a baseline policy shaping algorithm with only access tobehavior observations.
Frogger
In these experiments we use the arcade game, Frogger (seeFigure 2), as a test domain. We chose Frogger because it isa discrete environment that can still be quite complex dueto the specific mechanics of the environment. The learningagent’s goal in this environment is to move from the bottomof the environment to the top while navigating the obstaclesin the world. In this world, the obstacles move following aset pattern. Obstacles on on alternating rows will move onspace to either the left or the right every time step. Movingoutside of the bounds of the map, getting hit by a car, orfalling into the water will result in the agent’s death, whichimposes a reward penalty of -10, reaching the goal earns theagent a reward of +100, and any other move taken in this en-vironment will result in a small reward pentaly of -1. In thisenvironment, the agent can take actions to move up, down,left, right, or choose to do nothing.We test our technique on three different Frogger environ-ments. These environments, shown in Figures 2 (b), (c), and(d), differ based on obstacle density. Specifically, we evalu-ate performance in maps in which spaces have a 25% chanceof containing an obstacle, a 50% chance of containing an ob-stacle, and a 75% chance of containing an obstacle.In addition, we evaluate agent performance in these en-vironments under two different conditions: 1) a determinis-tic condition in which all actions execute normally, and 2) astochastic condition in which actions have an 80% chance ofexecuting normally and a 20% chance that the agent’s actionfails and it executes a different action instead.
Methodology
In this section, we will discuss the experimental methodol-ogy used to both create and evaluate the language-based cri-tique policy in Frogger.igure 2: (a) The Frogger map used for training. (b) The 25% map used for testing. (c) The 50% map used for testing. (d) The75% map used for testing.
Data Collection
Since human teachers generate the natu-ral language that is used for training, it is possible that mis-takes will be made. Therefore, it is important to examine theeffect that imperfect teachers will have on our technique. It isdifficult to control for this type of error using actual humans,so for these experiments we use simulated human oracles togenerate the required training observations and natural lan-guage. To create the behavior traces required for training,we trained reinforcement learning agents with randomstarting positions to move one row forward while dodgingobstacles on the training map in Figure 2 (a). We chose thisspecific task to help eliminate any map-specific strategiesthat may be learned by using an agent trained to navigatethe complete environment. To further help eliminate map-specific strategies, the states recorded for these training ex-amples and then used in the remainder of these experimentsencompass only a 3x3 grid surrounding the agent. This wasdone to help prevent the encoder-decoder network makingspurious associations between the natural language annota-tions and potentially unrelated regions of the state space.Since we are using simulated humans to generate stateand action traces, we use a grammar to create the naturallanguage annotations that our system require. This gram-mar was constructing following the technique used in (Har-rison, Ehsan, and Riedl 2017). This technique uses naturallanguage utterances generated by humans to create a gram-mar in such a way that variances in human language arepreserved and codified. In order to produce a sentence, thegrammar must first be provided with state and action infor-mation and then the grammar identifies the most appropri-ate grammar rule to construct a natural language sentence.The grammar is constructed such that each grammar rule canproduce a large number of unique sentences. Using this in-formation, the grammar can then produce a natural languagesentence describing the states and actions.Since human trainers are likely to make mistakes, we testour technique’s ability to deal with imperfect human train- ers by introducing uncertainty into the natural language an-notations generated by this grammar. Specifically, we createtwo different language-based critique models using the fol-lowing simulated teachers: 1) teachers that use the correctgrammar rule to provide feedback 80% the time and and arandom rule to provide feedback 20% of the time (referredto from now on as the
80% training set ), and 2) teachersthat use the correct grammar rule to provide feedback 60%of the time and a random rule to provide feedback 40% ofthe time (referred to from now on as the
60% training set ).This will allow us to evaluate how robust our technique is topotential errors contained in the natural language feedbackprovided by human trainers. We feel that this also helps miti-gate the regularity that is often present when using syntheticgrammars. It is important to note, that each of these train-ing sets was generated using the same set of behavior traces.This makes the performance of the resulting critique poli-cies directly comparable. After error was injected into eachtraining set, duplicate training examples were removed.To provide further evidence on the variability of the gram-mar with respect to these training sets, we also examinedhow often sentences repeated themselves in each trainingset. In the 80% training set, the most seen sentence com-prised 2.2% of the total training set, which contained a to-tal of 1433 training examples. On average, a sentence wasrepeated 0.19% of the time. For the 60% training set, themost seen sentence comprised 1.7% of the training set of1497 examples. On average, a sentence was repeated 0.17%of the time. The disparity in training set size is due to du-plicate training examples being removed after errors wereintroduced.
Training
Using this dataset, we train an encoder-decodernetwork for 100 ephocs. Specifically we use an embeddingencoder-decoder network with attention that is comprisedtwo-layer recurrent neural networks composed of long short-term memory cells containing 300 hidden units each and anigure 3: Learning rates for agents on deterministic versions of (a) the 25% map,(b) the 50% map, and (c) the 75% map.Figure 4: Learning rates for agents on stochastic versions of the 25% map (a), 50% map (b), and 75% map(c).embedding size of 300. As mentioned previously, the net-work learns to translate between natural language descrip-tions and state and action information. For these experi-ments, the network learns to translate natural language gen-erated by the grammar into the provided state and action in-formation, which in this case is the 3x3 grid surrounding theagent as well as the action performed in that state.
Evaluation
We ran experiments using four intelligentagents. The first is a baseline Q-learning agent with no ac-cess human feedback that we will refer to as the
Q-learning agent. The second is an agent trained using policy shapingthat we will refer to as the observation-based critique agent.This agent has access to the state and action informationthat were used to train the language-based critique policy,which it uses as a positive action feedback signal. To helpguard against poor examples in the training set, this agentalso uses the parameter, τ , to control how much weight isgiven to the action examples and to the agent’s own expe-rience. The final two agents are trained using our techniqueand we refer to them as the language-based critique 80%accuracy agent and the language-based critique 60% accu-racy agent depending on which training set was used to gen-erate the feedback oracle that the agent used during learning.Each of these agents was evaluated on each of the three un-seen Frogger maps in the deterministic and stochastic con-ditions described previously. These test cases are meant tosimulate how each agent performs in a simple (determin- istic) environment as well as a more complex (stochastic)environment.For the deterministic test case, each agent was trained for5000 episodes. For the stochastic test case, each agent wastrained for 25,000 episodes. In all test cases, the learned pol-icy was evaluated every 100 episodes and then the total cu-mulative reward earned during each episode was averagedover 100 total runs. Results
The results for each agent on the deterministic Frogger mapscan be seen in Figure 3 and the results for each agent on thestochastic Frogger maps can be seen in Figure 4. For boththe language-based critique agent and the observation-basedcritique agent we tested several initial values and schedulesfor increasing the τ parameters. The graphs show the bestresults achieved in these experiments.As can be seen from Figure 3, both language-based cri-tique agents converge much faster than the Q-learning agentand the observation-based critique agent. Interestingly, the60% accurate language-based critique agent outperformedthe 80% accurate language-based critique agent on the 25%map and the 75% map. It is important to note that theobservation-based critique agent also consistently outper-forms the Q-learning agent on each map, meaning that us-ing observations still provides some benefit during trainingin unseen environments.Figure 4 shows the results that each agent obtained onhe stochastic versions of the test Frogger maps. For this setof experiments, both language-based critique agents outper-form the other two agents on each map used for testing. Con-trary to the performance in the deterministic environments,the 60% accurate language-based critique agent and 80% ac-curate language-based critique agent performed similarly inthese environments. It is also interesting to note that on the map, the language-based critique agents are the onlyones to converge after the 25,000 training episodes. In the25% map and the 75% map, the observation-based critiqueagent outperforms the Q-learning agent. Similar its perfor-mance in the deterministic environments, this shows that ac-cess to behavior observations still provides some amount ofbenefit in generalizing behavior to unseen environments. Discussion
The first thing to note is that across all test cases thelanguage-based critique agents either outperformed both theobservation-based critique agent and the baseline Q-learningagent. This shows that natural language provides knowledgeuseful for generalizing to unseen environments that cannotbe obtained by simply looking at past observations. In ad-dition, this shows that our technique is robust to complexityin the learning environment as well as language error thatmay be present in human trainers. This is a significant re-sult as this provides evidence that the encoder-decoder net-work can identify relevant features in the set of natural lan-guage annotations used for training even when the datasetcontains a large amount of noise. These differences in per-formance were especially pronounced on the deterministicand stochastic 50% maps, as well as the deterministic andstochastic 75% maps. In these cases, both language-basedcritique agents drastically outperformed both baselines. Theconsistently positive results across both deterministic andstochastic environments further shows what a powerful toollanguage can be with respect to generalizing knowledgeacross many types of environments.One result that needs to be discussed is the performance ofthe 60% accurate language-based critique agent with respectto the 80% accurate language-based critique agent on the de-terministic 25% and 75% maps. In each of these cases, the60% accurate agent outperformed the 80% accurate agent,contrary to our intuition that the 80% accurate agent shouldconsistently outperform the 60% accurate agent due to theadditional error contained in the latter training set. We hy-pothesize that this behavior can be explained due to modeloverfitting causing erratic behavior in these two cases. If theencoder-decoder network was overfitting part of the train-ing set, then it is possible that the increased error introducedin the 60% accurate agent had a regularizing effect on thenetwork, which allowed it to better generalize to the statesfound in these two maps.
Limitations
While the results of our experiments provide strong evidencethat our technique is effective at utilizing language to helplearn generalizable knowledge, our technique is not withoutits limitations. First, our experiments made several simplify-ing assumptions that were necessary in order to control for the variance that accompanies human teachers. By using agrammar, we were able to control the amount of variancepresent in the annotations used for training. While we at-tempted to mitigate this by encoding our grammar with alarge amount of variance and training the language-basedcritique model using training sets containing natural lan-guage error, naturally occurring language is likely to containmore variation than is present in our grammar.In addition, the natural language explanations used totrain the language-based critique model were used to anno-tate single actions. Typically humans provide explanationsof actions in context of a larger goal-based behavior trajec-tory and not on the level of individual actions. One way toimprove this system is to enable it to learn from state/actionexplanations at varying levels of granularity.We also only explored how this technique can be ap-plied in discrete environments. Using a discrete environmentmakes it easy to associate natural language annotations withstate and action information. If this was done in a continuousenvironment then it would be much more difficult to deter-mine what state or action should be associated with certainnatural language annotations.Finally, we have only explored how this technique can beused to generalize to unseen environments within the samedomain. It is unclear, however, if this technique could beused to aid in transfering knoweldge to agents learning sim-ilar tasks in different domains.
Conclusions
Language is a powerful tool that humans use to generalizeknowledge across a large number of states. In this work,we explore how language can be used to augment machineintelligence and give intelligent agents an expanded abilityto generalize knowledge to unknown environments. Specif-ically, we show how neural machine translation techniquescan be used to give action advice to reinforcement learningagents that generalizes across many different states, even ifthey have not been seen before. As our experiments haveshown, this generalized model of advice enables reinforce-ment learning agents to quickly learn in unseen environ-ments.In addition, this technique gives human teachers anotherway to train intelligent agents. The ability to augment humandemonstration or critique with human feedback has the po-tential to significantly reduce the amount of effort requiredin order to train intelligent agents. This makes the task oftraining intelligent agents more approachable to potentialhuman trainers. It is even possible that this task could becrowdsourced in the future, drastically reducing the efforton the part of an individual trainer and making these types ofagent training methods more appealing. Through this work,we hope to help bring down the language barrier that existsbetween humans and intelligent agents. By removing thisbarrier, we hope to enable humans to transfer more complexknowledge to intelligent agents, which should allow them tolearn even more complex tasks in complex, unknown envi-ronments. eferences [Argall et al. 2009] Argall, B. D.; Chernova, S.; Veloso, M.;and Browning, B. 2009. A survey of robot learningfrom demonstration.
Robotics and autonomous systems
Proceedings of theJoint Conference of the 47th Annual Meeting of the ACL andthe 4th International Joint Conference on Natural LanguageProcessing of the AFNLP: Volume 1-Volume 1 , 82–90. As-sociation for Computational Linguistics.[Branavan, Zettlemoyer, and Barzilay 2010] Branavan, S.;Zettlemoyer, L. S.; and Barzilay, R. 2010. Readingbetween the lines: Learning to map high-level instructionsto commands. In
Proceedings of the 48th Annual Meeting ofthe Association for Computational Linguistics , 1268–1277.Association for Computational Linguistics.[Cederborg et al. 2015] Cederborg, T.; Grover, I.; Isbell,C. L.; and Thomaz, A. L. 2015. Policy shaping with humanteachers. In
Twenty-Fourth International Joint Conferenceon Artificial Intelligence .[Chernova and Thomaz 2014] Chernova, S., and Thomaz,A. L. 2014. Robot learning from human teachers.
Synthe-sis Lectures on Artificial Intelligence and Machine Learning
Advances in Neural Information Processing Systems , 2625–2633.[Harrison, Ehsan, and Riedl 2017] Harrison, B.; Ehsan, U.;and Riedl, M. O. 2017. Rationalization: A neural machinetranslation approach to generating natural language explana-tions. arXiv preprint arXiv:1702.07826 .[Krening et al. 2017] Krening, S.; Harrison, B.; Feigh,K. M.; Isbell, C. L.; Riedl, M.; and Thomaz, A. 2017.Learning from explanations using sentiment and advice inrl.
IEEE Transactions on Cognitive and Developmental Sys-tems
Robot and Human In-teractive Communication, 2014 RO-MAN: The 23rd IEEEInternational Symposium on , 607–612. IEEE.[Luong, Pham, and Manning 2015] Luong, M.-T.; Pham, H.;and Manning, C. D. 2015. Effective approaches to attention-based neural machine translation. In
Proceedings of the2015 Conference on Empirical Methods in Natural Lan-guage Processing , 1412–1421. Lisbon, Portugal: Associa-tion for Computational Linguistics.[MacGlashan et al. 2015] MacGlashan, J.; Babes-Vroman,M.; desJardins, M.; Littman, M. L.; Muresan, S.; Squire, S.;Tellex, S.; Arumugam, D.; and Yang, L. 2015. Groundingenglish commands to reward functions. In
Robotics: Scienceand Systems . [Matuszek et al. 2013] Matuszek, C.; Herbst, E.; Zettle-moyer, L.; and Fox, D. 2013. Learning to parse naturallanguage commands to a robot control system. In
Experi-mental Robotics , 403–415. Springer.[Narasimhan, Kulkarni, and Barzilay 2015] Narasimhan, K.;Kulkarni, T. D.; and Barzilay, R. 2015. Language un-derstanding for textbased games using deep reinforcementlearning. In
In Proceedings of the Conference on EmpiricalMethods in Natural Language Processing . Citeseer.[Sutton and Barto 1998] Sutton, R. S., and Barto, A. G.1998.
Reinforcement learning: An introduction . MIT Press.[Watkins and Dayan 1992] Watkins, C., and Dayan, P. 1992.Q-learning.
Machine Learning