[PDF] Student/Teacher Advising through Reward Augmentation

Abstract

Transfer learning is an important new subfield of multiagent reinforcement learning that aims to help an agent learn about a problem by using knowledge that it has gained solving another problem, or by using knowledge that is communicated to it by an agent who already knows the problem. This is useful when one wishes to change the architecture or learning algorithm of an agent (so that the new knowledge need not be built "from scratch"), when new agents are frequently introduced to the environment with no knowledge, or when an agent must adapt to similar but different problems. Great progress has been made in the agent-to-agent case using the Teacher/Student framework proposed by (Torrey and Taylor 2013). However, that approach requires that learning from a teacher be treated differently from learning in every other reinforcement learning context. In this paper, I propose a method which allows the teacher/student framework to be applied in a way that fits directly and naturally into the more general reinforcement learning framework by integrating the teacher feedback into the reward signal received by the learning agent. I show that this approach can significantly improve the rate of learning for an agent playing a one-player stochastic game; I give examples of potential pitfalls of the approach; and I propose further areas of research building on this framework.

Full PDF

SStudent/Teacher Advising through Reward Augmentation

Cameron Reid

Abstract

Transfer learning is an important new subﬁeld ofmultiagent reinforcement learning that aims to helpan agent learn about a problem by using knowledgethat it has gained solving another problem, or byusing knowledge that is communicated to it by anagent who already knows the problem. This is usefulwhen one wishes to change the architecture or learningalgorithm of an agent (so that the new knowledge neednot be built “from scratch”), when new agents arefrequently introduced to the environment with noknowledge, or when an agent must adapt to similarbut diﬀerent problems. Great progress has been madein the agent-to-agent case using the Teacher/Studentframework proposed by (Torrey and Taylor 2013).However, that approach requires that learning from ateacher be treated diﬀerently from learning in everyother reinforcement learning context. In this paper, Ipropose a method which allows the teacher/studentframework to be applied in a way that ﬁts directly andnaturally into the more general reinforcement learningframework by integrating the teacher feedback intothe reward signal received by the learning agent. Ishow that this approach can signiﬁcantly improvethe rate of learning for an agent playing a one-playerstochastic game; I give examples of potential pitfalls ofthe approach; and I propose further areas of researchbuilding on this framework.

Introduction

Reinforcement Learning

Reinforcement learning describes a variety of methodsthat are used to solve sequential decision problemsin which some agent is interacting with and receiv-ing feedback from an environment. Generally, theproblem is formulated such that the feedback is inthe form of a cost (which the agent should minimize),or a reward (to be maximized). (Sutton and Barto2018) provide an excellent primer on the discipline ofreinforcement learning.

Tabular Q-Learning

The agent discussed in this paper uses a tabular Q-learning approach to reinforcement learning describedby (Watkins 1989). This is a simple but eﬀectiveapproach to reinforcement learning that allows anagent to eventually learn an exact optimal policy forevery state. Put simply, we keep a large, relativelyhigh-dimensional table with one entry per state-actionpair; at every step, we observe a state S and take anaction a based on a policy π ( s, a | Q ). The agent thenreceieves its new state S , and a reward R , and makesan update to the Q-table as follows: Q ( S, a ) = Q ( S, a ) + α [ R + γ max a Q ( S , a ) − Q ( S, a )](Sutton and Barto 2018)where γ is a discount factor (lower values make theresulting policy more “short-sighted”) and α is some1 a r X i v : . [ c s . A I] F e b tep size. Under some simple assumptions, this simplealgorithm will eventually converge to a perfectly op-timal policy with probability 1 (Watkins and Dayan1992). Transfer Learning

Even the best reinforcement learning methods can bequite slow to converge to an optimal solution.

Transferlearning describes one attempt to solve that problem– given an agent which has been trained on a problem,how can we transfer the knowledge it has gained toanother agent, or generalize that knowledge to anotherproblem?One proposed solution in this paradigm is the Teacher-Student framework proposed by (Torrey and Taylor2013), in which an agent who is an “expert” in theproblem provides advice to a learning agent to helpspeed the training of the learning agent. This adviceis provided by essentially telling the learning agentwhich action to take at certain states. (Da Silva,Glatt, and Costa 2017) expanded on this frameworkby proposing a system in which agents learning si-multaneously can all be either advisor or advisee (orboth) during training.While those show impressive results, they make a fewassumptions that I believe are unrealistic and exhibita departure from the fundamental reinforcement learn-ing problem: for example, the advice is treated as aspecial case, rather than an additional environmen-tal signal. Here, I attempt to begin to bring thosemethods into harmony with the generic reinforcementlearning problem formulation.To that end, I propose an approach for shaping thereward of a learning agent which will help guide thelearning agent as it learns by modifying the rewardreceived from the environment with an additionalpunishment based upon the knowledge of the teachingagent.

Methods

Hunter/Prey Game

To examine this problem, I’ve used a simple Gridworldhunter/prey game. The learning agent is a hunter,whose job is to catch the prey by moving into thespace occupied by the prey. The prey, in turn, movesin a random direction at each step as long as it isnot captured. The agent is given a reward of -1 oneach step that the prey has not been captured, and areward of 0 when it has.The environment of the game is fully observable tothe agent; at each step, the agent receives a tupleof h h x , h y , p x , p y i , where h { x,y } and p { x,y } are thehunter’s and prey’s x- and y- coordinate, respectively.The action space in this game are integers from 0 to3 which encode cardinal directions in the gridworld– 0 moves the agent down, 1 moves the agent to theright, etc. Building the Advisor

The advisor and agent use the same parameters forlearning the policy. First, the advisor learns a policyby playing 20,000 episodes of the game (i.e., restartingevery time the prey is captured) with the Q-learningalgorithm described above. The learning rate, α , wasset to 0.1; the discount factor, γ , was set to 1.0 (so,undiscounted learning).The learned policy was then kept to help inform theteaching policies described below. Teaching the Student

Once the advisor is trained, its policy was used toprovide feedback to the learning agent using a fewhand-coded policies.In the ﬁrst policy, the advisor augmented the rewardthe agent received by a tunable, ﬁxed value (set to-10 in these experiments) if the action chosen by the2gent is not the optimal action as determined by the Qfunction learned by the advisor. In another policy, theteaching agent only augments the reward signal if thechosen action is the worst among the four availableactions. The third policy augments the reward signalby an amount proportional to the diﬀerence betweenthe chosen action and the optimal action (again, asdetermined by the advisor’s Q function).Note that, while these policies are ﬁxed and providedhere, the problem of learning these policies is easilytranslated to a reinforcement learning problem, andan optimal policy could easily be learned.

Deﬁning Punishment

As mentioned above, I address three punishmentschedules in this paper. For simplicity’s sake, I deﬁnea function pun ( s, a ) whose value is an amount bywhich the reward signal is augmented before beingprovided to the learning agent who has taken action a in state s . This function is deﬁned in a few diﬀer-ent ways, corresponding to each of the punishmentschedules. Punishing Sub-optimal Actions

The ﬁrst punishment schedule imposes additional costto agents who choose an action which is not optimalaccording to the teacher’s Q value for that state. Moreformally, the reward is augmented by pun sub ( s, a ) = C sub (1)where sub is the indicator function whose value is 1when a = max b Q teacher ( s, b ), and 0 otherwise. Punishing Anti-Optimal Actions

The next punishment schedule examined is similar,but only imposes an additional cost when the learn-ing agent chooses an action that would be the worstamong all actions. In other words pun anti ( s, a ) = C anti (2)where anti = 1 when a = min b Q teacher ( s, b ) and 0otherwise. Continuous Punishment By Severity

The ﬁnal punishment schedule imposes an additionalcost to the learning agent that is proportional to thediﬀerence between the expected value of the chosenaction and the value of the optimal action based onthe teacher’s action-value function. That is, pun cont ( s, a ) = C ( Q teacher ( s, a ) − max b Q teacher ( s, b ))(3) Results

Figure 1: Comparison of Q Learning and SuboptimalAction PunishmentFigure 1 shows the results of augmenting the learningagent’s reward by Eq. 1 with C = 10; that is, thereward received by the agent at each step wasˆ R agent = R agent − pun sub ( s, a ) (4)3otice that the guidance from the teacher causesimpressive improvements in training speed; however,learning quickly levels oﬀ after only a couple thousandepisodes at a level of performance that is inferior towhat simple Q-learning achieves by the ﬁnal episode.This is likely the result of the teacher continuing topunish for slight variations to the teacher’s policywhich might actually be improvements.Figure 2: Comparison of Q Learning and Anti-Optimal Action PunishmentFigure 2 shows the results of augmenting the learningagent’s reward by Eq. 2 with C = 10. In this case,the reward becameˆ R agent = R agent − pun anti ( s, a ) (5)In this case, the increase in convergence speed wasless signiﬁcant; however, because the teacher onlypunishes if the student chooses the worst possible action, the negative eﬀects are diminished, and thestudent manages to perform better than the teacherat every episode.Figure 3 shows the result of augmenting the rewardsignal by Eq 3 with C = 10, i.e.,ˆ R agent = R agent − pun cont ( s, a ) (6)Similarly to above, it appears that, once the studenthas learned a similar-enough version of the teacher’s Figure 3: Comparison of Q Learning and ContinuousProportional Punishmentpolicy, the diﬀerence between the values of the chosenaction and the teacher’s guess of the optimal actionare too small to make much diﬀerence, and so thestudent’s learning curve remains below the teacher’sthroughout every episode.In this paper, I have ignored the problem of budgeting advice, which is prominent elsewhere in the literature.Because the continuous punishment schedule requiresadvice at every step, it would certainly require a lotof interaction with the learning agent, which mostlikely makes it intractible for problems where thereis a cost to interaction. Due to this fact and the factthat the suboptimal schedule leads to poor overallconvergence, I consider the anti-optimal schedule tobe the most useful. Importance of Designing the FeedbackPolicy

Despite the promising results discussed above, I en-countered one instance where the reward augmenta-tion scheme caused severe problems for learning.Figure 4 shows the results of attempting to augmentthe reward signal by providing positive feedback whenthe learning agent chooses the action that is optimalaccording to the teacher’s Q function, and negative

Tuning the C Parameter

Here we examine the eﬀect of the C parameter onconvergence.Figure 5 shows the eﬀect of tuning the C parameterin the “anti-optimal” schedule. Of note is that small C values lead to a less-impressive speedup in conver-gence, but result in less harmful negative eﬀects atlater episodes. Meanwhile, a larger value of C leads tomore impressive initial speedup but leads to a policywhich is not as good. Also interesting is that perfor-mance seems to ﬂip for every schedule at around the15,000 episode mark: before that, higher values of C produce better performance but after that, highervalues of C seem to prevent further improvement.Figure 6 shows the eﬀect of the C parameter on the“suboptimal” schedule. Notably, there doesn’t appear Figure 5: Various C -values under the “suboptimal”scheduleFigure 6: Various C -values under the “anti-optimal”schedule5o be any negative impact on later training episodeslike was apparent in Figure 6. In fact, higher valuesof C seem to have purely positive eﬀects. Presumably,this is because the feedback is mostly applied earlyin training, but once the policy becomes relativelygood, it’s unlikely that the agent will take the worstpossible action and so feedback is sparse and trainingcontinues as normal.Figure 7: Various C -values under the “continuous”scheduleFigure 7 shows the eﬀect of tuning the C parameterfor the “continuous” schedule. The eﬀect is similarto what was seen in Figure 5: higher values of C allow the agent to learn more quickly, but it doesn’tnegatively impact learning at later epochs when theagent has learned a reasonable policy and the diﬀer-ence between its chosen action and the teacher’s bestaction is probably small. Conclusions & Future Work

In this paper I’ve proposed an extension of theteacher/student framework initially developed by(Torrey and Taylor 2013) which allows the teacher toprovide advice to the student via the existing struc-ture of reinforcement learning problems by augment-ing the reward signal that the learning agent receivesfrom the environment. I’ve shown that using this approach can signiﬁcantly speed up learning. Further-more, this approach sidesteps some of the shortfallsof approaches like that of (Da Silva, Glatt, and Costa2017) – namely, my approach extends naturally toagents which make use of function approximation intheir learning algorithms, while the most eﬀective ap-proaches in (Da Silva, Glatt, and Costa 2017) requirea record of visits to a state. In problems with a largestate space where function approximation is necessary,it is unreasonable to expect that an agent will visitany given state even two times, so the assumptionthat we can count visits to a state don’t hold.The obvious next steps for this approach would be toreplicate the results of (Da Silva, Glatt, and Costa2017) with co-learning agents. This would requiredeveloping some metric of conﬁdence in a state for anagent to avoid negative impact, but that is a tractableproblem.Another obvious path forward would be to remove thehand-coded punishment policies in favor of a trainedmeta-agent that learns how to guide learning agents.This would allow not only the punishment schedule tobe optimized, but it would also lead to a formulationof the advice budget that allows the training agent tooptimize the feedback it provides based on an actualcost of providing that feedback, again bringing theconcept of “budget” into the general framework of thereinforcement learning problem.

References

Da Silva, Felipe Leno, Ruben Glatt, and Anna HelenaReali Costa. 2017. “Simultaneously Learning andAdvising in Multiagent Reinforcement Learning.” In

Proceedings of the 16th Conference on AutonomousAgents and Multiagent Systems , 1100–1108. Interna-tional Foundation for Autonomous Agents; MultiagentSystems.Sutton, Richard S, and Andrew G Barto. 2018.

Re-inforcement Learning: An Introduction . MIT press.Torrey, Lisa, and Matthew Taylor. 2013. “Teachingon a Budget: Agents Advising Agents in Reinforce-6ent Learning.” In

Proceedings of the 2013 Interna-tional Conference on Autonomous Agents and Multi-Agent Systems , 1053–60. International Foundation forAutonomous Agents; Multiagent Systems.Watkins, Christopher JCH, and Peter Dayan. 1992.“Q-Learning.”