Persistent Rule-based Interactive Reinforcement Learning
Adam Bignold, Francisco Cruz, Richard Dazeley, Peter Vamplew, Cameron Foale
PPersistent Rule-based Interactive Reinforcement Learning
Adam Bignold , ∗ Francisco Cruz , , ∗ Richard Dazeley Peter Vamplew Cameron Foale School of Science, Engineering and Information Technology, Federation University, Ballarat, Australia. School of Information Technology, Deakin University, Geelong, Australia. Escuela de Ingenier´ıa, Universidad Central de Chile, Santiago, Chile. ∗ Both authors contributed equally to this manuscript.Corresponding e-mails: { a.bignold, p.vamplew, c.foale } @federation.edu.au, { francisco.cruz, richard.dazeley } @deakin.edu.au Abstract
Interactive reinforcement learning has allowed speed-ing up the learning process in autonomous agents byincluding a human trainer providing extra informa-tion to the agent in real-time. Current interactivereinforcement learning research has been limited tointeractions that offer relevant advice to the currentstate only. Additionally, the information provided byeach interaction is not retained and instead discardedby the agent after a single-use. In this work, we pro-pose a persistent rule-based interactive reinforcementlearning approach, i.e., a method for retaining andreusing provided knowledge, allowing trainers to givegeneral advice relevant to more than just the currentstate. Our experimental results show persistent advicesubstantially improves the performance of the agentwhile reducing the number of interactions requiredfor the trainer. Moreover, rule-based advice showssimilar performance impact as state-based advice, butwith a substantially reduced interaction count.
Keywords:
Reinforcement learning; Interactivereinforcement learning; Persistent advice; Rule-basedadvice.
Interactive reinforcement learning (IntRL) allows atrainer to guide or evaluate a learning agent’s be- haviour [1, 2]. The assistance provided by the trainerreinforces the behaviour the agent is learning andshapes the exploration policy, resulting in a reducedsearch space [3]. Current IntRL techniques discardthe advice sourced from the human shortly after ithas been used [4, 5], increasing the dependency onthe advisor to repeatedly provide the same advice tomaximise the agent’s use of it.Moreover, current IntRL approaches allow trainersto evaluate or recommend actions based only on thecurrent state of the environment [6, 7]. This constraintrestricts the trainer to providing advice relevant tothe current state and no other, even when such advicemay be applicable to multiple states [8]. Restrictingthe time and utility of advice affect negatively theinteractive approach in terms of creating an increasingdemand on the user’s time, instead of withholdingpotentially useful information for the agent [2].This work introduces persistence to IntRL, amethod for information retention and reuse. Persis-tent agents attempt to maximise the value extractedfrom the advice by replaying interactions that oc-curred in the past, rather than relying on the advisorto repeat an interaction. Agents that retain advicerequire fewer interactions than non-persistent coun-terparts to achieve similar or improved performance,thus reducing the burden on the advisor to provideadvice on an ongoing basis.Additionally, this works introduces a persistent rule-1 a r X i v : . [ c s . A I] F e b ased IntRL approach. Allowing users to provide in-formation in the form of rules, rather than per-stateaction recommendations, increases the informationper interaction, and does not limit the information tothe current state. By not constraining the advice tothe current state, users can give advice pre-emptively,no longer requiring the current state to match thecriteria for the user’s assistance. This more informa-tionally rich interaction method improves the perfor-mance of the agent compared to existing methods andreduces the number of interactions between the agentand the advisor.Therefore, the contribution of this work is twofold.First, the introduction of a state-based method forthe retention and reuse of advice, named persistence.Second, a persistent rule-based IntRL method obtain-ing the same performance as state-based advice, butwith a substantially reduced interaction count. Inthis regard, rules allow advice to be provided thatgeneralises over multiple states. Interactive reinforcement learning (IntRL) is a field ofRL research in which a trainer interacts with an RLagent in real-time [9, 10, 11]. The trainer can provideextra information to the agent regarding its behaviour,the environment, or future actions it should per-form [12]. This approach has shown to considerablyimprove the agent learning speed and allows RL toscale to larger or more complex problems [13, 14, 15].Current IntRL methods limit guidance and evalu-ation to the current state of the agent, regardless ofwhether the conditions for the information are sharedamong multiple states [16, 17]. This constraint re-quires the advising user to monitor the current stateof the agent and wait until conditions are met thatsuit the advice they wish to provide. Additionally,the user is required to repeat the advice for each statewhere the conditions are met. This lack of generali-sation increases the number of interactions and thedemand on the user [18, 19].
In computer science, a rule is a statement consisting ofa condition and a conclusion. A simple example of arule is ‘IF p THEN q’ , dictating that if the conditionof p is met, then the conclusion is q . Additionalqualifiers may supplement rules, allowing for a rulescondition or conclusion to be constructed to meetspecific demands.When teaching or conveying information betweenpeople, one form of knowledge transfer is rules. Whilethe syntax of the rule is not necessarily formal, therelation of condition and conclusion is maintained [20].For example, the sentence ‘Don’t touch the stove whenhot’ can be represented formally as ‘IF stove==hotTHEN don’t touch’ .Moreover, conditions and conclusions are quicklyidentifiable by humans when using natural language isused. Recent advances in speech-to-text systems havedemonstrated the ability to identify the conditionand conclusion in human speech [21, 20]. Assistantsystems such as Google Assistant and Apple Siri havedemonstrated this functionality through conditionalreminders such as ‘When I am at the supermarketremind me to get milk’ or ‘Remind me at 3 PM to callthe restaurant’ [22]. The ease with which humans canidentify rules for knowledge transfer, and the abilityfor machines to translate speech to rules, means thatrules are an increasingly viable option for knowledgetransfer for non-technical users [23, 24]. A user may create multiple rules over the duration oftheir assistance to an agent [25]. As a result, a singlestate may have multiple rules, each with conflictingadvice for the current state. In this regard, binarydecision trees offer a method of structuring rules insuch a way that only one conclusion is given for eachstate. A binary decision tree is made up of nodes,each containing either a condition, a conclusion, orboth. The decision on which child to evaluate nextis dependent on whether the condition of the currentnode is met or not [26].The philosophy of early decision tree systems wasthat knowledge was static, and if captured in its en-2irety, would not need updating or correction [27].Traditionally, the approach to designing a decisiontree involved employing an expert and a knowledge en-gineer. The expert held the knowledge to be capturedin their head [27]. Through exhaustive deliberation,the knowledge engineer and expert would transformthe knowledge into a decision tree. Algorithms suchas ID3 [28] and CART [29] allow this process to beautomated, provided that large amounts of labelleddata are available.An additional concern with traditional decision treemethodologies is maintenance. Maintenance is thetask of modifying an existing tree and is often per-formed with the intention of expanding or correctingthe knowledge it captures. Maintenance is a difficulttask, as it requires a deep understanding of the en-tire tree and the reasons why each rule was addedin the past. For trees generating using algorithmssuch as ID3 [28] and CART [29], this understandingof the tree is lost, as the rules were generated by thealgorithm, not by a knowledge engineer.The usual methods for building decision trees donot meet the IntRL constraints. IntRL does not haveaccess to large amounts of labelled data and aimsto be within the skill level of non-expert users, notspecialised knowledge engineers [14]. Rule-based In-tRL requires a method for generating binary decisiontrees without the need for expert skills in knowledgeengineering, without large amounts of labelled data,and that can be built iteratively without the need forthe user to know the full context of the tree.
Ripple-down rules (RDR) is a well-known iterativetechnique for building and maintaining binary deci-sion trees [30, 31]. RDR was proposed in responseto the concerns raised with traditional decision treeknowledge systems. While early decision tree sys-tems consider knowledge to be static, RDR considersknowledge to be fluid and ever-changing. Therefore,knowledge is incrementally acquired, occurring nat-urally as domain experts interacted with the systemover time.RDR is a combination of decision trees and case-based reasoning [32]. A case is a collection of poten- tially relevant material that the system uses to makea classification and is equivalent to the concept ofstates in RL. Each node in an RDR system containsa rule, a classification, and a case. The case pairedwith each node is referred to as the ‘cornerstone case’and provides justification for the node’s creation [27].RDR systems require the user to only consider thedifference between the current case and the corner-stone case [27]. Using this methodology, the user doesnot need to know the context of the entire system,or how new rules will impact its structure. Thereis no need for a knowledge engineer for such a sys-tem. Since no rules are written by the user and nooversight of the construction of the tree is needed.The iterative nature of RDR also negates the needfor large amounts of labelled data. Instead, the treeis built using the gradual flow of cases that any deci-sion tree system is subject to. These features makeRDR suitable to structure rule-based advice in IntRLscenarios. Figure 1 depicts an RDR model examplerepresented as both a tree and a text representationusing conditional clauses.
The method for retention and reuse of advice proposedhere combines the concept of modelling demonstra-tions, with the evaluative and informative interactionmethodology from IntRL [3]. This combination, re-sulting in retained advice, allows an agent to maximisethe utility of each interaction. Additionally, a rule-based IntRL approach would further minimise theadvisor demand. Rule-structured advice allows infor-mation to be generalised over multiple states. Thisreduces the interactions required with the human ad-visor while simultaneously increasing the potentialbenefit each interaction has on the agent’s behaviour.The generalisation occurs because the user can specifythe conditions in which the information is applica-ble, allowing the advice to be generalised beyondthe current state. The agent can then check eachstate it encounters to see if the conditions are met,at which point the recommendation or evaluation can3igure 1: An example of an RDR tree model, and atext representation of the same model. The root nodecontains a rule that will always equate to true, and ifno other rule in the tree equates true, the classificationof the root node will be chosen.be utilised.
As introduced, we propose a persistent agent thatkeeps a record of each interaction and the conditionsin which it occurred. When the conditions are met inthe future, the interaction is replayed. This results inimproved utilisation of the advice and consequently,improved performance of the agent. Additionally,fewer interactions with the trainer are required, asthere is no need for advice to be repeatedly providedfor each state.However, a naive implementation of persistencecan introduce flaws into the reward-shaping process.These flaws, if unaddressed, may cause the agent tonever learn an optimal policy. Prior work on reward-shaping [33] has shown that while reward-shapingcan accelerate learning, it can also result in the opti-mal policy under the shaping reward differing fromthat which would be optimal without shaping. Ng et al. [34] demonstrated that this issue can be avoidedby using a potential-based approach to constructingthe shaping reward signal. This guarantees that therewards obtained along any path from a state backto itself are zero-sum so that the agent will not finda loop in the environment that will provide infinitelyaccumulating rewards without termination [35]. Fornon-persistent IntRL agents, the reward given as partof the evaluation is temporary as the human has toprovide the supplemental reward upon revisiting thestate. Assuming that the human will eventually stopproviding advice, the reward signal will become zero-sum [36].For IntRL agents that use policy-shaping, i.e., rec-ommendations on which action to perform next, astraightforward implementation of persistence willwork if the advice is correct. However, human ad-vice is rarely 100% accurate [3]. Inaccuracy can resultfrom negligence, misunderstanding, latency, malicious-ness, and noise introduced when interpreting advice.Furthermore, if the agent always performs the recom-mended action, then it is not given the opportunityto explore and discover the optimal action. An agentthat retains and reuses inaccurate advice will not learnan optimal policy. Therefore, it is important that theagent be able to discard or ignore retained knowledge.These two issues with persistence, non-potentialreward-shaping and incorrect policy-shaping advice,result in persistent agents being unable to learn theoptimal policy. The issue of inaccurate advice withpersistence has two possible solutions, either identifythe incorrect advice and discard it or discard all ad-vice after a period regardless of its accuracy. To knowthe accuracy of a piece of advice a full solution tothe problem must be known, and if this is achievablethen an RL agent is not needed. Instead, a policyof discarding or ignoring advice after a period allowsa persistent agent to function with potentially inac-curate advice, while still maximising the utility ofeach interaction. This method also solves the issueof non-potential evaluative advice, as the frequencyof the supplemental reward is reduced over time un-til zero. Once the supplemental reward is reduced tozero, the cumulative shaping reward function becomeszero-sum once again.Therefore, to solve the issue of incorrect advice in4ersistent IntRL, a method for discarding or ignoringadvice after a period of time is needed. Probabilis-tic policy reuse (PPR) is a technique that aims toimprove RL agents that use guidance [37]. PPR re-lies on using probabilistic bias to determine whichexploration policy to use when multiple options areavailable, the goal of which is to balance random ex-ploration, the use of a guidance policy, and the use ofthe current policy.For the persistent agent scenario, there are three ac-tion selection options available: random exploration,the use of retained advice from the trainer, or thebest action currently known. PPR assigns each ofthe three options a probability and priority of selec-tion [37]. Over time, the probability of using guidanceor retained information decreases, and trust in theagents own policy increases. Using PPR, the guid-ance provided by the trainer is used for more than asingle time step, with a decreasing probability overtime, until the value of the advice is captured bythe agent’s own policy. Once encapsulated by theagent, self-guided exploration and exploitation of theenvironment continue.
Following, we supply details about the rule-based in-teractive agent implemented. As in the previous case,issues of conflicting and incorrect advice need to bemitigated. Therefore, a method for managing andcorrecting retained information is required. In thisregard, ripple-down rules (RDR) offer a methodol-ogy for iteratively building knowledge-based systemswithout the need for engineering skills.While current IntRL agents accept advice pertain-ing to the current state only, ripple-down rule re-inforcement learning (RDR-RL) accepts rule-basedadvice that can apply to multiple states. Each in-teraction contains a recommendation or evaluationfrom the user and the conditions for its application.For example, the user may provide the following ruleto an agent learning to drive a car: “IF obstacle onleft==TRUE THEN action=turn right” . In this ex-ample, the advice is to turn right, and the conditionfor its use is that there is an obstacle on the left-handside of the car. While rule-based IntRL assumes that all interactions contain a rule, this rule does not haveto be sourced directly from the user. The methodin which the user interacts with the agent can be byany means, as long as the advice collected results in aset of conditions and the recommendation. The usermay provide the set of conditions for the applicabilityof the advice directly, or optionally, the conditionsmay be discovered using assistive technologies suchas case-based reasoning or speech-to-text.An RDR-RL agent has three aspects to be consid-ered during its construction, each of which is describedin the following sections. These aspects are advicegathering, advice modelling, and advice utility.
The RDR-RL agent has the same foundation as anyRL agent. The ability to retain and use the adviceprovided by the user is an addition to the RL agent,built around the existing algorithm. Like existingIntRL agents, when no advice has been provided tothe agent, it will operate to the exact same as astandard RL agent.For instance, at any point during the agent’s learn-ing, a user may assist the agent by recommending anaction to take. When the user begins an interaction,they are provided with the agent’s current state, andif available, the current intended action. If the useragrees with the intended action the agent presented,or if the user is no longer available, the agent continueslearning on its own.If the user disagrees with the action the agent isproposing, or if there is no action proposed, then theinteraction continues. The user is provided with acornerstone case. The cornerstone case is the statein which the user recommended the action that theagent is intending to take. The differences between thecornerstone case and the current state are presented tothe user. If there is no cornerstone case, for example,when it is the first time the user is providing adviceto the agent, then only the current state is provided.The user recommends an action for the agent to takeand creates a rule that distinguishes the two cases,setting the conditions for their recommended action.Once the recommended action has been provided, andthe rule setting the conditions for its use determined,5hey are passed to the agent. The agent then usesthe rule and recommendation to update its model ofadvice.
Advice modelling is the process of storing the infor-mation received from the user. The agent receives arule and a recommendation from the user each timean interaction occurs. The rule dictates the condi-tions that must be met for the recommendation to beprovided to the agent.For instance, using persistence for state-based In-tRL may maintain a lookup table for each state andthe corresponding recommendation/evaluation thathad been provided. As we will describe along with theexperimental results, this simple method for advicemodelling improves performance compared to agentsthat do not retain advice. However, this lookup modeldoes not generalise advice across multiple states andmay present difficulties with incorrect advice.For rule-based advice, a ripple-down rules decisiontree is used to model the advice provided by the user.This system allows a model of advice to be iterativelybuilt over time, as the user provides more informationto the agent. The RDR model is part of the learningagent but is independent of the Q-value policy. It isused to assist in action selection.When an interaction with the user occurs, the agentis provided with an action recommendation, and a rulegoverning its use. To update the model of advice, theagent provides the current state as a case to the RDRsystem. The system returns a classification node andan insertion node. The classification node containsthe recommended action based on the advice collectedprior to the current state; the recommendation thatthe user disagrees with given the current state. Theinsertion node is the last node in the branch of theRDR tree that evaluated the current state and is thepoint at which the new rule will be inserted. A newnode is created using the rule and recommendationfrom the user, along with the current state as thecornerstone case. If the rule in the insertion node isevaluated TRUE using the information in the currentstate, then the new node will be inserted as a TRUEchild, otherwise, it will be inserted as a FALSE child.
The last aspect of the agent’s construction detailswhen the advice gathered from the user is used bythe agent. In the previous section, the concept ofpersistence was discussed. There, it was identified asan issue the decreasing agent performance if incorrectadvice was provided, or recommended actions were al-ways followed and neglecting exploration. To mitigatethis issue, PPR was proposed.For the RDR-RL agent, the guidance policy is themodel of advice. The trade-off between explorationand the exploitation of the learned expected-rewardspolicy continues to be managed by whichever actionselection method is preferred by the agent designer.For instance, an (cid:15) -greedy action selection methodis used for the experiments in this work. In thisregard, PPR manages to switch between the actionrecommended by the advice model and the (cid:15) -greedyaction selection method.At each time step, the advising user has a chanceto interact with the agent. If an interaction occurs,the model is updated. When a user first recommendsan action, it is expected that the agent will performit. For this reason, the recommended action is alwaysperformed on the time step at which it was recom-mended, regardless of the probabilities currently setby PPR.When an agent is selecting an action in a time stepwhere the user has not recommended a previous action,then PPR is used. First, the agent’s model of adviceis checked to see if any advice pertains to the currentstate. If the model recommends an action, then thataction is taken with a probability determined by thePPR selection policy. If no action is recommended,then the agent’s default action selection policy is used,e.g., (cid:15) -greedy. Figure 2 summarises the whole processflow for persistent rule-based IntRL agent.
The mountain car is a control problem in which a caris located in a unidimensional track between two steephills. This environment is a well-know benchmark in6igure 2: Process flow of a rule-based interactivereinforcement learning agent.RL community, therefore, it is a good candidate toinitially test our proposed approach.The car starts at a random position to the bottomof the valley ( − . < x < .
4) with no velocity ( v = 0).The aim is to reach the top of the right hill. However,the car engine does not have enough power to claimto the top directly and, therefore, needs to buildmomentum moving toward the left hill first.An RL agent controlling the car movements ob-serves two state variables, namely, the position x andthe velocity v . The position x varies between -1.2and 0.6 in the x-axis and the velocity v between -0.07and 0.07. The agent can take three different actions:accelerate the car to the left, accelerate the car to theright, and do nothing.The agent receives a negative reward of r = − r = 0). The learning episode finishes in casethe top of the right hill is climbed ( x = 0 .
6) or after1,000 iterations in which case the episode is forciblyterminated.
The self-driving car environment is a control problemin which a simulated car, controlled by the agent, must navigate an environment while avoiding colli-sions and maximising speed. The car has collisionsensors positioned around it which can detect if an ob-stacle is in that position, but not the distance to thatposition. Additionally, the car can observe its currentvelocity. All observations made by the agent comefrom its reference point, this includes the obstacles(e.g., there is an obstacle on my left) and the car’scurrent speed. The agent cannot observe its positionin the environment.Each step, the environment provides the agent re-ward equal to its current velocity. A penalty of -100is awarded each time that the agent collides with anobstacle. Along with the penalty reward, the agent’sposition resets to a safe position within the map, ve-locity resets to the lower limit, and the direction oftravel is set to face the direction with the longestdistance to an obstacle.Figure 3a shows the map used for the self-driving carexperiments. This map challenges the agent to learna behaviour that maximises velocity while avoidingcollisions by using a layout that prohibits turning athigh speeds at the narrow corridors on the top, right,and bottom of the map. The only two sections of themap that allow for high-velocity turning are the largeempty sections on the left side.The collision sensors return a Boolean responseas to whether there is an obstacle at that position,though not the distance to that obstacle. Additionally,the agent does not know the position of its sensorsin reference to itself. The only information the agenthas regarding the sensors is whether each is currentlycolliding with an obstacle. As stated, the agent alsoknows its current velocity. The possible velocity of theagent is capped at 1m/s at the lower end, and 5m/sat the higher end. A lower cap above zero velocityprevents the agent from moving in reverse or standingstill. This lower limit reduces the state space andprevents an unintended solution, e.g., standing stillis an excellent method for avoiding collisions. Theupper limit of 5m/s is set so that velocity is notlimitless and further reduces the state space, whilestill being high enough that it exceeds the limit fora safe turn anywhere in the environment. An actionthat attempts to exceed the velocity thresholds setby the environment will return the respective limit.7 a) Simulated self-driving car. (b) Optimal path.
Figure 3: A graphical representation of the simulatedself-driving car. The blue square at the top left is thecar. A yellow line within the car indicates the currentdirection and the number below (in yellow) is thecurrent velocity. The small green squares surroundingthe car are collision sensors and will always align withthe cars current direction. The large white rectanglesare obstacles.There are five possible actions for the agent to takewithin the self-driving car environment. These actionsare:i.
Accelerate : the car increases its velocity by 0.5meters per second.ii.
Decelerate : the car’s velocity will decrease by 0.5meters per second.iii.
Turn left : the car alters its direction of travel by5 degrees to the left.iv.
Turn right : the car alters its direction of travelby 5 degrees to the right.v.
Do nothing : the car’s velocity or direction oftravel is not altered. When performing this actionthe only change is the car’s position, based oncurrent velocity, position, and direction of travel.The self-driving car environment has nine state fea-tures, one for each of the collision sensors on the car,and the current velocity of the car. The collision sen-sor state features are Boolean, representing whetherthey detect an obstacle at their position. The velocity of the agent has nine possible values, the upper andlower limits, plus every increment of 0.5 value in be-tween. With the inclusion of the five possible actions,this environment has 11520 state-action pairs.The reward function defined by the environmentpromotes the agent to learn a behaviour that avoidsobstacles while attempting to achieve the highest ve-locity the environment allows. The most naturalsolution to learn that achieves these conditions is todrive in a circle, assuming that the path of the circledoes not intersect with an obstacle. The map chosenfor use in these experiments allows an unobstructedcircle path to be found, but only at low velocities. Ifthe agent is to meet both conditions that achieve thehighest reward, a more complex behaviour must belearned (see Figure 3b).
To compare agent performance and interaction, met-rics for agent steps, agent reward, and interactions arerecorded. A number of different agents and simulatedusers have been designed and applied to the mountaincar and self-driving car environments. Simulated usershave been chosen over actual human trials, as theyallow rapid and controlled experiments [38]. Whenemploying simulated users, interaction characteristicssuch as knowledge level, accuracy, and availability canbe set to specific and measurable levels. Following, wedescribe all the agents used during the experiments.
Next, we demonstrate the use of persistent advicewith probabilistic policy reuse (PPR), and the impactits use has on agent performance and user reliance.The experiments have been designed to test severallevels of human advice accuracy and availability, withand without retention of received advice.The mountain car environment is used in these ex-periments since it is a common benchmark problemin RL with sufficient complexity to effectively testagents and simple enough for human observers to in-tuitively calculate the correct policy. Additionally, the8ountain car environment has been previously usedin a human trial evaluating different advice deliverystyles [3] and with simulated user [38]. We use theresults reported in the human trial to set a realisticlevel of interaction for evaluative and informative ad-vice agents. Five agents have been designed for thefollowing experiments. The expected-reward valueshave been initialized to zero, an optimistic value forthe environment. All the agents are given a learningrate α = 0 .
25, a discounting factor γ = 0 .
9, and usean (cid:15) -greedy action selection strategy with (cid:15) = 0 .
1. Forthe agent to represent the continuous two-dimensionalstate space of the environment, it has been discretizedinto 20 bins for each state feature, creating a totalof 400 states, each with three actions. The learningagents are listed below:i.
Unassisted Q-Learning Agent : A Q-learningagent used for capturing a baseline for perfor-mance on the mountain car environment. Thisagent is unassisted, receiving no guidance or eval-uation from the trainer and used as a benchmark.ii.
Non-Persistent Evaluative Advice Agent : Thisagent is assisted by a user. The user may providean additional reward at each time step to evaluatethe agent’s last choice of action. For this non-persistent agent, the supplemental reward is usedin the current learning step and then discarded.iii.
Persistent Evaluative Advice Agent : This agentis assisted by a user. The user may provide an ad-ditional reward at each time step to evaluate theagent’s last choice of action. For this persistentagent, the evaluation provided is retained, andupon performing the same state-action pair inthe future, the evaluation may be automaticallyprovided to the agent, with a probability definedby the PPR action selection policy.iv.
Non-Persistent Informative Advice Agent : Thisagent is assisted by a user. The user may rec-ommend an action for the agent to perform forthe current time step. When the agent is rec-ommended an action, that action is taken onthat time step, and then the advice is discarded.This non-persistent agent, when visiting the same Figure 4: Probabilistic policy reuse (PPR) for anIntRL agent using informative advice. If the userrecommends an action on the current time step thenthe agent’s advice model updates and the action isperformed. If the user does not provide advice on thecurrent time step, then the agent will follow previouslyobtained advice 80% of the time (*decays over time)and its default exploration policy the remaining time.state again in the future, will not recall the recom-mended action and will perform (cid:15) -greedy actionselection.v.
Persistent Informative Advice Agent : This agentis assisted by a user. The user may recommendan action each time step for the agent to per-form. If recommended, the learning agent willtake the advice on that time step and retain therecommendation for use when it visits the samestate in the future. When the agent visits a statein which it was previously recommended, it willtake that action with the probability defined bythe PPR action selection policy.The agents adopting a persistent model are employ-ing PPR for action selection. As depicted in Figure 4,the PPR action selection begins with an 80% chanceof reusing advice provided to the agent in the past.The probability of reusing advice decreases by 5%each episode. For the remaining 20% of the time, orif no advice has been provided for the current state,an (cid:15) -greedy action selection policy is used.For each agent, one hundred experiments are run.At the beginning of each experiment, the environ-ment, the agent, and the agent’s model of providedadvice are reset. Each experiment runs with a maxi-mum of one thousand steps before it terminates. The9umber of steps performed, interactions performed,and reward received are recorded. An interactionis recorded if the user provides advice to the agent,not when the agent uses advice it has stored from aprevious interaction.Six different simulated users have been created astrainers. Three providing evaluative advice and threeinformative advice. Evaluative advice-giving usersprovide either a positive or negative supplementalreward corresponding to the agent’s choice in actionon the last time step. Informative advice-giving usersprovide a recommended action for the agent to per-form on the current time step. Simulated users thatare advising a persistent agent will not provide advicefor a state, or state-action, that it has previously ad-vised on, as it is assumed that if the agent is retaininginformation it should not need repeated advice forthe same conditions. This does not apply to non-persistent agents.Additionally, each simulated user will have eitheroptimistic, realistic, or pessimistic values for adviceaccuracy and availability. Accuracy is a measureof how correct the advice is provided by the user.Accuracy of interaction is altered by, with a specifiedprobability, replacing the recommended action withan action that is not optimal for the current state.Availability is a measure of how frequently the userprovides advice. The availability of the simulated useris altered by specifying a probability that the userwill interact with the agent on any given time step.Optimistic simulated users have 100% accurate ad-vice and will provide advice on every time step thatthe agent does not have retained knowledge of. Re-alistic simulated users have accuracy and availabilitymodelled from previously obtained results in a humantrial [3]. The recorded accuracy and availability ofhuman advice-givers differs depending on the type ofadvice being provided, i.e., evaluative or informative.Previous work has compared evaluative and informa-tive advice/agents [3], and as such is not in the scopeof this study. Lastly, pessimistic simulated users aregiven accuracy and availability values half that of therealistic users. Table 1 shows the accuracy and avail-ability values for each of the six simulated users (3evaluative users, 3 informative users). The previouslyobserved accuracy and availability for human advisors Table 1: The six simulated users designed for thechapters experiments. The advice delivery methods,evaluative and informative, are not intended to becompared against each other, rather they are to becompared against their persistent counterparts. Thisallows simulated users to be individually set for eachadvice delivery style that more accurately representthe type of user that would be advising each type ofagent.
Name Accuracy Availability
Evaluative Optimistic 100% 100%Evaluative Realistic 48.470% 26.860%Evaluative Pessimistic 24.235% 13.43%Informative Optimistic 100% 100%Informative Realistic 94.870% 47.316%Informative Pessimistic 47.435% 23.658%in the mountain car environment are shown as forrealistic agents.Table 2 lists all the agent/simulated user combina-tions tested. There are a total of thirteen agents, sixpersistent agents, six non-persistent agents, and anunassisted Q-Learning agent used for benchmarking.Included next to each agent/user combination is ashort name. This short name is used later in theResults section, as the full name is too long to includein each figure legend.
In this case, three learning agents have been de-signed, which include unassisted Q-Learning, persis-tent state-based informative, and rule-based assistedusing ripple-down rules. No evaluative assisted agentsare tested in these experiments, as they cannot besuitably compared to the rule-assisted agent which isusing informative advice. The three learning agentsused are described below:i.
Unassisted Q-Learning Agent : A Q-learningagent used for capturing a baseline for perfor-mance on each environment. This agent is unas-sisted, receiving no guidance or evaluation from10able 2: Agent/User combinations for persistent agenttesting, including short names for reference.
ShortName Agent Simulated User
UQL Unassisted Q-Learning NONENPE-O Non-PersistentEval. EVAL. OPTI-MISTICNPE-R Non-PersistentEval. EVAL. REALIS-TICNPE-P Non-PersistentEval. EVAL. PES-SIMISTICNPI-O Non-PersistentInfo. INFO. OPTI-MISTICNPI-R Non-PersistentInfo. INFO. REALISTICNPI-P Non-PersistentInfo. INFO. PES-SIMISTICPE-O Persistent Evalua-tive EVAL. OPTI-MISTICPE-R Persistent Evalua-tive EVAL. REALIS-TICPE-P Persistent Evalua-tive EVAL. PES-SIMISTICPI-O Persistent Infor-mative INFO. OPTI-MISTICPI-R Persistent Infor-mative INFO. REALISTICPI-P Persistent Infor-mative INFO. PES-SIMISTICthe trainer and used as a benchmark. The agentwill represent each environment as described inthe previous section. The expected-reward val-ues have been initialized to zero. This agent uses (cid:15) -greedy action selection.ii.
State-based Persistent Agent : This agent is as-sisted by a user. The user may recommend anaction each time step for the agent to perform.If an action is recommended by the user, theagent will take it on that time step and retainthe recommendation for use when it visits the same state in the future. When the agent visits astate in which it was previously recommended, itwill take that action with the probability definedby the PPR action selection strategy. The persis-tent informative agent uses the same parametersettings as the unassisted Q-Learning agent foreach environment.iii.
Rule-Assisted Persistent Agent : This agent isassisted by a user. The user may provide a ruleand recommended action at each time step. Therule-assisted learning agent uses ripple-down rulesto model the advice received by the trainer. Ifthe user provides advice, and the rule providedequates to true for the current state, then theagent will take the recommended action duringthat time step. If the provided rule equates tofalse, then the agent will use its default actionselection strategy. When a rule is provided theagent will retain the rule for use in future states.Each time the agent visits a state, it will queryits retained model of rules. If a rule is foundthat equates to true for the current state, thenthat action is taken with a probability definedby the agent’s PPR action selection strategy. Allrule-assisted agents used in this experiment beginwith an 80% chance of taking the action recom-mended by its advice model. This 80% chanceis decayed each episode, until the point at whichthe agent is relying solely on its secondary actionselection strategy. The agent’s secondary actionselection strategy is the same strategy used bythe unassisted Q-Learning agent, i.e., (cid:15) -greedy.The rule-assisted agent uses the same parametersettings as the unassisted Q-Learning agent foreach environment.The mountain car environment is a good candidatefor the rule-based advice method as the optimal so-lution can be captured in very few rules, while stillremaining understandable by humans. The rule-basedand state-based agents are tested against the moun-tain car environment, employing simulated users withvarying levels of knowledge of the environment. Theaim is to compare the performance of the agents, andthe number of interactions performed to achieve that11erformance. The learning parameters used are thesame as the previous experiments.Additionally, in these experiments, the self-drivingcar environment is also used. The state and actionspaces for this environment is larger than the moun-tain car environment, but still remain understandableby human observers. The self-driving car agents aregiven a learning rate α = 0 .
1, a discounting factor γ = 0 . (cid:15) -greedy action selection strat-egy with (cid:15) = 0 . To allow quick, bias-reduced, repeatable testing ofthe agents, simulated users are used as trainers inplace of humans. Simulated users offer a method forperforming indicative evaluations of RL agents thatrequire human input, with controlled parameters [38].There are two types of simulated users required for thefollowing experiments, one must provide state-basedadvice, and the other must provide rule-based advice.Both types of simulated users will provide the sameinformation and the same amount of it.The first type, an informative state-based adviceuser, is the same user employed for the previous ex-periments. This user may provide a recommendedaction on each time step. The agent that the user isassisting will retain any recommendations provided Table 3: State-based simulated user knowledge basesfor the mountain car and the self-driving car environ-ments.
Environment(User Name) Limits
Mountain Car/ MC-FULL User will provide advice for allstates.Mountain Car/ MC-HALF User will only provide advice forstate in which the agent is on theleft slope of the valley. (IF position < -0.53)MountainCar / MC-QUARTER User will only provide advice forstate in which the agent is on thebottom half of the left slope of thevalley. (IF position < -0.53 ANDposition > -0.865)MountainCar / MC-MIDDLE User will only provide advice for thefew states at the bottom of the val-ley. (IF position < -0.43 AND posi-tion > -0.63)Self-drivingCar / SC-AVOID User will only provide advice forstates where the agent has an obsta-cle on the left side OR the right side,but not both. (IF right = true ORright-front-close = true) OR (IF left= true OR left-front-close = true)by the user, and will not give the user an opportunityto provide advice for a state for which advice hasalready been received, capping the number of inter-actions at the number of states. As in the previousexperiments, each informative state-based user hadan accuracy and availability score. Accuracy is theprobability that the advice the user is providing isoptimal for the current state. Availability is the prob-ability that the user would provide advice for anygiven opportunity. Additionally, the states that theuser can provide advice will be limited to parts ofthe environment, simulating a limited or incompleteknowledge level of the environment. Table 3 showsthe knowledge limitations of the various state-basedusers built for the rule-based experiments in order to12o a fair comparison.The advice that the state-based simulated usersprovide for the mountain car environment is optimal(previous to accuracy, availability, and knowledge levelis applied). However, the same may not be true forthe self-driving car environment. The reward functionfor this environment reinforces behaviour that avoidscollisions and maximises speed. The advice that thesimulated user provides for the self-driving car envi-ronment only attempts to avoid collisions. While thisadvice should be optimal, there may be situationswhere the agent will want to stay close to an obstacleto maximising speed. In these situations, the adviceprovided would be considered incorrect, and the agentwill need to learn to ignore it to learn the optimalbehaviour.The second type of simulated user is a rule-basedadvice user. Simulated users are a common methodol-ogy for the creation and evaluation of ripple-down rulesystems in research [39, 30, 40, 41]. These simulatedusers will return a rule and a recommended action foreach interaction with the user. The simulated usersemployed for the these experiments have been builtwith their own ripple-down rules model and populatedwith a set of rules that they will, over time, provide tothe agent. As in reality, users do not have their ownrule model, rather they would generate rules them-selves, therefore, we use the rule model for simulatedusers as a means to replicate the interaction processof a real user.The learning agent begins each experiment with anempty model of advice, and the simulated user beginswith a full model. Over time, the learning agentwill provide the trainer user with an opportunity toprovide advice. When an opportunity occurs, thelearning agent provides the current state observation,the current action it will take, and details about how itchose that action (either from the retained user modelor from an exploration strategy). This information isthe same information that would be made available toan actual human advisor. Now that the simulated userhas this information, it may choose to respond andwhat advice it will provide with. The simulated userwill respond if it has a rule that applies to the currentstate and it disagrees with the agent’s choice of action.The simulated user will continue to provide advice Figure 5: Process flow for a rule-based simulated userusing a ripple-down rules model assisting a rule-basedRL agent.for as long as it is given opportunities, that it hasnew rules to provide, and that the new rules disagreewith the agent’s current behaviour. The full processflow to choose an action using rule-based advice isdepicted in Figure 5.Multiple rule-based simulated users have been cre-ated to provide a range of different knowledge levelsfor the various environments (equivalent to the knowl-edge level of state-based simulated users shown inTable 3). Table 4 describes and provides the knowl-edge bases in use for each of the environments. Ashort description of each knowledge base is provided. The first experiment performed tests the use of prob-abilistic policy reuse (PPR) as an action selectionmethod, compared to always using advice when avail-able within the mountain car environment. As afore-13able 4: Rule-based simulated user knowledge basesfor the mountain car and self-driving car environments.The model is shown using ripple-down rules with atext representation.
Environ./ UserName Limits
MountainCar /MC-FULL IF 1==1 : EXPLOREIF velocity >
0: GO RIGHTNO TRUE NODEIF velocity < =0 GO LEFTMountainCar /MC-HALF IF 1==1 : EXPLOREIF position < -0.53: GO RIGHTIF velocity > = 0: GO RIGHTIF velocity <
0: GO LEFTNO FALSE NODEMountainCar /MC-QUARTER IF 1==1 : EXPLOREIF position < -0.53 ANDposition > -0.86: GO RIGHTIF velocity > = 0: GO RIGHTIF velocity <
0: GO LEFTNO FALSE NODEMountainCar /MC-MIDDLE IF 1==1 : EXPLOREIF position < -0.43 ANDposition > -0.63: GO RIGHTIF velocity > = 0: GO RIGHTIF velocity <
0: GO LEFTNO FALSE NODESelf-drivingCar /SC-AVOID IF 1==1 : EXPLOREIF right OR right-front-close:TURN LEFTNO TRUE NODEIF left OR left-front-close:TURN RIGHTNO FALSE NODEmentioned, the use of persistence in RL introducesa critical flaw. Specifically, if provided advice is re-tained and reused, and that advice is incorrect, thenthe agent will not be able to learn a solution to thecurrent problem. Figure 6 shows the performance of3 RL agents: an unassisted Q-learning (UQL) agentfor benchmarking and 2 persistent IntRL agents usinginformative advice. These two interactive agents areidentical except that one is using PPR for action selec-tion, called persistent informative reuse (PPR) agent
UQLPI-R (No PPR)PI-R (PPR)
Episodes S t e p s Figure 6: Probabilistic policy reuse and direct-useaction selection for IntRL using retained informativeadvice. Both assisted agents are using simulated usersusing realistic values for accuracy and availability, andare both retaining advice provided to them.or PI-R (PPR), while the other will always take arecommended action if one exists for the current state,called persistent informative reuse (No-PPR) agent orPI-R (No-PPR). Both interactive agents are assistedby a simulated user created with realistic values ofaccuracy and availability.Figure 6 shows that both assisted agents immedi-ately outperforms an unassisted agent (UQL in blue).Both agents are retaining the recommended actionsfrom the user, and cannot differentiate between cor-rect and incorrect advice. The No-PPR agent (inred) will always take the recommended action for thecurrent state, if available. This works well for the firstfew episodes, as small amounts of correct advice canhave a large positive impact on agent performance andsmall amounts of incorrect can be ignored because ofthe momentum the agent builds in the mountain carenvironment. However, as the amount of incorrect rec-ommended actions retained increases, the effect on theagent’s performance increases. Eventually, the impactof taking the wrong action will have such an effectthat the agent cannot build the required momentumto solve the task. Without the required momentum,the agent will get stuck in local minima.The agent using probabilistic policy reuse (PPR)continues to outperform both the unassisted agent(UQL), and the other assisted agent (PI-R No-PPR).The PPR agent will initially take the users advicein high regard, taking recommended actions 80% ofthe time. Over time, the agent pays less attention14o the retained advice of the user, and more to itsown learned policy. This allows the agent to disregardincorrect advice, as its own value estimations will showthe correct action to take, while correct advice willaccelerate the discovery of the true value estimationof the correct action in advised states.If human-sourced advice is 100% accurate for theproblem being tested, the use of PPR may lower thepotential performance of the agent. This is due tothe PPR action selection policy disregarding accurateinformation and instead taking exploratory or localminima actions. However, previous work [3] has shownthat human-sourced information is not likely to be100% correct, and as such, the use of PPR mitigatesthe risk of inaccurate information.
The second experiment tests non-persistent adviceand state-based persistent advice, i.e., the providedadvice is used and retain with PPR only in the givenstate. An unassisted Q-learning (UQL) agent is usedfor benchmarking. Simulated users are used for pro-viding advice in the mountain car environment withthree different initialisation, namely, optimistically,realistic, and pessimistically (denoted with the suffix-O, -R, and -P respectively). Figures 7a and 7c showthe performance over time for both non-persistentevaluative and informative agents, at varying levelsof user accuracy and availability. These figures donot compare the two advice delivery styles againsteach other (i.e., evaluative and informative), but theyare compared against their persistent counterparts.As evaluative advice is evaluating actions that havealready been taken, there is a short delay between theaction being taken and the application of the adviceto the agent. This delay causes latency in the effectof the advice on the agent’s learned policy. Figure 7ashows this delay for the evaluative agent, where mostof the advice is given in the first few episodes, butit takes around twenty episodes before the agent hasfully utilised the advice and converges to an optimalpath. The agent using informative advice on the otherhand does not suffer from this delay (shown in Fig-ure 7c). This agent is receiving recommendations onwhich action to take next, and if a recommendation
UQLNPE-ONPE-PNPE-R
Episodes S t e p s (a) Non-persistent evaluative UQLPE-OPE-PPE-R
Episodes S t e p s (b) Persistent evaluative UQLNPI-ONPI-PNPI-R
Episodes S t e p s (c) Non-persistent informative UQLPI-OPI-PPI-R
Episodes S t e p s (d) Persistent informative Figure 7: Steps per episode for 4 different agents us-ing advice. The agents are assisted by three differentsimulated users, initialised with either O ptimistic, R ealistic, or P essimistic values for accuracy andavailability. The figure shows that the persistentagents learn in fewer steps in comparison to the non-persistent agents when assisted by sufficiently accurateusers.is provided, then the action is taken.The first column in Table 5 shows the number ofinteractions that occurred, and the percentage of inter-actions compared to the number of steps performed,for the all non-persistent agent/simulated user com-binations. These values align with the percentageof availability of each of the simulated users. Thisis expected, as the agents are not retaining any ad-vice received from the user, so users are required tocontinually give the same advice if the agent is tomaximise its utility. An observation can be made thatas the availability of the simulated user decreases,the number of interactions increases. Availability isthe likelihood that the user will interact with theagent on any given time step, so it would be assumedthat the higher the availability then the higher thenumber of interactions. However, this is not what isobserved due to a more available advisor allows for15aster convergence.The simulated users that have been created havea direct correlation between the availability of theuser and its accuracy. Simulated users that are highlyaccurate allow the agent to learn the optimal policyfaster, which result in the agent taking fewer steps,and the simulated user having fewer opportunitiesto provide advice. Simulated users with lower accu-racy, such as the pessimistic users, cause the agent totake longer to learn the policy, resulting in the agenttaking more steps, and allowing the simulated usermore opportunities to provide advice. This is whatcreates the inverse correlation between the user adviceavailability and the number of interactions recorded.Figures 7a and 7b show evaluative agents, bothnon-persistent (NPE-*) and persistent (PE-*), us-ing advice from three different users. The persistentagent (shown in 7b) is using PPR to manage thetrade-off between the advice received from the user,its own learned policy, and its exploration strategy.The persistent agent is limited to only receiving oneinteraction from the user per state-action pair. If theagent has already received some advice for the state-action pair in question, then the user is not giventhe opportunity to provide additional advice. Theagent instead relies on the stored advice from the firstinteraction regardless of its accuracy. Both agents willalways utilise advice received directly from the user onthe current time step. However, the persistent agentkeeps it and follow a PPR strategy, which allows theagent to diminish the probability of using the advicefor a state-action pair over time. This results in thepersistent agent receiving one interaction per state-action pair, maximising the utility that interaction,then eventually only relying on its own policy.The agents being assisted by optimistically ini-tialised users perform almost the same. Theoptimistically-assisted persistent agent (PE-O) takesslightly longer to learn than the non-persistent coun-terpart (NPE-O), because the advice it receives isonly listened to 80% (diminishing over time) afterthe initial interaction with the user, compared to thenon-persistent agent whose user will continually inter-act with the agent and the agent will always followthe advice. The agents being assisted by realisticallyinitialised users differ greatly in performance. The non-persistent evaluative agent using a realisticallyinitialised user (NPE-R), while able to solve the moun-tain car problem in fewer steps than the 1000 cut-offlimit, was not able to find the optimal solution. How-ever, the persistent evaluative agent (PE-R) was notonly able to solve the problem, but also learned thesolution faster than the benchmark unassisted agent(UQL), just like the NPE-O and PE-O agents. Thedifference in performance is not only due to the persis-tent agent remembering the advice, but also becauseit can eventually disregard incorrect advice as thelikelihood that the PPR algorithm will choose to takethe recommended action diminishes over time, whilethe agent’s value estimation of the recommended ac-tion remains the same. What is particularly notablefrom these results is that the persistent agent (PE-R),still outperformed unassisted Q-learning despite morethan half of all interactions giving the incorrect advice.Regardless of whether the agent is persistent or not,neither agent that was advised by a pessimistically-initialised user (NPE-P, PE-P) was able to solve themountain car problem. This is likely due to the accu-racy of the pessimistic user being less than 25%.Figures 7c and 7d show the performance of infor-mative agents, both non-persistent (NPI-*) and per-sistent (PI-*), using advice from three users withdifferent levels of advice accuracy and availability.These agents can receive informative advice from auser. The advice that they receive is an action recom-mendation, informing them of which action to takein the current state. When either agent, persistentor non-persistent, receives an action recommendationdirectly from the user on the current time step that ac-tion will be taken by the agent. The persistent agentwill remember that action for the state it was receivedin, and use the PPR algorithm to continue to takethat action in the future. Once the persistent agenthas received an action recommendation from the userfor a particular state, the user will not interact withthe agent for that state in the future.The informative agents (NPI-O, PI-O), regardless ofpersistence, learned the solution in the same amountof time when being advised by an optimistically ini-tialised user. This is not surprising as the agent isreceiving 100% accurate advice for every time-step,making this essentially a supervised learning task at16able 5: Average number of interactions performed per experiment, and the percentage of interactionscompared to the steps taken, for each non-persistent and persistent agent/user combination. Interaction(% Interactions / Total Steps)Agent Non-persistent Persistent
Unassisted Q-learning agent (UQL) 0.00%Evaluative agent / Optimistic user (NPE-O/PE-O) 58,355 (100.00%) 668 (0.91%)Evaluative agent / Realistic user (NPE-R/PE-R) 486,503 (26.86%) 117 (0.01%)Evaluative agent / Pessimistic user (NPE-P/PE-P) 500,499 (13.43%) 47 ( < ∼ ∼ Using a rule-based persistent advice technique, we ex-pect to reduce even further the number of interactionsneeded between the learning agent and the trainer, incomparison to the state-based persistent advice. Inthis regard, we perform experiments using two differ-ent domains, namely, the mountain car domain andthe self-driving car domain described in Section 4.
First, we employ the mountain car domain using bothstate-based and rule-based advice. A total of eightdifferent simulated users are created, four are used forstate-based advice and four for rule-based advice. Theagents differ in the level of knowledge and the avail-ability to deliver advice, i.e., full, half, quarter, andbottom availability (as shown in Table 4). Figure 8shows the number of steps each agent performed eachepisode for this environment. Figure 8a shows the re-sults for the state-based agents, and Figure 8b showsthe rule-based agents. A comparison of the two graphsshows that the agents performed similar, regardless ofthe advice delivery method. This was expected, as themethod in which the agent uses the advice and theamount of advice in total that the agent receives doesnot differ between the two types of agents. The agentsusing minimal advice (MCP-MID and MCRDR-MID)end up learning a worse behaviour than the unassistedQ-Learning agent. This is likely an indication thatthe decay rate for the PPR action selection method
Episodes S t e p s MC-BENCHMCP-FULLMCP-HALFMCP-MIDMCP-QUAR (a) State-based agent.
Episodes S t e p s MC-BENCHMCP-FULLMCP-HALFMCP-MIDMCP-QUAR (b) Rule-based agent.
Figure 8: Step performance for state-based and rule-based IntRL agents for the mountain car domain.Using considerable much less interaction from thetrainer, results show no significant difference in per-formance between the two types of agents.is too low, and that the agent has not yet learnedto ignore the user advice after its initial benefit andfocus on its own learning.Table 6 shows the number of interactions, and thepercentage of interactions over opportunities for inter-actions, for each agent. These results show that thenumber of interactions is much less for the rule-basedagents compared to the state-based agents, allowingsimilar performance with much less effort from thetrainer. In the previous experiment, the number of in-teractions was not a useful measure to compare agentsagainst each other. This was because the advice pro-vided to the agent affects the number of steps theagent takes, which results in fewer opportunities forinteractions. However, Figure 8 shows that the per-formance the agents that use the same simulated userare the same, regardless of the advice type. Therefore,in this context, the number of interactions is a usefulmeasure for comparing the corresponding state-basedand rule-based agents.
The aim of the agent in the self-driving car envi-ronment is to avoid collisions and maximise speed.In the experiments, we created two simulated usersto provide state-based and rule-based advice. Bothagents outperformed the unassisted Q-Learning agent,both achieving a higher step count and reward. Theobtained steps and reward are shown in Figure 9a18able 6: Interaction percentage state- and rule-based agents for the mountain car domain. Average numberof interactions performed per experiment, and the percentage of interactions compared to the steps taken, foreach state-based/rule-based agent/user combination.
Agent / User
Unassisted Q-Learning Benchmark 0 (0.00%)State-based / Full (MCP-FULL) 254 ( < < < < < < < < BenchmarkRule-basedState-based
Episodes S t e p s (a) Steps per episode. BenchmarkRule-basedState-based
Episodes R e w a r d (b) Reward per episode. Figure 9: Steps and reward for state-based and rule-based IntRL agents for the self-driving car domain.The advice required from the trainer is considerableless obtaining no significant difference in performancebetween the two types of agents. and Figure 9b respectively. Although the agent wasforcibly terminated when it reached 3000 steps, Fig-ure 9a shows that the agents never reached the 3000step limit. This is because the agents are given arandom starting position and velocity at the begin-ning of each episode, some of which result in scenarioswhere the agent cannot avoid a crash. Although bothassisted agents outperformed the unassisted agent, be-tween the state-based and rule-based methods, thereis no considerable difference since both run a similarnumber of steps and collected a similar reward.Table 7 shows the number of interactions, and thepercentage of interactions compared to the numberof opportunities for interactions (equal to steps), foreach agent. These results show that the number ofinteractions is much less for the rule-based agentscompared to the state-based agents.
In this work, we have introduced the concept of persis-tence in interactive reinforcement learning. Currentmethods do not allow the agent to retain the ad-vice provided by assisting users. This may be dueto the effect that incorrect advice has on an agent’sperformance. To mitigate the risk that inaccurateinformation has on agent learning, probabilistic policyreuse was employed to manage the trade-off between19able 7: Average number of interactions performedper experiment and the percentage of interactionscompared to the steps taken, for each state-based/rule-based agent/user combination in the self-driving cardomain.
Agent
Unassisted Q-Learning Bench-mark 0 (0.00%)State-based Self-driving CarAgent (STATE-ADVICE) 232 ( < < Acknowledgments
This work has been partially supported by theAustralian Government Research Training Program(RTP) and the RTP Fee-Offset Scholarship throughFederation University Australia.
References [1] C. Arzate Cruz and T. Igarashi, “A survey oninteractive reinforcement learning: Design princi-ples and open challenges,” in
Proceedings of the2020 ACM Designing Interactive Systems Con-ference , pp. 1195–1209, 2020.[2] J. Lin, Z. Ma, R. Gomez, K. Nakamura, B. He,and G. Li, “A review on interactive reinforcementlearning from human social feedback,”
IEEE Ac-cess , vol. 8, pp. 120757–120765, 2020.[3] A. Bignold, F. Cruz, R. Dazeley, P. Vam-plew, and C. Foale, “Human engagement pro-viding evaluative and informative advice for in-teractive reinforcement learning,” arXiv preprintarXiv:2009.09575 , 2020.[4] W. B. Knox and P. Stone, “Interactively shapingagents via human reinforcement: The TAMERframework,” in
Proceedings of the Fifth Interna-tional Conference on Knowledge Capture , pp. 9–16, ACM, 2009.[5] F. Cruz, S. Magg, Y. Nagai, and S. Wermter, “Im-proving interactive reinforcement learning: Whatmakes a good teacher?,”
Connection Science ,vol. 30, no. 3, pp. 306–325, 2018.[6] S. Griffith, K. Subramanian, J. Scholz, C. Isbell,and A. L. Thomaz, “Policy shaping: Integrat-ing human feedback with reinforcement learning,”20n
Advances in Neural Information ProcessingSystems , pp. 2625–2633, 2013.[7] W. B. Knox and P. Stone, “Combining manualfeedback with subsequent MDP reward signalsfor reinforcement learning,” in
Proceedings ofthe 9th International Conference on AutonomousAgents and Multiagent Systems: volume 1 , pp. 5–12, International Foundation for AutonomousAgents and Multiagent Systems, 2010.[8] M. E. Taylor, N. Carboni, A. Fachantidis, I. Vla-havas, and L. Torrey, “Reinforcement learningagents providing advice in complex video games,”
Connection Science , vol. 26, no. 1, pp. 45–63,2014.[9] A. L. Thomaz, G. Hoffman, and C. Breazeal,“Real-time interactive reinforcement learning forrobots,” in
AAAI 2005 Workshop on HumanComprehensible Machine Learning , 2005.[10] A. Ayala, C. Henr´ıquez, and F. Cruz, “Rein-forcement learning using continuous states andinteractive feedback,” in
Proceedings of the Inter-national Conference on Applications of IntelligentSystems , pp. 1–5, 2019.[11] C. Mill´an, B. Fernandes, and F. Cruz, “Hu-man feedback in continuous actor-critic reinforce-ment learning,” in
Proceedings of the EuropeanSymposium on Artificial Neural Networks, Com-putational Intelligence and Machine LearningESANN , pp. 661–666, ESANN, 2019.[12] P. M. Pilarski and R. S. Sutton, “Between instruc-tion and reward: human-prompted switching,” in
AAAI Fall Symposium Series: Robots LearningInteractively from Human Teachers , pp. 45–52,2012.[13] K. Subramanian, C. L. Isbell Jr, and A. L.Thomaz, “Exploration from demonstration forinteractive reinforcement learning,” in
Proceed-ings of the 2016 International Conference on Au-tonomous Agents & Multiagent Systems , pp. 447–456, International Foundation for AutonomousAgents and Multiagent Systems, 2016. [14] I. Moreira, J. Rivas, F. Cruz, R. Dazeley, A. Ay-ala, and B. Fernandes, “Deep reinforcement learn-ing with interactive feedback in a human-robotenvironment,”
Submitted to MDPI Electronics ,2020.[15] C. Mill´an-Arias, B. Fernandes, F. Cruz, R. Daze-ley, and S. Fernandes, “A robust approach forcontinuous interactive reinforcement learning,” in
Proceedings of the 8th International Conferenceon Human-Agent Interaction , pp. 278–280, 2020.[16] A. Bignold, F. Cruz, M. E. Taylor, T. Brys,R. Dazeley, P. Vamplew, and C. Foale, “A concep-tual framework for externally-influenced agents:An assisted reinforcement learning review,” arXivpreprint arXiv:2007.01544 , 2020.[17] F. Cruz, S. Magg, C. Weber, and S. Wermter,“Improving reinforcement learning with interac-tive feedback and affordances,” in
Proceedingsof the Joint IEEE International Conference onDevelopment and Learning and on EpigeneticRobotics ICDL-EpiRob , pp. 165–170, IEEE, 2014.[18] L. Torrey and M. E. Taylor, “Teaching on a Bud-get: Agents Advising Agents in ReinforcementLearning,” in
Proceedings of the InternationalConference on Autonomous Agents and Multia-gent Systems AAMAS , 2013.[19] F. Cruz, P. W¨uppen, S. Magg, A. Fazrie, andS. Wermter, “Agent-advising approaches in aninteractive reinforcement learning scenario,” in
Proceedings of the Joint IEEE International Con-ference on Development and Learning and Epige-netic Robotics ICDL-EpiRob , pp. 209–214, IEEE,2017.[20] R. Contreras, A. Ayala, and F. Cruz, “Unmannedaerial vehicle control through domain-based au-tomatic speech recognition,”
Computers , vol. 9,no. 3, p. 75, 2020.[21] F. Cruz, J. Twiefel, S. Magg, C. Weber, andS. Wermter, “Interactive reinforcement learningthrough speech guidance in a domestic scenario,”in
Neural Networks (IJCNN), 2015 InternationalJoint Conference on , pp. 1–8, IEEE, 2015.2122] G. L´opez, L. Quesada, and L. A. Guerrero,“Alexa vs. Siri vs. Cortana vs. Google Assistant:a comparison of speech-based natural user inter-faces,” in
International Conference on AppliedHuman Factors and Ergonomics , pp. 241–250,Springer, 2017.[23] F. Cruz, G. I. Parisi, and S. Wermter, “Multi-modal feedback for affordance-driven interactivereinforcement learning,” in
Proceedings of the In-ternational Joint Conference on Neural NetworksIJCNN , pp. 5515–5122, IEEE, 2018.[24] N. Churamani, F. Cruz, S. Griffiths, and P. Bar-ros, “icub: learning emotion expressions using hu-man reward,” arXiv preprint arXiv:2003.13483 ,2020.[25] S. W. Kwok and C. Carter, “Multiple deci-sion trees,” in
Machine Intelligence and PatternRecognition , vol. 9, pp. 327–335, Elsevier, 1990.[26] L. Rokach and O. Maimon, “Decision trees,” in
Data mining and knowledge discovery handbook ,pp. 165–192, Springer, 2005.[27] D. Richards, “Two decades of ripple down rulesresearch,”
The Knowledge Engineering Review ,vol. 24, no. 2, pp. 159–184, 2009.[28] J. R. Quinlan, “Induction of decision trees,”
Ma-chine learning , vol. 1, no. 1, pp. 81–106, 1986.[29] L. Breiman,
Classification and regression trees .Routledge, 2017.[30] B. Kang, P. Compton, and P. Preston, “Multipleclassification ripple down rules: evaluation andpossibilities,” in
Proceedings 9th Banff Knowl-edge Acquisition for Knowledge Based SystemsWorkshop , vol. 1, pp. 17–1, 1995.[31] P. Compton, G. Edwards, B. Kang, L. Lazarus,R. Malor, T. Menzies, P. Preston, A. Srinivasan,and C. Sammut, “Ripple down rules: possibili-ties and limitations,” in
Proceedings of the SixthAAAI Knowledge Acquisition for Knowledge-Based Systems Workshop, Calgary, Canada, Uni-versity of Calgary , pp. 6–1, 1991. [32] D. Herbert and B. H. Kang, “Intelligent conver-sation system using multiple classification rippledown rules and conversational context,”
ExpertSystems with Applications , vol. 112, pp. 342–352,2018.[33] J. Randløv and P. Alstrøm, “Learning to drivea bicycle using reinforcement learning and shap-ing.,” in
ICML , vol. 98, pp. 463–471, Citeseer,1998.[34] A. Y. Ng, D. Harada, and S. Russell, “Policy in-variance under reward transformations: Theoryand application to reward shaping,” in
Proceed-ings of the International Conference on MachineLearning ICML , vol. 99, pp. 278–287, 1999.[35] S. Devlin and D. Kudenko, “Theoretical consid-erations of potential-based reward shaping formulti-agent systems,” in
The 10th InternationalConference on Autonomous Agents and Multi-agent Systems-Volume 1 , pp. 225–232, Interna-tional Foundation for Autonomous Agents andMultiagent Systems, 2011.[36] A. Harutyunyan, S. Devlin, P. Vrancx, andA. Now´e, “Expressing arbitrary reward functionsas potential-based advice.,” in
AAAI , pp. 2652–2658, 2015.[37] F. Fern´andez and M. Veloso, “Probabilistic pol-icy reuse in a reinforcement learning agent,” in
Proceedings of the fifth International Joint Con-ference on Autonomous Agents and Multi-AgentSystems , pp. 720–727, ACM, 2006.[38] A. Bignold, F. Cruz, R. Dazeley, P. Vamplew,and C. Foale, “An evaluation methodology forinteractive reinforcement learning with simulatedusers,” arXiv preprint , 2021.[39] R. Dazeley and B. Kang, “An online classificationand prediction hybrid system for knowledge dis-covery in databases,” in
AISAT 2004 The 2nd In-ternational Conference on Artificial Intelligencein Science and Technology , vol. 1, pp. 114–119,2004.2240] B. H. Kang, P. Preston, and P. Compton, “Simu-lated expert evaluation of multiple classificationripple down rules,” in
Proceedings of the 11thWorkshop on Knowledge Acquisition, Modelingand Management , 1998.[41] P. Compton, P. Preston, and B. Kang, “Theuse of simulated experts in evaluating knowledgeacquisition,”
University of Calgary , 1995. [42] B. R. Gaines and P. Compton, “Induction ofripple-down rules applied to modeling largedatabases,”
Journal of Intelligent InformationSystems , vol. 5, no. 3, pp. 211–228, 1995.[43] P. Compton, L. Peters, G. Edwards, and T. G.Lavers, “Experience with ripple-down rules,” in