[PDF] Human Engagement Providing Evaluative and Informative Advice for Interactive Reinforcement Learning

Abstract

Reinforcement learning is an approach used by intelligent agents to autonomously learn new skills. Although reinforcement learning has been demonstrated to be an effective learning approach in several different contexts, a common drawback exhibited is the time needed in order to satisfactorily learn a task, especially in large state-action spaces. To address this issue, interactive reinforcement learning proposes the use of externally-sourced information in order to speed up the learning process. Up to now, different information sources have been used to give advice to the learner agent, among them human-sourced advice. When interacting with a learner agent, humans may provide either evaluative or informative advice. From the agent's perspective these styles of interaction are commonly referred to as reward-shaping and policy-shaping respectively. Evaluation requires the human to provide feedback on the prior action performed, while informative advice they provide advice on the best action to select for a given situation. Prior research has focused on the effect of human-sourced advice on the interactive reinforcement learning process, specifically aiming to improve the learning speed of the agent, while reducing the engagement with the human. This work presents an experimental setup for a human-trial designed to compare the methods people use to deliver advice in term of human engagement. Obtained results show that users giving informative advice to the learner agents provide more accurate advice, are willing to assist the learner agent for a longer time, and provide more advice per episode. Additionally, self-evaluation from participants using the informative approach has indicated that the agent's ability to follow the advice is higher, and therefore, they feel their own advice to be of higher accuracy when compared to people providing evaluative advice.

Full PDF

HHuman Engagement Providing Evaluative andInformative Advice for Interactive ReinforcementLearning

Adam Bignold , † Francisco Cruz , , † Richard Dazeley Peter Vamplew Cameron Foale School of Engineering, Information Technology and Physical Sciences,Federation University, Ballarat, Australia. School of Information Technology, Deakin University, Geelong. Escuela de Ingenier´ıa, Universidad Central de Chile, Santiago.Corresponding e-mails: { a.bignold, p.vamplew, c.foale } @federation.edu.au, { francisco.cruz, richard.dazeley } @deakin.edu.au ‡ Both authors contributed equally to this manuscript.

Abstract

Reinforcement learning is an approach used by intelligent agents toautonomously learn new skills. Although reinforcement learning has beendemonstrated to be an eﬀective learning approach in several diﬀerentcontexts, a common drawback exhibited is the time needed in order tosatisfactorily learn a task, especially in large state-action spaces. To ad-dress this issue, interactive reinforcement learning proposes the use ofexternally-sourced information in order to speed up the learning process.Up to now, diﬀerent information sources have been used to give advice tothe learner agent, among them human-sourced advice. When interactingwith a learner agent, humans may provide either evaluative or informa-tive advice. From the agent’s perspective these styles of interaction arecommonly referred to as reward-shaping and policy-shaping respectively.Evaluation (reward-shaping) requires the human to provide feedback onthe prior action performed, while informative advice (policy-shaping) theyprovide advice on the best action to select for a given situation. Prior re-search has focused on the eﬀect of human-sourced advice on the interactivereinforcement learning process, speciﬁcally aiming to improve the learn-ing speed of the agent, while reducing the engagement with the human.This work focuses on answering which of the approaches, evaluative orinformative, is the preferred instructional approach for humans. More-over, this work presents an experimental setup for a human-trial designedto compare the methods people use to deliver advice in term of humanengagement. Obtained results show that users giving informative adviceto the learner agents provide more accurate advice, are willing to assistthe learner agent for a longer time, and provide more advice per episode. a r X i v : . [ c s . A I] S e p age 2 Additionally, self-evaluation from participants using the informative ap-proach has indicated that the agent’s ability to follow the advice is higher,and therefore, they feel their own advice to be of higher accuracy whencompared to people providing evaluative advice.

Keywords: interactive reinforcement learning, assisted reinforcement learn-ing, evaluative and informative advice, reward-shaping, policy-shaping, userstudy.

Reinforcement Learning (RL) aims at the creation of agents and systems thatare capable of functioning in real-world environments [1]. A common RL taskinvolves decision-making and control, which given some information about thecurrent state of the environment, must determine the best action to take inorder to maximise long-term success. In this regard, RL allows improvingthe decision-making process while operating, to learn without supervision, andadapt to changing circumstances [2]. In classical, autonomous RL [1] the agentinteracts with its environment learning by trial-and-error. The agent exploresthe environment and learns solely from the rewards it receives (see grey boxwithin Figure 1). RL has shown success in diﬀerent domains such as inventorymanagement [3], robot scenarios [4, 5], and game environments [6, 7], amongothers. However, RL has diﬃculties to learn in large state spaces. As environ-ments become larger the agent’s training time increases and ﬁnding a solutioncan become impractical [8, 9].Interactive Reinforcement Learning (IntRL) is an alternative to RL in whichan advisor interacts with an RL agent in real-time [10]. The advisor can provideextra information to the agent regarding its behaviour or future actions it shouldperform. In this regard, the advice can be either evaluative or informative [11].The former is an evaluation the advisor gives to the agent indicating how goodor bad was the last action performed. The latter is a suggestion given to theagent indicating what action to perform next from the current state. Humanadvisors are usually used in IntRL since they achieve good performance in areassuch as problem-solving, forward planning, and teaching. Moreover, they have alarge collection of knowledge and experiences to draw upon when encounteringnew environments and problems [12]. IntRL utilise these skills of humans toassist the agent with its own learning and decision-making. This approachhas been shown to considerably improve the agent’s learning speed and canallow RL to scale to larger or more complex problems [13]. Figure 1 shows theIntRL approach with a human advisor included providing either evaluative orinformative advice to the learner agent.There are two major barriers to humans providing information to RL agents.The ﬁrst is the time required by the human. In this regard, it is important thatthe mechanisms used to provide advice to the agent serve to reduce the numberof interactions required. The second barrier is the skill needed by the humanto provide the information. Humans usually need both programming skillsage 3

Human advisor RL agent EnvironmentReinforcement learningInteractive reinforcement learning action a t state s t+1 reward r t+1 advicestate s t+1 reward r t+1 Figure 1: Interactive reinforcement learning approach. In classical, au-tonomous reinforcement learning, the learner agent performs an action a t froma state s t and the environment produces a response leading the agent to a newstate s t +1 and receiving a reward r t +1 . Interactive reinforcement learning addsa human advisor for assistance. Whereas the advisor also observes the environ-ment’s response, they can provide either evaluative or informative advice to thelearner agent.and knowledge of the problem dynamics to encode information relevant to theagent’s learning [14, 15]. A principle of IntRL is that the method to provideinformation to the agent should be understandable and usable by people withoutprogramming skills or deep problem domain expertise [10, 16]. Therefore, thetime required by a human advisor should remain as low as possible to reduce theburden on the human and methods for providing information to an agent shouldbe accessible to users without programming or machine learning expertise.In this work, we aim to reduce the obligation of the human advisor whileimproving the learning speed of the agent. We address the question of whichof the approaches, evaluative or informative, is the preferred instructional ap-proach for humans. To this aim, we carry out an analysis of human engagementwith twenty participants with no prior knowledge of machine learning tech-niques. In our experiments, ten users give evaluative advice to the RL agentwhile ten users give informative advice in a simulated scenario. From the per-formed interactions, we analyse the advice accuracy and the advice availabilityof each assistive approach. We also present an analysis of how evaluative advicemay be aﬀected by reward bias when teaching the RL agent.Therefore, the distinction between advice delivery styles, i.e., evaluative orinformative (also known as reward-shaping and policy-shaping respectively),and how humans engage and prefer to teach artiﬁcial agents is studied in thiswork. While evaluative and informative approaches are about the method usedto instruct the agent, reward-shaping and policy-shaping methods are abouthow the agent incorporates the provided advice, thus considering the agent’sviewpoint.This work is organised in the following sections. Section 2 presents anoverview of prior research on evaluative and informative advice, including aage 4discussion of how they compare. Secondly, this section discusses prior studies inhuman engagement involving these two interactive approaches. Section 3 intro-duces the experimental methodology used in this work including further detailsof IntRL learning framed within the assisted RL taxonomy and how human-sourced advice has been obtained. Section 4 describes the IntRL scenario usedduring the experiments; this includes the key features of the environment alongwith the interactions related to the particular scenario with the participants inthe experiment. Section 5 presents the results including the users’ self-evaluationof the experience and the characteristics of the interactive steps in terms of thefrequency, accuracy, and availability of the advice. Finally, in Section 6 themain conclusions obtained from this work are presented. Learning from the ground up can be a challenging task. While humans and ar-tiﬁcial agents using RL are both capable of learning new tasks, it is evident thatany extra information regarding the task can signiﬁcantly reduce the learningtime [17, 18, 19]. For humans, we can get advice from peers, teachers, the Inter-net, books, or videos, among other sources. By incorporating advice, humanscan learn what the correct behaviour looks like, build upon existing knowledge,evaluate current behaviour, and ultimately reduce the amount of time spentperforming the wrong actions [20]. For artiﬁcial agents, the beneﬁts of adviceare the same. For instance, advice may be used to construct or supplement thereward function, resulting in an improved evaluation of the agent’s actions orincreased the utility of the reward function requiring fewer experiences to learna behaviour [21, 22]. The advice can also be used to inﬂuence the agent’s policy,either directly or through the action selection method, in order to reduce thesearch space.There are many possible information sources for agents to use. For in-stance, external information can come from databases [23], labelled sets [24, 25],cases [26, 27], past experiences [28], other agents [29, 30], contextual percep-tion [31], and from humans [32]. Human-supplied advice is contextually relevantinformation that comes from a human as a result of observation or awarenessof the agent’s current behaviour or goal. This information is commonly used tosupplement, construct, or alter the RL process. Human-sourced advice can bemore noisy, inaccurate, and inconsistent than other information sources. How-ever, the critical beneﬁt is that the advice is contextually relevant and can beapplied to aid the agent in its current situation or goal.IntRL may use human-sourced advice [33] or simulated-users [34] to directlyinteract with the agent while it is learning/operating [10]. The focus for IntRLis limited to the use of advice during the learning process, not before or after.This limitation requires interactive techniques to be easy for an agent to getinformation from, and for humans to add information to so that the learningage 5process is not slowed down. This limitation also means that the agent or policyshould not be reset when new information is provided, as that is conceptuallysimilar to creating a new agent rather than interacting with an existing one.When humans interact with the agent, they may either provide additional re-wards in response to the agent’s performance [35] or recommend actions to theagent to guide the exploration process [36].

Evaluative advice is information that critiques current or past behaviour ofan agent [37, 38]. Advice that supplements, improves, or creates a rewardfunction is considered to be evaluative as it is a reaction to an agent’s behaviourrather than a direct inﬂuence on an agent’s decision-making. The source of theadvice is what separates evaluative advice from the reward function. A typicalreward function is deﬁned for an speciﬁc environment, whereas evaluative adviceoriginates from an observer of the agent or other external sources [39, 15]. Figure2 shows in green evaluative advisors supplementing the reward received from theenvironment.Humans providing evaluative advice do not need to know the solution to aproblem [40], it is enough for them to be able to assess the result of an actionand then decide whether it was the correct action to take. For instance, inthe training an agent manually via evaluative reinforcement (TAMER) frame-work [41, 42], a human user continually critiques the RL agent’s actions. Thehuman observes the agent, and in response to the agent’s actions, provides asimple yes/no evaluation of its choice of action. This Boolean evaluation actsas an additional reward signal, supplementing the reward function from the en-vironment. This bare minimum of human inﬂuence is enough to signiﬁcantlydecrease the time required by the agent to learn the required task [41].Another example of evaluative advice is the convergent actor-critic by hu-mans (COACH) approach [43]. In this approach, a human trainer may givepositive or negative feedback to a virtual dog learning to reach a goal position.The human feedback was divided into punishment and reward and labelled withdiﬀerent levels as ’mild electroshock’, ’bad dog’, ’good dog’, and ’treats’. UsingCOACH, the agent was able to learn the task facing multiple feedback strategies.Recently, this approach has been extended as Deep COACH [44] to representthe agent policy by deep neural networks.age 6

EnvironmentAgent Informative advisors action a t state s t+1 reward r t+1 Evaluative advisors suggested action a s suggested reward r s Figure 2: Interactive reinforcement learning approach using evaluative andinformative advice. While the informative advisor may suggest an action tobe performed by the agent, the evaluative advisor may suggest a reward tosupplement the reward obtained from the environment.

Informative advice is information that aids an agent in its decision-making [45,46]. Advice that recommends actions to take or avoid, suggests explorationstrategies, provides information about the environment or proactively alterswhat action an agent may take is considered to be informative. Informativemethods primarily focus on transferring information from the human and en-coding it into the agent’s policy, either directly, by altering the policy, or indi-rectly by inﬂuencing the agent’s decision-making process [47]. Figure 2 showsin brown informative advisors suggesting an action to be taken.Providing informative advice can be challenging for two reasons, the ﬁrstof which is the human factor. Informative advice typically requires the humanto know what the correct action is for a given state ahead of time. Not onlydoes this require a greater understanding of the environment and the agent’sposition within it, but it also requires a more substantial commitment of timeand eﬀort to provide the advice. The time and eﬀort required increases as thesize of the environment, and the available actions increases [48]. The secondreason utilising informative advice is challenging is that encoding informationsourced from a human into a form an agent can understand can be a complicatedprocess, as it is more informationally dense then evaluative advice [49].age 7For instance, an implementation of informative advice in IntRL is the AD-VISE algorithm [50]. In ADVISE, a human observing an agent in operation canrecommend actions to take at any given step, which the agent may choose tofollow. This methodology allows the human to guide the agent through parts ofthe environment which they are familiar with. This can result in a signiﬁcantimprovement over existing IntRL methods and a reduced need for exploration.Another example of informative advice was presented in [17] in which a robotlearned a cleaning task using human-provided interactive feedback. In this do-mestic scenario, seven actions could be advised to the agent using multi-modalaudiovisual feedback. The provided advice was integrated into the learningprocess with an aﬀordance-driven [31] IntRL approach. After experiments, therobot collected more and faster reward, tested against diﬀerent minimal conﬁ-dent level thresholds and diﬀerent levels of aﬀordance availability.

Evaluative advice has been more widely utilised in prior research as implemen-tations are simpler to encode as the focus tends to be on the result of a decisionrather than on what decision should be made [51]. This is due to it being easierto determine if an action was the correct or incorrect action to take once theresult of the action is available. Most implementations of evaluative advice alteror supplement the reward function of the environment. Encoding informationto alter the reward function is generally straightforward, as the primary focusis on whether to increase or decrease the reward given to the agent, as opposedto informative implementations that attempt to alter the decision-making pol-icy [52]. Additionally, providing an evaluation requires less human eﬀort thandetermining what information or action is relevant for a given state, as theinformation sought is typically a Boolean or a scalar measurement. Overall,evaluative advice is more direct to obtain, implement, and encode than theinformative counterpart.Informative advice tends to be more informationally dense than evaluativeadvice. While this does make sourcing and encoding the information diﬃcult,it does provide more beneﬁt to the agent [51]. Evaluative advice only reinforcesbehaviour after that behaviour has been exhibited, whereas informative advicecan promote or discourage behaviour before it is presented. Advice that rec-ommends taking or avoiding actions will reduce the search space for the agent,resulting in improved learning time. The downside of this is that if the agentnever performs actions that are preemptively discouraged, and the advice is notoptimal, then the optimal policy may not be found [53].A direct comparison of the two styles is diﬃcult as the implementationsof human-sourced advice vary. Griﬃth et al. [50] compared the eﬀects of in-formative versus evaluative advice on artiﬁcial agents using their informativealgorithm ADVISE, against the evaluative algorithm TAMER. Both algorithmsutilise IntRL agents and advice is given on a step by step basis. The ADVISEalgorithm prompts the advisor for a recommended action which the agent canthen follow, while TAMER prompts the advisor for a binary evaluation on theage 8previously taken action. In the experiments, each agent is assisted by a simu-lated human, making the advice comparable.The ADVISE algorithm allows the advisor to recommend an action andtherefore the number of bits of information provided is equal to log ( n a ) where n a is the number of possible actions (e.g., if there are eight possible actions n a = 8, then each piece of informative advice provides three bits of informa-tion). In contrast, TAMER allows the human to provide a binary evaluation(i.e., correct/incorrect) which provides only a single bit of information. There-fore the information gain from ADVISE is greater than TAMER and may biasthe results. However, the experiments show that informative advice is morebeneﬁcial to the agent regardless of advice accuracy for the majority of cases.The use of a simulated human as an oracle in these experiments allowed for theprovision of consistent advice that does not suﬀer from biases introduced by realhumans. However if the behaviour of actual human advice-givers diﬀers fromthat of the simulated human in terms of either accuracy and/or engagement,then the impact on agent behaviour may not reﬂect that observed in this study.Therefore it is important to develop an understanding of the properties of actualhuman advice. Studies on human engagement and teaching styles when engaging with inter-active machine learning agents have previously been studied [16, 54], however,they have been mainly focused on assessing human commitment independentof the type of advice. For instance, Amershi et al. [16] presented a comprehen-sive study looking at the engagement between humans and interactive machinelearning. The study included some case studies demonstrating the use of hu-mans as information sources in machine learning. This work highlighted theneed for increased understanding of how humans engage with machine learningalgorithms, and what teaching styles the users preferred.A study by Thomaz and Breazeal [55], later conﬁrmed by Knox and Stone[56], found that human tutors tend to have a positive bias when teaching ma-chines, opting to reward rather than punish RL agents. This bias leads to agentsfavouring the rewards provided by the human over the reward function of theenvironment. The positive bias was observed in humans providing evaluativeadvice, as it tends to be provided as a reward [55]. Due to its characteristics,no such bias has been tested for or observed yet in informative-assisted agents.Knox and Stone [57] later mitigated the consequence of the positive bias in RLagents by developing an agent that valued human-reward gained in the longterm rather than the short term.Another study performed by Cakmak and Thomaz [58] investigated thestrategy of teachers when tutoring machine learning agents. The study foundthat humans providing advice to a system over an extended period experiencedfrustration and boredom when bombarded with questions from the agent. Thestream of questions to the teachers caused some participants to “turn their brainoﬀ” or “lose track of what they were teaching” according to self-reports [59].age 9Similar results were obtained using a movie recommendation system developedfor Netﬂix, where participants were repeatedly asked to state if the system wasright or wrong [60, 61].The previous studies suggest that participants do not like being promptedfor input repeatedly, particularly when the input can be repetitive. CurrentIntRL systems do not prompt the user for information, instead, allowing theadvisor to step in whenever they wish. Nevertheless, input into these systemsis repetitive and requires the users to provide advice on a state-by-state basis[36], leaving current systems susceptible to the same issues of frustration andinterruption as the active learning systems reported. Regardless, it is still notclear whether these issues will be translated into the IntRL scenarios. Therefore,the remainder of this paper reports details and results of an experiment carriedout to establish the characteristics of advice provided by humans interactingwith an IntRL agent, and to assess whether these properties alter depending onwhether evaluative or informative advice is being provided.

In this section, we describe the IntRL methodology used during the experimentsand frame the approach within an assisted RL framework. Moreover, we out-line the method to collect human advice including participants’ characteristics,induction process, experiment details, and after-experience questionnaire.

Assisted reinforcement learning (ARL) [15] is a general framework proposed toincorporate external information into traditional RL. The framework uses a con-ceptual taxonomy including processing components and communications linksto describe transmission, modiﬁcation, and modality of the sourced information.The processing components comprise information source, advice interpretation,external model, and assisted agent, whereas the communications links are tem-porality, advice structure, and agent modiﬁcation. ARL agents aim to gatheras much information from an external source as possible, as this can lead toimproved performance within the environment. A concrete example of an ARLagent is an IntRL agent. As previously mentioned, an IntRL agent can be ad-vised with externally-sourced information to support the learning process at anytime of the training.age 10

Human-sourced adviceConvert binary advice to rewardImmediatemodel

Interactive reinforcement learning

Interactive assistanceQ-learningagent Information sourceTemporalityAdvice interpretationAdvice structureExternal modelAgent modiﬁcationAssisted agentConvert advice to action selectionState-actionlookupState-action pair valueReward shaping Policy shaping

Evaluative advice Informative advice

Figure 3: Interactive reinforcement learning method used to compare human en-gagement in evaluative and informative advice. The method is presented usingthe assisted reinforcement learning taxonomy, deﬁning processing components(dotted red squares) and communication links (underlined green parallelograms)for each advice delivery style. The evaluative and informative methods diﬀer inadvice interpretation, advice structure, and agent modiﬁcation. All the otherprocessing components and communication links are common to both and lo-cated at the centre.In this work, two diﬀerent learner agents attempt to solve the MountainCar problem [1] using IntRL (more details about the experimental problem aregiven in the next section), the ﬁrst agent accepts evaluative advice and theother receives informative advice. Figure 3 shows the IntRL approach framedwithin the ARL framework [15] using both evaluative and informative advice.The ﬁgure shows the processing components using dotted red squares and thecommunication links using green parallelograms with underlined text. Usingage 11the ARL taxonomy, there are some common processing components and com-munication links that are adopted similarly by both approaches. The commonelements are information source, temporality, external model, and the assistedagent, which are adopted by the ARL framework as human-sourced advice, in-teractive assistance, an immediate model, and a Q-learning agent. All the otherprocessing components and communication links diﬀer to each other for evalua-tive and informative advice. For the evaluative approach, advice interpretation,advice structure, and agent modiﬁcation are adopted by the ARL framework asbinary advice to reward conversion, state-action pair value, and reward-shapingrespectively. For the informative approach, they are adopted as advice to ac-tion selection conversion, state-action lookup, and policy-shaping respectively.As this approach relies on human trainers as an external information source, thehigher the people engagement, the higher the opportunity to transfer knowledgeto the agent. The accuracy of the advice and information gain as a result ofthe advice provided is also important, as they contribute to the policy beinglearned by the agent [53].We aim to measure the human engagement, accuracy of advice, and theinformation gain for evaluative and informative advice for IntRL. To this aim, weperform experiments using two IntRL agents implemented with the temporal-diﬀerence learning method Q-learning. The performance of the agent, or itsability to solve the problem, is not the main focus of this paper. A comparisonof evaluative and informative advice, in terms of the performance of the agents,has been investigated in a prior study [50].In the context of this work, human engagement is a measure of the number ofinteractions, the total time spent constructing interactions, and the distributionof interactions over the time the agent is operating. The observing human isgiven an opportunity to provide information once per step of the agent, and ifthe human does provide some advice during that step, then the interaction isrecorded. However, a measure of the number of interactions is not suﬃcient,as the time and eﬀort required to provide an interaction may diﬀer betweeninformative and evaluative advice methods. As a result, the interaction timeis also recorded. Moreover, the accuracy of the information provided to theagent aﬀects its performance within the environment [53]. In this regard, adviceaccuracy is a measure of how accurate the information provided by the humanis, compared to the optimal action to take for each state the agent encounters.This can be calculated by comparing the advice provided by the human againstthe known optimal policy for this task.

During the experiments, twenty people participated, ten for each advice deliv-ery style. Each participant was able to communicate with an RL agent whileobserving its current state and performance. A participant interacting withthe evaluative agent had the option of providing an agreement or disagreement(yes/no) to the agent’s choice of action for the last time step. This binary eval-uation was then used by the agent to supplement the reward it receives from theage 12environment. A positive evaluation added +1 to the reward, while a negativeevaluation subtracted − In this section, we describe the key features of the experimental environment in-cluding the agent’s representation, state and action representation, and rewardfunction. Furthermore, we complement the human-agent interactive method-ology described in the previous section by indicating the script given to theparticipants.

The Mountain Car environment is a standard continuous-state testing domainfor RL [1, 62]. In the environment, an underpowered car must drive from thebottom of a valley to the top of a steep hill. Since the gravity in the environmentis stronger than the engine of the car, the car cannot drive straight up the sideof the mountain. In order for the car to reach the top of the mountain, itmust build up enough inertia and velocity. Figure 4 illustrates the mountaincar environment and its key features.In our experiments, an RL agent controls the actions of the car. The carbegins at a random position and with a low velocity somewhere within thestarting position. In order to reach the goal position, the agent must buildup enough momentum. To do so, the agent accelerates towards the goal untilits velocity is reduced to zero by gravity. At this point, the agent turns andaccelerates towards the other direction toward the highest possible position,again, until its velocity is reduced to zero. Finally, the agent accelerates downthe hill again, building up velocity to reach the goal state. Should the agentnot reach high enough up the mountain to reach the goal position, it shouldrepeat the actions of accelerating in the opposite direction until a zero velocityis reached and turning around.The key to the agent solving the Mountain Car problem is to increase its ownvelocity ( v ). The agent’s mass ( m ), the magnitude of acceleration ( a ), and theforce of gravity ( G ) are constant. As the agent’s acceleration is lower than thegravity acting upon it, pulling the agent to the lowest point of the environment,the agent must accelerate at the correct moments, and in the correct direction,to increase its velocity. The optimal solution to the Mountain Car problem isto accelerate in the current direction of travel and take a random action whenvelocity is zero. An example of a rule formulation denoting this behaviour isshown in Eq. (1). The policy states the agent’s next action ( A t ) should be toaccelerate right if its velocity is greater than 0, i.e. keep right movement, toaccelerate left if its velocity is less than 0, i.e., keep left movement, and to takea random action if velocity is 0. A t =  +1 v > − v < ∈ {− , } v = 0 (1)age 14 Startingposition Goal (1)(2)

Agent

Figure 4: A detailed graphical representation of the Mountain Car environment.The agent begins on the line at a random position within the yellow box andmust travel to the green goal state. To do so, the agent accelerates towardsthe ﬁrst (1) key position until its velocity is reduced to zero by gravity. Atthis point, the agent turns and accelerates towards the second (2) key position,again, until its velocity is reduced to zero. Finally, the agent accelerates downthe hill again, building up velocity to reach the goal state.The agent controlling the car has three actions to choose from in any state:to accelerate left, to accelerate right, or not to accelerate at all. The graphicalrepresentation of these possible actions is shown in Figure 5. At each step, theagent receives a reward of −

1, and no reward when reaching the goal state.This reward encourages the agent to reach the goal in as few steps as possibleto maximise the reward.The agent’s state consists of two state variables, position and velocity, whichare represented as real numbers. The position variable p represents the agent’sposition within the environment, and ranges linearly from − . .

6, i.e. p ∈ [ − . , . − .

53. Thevelocity of the agent v has a range of − .

07 and 0 .

07, i.e. v ∈ [ − . , . p = − .

2) then the agent’s velocity is set to 0.In this work, the RL agent utilizes discrete state variables. Therefore, twentybins for each state variable has been used, creating a total of 400 (20 ×

20) states.Of these 400 states, there are some that may never be visited by the RL agent,for example, it is impossible that the agent will be at top of the left mountainage 15

Acceleratedleft No action;drift Acceleratedright

Figure 5: A graphical representation of the Mountain Car agent. The entirerectangle (blue and red) represents the car. The blue box indicates which actionthe agent has chosen to perform, either to accelerate left, to accelerate right, ornot to accelerate at all and continue moving in its current direction of travel.( p = − .

2) and have a high positive velocity ( v = 0 . As indicated in the previous section, twenty persons with no experience in ma-chine learning participated as trainers. During the experiments, the agents weregiven a low learning rate, manually tuned to extend the time which the agentwould take to learn a suitable behaviour on its own. This was chosen so thatthe focus would be on the human’s input rather than on the agent’s capabili-ties. Both the evaluative and informative agents were given a learning rate of α = 0 .

25, a discounting factor of γ = 0 .

9, and used an (cid:15) -greedy action selectionstrategy with an epsilon of (cid:15) = 0 . BothInformative

Evaluative

Level of understanding A d v i c e Understanding of the Environment

Before experiment After experiment Change

Figure 6: Participants’ self-reported level of understanding of the solution anddynamics of the Mountain Car environment. The participants rated their un-derstanding on a scale of 0 to 10 before and after assisting the agent. Thestandard deviation is shown over the bars for each approach and group.

In this section, we analyse the main results obtained from the experimentalscenario. First, we present user’s self-evaluations in terms of the level of taskunderstanding, engagement with the interactive agent, self-reported accuracy,and ability to follow advice. Thereafter, we discuss the characteristics of thegiven advice, such as frequency, accuracy, and availability.

As previously mentioned, before each participant began interacting with theagent, they were asked to answer two questions from the questionnaire (seeAppendix A). The purpose of the questionnaire is to assess the participants un-derstanding of the problem environment and their interactions with the agent.The ﬁrst question asked was whether the participant had previously been in-volved in a machine learning study. None of the twenty participants reportedbeing involved in a machine learning study previously.Participant were then provided with a brief explanation of the dynamicsof the environment and what would be the optimal behaviour. Subsequently,before starting the experiment, they were then asked to rate their level of un-derstanding of the environment on a scale of 0 to 10. After interacting withthe agent, the participants were asked the same question again. Figure 6 showsthe average self-reported level of understanding from the two groups of partic-ipants, i.e., evaluative and informative groups, and both before and after theage 17

31 2

Number of answers A d v i c e Level of Engagement Interacting with the Agent

I could have spent more time interacting.

I am happy with how much time I interacted.

I spent too much time interacting.

Figure 7: Participants’ self-reported level of engagement with the agent. Par-ticipants reported that they (a) could have spent more time with the agent, (b)were happy with how much time they provided, or (c) spent too much time withthe agent. No signiﬁcant diﬀerences are shown between the two groups.experiments.Interestingly, there is a small diﬀerence in the participant self-reported un-derstanding of the environment before they begin interacting with the agent.The only diﬀerence in the explanation given to the two groups was the details onhow they can interact with the agent. The participants giving evaluative advicewere asked to rate the agent’s choice of action as good or bad, while the partic-ipants giving informative advice were asked to recommend an action, either leftor right. The diﬀerence in reported understanding before the experiment mayindicate that evaluative advice delivery is easier to understand.Additionally, a change in the level of participants self-reported understand-ing is observed after the experiment. Although the informative group shows agreater change of understanding than the evaluative group after the experiment,this is due to the initial self-reported understanding. After assisting the agent,the two groups reported a greater understanding of the environment showingno signiﬁcant diﬀerence between both of them.Moreover, after ﬁnishing the experiment, participants were also asked toreport how they felt about their level of engagement with the agent. They weregiven three diﬀerent options to answer.(a) I could have spent more time interacting with the agent.(b) I am happy with how much time I interacted with the agent.(c) I spent too much time interacting with the agent.age 18

110 1073 422 Number of answers A d v i c e Level of Accuracy

Always incorrect Mostly incorrect Sometimes incorrectSometimes correct Mostly correct Always correct

Figure 8: Participants’ self-reported level of advice accuracy. Participants ratedthe accuracy of the advice they provided to the agent from ‘Always Incorrect’ to‘Always Correct’. The informative group shows more conﬁdence in the advicethey give to the agent.Figure 7 shows the participants’ reported level of engagement with the agentindicating no signiﬁcant diﬀerence between the two groups. In both cases, themajority of participants were content with the level of engagement they hadwith the agent.The participants were asked to report what they thought their level of ac-curacy was throughout the experiment. Participants were given six diﬀerentoptions to answer, ranging from always incorrect to always correct. Figure 8shows the self-reported results. The results obtained indicate that participantsin the informative group were more conﬁdent in the advice they provided to theagent.Finally, participants were asked to rate how well they thought the agentfollowed their advice. On a scale from 0 (never), to 10 (always), participantsscored the agent’s ability to follow the given advice. The obtained results, sum-marised in Figure 9, show that participants using informative advice perceivedthe agent as better able to follow advice when compared to participants usingevaluative advice.We have computed the Student’s t-test to test the statistical diﬀerence be-age 19

Level of perception A d v i c e Agent's Ability to Follow Advice

Average evaluation

Figure 9: Average of participants’ self-reported feeling of how well the agentfollowed the advice provided. The participants score the agent’s ability to followadvice using a scale from 0 (never), to ten (always). The informative groupperceives the agent to better follow the provided advice. The standard deviationis shown over the bars for each approach.Table 1: Student’s t-test for comparison of self-reported results for evaluativeand informative advisors.

Evaluation t-score p-value

Understanding of the environment (before) t = 2 . p = 0 . t = 0 . p = 0 . t = 1 . p = 0 . From the assistance provided to the agent, we kept a record of the numberof interactive steps and the percentage relative to the total amount of steps.Figure 10 displays the number of steps that each set of participants interactedwith the agent to provide assistance. In the boxplot, the cross marker representsthe mean, dots are outliers, and the quartile calculation uses exclusive median.Overall, both groups provided advice in 9.15 steps on average, however, thedata collected show a large variation in the engagement between the two typesage 20Figure 10: Number of steps that participants provided advice to the learneragent on the Mountain Car environment. The amount of interactive steps isover two times for participants providing informative advice in comparison toevaluative advice.of advice. Participants providing informative advice assisted over twice as manysteps than participants providing evaluative advice.As demonstrated in previous work [50], agents assisted by informative advicelearn quicker than agents assisted by evaluative advice. The increase in learn-ing speed results in fewer steps per episode for environments with a terminationcondition. This decrease in steps per episode for informative assisted agentsgives fewer opportunities for the user to provide advice, as only one interactionmay occur each step. As a result, the number of interactions per episode is notnecessarily a suitable measure of engagement. Therefore, the number of stepsin which interaction occurred relative to the total amount of steps is used tomeasure engagement. Figure 11 shows the interaction rate as a percentage forthe two sets of participants. As before, the boxplot uses cross markers to rep-resent the mean and exclusive median for quartile calculation. The interactionpercentage is the ratio of interactions to interaction opportunities. Using thismeasurement, the length of the episode is disregarded. The results show thatparticipants using an informative advice delivery method interact almost twiceas often as their evaluative counterparts. Despite the higher rate of interactionshown by participants using informative advice, both groups self-reported theywere happy with their level of engagement with the agent, as shown in Figure7. While training the agent, the availability and accuracy of the provided as-sistance by the advisors were recorded. Figure 12 displays the accuracy per-centage of the advice provided by each of the groups of participants. Crossmarkers represent the mean and exclusive median is used for quartile calcula-tion. An accurate interaction is one that provided the optimal advice for theagent in that state. Therefore, accuracy is a measurement of the number of cor-age 21Figure 11: Percentage of steps that participants provided advice to the learneragent on the Mountain Car problem. The percentage is computed as the ratio ofinteractions to interaction opportunities. The informative advice rate is almosttwice as high in comparison to evaluative advice.Figure 12: The percentage of interactions in which the advice provided wasoptimal for the state-action. Participants providing informative advice werearound two times more accurate and presented less variability in comparison toparticipants using evaluative advice.age 22Table 2: Student’s t-test for comparison of the provided advice from evaluativeand informative advisors.

Evaluation t-score p-value

Average interactive steps t = 2 . p = 0 . t = 1 . p = 0 . t = 14 . p = 2 . × − rect interactions compared to the total interactions. Informative interactionsare almost twice as accurate in comparison to evaluative interactions and alsoshow much less variability. These results also reﬂect the self-reported level ofadvice accuracy shown in Figure 8.We have also computed the Student’s t-test to test the statistical diﬀerencebetween the obtained results in terms of the advice provided from the twogroups. Table 2 shows the obtained t-scores along with the p-values for theaverage interactive steps, the average interactive rate, and the accuracy of theadvice provided. Although there exist statistically diﬀerences between the twogroups for the average interactive steps and the average interactive rate, this ismuch clearer in the accuracy of the advice provided given the low p-value.One hypothesis for the large diﬀerence in accuracy is latency. In this context,latency is the time it takes for the human to decide on the advice to provide,and then input it into the agent. It is possible that if the human is too latein providing advice, then the advice will inadvertently be provided to the stateafter the one intended. For the Mountain Car environment, a late interactionis more likely to remain accurate in the next state for informative advice thanit is for evaluative advice. This is due to the layout of the state-space and thenature of untrained agents. The optimal action for a state in the Mountain Carenvironment is likely to be the same as its neighbouring states. This is dueto the optimal behaviour being to accelerate in a single direction until velocityreaches 0. Therefore, a recommended action that is received in the state afterthe one intended is likely to be the correct action, regardless of latency. Thisdoes not apply to evaluative advice. The participants assisting the evaluativeagent do not provide a recommended action, instead, they critique the agent’slast choice. An untrained agent has a largely random action selection policy andis therefore not likely to choose the same action twice in a row. As the agent’schosen action may have changed by the time it receives advice from the user,the accuracy is more aﬀected.This hypothesis is supported by the state breakdown of the advice accuracy.Figure 13 shows the accuracy of participants’ advice for each state in the envi-ronment for (a) informative and (b) evaluative advice respectively. The darkerthe colour, the more accurate the advice supplied by the participants for thatstate. The comparison of the two heatmaps supports the earlier observationsof the accuracy shown in Figure 12; informative is much more accurate thanevaluative advice. The informative advice method (Figure 13a) shows that thestates with the most inaccuracy are in the middle of the environment, whereage 23 Position V e l o c i t y Advice Accuracy - Informative Agent

Position V e l o c i t y No data0%10%20%30%40%50%60%70%80%90%100%

Advice Accuracy - Evaluative Agent

Figure 13: State-based accuracy of (a) informative and (b) evaluative partici-pants for the Mountain Car environment. Informative advice is in general moreaccurate than evaluative advice, except in states in the middle of the environ-ment, where the optimal action changes. Latency aﬀects more in evaluativeadvice since there is a low probability that delayed advice is still useful in thenext state.

Position V e l o c i t y Advice Availability - Informative Agent

Position V e l o c i t y No data0%10%20%30%40%50%60%70%80%90%100%

Advice Availability - Evaluative Agent

Figure 14: State-based availability of (a) informative and (b) evaluative partici-pants for the Mountain Car environment. Participants using informative adviceachieved higher velocities in the environment, and as a consequence, more stateswere visited in comparison to the evaluative advice approach.age 24

Position V e l o c i t y No data0%10%20%30%40%50%60%70%80%90%100%

Advice Bias - Evaluative Agent

Figure 15: Reward bias of evaluative advice. Above 50% means that the ad-visor provided more positive evaluation than negative evaluation. In general,participants provided much more positive advice, conﬁrming prior ﬁndings thatpeople are more likely to provide feedback on actions they view as correct thanon incorrect actions.the optimal action changes. This inaccuracy is likely not due to poor partici-pant knowledge, but rather providing delayed advice, after the agent has movedbeyond the centre threshold.The evaluative advice method (Figure 13b) shows that accuracy diﬀers wildlyacross the environment and does not have an obvious pattern. The poor resultfor accuracy of evaluative advice is likely due to the latency of advice deliverycoupled with the lower probability that the advice will still be accurate to thefollowing state compared to informative advice. Additionally, evaluative advicemay have lower accuracy as it requires the human assessing each state-actionpair. On the other hand, informative advice may require less time assessing eachstate, as the human may be following a set of rules for action recommendation,and that the next state is easier to predict compared to the agent’s next actionchoice.Figure 14 shows the availability of participants’ advice for each state in theenvironment for (a) informative and (b) evaluative advice respectively. Avail-ability in this context is a measure of how often the user provides advice in astate compared to the number of times the agent visited the state. The darker astate is on the heatmaps, the more often the user provides advice for that state.The agent that was assisted by informative advice (Figure 14a) was able toachieve higher velocities in the environment, and as a result, visited more statesin comparison to the evaluative advice method (Figure 14b). One pattern thatcan be observed in the results is that the states on the edges show higher adviceavailability than those in the centre. These edge states are visited when theagent has learned a suitable behaviour, making the evaluation and recommen-dation of actions easier on the user, and increasing engagement. The edge statesage 25tend to be the last states the users provided advice, before voluntarily endingthe experiment.Finally, we tested the presence of the reward bias of the participants pro-viding evaluative advice as it has been reported in existing literature [16]. Inthis regard, a deviation from ﬁfty percent indicates reward bias, i.e., above 50%means that the advisor provided more positive evaluation than negative eval-uation. On average, participants provided 66 .

22% of positive advice, with aminimum rate of 57 .

14% and a maximum rate of 100 .

00% of positive evalua-tions. Figure 15 shows the reward bias of the participants providing evaluativeadvice. The results collected show that all participants provided more positiveevaluation than negative evaluation.

The human trial performed in this work has investigated the engagement of hu-man advice-givers when assisting interactive reinforcement learning agents. Theassessment was performed using two methods for providing assistance, namely,evaluative and informative advice. Evaluative advice assesses the past per-formance of an agent, while informative advice supplements future decision-making. Previous work in the ﬁeld has compared the performance of interactivereinforcement learning agents under the inﬂuence of each assistance method,ﬁnding that informative-assisted agents learn faster. However, to the best ofour knowledge, studies on human engagement when providing advice using eachassistance method have not been performed.The results obtained from the human trial show that advice-givers provid-ing informative advice outperformed those that used evaluative advice. Humansusing an informative advice-giving method demonstrated more accurate advice,assisted the agent for longer, and provided advice more frequently. Addition-ally, informative advice-givers rated the ability of the agent to follow advicehigher, perceived their own advice to be of higher accuracy, and were similarlycontent with their engagement with the agent as the evaluative advice-givingparticipants.Future work will consider the use of simulated users as a method of repli-cating the general behaviour of participants from this experiment. Includingsimulated users would allow for faster experiments, keeping experimental condi-tions under control, and repeat the process as many time as needed. The ﬁndingsfrom this study can be used to create simulated users which more closely reﬂectthe behaviour of actual human advisers.

Acknowledgments

This work has been partially supported by the Australian Government ResearchTraining Program (RTP) and the RTP Fee-Oﬀset Scholarship through Federa-tion University Australia.age 26

References [1] R. S. Sutton and A. G. Barto,

Reinforcement learning: An introduction .MIT press, 2018.[2] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning:A survey,”

Journal of artiﬁcial intelligence research , pp. 237–285, 1996.[3] I. Giannoccaro and P. Pontrandolfo, “Inventory management in supplychains: a reinforcement learning approach,”

International Journal of Pro-duction Economics , vol. 78, no. 2, pp. 153–161, 2002.[4] H. Kitano, M. Asada, Y. Kuniyoshi, I. Noda, E. Osawa, and H. Matsubara,“RoboCup: A challenge problem for AI,”

AI magazine , vol. 18, no. 1, p. 73,1997.[5] N. Churamani, F. Cruz, S. Griﬃths, and P. Barros, “icub: learning emotionexpressions using human reward,” arXiv preprint arXiv:2003.13483 , 2020.[6] G. Tesauro, “TD-Gammon, a self-teaching backgammon program, achievesmaster-level play,”

Neural computation , vol. 6, no. 2, pp. 215–219, 1994.[7] P. Barros, A. Tanevska, and A. Sciutti, “Learning from learners: Adaptingreinforcement learning agents to be competitive in a card game,” arXivpreprint arXiv:2004.04000 , 2020.[8] D. J. Mankowitz, G. Dulac-Arnold, and T. Hester, “Challenges of real-world reinforcement learning,” in

ICML Workshop on Real-Life Reinforce-ment Learning , p. 14, 2019.[9] F. Cruz, P. W¨uppen, A. Fazrie, C. Weber, and S. Wermter, “Action se-lection methods in a robotic reinforcement learning scenario,” in , pp. 13–18, IEEE, 2018.[10] A. L. Thomaz, G. Hoﬀman, and C. Breazeal, “Real-time interactive rein-forcement learning for robots,” in

Proceedings of Association for the Ad-vancement of Artiﬁcial Intelligence Conference AAAI, Workshop on Hu-man Comprehensible Machine Learning , pp. 9–13, 2005.[11] G. Li, R. Gomez, K. Nakamura, and B. He, “Human-centered reinforce-ment learning: a survey,”

IEEE Transactions on Human-Machine Systems ,vol. 49, no. 4, pp. 337–349, 2019.[12] C. Thimmesh,

Team Moon . Houghton Miﬄin Company, 2006.[13] K. Subramanian, C. L. Isbell Jr, and A. L. Thomaz, “Exploration fromdemonstration for interactive reinforcement learning,” in

Proceedings ofthe 2016 International Conference on Autonomous Agents & MultiagentSystems , pp. 447–456, International Foundation for Autonomous Agentsand Multiagent Systems, 2016.age 27[14] C. Arzate Cruz and T. Igarashi, “A survey on interactive reinforcementlearning: Design principles and open challenges,” in

Proceedings of the 2020ACM Designing Interactive Systems Conference , pp. 1195–1209, 2020.[15] A. Bignold, F. Cruz, M. E. Taylor, T. Brys, R. Dazeley, P. Vamplew, andC. Foale, “A conceptual framework for externally-inﬂuenced agents: Anassisted reinforcement learning review,” arXiv preprint arXiv:2007.01544 ,2020.[16] S. Amershi, M. Cakmak, W. B. Knox, and T. Kulesza, “Power to thepeople: The role of humans in interactive machine learning,”

AI Magazine ,vol. 35, no. 4, pp. 105–120, 2014.[17] F. Cruz, G. I. Parisi, and S. Wermter, “Multi-modal feedback foraﬀordance-driven interactive reinforcement learning,” in , pp. 1–8, IEEE, 2018.[18] M. Sharma, M. P. Holmes, J. C. Santamar´ıa, A. Irani, C. L. Isbell Jr,and A. Ram, “Transfer learning in real-time strategy games using hybridCBR/RL.,” in

IJCAI , vol. 7, pp. 1041–1046, 2007.[19] M. E. Taylor, P. Stone, and Y. Liu, “Transfer learning via inter-task map-pings for temporal diﬀerence learning,”

Journal of Machine Learning Re-search , vol. 8, no. Sep, pp. 2125–2167, 2007.[20] Y. S. Shin and Y. Niv, “Biased evaluations emerge from inferring hiddencauses,”

PsyArXiv preprint psyarxiv:10.31234 , Apr 2020.[21] M. Grzes, “Reward shaping in episodic reinforcement learning,” in

Proceed-ings of the Sixteenth International Conference on Autonomous Agents andMultiagent Sytems (AAMAS 2017) , pp. 565–573, ACM, 2017.[22] O. Marom and B. S. Rosman, “Belief reward shaping in reinforcementlearning,” in

Proceedings of the Thirty-Second AAAI Conference on Arti-ﬁcial Intelligence , pp. 3762–3769, AAAI, 2018.[23] P. Shah, D. Hakkani-Tur, and L. Heck, “Interactive reinforcement learn-ing for task-oriented dialogue management,” in

Workshop on Deep Learn-ing for Action and Interaction, Advances in Neural Information ProcessingSystems 2016 , pp. 1–11, 2016.[24] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio,

Deep learning , vol. 1.MIT press Cambridge, 2016.[25] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”

Nature , vol. 521,no. 7553, p. 436, 2015.[26] B. Kang, P. Compton, and P. Preston, “Multiple classiﬁcation ripple downrules: evaluation and possibilities,” in

Proceedings 9th Banﬀ KnowledgeAcquisition for Knowledge Based Systems Workshop , vol. 1, pp. 17–1, 1995.age 28[27] P. Compton, G. Edwards, B. Kang, L. Lazarus, R. Malor, T. Menzies,P. Preston, A. Srinivasan, and C. Sammut, “Ripple down rules: possibilitiesand limitations,” in

Proceedings of the Sixth AAAI Knowledge Acquisitionfor Knowledge-Based Systems Workshop, Calgary, Canada, University ofCalgary , pp. 6–1, 1991.[28] M. E. Taylor and P. Stone, “Transfer learning for reinforcement learning do-mains: A survey,”

Journal of Machine Learning Research , vol. 10, no. Jul,pp. 1633–1685, 2009.[29] M. L. Littman, “Markov games as a framework for multi-agent reinforce-ment learning,” in

Machine Learning Proceedings 1994 , pp. 157–163, Else-vier, 1994.[30] M. Tan, “Multi-agent reinforcement learning: Independent versus coop-erative agents,” in

Proceedings of the Tenth International Conference onMachine Learning , pp. 330–337, 1993.[31] F. Cruz, G. I. Parisi, and S. Wermter, “Learning contextual aﬀordanceswith an associative neural architecture,” in

Proceedings of the EuropeanSymposium on Artiﬁcial Neural Network. Computational Intelligence andMachine Learning ESANN , pp. 665–670, UCLouvain, 2016.[32] B. Argall, B. Browning, and M. Veloso, “Learning by demonstration withcritique from a human teacher,” in

Proceedings of the ACM/IEEE Inter-national Conference on Human-Robot Interaction , pp. 57–64, ACM, 2007.[33] C. Mill´an, B. Fernandes, and F. Cruz, “Human feedback in continuousactor-critic reinforcement learning,” in

Proceedings of the European Sym-posium on Artiﬁcial Neural Networks, Computational Intelligence and Ma-chine Learning ESANN , pp. 661–666, ESANN, 2019.[34] A. Ayala, C. Henr´ıquez, and F. Cruz, “Reinforcement learning using con-tinuous states and interactive feedback,” in

Proceedings of the InternationalConference on Applications of Intelligent Systems , pp. 1–5, 2019.[35] A. L. Thomaz and C. Breazeal, “Asymmetric interpretations of positiveand negative human feedback for a social learning agent,” in

Robot andHuman Interactive Communication, 2007. RO-MAN 2007. The 16th IEEEInternational Symposium on , pp. 720–725, IEEE, 2007.[36] I. Moreira, J. Rivas, F. Cruz, R. Dazeley, A. Ayala, and B. Fernandes,“Deep reinforcement learning with interactive feedback in a human–robotenvironment,”

Applied Sciences , vol. 10, no. 16, p. 5574, 2020.[37] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under rewardtransformations: Theory and application to reward shaping,” in

Proceed-ings of the International Conference on Machine Learning ICML , vol. 99,pp. 278–287, 1999.age 29[38] T. Brys, A. Now´e, D. Kudenko, and M. E. Taylor, “Combining multiplecorrelated reward and shaping signals by measuring conﬁdence.,” in

Pro-ceedings of the Association for the Advancement of Artiﬁcial Intelligenceconference AAAI , pp. 1687–1693, 2014.[39] B. Marthi, “Automatic shaping and decomposition of reward functions,”in

Proceedings of the International Conference on Machine learning ICML ,pp. 601–608, ACM, 2007.[40] B. Rosman and S. Ramamoorthy, “Giving advice to agents with hiddengoals,” in , pp. 1959–1964, IEEE, 2014.[41] W. B. Knox and P. Stone, “Combining manual feedback with subsequentMDP reward signals for reinforcement learning,” in

Proceedings of the 9thInternational Conference on Autonomous Agents and Multiagent Systems:volume 1 , pp. 5–12, International Foundation for Autonomous Agents andMultiagent Systems, 2010.[42] W. B. Knox and P. Stone, “Interactively shaping agents via human rein-forcement: The TAMER framework,” in

Proceedings of the Fifth Interna-tional Conference on Knowledge Capture , pp. 9–16, ACM, 2009.[43] J. MacGlashan, M. K. Ho, R. Loftin, B. Peng, G. Wang, D. L. Roberts,M. E. Taylor, and M. L. Littman, “Interactive learning from policy-dependent human feedback,” in

Proceedings of the 34th International Con-ference on Machine Learning-Volume 70 , pp. 2285–2294, 2017.[44] D. Arumugam, J. K. Lee, S. Saskin, and M. L. Littman, “Deep rein-forcement learning from policy-dependent human feedback,” arXiv preprintarXiv:1902.04257 , 2019.[45] T. Kessler Faulkner, R. A. Gutierrez, E. S. Short, G. Hoﬀman, andA. L. Thomaz, “Active attention-modiﬁed policy shaping: socially inter-active agents track,” in

Proceedings of the International Conference onAutonomous Agents and Multiagent Systems AAMAS , pp. 728–736, In-ternational Foundation for Autonomous Agents and Multiagent Systems,2019.[46] P. Behboudian, Y. Satsangi, M. E. Taylor, A. Harutyunyan, and M. Bowl-ing, “Useful policy invariant shaping from arbitrary advice,” in

AAMASAdaptive and Learning Agents Workshop ALA 2020 , p. 9, 2020.[47] J. Lin, Z. Ma, R. Gomez, K. Nakamura, B. He, and G. Li, “A reviewon interactive reinforcement learning from human social feedback,”

IEEEAccess , vol. 8, pp. 120757–120765, 2020.[48] F. Cruz, P. W¨uppen, S. Magg, A. Fazrie, and S. Wermter, “Agent-advisingapproaches in an interactive reinforcement learning scenario,” in age 30

IEEE International Conference on Development and Learning and Epige-netic Robotics (ICDL-EpiRob) , pp. 209–214, IEEE, 2017.[49] J. Grizou, M. Lopes, and P.-Y. Oudeyer, “Robot learning simultaneouslya task and how to interpret human instructions,” in

Proceedings of theJoint IEEE International Conference on Development and Learning andEpigenetic Robotics ICDL-EpiRob , pp. 1–8, IEEE, 2013.[50] S. Griﬃth, K. Subramanian, J. Scholz, C. Isbell, and A. L. Thomaz, “Pol-icy shaping: Integrating human feedback with reinforcement learning,” in

Advances in Neural Information Processing Systems , pp. 2625–2633, 2013.[51] P. M. Pilarski and R. S. Sutton, “Between instruction and reward: human-prompted switching,” in

AAAI Fall Symposium Series: Robots LearningInteractively from Human Teachers , pp. 45–52, 2012.[52] O. Amir, E. Kamar, A. Kolobov, and B. Grosz, “Interactive teaching strate-gies for agent training,” in

Proceedings of the International Joint Confer-ence on Artiﬁcial Intelligence IJCAI , pp. 804–811, 2016.[53] F. Cruz, S. Magg, Y. Nagai, and S. Wermter, “Improving interactive re-inforcement learning: What makes a good teacher?,”

Connection Science ,vol. 30, no. 3, pp. 306–325, 2018.[54] E. Kamar, S. Hacker, and E. Horvitz, “Combining human and machineintelligence in large-scale crowdsourcing,” in

Proceedings of the 11th In-ternational Conference on Autonomous Agents and Multiagent Systems-Volume 1 , pp. 467–474, International Foundation for Autonomous Agentsand Multiagent Systems, 2012.[55] A. L. Thomaz and C. Breazeal, “Teachable robots: Understanding humanteaching behavior to build more eﬀective robot learners,”

Artiﬁcial Intelli-gence , vol. 172, no. 6-7, pp. 716–737, 2008.[56] W. B. Knox and P. Stone, “Reinforcement learning from human reward:Discounting in episodic tasks,” in ,pp. 878–885, IEEE, 2012.[57] W. B. Knox and P. Stone, “Learning non-myopically from human-generatedreward,” in

Proceedings of the 2013 International Conference on IntelligentUser Interfaces , pp. 191–202, ACM, 2013.[58] M. Cakmak and A. L. Thomaz, “Optimality of human teachers for robotlearners,” in

Development and Learning (ICDL), 2010 IEEE 9th Interna-tional Conference on , pp. 64–69, IEEE, 2010.[59] M. Cakmak, C. Chao, and A. L. Thomaz, “Designing interactions for robotactive learners,”

IEEE Transactions on Autonomous Mental Development ,vol. 2, no. 2, pp. 108–118, 2010.age 31[60] A. Guillory and J. A. Bilmes, “Simultaneous learning and covering withadversarial noise.,” in

ICML , vol. 11, pp. 369–376, 2011.[61] A. Guillory and J. A. Bilmes, “Online submodular set cover, ranking, andrepeated active learning,” in

Advances in Neural Information ProcessingSystems , pp. 1107–1115, 2011.[62] A. W. Moore, L. Birnbaum, and G. Collins, “Variable resolution dynamicprogramming: Eﬃciently learning action maps in multivariate real-valuedstate-spaces,” in

Proceedings of the Eighth International Conference on Ma-chine Learning , pp. 333–337, 1991. uestionnaire Form

CRICOS Provider No: 00103D Benchmarking Human Advice Giving Study – Questionnaire Form 2017 | P a g e

Participant Code: ___________________

No identifying information is collected. The participant code is used to match your questionnaire responses to your experiment responses. After completing, neither your name nor any identifying information will be kept. See the Plain Language Information Statement for more details.

Have you participated in a machine learning study in the past?

On a scale of 0 – 10, how would you rate your level of knowledge about the Mountain Car experiment?

Stop the questionnaire now and perform the experiment. After completing the experiment, turn over the page and complete the questions.

Page 32

A Questionnaire Form uestionnaire Form

CRICOS Provider No: 00103D Benchmarking Human Advice Giving Study – Questionnaire Form 2017 | P a g e

Do not complete this side until you have completed the experiment.

Now that you have completed the experiment, on a scale of 1 – 10, how would you rate your level of knowledge about the Mountain Car experiment?

How do you feel about the level of engagement you had with the agent?

A. I could have spent more time interacting with the agent. B. I’m happy with how much time I interacted with the agent. C. I spent too much time interacting with the agent.

How accurate do you feel your advice was to the agent?

A. Always Incorrect B. Mostly Incorrect C. Sometimes Incorrect D. Sometimes Correct E. Mostly Correct F. Always Correct.

How well do you think the agent followed your advice?