Accelerating Reinforcement Learning Agent with EEG-based Implicit Human Feedback
Duo Xu, Mohit Agarwal, Ekansh Gupta, Faramarz Fekri, Raghupathy Sivakumar
11 Accelerating Reinforcement Learning Agent withEEG-based Implicit Human Feedback
Duo Xu*, Mohit Agarwal*, Ekansh Gupta, Faramarz Fekri, Raghupathy SivakumarGeorgia Institute of TechnologyAtalnta, GA, 30332Emails: { dxu301, mohit, fekri, siva } @ece.gatech.edu Abstract —Providing Reinforcement Learning (RL) agents withhuman feedback can dramatically improve various aspects oflearning. However, previous methods require human observerto give inputs explicitly (e.g., press buttons, voice interface),burdening the human in the loop of RL agent’s learning process.Further, it is sometimes difficult or impossible to obtain theexplicit human advise (feedback), e.g., autonomous driving,disabled rehabilitation, etc. In this work, we investigate capturinghuman’s intrinsic reactions as implicit (and natural) feedbackthrough EEG in the form of error-related potentials (ErrP),providing a natural and direct way for humans to improvethe RL agent learning. As such, the human intelligence can beintegrated via implicit feedback with RL algorithms to acceleratethe learning of RL agent. We develop three reasonably complex2D discrete navigational games to experimentally evaluate theoverall performance of the proposed work. Major contributionsof our work are as follows, (i) we propose and experimentallyvalidate the zero-shot learning of ErrPs, where the ErrPs can belearned for one game, and transferred to other unseen games,(ii) we propose a novel RL framework for integrating implicithuman feedbacks via ErrPs with RL agent, improving the labelefficiency and robustness to human mistakes, and (iii) comparedto prior works, we scale the application of ErrPs to reasonablycomplex environments, and demonstrate the significance of ourapproach for accelerated learning through real user experiments.
I. I
NTRODUCTION
AI systems are increasingly applied to real-world tasks thatinvolve interaction with humans. And humans are often in theloop of the RL agent’s learning process. Self-driving cars learnwith humans ready to intervene in dangerous situations. Face-book’s algorithm for recommending trending news stories hashumans filtering out inappropriate content. Therefore RL with human-in-the-loop has inspired several research efforts whereeither an alternative (or supplementary) feedback is obtainedfrom the human participant, such as human rankings or ratings[18], human robot interaction and rehabilitation engineeringfor the disabled [32], [36], or the learning is performed throughhuman demonstrations [41]. Such approaches with explicithuman input despite being highly effective, severely burdensthe human interacting with RL agent. Further, it is difficultor even impossible to obtain the explicit human feedback invarious situations, e.g., autonomous driving, disabled users,etc.In this work, we investigate an alternative paradigm toobtain the human feedback in an implicit manner (by tap- * equal contribution ping directly into the intrinsic brainwaves) that substantiallyincreases the richness of the reward functions, while notseverely burdening the human-in-the-loop. We study the useof electroencephalogram (EEG) based brain waves of thehuman-in-the-loop to generate the auxiliary reward functionsto augment the learning of RL agent. Such a model willbenefit from the natural rich activity of a powerful sensor(the human brain), but at the same time not burden thehuman since the activity being relied upon is intrinsic . Thisparadigm is inspired by a high-level error-processing systemin humans that generates error-related potential (ErrP) [50],[6], a negative deflection in the ongoing EEG signals. When ahuman recognizes an error made by an agent, the elicited ErrPcan be captured through EEG to inform agent about the sub-optimality of the taken action in the particular state. Humanfeedback obtained in this manner is direct and fast while beingnatural and easy for humans. This widens the applicability ofsuch RL-human interactive systems where the RL agents aredeployed in the real-world environment, and increased latencyof human feedback could create unwanted situations. Further,obtaining large amount of explicit feedback is infeasible dueto the increased cognitive load [45]. Additionally, EEG-basedfeedback allows disabled users to provide the feedback, whereexplicit communication pathway is not available.Previous works have [14], [48] demonstrated the benefitof error-potentials in a very simple setting (i.e., very smallstate-space, and two actions), and used ErrPs as the onlyreward. As a baseline contribution, we scale the feasibilityof capturing error-potentials (of a human observer watchingan agent learning to play games) to reasonably complex en-vironments, and then experimentally show that decoded ErrPscan be appropriately used as an auxiliary reward function to aRL agent. Specifically, we show that the full access approach,inquiring human feedback on every state-action pair visitedby RL agent, can significantly speedup the learning of RLagent. However, we argue that while obtaining such implicithuman feedback through EEG is less burdensome, it is still atime-intensive task for the subject and the experimenter alike.This, combined with the noisy EEG signals and stochasticity ininferring error-potentials, raises significant challenges in termsof the practicality of the solution.In this context, we first argue that the definition of Er-rPs can be learned in a zero-shot manner across differentenvironments. We experimentally validate that ErrPs of anobserver can be learned to decode for a specific game, and the a r X i v : . [ c s . N E ] J u l definition can be used as-is for another game without requiringre-learning of the ErrP. This is notably different from previousapproaches [14], [48], where the labeled ErrPs are obtained inthe same environment the RL agent is trying to solve. Wecontend that previous approaches are not practical, since ErrPdecoder cannot be trained and tested in the same environment.We develop a framework to integrate deep RL (DRL) modelwith the implicit human feedback mechanism (via ErrP) in apractical, sample-efficient manner. Our proposed frameworkallows humans to provide their feedback implicitly prior tothe agent training, reducing the cognitive load on humans, andhence the cost of human supervision. In the presented frame-work, prior to the training of RL agent, randomly generateddemonstrations are presented to human for giving feedback(implicitly via ErrP), and an auxiliary reward function islearned to reflect the human decision and intelligence hiddenbehind ErrP labels. This auxiliary reward is then passed to theRL agent to accelerate the learning process in sparse-rewardenvironments. Similar previous work lies in the streamline ofhuman-agent interaction via reward shaping [8], [10], [34],[53], [57], [60]. However, their methods didn’t treat errorsin the human feedback specifically, so that they cannot berobust to wrong ErrP labels, generated due to the randomnessof collecting and decoding of ErrP data. Thus, we learn anauxiliary reward of human feedback in a way of being robustto wrong ErrP labels. We first model the human policy asa soft-Q policy [28] and learn the human Q function viamaximum likelihood based on collected ErrP labels. Then inorder to make the learned Q function more compatible withthe state space, a baseline function is introduced to smoothenthat. Finally, at the RL agent side, the received reward isthe combination of environmental reward (sparse) and theauxiliary reward learned from human feedback.We present results of real ErrP experiments to evaluatethe acceleration in learning, and sample efficiency, of theproposed frameworks. We show that such implicit feedbackapproach can accelerate the training of RL agent by 2.25x,while reducing the number of queries required by 75.56%. Insummary, the novel contributions of our work are,1) We demonstrate the zero-shot learning of error-potentialsover various visual-based RL problems (discrete grid-based navigation games, studied in this work), enabling theestimation of implicit human feedback in new and unseenenvironments without re-training of ErrP decoder.2) In order to reduce the sample complexity of ErrP labels, wepropose a new framework of integrating human feedbackinto RL via reward shaping. It is a novel approach specif-ically considering robustness against mistakes in humanfeedback. We first generate a set of random trajectoriesby Monte Carlo Tree Search (MCTS), balancing explo-ration and exploitation. Then ErrP labels are collected inexperiments by demonstrating these trajectories to humanobservers. By learning human Q function with decodedlabels, we derive an auxiliary reward function to augmentthe learning of following RL agent.3) We scale the implicit human feedback (via ErrP) basedRL to reasonably complex environments and demonstratethe significance of our approach through experiments on various human subjects.Our work demonstrates the potential of intuitive humanrobot interaction, facilitating robotic control by implicit humanfeedback in the form of ErrPs. We believe the contributionpresented in this work, i.e., zero-shot learning of ErrPs andRL framework to reduce the human cognitive load, wouldinspire such implicit human feedback system to be deployedin practical robotic applications, such as autonomous drivingor end-user applications for disabled, where explicit humanfeedback is not available.II. R ELATED W ORK
The impact of feedback provided by a human to an agent inRL settings has been investigated by multiple researchers. Asurvey of recent research in using human guidance for deep RLtasks is presented in [62]. We summarize related work in someof these techniques that are most relevant to us. In addition torewards from environment, reward shaping learns an auxiliaryreward function to accelerate the learning process of the agent[12], [13], [54]. A framework called TAMER (Training anAgent Manually via Evaluative Reinforcement) that enabledshaping (interactively training an agent via an external signalprovided by a human) was presented in [34]. Then the authorextended this work to enable human feedback to augmentan RL agent that learned using an MDP reward signal [35],[36]. Recently an architecture called Deep-TAMER [57] hasextended the TAMER framework to environments with high-dimensional state spaces. DQN-TAMER [1] modeled othercharacteristics of human observers, such as facial expressions,from which human reward was inferred.Human preference [15], [59] is another approach to commu-nicate complex goals to allow systems to interact with real-world environments in a meaningful way. This allowed theRL agent to directly learn from expert preferences. However,this approach is limited by assumptions on the existence of a(total) order among the set of trajectories. The author proposeda framework called Human-Agent Transfer (HAT) [53]. Itdirectly used demonstrations provided by a human operatorto synthesize a baseline policy, which is to guide the learningof the agent. CHAT [56] extended HAT to consider uncer-tainty in summarizing demonstrations and further improve theperformance.‘Potential functions’ is also used in Potential-based rewardshaping (PBRS) methods to accelerate the learning process,while preserving the identity of optimal policies [41], [58],[61]. The potential function was designed to encode ‘rules’ ofthe environment of the RL agent. However, potential functionswill typically need to be pre-specified. This has restricted theuse of PBRS to tabular / low-dimensional state spaces.In previous reward-shaping work mentioned above, humanfeedback is explicit, requiring active human labeling or at-tention, and the mistakes in human feedback are not tackled.Here we propose to read implicit human feedback from error-potential hidden in human brain waves, and deal with wrongfeedback in a robust approach. Recently, there is a longline of papers studying reinforcement learning from humanfeedback, such as [17], [18], [55], [59], [15]. However, they are only about explicit human feedback or labeling, and theyall assume human feedback is noiseless. In this work, we usereward function learned by imitation learning to augment thefollowing RL agent.Numerous works [11], [31], [30] have studied a high-levelerror-processing system in humans generating the error-relatedpotential/negativity (ErrP or ERN).Interaction, response, and feedback ErrPs have been heavilyinvestigated in the domain of choice reaction tasks, wherehuman is actively interacting with the system [49], [7], [44],[22], [23] and the error is made either by the human or bythe machine. [33] demonstrated the use of ErrP signals in aninteractive RL task, when the human is actively interactingwith the machine system. [22] explored the ErrPs when humanis silently observing the machine actions (and does not activelyinteract). Works at the intersection of ErrP and RL [14], [48]demonstrate the benefit of ErrPs in a very simple setting (i.e.,very small state-space), and use ErrP-based feedback as theonly reward. Moreover, in all of these works, the ErrP decoderis trained on a similar game (or robotic task), essentially usingthe knowledge that is supposed to be unknown in the RL task.In our work, we use labeled ErrPs examples of very simple andknown environments to train the ErrP decoder, and integrateErrP with DRL in a sample-efficient manner for reasonablycomplex environments.III. P
RELIMINARIES AND S ETUP
Definitions:
Consider a Markov Decision Process (MDP)problem M , as a tuple < X , A , P, P , R, γ > , with state-space X , action-space A , transition kernel P , initial statedistribution P , accompanied with reward function R , anddiscounting factor ≤ γ ≤ . In this work, we onlyconsider MDP with discrete actions and states. In model-freeRL method, the central idea of most prominent approaches isto learn the Q-function by minimizing the Bellman residual,i.e., L ( Q ) = E π (cid:2)(cid:0) Q ( x, a ) − r − γQ ( x (cid:48) , ˆ a ) (cid:1) (cid:3) , and temporaldifference (TD) update where the transition tuple ( x, a, r, x (cid:48) ) consists of a consecutive experience under behavior policy π . Bayesian Deep Q Network:
The Q function model adoptedin this paper is
Bayesian DQN [2]. It is a neural architecturewhere the Q-function is approximated as a linear function,weighted by ω a , a ∈ A , of the feature representation ofstates φ θ ( x ) ∈ R d , parameterized by neural network withweights θ . The weights ω a follow the Gaussian distributionfrom Bayesian linear regression. A. System Setup and Data Collection
We designed and developed an experimental protocol, wherea machine agent plays a computer game, while a humansilently observes (and assesses) the actions taken by themachine agent. These implicit human reactions are capturedby placing raw electrodes on the scalp of the human brainin the form of EEG potentials. The electrode cap (BIOPACCAP-100C) was attached with the OpenBCI Cyton platform,which was further connected to a desktop machine over the http://openbci.com wireless channel. In the game design (developed on OpenAIGym), we open a TCP port, and continuously transmit thecurrent state-action pair using the TCP/IP protocol. We usedOpenViBE software [46] to record the human EEG data.OpenViBE continuously listens to the TCP port (for state-action pairs), and timestamps the EEG data in a synchronizedmanner. A total of five human subjects were recruited (meanage 26.8 with standard deviation of 1.92, 1 female) usingstandard procedures with their consent. For each subject-gamepair, the experimental duration was less than 15 minutes. Theagent took action every 1.5 seconds during the experiment.All the research protocols for the user data collection werereviewed and approved by the University Institutional ReviewBoard.
1) Game Environments:
We have developed three discretegrid-based navigational games in OpenAI Gym
Atari frame-work, namely Wobble, Catch, and Maze (Fig. 1(a)).
Wobble:
Wobble is a simple 1-D cursor-target game, wherethe middle horizontal plane is divided into 20 discrete blocks.At the beginning of the game, the cursor appears at the centerof the screen, and the target appears no more than three blocksaway from the cursor position. The action space constitutesmoving to the left or right. The game is finished when thecursor reaches the target. Once the game is finished, a newgame is started with the cursor in place.
Catch:
Catch is a simplistic version of
Eggomania (Atari2600 benchmark), where we display a single egg on the screenat a time. The screen dimensions are divided into 10x10 gridspace, where the egg and the cart , both occupies one block.The action space of the agent consists of NOOP (no operation), left and right . At the start of the game, the horizontal positionof the egg is chosen randomly. At each time step, the egg fallsone block in the vertical direction.
Maze:
Maze is a 2-D navigational game, where the agent hasto reach to a fixed target (shown with a plus symbol). Thescreen is divided into 10x10 square blocks. The action spaceconsists of four directional movements. The only reward hereis the result of the episode, i.e., win or lose. If an agent moves,but hits a wall, a quick blinking of the agent is displayed, torender the action taken by the agent.
B. Advantages of using error-potentials
In our work, relying on error-potentials provides two pri-mary benefits:(a)
Provides a generalization notion of error-detection:
Error-potentials are elicited by incorrect feedback in adiverse set of tasks [25] implying that the error process-ing system is sensitive to a generalized notion of errordetection. Error-potentials are observed across a widevariety of input modality (e.g., audio [19], visual [21],somatosensory [40], etc.). This generalized mechanismis one of the characteristic advantage in error-potentials,unlike other brain-potentials specific to a stimulus ormodality. For instance, the P600, N300, P300, and N200are elicited when a subject is presented with syntactic https://en.wikipedia.org/wiki/Eggomania (a) Game Environments(b) Experiment Bench Fig. 1: Experimental framework anomalies in sentences [43], semantically inconsistentword and picture pairs [39], interruption of a stimuluswith another divergent stimulus [52], and detection ofmismatch in a stimulus [24] respectively.(b)
Evolutionary Significance:
Error-potentials in primatesare strong and universal (exhibiting similar behaviousacross individuals) as they have an evolutionary signif-icance due to their importance in cognition, learning,and survival. Error-potentials enable the learning processvia the administration of rewards and punishments inAnterior Cingulate Cortex (ACC) [29]. In monkeys, error-potentials were generated in anterior cingulate sulcus,when monkeys made errors in a simple response task.are the are linked to the. [42] found error-recognitionunits in monkeys’ anterior cingulate sulcus that wereactivated when the animals received negative feedbackin the form of absence of an expected reward. Similarly,[26] found that when monkeys made errors in a simpleresponse task, error-related potentials were generated inthe anterior cingulate sulcus thereby advocating that ErrPslink human and non human primates on the basis of errormonitoring.
C. Motivation: Using intrinsic error-potentials over manuallabeling
Experimental Methodology:
We conducted an experimentin which subjects were asked to label the actions of an AI agentin a maze. If the agent took a correct action, they needed topress a certain key and if the agent performed a wrong action,they were supposed to press another key. We conducted thisexperiment to find differences between manual labeling andlabeling using EEG experiments in terms of user comfort andlabeling accuracy. The methodology of this experiment wassimple. We designed a maze game and generated 3 instancesof it where each instance got progressively faster (to study the impact of time pressure on mental comfort and accuracy).The first instance had a time delay of 1.5 seconds betweensuccessive actions of the agent while the second and the thirdhad a delay of 1.0 and 0.5 seconds respectively. Subjects wererequired to play 3 trials in each instance (thus totaling to 9trials overall) where the order of instances was randomized toavoid biasing the users to a particular order of the game. Oncesubjects finished playing 3 trials of an instance, they wereredirected to a Qualtrics survey where they had to provide theirsubjective feedback about the experiment. Thus, there were 3forms that each subject had to fill (one per instance). We usedAmazon’s Mechanical Turk to request anonymous workers tocomplete this task. The study protocol was approved by theInstitutional Review Board.
Results:
We obtained a total of 281 responses (87, 91,and 103 unique user responses for the 1.5s, 1.0s, and 0.5sinstances of the game respectively). On average, for the 1.5sinstance of the game, we obtained a true positive rate of56.6% and 41.5% for correct and incorrect actions of the mazeagent respectively. We also obtained a feedback latency of376ms and 540ms for correct and incorrect actions of the mazeagent respectively. Note that correct and incorrect actions ofthe agent corresponds to the non-Errp and ErrP respectively,during EEG experiments. For the 1.0s instance of the game, weobtained a true positive rate of 49.8% and 38.8% for correctand incorrect actions of the maze agent respectively. We alsoobtained a feedback latency of 288ms and 456ms for correctand incorrect actions of the maze agent respectively. For the0.5s instance of the game, we obtained a true positive rate of34.9% and 14.1% for correct and incorrect actions of the mazeagent respectively. We also obtained a feedback latency of179ms and 207ms for correct and incorrect actions of the mazeagent respectively. However some trials in these experimentshad very poor labeling rate (some subjects only labeled lessthan 50% of the actions available). In order to prevent theresults from being swayed by inert participants, we decided toseparate them from the operational participants.We decided to remove the trials for the users which had lessthan 50% feedback rate. In other words, we removed the trialswhere participants failed to provide the feedback for at least50% of all the actions. On average, for the 1.5s instance of thegame, we obtained a true positive rate of 74.1% and 53.4% forcorrect and incorrect actions of the maze agent respectively.We also obtained a feedback latency of 364ms and 539ms forcorrect and incorrect actions of the maze agent respectively.For the 1.0s instance of the game, we obtained a true positiverate of 69.8% and 52.6% for correct and incorrect actionsof the maze agent respectively. We also obtained a feedbacklatency of 290ms and 451ms for correct and incorrect actionsof the maze agent respectively. For the 0.5s instance of thegame, we obtained a true positive rate of 56.4% and 21.6%for correct and incorrect actions of the maze agent respectively.We also obtained a feedback latency of 177ms and 203ms forcorrect and incorrect actions of the maze agent respectively.These values are summarized in Table I
Insights:
As we can clearly see, the accuracy values forcorrect and incorrect actions both decrease with decrease intime interval. The labeling accuracy for correct actions seems
TABLE I: Accuracy and latency for maze game manual labeling
Time Subjects Non-ErrP ErrP Non-ErrP ErrPInterval (s) TPR (%) TPR (%) latency (ms) latency (ms)1.5 87 74.1 53.4 364 5391.0 91 69.8 52.6 290 4510.5 103 56.4 21.6 177 203 to be more than that of incorrect actions. Both, the accuracyfor correct as well as incorrect actions decreases as the timelatency is decreased (thereby increasing time pressure). Eventhe best possible accuracy for incorrect actions is about 53.4%(only marginally better than random labeling). This performsrather poorly compared to the labeling accuracy using ErrP.Based on the qualitative survey responses, on a scale of 1to 7, the users gave the 1.5s instance of the game an averagecomfort rating of 5.4 which declined to 4.9 and 3.9 for the1.0s instance and 0.5s instance respectively. On being askedif they were able to mark all actions correctly, 40% of thesubjects answered in the affirmative in the 1.5s instance of thegame, which declined to 26% and 14% in the 1.0s and the0.5s instance of the game. Across the board, the majority ofthe participants reported that the ideal time interval for themto correctly label all actions of the agent would be between1.5s and 3.0s or larger. 64% of the participants in the 1.5sinstance of the game reported that reducing the time intervalof the game to 1.0s would decrease their labeling accuracy, and69% of the participants reported that it would increase theirmental burden. 52% of the participants in the 1.0s instanceof the game reported that reducing the time interval of thegame to 0.5s would decrease their labeling accuracy, and60% of the participants reported that it would increase theirmental burden. In contrast, 64% of the participants in the 1.0sinstance of the game reported that increasing the time intervalfrom 1.0s to 1.5s would increase their labeling accuracy anddecrease their mental burden. 49% of the participants in the0.5s instance of the game reported that reducing the timeinterval of the game further would decrease their labelingaccuracy, and 53% of the participants reported that it wouldincrease their mental burden. To summarize, the users feltincreasing discomfort and cognitive burden as the time latencyreduced from 1.5s to 1.0s and further to 0.5s. They alsoreported that the optimal time latency for comfortable manuallabeling would be between 1.5s and 3s. This was also evidentfrom the fact that more than 60% of the participants anticipatedreduction in their accuracy if time latency was to be decreasedfrom 1.5s.IV. I
NTEGRATING RL WITH I MPLICIT H UMAN F EEDBACK :A N
AIVE A PPROACH
In this section, we provide our baseline contribution, i.e., (i)we demonstrate the feasibility of capturing error-potentials ofa human subject watching an RL agent learning to play sev-eral different games, and then decoding the human feedback(judgment) on the observed state-action pair appropriately, and(ii) using them as an auxiliary reward function to acceleratethe learning of the RL agent.
A. Obtaining the Implicit Human Feedback: Decoding ErrPs
We rely on the Riemannian Geometry framework for theclassification of human’s intrinsic reaction (captured in theform of ErrPs) [5].We consider the classification of error-related potentials as a binary classification task indicating thepresence (i.e., action taken by the agent is incorrect) andabsence of error (i.e., action taken by the agent is correct). Theraw EEG data is bandpass filtered in [0.5, 40] Hz. Epochs of800ms were extracted relative to pre-stimulus 200ms baseline,and were subjected to spatial filtering. In spatial filtering, pro-totype responses of each class, i.e., “correct” and “erroneous”,are computed by averaging all training trials in the correspond-ing classes(“xDAWN Spatial Filter” [47], [16]). “xDAWNfiltering” projects the EEG signals from sensor space (i.e.,electrode space) to the source space (i.e., a low-dimensionalspace constituted by the actual neuronal ensembles in brainfiring coherently). The covariance matrix of each epoch iscomputed, and concatenated with the prototype responses ofthe class. Further, dimensionality reduction is achieved byselecting relevant channels through backward elimination [3].The filtered signals are projected to the tangent space [4]for feature extraction. The obtained feature vector is firstnormalized (using L1 norm) and fed to a regularized regressionmodel. A threshold value is selected for the final decision bymaximizing accuracy offline on the training set. We presentthe algorithm to decode the ErrP signals in Algorithm 1.
Algorithm 1:
Riemannian Geometry based ErrP classifi-cation algorithm
Input : raw EEG signals
EEG Pre-process raw EEG signals ; Spatial Filtering: xDAWN Spatial Filter ( nf ilter ) ; Electrode Selection: ElectrodeSelect ( nelec ,metric=’riemann’) ; Tangent Space Projection : TangentSpace(metric =“logeuclid”) Normalize using L1 norm ; Regression: ElasticNet ; Select decision threshold by maximizing accuracy
B. The Full Access Method
A naive approach to integrate the human feedback with RLmodels is reward shaping with full access . Human feedback isobtained on every visited state-action pair while RL agent islearning, and a negative penalty is added to the environmentalreward in case an ErrP is detected. The evaluation result ofthis method based on real ErrP data are presented later in theevaluation section (section VI-A), validating that full access method can significantly accelerate the learning of the RLagent. However, obtaining the human feedback for every state-action pair is time-intensive and not practically feasible. In thenext section, we provide our novel contributions to practicallyobtain and integrate the implicit feedback with the learning ofRL agent.Interaction, response, and feedback ErrPs have been heavilyinvestigated in the domain of choice reaction tasks, where human is actively interacting with the system [49], [7], [44],[22], [23] and the error is made either by the human or bythe machine. [33] demonstrated the use of ErrP signals in aninteractive RL task, when the human is actively interactingwith the machine system. [22] explored the ErrPs when humanis silently observing the machine actions (and does not activelyinteract). Works at the intersection of ErrP and RL [14], [48]demonstrate the benefit of ErrPs in a very simple setting (i.e.,very small state-space), and use ErrP-based feedback as theonly reward. Moreover, in all of these works, the ErrP decoderis trained on a similar game (or robotic task), essentially usingthe knowledge that is supposed to be unknown in the RL task.In our work, we use labeled ErrPs examples of very simple andknown environments to train the ErrP decoder, and integrateErrP with DRL in a sample-efficient manner for reasonablycomplex environments.V. T
OWARDS P RACTICAL I NTEGRATION OF RL WITH I MPLICIT H UMAN F EEDBACK
In this section, we propose two approaches to enable thedeployment of ErrP-augmented RL into practical systems.Firstly, we show that ErrPs of an observer can be learnedin a zero-shot manner, i.e. ErrP decoder can be trained for aspecific game, and the trained decoder can be used as-is for an-other game without re-training of the ErrP decoder. To combatwith the practical issues with obtaining ErrP labels for everystate-action pairs, we propose an RL framework (motivatedby imitation learning approaches) allowing humans to providetheir feedback on a few trajectories prior to the learning of theRL agent. This dramatically reduces the number of feedbacklabels required from the human observer.
A. Zero-shot learning of ErrPs
Error-potentials in the EEG signals are studied under twomajor paradigms in human-machine interaction tasks, (i) feed-back and response ErrPs: error made by human [11], [20],[7], [44], (ii) interaction ErrPs: error made by machine ininterpreting human intent [22]. Another interesting paradigmis observation ErrPs , when human is watching (and silentlyassessing) the machine performing a specific task [14]. Themanifestation of these potentials across these paradigms werefound quite similar in terms of their general shape, negativeand positive peak latency, and frequency characteristics[22],[14]. This prompts us to explore the consistency of the error-potentials across different environments (i.e., games, in ourcase) within the observation ErrPs . In Figure 2, we plot thegrand average waveforms across three environments (Maze,Catch and Wobble), to visually validate the consistency ofpotentials. We can see that the shape of negativity, and thepeak latency is quite consistent across the three game environ-ments.Further, in evaluation section VI-B1, we experimentallydemonstrate the zero-shot learning of error-potentials.
B. Robust Reward Shaping using Human Feedback
RL algorithms deployed in the environment with sparserewards demand heavy explorations (require a large number of trial-and-errors) during the initial stages of training. Previouswork on reward shaping with human feedback [10], [34], [53],[57], [60] build a specific model to generalize human feedbackin state space, without tackling wrong feedback. Inspired bysoft Q policy [28], we develop a novel framework of learningthe auxiliary reward from human feedback to accelerate thetraining of the RL agent, with robustness to mistakes in ErrPlabeling.In this framework, implicit human feedback is requiredon all state-action pairs along trajectories (demonstrations)randomly generated initially. Before RL agent starts learning,human subjects are asked to observe a number of trajectories,and their implicit feedback in the form of ErrP on corre-sponding state-action pair are recorded in a dataset. Then theauxiliary reward function r a ( · , · ) is learned from these trajec-tories labeled by human feedback. It also discovers the humandecision or intention hidden behind the implicit feedback, inthe form of ErrPs. During the RL training, the learned rewardfunction acts as a proxy for the human feedback, compensatingthe very sparse reward of environment. The flowchart of theproposed learning framework is shown in Figure 3. Differentfrom the naive baseline full access method, in this approach,queries for human feedback (ErrP labeling) are required onlyon trajectories generated initially, instead of querying everylearning step during the training. Hence, the total number ofErrP queries can be reduced significantly, further reducingthe cognitive load for the human-in-the-loop on the implicitfeedback. Trajectory Generation:
Constraint by the coherence require-ment in EEG experiments, the trajectories for ErrP labelinghave to be complete, containing every state-action pair fromthe beginning to the end of the game. However, the selectedtrajectories have to cover state space as much as possible, andcannot be too far away from the optimal solutions. This is es-sentially the trade-off between exploitation and exploration. Sowe propose to use Monte Carlo Tree Search (MCTS) [9], [37],[51] to generate random trajectories for ErrP experiments. Itis to tackle exploration-exploitation trade-off by Upper Confi-dence Bound (UCB) method [37], and doesn’t require to knowthe optimal solution a prior. MCTS is a general game playingtechnique with recent success in discrete, turn-based, and non-deterministic game domains. We choose MCTS as a trajectorysampling algorithm for its proven high-level performance,domain generality, and variable computational bounds.There is one node in the tree for each state s , containinga value Q ( s, a ) and a visit count N ( s, a ) for each action a , and an overall count N ( s ) = (cid:80) a N ( s, a ) . Each nodeis initialised to Q ( s, a ) = 0 , N ( s, a ) = 0 . The value isestimated by the mean return from s in all simulations whereaction a was selected from state s , and the only reward hereis the result of the game, win or lose. At each state s ofthe trajectory, the action is selected to be the maximizer ofthe objective Q ( s, a ) + c (cid:113) log N ( s ) N ( s,a ) , where c is to trade offbetween reaching the target and exploring more state space[37]. By the end of generating each trajectory, the returnis back-propagated into Q values along the trajectory, i.e., Q ( s t , a t ) := r + γQ ( s t +1 , a t +1 ) . Only first K generated Game: Maze
Game: Catch
Game: Wobble
S03S04S01S02S05Avg
Fig. 2: Manifestation of error-potentials in time-domain: Grand average potentials (error-minus-correct conditions) are shown for Maze, Catchand Wobble game environments. Thick black line denotes the average over all the subjects.Fig. 3: Robust Reward Shaping with Human Feedback. The dashedarrow shows trajectories in
D ∪ D R are all used in reward learning trajectories are used in ErrP experiments.In experiments for collecting ErrPs, the human subjectprovides implicit feedback (via ErrP) over all the generatedtrajectories, labeling every state-action pair as a positive or negative sample corresponding to its correctness according tohuman intelligence. With decoded ErrP labels over trajecto-ries as input, we propose a novel reward shaping methodto incorporate ErrP labels into the following reinforcementlearning process. It specifically tackles the problem of robust-ness against wrong ErrP labels, with details explained in thefollowing section. Reward Learning
Since implicit human feedback via ErrPis noisy, different from previous work, we don’t use a neuralnetwork to approximate the human feedback directly. In orderto improve the robustness to errors, we first model the humanpolicy as soft-Q policy, i.e., a popular energy-based policy[27], [28], and learn the human Q function instead by solvinga classification problem. Following the principle of maximumentropy [63], given human Q function Q h ( · , · ) , the humanpolicy distribution and value function can be expressed asfollows, π h ( a | s ) = exp(( Q h ( s , a ) − V h ( s )) /α ) ,V h ( s ) = α log (cid:88) a exp( Q h ( s , a ) /α ) (1)where α is a free parameter, tuned empirically. The likelihoodof positive and negative state-action pair are denoted as π h ( a | s ) and − π h ( a | s ) . When trajectories and correspondinghuman feedback (ErrP labels) are ready, we learn the humanQ function by maximizing the likelihood of both positive andnegative state-action pairs in the trajectories, which is to max-imize the objective (3), where the binary variable ErrP ( s, a ) denotes the ErrP labels obtained in previous step. It is essen-tially a classification problem on states with ErrP as labels andQ function as logits. Hence, a naive choice for auxiliary rewardfor the following RL agent is the Bellman difference of humanQ function, i.e., Q h ( s , a ) − γ max a Q h ( s , a ) . However, dueto the noise on ErrP decoding and lack of ErrP labels, thefunction Q h learned by maximum likelihood doesn’t have theshape compatible with the state dynamics of the target MDP (environments in experiments). And the generated rewardfunction can make the learning process of RL agent unstableand even divergent.In order to refine the reward shape and attenuate the gradientvariance, we introduce another baseline function t ( s ) only interms of state, to incorporate the state transition information.Hence, the Q function becomes Q B ( s , a ) := Q h ( s , a ) + t ( s ) .It can be proved that Q B ( · , · ) and Q h ( · , · ) induce the sameoptimal policy [41]. The baseline function t ∗ ( · ) can be learnedby optimizing t ∗ = arg min t J ( t ) , whose objective is definedas (4), where the loss function l ( · ) is chosen to be l -normvia empirical evaluations .As shown in Figure 3, for learning the reward function,in addition to the demonstration D , we incorporate anotherset of demonstrations D R , containing transitions randomlysampled from environment without reward information. Theset D R is to help the function t ( · ) to efficiently learn the statedynamics, and does not require any human labeling. Afterreward learning, i.e., learning both Q h ( · , · ) and t ( · ) , for anytransition tuple ( s , a , s (cid:48) ) , the learned auxiliary reward functioncan be represented as r a ( s , a ) = Q h ( s , a ) + t ( s ) − γ max a (cid:48) ∈A [ Q h ( s (cid:48) , a (cid:48) ) + t ( s (cid:48) )] (2)We then use this r a to augment the following RL agent. Inorder to further attenuate the negative influence of wrongErrP labels, when combining environmental reward r e andauxiliary reward r a , we propose a coefficient β ( e ) , expo-nentially decreasing in terms of learning episodes e , i.e., β ( e ) := ae − e/b . Finally, the reward received by RL agentis r e ( s t , a t ) + β ( e ) r a ( s t , a t ) . Empirically, the best coefficientfunction is β ( e ) = 3 e − e/ in experiments.VI. E VALUATION
A. Baseline results: Naive Approach
We first validate the feasibility of decoding error-potentialsusing a 10-fold cross-validation scheme for each game. In thisscheme, we split the state-action pairs of a game in 10-foldsfor training and testing of the ErrP decoder. In Figure 4(a), weshow the performance of three games in terms of Area UnderCurve (AUC) score, sensitivity and specificity, averaged over5 subjects. The Maze game has the highest AUC score (0.89 ± ± ± full access method isthe most preliminary approach to incorporate implicit humanfeedback (in the form of decoded error-potentials) into theDRL model. It asks the external oracle (human) for theimplicit feedback in every training step, reaching the max-imum number of possible queries. Hence it has the fastest (a) Decoding ErrPs S u cc e ss R a t e No ErrPsub01sub02sub03sub04sub05 (b) Learning Curve
No ErrP 01 02 03 04 05Subject050100150200250300 C o m p l e t e E p i s o d e (c) Complete Episode Fig. 4: Baseline results of Naive Approach: (a) 10-fold CV performance of each game without any zero-shot learning, (b) and (c) RL with full access to ErrP feedback
Algorithm 2:
Robust Reward Shaping with Human ErrP
Input :
Trajectories Given Initially Conduct EEG Experiments to label the correctness alongtrajectories by human ErrP ; With ErrP data collected in experiments, use Algorithm 1to decode ErrP labels, i.e., ErrP ( · , · ) ; Learn the human Q function Q h ( · , · ) by optimizing J ( Q h ) := (cid:88) ( s,a ) ∈D π h ( a | s )(1 − ErrP ( s, a ))+(1 − π h ( a | s )) ErrP ( s, a ) (3)and learn the reward shaping function t ( · ) byminimizing J ( t ) := (cid:88) ( s , a , s (cid:48) ) ∈D∪D R l ( Q h ( s , a ) − t ( s ) − γ max a (cid:48) ∈A ( Q h ( s (cid:48) , a (cid:48) ) − t ( s (cid:48) ))) (4)Then we have Q B ( s , a ) := Q h ( s , a ) − t ( s ) ; Pass the auxiliary reward function r a (2) to the RL agent; RL agent starts to solve the problem with rewardfunction r e ( s, a ) + β ( e ) r a ( s, a ) by any RL algorithm.training convergence rate. We use this method as a benchmarkfor comparing the sample efficiency of the proposed RLframework. The evaluation metric adopted here is success rate ,which is the ratio of success plays in the last 32 episodes. Thetraining converges and terminates at complete episode , whenthe success rate reaches to 1. The results with real ErrP dataof 5 subjects are shown in Figure 4(b,c). We can see thereis a significant improvement in the training convergence withall subjects. Here, No ErrP refers to the BDQN performancewithout integrating the human feedback. In all plots of thispaper, solid lines are average values over 10 random seeds,and shaded regions correspond to one standard deviation. Weuse BDQN (as introduced in section III) as the DRL modelfor all experiments conducted in this paper. However, the ErrPfeedback here can be used to augment any RL algorithm.
B. Evaluation of the Proposed Solution
In this subsection, we evaluate the performance of proposedapproaches to practically integrate the implicit human feed-back (via EEG) into the DRL algorithms.
1) Zero-shot learning of ErrPs:
To evaluate the zero-shot learning capability of error-potentials and the decoding A UC S c o r e (a) Over subjects(b) Over games Fig. 5: Zero-shot learning of ErrP: (a) from Catch to Maze oversubjects compared with 10-fold CV, (b) over all combinations ofthree games compared with 10-fold CV. algorithm, we train on the samples collected from the Catchgame and test on the Maze game. As Catch is a simple game,we assume the optimal action for each state is already known(providing the labeled examples to train the ErrP decoder).However, the Maze game need to be solved, hence, we do notmake any assumptions about the optimality of the actions. InFigure 5(a), we provide the zero-shot learning performanceand compare it against the 10-fold Cross-Validation (CV)scheme shown in section VI-A. Further, we present the AUCscore of zero-shot learning performance over all training andtesting combinations in Figure. 5(b). We use the Area UnderCurve (AUC) as the performance metric for the decodingof error-potentials. We can see that the ErrPs recorded forCatch game, are able to capture more than 80% of thevariability in the ErrPs for Maze game. Averaged over 5subjects, the decoder performs with an AUC score of 0.8078( ± . ) when trained on the Catch game. This comparedwith the performance of 0.693 ( ± . ) when trained usingWobble labels. Similarly, Catch and Wobble performs with anaverage AUC score of 0.790 ( ± . ) and 0.680 ( ± . )respectively, when trained on labels obtained through theMaze environment. These experiments validate that the error-potentials can be learned in a zero-shot manner to avoid re-training of the human feedback (via EEG) decoder. (a) 10 Trajectories (b) 20 Trajectories Fig. 6: Evaluation of the proposed reward shaping method.TABLE II: Average Number of Queries on Maze GameSubject 01 02 03 04 05Full access . . . . . Proposed method . . . . .
2) Evaluation of Robust Reward Shaping with HumanErrP:
For the evaluation of Algorithm 2, stochastic trajectoriesfor the Maze game were generated by random walking in thestate space, as long as most part of state space is covered.Before training the RL agent, each human subject providedimplicit feedback (via ErrP) as explained in the experimentalprotocol (section III-A) on every state-action pair along thesetrajectories. The performance of the proposed approach wasevaluated with 10 and 20 initial trajectories, each for 5subjects. We use the Bayesian DQN as the DRL model.The acceleration of human feedback is shown in Figure6(a) for 10 trajectories, where the base model is BayesianDQN. We can see the significant acceleration in trainingconvergence in Figure 6(a) in terms of the success rate for5 subjects and compared against the case of
No ErrP, i.e. nohuman feedback . Subject 01 has the highest fidelity for error-potentials, and hence, RL algorithm converges at much fasterrate when relies upon the feedback obtained by Subject 01.It is evident from the results that the error-potential decodingperformance is sufficient to achieve around 2x improvementin training time (in terms of the number of episodes required).Similarly, Figure 6(b) shows the success rate and convergencecurve for training to complete, for 20 trajectories. ComparingFigure 6(a) and (b), we can see that the training converges atmuch faster rate when the number of initial trajectories areincreased. Further, the learning variance is also decreasedwith more trajectories. The comparison between Figure 6 andFigure 4(c) shows that the proposed framework learns fasterthan No-ErrP case, while outperforming the full access case,even though full access requires significantly larger amountof queries. We also compare the number of ErrP queries for full access and proposed method in Table II, according to thestatistics on experiments with 20 trajectories. On an averagefor 5 subjects, the proposed approach makes 75.56% lessqueries as compared to the full access . As full access queriesfor feedback label at every learning step, while the proposedframework queries only on the trajectories given initially, thetotal number of queries made are significantly reduced.
3) Analysis of the dependence and subjectivity of errors:
In this section, we analyze the detection accuracy of error-
TABLE III: Accuracy and standard deviations per subject
Subject ErrP mean ErrP std dev non-ErrP mean non-ErrP std devS12 0.79 0.27 0.75 0.17S07 0.8 0.3 0.85 0.16S02 0.73 0.29 0.77 0.15S08 0.6 0.25 0.56 0.14S01 0.8 0.25 0.77 0.16S04 0.78 0.25 0.63 0.16S16 0.73 0.3 0.78 0.14S03 0.65 0.25 0.61 0.13S06 0.73 0.3 0.64 0.17S05 0.75 0.3 0.72 0.13S09 0.71 0.27 0.66 0.13S15 0.67 0.31 0.65 0.1Average 0.73 0.28 0.70 0.14 potentials for the Maze game, to develop insights into the char-acteristics of error-potential based on the users and providedstimulations. The EEG samples recorded for the Maze experi-ment can be presented along two independent dimensions, (i)users and (ii) state-action pair of the agent (i.e., stimulation).Within the state-action pairs, if the action is correct, it is called non-errp , otherwise errp . • Experiment 1: Subjectivity over correct and incorrectactions.
For each user, we divide the EEG trials into twocategories (a) errp , and (b) non-errp . For each user andcategory, we compute the mean and standard deviationof classification accuracy of EEG trials, and present inTable III. We can observe that the per user standarddeviations for ErrPs is roughly double the standard de-viations for non-ErrPs. The aggregate per user standarddeviation across SAPs is 0.28 for ErrPs and 0.14 for non-ErrPs. This difference in per user standard deviations isstatistically significant (p < • Experiment 2: Subjectivity over users.
In this exper-iment, for each unique state-action pair, we average theperformance of EEG trials of all users. We achieveda mean and standard deviation of 0.75 and 0.13, and0.75 and 0.07 for errp and non-errp respectively. Weuse Levene’s test [38] to conclude that the differencein variance between these two population samples isstatistically significant ( p = 0 . < . ). • Experiment 3: Subjectivity over states.
For each uniquestate in Maze game, we plot the mean and standard devia-tion of EEG trial performance in Fig. 7. We plot the class-fier accuracy for ErrP and non-ErrP SAPs respectivelybased on their initial state on the maze. We can visualizethat the plot corresponding the standard deviation fornon-ErrPs is darker (indicating lower standard deviation)compared to the plot corresponding to the deviations forErrPs. We can also see that within a plot, there is also agradation in the accuracy (indicated by different shadesof green) implying that there is some dissimilarity amongerroneous states and hence subjectivity on the user’s partand diminishing the argument that erroneous vs non- erroneous scenarios are purely binary. • Experiment 4: Errors of commission and omission.
In this experiment, we consider only the erroneous ac-tions and split the EEG trials into two categories, (i)commission errors and (ii) omission errors. Commissionerror is defined as an agent making an incorrect move toa new cell, while omission error refers to the incorrectaction of agent by staying in the same cell grid. Thetotal state-actions pairs for commission and omission aredistributed fairly (out of 71 unique state-action pairs, 34correspond to errors of omission and the remaining 37correspond to errors of commission). 7 correspond toerrors of commission. However we observe that amongthe state-action pairs which had very high accuracies,state-action pairs corresponding to errors of commissionare disproportionately represented. Out of the top 5state-action pairs that have the highest accuracy, all ofthem represent errors of commission and out of the top10 state-action pairs that have the highest accuracy, 9of them signify errors of commission. This was alsoindicated in the fact that errors of omission had a meanaccuracy of 72% whereas errors of commission had amuch higher mean accuracy of 77%. This implies thatthe error scenarios that are the easiest to detect are likelyto be errors of commission. This has certain implicationsthat bolster the hypothesis that certain errors are indeedmore ”valuable” to a user than others and hence generatea far more noticeable response in the brain.These 4 experiments collectively lead us to 2 main insights.(a) Per subject, owing to the differences in variances, thereis less variation in the non-ErrP accuracies comparedto the ErrP accuracies implying that erroneous scenarioslead to more variation in the classifier accuracy and byextension, in the brain’s response, than non-erroneousscenarios. This further implies that there is a gradationin error detection unlike it being a binary phenomenonwhich makes certain errors easier to detect and certainothers more difficult to detect.(b) The differences in variations in classifier accuracy be-tween ErrP and non-ErrP SAPs diminishes when weaverage the accuracies over the SAPs and represent themas a function of users. This implies that the variation inthe accuracy of ErrP vs non-ErrP is impacted more bydifferences in SAPs compared to the differences in users.
4) Robustness Evaluation:
Because the generation processand decoding of brain signal are stochastic. The robustnessto wrong ErrP labels is important when incorporating humanfeedback (via EEG) into reward shaping method. We are goingto show that modeling the human policy as soft Q policy, as wedid in (1), can make the learned auxiliary function r a resist towrong human feedback. In the comparison on robustness, thebaseline method, called ” simple ”, is to simply use a bootstrapneural network to generalize the binary ErrP labels acrossthe state space, same as [60]. Both simple benchmark andthe proposed robust reward shaping are trained on the sameset of trajectories and human labels. The neural network in (a) ErrP accuracy mean (b) ErrP accuracy std deviation(c) non-ErrP accuracy mean (d) non-ErrP accuracy std deviation Fig. 7: Differences between ErrP and non-ErrP accuracies for eachinitial state over all users (a) ErrP vector per SAP (b) non-ErrP vector per SAP
Fig. 8: Accuracy vectors per SAP both methods is MLP, having two hidden layers of 64 units.And the number of bootstrap head in ”simple” benchmark isset to 5. We evaluate both simple and the proposed methodson subject 02 and subject 07, whose accuracy are . and . respectively. The comparison result is shown in Figure 9.We can see that the proposed method performs better in bothsubjects with different initial trajectories. That is because theproposed method treats the human feedback in a probabilisticway, and the baseline function t can incorporate the statetransition information to attenuate the influence of wronghuman feedback. Moreover, the comparison of all cases showsthat the performance gain of simple benchmark over no-ErrPmethod is decreased when the error probability of human labelincreases.
5) Ablation Study:
In this section, we conduct ablationstudy on the proposed robust reward shaping with humanfeedback. We first specifically evaluate the effect of baselinefunction t learned from (4). Because the human feedbacklabels in the initial trajectories can not cover the whole statespace and some labels are wrong, the learned Q function ofhuman Q h ( · , · ) may not be compatible with the state dynamicsof the environment. Thus we introduce a baseline functiononly in terms of state to smoothen the learned Q function.Here the ablation evaluation on baseline function is still onsubject 02 and 07, same as the section above. The baseline isthe proposed reward shaping method without t in (2), where (a) Subject 02-10 Trajectories (b) Subject 02-20 Trajectories(c) Subject 07-10 Trajectories (d) Subject 07-20 Trajectories Fig. 9: Ablation Study of the proposed reward shaping method. the auxiliary reward function is only the Bellman differencebetween adjacent states. The comparison result is shown inFigure 9. We can see that the baseline function can improvethe convergence speed in all cases, and it can even do betterthan simple method in some cases, showing the importance ofthe baseline function here.In addition, we also conduct the ablation study on thecombining coefficient β ( · ) . The benchmark method is todirectly sum auxiliary reward r a and environmental reward r e together. The coefficient in the proposed method is setto β ( e ) = 3 e − e/ . Comparison result is shown in Figure9. We can see this exponentially decreasing coefficient canstabilize the training process significantly, and hence improvethe convergence speed.VII. C ONCLUSIONS AND F UTURE W ORK
In this work, we investigate an interesting paradigm toobtain and integrate the implicit human feedback with RLalgorithms. We first demonstrate the feasibility of obtainingimplicit human feedback by capturing error-potentials of ahuman observer watching an agent learning to play severaldifferent visual-based games, and then decoding the signalsappropriately and using them as an auxiliary reward functionto help an RL agent. Then we argue that the definition ofErrPs can be learned in a zero-shot manner across differentenvironments, eliminating the need of re-training over new andunseen environments. We validate the acceleration in learningof games through augmenting the RL agent by ErrP feedbackusing a naive approach, i.e., full access method. We thenpropose a novel RL framework, improving the label efficiencyand reducing human cognitive load. We experimentally showthat the proposed RL framework can accelerate the trainingof RL agent by 2.25x, while reducing the number of queriesrequired by 75.56%.
Scope and Future work:
The scope of our work in limitedto the visual-based RL problems with discrete state and action spaces. action. Moreover, the demonstration of the zero-shotlearning of error-potentials is limited across the environmentspresented in the paper. We have considered discrete grid-basedreasonably complex navigational games in our work. Furtherstudies have to be done to explore if such an approach couldbe extended to Atari and Robotic environments environmentswith very large state-space and continuous action-space. Weplan to test our framework over robotic environments, andevaluate the zero-shot learning capabilities of error-potentialsbetween virtual and physical worlds.R
EFERENCES[1] Riku Arakawa, Sosuke Kobayashi, Yuya Unno, Yuta Tsuboi, and Shin-ichi Maeda. Dqn-tamer: Human-in-the-loop reinforcement learning withintractable feedback. arXiv preprint arXiv:1810.11748 , 2018.[2] Kamyar Azizzadenesheli, Emma Brunskill, and Animashree Anandku-mar. Efficient exploration through bayesian deep q-networks. In , pages 1–9. IEEE,2018.[3] Alexandre Barachant and Stephane Bonnet. Channel selection procedureusing riemannian distance for bci applications. In , pages 348–351. IEEE,2011.[4] Alexandre Barachant, St´ephane Bonnet, Marco Congedo, and ChristianJutten. Classification of covariance matrices using a riemannian-basedkernel for bci applications.
Neurocomputing , 112:172–178, 2013.[5] Alexandre Barachant and Marco Congedo. A plug&play p300 bci usinginformation geometry. arXiv preprint arXiv:1409.0107 , 2014.[6] Shlomo Bentin, Gregory McCarthy, and Charles C Wood. Event-relatedpotentials, lexical decision and semantic priming.
Electroencephalogra-phy and clinical Neurophysiology , 60(4):343–355, 1985.[7] Benjamin Blankertz, Guido Dornhege, Christin Schafer, Roman Krepki,Jens Kohlmorgen, K-R Muller, Volker Kunzmann, Florian Losch, andGabriel Curio. Boosting bit rates and error detection for the classificationof fast-paced motor commands based on single-trial eeg analysis.
IEEE Transactions on Neural Systems and Rehabilitation Engineering ,11(2):127–131, 2003.[8] Daniel S Brown, Wonjoon Goo, Prabhat Nagarajan, and Scott Niekum.Extrapolating beyond suboptimal demonstrations via inverse reinforce-ment learning from observations. arXiv preprint arXiv:1904.06387 ,2019.[9] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon MLucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, DiegoPerez, Spyridon Samothrakis, and Simon Colton. A survey of montecarlo tree search methods.
IEEE Transactions on ComputationalIntelligence and AI in games , 4(1):1–43, 2012.[10] Tim Brys, Anna Harutyunyan, Halit Bener Suay, Sonia Chernova,Matthew E Taylor, and Ann Now´e. Reinforcement learning fromdemonstration through shaping. In
Twenty-Fourth International JointConference on Artificial Intelligence , 2015.[11] Cameron S Carter, Todd S Braver, Deanna M Barch, Matthew MBotvinick, Douglas Noll, and Jonathan D Cohen. Anterior cingulatecortex, error detection, and the online monitoring of performance.
Science , 280(5364):747–749, 1998.[12] Hyeong Soo Chang. Reinforcement learning with supervision bycombining multiple learnings and expert advices. In , pages 6–pp. IEEE, 2006.[13] LI Charles and RS Christian. A social reinforcement learning agent. In
Proceedings of the fifth international conference on Autonomous agents ,2001.[14] Ricardo Chavarriaga and Jos´e del R Mill´an. Learning from eeg error-related potentials in noninvasive brain-computer interfaces.
IEEE trans-actions on neural systems and rehabilitation engineering , 18(4):381–388, 2010.[15] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg,and Dario Amodei. Deep reinforcement learning from human prefer-ences. In
Advances in Neural Information Processing Systems , pages4299–4307, 2017.[16] Marco Congedo, Alexandre Barachant, and Anton Andreev. A newgeneration of brain-computer interface based on riemannian geometry. arXiv preprint arXiv:1310.8115 , 2013. [17] Christian Daniel, Oliver Kroemer, Malte Viering, Jan Metz, and JanPeters. Active reward learning with a novel acquisition function. Autonomous Robots , 39(3):389–405, 2015.[18] Layla El Asri, Bilal Piot, Matthieu Geist, Romain Laroche, and OlivierPietquin. Score-based inverse reinforcement learning. In
Proceedings ofthe 2016 International Conference on Autonomous Agents & MultiagentSystems , pages 457–465. International Foundation for AutonomousAgents and Multiagent Systems, 2016.[19] Michael Falkenstein, Joachim Hohnsbein, Joerg Hoormann, andL. Blanke. Effects of crossmodal divided attention on late erp compo-nents. ii. error processing in choice reaction tasks.
Electroencephalog-raphy and clinical neurophysiology , 78 6:447–55, 1991.[20] Michael Falkenstein, J¨org Hoormann, Stefan Christ, and Joachim Hohns-bein. Erp components on reaction errors and their functional signifi-cance: a tutorial.
Biological psychology , 51(2-3):87–107, 2000.[21] Michael Falkenstein, J¨org Hoormann, Stefan Christ, and Joachim Hohns-bein. Erp components on reaction errors and their functional signifi-cance: A tutorial.
Biological Psychology , 51:87–107, 02 2000.[22] Pierre W Ferrez and Jos´e del R Mill´an. You are wrong!—automaticdetection of interaction errors from brain waves. In
Proceedings ofthe 19th international joint conference on Artificial intelligence , numberCONF, 2005.[23] Pierre W Ferrez and Jos´e del R Mill´an. Error-related eeg potentials gen-erated during simulated brain–computer interaction.
IEEE transactionson biomedical engineering , 55(3):923–929, 2008.[24] Van Petten C. Folstein, J. R. Influence of cognitive control and mismatchon the n2 component of the erp: A review.
Psychophysiology , 2008.[25] William Gehring, Michael Coles, David Meyer, and Emanuel Donchin.A brain potential manifestation of error-related processing [supple-ment].
Electroencephalography and clinical neurophysiology. Supple-ment , 44:261–72, 02 1995.[26] H Gemba, K Sasaki, and V.B. Brooks. ‘error’ potentials in limbiccortex (anterior cingulate area 24) of monkeys during motor learning.
Neuroscience letters , 70:223–7, 11 1986.[27] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine.Reinforcement learning with deep energy-based policies. In
Proceedingsof the 34th International Conference on Machine Learning-Volume 70 ,pages 1352–1361. JMLR. org, 2017.[28] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Softactor-critic: Off-policy maximum entropy deep reinforcement learningwith a stochastic actor. arXiv preprint arXiv:1801.01290 , 2018.[29] Clay B. Holroyd and Michael G. H. Coles. The neural basis of humanerror processing: reinforcement learning, dopamine, and the error-relatednegativity.
Psychological review , 109 4:679–709, 2002.[30] Clay B Holroyd and Michael GH Coles. The neural basis of humanerror processing: reinforcement learning, dopamine, and the error-relatednegativity.
Psychological review , 109(4):679, 2002.[31] Clay B Holroyd, Sander Nieuwenhuis, Nick Yeung, and Jonathan DCohen. Errors in reward prediction are reflected in the event-relatedbrain potential.
Neuroreport , 14(18):2481–2484, 2003.[32] I˜naki Iturrate, Luis Montesano, and Javier Minguez. Robot rein-forcement learning using eeg-based reward signals. In , pages 4822–4829. IEEE, 2010.[33] Su Kyoung Kim, Elsa Andrea Kirchner, Arne Stefes, and Frank Kirch-ner. Intrinsic interactive reinforcement learning–using error-relatedpotentials for real world human-robot interaction.
Scientific reports ,7(1):17562, 2017.[34] W Bradley Knox and Peter Stone. Interactively shaping agents viahuman reinforcement: The tamer framework. In
Proceedings of thefifth international conference on Knowledge capture , pages 9–16. ACM,2009.[35] W Bradley Knox and Peter Stone. Augmenting reinforcement learningwith human feedback. In
ICML 2011 Workshop on New Developmentsin Imitation Learning (July 2011) , volume 855, page 3, 2011.[36] William Bradley Knox. Learning from human-generated reward. 2012.[37] Levente Kocsis and Csaba Szepesv´ari. Bandit based monte-carloplanning. In
European conference on machine learning , pages 282–293. Springer, 2006.[38] Howard Levene, II Olkin, and H Hotelling. Robust tests for equality ofvariances.(1960).
Contributions to probability and statistics; Essays inhonor of Harold Hotelling , pages 78–92, 1960.[39] Mandy J Maguire, Grant Magnon, Diane A Ogiela, Rebecca Egbert,and Lynda Sides. The N300 ERP component reveals developmentalchanges in object and action identification.
Developmental CognitiveNeuroscience , 5:1–9, 2013. [40] Wolfgang H. R. Miltner, Christoph H. Braun, and Michael G. H.Coles. Event-related brain potentials following incorrect feedback ina time-estimation task: Evidence for a “generic” neural system for errordetection.
J. Cognitive Neuroscience , 9(6):788–798, November 1997.[41] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invarianceunder reward transformations: Theory and application to reward shaping.In
ICML , volume 99, pages 278–287, 1999.[42] Hiroaki Niki and Masataka Watanabe. Prefrontal and cingulate activityduring timing behavior in the monkey.
Brain research , 171:213–24, 091979.[43] Lee Osterhout, Mark D. Allen, Judith Mclaughlin, and Kayo Inoue.Brain potentials elicited by prose-embedded linguistic anomalies.
Mem-ory & Cognition , 30(8):1304–1312, Dec 2002.[44] Lucas C Parra, Clay D Spence, Adam D Gerson, and Paul Sajda.Response error correction-a demonstration of improved human-machineperformance using real-time eeg monitoring.
IEEE transactions onneural systems and rehabilitation engineering , 11(2):173–177, 2003.[45] Filip Radlinski and Thorsten Joachims. Evaluating the robustness oflearning from implicit feedback. arXiv preprint cs/0605036 , 2006.[46] Yann Renard, Fabien Lotte, Guillaume Gibert, Marco Congedo, Em-manuel Maby, Vincent Delannoy, Olivier Bertrand, and Anatole L´ecuyer.Openvibe: An open-source software platform to design, test, and usebrain–computer interfaces in real and virtual environments.
Presence:teleoperators and virtual environments , 19(1):35–53, 2010.[47] Bertrand Rivet, Antoine Souloumiac, Virginie Attina, and GuillaumeGibert. xdawn algorithm to enhance evoked potentials: application tobrain–computer interface.
IEEE Transactions on Biomedical Engineer-ing , 56(8):2035–2043, 2009.[48] Andres F Salazar-Gomez, Joseph DelPreto, Stephanie Gil, Frank HGuenther, and Daniela Rus. Correcting robot mistakes in real time usingeeg signals. In , pages 6570–6577. IEEE, 2017.[49] Gerwin Schalk, Jonathan R Wolpaw, Dennis J McFarland, and GertPfurtscheller. Eeg-based communication: presence of an error potential.
Clinical neurophysiology , 111(12):2138–2144, 2000.[50] Marten K Scheffers, Michael GH Coles, Peter Bernstein, William JGehring, and Emanuel Donchin. Event-related brain potentials and error-related processing: An analysis of incorrect responses to go and no-gostimuli.
Psychophysiology , 33(1):42–53, 1996.[51] David Silver and Joel Veness. Monte-carlo planning in large pomdps. In
Advances in neural information processing systems , pages 2164–2172,2010.[52] Hillyard SA. Squires NK, Squires KC. Two varieties of long-latencypositive waves evoked by unpredictable auditory stimuli in man.
Elec-troencephalogr Clin Neurophysiol , 1975.[53] Matthew E Taylor, Halit Bener Suay, and Sonia Chernova. Integratingreinforcement learning with human demonstrations of varying ability.In
The 10th International Conference on Autonomous Agents andMultiagent Systems-Volume 2 , pages 617–624. International Foundationfor Autonomous Agents and Multiagent Systems, 2011.[54] Andrea Lockerd Thomaz, Cynthia Breazeal, et al. Reinforcementlearning with human teachers: Evidence of feedback and guidance withimplications for learning performance. In
Aaai , volume 6, pages 1000–1005. Boston, MA, 2006.[55] Sida I Wang, Percy Liang, and Christopher D Manning. Learninglanguage games through interaction. arXiv preprint arXiv:1606.02447 ,2016.[56] Zhaodong Wang and Matthew E Taylor. Improving reinforcementlearning with confidence-based demonstrations. In
Proceedings of the26th International Joint Conference on Artificial Intelligence , pages3027–3033, 2017.[57] Garrett Warnell, Nicholas Waytowich, Vernon Lawhern, and Peter Stone.Deep tamer: Interactive agent shaping in high-dimensional state spaces.In
Thirty-Second AAAI Conference on Artificial Intelligence , 2018.[58] Eric Wiewiora, Garrison W Cottrell, and Charles Elkan. Principledmethods for advising reinforcement learning agents. In
Proceedingsof the 20th International Conference on Machine Learning (ICML-03) ,pages 792–799, 2003.[59] Christian Wirth, Johannes F¨urnkranz, and Gerhard Neumann. Model-free preference-based reinforcement learning. In
Thirtieth AAAI Con-ference on Artificial Intelligence , 2016.[60] Baicen Xiao, Qifan Lu, Bhaskar Ramasubramanian, Andrew Clark,Linda Bushnell, and Radha Poovendran. Fresh: Interactive rewardshaping in high-dimensional state spaces using human feedback. arXivpreprint arXiv:2001.06781 , 2020.[61] Baicen Xiao, Bhaskar Ramasubramanian, Andrew Clark, HannanehHajishirzi, Linda Bushnell, and Radha Poovendran. Potential-based advice for stochastic policy learning. In , pages 1842–1849. IEEE, 2019.[62] Ruohan Zhang, Faraz Torabi, Lin Guan, Dana H Ballard, and PeterStone. Leveraging human guidance for deep reinforcement learningtasks. In Proceedings of the 28th International Joint Conference onArtificial Intelligence , pages 6339–6346. AAAI Press, 2019.[63] Brian D Ziebart.