Face valuing: Training user interfaces with facial expressions and reinforcement learning
FFace valuing: Training user interfaces with facial expressionsand reinforcement learning
Vivek Veeriah, Patrick M. Pilarski, Richard S. Sutton
Reinforcement Learning & Artificial Intelligence LabDepartment of Computing Science, University of Alberta { vivekveeriah, pilarski, rsutton } @ualberta.ca Abstract
An important application of interactive machinelearning is extending or amplifying the cognitiveand physical capabilities of a human. To accom-plish this, machines need to learn about their humanusers’ intentions and adapt to their preferences. Inmost current research, a user has conveyed pref-erences to a machine using explicit corrective orinstructive feedback; explicit feedback imposes acognitive load on the user and is expensive in termsof human effort. The primary objective of the cur-rent work is to demonstrate that a learning agentcan reduce the amount of explicit feedback requiredfor adapting to the user’s preferences pertaining to atask by learning to perceive a value of its behaviorfrom the human user, particularly from the user’sfacial expressions—we call this face valuing . Weempirically evaluate face valuing on a grip selec-tion task. Our preliminary results suggest that anagent can quickly adapt to a user’s changing pref-erences with minimal explicit feedback by learninga value function that maps facial features extractedfrom a camera image to expected future reward. Webelieve that an agent learning to perceive a valuefrom the body language of its human user is com-plementary to existing interactive machine learn-ing approaches and will help in creating successfulhuman-machine interactive applications.
One important objective of human-machine interaction is toaugment existing human capabilities, which requires ma-chines and their human users to closely collaborate and forma productive partnership. To achieve this, it is crucial forthe machines to learn interactively from their users, specifi-cally their intents and preferences. In current research trends,the user’s preferences are conveyed via explicit instructions(Kuhlmann et al., 2004) or expensive corrective feedback(Knox & Stone, 2015)—which can be in the form of pre-defined words or sentences, push buttons, mouse clicks etc.In many real-world, ongoing scenarios, these methods offeedback impose a cognitive load on human users. More-over, in complex domains like prosthetic limbs, it is demand- ing for the user to provide these kinds of explicit feedback.It is important to have an alternative approach that is bothscalable and would allow the machines to learn their humanusers’intents and preferences via ongoing interactions.In this paper, we explore the idea that a reinforcementlearning agent can learn a value function that relates a user’sbody language, specifically from the user’s facial expressions,to expectations of future reward. The agent can use this valuefunction to adapt its actions to a user’s preferences quicklywith minimal explicit feedback. This approach is analogousto an agent learning to understand the body language of itshuman user. It could also be imagined as building a form ofcommunicative capital between a human user and a learningagent (c.f., Pilarski et al., 2015). Learning from interactionswith a human user tend to be continual; reinforcement learn-ing methods are therefore naturally suited for this purpose.To the best of our knowledge, our system is the first tolearn a value function in real-time for a user’s body language,specifically a value function that relates future reward to thefacial features of the user. Additionally, this work is the firstexample of how such a value function can be used to guide thelearning process of an agent interacting with a human user.Importantly, our approach does not utilize explicit rewardchannels, for example those discussed by Thomaz & Breazeal(2012) and by Knox et al. (2009). As it operates in realtime, we believe that our approach is well suited for realis-tic human-machine interaction tasks and complements exist-ing interactive machine learning approaches. Learning a lan-guage between an agent and its user in the form of value func-tions represents a new and powerful class of human-machineinteraction technologies, and we expect the approaches dis-cussed in this work will have broad applicability in many dif-ferent real-world domains.
Significant research effort has been directed toward creatingsuccessful human-robot partnerships (e.g., as summarized inKnox & Stone, 2015; Mead et al., 2013; Breazeal et al., 2012;Pilarski & Sutton, 2012; Edwards et al., 2015). A naturalapproach is for an agent to learn from ongoing interactionswith a human user via human-delivered rewards . Research byThomaz & Breazeal (2008), Knox & Stone (2009), Breazealet al. (2012), Loftin et al. (2015), and Peng et al. (2016)adopts this perspective, and it has been extensively studied a r X i v : . [ c s . H C ] J un n recent work by Knox & Stone (2015). In the TAMER ap-proach of Knox and Stone (2015), a system was able to learn apredictive model of a human user’s shaping rewards, such thatthe model could be used to successfully train an agent even inthe presence of human-related delays and inconsistencies. Asa potential drawback of learning a reward model, when theuser needs to modify the agent’s behavior, the model wouldhave to be changed (e.g., via additional shaping rewards fromthe user). We desire a method for delivering feedback thatdoes not require a large number of costly interactions fromthe human, and that transfers well to new or changed situa-tions without the need for retraining.Another interesting approach to the interactive instructionof a machine learner involves a human teaching a robot toperform a task through demonstrations, a process aptly namedas learning from demonstration . This approach can also becalled programming by demonstration. There are numerousworks exploring learning from demonstration (e.g., Koenig &Mataric, 2012; Schulman et al., 2013; Alizadeh et al., 2014).One noted downside is that this form of learning is reportedto be at times a tiring experience for a human user. Manyapproaches are also limited in their ability to scale up to a fullrange of real-world tasks (e.g., it is impossible to tractablyprovide demonstrations or instructions covering all possiblesituations).A key difference between many existing methods and ourapproach is that we are concerned with designing a general,scalable approach that would allow an agent to adapt its be-havior to a user’s changing preferences with minimal explicithuman-generated feedback. This is in contrast to approachesthat seek to use body language like facial features as an inputchannel to directly control a robot or other machine’s oper-ation (e.g., Breazeal (1998) and Liu & Picard (2003)). Asa significant contribution of the present work, we describethe use of facial features not as a channel of control but as ameans of valuation. To this end, we propose to learn a valuefunction that is grounded in the user’s body language, inde-pendent of the features of a task, and to use this value func-tion to help influence an agent’s real-time decision making ina way that spans multiple tasks and settings of use. In a reinforcement learning setting (Sutton & Barto, 1998),a learning agent interacts with an unknown environment inorder to achieve a goal. In this setting, the goal is to maximizethe cumulative reward accumulated by adapting its behaviorwithin the environment.Markov Decision Processes (MDPs) are the mathematicalnotations used for formalizing a reinforcement learning prob-lem. An MDP consists of a tuple (cid:10) S , A , p, r, γ (cid:11) , consist-ing of S , set of all states; A , set of all actions; p ( s (cid:48) | s, a ) ,a transition probability function, which gives the probabilityof transition to state s (cid:48) ∈ S at the next time-step given forthe current state s ∈ S and action a ∈ A ; r ( s, a, s (cid:48) ) , thereward function that gives the expected reward for a transi-tion from state s ∈ S to s (cid:48) ∈ S by taking action a ∈ A ; γ is the discount factor, that specifies the relative importancebetween immediate and long term rewards. In episodic prob- lems, the MDP can be viewed as having special states called terminal states , which terminate an episode. Such states easethe mathematical notations as they could be viewed as a sin-gle state with single action that results in a reward of 0 andtransition to itself. The return at a time instance t is definedas the discounted sum of rewards after time t : G t = R t +1 + γR t +2 + γ R t +3 + · · · G t = ∞ (cid:88) i =1 γ i − R t + i where R t +1 denotes the reward received after taking anaction A t in state S t .Actions are taken at discrete time steps t = 0 , , , · · · ac-cording to a policy π : S × A → [0 , which defines a selec-tion probability for each action conditioned on the state. Eachpolicy π has a corresponding state-value function v π ( s ) , thatmaps each state s ∈ S to the expected return G t from thatstate by following the policy π , v π ( s ) = E (cid:8) G t | S t = s, π (cid:9) The state-value functions are significant when the giventask requires prediction. On the contrary, if the given taskrequires control, then it is important to use the action-valuefunctions q π ( s, a ) which gives the expected return G t by tak-ing an action a from state s and then following the policy π : q π ( s, a ) = E (cid:8) G t | S t = s, A t = a, π (cid:9) To evaluate the face valuing approach, we introduce a gripselection task that was inspired by a natural problem in aprosthetic arm domain where the agent needs to select an ap-propriate grip pattern for grasping a given object. The taskconsists of a set of n grips and m objects. Depending on theexperiment, there could be many possible grips for a givenobject, with the correct grip being defined according to theuser’s preference. The task could also consist of uncountablenumber of objects, making grip selection with pure trial-and-error a non-trivial problem.This task was formulated as an undiscounted episodicMDP with a reward of 0 for every time step, and with 0 re-ward for completing an episode. At the beginning of eachepisode, a single object was randomly picked from a large setof objects and the agent was tasked with choosing a grip froma limited set of grips; once the agent selected a grip it neededto move a fixed number of steps towards the object to finishthe grasping motion, thereby completing an episode. A hu-man user provided reward to the agent by pressing a singlebutton which delivered a reward of -5 for the correspondingtime step. Moreover, pressing the button forced the agent toreturn back to its initial position regardless of its current po-sition. The experimental setup is shown in Fig. 1.As in real-world grasping tasks, a user could have personalpreferences over which grip was suitable to grasp a given ob-ject. These preferences could change from episode to episode a) (b)Figure 1: (a) Overview of the face valuing agent: the user observes the simulated task and can provide a negative reward to the SARSA agentduring the task. The agent can observe user’s face via a webcam, and learns to behave based on the user’s preferences. (b) Grip selectiontask: Solid colored box denotes the object; the thin blue lined object denotes the grip selected by an agent. The initial position of the agent iscalled “grip-changing station”. The agent has to learn to select, based on the user’s preference, an appropriate grip — the grip’s width needsto be equal to the object’s width. The figure on the left denotes one correct combination while the figure on the center denotes an incorrectgrip for the given object. The figure on the right denotes a scenario where the agent is forced to return to its initial position, as the user haspushed the reward button. To complete an episode, the agent has to select one of the correct grips; move forward and grasp the object. and from experiment to experiment. Further, these prefer-ences were hidden from the learning agent and the only waythe agent could infer this is from the changing facial expres-sions of the human user. Therefore, in this work, we askedthe human user to be as expressive as possible so as to pro-vide clear cues for the learning agent to begin forming itsbehavioral choices. For our experiments, two Sarsa( λ ) (Rummery & Niranjan,1994; Sutton & Barto, 1998) agents (one that uses face valu-ing and one without face valuing) are compared on the abovedescribed task. All the experimental results in this paper wereperformed by a well-trained user in a blind setting, i.e., theuser did not know which of the two methods was currentlybeing evaluated. The user provided the same form of rewardsto both learning agents via their button pushes. The two maininstructions we gave to the human user were 1) to expresstheir pleasure or displeasure with the agent via any simple,repeatable, and minutely distinguishable facial expressions,and 2) to push the button whenever the learning agent wasnot behaving as per the user’s expectations. Agent State spacew/o face valuing current grip and object ids + bias termw/ face valuing 23 feature points + bias term
Table 1: State space of the agents compared in the experiments aredisplayed.
The state spaces for both the agents are briefly described inTable 1. The agent without face valuing has the id of the cur-rent grip and id of the current object in its state space alongwith a bias term. The id of the current grip chosen by the agent is one-hot encoded to form vector of length n . Simi-larly, the id of the object is also one-hot encoded to a vectorof length m and concatenated with the vector correspondingto the current grip. So, the entire state space for this methodis of length m + n + 1 where m is the total number of ob-jects during the entire experiment and n is the total numberof grips available to the agent during an experiment.For the face valuing agent, 68 key points from a frame con-taining a human’s face are detected through a popular faciallandmark detection algorithm (Kazemi et al., 2014). Thesekey points are simple two dimensional coordinates that de-note the position of certain special locations of a human’sface. These points from each frame were normalized and23 points that correspond to the positions of eye brows andmouth of a human’s face were selected. Each of these 23points, were tile-coded with tilings and each tiling of size × resulting in a feature vector of size 9200. These keypoints seems to produce sufficient variations between differ-ent facial expressions. Figure 2: Features extracted from face: facial landmarks corre-sponding to eye brows and mouth of a user’s face are the only fea-tures extracted from the face.These are marked as yellow circles.a) 2 objects & 2 grips (b) 2 objects & 4 grips (c) 2 objects & 8 grips(d) 4 objects & 2 grips (e) 4 objects & 4 grips (f) 4 objects & 8 grips(g) 8 objects & 2 grips (h) 8 objects & 4 grips (i) 8 objects & 8 grips(j) 2 grips setting (k) 4 grips setting (l) 8 grips settingFigure 3: Total time steps taken for different grip and object settings: the face-valuing agent learned to adapt faster compared to a conventional agentin most settings. Moreover, in difficult settings where there are more number of appropriate object-grip combinations, the face-valuing agent requiredless human generated rewards for achieving this level of performance. Each plot was generated by averaging data obtained from 5 independent runsand each run consisted of 15 episodes. .2 Action space
The complete action space for both the agents consisted of { grip , grip , · · · , grip n , ↑ , ↓} actions where the first n ac-tions implied in choosing that particular grip. The remain-ing two actions were the move one step forward towards theobject and move one step back towards the grip-changingstation. The actions available to the agent depended on itsposition relative to the object and the grip-changing station.When the agent was in the grip-changing station, the availableactions were { grip , grip , · · · , grip n , ↑} whereas when theagent had left the grip-changing station, only the {↑ , ↓} ac-tions were available. When the user pushed the reward but-ton, the agent lost all its actions except {↓} until it reachedthe grip-changing station.The agent observed the state space once every one-tenth ofa second and had to take an action on every time step. Theagent, however, had the freedom of choosing the same actionfor many consecutive time steps which allowed the humanuser to expressively respond to the learning agent. The first experiment compared the two agents with multiplegrip & object settings. The plots of total time steps and to-tal human generated rewards accumulated by the agents areshown in Fig. 3. The plots (Fig. 3 (a), (b), · · · , (i)) rep-resent the total time taken by a learning agent to completea successful grasp across episodes. The plots (Fig. 3 (j),(k), (l)) display the total number of human generated rewardsgiven to an agent to successfully adapt to user’s preferences.These graphs (Fig. 3) were generated from the same userexperiments conducted in a blind manner. A perfect agentwould have no human generated reward in all these settingsand would take only 11 time steps to complete an episode.From the plots (Fig. 3 (a), (b), · · · , (i)), it can be ob-served that the agent with face valuing quickly adapted withthe user’s preferences in all the different object and grip set-tings. Also, from the plots (Fig. 3 (j), (k), (l)), the total num-ber of human generated rewards for the face valuing agentwas comparatively lower than the agent without face valuingin the difficult settings of this experiment.During the initial phase of the experiment, the face valuingagent utilized the human generated reward to learn to per-ceive a value of its actions from the human user’s face. Thislearned value was leveraged to adapt the agent’s actions inlater phases of the experiment. In simple experiment settingswhere there are fewer number of object-grip combinations,like the 2 grips experiment setup, the agent without face valu-ing could quickly learn the appropriate behavior from buttonpushes provided by the user and this resulted in a better per-formance compared to an agent with face valuing. However,in setups with large number of possible combinations of gripsand objects, the agent without face valuing lost this advan-tage and failed to scale up. The face-valuing agent performedcomparatively better in these scenarios as it learned to per-ceive a “goodness” of its actions which guided the agent inchoosing the correct action at a given time instance.
Figure 4: Experiment with Infinite objects & 5 grips: Plot of totaltime steps for limited grips and infinite objects setting. An experi-ment setting where a new object is introduced every episode and theagent needs to grasp this object from one of its grip. Each bar repre-sents the total time steps taken by the agent to complete 15 episodesof this task. The plot was generated from data obtained from 5 inde-pendent runs. This is a key result of our approach as it clearly showsthe performance improvement obtained through face valuing.
For the agent without face valuing, the user’s preferencescould be communicated only through the reward channel.Naturally, this approach was more expensive in terms of thenumber of manual rewards compared to the face-valuing ap-proach. On the other hand, the face-valuing approach uti-lized the user’s facial features, which conveyed the prefer-ences over the grips—a simple approach observed in the userwas to for them to have a neutral or a sad expression when theagent was not selecting the correct grip and to display a posi-tive expression when the agent selected the correct grip. Inter-estingly, by learning a value function over the face, this agentlearned to wait for an affirmative expression from the humanuser before moving forward to grasp the object. When therewere no such expressions from the user, the agent switchedfrom one grip to another until the user gave a “go ahead” ex-pression.
A second experiment (Fig. 4) showed the performance im-provement obtained through our approach in a different andmore difficult setting: one where a new object was generatedfor each episode and no object was ever seen more than onceby the agent throughout the experiment. This experimenttherefore explored the ability of face valuing to address newor changed tasks, and highlighted the importance of adaptingquickly to a user’s preferences.The Fig. 4 denotes the total time steps taken by the agentsto complete this experiment whereas the Table 2 shows thetotal number of instances of human explicit feedback requiredby the agent to successfully complete this task. Data wasgenerated from experiments with a single user.From the plot in Fig. 4, it can be observed that the face-aluing agent was much quicker in adapting to the user’s pref-erences. It learned to complete the task quicker than an agentwithout face valuing. From the Table 2, it is clear that that to-tal number of instances of human generated feedback to com-plete this task was less for the face-valuing agent.Since a new object was introduced in every episode,the agent without face valuing could not learn the possi-ble grip/object combinations only from human-generated re-wards. This was the cause for it requiring more human feed-back in completing this task. In the face-valuing approach,as the learning agent relied on values related to facial fea-tures, it could adapt easily in these situations. Effectively, theagent with face valuing learned to keep switching the gripsperiodically until the user gave a “go ahead” expression. Un-fortunately, the agent without face valuing did not have thisadvantage and could not perform effectively in this setting.
Agent Total no. of button pushesw/o face valuing 623.6w/ face valuing 137.4
Table 2: Total number of button pushes provided for adapting anagent to the user’s preference in experiment 2. The values representthe total number of button pushes given by the human user to anagent for the complete experiment setting that lasted 15 episodes.Average obtained from 5 independent runs.
In our experiments, the learning agent with face valuing hadthe ability to perceive a human user’s face and, eventually,learned to perceive a value of its behavior from its user’s fa-cial expressions. Our preliminary results therefore suggestthat, by learning to value a human user’s facial expressions,the agent could adapt quickly to its user’s preferences withminimal explicit corrective feedback. This learning occurredas follows: during the initial phase of the experiment, theagent used the explicit corrective feedback to learn a valuefunction from the user’s facial gestures; these gestures servedas useful clues about future rewards based on the agent’s cur-rent behavior, and guided it’s behavior.Several studies have shown that users, to a certain extent,are willing to teach machines to perform tasks automatically.For example, in medical domains, it is already common forpeople with amputations to extend their capabilities or lim-itations through a partnership with machines (Williams, T.W., 2011). However, currently available technology does notidentify and adapt quickly to the different preferences of theirusers; this is a serious bottleneck to intelligence or physi-cal amplification in human-machine partnerships. Our workhelps begin to address these limitations.For evaluating our approach, we introduced a grip selec-tion task wherein the learning agent had to figure out the goalthrough its interaction with the user; this agent can be readilytermed a goal-seeking agent (Pilarski et al., 2015). To demon-strate the significance of our approach, we performed twosets of experiments: the first one involved multiple object-grip settings on the grip selection task; we termed the second experiment as the infinite objects setting , because one newobject was generated for every episode and the agent neededto grasp this object by selecting one grip from its limited setof grips. This infinite objects settings is pertinent to real-world scenarios, where there are uncountable number of ob-jects which can be grasped from a limited set of grip patterns.The results from the first user experiment (Fig. 3 (a), (b), · · · , (i)) suggest that the face-valuing agent learns to adaptquicker to its user’s preferences in this task. From the plots inFig. 3 (j), (k), (l), it can be observed that the face-valuingagent learns to adapt to its user’s preferences with signifi-cantly lesser number of explicit human generated feedbacksignals, particularly in difficult experiment settings. From thesecond user experiment (Fig. 4), we empirically show a sce-nario where conventional methods can fail. From both theexperiments, it can be observed that the face-valuing agentsuccessfully learns to adapt and completes one episode afteranother by relying only on facial expressions, specifically thevalue learned from facial expressions. On the other hand, theagent without face valuing could rely only on human gen-erated feedback for identifying the correct grip for a givenobject, which resulted in a greater number of button pushesbeing given by the user. Moreover, we observed that the facevaluing agent learned to wait for an affirmative facial expres-sion before moving towards the object. Otherwise, the agentwould switch from one grip to another at the grip-changingstation until the user provided an affirmative expression.Though our experiments were simulated, we believe thatour approach can be much more valuable in a realistic robotsetting—we expect a robot’s behavior would elicit more ex-pressive facial feedback from the user than our simple sim-ulated domain, and thus more powerful features for a face-valuing agent. In a robot setting, the user can observe therobot’s actions and their consequences in a real-world envi-ronment, where it is natural for the user to implicitly emotethrough various facial cues. Robotic experiments are neededto help quantify the advantage of a face-valuing approach tohuman-machine interaction.
We introduced a new and a promising approach called facevaluing for adapting an agent to a user’s preferences, andshowed that it can produce large performance improvementsover a conventional agent that learns only from human-generated rewards. By allowing the agent to perceive a valuefrom a user’s facial expressions, the total number of expen-sive human generated-rewards delivered during a task wassubstantially reduced and the agent was quickly able to adaptto its user’s preferences. Face valuing learns a mapping fromfacial expression to user satisfaction; it formalizes satisfac-tion as a value function and learns this value function throughtemporal-difference methods. Most work on the use of facialfeatures in human-machine interaction uses facial features ascontrol signals for an agent; surprisingly, our work seems tobe the first to use facial expressions to instead train a learningsystem. Face valuing is general and largely task agnostic, andwe believe it will therefore extend well to other settings andother forms of human body language. eferenceseferences