[PDF] Potential Impacts of Smart Homes on Human Behavior: A Reinforcement Learning Approach

Abstract

We aim to investigate the potential impacts of smart homes on human behavior. To this end, we simulate a series of human models capable of performing various activities inside a reinforcement learning-based smart home. We then investigate the possibility of human behavior being altered as a result of the smart home and the human model adapting to one-another. We design a semi-Markov decision process human task interleaving model based on hierarchical reinforcement learning that learns to make decisions to either pursue or leave an activity. We then integrate our human model in the smart home which is based on Q-learning. We show that a smart home trained on a generic human model is able to anticipate and learn the thermal preferences of human models with intrinsic rewards similar to the generic model. The hierarchical human model learns to complete each activity and set optimal thermal settings for maximum comfort. With the smart home, the number of time steps required to change the thermal settings are reduced for the human models. Interestingly, we observe that small variations in the human model reward structures can lead to the opposite behavior in the form of unexpected switching between activities which signals changes in human behavior due to the presence of the smart home.

Full PDF

PP OTENTIAL I MPACTS OF S MART H OMES ON H UMAN B EHAVIOR :A R

EINFORCEMENT L EARNING A PPROACH

Shashi Suman , Ali Etemad , Francois Rivest Dept. ECE and Ingenuity Labs Research Institute, Queen’s University Dept. of Mathematics and Computer Science, Royal Military College of Canada {shashi.suman, ali.etemad}@queensu.ca, francois.rivest@{mail.mcgill.ca, rmc.ca} A BSTRACT

We aim to investigate the potential impacts of smart homes on human behavior. To this end, wesimulate a series of human models capable of performing various activities inside a reinforcementlearning-based smart home. We then investigate the possibility of human behavior being altered as aresult of the smart home and the human model adapting to one-another. We design a semi-Markovdecision process human task interleaving model based on hierarchical reinforcement learning thatlearns to make decisions to either pursue or leave an activity. We then integrate our human modelin the smart home which is based on Q-learning. We show that a smart home trained on a generichuman model is able to anticipate and learn the thermal preferences of human models with intrinsicrewards similar to the generic model. The hierarchical human model learns to complete each activityand set optimal thermal settings for maximum comfort. With the smart home, the number of timesteps required to change the thermal settings are reduced for the human models. Interestingly, weobserve that small variations in the human model reward structures can lead to the opposite behaviorin the form of unexpected switching between activities which signals changes in human behavior dueto the presence of the smart home.

Smart Home Systems (SHS) have drastically evolved in recent years due to advances in AI, enhanced connectivity,affordability, and the Internet of Things (IoT). For instance, smart thermostats [3] and smart device schedulers [14] areincreasing in popularity everyday. These devices and the ecosystems which they create, aim to enhance human qualityof life by saving time and costs, and increasing comfort. On the other hand, the increased use of AI and automation inour daily lives has begun to impact human performance [13]. For instance, it was shown in [18] that with increasedefﬁciency in human performance, which can be the result of automation and use of AI, happiness also increases.Reinforcement learning (RL) has been widely used to enable agents to learn from feedback [22]. As a result, theyhave been widely used as intelligent/automated agents [12]. While RL-based agents generally learn to optimize theirperformance for maximizing received rewards, such reward maximization may only be optimal from the perspectiveof the agent without solving the environment. For example, it was shown in [7] that a smart agent can exploit theenvironment to gain signiﬁcant rewards by exploiting an edge-case in the rules of the environment.In recent years, SHS have also begun to utilize RL for learning from human interactions and feedback in order toprovide customized and personalized user experience [20] [3]. While intelligent agents such as SHS generally providevalue and increase quality of life, we question whether it is possible that in particular cases, an SHS powered by RLcould learn to exploit the intricacies of human behaviours to maximize its own reward regardless of its implications onthe humans .In this paper, we tackle this problem by investigating if it is possible for an adaptive RL-based SHS to exploitassumptions in the environment and change human behaviors while trying to maximize human comfort. To do so,we run a set of experiments in which the SHS sets up the temperature and humidity (TH) of an environment to thedesired comfort level of a user, while the user performs a set of activities. For the smart home to learn to adapt to a r X i v : . [ c s . A I] F e b uman preferences given each activity, we use Q-Learning where the SHS learns through a pre-trained human model’sfeedback [27]. For the human model, we utilize Hierarchical Reinforcement Learning (HRL), which is capable ofefﬁciently modeling human interleaving capabilities, i.e. switching between activities [6]. To represent the human’sability to continue or leave a particular activity, we implement the human model as hierarchical optimal [2] (see [10],which shows relatively high accuracy for activity continuation and termination when compared to real humans). Oursimulations show that indeed unintended consequences can be caused in certain scenarios, as we observe behavioralchanges exhibited by the human agent.Our contributions in this paper can be summarized as follows: ( ) We model a smart home with RL capable of learningto optimize ambient parameters namely temperature and humidity for maximizing human comfort. ( ) We modela human agent capable of pursuing and switching between various activities. Moreover, the agent can control thetemperature and humidity levels in order to manage its own comfort. ( ) We ﬁnd that in an environment that followswell-established physics laws of heating and humidity, the co-existence of the RL-based SHS and the human agent canlead to changes in the human behavior that would otherwise not be achieved without the SHS. AI in Smart Environments.

Recent advances in machine learning have enabled smart devices to learn user behaviorsand preferences for enhanced user experience and comfort. For instance, in [3] smart thermostats were capable oflearning user thermal preferences in real-time using Bayesian networks. In [17], neural networks were used to predictuser behaviors in smart homes to adjust the ambient thermal conditions.RL algorithms have also begun to make their way into SHS. In [20], value iteration, a reward-based algorithm, wasused to learn user preferences about lighting, conditional on the number of occupants, time of day, and surroundingillumination. Multi-agent deep Q-Networks have been used in [26] to learn the energy cost of smart home appliances.Deep RL was employed in [9] for energy minimization along with a neural network that predicted user thermal comfort.This was further extended in [24] where air quality for a particular temperature was controlled using the Double DQNalgorithm while also minimizing the energy consumption. Similarly in [4] Q-Learning was used to control the indoorventilation to maintain air quality by learning the rate of CO generation from individuals. Human-RL Interaction.

Numerous studies have explored the interaction between humans and RL systems. Forinstance, [21] discusses the notion of placing humans in the loop to provide feedback while an agent plays Atari games.With this approach, the learning stability of the agent increased as the number of iterations required for convergencewas reduced. Similarly in [16], bots learned to converse through feedback from humans using Reinforce algorithm,resulting in an improved learning curve.Based on the above, it can be observed that including humans in RL loops generally results in better learning.Nonetheless, some studies have interestingly discovered potential negative or unintended consequences of human-RLinteractions. For instance, it was shown in [21] that in complex environments where the human is unsure about thenature of feedback it has to provide to the agents, learning frequently fails. Moreover, in some other studies, it has beenshown that RL agents can exploit certain rules in the environment to maximize their reward regardless of its implicationson the overall system. For example in [7] the agent exploited the environment to obtain indeﬁnite rewards, while in [5],the agent exploited the physics engine of the environment to win the game. Given these ﬁndings, in this paper, we aimto study the notion of including humans in an RL-based SHS with the goal of exploring whether scenarios might beencountered that would be deemed unintended or negative.

Problem Setup.

We aim to investigate whether for human agent H capable of interleaving a set of activities throughpolicy K , there exists an RL-based SHS M that can change the human behaviour in an attempt to maximize comfort.Speciﬁcally, for a policy K which H learns in the absence of M , we will obtain K (cid:48) (cid:54) = K when M is integrated into theenvironment. To do so, we will assume M controls TH combined with an existing HRL model for H . We design H capable of carrying out the activities rest, leisure, and physical workout, which are three common categories of activitiesin normal human life, through policy K . M will learn to anticipate the TH preferences of H , such that H would feelmore comfortable while completing the activities through a different policy K (cid:48) . In the next two sections, we describe M and four variations of H to evaluate whether our hypothesis on the possibility of obtaining K (cid:48) (cid:54) = K is correct. Smart Home System.

We model the SHS using a Markov Decision Process (MDP) deﬁned by the tuple < S , A , P, R, γ > , where S is the set of discretized states, A is the set of actions, P ( s (cid:48) | s, a ) is the probabilityto go to state s (cid:48) when taking action a in state s , R ( s, a ) is the expected reward given when taking action a in state s ,2nd γ is the discounted factor devaluing rewards received later in time. The objective is to learn a policy π ( s ) whichreturns which action to take in state s that maximises the expected sum of future rewards (cid:80) ∞ t =1 γ t r t , where r t is thereward received after taking the t th action. The expected sum of future rewards received for taking action a in state s byfollowing policy π is given by the Bellman equation: Q π ( s, a ) = (cid:88) s (cid:48) ∈ S P ( s (cid:48) | s, a )[ R ( s, a ) + γQ π ( s (cid:48) , π ( s (cid:48) )] (1)where Q π is the discounted value for the optimal policy in state s for action a for the SHS, obtained using Q-Learning[25]. Given a transition from state s to state s (cid:48) using action a and reward r , the update rule is given by: Q ( s, a ) ← (1 − α ) Q ( s, a ) + α (cid:20) r + γ max a (cid:48) ∈A { Q ( s (cid:48) , a (cid:48) ) } (cid:21) (2)where α is the learning rate. The (cid:15) -Greedy policy is given by π ( s ) = (cid:26) arg max a ∈A Q ( s, a ) with probability − (cid:15)a ∈ A with probability (cid:15)/ |A| . (3)While learning, actions are selected using a decaying (cid:15) -greedy policy to keep a balance between exploration andexploitation. Based on the (cid:15) value, the agent decides between a random vs. a greedy action, where the value of (cid:15) decayswith each iteration.The smart home receives a state s that consists of temperature and humidity as a discrete state. Additionally, givenmany recent wearable devices that track human activity can now integrate into smart-home applications, we furtherinclude the human activity in the received state to ensure the smart home can incorporate human activity when settingthe temperature. Based on this observation, the smart home may take an action to increase, maintain, or decrease thetemperature or humidity respectively. The smart home receives a negative reward (-1) every time the human changesthe TH, thus, encouraging the smart home to set the TH to the human preference.To simulate real life changes in temperature and humidity in the environment, we use thermal calculations based onNewton’s law of cooling [19] assuming the dew point to be constant because of low deviation in temperatures to keepthe Newton’s law from breaking. If we consider our Human temperature preference to be T h and the surroundingtemperature as T s , then the thermal change can be expressed as: T ( t ) = T s + ( T s − T h ) e − kt , (4)where t is time and k is the decay constant. For simplicity, we ignore the heat generated by the human body’s metabolismwhich varies due to sweat evaporation, radiation, convection, and conduction. With regular change in the surroundingswith respect to the skin temperature, radiant heat can be gained or lost. Similarly, via breathing and conduction, ourbody loses heat. Hence, we kept the mean radiant temperature at 22 degrees to reduce the complexity.To simulate the changes in humidity within the environment, we use the August-Roche-Magnus approximation [15].As per deﬁnition, relative humidity can be expressed as: R h = 100 × e . × Td . Td e . × T . T , (5)where T is the room temperature (Eq. (4)) and T d is the dew point temperature. Similar to the temperature, we ignorethe humidity produced by the humans because the amount of moisture released by humans remains independent of theindoor humidity [23] and remains mostly constant at about 50 grams/hr. Human Model.

We need a human model H that can: ( i ) represent how humans adapt based on reward; ( ii ) naturallycomplete and switch between tasks; and ( iii ) convert TH to comfort and reward. Accordingly, we use reinforcementlearning for the human model as such approaches have frequently been used for modeling human behaviors andadaptation based on rewards [11]. In particular, an HRL model capable of capturing human behaviors in selecting,completing, and switching among different tasks has been proposed in [10].We propose H such that within a limited time window N , must learn to select or leave the available or ongoing activitiesrespectively to maximize the reward. We model this using HRL as implemented in [10], where the similarity betweenan HRL-based human model and real individuals was shown to be high while simulating task interleaving. In thisformulation, the goal can be broken down into a hierarchy of sub-problems where each sub-problem is a separateSemi-Markov Decision Process (SMDP) that the human model has to learn. This allows us to combine primitive actionsto create more composite ones. We select three possible types of activity classes based on activity intensity for the3 Root Physical Workout Leave Change THT T H H Continue Rest Leisure

Figure 1: Hierarchical model representation for our simulated human. Rectangles are composite SMDP subroutines andovals are primitive actions.human agent: ‘rest’ for low intensity, ‘leisure’ for medium intensity, and ‘physical workout’ for high intensity. Eachactivity class has a different cost function, state-space, and PMV preferences. The entire hierarchy has four compositeactions and six primitive (non-SMDP) actions as shown in Figure 1. An SMDP is similar to an MDP, with the exceptionthat some actions can take multiple time steps to be completed, and hence the agent receives delayed rewards. This isdue to the fact that some actions are represented as macros consisting of sequences of actions.We divide the hierarchy for our H into three distinct levels. The ﬁrst level has an SMDP root node. The root subroutinechooses an activity that returns a maximum cumulative reward. The second level consists of tasks that the human mustperform. Each task is an SMDP subroutine to model the progress speed based on the time spent. This helps us togeneralize the interleaving ability of humans within activities. In each task, the human can either continue or leave thetask, or decide to change either the temperature or humidity as the third level in the hierarchy.For the human state representation, we use a similar representation as the SHS, except that we further encode the THfor the human as a comfort state using a predicted mean vote (PMV) model [1]. PMV is an empirical model based onhuman sensation which predicts the average vote of a large group of people regarding thermal comfort on a 7-pointscale spanning from -3 to 3, with 3 being the warmest feeling, -3 being the coldest feeling, and 0 being optimal comfort.To model the human, we use a three-part value function decomposition similar to [2] as given by the following threeelements: Q r ( i, s ) = (cid:26)(cid:80) s (cid:48) P ( s (cid:48) | s, a ) R ( a, s (cid:48) ) if i is primitive Q r ( child ( i ) , s ) + Q c ( child ( i ) , s, π child ( s )) if i is composite (6) Q c ( i, s, a ) = (cid:88) SS P (cid:48) ( s (cid:48) , t | s, a ) αγ t Q task ( i, s (cid:48) , π task ( s (cid:48) ))) (7) Q e ( i, s, a ) = (cid:26)(cid:80) EX P (cid:48) ( s (cid:48) , t | s, a ) αγ t Q root ( i, s (cid:48) , π root ( s (cid:48) )) if s is an exit state (cid:80) SS P (cid:48) ( s (cid:48) , t | s, a ) αγ t Q e ( i, s (cid:48) , a (cid:48) ) if s is not an exit state (8)where Q r is the expected discounted reward for taking action a , Q c is the expected discounted reward for completingthe rest of the subroutine after the agent tasks a in state s , and Q e refers to all the rewards external to the currentsubroutine of the agent. SS are the states that exist within the same subroutine of s . EX are the exit states for thecurrent subroutine where the human can leave the task. Our human model can exit from a task at any given state, hence,every state is considered as an exit state. i indicates the node value of each task in the hierarchy. is assigned to theroot node at level and , , , ... are assigned to the composite tasks for the consecutive levels.The full Bellman equation for the root and task levels in the hierarchy are given by: Q π,root ( i, s, a ) = Q r,root ( i, s, a ) + Q c,root ( i, s, a ) , (9) Q π,task ( i, s, a ) = Q r,task ( i, s, a ) + Q c,task ( i, s, a ) + Q e,task ( i, s, a ) , (10)where π root and π task ( i ) are the optimal policies at the root and task levels respectively, and t is the time steps spent oneach task. The Bellman equation for the root does not contain Q e because there is no external subroutine at the rootlevel that provides external rewards to the root node. 4 Time S teps

Reward r(s)Penalty c(s)

Light Exercise Medium Exercise Heavy Exercise r ( s ) a n d c ( s ) Figure 2: Costs and rewards for leaving or continuing the activity ‘physical workout’ versus time.

Optimality.

Hierarchical Reinforcement learning is categorized into two notions for an optimal solution. i) RecursiveOptimal and ii) Hierarchical Optimal . In Recursive Optimal or MAXQ [8], the expected return for performing a subtask a is Q ( i, s, a ) = Q r ( a, s ) + Q c ( i, s, a ) . The expected reward after the completion of subtask i is not a part of the valuefunction which makes MAXQ a Recursive Optimal solution. In Hierarchical Optimal [2], when solving the entire task i.e. root, the policy is hierarchically optimal given the rewards from the external subroutines (8). Each task may or maynot be locally optimal. The Bellman equation for Recursive Optimal can be expressed as: Q π ( i, s, a ) = Q r ( a, s ) + Q c ( i, s, a ) , (11)while the Bellman equation for Hierarchical Optimal can be expressed as: Q π ( i, s, a ) = Q r ( a, s ) + Q c ( i, s, a ) + Q e ( i, s, a ) (12)For this paper, we have taken H as a Hierarchical Optimal agent to imitate the switching ability of humans where theyhave context about the external rewards from other available tasks. We perform the following experiments to evaluate four different assumptions regarding the human model H whileinteracting with the SHS. We trained each human model for 350 episodes which was empirically enough for convergence.Next we train the human models with the SHS for an additional 150 episodes. We repeat our experiments

50 times toevaluate how the SHS performs with different combinations of TH settings when initializing the environment. Followingwe describe the different human models and experiments along with the obtained results.

Experiment 1: Baseline.

In this experiment, we explore a baseline scenario where an SHS is trained for an averagehuman model using the existing thermal comfort sensational model [1]. Here, we aim to show that 1) our human modelcan indeed learn to complete each available activity, 2) the SHS can learn to anticipate the human TH preferences, and3) the human performance does not exhibit any unintended behaviours when integrating the SHS.Let Model A be a generic human model used to train the SHS with the optimal comfort PMV indices ( i.e.

SHS initialsettings) between − . and . , and PPD of 5 % as per ASHRAE-55 [1]. The metabolic rates used by the PMV functionfor the 3 activities are [1 , . , . respectively as deﬁned in [1]. The human model reward function is given by: R A,B ( s, a ) = (cid:26) r ( s ) + d ( s ) if a is continue − c ( s ) if a is leave , (13)where d ( s ) = −| pmv ( s ) | is the intrinsic comfort reward given the PMV score pmv ( s ) for the current state s , where r ( s ) and c ( s ) are the reward and penalty functions for pursuing and leaving the task respectively (Figure 2). This helpsthe human learn when to switch tasks given the current task and TH. This experiment shows that Model A is capable ofcompleting each activity along with setting the optimal TH settings (see Figure 3(a) and (b)). As expected, with theSHS, the number of time steps required for Model A to correct TH was reduced as shown in Figure 4 (a) Experiment 2: Control.

In this experiment, our goal is to determine whether the SHS can learn to adapt to a particularhuman model with a thermal preference different to that of an average human model (Model A). Here, we deﬁne ModelB as a human with different activity-speciﬁc TH preferences compared to Model A, but with the same comfort rewardfunctions. The metabolic values within the PMV for the three activities of Model B are set to [1 . , . , . based on[1], which also changes the TH preference. 5

20 40 60 80 100 120 (a)

Task0Task1Task2Set0Set1Set2 M o d e l A / B Without Smart Home

Model AModel B 0 20 40 60 80 100 120 (b)With Smart Home

Model AModel B0 20 40 60 80 100 120 (c)

Task0Task1Task2Set0Set1Set2 M o d e l C Task 0 20 40 60 80 100 120 (d)

Task0 20 40 60 80 100 120

Time Steps(e)

Task0Task1Task2Set0Set1Set2 M o d e l D Task 0 20 40 60 80 100 120

Time Steps(f)

Task0102030Reward

Figure 3: Plot of activities over time for each model (A, B, C, D) with and without the SHS. In (a), (c), (e), each modellearns to complete the tasks without interruption. In (b) and (f), the SHS anticipates human preferences, speeding upthe time for human models A, B, and D to complete the activities. In (d), Model C (different internal reward structure)behaves erratically in the presence of SHS. Set i denotes the action of setting TH for the corresponding Task i .In Figure 3 (a), we can see that both human models A and B, once trained, converge to the expected behaviors bycompleting all 3 activities, and spending a few time steps tuning the TH at the beginning of each activity. Then, Figure3 (b) shows that once the SHS is integrated into the environment, the amount of time spent by the human model inadjusting the TH is reduced, without negatively affecting the completion of activities by the human. As expected, withthe SHS, the human model takes fewer time steps to correct the TH. Figure 4 (a) and Figure 4 (b) show that with theSHS, the time steps (green) is reduced as compared to the human model (red) without the SHS for both Models A andB. This concludes that the SHS learns to adapt to Model B with a different thermal preference. This works well for theTH preferences of an average Model A as well as for Model B with different preferences, showing that the SHS modelcan adapt to both generic and speciﬁc human model preferences. With Experiments 1 and 2, we can conclude that thehuman models in a RL-based smart home can successfully complete the activities without abruptly leaving any of them,while requiring fewer time steps to set TH. Experiment 3: Demonstration.

In this experiment, we look into the prospect of the SHS learning the preference of ahuman that has a different intrinsic thermal reward structure than the ones in Models A and B. Models C and D willdiffer from A and B by having an extra reward term for leaving uncomfortable thermal conditions [6]. We deﬁne thereward structures of Models C and D by: R C ( s, a ) = (cid:26) r ( s ) + d ( s ) if a is continue − c ( s ) + l ( s ) if a is leave (14)where the term l ( s ) = | pmv ( s ) | is the intrinsic leaving reward that depends on the TH, s is the current state, and pmv returns the discrete PMV score of the state. This helps the model to imitate the sense of relief when humans leave anuncomfortable environment. 6 (a) (b) (c) (d) Model A Model B Model C Model D

Figure 4: Comparison of human Models A, B, C, and D while setting TH for comfort, with and without the SHS. Eachcolored plot denotes PMV variations versus the required time steps to reach the optimal PMV range while pursuing theactivities. Red: Without the SHS, Blue: Minimum time steps without SHS, Green: With the SHS. The shaded areashows the comfortable PMV range within 5% PPD. Models C and D have a comfort range between a PMV of -0.1 and+0.1 along with a different intrinsic reward system.To learn about the factors that may possibly impact on the human behavior in the presence of SHS, we assigned ModelC with different metabolism indices as Model A, and Model D with the same metabolism indices deﬁned for Model A.This changes the TH preference for Model C but not for Model D, during any speciﬁc activity. For this experiment,however, to represent a sensitive human with a different reward structure (Eq. 14), we have set the comfortable range tobe between -0.1 to 0.1.Figure 3 (c) shows the experiment results where Model C converges to the exact same behaviors as Models A and Bwithout the smart home. However, with the smart home, Model C deviates from the expected behaviour (completingthe activities without abruptly leaving any of them, as well as setting TH with reduced time steps). In Figure 4 (c) thegreen plot shows that with the SHS, Model C takes more steps to set the TH than without the SHS shown in red plot.While trying to complete the tasks, we also observed Model C occasionally leaving the activities before completingthem (Figure 3 (d)).To study this deviation in behavior, we look into the Q -values of Model C as shown in Figure 5. We observe thatthe overall reward available in external subroutines ( Q e ) and reward within the same subroutine ( Q c ) increase in thepresence of SHS. With the SHS, the mean of Q c -values increase from 1.1 to 1.34 (see Figure 5 top row) indicatingan eagerness to complete the current activity (subroutine), while the mean of Q e -values increase from 0.012 to 0.098(see 5, bottom row), indicating the availability of higher rewards from external activities (subroutines). For any state,if the Q e -value is higher than the Q c -value, Model C decides to leave the current activity to maximize the rewardsfrom external activities. The number of such states with higher Q e -values increase in the presence of the SHS whichexplains the switching between the activities. With the occasional switching between activities, the SHS fails to learnthe TH preference of Model C during a particular activity which inherently increases the overall time steps needed tolearn the TH preference of the human model. After evaluating Model C (see Table 1, we can see that for 50 trials, themean reward for Model C has decreased from 266 to 197 showing the decrease in the performance of SHS to learn thepreference of Model C with different intrinsic rewards and metabolism indices. Meanwhile, the standard deviationincreased from 3.3 to 18.1, similar to our observation in Figure 4(c), which may suggest additional instability in theoverall system.Next, we compare Model C with Model D that has the same metabolism indices for the activities as the generic ModelA. Figure 3(e) shows that without the SHS, Model D is able to complete each task and set optimal TH settings. Figure3(f) shows the progress of Model D successfully completing each activity in the presence of SHS without occasional7 C o m p l e t i o n R e w a r d Without SmartHome

With SmartHome

Activity0Activity1Activity2

States E x i t R e w a r d States

Activity0Activity1Activity2 Activity0Activity1Activity2Activity0Activity1Activity2

Figure 5: Model C completion rewards ( Q c is the expected rewards to complete the current task, and external reward Q e is the expected reward from tasks other than the current sub routine).Table 1: Mean Time Steps (MTS) required to change TH, Mean Reward (MR) ± standard deviation for each modelwith and without SHS. Without SHS With SHSMTS MR MTS MRModel A 22 253 ± ± ± ± ± ± ± ± In this paper, we investigated the potential impacts of RL-based SHS on human behavior. We showed that a hierarchy-based human model is capable of completing different activities, and setting TH preferences for optimal comfort duringthe activities. Our human model is based on HRL, where each activity is an SMDP that the agent learns to complete. Inorder to maximize comfort, the human also learns to set TH settings speciﬁc for the activity being performed. As anactivity progresses, the agent also anticipates rewards from external subroutines using the three-part value functionand then decides to continue or leave it. Similarly, our SHS model is based on Q-Learning that learns the human THpreferences based on the activity and current TH. The SHS model can learn and adapt to human models with similar ordifferent TH preferences by successfully reducing the number of time steps required by the human model to adjust TH.Our experiments showed that with certain conditions, atypical human behaviors can arise in the presence of the SHS.This alternate behavior can be described as occasional switching between activities rather than completing each activitytill the end, an increase in the overall time steps, or both. On the other hand, in our control experiment (Experiment 2),the SHS resulted in an improvement in the human performance like a decrease in the time steps required to set the THwithout any occasional switching between activities. Our simulations and ﬁndings present open questions for future8ork, which may involve improvement of the SHS to anticipate the behavioral state space of humans that are morecomplex in nature.

References [1] American society of heating refrigerating & air -conditioning engineers, standard 55: Thermal environmental conditions forhuman occupancy, 2010.[2] D. Andre and S. J. Russell. State abstraction for programmable reinforcement learning agents. In

AAAI/IAAI , pages 119–125,2002.[3] F. Auffenberg, S. Stein, and A. Rogers. A personalised thermal comfort model using a bayesian network.

International JointConference on Artiﬁcial Intelligence , 2015.[4] S. Baghaee and I. Ulusoy. User comfort and energy efﬁciency in hvac systems by q-learning. In , pages 1–4. IEEE, 2018.[5] B. Baker, I. Kanitscheider, T. Markov, Y. Wu, G. Powell, B. McGrew, and I. Mordatch. Emergent tool use from multi-agentautocurricula. arXiv preprint arXiv:1909.07528 , 2019.[6] M. M. Botvinick. Hierarchical reinforcement learning and decision making.

Current Opinion in Neurobiology , 22(6):956–962,2012.[7] P. Chrabaszcz, I. Loshchilov, and F. Hutter. Back to basics: Benchmarking canonical evolution strategies for playing atari. arXiv preprint arXiv:1802.08842 , 2018.[8] T. G. Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition.

Journal of artiﬁcialintelligence research , 13:227–303, 2000.[9] G. Gao, J. Li, and Y. Wen. Energy-efﬁcient thermal comfort control in smart buildings via deep reinforcement learning. arXivpreprint arXiv:1901.04693 , 2019.[10] C. Gebhardt, A. Oulasvirta, and O. Hilliges. Hierarchical reinforcement learning as a model of human task interleaving. arXivpreprint arXiv:2001.02122 , 2020.[11] P. W. Glimcher. Understanding dopamine and reinforcement learning: the dopamine reward prediction error hypothesis.

Proceedings of the National Academy of Sciences , 108(Supplement 3):15647–15654, 2011.[12] A. Juliani, V.-P. Berges, E. Vckay, Y. Gao, H. Henry, M. Mattar, and D. Lange. Unity: A general platform for intelligent agents. arXiv preprint arXiv:1809.02627 , 2018.[13] D. B. Kaber and M. R. Endsley. The effects of level of automation and adaptive automation on human performance, situationawareness and workload in a dynamic control task.

Theoretical Issues in Ergonomics Science , 5(2):113–153, 2004.[14] A. Khalili, C. Wu, and H. Aghajan. Autonomous learning of user’s preference of music and light services in smart homeapplications. In

Behavior Monitoring and Interpretation Workshop at German AI Conf , page 12, 2009.[15] M. G. Lawrence. The relationship between relative humidity and the dewpoint temperature in moist air: A simple conversionand applications.

Bulletin of the American Meteorological Society , 86(2):225–234, 2005.[16] J. Li, A. H. Miller, S. Chopra, M. Ranzato, and J. Weston. Dialogue learning with human-in-the-loop. arXiv preprintarXiv:1611.09823 , 2016.[17] T. Liang, B. Zeng, J. Liu, L. Ye, and C. Zou. An unsupervised user behavior prediction algorithm based on machine learningand neural network for smart home.

IEEE Access , 6:49237–49247, 2018.[18] A. J. Oswald, E. Proto, and D. Sgroi. Happiness and productivity.

Journal of Labor Economics , 33(4):789–822, 2015.[19] C. T. O’Sullivan. Newton’s law of cooling—a critical assessment.

American Journal of Physics , 58(10):956–960, 1990.[20] J. Y. Park, T. Dougherty, H. Fritz, and Z. Nagy. Lightlearn: An adaptive and occupant centered controller for lighting based onreinforcement learning.

Building and Environment , 147:397–414, 2019.[21] W. Saunders, G. Sastry, A. Stuhlmueller, and O. Evans. Trial without error: Towards safe reinforcement learning via humanintervention. arXiv preprint arXiv:1707.05173 , 2017.[22] R. S. Sutton, A. G. Barto, et al.

Introduction to reinforcement learning , volume 135. MIT press Cambridge, 1998.[23] A. TenWolde and C. L. Pilon. The effect of indoor humidity on water vapor release in homes. 2007.[24] W. Valladares, M. Galindo, J. Gutiérrez, W.-C. Wu, K.-K. Liao, J.-C. Liao, K.-C. Lu, and C.-C. Wang. Energy optimizationassociated with thermal comfort and indoor air control via a deep reinforcement learning algorithm.

Building and Environment ,155:105–117, 2019.[25] C. J. Watkins and P. Dayan. Q-learning.

Machine Learning , 8(3-4):279–292, 1992.[26] Y. Yang, J. Hao, Y. Zheng, and C. Yu. Large-scale home energy management using entropy-based collective multiagent deepreinforcement learning framework. In

IJCAI , pages 630–636, 2019.[27] R. Zhang, F. Torabi, L. Guan, D. H. Ballard, and P. Stone. Leveraging human guidance for deep reinforcement learning tasks.

International Joint Conference on Artiﬁcial Intelligence , 2019., 2019.