[PDF] Beyond Winning and Losing: Modeling Human Motivations and Behaviors Using Inverse Reinforcement Learning

Abstract

In recent years, reinforcement learning (RL) methods have been applied to model gameplay with great success, achieving super-human performance in various environments, such as Atari, Go, and Poker. However, those studies mostly focus on winning the game and have largely ignored the rich and complex human motivations, which are essential for understanding different players' diverse behaviors. In this paper, we present a novel method called Multi-Motivation Behavior Modeling (MMBM) that takes the multifaceted human motivations into consideration and models the underlying value structure of the players using inverse RL. Our approach does not require the access to the dynamic of the system, making it feasible to model complex interactive environments such as massively multiplayer online games. MMBM is tested on the World of Warcraft Avatar History dataset, which recorded over 70,000 users' gameplay spanning three years period. Our model reveals the significant difference of value structures among different player groups. Using the results of motivation modeling, we also predict and explain their diverse gameplay behaviors and provide a quantitative assessment of how the redesign of the game environment impacts players' behaviors.

Full PDF

BBeyond Winning and Losing: Modeling Human Motivationsand Behaviors Using Inverse Reinforcement Learning

Baoxiang Wang [email protected]

Tongfang Sun [email protected]

Xianjun Sam Zheng [email protected]

ABSTRACT

In recent years, reinforcement learning (RL) methods havebeen applied to model gameplay with great success, achievingsuper-human performance in various environments, such asAtari, Go, and Poker. However, those studies mostly focuson winning the game and have largely ignored the rich andcomplex human motivations, which are essential for under-standing different players’ diverse behaviors. In this paper,we present a novel method called Multi-Motivation BehaviorModeling (MMBM) that takes the multifaceted human mo-tivations into consideration and models the underlying valuestructure of the players using inverse RL. Our approach doesnot require the access to the dynamic of the system, makingit feasible to model complex interactive environments suchas massively multiplayer online games. MMBM is tested onthe World of Warcraft Avatar History dataset, which recordedover 70,000 users’ gameplay spanning three years period. Ourmodel reveals the signiﬁcant difference of value structuresamong different player groups. Using the results of motivationmodeling, we also predict and explain their diverse gameplaybehaviors and provide a quantitative assessment of how theredesign of the game environment impacts players’ behaviors.

INTRODUCTION

In recent years, reinforcement learning (RL) methods havebeen applied to model gameplay with great success, achiev-ing super-human performance in various environments, suchas Atari, Go, and Texas hold’em poker [10, 19, 11]. Thosestudies, however, primarily focus on winning the game, andthe goal of the computer agent is to take actions that can max-imize the cumulative scalar rewards, such as achieving highscores or beating the opponents. They have mostly ignored therich and complex human motivations, which are essential forunderstanding different players’ reward mechanism as wellas their complex and diverse behaviors. In fact, numerous be-havioral and psychology studies [17, 2] have shown that whenpeople are playing games, apart from competing and winning,they also try to connect with others, or they just want to havesome fun or enjoyment by themselves. An extensive survey ofgame motivation [22, 21] with 30,000 players on Massively-Multiplayer Online Games (MMOGs) conﬁrms that humanplayers have complex and multifaceted motivations. As shownin Tbl. 1, the study categorizes the complex motivations ofgameplay into ten different types and three different groups,namely, Achievement, Social, and Immersion.In this paper, we propose a novel method called Multi-Motivation Behavior Modeling (MMBM) that is based onRL and takes into consideration the multifaceted human moti-vations. The objective of MMBM is to model the underlying

Figure 1. In the typical RL model (left), an agent or player has only onesingle motivation and maximizes one scalar reward. In MMBM (right),an agent or player has multiple motives and the goal is to optimize thecombination of different rewards based on each agent’s value structure.Table 1. Components of game motivation

Components Sub-componentsAchievement Advancement, Mechanics, CompetitionSocial Socializing, Relationship, TeamworkImmersion Discovery, Role Playing, Customization,Escapismvalue structure of the players from the observed human be-havior. By incorporating the motivation theory in [22] ofgameplay, we extend the standard RL framework to covermultiple rewards situations. In MMBM, the goal of the agents(or players) is not simply to maximize one scalar reward underone single motivation such as achieving high scores, but in-stead to maximize the combination of multiple rewards basedon the multi-faceted motivations. Fig. 1 illustrates the differ-ence between the typical RL vs. our proposed MMBM. Thechallenge in discovering human’ motivations is that they arenot explicitly observable. Instead, we have to infer them fromthe players’ behaviors, which can be achieved by using inversereinforcement learning (IRL). In MMBM, we extent IRL touncover the complex, multi-dimensional reward mechanism.Our model ﬁrst quantiﬁes each dimension of the reward signalindividually Based on the motivation theory. The individualsignals are subsequently combined under the assumption thateach player appears in the trajectory are acting at the best ofoptimizing their objectives. In this way, decomposition of thefull reward signal is reduced into a linear program, which issolved efﬁciently, and subsequently the value structures of theplayers can be computed.A signiﬁcant advantage of MMBM is that it utilizes onlyoff-policy learning: Each of the individual reward signals isestimated by Q-learning with deep Q-networks (DQN), andMMBM’s IRL algorithm takes only the trajectories as its in- a r X i v : . [ c s . L G ] J u l ut. In this way, MMBM does not require a simulator of theenvironment, nor does it inquire human players’ counterfac-tual actions that do not exist in the dataset. This is beneﬁcialsince most of the existing IRL methods have to have accessto either the simulation environment or actual human policies,which are usually costly to obtain or simply do not exist. Forlarge and complex games, MMBM provides a feasible way toanalyze the historical data.We apply MMBM to model the players’ behaviors and moti-vations of World of Warcraft, which is one of the most suc-cessful massively multiplayer online role-playing games withmillions of subscribers worldwide. We test the MMBM onthe World of Warcraft Avatar History (WoWAH) dataset [8]with 70,000 users’ gameplay spanning over a three-year pe-riod. Our method outputs the value structure which is the mostsuccinct description of the game environment in the perspec-tive of the human players. On top of the value structure, italso predicts the players’ behaviors accurately, outperformingexisting approaches such as large-margin Q-learning [15] andpolicy imitation via classiﬁer. Moreover, it reveals the differentreward functions and diverse value structures among differ-ent player groups, which interestingly agrees with previousknowledge-based studies on WoW [6, 12] PRELIMINARIESInverse Reinforcement Learning

The process of recovering the reward function from observedtrajectories is inverse reinforcement learning (IRL). It reversesthe input and output pairs of RL algorithms, computing therewards function according to the policies or actions of theagents. The basic assumption of IRL is that, though the re-ward function is unknown, it exists and the agents’ actionsare conducted to maximize the cumulative reward. The as-sumption is illustrated by a few different forms in mathematicforms, including linear IRL [1, 13], max-entropy IRL [16],and large-margin Q-learning [15], and etc. Most of the ex-isting IRL algorithms have either of the two requirements:they need to access the dynamics of the environment [1, 10],which is usually provided by a simulator of the game; or toaccess the policy function [23], which requires the agent toretroactively compute the counterfactual action at a historicaldecision point. Such requirements are expensive in complexand massive games where human players involved. Hence, anIRL algorithm without those need is desired. Approaches suchas large-margin Q-learning [14, 7] and our proposed MMBMdo not require that and are suitable for complex, real-worldgame environments.

Deep Q-Learning and Large Margin Q-Learning

We ﬁrst deﬁne some notions in RL. In a game environment, ateach round t , the player conducts an action a t according to theirown policy π ( · ) and current game state s t . The player may notbe able to obtain the full game state (such as events happenedout of the vision of the player) and uses the observation x t tosubstitute s t . The player subsequently receives the feedbackfrom the environment, including a scalar reward r t and theobservation x t + of the next round. The player’s intention is tomaximize his/her discounted cumulative reward, also known as the action-value function, Q π ( s , a ) = E [ R t | s t = s , a t = a , π ] , (1)where R t = ∑ t (cid:48) ≥ t γ t (cid:48) − t r t (cid:48) . Deep Q-Learning uses a Deep Q-Network (DQN) to estimates the Q π ∗ ( s , a ) value, where π ∗ isthe maxima of the Q-value over all policies. DQN uses therecursive relation of Q π ∗ ( s , a ) , known as the Bellman equation Q π ∗ ( s t , a t ) = r t + γ max a (cid:48) Q π ∗ ( s t + , a (cid:48) ) . (2)The estimation is conducted by minimizing the Bellman error L = E [ ( Q π ∗ ( s t , a t | θ ) − y (cid:48) ) ] , (3)where y (cid:48) = r t − γ max a (cid:48) Q π ∗ ( s t + , a (cid:48) | θ (cid:48) ) is the target value func-tion and θ (cid:48) the parameter of the target network.While DQN estimates the Q ( s , a ) function of π ∗ from thereward signals, which is subsequently used to retrieve the opti-mal policy, large-margin Q-learning approximates the action-value function corresponding to the observed behavior directly.Suppose the policy and the reward signals of the player or agroup of players are unknown and we have observed a set ofstate-action pairs generated by such a policy. Large-marginQ-learning [14, 7] assumes (as most of the IRL algorithmsdo) that the players’ actions are intended to maximize theiraction-value, namely, Q ∗ ( s , a ) ≥ max a (cid:48) ∈ A ( s ) Q ∗ ( s , a (cid:48) ) (4)is satisﬁed for all state-action pairs with a margin. Note that A ( s ) is the set of all feasible actions under state s . Adding alarge margin toward the difference between inequality (4), itresults in the error term L = ( Q π ∗ ( s t , a t ) − ( l s , a + max a (cid:48) Q π ∗ ( s t , a (cid:48) )) , (5)where l s , a is the margins, which could either be pre-deﬁnedparameters trainable parameters. METHODSReward Mechanism Modeling in MMBM

We present our MMBM algorithm to compute the underlyingreward function of the agents. Our method can be viewedas a two-step workﬂow: The ﬁrst step, known as Q-learning,estimates the reward functions of the players at different statesof gameplay environment; The second step, a variant of inversereinforcement learning (IRL), estimates the combination orweights of the different rewards learned at the ﬁrst step. Inessence, the two-step methodology decomposes the complexinteractions between players and a game environment intomultiple quantitative metrics and solve them separately. Anintuitive illustration of the two-step framework on WoWAH isshown in Fig. 2, while the formal algorithm is described in 1.The fundamental idea behind the ﬁrst step is that in a complexenvironment, given the same situation or state, different play-ers respectively perform their optimal actions and exhibit di-verse behaviors. For example, a player who values more aboutthe relationship with his/her teammates would spend moreime on team-based activities than those players who focusmore on their advancements or achievements. Because he/shereceives more overall rewards, by getting more teamwork-based rewards that he/she values. Hence, the combination ofmultiple reward signals is essential to model the users’ behav-ior and their underlying value structure.MMBM learns the weights of the combination of multiplemotivations from the user behavior data using IRL techniques.Formally, let T be the set of state-action pairs of a user ora group of users. It consists of the choices of the users (cor-respond to the term actions a t in RL; t stands for time stepindex) under various of situation or scenarios (correspond tothe term states s t in RL). We can also infer from the statesthat the agents are receiving feedback on multiple rewards f t = ( f t , . . . , f nt ) simultaneously. Assume that the players areoptimal in processing any information available to them anddisplay optimal trajectory towards their objectives . Then,given the same environment state, different players performdiverse actions or display complex behaviors must be resultedfrom their different motivations or value structure f t , . . . , f nt .With the assumption, it reduces to ﬁnd a valid combinationof rewards such that under that combination every action isoptimal, that is, there does not exist another feasible action thatyields a higher total reward. Let φ ∈ R n be the combinationweights, subjecting to || φ || = , φ ≥

0, deﬁne the reward as r t = φ T f t (6)The action-value function (or Q-function) describes the objec-tive of the user Q ∗ ( s , a ) = E [ R t | s t = s , a t = a , π ∗ ] , (7)where R t = ∑ t (cid:48) ≥ t γ t (cid:48) − t r t (cid:48) . Q function gives the expected cumulative rewards the usergets if the user chooses optimal action a under state s andfollows the best policy thereafter. It is the function that shouldsatisfy the previously discussed action optimality, which canbe formulated as Q ∗ ( s , a ) ≥ max a (cid:48) ∈ A ( s ) Q ∗ ( s , a (cid:48) ) , (8)where a (cid:48) ∈ A ( s ) is the set of all possible actions the user cantake at the state s . Since Q ∗ ( s , a ) is a function of φ , solvinginequation (8) will yield the combination weight φ we want.Though (8) itself could be infeasible and hard to solve, weapply two approximation to ﬁnd the solution. First, let Q i ( s , a ) be the action-value function, as if f i is the only existing rewardsignal Q i ( s , a ) = E [ ∑ t (cid:48) ≥ t γ t (cid:48) − t f it (cid:48) | s t = s , a t = a , π i ] (9)At this moment we assume the such Q i function can be accu-rately estimated. We then apply linear scalarization [15] from The assumption is reasonable as we take multiple dimension ofreward signals into consideration

IRL to explicitly separate out the weights φ Q ∗ ( s , a ) = φ T ˜ Q ( s , a ) , (10)where the vector of function ˜ Q ( s , a ) = ( Q ( · ) , . . . , Q n ( · )) . Sec-ond, we introduce the slack variables ξ s , a , which models thecases that users behave less optimally, such as making mis-takes or just playing randomly. ξ s , a sets the threshold of thedifference between the actual action-value Q ∗ ( s , a ) and thelargest possible action-value max a (cid:48) ∈ A ( s ) Q ∗ ( s , a (cid:48) ) over all fea-sible actions. The value of ξ s , a is positive whenever inequality(8) is not satisﬁed, and zero otherwise. Solving the inequa-tion (8) is reduced to minimizing the summation of the slackvariable ξ s , a over all observed pairs, which is − ∑ s , a (cid:20) min ( , Q ∗ ( s , a ) − max a (cid:48) ∈ A ( s ) \ a Q ∗ ( s , a (cid:48) )) (cid:21) .. (11)After the two approximation steps, minimizing such totalslacks is reduced into a linear program (LP) problem as fol-lows.Given that the action-value function ˜ Q ( s , a ) for each of thereward signal, and let T be the set of observed state-actionpairs of (that is, our dataset), minimization of the summationof the slack variables (11) is formulated into the following LPminimize φ , ξ ∑ s , a ξ s , a subject to φ T ( ˜ Q ( s , a ) − ˜ Q ( s , a (cid:48) )) ≥ − ξ s , a , ∀ ( s , a ) ∈ T , a (cid:48) ∈ A ( s ) \ a φ ≥ , || φ || ≥ ξ s , a ≥ , ∀ ( s , a ) ∈ T . (12)As LP can be solved efﬁciently, MMBM ﬁnds the compositionof the rewards by solving the weights φ of different rewardsignals in Eq. (12).The remaining problem is to estimate the action-value action-value function ˜ Q ( s , a ) , which is solved by Q-learning via DQN.Referring to Algorithm 1, line φ , which is a quantitative description ofhuman players’ motivations and the value structure. Off-Policy Action-value Function Approximation

In Alg. 1, MMBM requires the approximation of the action-value function Q i ( s , a ) for each of the component of the re-wards . Such approximation should be a fair estimation of thecumulative reward the user would receive if the user choosesan action a at the state s and maximizes the i -th reward there-after. DQN uses the recursive property (i.e. Bellman equation)that the action-value estimator should have, that is, the cu-mulative reward since the current step onwards should be theimmediate reward plus the cumulative reward since the nextstep onwards. Using the property, DQN updates the action-value function iteratively, by moving the Q i ( s , a ) value toward igure 2. Illustrative execution of Alg. 1 on WoWAH Algorithm 1

MMBM Parameters: learning rate α , discount factor γ Initialization: initialize network parameters w i randomly Input: set T of trajectories for i = to n do for t to size of T do Calculate f it end for repeat Compute L i = E [ ( Q i ( s t , a t | w i ) − y (cid:48) ) ] Update w i = w i − α ∇ w i L i until convergence of Q i ( s , a ) end for for t to size of T do Compute ˜ Q ( s , a ) = ( Q ( · ) , . . . , Q n ( · )) end for Find φ by solving linear program (12)(by upgrading the parameters) it’s target r t + γ Q i ( s (cid:48) , a ∗ ) . Bythe time Q i ( · ) converges to satisfy the Bellman equation, itestimates the action-value given any state-action pair.An advantage of Q-learning is that it learns off-policy property,which implies that the action-value function approximationdoes not rely on real-time data or any game simulator. To un-derstand this, we observe that the ( s , a , s (cid:48) , a ∗ tuples used in theiterations of the update could be feed into the model with anarbitrary order and could involve any a without being requiredthat a is generated by a certain policy. It is very important toour algorithm because since the interaction records betweenthe agent and the environment, such as the computer-humaniteration, are usually available in its ofﬂine mode. This meansour MMBM does not require the dynamics of the environ-ment for training the model. Using gameplay historical dataor player behavior log data as the input, MMBM models com-plex game environments such as massively multiplayer onlinegames. The function approximator which parametrizes the action-value functions largely depends on the environment. Takingour experiments on the WoWAH dataset as an example, theDQN architecture is designed according to the available ob-servations and is applied to all reward signals i = , . . . , n . Asshown in Fig. 3, the categorical elements of the input (e.g. race , class , etc.) are ﬁrst processed by an embedding layer[9], while the numeral elements (e.g. session length, currentlevel etc.) are ﬁrst fed into a fully connected (FC) layer withrectiﬁer non-linearity. The output of embedding layer and FClayer are then concatenated and fed into another FC layer withrectiﬁer non-linearity. A ﬁnal FC layer is applied to computethe Q ( s , a ) value for each action a ∈ ∪ s A ( s ) . The detailedintroduction of the environment and the details of each of theinput variables are included in the experiment section. Imitation Learning and Predictions

An immediate use of the action-value function is to derive theoptimal policy which imitates the gameplay of the samples.That is, let π ∗ ( s ) denote the action at state s π ∗ ( s ) = argmax a (cid:48) Q ∗ ( s , a (cid:48) ) (13)is the policy function which predicts the players move. The in-tuitive understanding of our predicting power is if the players’reward system is available, we could easily predict the players’behavior. Moreover, the predictions of the user behavior arewith a reason behind: While most of the classiﬁcation modelsare just black boxes their outputs may not be correspondingto a clear intuitive. MMBM, instead, reveals the underlyingsystem that drives the behavior before making predictionsTo make predictions, MMBM ﬁrst solving LP (12), and getthe combination of reward signal r t . As the action-value func-tion Q i have been already learned, the action-value function Q ∗ ( s , a ) becomes known by applying Eq. (10). To avoid thebias involved in the scalarization process, we learn the Q ∗ ( s , a ) using the combination weights and the original dataset oncemore. The re-train of then action-value function can be gen-eralized to a more complex combination, for example, Pareto igure 3. DQN architecture for Q i training, i = ,..., n , on WoWAH combination of individual reward signals. Note that, Q ∗ ( s , a ) and π ∗ ( s ) are not corresponding to just advancement or fastestleveling up in the game. MMBM is beyond winning and los-ing: it models and predicts the actual actions that the humanswould have conducted once they present at such a state. EXPERIMENTAL ANALYSISWoWAH Player Behavior Dataset

We tested our MMBM in WoW, one of the most successfulMMORPGs in the world with millions of active players. Likeother MMORPGs, each player chooses a character avatar andcontrol the avatar in third- or ﬁrst-person view throughout thegame. Players can explore the landscape, ﬁght various mon-sters, complete quests individually or cooperatively, communi-cating and interacting with other players, or build their guilds(groups). As shown by Yee [22, 21], players’ motivations aredistinct and their actions and behavior are complicated. TheWoWAH dataset [8] is an interesting dataset to investigate thebehavior. It records a signiﬁcant amount of gameplay datawith over 70,000 players’ movements (regarded as actions)from realm

TW-Light’s Hope every 10 minutes spanning forthe 3-year period. Previous studies on this dataset are eitherbased on descriptive statistics [8] or using simple classiﬁers orclustering [20, 5, 3, 4], and these methods fail to capture therich and complex motivations of the players.From the reinforcement learning perspective, we treat eachplayer as a human agent who conducts an action at each timeinterval. All available data such as current level or joining the guild are regarded as observations. The players’ trajectoriesare composed of a sequence of locations and observations,which partially reﬂect their playing strategies. [4, 18]. Even

Table 2. Different types of motivations in WoWAH and correspondingdeﬁnitions

Motv.

Category & Deﬁnition f Advancement describes how fast the player levelsup in the game. It’s the speed the user levels up, di-vided by the averaged speed at the entire WoWAH. f Competition describes if the player joins

Battle-ground or Arena and competing with human oppo-nents. It equals the number of visits. f Relationship is linear to the duration that theplayer has been in the current guild . f Teamwork describes the intention of conductingteamwork, which is the number of recent zoneswith teamwork features visited. Zones with team-work features include

Battleground , Arena , Dun-geon , Raid , or a zone controlled by

The Alliance . f Escapism is the linear combination of the durationof the recent game session and the number of daysthe player continuously login to the game recently.though Yee theorized ten different motivation for gameplay,we apply ﬁve of them that are frequently observed in theWoWAH dataset. Therefore, we compute ﬁve different kindsof motivations using the WoWAH dataset and let n = f , . . . , f areillustrated in Tbl. 2, and are based on the Yee’s research andother WoW case studies [6, 12]. With those value, we modelthe ﬁnal reward function that each player tries to maximizeduring the gameplay. layer Motivation Modeling We present our experimental results on recovering the multi-motivation mechanism, which is the solution of LP (12). Weuse Tbl. 2 and solving LP (12) on trajectories that are ran-domly drawn from the WoWAH dataset. The underlyingreward mechanism and value structure of the whole playercommunity is φ = ( . , . , . , . , . ) T . (14)In other words, when choosing an action or conducting a be-havior, the players’ total motivation is composed of 40% theiradvancement, 10% their competition, 21% their relationship,16% their teamwork, and 12% their escapism on average. Theresults are illustrated as a spider map in Fig. 4. Note thatthe above φ is calculated based on the entire player database,and our MMBM can calculate the respective φ vector for anindividual player or a player group.We show some comparison results for different player groups.Signiﬁcantly value structures difference is observed betweenthe players at a higher level ( ≥

50) versus the players at alower level ( ≤ Warrior , Hunter , and

Priest , where the

Warrior players value more onAdvancement and the

Priest players value more about relation-ship. It agrees with the common knowledge in WoW that thespells of

Priest focus on beneﬁting (healing, bufﬁng, etc.) theteam rather than those of

Hunder and

Warrior whose spellsare more related about damage, and damage/tank, respectively.Lastly, the results also show that players in the guild valuemore about Teamwork and Relationship motivations as com-pared to the players that arenâ ˘A ´Zt in a guild . The differenceof th weights are distributed into advancement and escapism instead. Interestingly, those quantitative results agrees withprevious knowledge-based studies on WoW [6, 12].

Predicting Players’ Behavior

Once MMBM models the humans’ motivation and value struc-ture, it straightforwardly predicts the complex user behaviors.The prediction is made by the policy stated in Eq. (13). InWoW, at any given state, the player chooses its movements tostay in the same zone or move to another feasible zone. Theaction space is discrete, and depending on the players’ levelthe size ranges from a few zones or over one hundred zones.Therefore, the chance of randomly guess players’ next actionis quite low. Our predictions are quite accurate consideringthe difﬁculties and action space size.We evaluate the accuracy of the prediction, by comparing ifthe predicted action π ∗ ( s ) agrees with the actual action a forthe ( s , a ) pairs in the dataset. Experiments show that policiesinduced by a biased reward function underperform our π ∗ .That is computed by adding a disturb factor ε ∼ N ( , . ) to the solution φ of LP (12), as shown in 3. We also com-pare our result with the policy only focuses on advancement,by setting φ = ( , , , , ) T , and the results show that ourapproach predicts players’ actions signiﬁcantly better. This Figure 4. Spider maps to represent player reward mechanism or valuestructure. Top-left: the weights of different motivations for the entireWoW player community; top-right: different value structures betweenthe players at higher level ( ≥ ) and the players at lower level ( ≤ );bottom-left: different value structures between the players in differentclasses Warrior , Hunter , and

Priest ; bottom-right: comparison of differ-ent value structure of the players who are in guild and those who are notin a guild ﬁnding indicates that taking into consideration of multiple mo-tivations of the user or player not only can reveal each player’sdifferent value structure but can also predict the complex userbehavior more accurately. Then, we test the large-margin Q-learning and policy imitation and the results show that boththese method are less accurate compared with our MMBM.Note that policy imitation via supervised learning is imple-mented by a multi-class support vector machine, mapping thestate s to the action a .A close examination of the errors that are made during theprediction yields some interesting understandings. As ourMMBM model assumes that every player tries to maximizetheir cumulative reward in Eq. (4), i.e., everyone is regardedas a rational and optimal player. Unfortunately, our modelwould have some trouble to distinguish whether a particularaction that deviates from an average one is caused by theplayer’s actual intention or the player’s sub-optimality duringthe gameplay. For instance, some of the players could spendhundreds of hours on solo quest but fail to level up quickly.This could be due to their intention of enjoying doing the questrepeatedly or the players not knowing the optimal strategy tolevel up. We will address this limitation of our method in thefuture work by considering humans’ different ability or skilllevels. Dynamics of the Human Motivation

The motivation of gameplay may evolve. It can also be im-pacted by the new design or new versions of the game environ-ment. How would a design update affect the users’ motivationsand behaviors and how we can quantify this impact? It’s avery interesting question for every game designer to consider.We conduct an analysis of the dynamics of player motiva- able 3. Accuracy of different approaches

Approach Accuracy Notes π ∗ π ∗ φ + ε instead π φ = ( , , , , ) T LMQL 47.2% Large margin Q-learningPI 31.0% Policy imitation via SLLinear Q 29.5% Replace DQN w/ lineartions on the WoWAH dataset, i.e. how the underlying rewardmechanism for players changes over time. To achieve this, theidea is that the set T in LP (12) may contain any numbersof trajectories. Randomly drawing ( s t , a t ) from the datasetwhere t is restricted to a speciﬁc time range yields the set T which illustrates the player’s motivations during that timerange. Taking the time range chronologically, we show theevolution of game motivation, characterized by the elementsin φ . Fig. 5 illustrates the trend of Advancement , Competition , Relationship , Teamwork , and

Escapism . .First, we observed the dramatic increase in Advancement andCompetition during the mid-to-late period on the graph. Ithappens at around the 150000th time interval, which coin-cides with the release of the patch Wraith of the Lich King onNovember 2008. Analyzing the game update patch, two pri-mary reasons can explain the increased level of motivation onAdvancement. First, the patch increased the maximum playerlevel from 70 to 80. As a result, the players with level 70, theprevious max level, were rushing to complete the remainingten leveling ups to reach the new max level. Second, the patchintroduced two new classes in the game, namely

Death Knight and

Shaman , and this gave incentives to many players to openthe secondary accounts and to level up them is the ﬁrst thingto do afterward. Meanwhile, the reason for more Competitionis that many players tend to join player-versus-player (PvP) tocompete with other human players to get more familiar withthe mechanism of their new avatar. It’s also noticed that thesatisfactions are not independent of each other: players spendmore time on advancement usually have insufﬁcient time tocomplete tasks which require teamwork but provides no expe-rience for leveling up. That’s shown in Fig. 5 that the weightfor teamwork decreases each time the weight for advancementincreases, and vice versa.We analysis the overall trend of the game during the threeyears when WoWAH was collected. It turns out that the gameemphasis more on teamwork and relationship during the pe-riod, partially because the dataset was collected only two yearsafter the game release, and the players are getting more andmore involved in the game during that time. Apart from that,the weights of different kinds of motivations are under inﬂu-ences from both game patches and updates, and game usercommunity. Overall, our MMBM model and analysis providesuseful insights in Fig. 5 for game designers and researchers. Note that at any time the weights of those elements sum to 1, repre-senting how players value those satisfactions relatively.

Figure 5. Top: trends of different kinds of motivations during from Mar2006 to Jan 2009; Bottom: the enlargement of the top ﬁgure duringaround the release of patch

Wraith of the Lich King

CONCLUSIONS AND FUTURE WORK

We present MMBM, a general RL model that takes multi-faceted human motivations into consideration. MMBM con-ducts the IRL task, while not relying on the access of pol-icy function nor the dynamics of the environment. Hence,MMBM can be applied to study complex, interactive environ-ments with its historical dataset. Our experiment results onthe WoWAH dataset shows that MMBM recovers reasonablereward mechanism of the players. On top of that, it predictshuman playersâ ˘A ´Z behaviors accurately, shows how differentgroup of players have respective value structure, and providesa quantitative assessment of how the redesign of the gameenvironment impacts players’ behaviors.We view our work as one of the ﬁrst that can combine therichness of psychological and game research theories with therigorousness of RL models. Our goal is beyond winning andlosing: not simply to create software agents that beat humanin various games or competitions, but to propose methods thatcan help to understand the intricacy and complexity of humanmotivations and their behaviors. We hope to inspire moreresearchers to investigate this topic further.

EFERENCES

1. Pieter Abbeel and Andrew Y Ng. 2004. Apprenticeshiplearning via inverse reinforcement learning. In

Proceedings of the twenty-ﬁrst international conferenceon Machine learning . ACM, 1.2. James Tyrone Orpilla Alvarado. 2005.

Playing withPower: An Examination of a Massive Multiplayer OnlineRole Playing Game . Ph.D. Dissertation. AlliantInternational University.3. Christian Bauckhage, Anders Drachen, and Rafet Sifa.2015. Clustering game behavior data.

IEEE Transactionson Computational Intelligence and AI in Games

7, 3(2015), 266–278.4. Jonathan Bell, Swapneel Sheth, and Gail Kaiser. 2013. Alarge-scale, longitudinal study of user proﬁles in world ofwarcraft. In

Proceedings of the 22nd internationalconference on World Wide Web companion . InternationalWorld Wide Web Conferences Steering Committee,1175–1184.5. Anders Drachen, Christian Thurau, Rafet Sifa, andChristian Bauckhage. 2014. A comparison of methods forplayer clustering via behavioral telemetry. arXiv preprintarXiv:1407.3950 (2014).6. Nicolas Ducheneaut, Nicholas Yee, Eric Nickell, andRobert J Moore. 2006. Alone together?: exploring thesocial dynamics of massively multiplayer online games.In

Proceedings of the SIGCHI conference on HumanFactors in computing systems . ACM, 407–416.7. Todd Hester, Matej Vecerik, Olivier Pietquin, MarcLanctot, Tom Schaul, Bilal Piot, Andrew Sendonaris,Gabriel Dulac-Arnold, Ian Osband, John Agapiou, andothers. 2018. Deep Q-learning from Demonstrations.

Association for the Advancement of Artiﬁcial Intelligence(AAAI) (2018).8. Yeng-Ting Lee, Kuan-Ta Chen, Yun-Maw Cheng, andChin-Laung Lei. 2011. World of Warcraft avatar historydataset. In

Proceedings of the second annual ACMconference on Multimedia systems . ACM, 123–128.9. T Mikolov and J Dean. 2013. Distributed representationsof words and phrases and their compositionality.

Advances in neural information processing systems (2013).10. Volodymyr Mnih, Koray Kavukcuoglu, David Silver,Andrei A Rusu, Joel Veness, Marc G Bellemare, AlexGraves, Martin Riedmiller, Andreas K Fidjeland, GeorgOstrovski, and others. 2015. Human-level control throughdeep reinforcement learning.

Nature arXiv preprint arXiv:1701.01724 (2017). 12. Bonnie Nardi and Justin Harris. 2006. Strangers andfriends: Collaborative play in World of Warcraft. In

Proceedings of the 2006 20th anniversary conference onComputer supported cooperative work . ACM, 149–158.13. Andrew Y Ng, Stuart J Russell, and others. 2000.Algorithms for inverse reinforcement learning.. In

Icml .663–670.14. Shibin Parameswaran and Kilian Q Weinberger. 2010.Large margin multi-task metric learning. In

Advances inneural information processing systems . 1867–1875.15. Bilal Piot, Matthieu Geist, and Olivier Pietquin. 2013.Learning from demonstrations: Is it worth estimating areward function?. In

Joint European Conference onMachine Learning and Knowledge Discovery inDatabases . Springer, 17–32.16. Nathan D Ratliff, J Andrew Bagnell, and Martin AZinkevich. 2006. Maximum margin planning. In

Proceedings of the 23rd international conference onMachine learning . ACM, 729–736.17. Daniel Schultheiss. 2007. Long-term motivations to playMMOGs: A longitudinal study on motivations,experience and behavior. In

DiGRA . 344–348.18. Siqi Shen, Niels Brouwers, Alexandru Iosup, and DickEpema. 2014. Characterization of human mobility innetworked virtual environments. In

Proceedings ofNetwork and Operating System Support on Digital Audioand Video Workshop . ACM, 13.19. David Silver, Julian Schrittwieser, Karen Simonyan,Ioannis Antonoglou, Aja Huang, Arthur Guez, ThomasHubert, Lucas Baker, Matthew Lai, Adrian Bolton, andothers. 2017. Mastering the game of go without humanknowledge.

Nature

Proceedings of the 10th AnnualWorkshop on Network and Systems Support for Games .IEEE Press, 6.21. Nick Yee. 2006a. The demographics, motivations, andderived experiences of users of massively multi-useronline graphical environments.

Presence: Teleoperatorsand virtual environments

15, 3 (2006), 309–329.22. Nick Yee. 2006b. Motivations for play in online games.

CyberPsychology & behavior

9, 6 (2006), 772–775.23. Martin Zinkevich, Michael Johanson, Michael HBowling, and Carmelo Piccione. 2007. RegretMinimization in Games with Incomplete Information.. In