[PDF] Online Learning in Iterated Prisoner's Dilemma to Mimic Human Behavior

Abstract

Prisoner's Dilemma mainly treat the choice to cooperate or defect as an atomic action. We propose to study online learning algorithm behavior in the Iterated Prisoner's Dilemma (IPD) game, where we explored the full spectrum of reinforcement learning agents: multi-armed bandits, contextual bandits and reinforcement learning. We have evaluate them based on a tournament of iterated prisoner's dilemma where multiple agents can compete in a sequential fashion. This allows us to analyze the dynamics of policies learned by multiple self-interested independent reward-driven agents, and also allows us study the capacity of these algorithms to fit the human behaviors. Results suggest that considering the current situation to make decision is the worst in this kind of social dilemma game. Multiples discoveries on online learning behaviors and clinical validations are stated.

Full PDF

OOnline Learning in Iterated Prisoner’s Dilemma toMimic Human Behavior

Baihan Lin

Columbia University [email protected]

Djallel Bouneffouf

IBM Research [email protected]

Guillermo Cecchi

IBM Research [email protected]

Abstract

Prisoner’s Dilemma mainly treat the choice to cooperate or defect as an atomicaction. We propose to study online learning algorithm behavior in the IteratedPrisoner’s Dilemma (IPD) game, where we explored the full spectrum of reinforce-ment learning agents: multi-armed bandits, contextual bandits and reinforcementlearning. We have evaluate them based on a tournament of iterated prisoner’sdilemma where multiple agents can compete in a sequential fashion. This allows usto analyze the dynamics of policies learned by multiple self-interested independentreward-driven agents, and also allows us study the capacity of these algorithmsto ﬁt the human behaviors. Results suggest that considering the current situationto make decision is the worst in this kind of social dilemma game. Multiplesdiscoveries on online learning behaviors and clinical validations are stated. Social dilemmas expose tensions between cooperation and defection. Understanding the best way ofplaying the iterated prisoner’s dilemma (IPD) has been of interest to the scientiﬁc community sincethe formulation of the game seventy years ago [1]. To evaluate the algorithm a round robin computertournament was proposed, where algorithms competed against each others [2]. The winner wasdecided on the average score a strategy achieved. Using this framework, we propose here to focus onstudying reward driven online learning algorithm with different type of attentions mechanism, wherewe deﬁne attention "as the behavioral and cognitive process of selectively concentrating on a discretestimulus while ignoring other perceivable stimuli" [3]. Following this deﬁnition, we analyze threealgorithms classes: the no-attention-to-the-context online learning agent (the multi armed banditalgorithms) outputs an action but does not use any information about the state of the environment(context); the contextual bandit algorithm extends the model by making the decision conditional onthe current state of the environment, and ﬁnally reinforcement learning as an extension of contextualbandits which makes decision conditional on the current state of the environment and the next stateof the unknown environments.This paper mainly focuses on an answer to the following questions: • Does attending to the context for an online learning algorithm help on the task of maximizingthe rewards in an IPD tournament, and how do different attention biases shape behavior? • Does attending to the context for an online learning algorithm help to mimic humanbehavior?

To answer these questions, we have performed two experimenters: (1) The ﬁrst one where we haverun a tournament of the iterated prisoner’s dilemma: Since the seminal tournament in 1980 [1], anumber of IPD tournaments have been undertaken [2, 4, 5, 6, 7, 8, 9, 10, 11]. In this work, we adopt The data and codes can be accessed at https://github.com/doerlbh/dilemmaRLPreprint. Under review. a r X i v : . [ c s . G T ] J un similar tournament setting, but also extended it to cases with more than two players. Empirically,we evaluated the algorithms in four settings of the Iterated Prisoner’s Dilemma: pairwise-agenttournament, three-agent tournament, “mental”-agent tournament. (2) Behavioral cloning predictiontask: where we train the the three types of algorithm to mimic the human behavior on some trainingset and then test them in a test set.Our main results are the following: • We have observed that contextual bandits are not performing well in the tournament, whichmeans that considering the current situation to make decision is the worst in this kind ofsocial dilemma game. Basically we should either do not care about the current situation orcaring about more situations, but not just the current one. • We have observed that bandit algorithms (without context) is the best in term of ﬁtting thehuman data, which implies that humans may not consider the context when they play theiterated prisoner’s dilemma.

There is much computational work focused on non understanding the strategy space and ﬁndingwinning strategies in the iterated prisoner’s dilemma; Authors in [12] present and discuss severalimprovements to the Q-Learning algorithm, allowing for an easy numerical measure of the exploitabil-ity of a given strategy. [13] propose a mechanism for achieving cooperation and communicationin Multi-Agent Reinforcement Learning settings by intrinsically rewarding agents for obeying thecommands of other agents. We are interested in investigating how algorithms are behaving and alsohow they are modeling the human decisions in the IPD, with the larger goal of understanding humandecision-making. For-instance, In [14] authors have proposed an active modeling technique to predictthe behavior of IPD players. The proposed method can model the opponent player’s behavior whiletaking advantage of interactive game environments. The data showed that the observer was ableto build, through direct actions, a more accurate model of an opponent’s behavior than when thedata were collected through random actions. [15] they propose the ﬁrst predictive model of humancooperation able to organize a number of different experimental ﬁndings that are not explained bythe standard model and they show also that the model makes satisfactorily accurate quantitativepredictions of population average behavior in one-shot social dilemmas. To the best of our knowledgeno study has been exploring the full spectrum of reinforcement learning agents: multi-armed bandits,contextual bandits and reinforcement learning in social dilemma.

Here, we describe the two main experiments we have run, the Iterated Prisoner’s Dilemma (IPD), andthe Behavioral Cloning with Demonstration Rewards (BCDR).

Table 1: IPD Payoff

C DC R,R S,TD T,S P,P

The Iterated Prisoner’s Dilemma (IPD) can be deﬁned as a matrix game G = [ N, { A i } i ∈ N , { R i } i ∈ N ] , where N is the set of agents, A i is theset of actions available to agent i with A being the joint action space A × · · · × A n , and R i is the reward function for agent i . A special caseof this generic multi-agent IPD is the classical two-agent case (Table 1).In this game, each agent has two actions: cooperate (C) and defect (D),and can receive one of the four possible rewards: R (Reward), P (Penalty),S (Sucker), and T (Temptation). In the multi-agent setting, if all agentsCooperates (C), they all receive Reward (R); if all agents defects (D), they all receive Penalty (P);if some agents Cooperate (C) and some Defect (D), cooperators receive Sucker (S) and defectorreceive Temptation (T). The four payoffs satisfy the following inequalities: T > R > P > S and R > T + S . The PD is a one round game, but is commonly studied in a manner where the prioroutcomes matter to understand the evolution of cooperative behaviour from complex dynamics [16].2 .2 Behavioral Cloning with Demonstration Rewards (BCDR) Here we deﬁne a new type of multi-agent online learning setting, the Behavior Cloning with Demon-stration Rewards (BCDR), present a novel training procedure and agent for solving this problem. Inthis setting, and similar to [17, 18, 19] the agent ﬁrst goes through a constraint learning phase whereit is allowed to query the actions and receive feedback r ek ( t ) ∈ [0 , about whether or not the chosendecision matches the teacher’s action (from demonstration). During the deployment (testing) phase,the goal of the agent is to maximize both r k ( t ) ∈ [0 , , the reward of the action k at time t , and the(unobserved) r ek ( t ) ∈ [0 , , which models whether or not the taking action k matches which actionthe teacher would have taken. During the deployment phase, the agent receives no feedback on thevalue of r ek ( t ) , where we would like to observe how the behavior captures the teacher’s policy proﬁle.In our speciﬁc problem, the human data plays the role of the teacher, and the behavioral cloning aimsto train our agents to mimic the human behaviors. We brieﬂy outlined the different types of online learning algorithms we have used:

Multi-Armed Bandit (MAB):

The multi-armed bandit algorithm models a sequential decision-making process, where at each time point a the algorithm selects an action from a given ﬁnite setof possible actions, attempting to maximize the cumulative reward over time [20, 21, 22]. In themulti-armed bandit agent pool, we have Thompson Sampling (TS) [23], Upper Conﬁdence Bound(UCB) [21], epsilon Greedy (eGreedy) [24], EXP3 [25] and Human Based Thompson Sampling(HBTS) [26].

Contextual Bandit (CB).

Following [27], this problem is deﬁned as follows. At each time point(iteration), an agent is presented with a context ( feature vector ) before choosing an arm. In thecontextual bandit agent pool, we have Contextual Thompson Sampling (CTS) [28], LinUCB [29],EXP4 [30] and Split Contextual Thompson Sampling (SCTS) [31]. Reinforcement Learning (RL).

Reinforcement learning deﬁnes a class of algorithms for solvingproblems modeled as Markov decision processes (MDP) [32]. An MDP is deﬁned by the tuple with aset of possible states, a set of actions and a transition function. In the reinforcement learning agentpool, we have Q-Learning (QL), Double Q-Learning (DQL) [33], State–action–reward–state–action(SARSA) [34] and Split Q-Learning (SQL) [35, 36]. We also selected three most popular handcraftedpolicy for Iterated Prisoner’s Dilemma: “Coop” stands for always cooperating, “Dfct” stands foralways defecting and “Tit4Tat” stands for following what the opponent chose for the last time (whichwas the winner approach in the 1980 IPD tournament [1]).Table 2: Parameter settings for different rewardbiases in the neuropsychiatry-inspired split models λ + w + λ − w − “Addiction” (ADD) ± . ± . . ± . ± . “ADHD” . ± . ± . . ± . ± . “Alzheimer’s” (AD) . ± . ± . . ± . ± . “Chronic pain” (CP) . ± . . ± . ± . ± . “Dementia” (bvFTD) . ± . ±

10 0 . ± . ± . “Parkinson’s” (PD) . ± . ± . . ± . ± “moderate” (M) . ± . ± . . ± . ± . Standard split models 1 1 1 1Positive split models 1 1 0 0Negative split models 0 0 1 1

Agents with Mental Disorder Properties.

Tosimulate behavior trajectories, we used three setof “split” algorithms which were designed tomodel human reward bias in different neurologi-cal and psychiatric conditions. We now outlinedthe split models evaluated in our three settings:the multi-armed bandit case with the Human-Based Thompson Sampling (HBTS) [26], thecontextual bandit case with the Split Contex-tual Thompson Sampling (SCTS) [31], and thereinforcement learning case with the Split Q-Learning [35, 36]. All three split models are standardized for their parametric notions (see Table2 for a complete parametrization and [31] for more literature review of these clinically-inspiredreward-processing biases). For each agent, we set the four parameters: λ + and λ − as the weights ofthe previously accumulated positive and negative rewards, respectively, w + and w − as the weightson the positive and negative rewards at the current iteration. Game settings.

The payoffs are set as the classical IPD game: R = 5 , T = 3 , P = 1 , S = 0 .Following [37], we created create standardized payoff measures from the R, S, T, P values usingtwo differences between payoffs associated with important game outcomes, both normalized by thedifference between the temptation to defect and being a sucker when cooperating as the other defects.3igure 1: Success, Teamwork, Cooperation and Competition in two-agent tournament.Figure 2: Cumulative reward and cooperation rate averaged by class in two- and three-player setting. State representations.

In most IPD literature, the state is deﬁned the pair of previous actions of selfand opponent. Studies suggested that only one single previous state is needed to deﬁne any prisoner’sdilemma strategy [38]. However, as we are interested in understanding the role of three levels ofinformation (no information, with context but without state, and with both context and state), weexpand the state representation to account for the past n pairs of actions as the history (or memory)for the agents. For contextual bandits algorithms, this history is their context. For reinforcementlearning algorithms, this history is their state representation. In the following sections, we will presentthe results in which the memory is set to be the past 5 action pairs (denoted M em = 5 , and moreresults of

M em = 1 in Appendix A.)

Learning settings.

In all experiments, the discount factor γ was set to be 0.95. The exploration isincluded with (cid:15) -greedy algorithm with (cid:15) set to be 0.05 (except for the algorithms that already havean exploration mechanism). The learning rate was polynomial α t ( s, a ) = 1 /n t ( s, a ) . , which wasshown in previous work to be better in theory and in practice [39]. All experiments were performedand averaged for at least 100 runs, and over 50 or 60 steps of dueling actions from the initial state. Reported measures.

To capture the behavior of the algorithms, we report ﬁve measures: individualnormalized rewards, collective normalized rewards, difference of normalized rewards, the cooperationrate and normalized reward feedback at each round. We are interested in the individual rewards sincethat is what online learning agents should effectively maximize their expected cumulative discountedreward for. We are interested in the collective rewards because it might offer important insights onthe teamwork of the participating agents. We are interested in the difference between each individualplayer’s reward and the average reward of all participating players because it might capture theinternal competition within a team. We record the cooperation rate as the percentage of cooperating inall rounds since it is not only a probe for the emergence of strategies, but also the standard measure inbehavioral modeling to compare human data and models [40]. Lastly, we provided reward feedbackat each round as a diagnostic tool to understand the speciﬁc strategy emerged from each game. (Thecolor codes throughout this paper are set constant for each of the 14 agents, such that all handcraftedagents have green-ish colors, multi-armed bandits agents red-ish, contextual bandits agents blue-ishand reinforcement learning agents purple-ish).

In the two-agent tournament, we recorded the behaviors ofthe 14 agents playing against each other (and with themselves). Figure 1 summarized the reward andbehavior patterns of the tournament. We ﬁrst noticed that the multi-armed bandits and reinforcement4igure 3: Reward feedbacks and cooperation rates in some two-player and the three-player settings.BanditsContextualbanditsReinforcementlearningFigure 4: Mental variants in three agent pools: reward, cooperation, teamwork, competition (Mem=5).learning algorithms learned to cooperate when their opponent is

Coop , yielding a high mutual rewards,while the contextual bandits algorithms mostly decided to defect on

Coop to exploit its trust. Fromthe cooperation heatmap, we also observed that reinforcement learning algorithms appeared to bemore defective when facing a multi-armed bandits or contextual bandits algorithm than facing anotherreinforcement learning algorithm. The multi-armed bandits algorithms are more defective whenfacing a contextual bandits algorithm than facing a reinforcement learning algorithm or anothermulti-armed bandits algorithm. Adversarial algorithms EXP3 and EXP4 failed to learn any distinctivepolicy in the IPD environment. We also discovered interesting teamwork and competition behaviors inthe heatmaps of collective rewards and relative rewards. In general, the contextual bandits algorithmsare the best team players, yielding an overall highest collective rewards, followed by reinforcementlearning algorithms. The reinforcement learning algorithms are the most competitive opponents,yielding an overall highest relative rewards, followed by multi-armed bandits algorithms.Figure 2 summarized the averaged reward and cooperation for each of the three classes, where weobserved handcrafted algorithms the best, followed by the reinforcement learning algorithms andthen the multi-armed bandits algorithms. The contextual bandits algorithms received the lowest ﬁnalrewards among the four classes of agents. Surprisingly, the cooperation rate ﬁgure suggested that alower cooperation rate didn’t imply a higher reward. The most cooperative learning algorithm classis the contextual bandits, followed by the reinforcement learning algorithms. The most defectivealgorithms, the multi-armed bandits, didn’t yield the highest score.More detailed probing into the speciﬁc games (Figure 3) demonstrated more diverse strategies thanthese revealed by the cooperation rates. For instance, in the game of QL vs. CTS, we observe that5TS converged to a ﬁxed cooperation rate within the ﬁrst few rounds and stayed constant since then,while the QL gradually decayed its cooperation rate. In the game of UCB1 vs. DQL, UCB1 seemedto oscillate between a high and low cooperation rate within the ﬁrst few rounds (because it was builtto explore all actions ﬁrst), while the DQL gradually decayed its cooperation rate. In the game ofDQL vs. Tit4Tat, we observed a seemingly mimicking effect of the DQL to a tit-for-tat-like behaviors.In the game of SARSA vs. LinUCB, LinUCB converged to a ﬁxed cooperation rate with the ﬁrstfew rounds and stayed constant since then, while the SARSA gradually decayed its cooperation rate.There seemed to be a universality of the three classes of the agents within the ﬁrst few rounds.

Cognitive interpretations of these learning systems.

The main distinctions between the threeclasses of algorithms are the complexity of the learning mechanism and the cognitive system theyadopt. In the multi-armed bandits setting, there is no attention to any contexts, and the agents aim tomost efﬁciently allocate a ﬁxed limited set of cognitive resources between competing (alternative)choices in a way that maximizes their expected gain. In the contextual bandits setting, the agentsapply an attention mechanism to the current context, and aim to collect enough information abouthow the context vectors and rewards relate to each other, so that they can predict the next best actionto play by looking at the feature vectors. In the reinforcement learning setting, the agents not onlypay attention to the current context, but also apply the attention mechanism to multiple contextsrelated to different states, and aim to use the past experience to ﬁnd out which actions lead to highercumulative rewards. Our results suggested that in the Iterated Prisoner’s Dilemma of two learningsystems, an optimal learning policy should hold memory for different state representations andallocate attention to different contexts across the states, which explained the overall best performanceby the reinforcement learning algorithms. This further suggested that in zero-sum games like theIterated Prisoner’s Dilemma, participating learning systems tend to undergo multiple states. Theoverall underperformance of the contextual bandits suggested that the attention to only the currentcontext was not sufﬁcient without the state representation, because the learning system might mix thethe context-dependent reward mappings of multiple states, which can oversimplify the policy andpotentially mislead the learning as an interfering effect. On the other hand, the multi-armed banditsignored the context information entirely, so they are not susceptible to the interfering effect from therepresentations of different contexts. Their learned policies, however, didn’t exhibit any interestingﬂexibility to account for any major change in the state (for instance, the opponent might just ﬁnish amajor learning episode and decide on a different strategy).

Results for the three-agent tournament.

In the three-agent tournament, we wish to understandhow all three classes of algorithms interact in the same arena. For each game, we picked onealgorithm from each class (one from multi-armed bandits, one from contextual bandits and onefrom reinforcement learning) to make our player pool. We observed in Figure 2 a very similarpattern as the two-player case: reinforcement learning agents demonstrated the best performance(highest ﬁnal rewards) followed by multi-armed bandits, and contextual bandits performed the worst.However, in three-agent setting, although the contextual bandits is still the most cooperative, andreinforcement learning became the most defective. More detailed probing into the speciﬁc games(Figure 3) demonstrated more diverse strategies than these revealed by the cooperation rates. Takethe game eGreedy vs. LinUCB vs. DQL and the game UCB1 vs. LinUCB vs. QL as an example, themulti-armed bandits algorithms started off as the most defective but later started to cooperate more infollowing rounds, while the reinforcement learning algorithms became more and more defective. Thecontextual bandits algorithms in both cases stayed cooperative at a relatively high rate.

In this experiment, we wish to understand which online learning algorithms does the best jobsimulating different variants of mental proﬁles. We adopted the same parameter settings to set theuniﬁed models of human behavioral agents in [31], which consist of multi-armed bandits, contextualbandits and reinforcement learning agents to simulate neuropsychiatric conditions such as Alzheimer’sdisorder (AD), addiction (ADD), attention-deﬁcit/hyperactivity disorder (ADHD), Parkinson’s disease(PD), chronic pain (CP) and dementia (bvFTD). To better understand these three uniﬁed models, weperformed the tournament within each of the three agent pools.As shown in Figure 5, the trajectories of these mental variants demonstrated different levels ofdiversity in the three classes of algorithms. The split multi-armed bandits models appeared to allfollow a decaying cooperation rates, but with different dacay rate. The split contextual bandits models6emonstrated a more diverse range of cooperation rates while maintaining a constant rate in manyreward bias. The split reinforcement learning models introduced two types of behaviors: within theﬁrst few rounds, the agents can either gradually decay its cooperation rates, or ﬁrst becoming morecooperative for a few rounds before decaying. Figure 5: Trajectories of cooperation rate by themental variants in three agent pools.Figure 4 offered a more comprehensive spec-trum of these behaviors. The ﬁrst thing we no-ticed is that the three classes of models doesn’tcapture the same proﬁle of the mental variants.For instance, “addiction” (ADD) appeared to bethe most cooperative in the split multi-armedbandits framework, but was also relatively de-fective in the split reinforcement learning frame-work. “Parkinson’s diesease” (PD) appeared tobe having the lowest collective rewards (a badteam player) in the split multi-armed banditsframework, but it contributed to the collective rewards very positively in the split contextual banditsand split reinforcement learning frameworks. This suggests that there are more subtlety involvedin these multi-agent interactions such that the “uniﬁed” split models are not well capturing theuniversality within each mental condition. Comparing the three agent classes in all four patterns(individual rewards, cooperation rates, collective rewards and relative rewards), we do observe a morediverse pattern in the reinforcement learning pool than the other two classes of algorithms.Our simulation does matched the behavioral observations in several clinical studies. [41] studiedthe cooperative behaviors in subjects playing Iterated Prisoner’s Dilemma after receiving differentdoseage of alprazolam, and discovered that in addiction-related test blocks, cooperative choices weresigniﬁcantly decreased as a function of dose, consistent with our reinforcement learning group resultsand previous reports showing that high acute doses of GABA-modulating drugs are associated withviolence and other antisocial behavior [42, 43]. [44] studied children playing Iterated Prisoner’sDilemma in a neuroimaging setting and reported that comparing with children with ADHD, the controlparticipants exhibited greater brain error monitoring signals for non-social options (i.e., betrayal) thatdid not imply more losses for them and instead, generated more gains, while the ADHD childrenexhibited no differential modulation between these options. The absence of neural modulation duringthe IPD task in the ADHD children suggested of a general reward deﬁcit in value-related brainmechanisms, matching our observation that in the split contextual bandits and reinforcement learninggroups that “ADHD” exhibited an overall high cooperation rate and a comparatively low relativerewards. Among all the mental variants, “behavioral variants of fronto-temporal dementia” (bvFTD)appeared to be the most defective in all three split frameworks and obtain the lowest collective reward,matching its clinical symptoms including inappropriate behavior, lack of empathy for others, lack ofinsight, and impaired decision-making in affective scenarios [45, 46]

We collated the human data comprising 168,386 individual decisions from many human subjectsexperiments [2, 4, 5, 6, 7, 8, 9, 10] that used real ﬁnancial incentives and transparently conveyed therules of the game to the subjects. As a a standard procedure in experimental economics, subjectsanonymously interact with each other and their decisions to cooperate or defect at each time periodof each interaction are recorded. They receive payoffs proportional to the outcomes in the same orsimilar payoff as the one we used in Table 1. Following the similar preprocessing steps as [40], wewere able to construct the comprehensive collection of game structures and individual decisions fromthe description of the experiments in the published papers and the publicly available data sets. Thiscomprehensive dataset consists of behavioral trajectories of different time horizons, ranging from 2to 30 rounds, but most of these experimental data only host full historical information of at most past9 actions. We further selected only those trajectories with these full historical information, whichcomprised 8,257 behavioral trajectories. We randomly selected 8,000 of them as the training set andthe other 257 trajectories as the test set.In the training phase, all agents are trained with the demonstration rewards as feedback sequentiallyfor the trajectories in the training set. In the testing phase, we paused all the learning, and tested7igure 6: Behavioral Cloning: bandits modeled human data the best with the lowest prediction error.on the 257 trajectories independently, recorded their cooperation rate. In each test trajectory, wecompared their evolution of cooperation rate to that of the human data and compute a prediction error.Figure 6 summarized the testing results of all the agents in predicting the actions and their cooperationrates from human data. From the heatmap of the cooperation rates, we observe that the behavioralpolicy that each agent cloned from the data varies by class. The reinforcement learning algorithms allseemed to learn to defect at all costs (“tragedy of the commons”). The contextual bandits algorithmsmostly converged to a policy that adopted a ﬁxed cooperation rate. Comparing with the other two,the multi-armed bandits algorithms learned a more diverse cooperation rates across test cases. Theline plot on the right conﬁrms our understanding. The cooperation rate by the real humans (the blackcurve) tends to decline slowly from around 70% to around 40%. UCB1 and epsilon Greedy bothcaptured the decaying properties, mimicing the strategy of the human actions. The prediction erroranalysis matches this intuition. The UCB1 and epsilon greedy algorithms (or the multi-armed banditsalgorithms in general), appeared to be best capturing the human cooperation behaviors.As a side note, we would also like to point out the importance of the sanity check from the line plot(the cooperation rate vs. round). In the prediction error ﬁgures, EXP3 and EXP4 seemed to have anoverall low error, but this can be misleading: from the cooperation rate ﬁgures, we noted that EXP3and EXP4 didn’t seem to learn any policy at all (randomly choosing at 50% over the entire time),while the other agents all appeared to have adopted a non-random strategies.

We have explored the full spectrum of online learning agents: multi-armed bandits, contextualbandits and reinforcement learning. We have evaluated them based on a tournament of iteratedprisoner’s dilemma. This allows us to analyze the dynamics of policies learned by multiple self-interested independent reward driven agents, where we have observed that the contextual banditis not performing well in the tournament, which means that considering the current situation tomake decision is the worst in this kind of game. Basically we should either do not care about thecurrent situation or caring about more situations, but not just the current one. We have also studiedthe capacity of these algorithms to ﬁt the human behavior. We observed that bandit algorithms(without context) are the best in term of ﬁtting the human data, which opens the hypothesis thathuman are not considering the context when they are playing the IPD. Next steps include extendingour evaluations to other sequential social dilemma environments with more complicated and mixedincentive structure, such as the fruit Gathering game and the Wolfpack hunting game [47, 48].8

Broader Impact

The broader motivation of this work is to increase the two-way trafﬁc between artiﬁcial intelligenceand neuropsychiatry, in the hope that a deeper understanding of brain mechanisms revealed byhow they function (“neuro”) and dysfunction (“psychiatry”) can provide for better AI models, andconversely AI can help to conceptualize the otherwise bewildering complexity of the brain.Evidence has linked dopamine function to reinforcement learning via midbrain neurons and connec-tions to the basal ganglia, limbic regions, and cortex. Neuron ﬁring rates computationally representreward magnitude, expectancy, and violations (prediction error) and other value-based signals [49],allowing an animal to update and maintain value expectations associated with particular states andactions. When functioning properly, this helps an animal develop a policy to maximize outcomesby approaching/choosing cues with higher expected value and avoiding cues associated with loss orpunishment. This is similar to reinforcement learning widely used in computing and robotics [32],suggesting mechanistic overlap in humans and AI. Evidence of Q-learning and actor-critic modelshave been observed in spiking activity in midbrain dopamine neurons in primates [50] and in thehuman striatum using the blood-oxygen-level-dependent imaging (BOLD) signal [51].The behavioral cloning results suggest that bandit algorithms (without context) are the best in termof ﬁtting the human data, which open the hypothesis that human are not considering the contextwhen they are playing the iterated prisoner’s dilemma. This discovery proposes new modelingeffort on human study in the bandit framework, and points to future experimental designs whichincorporate these new parametric settings and control conditions. In particular, we propose that ourapproach may be relevant to study reward processing in different mental disorders, for which somemechanistic insights are available. A body of recent literature has demonstrated that a spectrum ofneurological and psychiatric disease symptoms are related to biases in learning from positive andnegative feedback [52]. Studies in humans have shown that when reward signaling in the directpathway is over-expressed, this may enhance state value and incur pathological reward-seekingbehavior, like gambling or substance use. Conversely, enhanced aversive error signals result indampened reward experience thereby causing symptoms like apathy, social withdrawal, fatigue, anddepression. Both genetic predispositions and experiences during critical periods of development canpredispose an individual to learn from positive or negative outcomes, making them more or less at riskfor brain-based illnesses [53]. This highlight our need to understand how intelligent systems learnfrom rewards and punishments, and how experience sampling may impact reinforcement learningduring inﬂuential training periods. Simulation results of the mental variants matches many of theclinical implications presented here, but also points to other complications from the social setting thatdeserve future investigation.The literature on the reward processing abnormalities in particular neurological and psychiatricdisorders is quite extensive; below we summarize some of the recent developments in this fast-growing ﬁeld. It is well-known that the neuromodulator dopamine plays a key role in reinforcementlearning processes. Parkinson’s disease (PD) patients, who have depleted dopamine in the basalganglia, tend to have impaired performance on tasks that require learning from trial and error. Forexample, [54] demonstrate that off-medication PD patients are better at learning to avoid choicesthat lead to negative outcomes than they are at learning from positive outcomes, while dopaminemedication typically used to treat PD symptoms reverses this bias. Alzheimer’s disease (AD) is themost common cause of dementia in the elderly and, besides memory impairment, it is associated witha variable degree of executive function impairment and visuospatial impairment. As discussed in[55], AD patients have decreased pursuit of rewarding behaviors, including loss of appetite; thesechanges are often secondary to apathy, associated with diminished reward system activity. Moveover,poor performance on certain tasks is associated with memory impairments. Frontotemporal dementia(bvFTD) usually involves a progressive change in personality and behavior including disinhibition,apathy, eating changes, repetitive or compulsive behaviors, and loss of empathy [55], and it ishypothesized that those changes are associated with abnormalities in reward processing. For instance,alterations in eating habits with a preference for carbohydrate sweet rich foods and overeating inbvFTD patients can be associated with abnormally increased reward representation for food, orimpairment in the negative (punishment) signal associated with fullness. Authors in [56] suggest thatthe strength of the association between a stimulus and the corresponding response is more susceptibleto degradation in Attention-deﬁcit/hyperactivity disorder (ADHD) patients, which suggests problemswith storing the stimulus-response associations. Among other functions, storing the associations9equires working memory capacity, which is often impaired in ADHD patients. In [57], it isdemonstrated that patients suffering from addictive behavior have heightened stimulus-responseassociations, resulting in enhanced reward-seeking behavior for the stimulus which generated suchassociation. In [58], it is suggested that chronic pain can elicit in a hypodopaminergic (low dopamine)state that impairs motivated behavior, resulting into a reduced drive in chronic pain patients to pursuethe rewards. Reduced reward response might underlie a key system mediating the anhedonia anddepression, which are usually common in chronic pain. A variety of computational models wasproposed for studying the disorders of reward processing in speciﬁc disorders, including, amongothers [54, 59, 60, 61, 57, 62].The approach proposed in the present manuscript, we hope, will contribute to expand and deepen thedialogue between AI and neuropsychiatry. 10 eferences [1] Robert Axelrod. Effective choice in the prisoner’s dilemma.

Journal of conﬂict resolution ,24(1):3–25, 1980.[2] James Andreoni and John H Miller. Rational cooperation in the ﬁnitely repeated prisoner’sdilemma: Experimental evidence.

The economic journal , 103(418):570–585, 1993.[3] Addie Johnson and Robert W Proctor.

Attention: Theory and practice . Sage, 2004.[4] Pedro Dal Bó. Cooperation under the shadow of the future: experimental evidence frominﬁnitely repeated games.

American economic review , 95(5):1591–1604, 2005.[5] Yoella Bereby-Meyer and Alvin E Roth. The speed of learning in noisy games: Partialreinforcement and the sustainability of cooperation.

American Economic Review , 96(4):1029–1042, 2006.[6] John Duffy, Jack Ochs, et al. Cooperative behavior and the frequency of social interaction.

Games and Economic Behavior , 66(2):785–812, 2009.[7] Howard Kunreuther, Gabriel Silvasi, Eric Bradlow, Dylan Small, et al. Bayesian analysisof deterministic and stochastic prisoner’s dilemma games.

Judgment and Decision Making ,4(5):363, 2009.[8] Pedro Dal Bó and Guillaume R Fréchette. The evolution of cooperation in inﬁnitely repeatedgames: Experimental evidence.

American Economic Review , 101(1):411–29, 2011.[9] Daniel Friedman and Ryan Oprea. A continuous dilemma.

American Economic Review ,102(1):337–63, 2012.[10] Drew Fudenberg, David G Rand, and Anna Dreber. Slow to anger and fast to forgive: Coopera-tion in an uncertain world.

American Economic Review , 102(2):720–49, 2012.[11] Marc Harper, Vincent Knight, Martin Jones, Georgios Koutsovoulos, Nikoleta E Glynatsi,and Owen Campbell. Reinforcement learning produces dominant strategies for the iteratedprisoner’s dilemma.

PloS one , 12(12), 2017.[12] Martin Kies. Finding best answers for the iterated prisoner’s dilemma using improved q-learning.

Available at SSRN 3556714 , 2020.[13] Gupta, Gaurav. Obedience-based multi-agent cooperation for sequential social dilemmas, 2020.[14] Hyunsoo Park and Kyung-Joong Kim. Active player modeling in the iterated prisoner’s dilemma.

Computational intelligence and neuroscience , 2016, 2016.[15] Valerio Capraro. A model of human cooperation in social dilemmas.

PloS one , 8(8), 2013.[16] Robert Axelrod and William Donald Hamilton. The evolution of cooperation. science ,211(4489):1390–1396, 1981.[17] Avinash Balakrishnan, Djallel Bouneffouf, Nicholas Mattei, and Francesca Rossi. Using multi-armed bandits to learn ethical priorities for online ai systems.

IBM Journal of Research andDevelopment , 63(4/5):1–1, 2019.[18] Avinash Balakrishnan, Djallel Bouneffouf, Nicholas Mattei, and Francesca Rossi. Incorporatingbehavioral constraints in online AI systems. In

The Thirty-Third AAAI Conference on ArtiﬁcialIntelligence, AAAI 2019, The Thirty-First Innovative Applications of Artiﬁcial IntelligenceConference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in ArtiﬁcialIntelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019 , pages 3–11.AAAI Press, 2019.[19] Ritesh Noothigattu, Djallel Bouneffouf, Nicholas Mattei, Rachita Chandra, Piyush Madan,Kush R. Varshney, Murray Campbell, Moninder Singh, and Francesca Rossi. Teaching AI agentsethical values using reinforcement learning and policy orchestration. In Sarit Kraus, editor,

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence,IJCAI 2019, Macao, China, August 10-16, 2019 , pages 6377–6381. ijcai.org, 2019.[20] T. L. Lai and Herbert Robbins. Asymptotically efﬁcient adaptive allocation rules.

Advances inApplied Mathematics , 6(1):4–22, 1985.[21] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmedbandit problem.

Machine Learning , 47(2-3):235–256, 2002.1122] Djallel Bouneffouf and Irina Rish. A survey on practical applications of multi-armed andcontextual bandits.

CoRR , abs/1904.10040, 2019.[23] W.R. Thompson. On the likelihood that one unknown probability exceeds another in view ofthe evidence of two samples.

Biometrika , 25:285–294, 1933.[24] Richard S. Sutton and Andrew G. Barto.

Introduction to Reinforcement Learning . MIT Press,Cambridge, MA, USA, 1st edition, 1998.[25] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochasticmultiarmed bandit problem.

SIAM Journal on Computing , 32(1):48–77, 2002.[26] Djallel Bouneffouf, Irina Rish, and Guillermo A Cecchi. Bandit models of human behavior:Reward processing in mental disorders. In

International Conference on Artiﬁcial GeneralIntelligence , pages 237–248. Springer, 2017.[27] John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with sideinformation. In

Advances in neural information processing systems , pages 817–824, 2008.[28] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linearpayoffs. In

ICML (3) , pages 127–135, 2013.[29] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased ofﬂine evaluation ofcontextual-bandit-based news article recommendation algorithms. In Irwin King, WolfgangNejdl, and Hang Li, editors,

WSDM , pages 297–306. ACM, 2011.[30] Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire. Contextualbandit algorithms with supervised learning guarantees. In

Proceedings of the FourteenthInternational Conference on Artiﬁcial Intelligence and Statistics , pages 19–26, 2011.[31] Baihan Lin, Djallel Bouneffouf, and Guillermo Cecchi. Uniﬁed models of human behavioralagents in bandits, contextual bandits, and rl. arXiv preprint arXiv:2005.04544 , 2020.[32] Richard S Sutton, Andrew G Barto, et al.

Introduction to reinforcement learning , volume 135.MIT press Cambridge, 1998.[33] Hado V Hasselt. Double q-learning. In

Advances in Neural Information Processing Systems ,pages 2613–2621, 2010.[34] Gavin A Rummery and Mahesan Niranjan.

On-line Q-learning using connectionist systems ,volume 37. University of Cambridge, Department of Engineering Cambridge, England, 1994.[35] Baihan Lin, Djallel Bouneffouf, and Guillermo Cecchi. Split q learning: reinforcement learningwith two-stream rewards. In

Proceedings of the 28th International Joint Conference on ArtiﬁcialIntelligence , pages 6448–6449. AAAI Press, 2019.[36] Baihan Lin, Djallel Bouneffouf, Jenna Reinen, Irina Rish, and Guillermo Cecchi. A story oftwo streams: Reinforcement learning models from human behavior and neuropsychiatry. In

Proceedings of the Nineteenth International Conference on Autonomous Agents and Multi-AgentSystems, AAMAS-20 , pages 744–752. International Foundation for Autonomous Agents andMultiagent Systems, 5 2020.[37] Anatol Rapoport, Albert M Chammah, and Carol J Orwant.

Prisoner’s dilemma: A study inconﬂict and cooperation , volume 165. University of Michigan press, 1965.[38] William H Press and Freeman J Dyson. Iterated prisoner’s dilemma contains strategies thatdominate any evolutionary opponent.

Proceedings of the National Academy of Sciences ,109(26):10409–10413, 2012.[39] Eyal Even-Dar and Yishay Mansour. Learning rates for q-learning.

Journal of Machine LearningResearch , 5(Dec):1–25, 2003.[40] John J Nay and Yevgeniy Vorobeychik. Predicting human cooperation.

PloS one , 11(5), 2016.[41] Scott D Lane and Joshua L Gowin. Gabaergic modulation of human social interaction in aprisoner’s dilemma model via acute administration of alprazolam.

Behavioural pharmacology ,20(7):657, 2009.[42] Richard M Wood, James K Rilling, Alan G Sanfey, Zubin Bhagwagar, and Robert D Rogers.Effects of tryptophan depletion on the performance of an iterated prisoner’s dilemma game inhealthy adults.

Neuropsychopharmacology , 31(5):1075–1084, 2006.[43] Alyson J Bond. Drug-induced behavioural disinhibition.

CNS drugs , 9(1):41–57, 1998.1244] Maria Luz Gonzalez-Gadea, Mariano Sigman, Alexia Rattazzi, Claudio Lavin, Alvaro Rivera-Rei, Julian Marino, Facundo Manes, and Agustin Ibanez. Neural markers of social and monetaryrewards in children with attention-deﬁcit/hyperactivity disorder and autism spectrum disorder.

Scientiﬁc reports , 6(1):1–11, 2016.[45] Teresa Torralva, Christopher M Kipps, John R Hodges, Luke Clark, Tristán Bekinschtein, MaríaRoca, María Lujan Calcagno, and Facundo Manes. The relationship between affective decision-making and theory of mind in the frontal variant of fronto-temporal dementia.

Neuropsychologia ,45(2):342–349, 2007.[46] Sinclair Lough and John R Hodges. Measuring and modifying abnormal social cognition infrontal variant frontotemporal dementia.

Journal of Psychosomatic Research , 53(2):639–646,2002.[47] Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent reinforcement learning in sequential social dilemmas. arXiv preprint arXiv:1702.03037 ,2017.[48] Weixun Wang, Jianye Hao, Yixi Wang, and Matthew Taylor. Towards cooperation in sequentialprisoner’s dilemmas: a deep multiagent reinforcement learning approach. arXiv preprintarXiv:1803.00162 , 2018.[49] W. Schultz, P. Dayan, and P. R. Montague. A Neural Substrate of Prediction and Reward.

Science , 275(5306):1593–1599, mar 1997.[50] Hannah M. Bayer and Paul W. Glimcher. Midbrain Dopamine Neurons Encode a QuantitativeReward Prediction Error Signal.

Neuron , 47(1):129–141, jul 2005.[51] John O’Doherty, Peter Dayan, Johannes Schultz, Ralf Deichmann, Karl Friston, and Raymond J.Dolan. Dissociable Roles of Ventral and Dorsal Striatum in Instrumental.

Science , 304(16April):452–454, 2004.[52] Tiago V Maia and Michael J Frank. From reinforcement learning models to psychiatric andneurological disorders.

Nature Neuroscience , 14(2):154–162, 2011.[53] Avram J. Holmes and Lauren M. Patrick. The Myth of Optimality in Clinical Neuroscience.

Trends in Cognitive Sciences , 22(3):241–257, feb 2018.[54] Michael J Frank, Lauren C Seeberger, and Randall C O’reilly. By carrot or by stick: cognitivereinforcement learning in parkinsonism.

Science , 306(5703):1940–1943, 2004.[55] David C Perry and Joel H Kramer. Reward processing in neurodegenerative disease.

Neurocase ,21(1):120–133, 2015.[56] Marjolein Luman, Catharina S Van Meel, Jaap Oosterlaan, Joseph A Sergeant, and Hilde MGeurts. Does reward frequency or magnitude drive reinforcement-learning in attention-deﬁcit/hyperactivity disorder?

Psychiatry research , 168(3):222–229, 2009.[57] A David Redish, Steve Jensen, Adam Johnson, and Zeb Kurth-Nelson. Reconciling rein-forcement learning models with behavioral extinction and renewal: implications for addiction,relapse, and problem gambling.

Psychological review , 114(3):784, 2007.[58] Anna MW Taylor, Susanne Becker, Petra Schweinhardt, and Catherine Cahill. Mesolimbicdopamine signaling in acute and chronic pain: implications for motivation, analgesia, andaddiction.

Pain , 157(6):1194, 2016.[59] William W Seeley, Juan Zhou, and Eun-Joo Kim. Frontotemporal dementia: what can thebehavioral variant teach us about human brain organization?

The Neuroscientist , 18(4):373–385,2012.[60] Tobias U Hauser, Vincenzo G Fiore, Michael Moutoussis, and Raymond J Dolan. Computationalpsychiatry of adhd: neural gain impairments across marrian levels of analysis.

Trends inneurosciences , 39(2):63–73, 2016.[61] Amir Dezfouli, Payam Piray, Mohammad Mahdi Keramati, Hamed Ekhtiari, Caro Lucas, andAzarakhsh Mokri. A neurocomputational model for cocaine addiction.

Neural computation ,21(10):2869–2893, 2009.[62] Leonardo Emanuel Hess, Ariel Haimovici, Miguel Angel Muñoz, and Pedro Montoya. Beyondpain: modeling decision-making deﬁcits in chronic pain.

Frontiers in behavioral neuroscience ,8, 2014. 13