[PDF] Cooperation and Reputation Dynamics with Reinforcement Learning

Abstract

Creating incentives for cooperation is a challenge in natural and artificial systems. One potential answer is reputation, whereby agents trade the immediate cost of cooperation for the future benefits of having a good reputation. Game theoretical models have shown that specific social norms can make cooperation stable, but how agents can independently learn to establish effective reputation mechanisms on their own is less understood. We use a simple model of reinforcement learning to show that reputation mechanisms generate two coordination problems: agents need to learn how to coordinate on the meaning of existing reputations and collectively agree on a social norm to assign reputations to others based on their behavior. These coordination problems exhibit multiple equilibria, some of which effectively establish cooperation. When we train agents with a standard Q-learning algorithm in an environment with the presence of reputation mechanisms, convergence to undesirable equilibria is widespread. We propose two mechanisms to alleviate this: (i) seeding a proportion of the system with fixed agents that steer others towards good equilibria; and (ii), intrinsic rewards based on the idea of introspection, i.e., augmenting agents' rewards by an amount proportionate to the performance of their own strategy against themselves. A combination of these simple mechanisms is successful in stabilizing cooperation, even in a fully decentralized version of the problem where agents learn to use and assign reputations simultaneously. We show how our results relate to the literature in Evolutionary Game Theory, and discuss implications for artificial, human and hybrid systems, where reputations can be used as a way to establish trust and cooperation.

Full PDF

CCooperation and Reputation Dynamicswith Reinforcement Learning

Nicolas Anastassacos

University College LondonThe Alan Turing [email protected]

Julian García

Monash [email protected]

Stephen Hailes

University College [email protected]

Mirco Musolesi

University College LondonThe Alan Turing InstituteUniversity of [email protected]

ABSTRACT

Creating incentives for cooperation is a challenge in natural and ar-tificial systems. One potential answer is reputation, whereby agentstrade the immediate cost of cooperation for the future benefits ofhaving a good reputation. Game theoretical models have shown thatspecific social norms can make cooperation stable, but how agentscan independently learn to establish effective reputation mecha-nisms on their own is less understood. We use a simple modelof reinforcement learning to show that reputation mechanismsgenerate two coordination problems: agents need to learn how tocoordinate on the meaning of existing reputations and collectivelyagree on a social norm to assign reputations to others based on theirbehavior. These coordination problems exhibit multiple equilibria,some of which effectively establish cooperation. When we trainagents with a standard Q-learning algorithm in an environmentwith the presence of reputation mechanisms, convergence to un-desirable equilibria is widespread. We propose two mechanismsto alleviate this: (i) seeding a proportion of the system with fixedagents that steer others towards good equilibria; and (ii) , intrinsicrewards based on the idea of introspection, i.e., augmenting agents’rewards by an amount proportionate to the performance of theirown strategy against themselves. A combination of these simplemechanisms is successful in stabilizing cooperation, even in a fullydecentralized version of the problem where agents learn to use andassign reputations simultaneously. We show how our results relateto the literature in Evolutionary Game Theory, and discuss implica-tions for artificial, human and hybrid systems, where reputationscan be used as a way to establish trust and cooperation.

KEYWORDS

Reputation; Cooperation; Evolutionary game theory; Reinforce-ment Learning

ACM Reference Format:

Nicolas Anastassacos, Julian García, Stephen Hailes, and Mirco Musolesi.2021. Cooperation and Reputation Dynamics with Reinforcement Learning.In

Proc. of the 20th International Conference on Autonomous Agents andMultiagent Systems (AAMAS 2021), Online, May 3–7, 2021 , IFAAMAS, 9 pages.

Proc. of the 20th International Conference on Autonomous Agents and Multiagent Systems(AAMAS 2021), U. Endriss, A. Nowé, F. Dignum, A. Lomuscio (eds.), May 3–7, 2021, Online .2021.

Cooperation is important in natural and artificial systems [2]. Itallows for agents with individual goals to reach beneficial group out-comes, even when group and individual incentives are not perfectlyaligned [34]. If cooperation is costly but the benefits of cooperationcan be enjoyed by all agents, the temptation to pay no cost is a dom-inant strategy and cooperation is hard to establish and maintain,unless a specific mechanism is in place to foster cooperation [21].One popular mechanism is direct reciprocity, which allows foragents to meet repeatedly [10, 37], thereby creating incentives topunish past defections, making cooperation viable via reciprocalstrategies like Tit-for-Tat [3]. When agents are anonymous or can-not interact repeatedly, they can use a reputation mechanism tocondition their cooperative actions, e.g., cooperating only withthose that have a good reputation. This is known as indirect reci-procity [30, 32], and allows agents to trade the immediate cost ofcooperation for the future benefits of keeping their reputation [20].Reputation systems are common in computing, including multia-gent systems [11, 38], but their application is not always straightfor-ward [13, 17, 31]. Models of indirect reciprocity can be used as toolsto understand reputation-based systems [18, 24, 32, 41]. These mod-els generally provide a mathematical description of the incentives,coupled with a dynamic account of how groups of agents respondto these incentives in simple but illustrative scenarios. The dynamicfeatures of these models are crucial, given that static models mightbe insufficient in the presence of multiple equilibria [20].The framework of indirect reciprocity typically relies on thetoolset of Evolutionary Game Theory (EGT). In EGT, agents do notsolve for equilibrium, but copy other agents that are successful in adynamic process inspired by evolution [34]. Agents that performwell in a population are more likely to pass on their strategies orpolicies to subsequent generations. Despite their simplicity, thesemodels have been successful in predicting human behavior [14] andto contribute to our understanding of complex dynamics associatedwith learning algorithms in multiagent settings [12, 33, 35, 36].In models of reputation dynamics, agents learn how to takeinto consideration the reputations of others from rewards derivedfrom a series of interactions with randomly chosen partners [22].A strategy determines whether a co-player with a particular repu-tation is worth the cost of cooperation. A coupled dynamic process a r X i v : . [ c s . M A ] F e b merges from the changing reputations in the system. Social normsdetermine how agents judge interactions, updating the reputationsof other agents after each encounter. Therefore, a norm assignsa reputation value given a combination of factors. Game theoret-ical models show that norms that reward justified defection, aswell as cooperation with reputable agents, are particularly good atmaintaining cooperation [22, 24, 30].Crucially, EGT models assume that the set of strategies are pre-defined, and explored by agents in a random fashion. Instead, weuse Reinforcement Learning (RL) to model the process wherebyagents discover strategies from simpler states and actions. Agentsexplore and learn strategies (i.e., policies) responding directly tothe rewards obtained during game play.Our purpose is twofold: First, we ask how predictions of modelsof reputation dynamics change, when the learning process is drivenby individual experiences like in RL, instead of social learning inEGT models. Second, we provide a simple environment where wetest whether RL algorithms reliably learn the equilibria that lead tosustained cooperation on the basis of public reputations.Studies of RL and cooperation abound (see, for example [1, 8,9, 15, 16, 19 ? ], but the interaction between cooperation and rep-utation in this context has not been explored yet. The existingEGT literature on reputation can be insightful in analyzing andproposing solutions to this particular problem [35]. This is the firstwork that bridges RL and EGT literature in the area of cooperationand reputation dynamics. In particular, we find that in presenceof reputation mechanisms, agents need to solve two coordinationproblems: learning how to coordinate actions on the basis of ex-isting reputation indicators; and collectively agreeing on a socialnorm to assign reputations to others based on their behavior. Thesecoordination problems exhibit multiple equilibria, some of whicheffectively establish cooperation. Agents driven by standard RLalgorithms will generally fail to coordinate, converging to non-efficient outcomes. This can be resolved by seeding a proportion ofthe system with fixed agents that steer others towards good equilib-ria or providing intrinsic rewards based on the idea of introspection,i.e., augmenting agents’ rewards by an amount proportionate tothe performance of their own strategy against themselves.More generally, our results can also inform the work of AI re-searchers concerned with designing agents that can cooperateamongst themselves [8], and with humans [7, 29]. The problemof cooperation is also present in scenarios where artificial agentsrespond to individual rewards [1, 9, 26, 27]. Because individual in-centives are often not sufficient, intrinsic rewards [8, 15, 16] or morecomplex agent architectures where the choice of the next actionis based on prediction models of the behavior of the other agents[9, 19]. We show that, fundamentally, a reputation mechanism isnot sufficient to steer agents towards cooperation, but this can becombined with intrinsic rewards to achieve cooperation.The rest of this paper is organized as follows: Section 2 describesthe problem of cooperation in presence of reputation mechanismsand the existing results based on stability solution concepts. Sec-tion 3 shows that naive Q-learning agents converge to undesirableequilibria. Section 4 discusses how to steer the system towardsdesirable outcomes while Section 5 deals with how agents can learnto also assign reputations in a completely decentralized fashion. InSection 6, we discuss how our results relate to the existing literature in game theory, as well as implications for artificial, human andhybrid systems, where reputation can be used as a way to establishtrust and cooperation. Our investigation is based on the classic Prisoner’s Dilemma (PD)game [28]. Agents can cooperate, paying a cost 𝑐 to help their oppo-nent by an amount 𝑏 ; or defect, bypassing the cost and potentiallyreaping the benefit bestowed from cooperative co-players. The cor-responding payoff matrix is then: (cid:18) 𝑏 − 𝑐 𝑏 − 𝑐 (cid:19) , with the first actionbeing defect, and the second action cooperate. In the followingsection we denote defect as action 0 and cooperate as action 1. With 𝑏 > 𝑐 >

0, selfish agents will play defect; an outcome that is notPareto optimal given that everyone would better off with mutualcooperation. We will set 𝑐 =

1, and vary 𝑏 to adjust the benefit tocost ratio of cooperation. The amount of cooperation that emergeswill be dependent on the benefit to cost ratio of the game 𝑏 / 𝑐 . We investigate a setup typical of the EGT literature [22, 24]. Weconsider 𝑁 agents that are randomly matched with another agent,each round, to play a game. The game is repeated for a number ofrounds, with agents being rematched in every round. An episodelasts for a pre-determined number of rounds 𝑀 .In each round agents play the PD game described above. Agentsdecide whether to cooperate or not based on the reputations oftheir co-players. In our simplest model, reputations can be 0 or 1.For the sake of clarity we will sometimes refer to 1 as Good , and 0 as

Bad ; however, no meaning is ascribed to reputation values a priori.Following [23], an agent decides whether they cooperate or notbased on their own reputation and the reputation of the opponent.The reputation assignment is based on a social norm as discussed inSection 2.3. Thus, we can encode how they react to the reputationsof other agents with 4 bits as shown in Table 1.

Table 1: Action rules are encoded as bitstrings of size 4. Eachbit encodes the action of the focal agent, with { 𝐷 = , 𝐶 = } .This action is a function of her reputation and the reputa-tion of the opponent. There are 16 possible action rules. If the focal reputation is: 0 0 1 1and the opponent reputation is: 0 1 0 1The focal action is given by: Bit 3 Bit 2 Bit 1 Bit 0For example, action rule 5 is as follows: Bit 3 is 0, therefore anagent using this rule will defect if their reputation is 0 and theopponent’s reputation is 0; Bit 2 is 1, thus the agent will cooperateif their reputation is 0 and the opponent’s is 1; Defect if their ownreputation is 1 and the other’s is 0, as given by Bit 1, and cooperatewhen their own reputation and the co-player’s reputation is 1 asgiven by Bit 0. The resulting bitwise representation is . Ina similar fashion, action rule 0 is always defect; action rule 15 isalways cooperate, and so on. As a result, we have 16 possible actionrules in total. .3 Social Norms and Reputation Assignment Now that we have described the basic setup of the game and theconcept of action rule, we can define how reputations are assignedto agents after observing social interactions.A social norm [40] is used to determine how agents assign repu-tations to others [23]. A social norm is a function that translateshow the actions of the parties involved in an interaction, translateinto their future reputations [24]. An observer, sometimes central-ized, changes the reputations of both parties, following the socialnorm, after each interaction.

Table 2: Bitwise interpretation of norms. Each bit encodesthe new focal reputation given her action towards an oppo-nent with a particular reputation. For example, norm = assigns a reputation 1 only to agents that cooperatewhen facing an agent with reputation 1. If the focal action is:

𝐷 𝐷 𝐶 𝐶

The opponent’s reputation is: 0 1 0 1New focal reputation is given by : Bit 3 Bit 2 Bit 1 Bit 0Following [23], the new reputation for an agent depends on heraction, and the reputation of the co-player. We can encode a socialnorm as four bits, for a total of 2 norms (see Table 2). These areknown as second-order norms [22], and can be extended to dependas well on the reputation of the focal agent (third order), or evenon the basis of previous interactions [32]. Table 3 gives some socialnorm examples. We consider a small reputation assignment error 𝜒 , which occasionally flips a reputation from the original intentionon assignment. This small error allows for the reputation dynamicsto be stationary, in the sense that in the long run the effect of initialreputations vanish.We consider two scenarios of reputation assignment that reflectdifferent levels of centralization. In a semi-centralized system allagents used a fixed exogenous norm (top-down reputation). In thiscase, the system’s state is given by 𝑁 action rules 𝑝 𝑖 , where eachaction rule is in { .. } . Instead, in a fully decentralized systemeach agent can use a different norm (bottom-up reputation) [41].In this case, the system’s state is given by 𝑁 tuples ( 𝑝 𝑖 , 𝑑 𝑖 ) , whereeach action rule 𝑝 𝑖 is in { .. } and each social norm 𝑑 𝑖 is in { .. } .Section 3 and 4 will deal with top-down dynamics, and Section 5will discuss the bottom-up case. In EGT, pairs ( 𝑝, 𝑑 ) have been analyzed for stability [24]. Techni-cally speaking, a norm 𝑑 yields a dynamical system that determinesthe proportion of “Good” individuals in the population in the longrun. This proportion is then used to compute the payoffs of specificaction rules. A norm 𝑑 stabilizes cooperation, if a monomorphouspopulation using action rule 𝑝 and norm 𝑑 can resist invasions byany mutant action rule 𝑝 ′ . Social norms that reward justified defec-tion, as well as cooperation with reputable agents, are particularlygood at maintaining cooperation. These results have been extendedto more realistic stochastic systems with finite populations [30].Stability predictions show that social norms that are most suc-cessful at cooperation share two characteristics: (i) cooperation with “good” agents begets a good reputation, and (ii) defectionagainst “bad” agents begets a good reputation [24]. A particularlysalient norm is known as “Stern Judging" [25]. The binary repre-sentation of this is norm 9 given in Table 3. More specifically, thecombination of action rule 5 and norm 9 cannot be invaded onceestablished (see Tables 1 and 3).Importantly, even with a good norm in place, defection is stillan equilibrium. Thus, it can be argued that reputation systemstransform the problem of cooperation into a simpler stag-hunt-likeproblem [26], with good and bad equilibria – leading to cooperationand defection respectively. This literature assumes that the set ofstrategies are predefined, and explored by agents in a completelyrandom fashion. We are interested in relaxing this assumption byusing RL. RL algorithms learn a policy from repeated interactions with theenvironment and attempt to balance exploration and exploitationto maximize rewards. Unlike in EGT, RL agents do not choose froma fixed set of existing strategies, but learn instead to take actionsgiven environment observations. Here, we train our agents withtabular 𝑄 -learning [39]. The policy, 𝜋 𝑖 of agent 𝑖 is representedby a table of state-action values 𝑄 𝑖 ( 𝑠, 𝑎 ) . While learning how toreact to the reputations of others, states correspond to opponents’reputations. Actions are naturally cooperate or defect.A single episode consists of a sufficiently large number of rounds 𝐾 , where agents are randomly match to play a PD game using theirreputations. The policy of agent 𝑖 is an 𝜖 -greedy policy and isdefined by 𝜋 𝑖 ( 𝑠 ) = (cid:40) arg max 𝑎 ∈A 𝑄 𝑖 ( 𝑠, 𝑎 ) with probability 1 − 𝜖 U(A) with probability 𝜖 (1)where U(A) denotes a uniform distribution over actions. Agentscollect a set of trajectories {( 𝑠, 𝑎, 𝑟, 𝑠 ′ ) 𝑘 : 𝑘 = , ..., 𝐾 } by interactingwith the environment and store them in a memory buffer M 𝑖 . Whenlearning, it updates its policy using these experiences according to 𝑄 𝑖 ( 𝑠, 𝑎 ) ← 𝑄 𝑖 ( 𝑠, 𝑎 ) + 𝛽 [ 𝑟 + 𝛾 max 𝑎 ′ ∈A 𝑖 𝑄 𝑖 ( 𝑠 ′ , 𝑎 ′ ) − 𝑄 𝑖 ( 𝑠, 𝑎 )] (2)where 𝑠 is the current state, 𝑎 is the current action, 𝑟 is the rewardand 𝑠 ′ is the next state. As agents learn over time the environmentappears non-stationary from any agent’s perspective. To accountfor other agents learning, the information stored in each agent’smemory buffer is refreshed every episode to ensure it remainsrelevant to the current transition dynamics of the environment.In this environment, agents are both matched in rounds againstother agents and must choose to cooperate or defect but mustalso judge the interactions of others and label agents as “good”or “bad”. Agents do not accrue rewards for passing judgment andmust coordinate how they assign reputation to others purely fromthe rewards received when cooperating or defecting with otheragents. It might be difficult for agents to learn independently as theymust coordinate on how to interpret reputations as well as how to able 3: Examples of social norms.Norm Binary representation Meaning ∗ Actions and reputations play no role - always assigns “bad” reputation.3 Cooperating with others is always “good” and defecting is always “bad”.9 ∗ Someone who cooperates with others that are “good” and defects toothers that are “bad” is good.11 Someone is “bad” only if they refuse to cooperate with a good individual.

Stability predictionsE ﬀ ective social normIne ﬀ ective social norm * Figure 1: Standard Q-learning achieves substantially less co-operation that what is predicted with stability analysis. assign them to others based on their behavior while simultaneouslylearning to cooperate or defect.We consider two scenarios. In our first scenario, agents learnhow to react to the reputations of others, with an effective socialnorm being enforced by a central party – this is discussed next.By analyzing the Q-values learned by the agents we can repre-sent each agent’s policy, 𝜋 𝑖 , as an equivalent strategy using thebitwise interpretation in Table 1. In the second scenario we considermechanisms for learning from individual experience. This effective norm (norm 9), combined with an action rule thatreacts to reputation (e.g., action rule 5), makes cooperation stable(see Section 2.4). Stability predictions expect this norm to maintaincooperation and, with social learning and stochasticity, a systemwith about 10 agents will reach as much as 75% cooperation for abenefit to cost ratio of 5 [30].Can agents using RL learn to play the strategies that combinewith stern judging to maintain cooperation? We fix norm 9, in acentralized system and allow agents to adjust their policies follow-ing the algorithm described above. We start with a population of10 agents. The reputation dynamics simulation is ran for 10 . 𝐾 =

200 random encounters in thepopulation. We set the reputation assignment error to 𝜒 = × − and, fixing the cost of cooperation to 𝑐 =

1, vary the benefit. Thelearning rate is set to 𝛽 = × − , 𝛾 = .

99 and 𝜖 = − . 𝜖 is kept at a fixed value to account for the changing behavior of otheragents. Results are averaged over 20 different random seeds foreach parameter set. We measure the average reward in the wholepopulation during the last half of the episodes, taking the average,weighted by 𝑏 , as the level of cooperation achieved. The results areshown in Figure 1.Strikingly, RL agents fail to reliably achieve cooperation, even inthe presence of an effective social norm. Only when cooperation isvery cheap, at a benefit to cost ratio of 10, a small 40% cooperation ismaintained. At a benefit to cost ratio of 5, the differences between aneffective social norm, and an ineffective social norm are negligible.This is in stark contrast to what is expected from stability analysis[24], or even stochastic predictions relying on social learning. For asystem with 10 agents, at a benefit to cost ratio of 5, a model basedon EGT techniques predicts as much as 70% cooperation based onthe same social norm.This result can be explained by the fact that defection is stillan equilibrium, even with an effective social norm. Agents in thissetup reliably fail to use the reputation information, convergingon a purely defecting strategy that ignores reputation. Effectively,the reputation system transforms a difficult prisoner’s dilemmainto stag-hunt like game with efficient (cooperation) and inefficientequilibria. It has been reported before that in these situations, RLalgorithms can fail to converge to desirable equilibria in the absenceof intrinsic rewards or changes to the environment [26].This outcome also reflects what typically happens when RLagents are trained to play with one another in a social dilemma: (i)learning does not account for potential changes in other agents’strategies and so defection is seen as more valuable if the envi-ronment were to remain unchanged, and (ii) agent behaviors areinitially volatile and so observations are not fully representative ofagent strategies. This remains the case, even with a coordinationsystem such as that introduced by the reputation mechanism, andeven with a simple game with binary reputations.More generally, the fact that agents trained with Q-learningcannot learn cooperation even with an effective social norm indi-cates a gap between social and individual learning. We proposetwo solutions to this problem to improve agent coordination on themeanings of reputation labels and encourage cooperation throughintrospective rewards. These are discussed next. We now set to design a mechanism to steer agents towards theefficient equilibrium, while retaining the main feature of learningrom individual experience. To do this, we take cues from the co-operation literature in EGT [21], where cooperation is enabled bycorrelated interactions, allowing cooperative types to meet eachother more frequently. This guarantees that the benefits of cooper-ation are disproportionately shared only among those cooperating.We propose two implementable ways to achieve this in the caseof reputations. These ideas are discussed in Sections 4.1and 4.2, re-spectively. In the remaining experiments we analyse the outcomesusing a challenging value 𝑏 / 𝑐 = Seeding agents with reciprocal strategies has the potential to in-crease the reward to autonomous cooperative ones. This approachhas been studied before in norm-emergence scenarios [6, 33]. Inparticular, [33] considers a simpler case where norms are equivalentto strategies and only two norms are available. In our case, normsare coordination devices through reputation. This setup is richerbecause more than two norms are available — clearly making theproblem more difficult.To alleviate the burden of coordination, we look to encourageagents to coordinate around a specific equilibria by introducingfixed agents into the population that play action rule 5 (i.e., 0101 ):I cooperate only when my co-player has a good reputation. Com-bined with social norm 9, an agent that plays action rule 5 is guar-anteed to always have a good reputation. By specifically rewardingagents with good reputations and penalizing agents with bad repu-tations, it encourages agents to play strategies that, in turn, givethem good reputations, thereby facilitating coordination.We now vary the number of seeds 𝑘 , i.e., we fix 𝑘 agents usingaction rule 5 (i.e., 0101 ), in an environment where 𝑁 − 𝑘 agentsare learning. For 𝑘 = 𝑁 we recover a simple single-agent learningproblem. We will show that a small proportion of seeds 𝑘 is enoughto successfully promote cooperation. We run experiments with 𝑁 =

10 agents that learn for 50 .

000 episodes. We focus on thechallenging problem where 𝑏 / 𝑐 =

5, and vary the number of seedagents 𝑘 , expressed as proportion of the whole population. Tobenchmark the effects of seeding agents, we consider the effectivesocial norm (9), as well as norm 0, which completely disregards thevalue of reputations. Other RL parameters are kept as above. Theresults are summarized in Figure 2.Figure 3A shows how 𝑘 / 𝑁 affects cooperation in the long run. Itmeasures the average cooperation in the last half of the episodes, asmeasured by the proportion achieved of the maximum cooperationpayoff. There is a sudden shift, when 20% of the agents are fixedreciprocators, steering the population towards cooperation reliably.Typical learning trajectories are shown in Figures 3B and 3C, forthe effective social norm, and the norm that disregards reputationsrespectively. The results for a norm that ignores reputation showsthat the seeding helps coordination when an effective norm is inplace. We note that the role of fixed reciprocators is to both regulatethe amount of defection that learning agents can “get away with”,while also also stabilizing the learning process by reducing thevariance in the outcomes. We need about 20% of the population tobe reciprocators in order to converge to cooperative outcomes whenthe other 80% of the agents are learning using Q-learning – thisresult holds for different population sizes, at the same proportion of fixed agents. This is intuitive since the threshold that biases therandom matching to guarantee reciprocation needs to stay constantwhen the system size is increased.Many policies can represent strategies resulting in widespreadcooperation. An obvious outcome is that agents just cooperate allthe time regardless of what their opponents strategy is (action rule15). This would ultimately lead to maximizing the total reward ifall agents stick to unconditional cooperation This is crucially notthe case when seeding reciprocators. With an effective social norm,cooperating with agents who have bad reputations will beget anegative reputation. Thus, unconditional cooperators are not stable.Instead, agents learn to play in a way that warrants them a positivereputation, in turn maximizing the reward, i.e, using a reciprocalstrategy. Figure 3 shows how in the long term, in the presenceof fixed agents, RL agents converge to a policy equivalent to thereciprocal strategy that matches the fixed agents being seeded.Without seeding, unconditional defection is the prevalent outcome.In summary, assuming an exogenous effective social norm, seed-ing reciprocal agents (with action rule 5) helps those learning tocoordinate on the “good” label. Evidence of this is that the mirroraction rule 10, that coordinates by limiting cooperation to thosewith label 0, is not present in the long run. While 20% may notbe a high demand in the proportion of seeded agents in certaincircumstances, there may be scenarios where the level of decen-tralization does not allow for the system to have a predeterminednumber of fixed good agents e.g., when they are required to bephysically present. However, there are many scenarios where itmay be cheap given their impact such as instantiating software botsin a virtual environment. We next discuss an alternative solutionbased on intrinsic rewards. The idea of intrinsic rewards incorporates psychological insightsfrom motivation into learning, by considering not only the externalrewards provided by the environment, but also rewards that areintrinsic to agents [4]. This idea has been used before in cooperationproblems with reinforcement learning, by endowing agents with ataste for curiosity [16] or with other-regarding preferences [15].Here, we use a simple principle for intrinsic rewards. Agentscare about what their policy would do to an agent like themselves.Therefore, we consider this as a form of introspection . Thus, theextrinsic reward and the value of introspection are weighted witha linear combination with parameter 𝛼 . An agent’s 𝑖 reward is then 𝑅 𝑖 = 𝛼𝑈 𝑖 + ( − 𝛼 ) 𝑆 𝑖 where, 𝑈 𝑖 is their payoff in a particular encounter, and 𝑆 𝑖 refers tothe payoff they would get facing themselves. The intuition for thisis that agents would prime policies that would be effective whenplaying against agents like themselves. The parameter 𝛼 , in [ , ] ,is used to represent the level of introspection.While the self-encounter leading to the intrinsic reward 𝑆 𝑖 stilluses the agent’s reputation as an input, the actions the agents takenin during self-play do not affect their reputations. This “simulatedself-encounter” does not contribute to the state of the game, but itis only used to generate the intrinsic reward. BC Figure 2: Seeding agents to promote cooperation. Panel A shows the average cooperation in the last half of the episodes, asmeasured by the proportion achieved of the maximum cooperation payoff. Panel B shows typical learning trajectories for thepopulation of agents using an efficient social norm, highlighting the average trajectory in bold. Panel C shows the results fornorm . B Cooperate with labelsDefect with labelsUnconditionalDefection A Figure 3: Learned policies with seeding – counting the fre-quency of strategies equivalent to the policies the algorithmhas converged to.

Aside from the intuition of introspection, this mechanism alsohas a justification previously explored in the EGT literature. Pa-rameter 𝛼 can also be conceived as regulating the matching, andpriming interactions among alike types; i.e., assortative matching[5]. If meetings between agents are “random”, those defecting willon average get higher rewards than cooperators; but when match-ing is assortative, cooperators are more likely to meet cooperatorsthan defectors – therefore, the cost of cooperation is repaid by ahigher probability of playing against a cooperating opponent [37]. The results are summarized in Figure 4. Parameters are as dis-cussed above, but noting a larger variance in the rewards distribu-tion we allow the learning to run for a larger number of episodes.Panel A shows how different levels of introspection affect cooper-ation in the long run, for an effective and ineffective social norm.For 𝑏 =

5, the benefits of the intrinsic reward kick in, raising thelevel of cooperation. Crucially, this curve is close to 0 . 𝛼 = 𝛼 = .

8. Wehave a decline in cooperation for larger values, driven by the noiseassociated with overemphasizing the intrinsic feature of the reward.Panels B, C and D show typical learning trajectories for agents in anenvironment with the effective social norm. The average trajectoryis shown as a dotted line.As reputation mechanisms turn a PD with a single non-efficientequilibrium (i.e., defection), into a game where potentially manyequilibria arise depending on how individuals use the reputationsfor coordination. In particular, agents need to coordinate on reactingto reputation signals. While the introspective reward encouragescooperation, it does not help the agents to solve the signal coordina-tion problem. Action rule 5, prominent in the results of Section 4.1,has a mirror action rule whereby agents defect with those labeled“good” (1), and cooperate with those labeled “bad” (0). When allagents decide on a label, both action rules can engender coopera-tion when combined with norm 9. Figure 5 shows that agents aresometimes divided on these two action rules, failing to cooperateconsistently, and occasionally opening the door for unconditionaldefection. While conditional strategies are used almost 90% of thetime for the optimal introspection level ( 𝑎𝑙𝑝ℎ𝑎 = . 𝛼 = . 𝑏 / 𝑐 = BCD

Figure 4: Seeding agents to promote cooperation . Panel A shows the average cooperation in the last half of the episodes, asmeasured by the proportion achieved of the maximum cooperation payoff as a function of the level of introspection. PanelsB, C and D show typical learning trajectories for the population of agents using an efficient social norm with different levelsof introspection 𝛼 . We highlight the average trajectory as a dotted line. B Cooperate with labelsDefect with labelsUnconditionalDefection A UnconditionalDefection Defect with labelsCooperate with labels Figure 5: Learned policies with introspective rewards –counting the frequency of strategies equivalent to the poli-cies the algorithm has converged to. corresponding coordination problem of reputation assignment, orassumes levels of centralization where social norms can be enforced.We next discuss a fully decentralized scenario, where a centralenforcer is not required.

The problem of reputation becomes harder when social normsare no longer centralized [41]. Instead, it is assumed that eachagent can judge the reputations of other after each encounter. Sofor every encounter, a third party is chosen – randomly from thepopulation – to judge the reputations of the two parties involvedin each interaction. This reflects a completely decentralized system where enforcement of the social norm is not possible. Thus, thenext step is learning to coordinate not just how to react to others’reputations, but also how to assign reputations to other agents.Maintaining cooperation in this case requires agents to learn toassign reputations in a meaningful way, preserving informationabout other agents’ strategies in the reputation labels.Once again, agents do not choose from 16 available norms tojudge encounters between agents, but learn to assign a reputation0 or 1 based on the actions agents take and the reputation of theirco-player. The dimensionality of the Q-tables increases to accom-modate for the new states and actions but the learning rate is keptfixed at 𝛽 = 𝑒 − 𝜖 = .

1. Agents do not accrue rewards forjudging the interactions of others and must coordinate how theyassign reputation purely from the rewards received when cooper-ating or defecting with other agents. This is hard because agentsmust rely on others to assign informative reputations to agents anddo not directly receive rewards for doing so.Figure 6 shows results for the decentralized problem. Without in-trospection or seeded agents to improve agent coordination, agentsdo not learn to cooperate under these circumstances and convergeto inefficient equilibria. We can immediately see the impact of aneffective norm as agents quickly learn a defecting strategy thatdefects unconditionally. This further complicates the problem as itrenders reputation meaningless and learning an appropriate socialnorm becomes difficult.As mentioned previously, seeded agents encourage coordinationon the reputation signal while introspection encourages coopera-tion. Introducing our two mechanisms independently has middlingsuccess as neither result in cooperation consistently emerging. Theadded complexity of coordinating on how to assign reputation stillresults in defecting strategies being overall more rewarding foragents and even seeding 50% of the agents only results in 30% co-operation between RL agents. While increasing 𝛼 appears to worksignificantly better than seeding agents when norms are not fixed,the strategies that agents learn are not coordinated and dominatedby the noise in the introspective reward signal as 𝛼 approaches 1.By combining these two mechanisms we can significantly im-prove the performance of the RL agents and have them coordinatearound a social norm. Notably, with 50% seeded agents and 𝛼 = . .0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 P e r c en t age o f s eeded agen t s Figure 6: Introspective reward and seeded agents recover co-operation for fully decentralized reputations. agents can achieve 80% cooperation. Although a large proportion ofagents need to be seeded, we can see that this serves to curb agentsfrom defaulting to defecting strategies enough that they can learnto assign meaningful reputations that are representative of agentstrategies. This is shown in Fig 7, panel B, where we can see that75% of the time, agents successfully coordinate around social norm3. Unlike norm 9, norm 3 looks to identify agents according to theirmost recent action taken, labeling agents who have cooperated as“good” and agents who have defected as “bad” (see Table 3). In con-junction with norm 3, agents also learn action rule 5: cooperatingwith good agents and defecting against bad agents. Similarly towhen norm 9 is used, the combination of norm 3 and action rule5 results in a stable equilibrium that rewards cooperation whiledefectors are identified and punished. Unlike norm 9, agents whocooperate with defectors are not seen as bad agents and are notheld responsible for their opponent’s reputation. Instead an agent’sreputation is determined exclusively by its own actions, indepen-dent of the opponent’s reputation and is sufficient to represent eachagent’s behavior and guide RL agents towards a positive equilibria.The right combination of seeding and introspective rewards canrecover up to 80% cooperation in the fully decentralized case. Fullcoordination remains a challenge.

Reputation dynamics create difficult coordination and cooperationproblems for independent learners. Agents trained using reinforce-ment learning fail to converge to efficient equilibria, even if aneffective social norm is imposed with a reputation system. Just likein existing reputation systems with human actors, artificial agentshave problems coordinating the effective use reputations.We have proposed two solutions to this problem. Our first solu-tion is to seed fixed agents whose task is to steer others towardscoordination on the meaning of reputation labels. Specifically, amass of reciprocal fixed agents effectively helps RL agents to coordi-nate on a single label. In turn, encouraging the conditional actionsthat foster cooperation while protecting it from defectors.Introspection – via intrinsic rewards – entices agents to be morecooperative. This mechanism is theoretically grounded by models ofassortative matching in EGT. The optimal balance of introspectionand external rewards can recover a great deal of cooperation, butdoes not protect agents from not coordinating on reputation labels. B Cooperate with labelsDefect with labelsUnconditionalDefection A Cooperatorsare goodIgnoreReputations

Figure 7: Introspective reward and seeded agents coordinatearound social norm 3 and action rule 5 resulting in a coop-erative, stable equilibrium.

These mechanisms show great potential for synergistic interac-tions. When combined, they successfully recover substantial levelsof cooperation in fully decentralized scenarios. Our results alsoshow that stability analysis and stochastic models of social learn-ing, common in EGT, tend to over-estimate how much cooperationyou can expect from the presence of reputation mechanisms. EGTmodels, and more generally models of human behavior, can learnfrom RL methods. Specifically, RL grounds the exploration processwhereby agents discover strategies. This process is often assumedex-ante in evolutionary games. As shown here, this can have aneffect in model predictions. Our results further underscore the dif-ferences between social and individual learning that are notablein the EGT and RL communities. This work has been based on arather basic scenario. Future work may further explore how thesesolutions apply to more complex cooperation scenarios, includingthose beyond binary reputations. Grid-like worlds that are popularbenchmarks in the RL community may also be interesting testbedsfor understanding reputation.Intelligent artificial agents are expected to be able to navigatesocial interactions and recognise efficient outcomes where multipleparties can benefit. Although a standard RL algorithm fails to con-verge to desirable equilibria, this can be amended by introducingsuccessful mechanisms, which has been extensively investigatedin the EGT literature. We believe that this paper demonstrates thelargely unexplored potential of combining techniques and method-ologies from the RL and EGT communities in order to investigateopen problems around cooperation and reputation dynamics inartificial, human, and hybrid societies.

EFERENCES [1] Nicolas Anastassacos, Stephen Hailes, and Mirco Musolesi. 2020. Partner Selec-tion for the Emergence of Cooperation in Multi-Agent Systems Using Reinforce-ment Learning.

Proceedings of the 34th AAAI Conference on Artificial Intelligence(AAAI’20)

34, 05 (April 2020), 7047–7054.[2] Robert Axelrod. 1997.

The Complexity of Cooperation: Agent-Based Models ofCompetition and Collaboration . Princeton University Press.[3] R. Axelrod and W. D. Hamilton. 1981. The Evolution of Cooperation.

Science

Intrinsically Motivated Learning in Natural and Artificial Systems , Gianluca Bal-dassarre and Marco Mirolli (Eds.). Springer, Berlin, Heidelberg, 17–47.[5] Theodore C. Bergstrom. 2003. The Algebra of Assortative Encounters and theEvolution of Cooperation.

International Game Theory Review

05, 03 (Sept. 2003),211–228.[6] Tim Borglund, Shuyue Hu, and Ho-Fung Leung. 2018. The Effects of Fixed-Strategy Agents on Local Convention Emergence in Multi-agent Systems. In

International Conference on Intelligent Information Processing . Springer, 99–108.[7] Filipa Correia, Shruti Chandra, Samuel Mascarenhas, Julien Charles-Nicolas,Justin Gally, Diana Lopes, Fernando P. Santos, Francisco C. Santos, Francisco S.Melo, and Ana Paiva. 2019. Walk the Talk! Exploring (Mis)Alignment of Wordsand Deeds by Robotic Teammates in a Public Goods Game. In

Proceedings of the8th IEEE International Conference on Robot and Human Interactive Communication(RO-MAN’19) . 1–7.[8] Tom Eccles, Edward Hughes, János Kramár, Steven Wheelwright, and Joel Z.Leibo. 2019. Learning Reciprocity in Complex Sequential Social Dilemmas. (March2019). arXiv:1903.08082 [cs][9] Jakob Foerster, Richard Y. Chen, Maruan Al-Shedivat, Shimon Whiteson, PieterAbbeel, and Igor Mordatch. 2018. Learning with Opponent-Learning Awareness.In

Proceedings of the 17th International Conference on Autonomous Agents andMultiAgent Systems (AAMAS ’18) . Richland, SC, 122–130.[10] Julián García and Matthijs van Veelen. 2018. No Strategy Can Win in the RepeatedPrisoner’s Dilemma: Linking Game Theory and Computer Simulations.

Frontiersin Robotics and AI

Comput. Surveys

48, 2 (Oct. 2015), 1–42.[12] Daniel Hennes, Dustin Morrill, Shayegan Omidshafiei, Rémi Munos, Julien Pero-lat, Marc Lanctot, Audrunas Gruslys, Jean-Baptiste Lespiau, Paavo Parmas, EdgarDuéñez-Guzmán, and Karl Tuyls. 2020. Neural Replicator Dynamics: MultiagentLearning via Hedging Policy Gradients. In

Proceedings of the 19th InternationalConference on Autonomous Agents and MultiAgent Systems (AAMAS’20) . 492–501.[13] Kevin Hoffman, David Zage, and Cristina Nita-Rotaru. 2009. A Survey of Attackand Defense Techniques for Reputation Systems.

Comput. Surveys

42, 1 (Dec.2009), 1–31.[14] Moshe Hoffman, Sigrid Suetens, Uri Gneezy, and Martin A. Nowak. 2015. AnExperimental Investigation of Evolutionary Dynamics in the Rock-Paper-ScissorsGame.

Scientific Reports

5, 1 (Aug. 2015).[15] Edward Hughes, Joel Z. Leibo, Matthew Phillips, Karl Tuyls, Edgar Dueñez-Guzman, Antonio García Castañeda, Iain Dunning, Tina Zhu, Kevin McKee,Raphael Koster, Heather Roff, and Thore Graepel. 2018. Inequity Aversion Im-proves Cooperation in Intertemporal Social Dilemmas. In

Proceedings of the32nd International Conference on Neural Information Processing Systems (NIPS’18) .3330–3340.[16] Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, PedroOrtega, Dj Strouse, Joel Z. Leibo, and Nando De Freitas. 2019. Social Influence asIntrinsic Motivation for Multi-Agent Deep Reinforcement Learning. In

Proceed-ings of the 36th International Conference on Machine Learning (ICML’19) . PMLR,3040–3049.[17] R. Jurca and B. Faltings. 2003. An Incentive Compatible Reputation Mechanism.In

Proceedings of IEEE International Conference on E-Commerce 2003 (CEC’03) .285–292.[18] Marcus Krellner and The Anh Han. 2020. Putting Oneself in Everybody’s Shoes -Pleasing Enables Indirect Reciprocity under Private Assessments.

Artificial Life Conference Proceedings

32 (July 2020), 402–410.[19] Alistair Letcher, Jakob Foerster, David Balduzzi, Tim Rocktäschel, and ShimonWhiteson. 2018. Stable Opponent Shaping in Differentiable Games. In

Proceedingsof the 7th International Conference on Learning Representations (ICLR’19) .[20] George J. Mailath and Larry Samuelson. 2006.

Repeated Games and Reputations:Long-Run Relationships . Oxford University Press, USA.[21] Martin A. Nowak. 2006. Five Rules for the Evolution of Cooperation.

Science

Nature

Journal of Theoretical Biology

PLoS Computational Biology

2, 12 (2006), e178.[26] Alexander Peysakhovich and Adam Lerer. 2018. Prosocial Learning AgentsSolve Generalized Stag Hunts Better than Selfish Ones. In

Proceedings of the17th International Conference on Autonomous Agents and MultiAgent Systems(AAMAS’18) . Stockholm, Sweden, 2043–2044.[27] Iyad Rahwan, Jacob W. Crandall, and Jean-François Bonnefon. 2020. IntelligentMachines as Social Catalysts.

Proceedings of the National Academy of Sciences (March 2020).[28] Anatol Rapoport, Albert M. Chammah, and Carol J. Orwant. 1965.

Prisoner’sDilemma: A Study in Conflict and Cooperation . University of Michigan Press.[29] Fernando P. Santos, Jorge M. Pacheco, Ana Paiva, and Francisco C. Santos. 2019.Evolution of Collective Fairness in Hybrid Populations of Humans and Agents.

Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI’19)

33, 01(July 2019), 6146–6153.[30] Fernando P. Santos, Francisco C. Santos, and Jorge M. Pacheco. 2016. SocialNorms of Cooperation in Small-Scale Societies.

PLOS Computational Biology

Proceedings of the33rd AAAI Conference on Artificial Intelligence (AAAI’18) (2018), 2.[32] Fernando P. Santos, Francisco C. Santos, and Jorge M. Pacheco. 2018. Social NormComplexity and Past Reputations in the Evolution of Cooperation.

Nature

Proceedings of the 20th International Joint Conference on Artificialintelligence (IJCAI’07) . 1507–1512.[34] Karl Sigmund. 2016.

The Calculus of Selfishness . Princeton University Press.[35] Karl Tuyls and Ann Nowé. 2005. Evolutionary Game Theory and Multi-AgentReinforcement Learning.

The Knowledge Engineering Review

20, 01 (Dec. 2005),63.[36] Karl Tuyls, Katja Verbeeck, and Tom Lenaerts. 2003. A Selection-Mutation Modelfor q-Learning in Multi-Agent Systems. In

Proceedings of the 2nd InternationalJoint Conference on Autonomous Agents and Multiagent Systems (AAMAS’03) .Association for Computing Machinery, New York, NY, USA, 693–700.[37] M. van Veelen, J. Garcia, D. G. Rand, and M. A. Nowak. 2012. Direct Reciprocityin Structured Populations.

Proceedings of the National Academy of Sciences

Proceedings of the 9th International Conference onAutonomous Agents and Multiagent Systems (AAMAS’10) . Richland, SC, 225–232.[39] Christopher J. C. H. Watkins and Peter Dayan. 1992. Q-Learning.

MachineLearning

8, 3 (May 1992), 279–292.[40] Michael Wooldridge. 2009.

An Introduction to MultiAgent Systems . John Wiley &Sons.[41] Jason Xu, Julian Garcia, and Toby Hanfield. 2019. Cooperation with Bottom-up Reputation Dynamics. In