Accumulating Risk Capital Through Investing in Cooperation
Charlotte Roman, Michael Dennis, Andrew Critch, Stuart Russell
AAccumulating Risk Capital Through Investing in Cooperation
Charlotte Roman, Michael Dennis, Andrew Critch, Stuart Russell
Center for Human-Compatible AIUniversity of California, [email protected],{michael_dennis,critch,russell}@cs.berkeley.edu
ABSTRACT
Recent work on promoting cooperation in multi-agent learning hasresulted in many methods which successfully promote cooperationat the cost of becoming more vulnerable to exploitation by maliciousactors. We show that this is an unavoidable trade-off and proposean objective which balances these concerns, promoting both safetyand long-term cooperation. Moreover, the trade-off between safetyand cooperation is not severe, and you can receive exponentiallylarge returns through cooperation from a small amount of risk.We study both an exact solution method and propose a methodfor training policies that targets this objective, Accumulating RiskCapital Through Investing in Cooperation (ARCTIC), and evaluatethem in iterated Prisonerβs Dilemma and Stag Hunt.
KEYWORDS
Social Dilemmas; Game Theory; Multi-agent Reinforcement Learn-ing; Safety; Cooperation
ACM Reference Format:
Charlotte Roman, Michael Dennis, Andrew Critch, Stuart Russell. 2021.Accumulating Risk Capital Through Investing in Cooperation . In
Proc. ofthe 20th International Conference on Autonomous Agents and MultiagentSystems (AAMAS 2021), Online, May 3β7, 2021 , IFAAMAS, 9 pages.
Sequential social dilemmas (SSDs) are games where short-termindividual incentives conflict with the long-term social good. Suchgames are pervasive and describe situations in which we woulddeploy automated multi-agent systems. In many of these situations,we only control one of the agents involved, as is the case withself-driving cars where the other agent could be a human or aself-driving car from another company. Many methods have beenproposed for training policies that would better optimize socialgood [5, 12, 13, 15]. However, many of these methods do so at agreater risk of being exploited. A policy that optimizes for socialwelfare could allow other policies to be selfish without long-termconsequences. So the welfare-maximizing policy may fare worsein these settings than if it had only optimized for its own self-interest. To some extent, this is unavoidable since every choiceto cooperate leaves room to be exploited. However, though thistrade-off is inevitable, it is not as stark as it first appears.To study this trade-off formally, we present two objectives. Thefirst objective is the well-studied notion of π -safety. Previous workshows that π -safe policies in sequential settings risk what the policywon in expectation [6], allowing them in the long run to takemuch larger risks. The second objective is to perform well with Proc. of the 20th International Conference on Autonomous Agents and Multiagent Systems(AAMAS 2021), U. Endriss, A. NowΓ©, F. Dignum, A. Lomuscio (eds.), May 3β7, 2021, Online cooperation-promoting policies. We define cooperation-promotingpolicies to be such that they are more cooperative when facedwith cooperative policies. Since perfect safety would require onlydefection and perfect performance with cooperation-promotingpolicies would require only cooperation, these two objectives arein clear conflict.Throughout this work, we find it convenient to conceive of π -safety using the concept of risk capital. Risk capital is the amount ofcapital an investor is willing to lose. In our case, we use risk capitalto refer to the amount of utility we are willing to risk, which for π -safe policies is π . In this framing, we can think of prior work, show-ing that π -safe policies risk what the policy won in expectation, asnoticing that a policy does not need more risk capital for reinvestingtheir unexpected winnings. Since cooperation could always be metwith defection, cooperating necessitates some amount of willing-ness to lose utility. However, like an investment, cooperation oftenresults in returning more utility than was risked. This returned util-ity can then be reinvested without risking worse outcomes, leadingto growing risk capital over time, thus more cooperation over timeeven for a small initial amount of risk capital.This argument shows an interesting fact about the trade-offbetween safety and cooperation β that though cooperation withtotal safety is impossible, giving away even a small amount ofsafety will lead to nearly optimal cooperation in the long-term.We formalize this argument as a trade-off between safe beliefsand cooperation promoting beliefs. We propose a method to trainpolicies that satisfy this objective, which we call Accumulating RiskCapital Through Investing in Cooperation (ARCTIC), based on theidea of investing risk capital to achieve long-term cooperation. Wetest this method in the simple domains of prisoners dilemma andstag hunt. However, we expect investing in cooperation to be abroadly applicable principle for developing safe and cooperativeagents. To achieve full safety, we begin with fully safe beliefs andadapt the safety levels according to opponent play. A positive initialrisk capital can be used with minimal effect on long-term safetywhile achieving cooperation more easily. The problem of cooperation in sequential settings has long beenstudied in game theory. One of the most famous strategies is Rap-portβs tit-for-tat[20], which achieved great success in Axelrodβstournament [1]. In part, Axelrod attributed itβs success to its abilityto promote cooperation and is not exploitable after the first move.Human experiments of sequential social dilemmas often findcooperation as the most popular strategy. Gardner et al. [7] usebounded rationality to explain the high levels of cooperation in labexperiments where participants played a common pool resourcegame and could communicate with each other. Human cooperationseen in social dilemmas is often founded in indirect reciprocity a r X i v : . [ c s . M A ] J a n We begin in Section 2 by outlining the necessary mathematicalpreliminaries. In Section 3, we establish the conditions for the π -safety of beliefs. Then we define a policy-conditioned belief thatincentivizes cooperation in SSDs in Section 4. In Section 5, wherewe show that the trade-off between safety and cooperativenessdecreases with length of the interaction. In Section 6, we equip thiscooperation inducing belief with the safety property and proposethe ARCTIC algorithm. Additionally, we show ARCTIC has desir-able properties in matrix SSDs: Prisonerβs Dilemma and Stag Hunt.Finally, we prove its success at defending against adversaries whilstmaintaining cooperation in RL agents in Section 7. A game is a tuple ( π, π΄, π’ ) where: π = { , , ..., π } is the set ofagents; π΄ = π΄ Γ ... Γ π΄ π is the action space of all players, π β₯ π’ = ( π’ , ..., π’ π ) where π’ π : π΄ β R is aconvex utility function for player π . The expected utility of player π is denoted πΈ [ π’ π ( π π , π β π )] .Denote the mixed strategy space of player π as Ξ£ π and opponentstrategy space as Ξ£ β π = Ξ π β π Ξ£ π . Then the minimax value of the game π£ π is the highest value player π can guarantee without knowingtheir opponentsβ actions: π£ π = max π π β Ξ£ π min π β π β Ξ£ β π πΈ [ π’ π ( π π , π β π )] .A strategy π π is safe if it guarantees at least the minimax valueon average: πΈ [ π’ π ( π π , π β π )] β₯ π£ π for any π β π β Ξ£ β π . A strategy π π is π -safe if, and only if, for all π β π β Ξ£ β π we have: π£ π β π β€ πΈ [ π’ π ( π π , π β π )] . The best-response to a strategy π β π is π΅π ( π β π ) : = arg max π π πΈ [ π’ π ( π π , π β π )] . A Nash equilibrium is a strategy profile π such that π π β π΅π ( π β π ) for all π β π . The Risk What Youβve Won inExpectation algorithm [6] plays an π -safe best response to a modelof an opponentβs strategy and achieves safety.Social dilemmas are games in which agents can either cooperateor defect where joint cooperation gains the highest total rewardsbut there is an incentive to deviate from this. We will explore theconditions for beliefs in a sequential social dilemma (SSD) to in-centivize cooperation. The possible payoffs of each stage gamesare reward π for mutual cooperation, punishment π for mutualdefection, sucker π if exploited, and temptation π for exploiting.Table 1 shows the form of a two-player matrix game to be socialdilemma, such as Prisonerβs Dilemma and Stag Hunt.Reinforcement learning (RL) problems are those which comprisean agent learning how to behave in an environment where theyreceive rewards for their actions. The environment is representedby a state variable, π β π , and the principle task of the agent is toestimate the value of choosing an action, π β π΄ , given the currentstate. The goal is to then find an optimal policy stating the actionto be taken in a given state to achieve the highest rewards. Inmulti-agent RL, multiple agents act in the same environment.A Markov game is a tuple ( π, π, ( π΄ π ) , π, ( π π ) , πΎ ) for π β π : setof agents π where | π | >
1, state space π , action spaces π΄ π withjoint action space π΄ : = π΄ Γ ... Γ π΄ | π | , Markovian transition model π : π Γ π΄ Γ π β¦β [ , ] , reward function π π : π Γ π΄ Γ π β¦β R and a discount factor πΎ β [ , ) . The goal of an agent π is to selecta sequence of actions, or policy π π , to maximize the cumulativediscounted return π π‘ = (cid:205) β π = πΎ π π π‘ + π + . Such a sequence of actionsis called the optimal policy π β π . The joint policy π : π β Ξ ( π΄ ) isdefined as π ( π | π ) : = (cid:206) π β π π π ( π π | π ) . The state-value function foragent π , or value function , π π : π β R describes the expected valueof following policy π π when your opponent follows policy π π fromstate π : π ππ π ,π β π ( π ) = πΈ [ (cid:205) π‘ β₯ πΎ π‘ π ππ‘ | π ππ‘ βΌ π π (Β·| π π‘ ) , π = π ] , where β π represents all other agents in π . As a mathematical convenience to represent both our safety criteriaand our beliefs about the distribution of opponent strategies, weintroduce policy-conditioned beliefs. A policy-conditioned belief is afunction π π : Ξ£ π β Ξ£ β π . Note that these beliefs are only a mathemat-ical convenience and should not be thought of as a literal reflectionof our setting, since players move simultaneously, the strategy ofplayer π does not causally affect the other playersβ strategies on thesame time step. However, policy-conditioned beliefs allow us torepresent both the idea that our opponent is drawn from some fixeddistribution and the idea that our opponent might be adversarial, ina single formalism. The set of best responses to a policy-conditionedbelief π π is defined as π΅π ( π π ) : = arg max π π β Ξ£ π πΈ [ π’ π ( π π , π π ( π π ))] . able 1: Social dilemma payoffs are of the form shown (left) where π > π , π > π , π > π + π , and either π > π , or π > π . Examplesof payoff matrices for two-player SSDs (payoffs normalized on [ , ] ): Stag Hunt (centre) and Prisonerβs Dilemma (right). πΆ π·πΆ ( π , π ) (
π,π ) π· ( π, π ) (
π, π ) πΆ π·πΆ ( , ) ( , / ) π· ( / , ) ( / , / ) πΆ π·πΆ ( / , / ) ( , ) π· ( , ) ( / , / ) A policy-conditioned belief is π -safe if, and only if, for all π π β π΅π ( π π ) , π π is π -safe. An example of a safe belief is the adversarialpolicy-conditioned belief, we define this policy-conditioned beliefas π π΄π ( π π ) : = arg min π β π β Ξ£ β π πΈ [ π’ π ( π π , π β π )] . The safety property is desirable to minimize risk in environmentswhere an adversarial opponent can exploit a policy. However, play-ing a safe strategy may end up with suboptimal outcomes, suchas in Prisonerβs Dilemma and Stag Hunt. Thus, we shall classify π -safe strategies that can safely cooperate with other agents in anymulti-agent games. The payoff matrix for players is assumed to benormalized on [ , ] for the purpose of simplicity; otherwise, thereis an additional coefficient of the range of payoffs .Proposition 1. For any two-player game with a Nash equilibrium π β , β π β [ , ] and β π β π , β π π β Ξ£ π such that || π π β π β π || β β€ π and π π is π -safe . Proof. Since we assumed the payoffs are normalized, we havemax π π πΈ [ π’ π ( π π , π΅π ( π π ))] = π π πΈ [ π’ π ( π π , π΅π ( π π ))] = . Let π ππππ be such that πΈ [ π’ π ( π ππππ , π΅π ( π ππππ ))] =
0. For any π π , suchthat || π π β π β π || β β€ π , we can bound the expected utility loss by theworst case: πΈ [ π’ π ( π β π , π΅π ( π β π ))] β πΈ [ π’ π ( π π , π΅π ( π π ))] β€ πΈ [ π’ π ( π β π , π΅π ( π β π ))] β πΈ [ π’ π ( π ππ , π΅π ( π ππ ))] , where π ππ = ( β π ) π β π + ππ ππππ . We can bound the right hand sideof the inequality using πΈ [ π’ π ( π β π , π΅π ( π β π ))] β€ πΈ [ π’ π ( π ππ , π΅π ( π ππ ))] β₯ β π . Thus,we have π -safety as πΈ [ π’ π ( π β π , π΅π ( π β π ))] β πΈ [ π’ π ( π π , π΅π ( π π ))] β€ π . β‘ Now that we have found conditions for the existence of π -safestrategies, we will prove similar properties for policy-conditionedbeliefs. Define a policy-conditioned belief π π to be π -close to a policy-conditioned belief π β² π if, and only if,max π π β Ξ£ π || π π ( π π ) β π β² π ( π π )|| β β€ π. Proposition 2.
For any belief π π that is π -close to π π΄π , β π π β Ξ£ π , πΈ [ π’ π ( π π , π π ( π π ))] β πΈ [ π’ π ( π π , π π΄π ( π π ))] β€ π. Proof. For all π π β Ξ£ π and any policy-conditioned belief π π suchthat π π is π -close to π π΄π , can be bounded by π π ( π π ) = ( β π ) π π΄ ( π π ) + ππ πππ₯π ( π π ) For bounded utilities where the greatest range in payoff a player can have is πΎ , replace π -safe with πΎπ -safe in Propositions 1 and 3. For π = ( π₯ , .., π₯ π ) , the supremum norm is | | π | | β : = sup {| π₯ | , ..., | π₯ π |} . where π πππ₯π ( π π ) = arg max π β π πΈ [ π’ π ( π π , π β π )] . We can write theexpected utility of belief π π as πΈ [ π’ π ( π π , π π ( π π ))] = πΈ [ π’ π ( π π , ( β π ) π π΄π ( π π ) + ππ πππ₯ ( π π ))] . Since the utility function is convex we canfind an upper bound: πΈ [ π’ π ( π π , ( β π ) π π΄π ( π π ) + ππ πππ₯ ( π π ))] β€( β π ) πΈ [ π’ π ( π π , π π΄ )] + ππΈ [ π’ π ( π π , π πππ₯π ( π π ))] . Rearrange the difference in expected utility of π π and π π΄π : πΈ [ π’ π ( π π , π π ( π π ))] β πΈ [ π’ π ( π π , π π΄π ( π π ))]β€ ( β π ) πΈ [ π’ π ( π π , π π΄ )] + ππΈ [ π’ π ( π π , π πππ₯π ( π π ))] β πΈ [ π’ π ( π π , π π΄π ( π π ))]β€ π ( πΈ [ π’ π ( π π , π πππ₯π ( π π ))] β πΈ [ π’ π ( π π , π π΄π ( π π ))]) β€ π ( β ) = π Therefore, the inequality holds true. β‘ Thus, beliefs close to π π΄ have similar expected utilities. Now wecan address the safety property in the same context.Proposition 3. If the policy-conditioned belief π π is π -close to theadversarial policy-conditioned belief π π΄π , and utilities are bounded on [ , ] then π π is π -safe. Proof. Since π π is π -close to the adversarial policy-conditionedbelief π π΄π , then by Proposition 2, β π π β Ξ£ π we have πΈ [ π’ π ( π π , π π ( π π ))] β πΈ [ π’ π ( π π , π π΄π ( π π ))] β€ π. Take π π β π΅π ( π π ) , then πΈ [ π’ π ( π π , π π ( π π ))] = max π π β Ξ£ π πΈ [ π’ π ( π π , π π ( π π ))]β₯ max π π β Ξ£ π πΈ [ π’ π ( π π , π π΄π ( π π ))] = π£ π . Thus, π£ π β π β€ πΈ [ π’ π ( π π , π π΄π ( π π ))] . Hence, β π π β Ξ£ π , π π is π -safe. This π π is π -safe. β‘ Proposition 3 shows that we can append any policy-conditionedbelief with some level of an adversarial policy-conditioned beliefand it will be π -safe for some π . Consequently, we can take a policy-conditioned belief that naively cooperates with any opponent andbound its safety through creating an uncertainty; either this beliefis true or they face an adversarial opponent. In the previous section, we proved that we could take a coopera-tion inducing policy-conditioned belief and still maintain the safetyproperty through uncertainty about whether the opponent faced is,in fact, an adversary. Now we must develop a policy-conditionedbelief that we know will incentivize cooperative behavior in se-quential social dilemmas. First, we consider two-player matrix SSDgames.Let the strategy of player π be ( πΌ, Β― πΌ ) where πΌ β [ , ] is theintended probability of cooperating in the next round and Β― πΌ β [ , ] s the probability of cooperating for all subsequent rounds. Thisformat is chosen for simplicity. Similarly, the current and futurestrategies of their opponent are denoted by ( π½, β π½, Β― π½, β Β― π½ ) . Fordiscounted future returns, define the expected returns π π : Ξ£ π Γ Ξ£ π Γ Ξ£ β π Γ Ξ£ β π β R as π π (( πΌ, Β― πΌ ) , ( π½, Β― π½ )) : = πΈ [ π’ π ( πΌ, π½ )] + π β βοΈ π‘ = πΎ π‘ πΈ [ π’ π ( Β― πΌ, Β― π½ )] where πΎ β ( , ] is the discount factor. A policy-conditioned beliefin the sequential game is a function π π : Ξ£ π Γ Ξ£ π β Ξ£ β π Γ Ξ£ β π .Although opponent strategies are unseen, suppose that a playerbelieves that their chosen strategy for the next round will changethe strategy of their opponent for all subsequent rounds dependingon their level of cooperation. For some π₯ β ( , ] , if player π choosesto cooperate with at least proportion π₯ then their opponentβs futurecooperation level will not decrease, and for cooperation less than π₯ , their opponentβs level of cooperation will not increase for futurerounds. Let player π such a policy-conditioned belief π πΆπ , where πΆ stands for cooperation-promoting, formally defined as π πΆπ ( πΌ ) : = (cid:40) ( π½, π½ + ) πΌ β₯ π₯ ( π½, π½ β ) πΌ < π₯ for some π₯ β ( , ] where π½ β€ π½ + and π½ β₯ π½ β . This belief issimilar to a belief the opponent plays tit-for-tat with the additionalcondition that the opponent can view intended mixed strategiesand is willing to invest or reject risk capital in the long-term.Let us find the necessary conditions on π πΆπ for cooperation tobe an equilibrium strategy against similar agents, and hence, be acooperation inducing policy-conditioned belief as required. For co-operation to occur, we require that defection is not a best-response,i.e. β πΌ > π π (( πΌ, Β― πΌ ) , π πΆπ ( πΌ )) β₯ π π (( , Β― πΌ ) , π πΆπ ( , )) .Proposition 4. For any matrix SSD and policy-conditioned belief π πΆπ where π½, π½ + , π½ β satisfy πΌπ½ ( π + π β π β π )+ πΌ ( π β π )+ π β βοΈ π‘ = πΎ π‘ ( π½ + β π½ β )[ Β― πΌ ( π + π β π β π )+ π β π ] β₯ for some Β― πΌ β [ , ] , has cooperation as a best-response. Proof. For cooperation to be a best-response, we must have π π (( πΌ, Β― πΌ ) , π πΆπ ( πΌ )) β₯ π π (( , Β― πΌ ) , π πΆπ ( , )) . We can write the expectedreturns of cooperating with positive probability in terms of payoffsas π π (( πΌ, Β― πΌ ) , π πΆπ ( πΌ )) = πΌπ½π + πΌ ( β π½ ) π + π½ ( β πΌ ) π + ( β πΌ )( β π½ ) π + π β βοΈ π‘ = πΎ π‘ [ Β― πΌπ½ + π + Β― πΌ ( β π½ + ) π + π½ + ( β Β― πΌ ) π + ( β Β― πΌ )( β π½ + ) π ] . The expected returns of defecting ( πΌ =
0) is π π (( , Β― πΌ ) , π πΆπ ( , )) = π½π + ( β π½ ) π + π β βοΈ π‘ = πΎ π‘ [ Β― πΌπ½ β π + Β― πΌ ( β π½ β ) π + π½ β ( β Β― πΌ ) π + ( β Β― πΌ )( β π½ β ) π ] . Now by substituting these into the inequality we get πΌπ½ ( π + π β π β π )+ πΌ ( π β π )+ π β βοΈ π‘ = πΎ π‘ ( π½ + β π½ β )[ Β― πΌ ( π + π β π β π )+ π β π ] β₯ , as given. β‘ If π½ ( π + π β π β π ) + ( π β π ) >
0, then πΈ [ π’ π ( πΌ, β πΌ, π½, β π½ )] is increasing in πΌ . As such, full cooperation will be dominant, so π will play πΌ =
1. Otherwise, they will play πΌ = π₯ , in which case π₯ should be set to 1 to induce fully cooperative behavior. π½ + β π½ β isthe change in cooperation that the next strategy will induce in theopponentβs future strategy.These beliefs can be summarized as those who believe theirnext action will affect opponentsβ future strategies such that theyprefer to cooperate now to avoid a reduction in future utility. Wecall a strategy that follows this system π πΆ . Other beliefs from theliterature that are cooperation promoting could also be substitutedhere, such as translucency [2]. In the previous sections, we described how to use policy-conditionedbeliefs to promote both safety and cooperation. In this section, wewill use these ideas to show how the two concepts are necessarilyin tension but how this tension disappears in the long-run. To makethis concrete, we will define the value of cooperation to be the valueexpected against the policy-conditioned belief π πΆπ , that is: π πΆ ( π π ) = πΈ [ π’ π ( π π , π πΆπ ( π π ))] . Similarly, we can think of the value of safety to be the valueexpected against the policy-conditioned belief π π΄π , that is: π π΄ ( π π ) = πΈ [ π’ π ( π π , π π΄π ( π π ))] . We will define π πΆ = max π { π πΆ ( π π )} to be the optimal coopera-tive value and π π΄ = max π { π π΄ ( π π )} to be the optimal safe value.For the sake of simplicity, we will assume that π½ = π½ β = π₯ = π½ = π½ β =
0, the cooperation-promotingpolicy would defect on the first step. Thus, the safety that yourpolicy loses over the first two steps is proportional to the probabil-ity that you cooperate, and the amount that you lose against thecooperation-promoting policy over the first two steps for defectingis proportional to the probability that you defect.However, when we look at the third step, the trade-off is lesssevere. If π π cooperated with proportion πΌ on the first time stepand πΌ is high enough that the cooperation-promoting policy coop-erates with them on the second time step, it increases the budget ofsafety for π π , now being able to cooperate at πΌ + πΈ [ π’ π ( π π , π β π )] β π£ π without exceeding the safety budget. If the policy were any morecooperative, the adversary would be motivated to cooperate on thesecond round to be able to exploit in future rounds.This pattern continues until it is safe for our policy to coop-erate deterministically. At which point, our policy will continueo cooperate for the rest of the round. This means that the trade-off between safety and performance with cooperation-promotingpolicies can be characterized by the probability of not cooperatingearly on, which over the long run is a negligible fraction of overallperformance. This is captured formally in Proposition 5.Proposition 5. Suppose that defection is a safe strategy and coop-eration is socially optimal. Let π π be π -safe. Let πΌ π‘ be the probability π π cooperates on round π‘ against a cooperation-promoting belief andassume πΈ [ π π‘ ] β€ ππΌ π‘ β + π£ π for some constant π > , as is the casewhen π½ = π½ β = . Then: π πΆ β π πΆ ( π π ) β₯ πΌπ π πΆ β ππ β Ξ¦ πΌ + πΆ β Ξ¦ πΆ . Where πΆ = ππ β π , Ξ¦ π₯ = +β + π₯ , notated as such because Ξ¦ isthe golden ratio, and πΌ = min {ββ log Ξ¦ πΆ ( π )β ,π } . Moreover, when πΈ [ π π‘ ] = ππΌ π‘ β + π£ π , there is an π -safe policy π π which makes thisbound tight. Proof. Let π π be π -safe so π = π π΄ β π π΄ ( π π ) . The result followsfrom the fact that cooperation at each round gives a bound forsafety, which we define to be Λ πΌ π as follows: πΌ π β€ π + π β (cid:205) π‘ = πΈ [ π π‘ ] β ππ£ π π β π β€ π + π β (cid:205) π‘ = ( ππΌ π‘ β + π£ π ) β ππ£ π π β π = Λ πΌ π . If we take these Λ πΌ π to inductively define a policy π π , then when πΈ [ π π‘ ] = ππΌ π‘ β + π£ π , π π can be proven achieve the desired equality bychanging all of the inequalities in the following proof to equalities.By substitution we have:Λ πΌ π = π + π β (cid:205) π‘ = ( ππΌ π‘ β + π£ π ) β ππ£ π π β π = π + π π β (cid:205) π‘ = πΌ π‘ β π β π = π + ππΌ π β + π π β (cid:205) π‘ = πΌ π‘ β π β π = π + π π β (cid:205) π‘ = πΌ π‘ β π β π + ππ β π πΌ π β β€ Λ πΌ π β + πΆ Λ πΌ π β We can then show by induction that Λ πΌ π‘ β€ π Ξ¦ π‘πΆ . In the base caseΛ πΌ = π = π Ξ¦ πΆ and in the inductive case:Λ πΌ π β€ Λ πΌ π β + πΆ Λ πΌ π β = π Ξ¦ π‘ β πΆ + πΆπ Ξ¦ π‘ β πΆ = π ( Ξ¦ πΆ + πΆ ) Ξ¦ π‘ β πΆ = π Ξ¦ π‘πΆ . Where the last inequality follows from the fact that Ξ¦ πΆ + πΆ = Ξ¦ πΆ .This inequality and πΌ π β€
1, allow us to bound π πΆ β π πΆ ( π π ) bysimply summing the expected rewards: π πΆ ( π π ) = π βοΈ π‘ = πΈ [ π π‘ ] β€ π π βοΈ π‘ = πΌ π‘ β + π π£ π β€ π π βοΈ π‘ = min { π Ξ¦ π‘πΆ , } + π π΄ . Since the socially optimal behavior is deterministic cooperation,the difference from optimal behavior only occurs when π Ξ¦ π‘πΆ < π‘ < β log Ξ¦ πΆ ( π ) . If β log Ξ¦ πΆ ( π ) is bigger than π , the second term of the min would never occur in the summation.Thus, the point where the summation starts using the second term is πΌ = min {ββ log Ξ¦ πΆ ( π )β ,π } . We plug this into the final inequalityand use the geometric series formula to get: π πΆ ( π π ) β€ π πΌ βοΈ π‘ = π ( Ξ¦ πΆ ) π‘ + π πΆ ( β πΌπ ) = ππ β Ξ¦ πΌ + πΆ β Ξ¦ πΆ + π πΆ ( β πΌπ ) We multiply both sides by β π πΆ for the desired result. β‘ It is important to note that this trade-off, after a certain point,does not grow with π . Moreover, the return on cooperation valuefor small reductions in the optimal safety value grows very quickly,as the small loss in safety can effectively be reinvested at each suc-cessive time step as the policy receives the gains from cooperation.Thus, in long iterated games, the optimal policy for this objectiveis nearly-optimal for the designer to deploy into either fully adver-sarial settings or fully cooperative settings. In the next section, wetake this core insight as the motivation for an algorithm that usesthe ideas of reinvesting this risk capital in order to achieve highdegrees of both safety and cooperation. In SSDs, if an opponent is cooperative, surplus payoff above thevalue of the game can be reinvested for higher long-term gain. How-ever, we also need to be able to invest safely to avoid exploitation.To begin with, we give agents an adversarial policy-conditionedbelief to avoid exploitation and maintain safety . For any surpluscapital gained, π , we can deviate our safe beliefs to be π -safe andbegin to reinvest to build trust with opponents. One such belief is π ππΆπ : = ( β π ) π π΄π + ππ πΆπ , where the sequential adversarial belief is π π΄π ( π ) = arg min π β π β Ξ£ β π πΈ [ π’ π ( π π , π β π )] , and π½, π½ + , π½ β satisfy the conditions in Proposition 4. By Proposition3, π ππΆπ is π -safe.Again, consider the two-player matrix SSDs of the form in Table1. If π π΄π ( π π ) = ( , , , ) , which is true for both Stag Hunt andPrisonerβs Dilemma, β π π β Ξ£ π we can write the belief π ππΆπ as π ππΆπ ( πΌ ) = (cid:40) ( ππ½, ππ½ + ) πΌ β₯ π₯ ( ππ½, ππ½ β ) πΌ < π₯. The condition for cooperation to occur is now: πΌ ( π β π )β ππΌπ½ ( π + π β π β π ) β€ π β βοΈ π‘ = πΎ π‘ π ( π½ + β π½ β )[ Β― πΌ ( π + π β π β π )+ π β π ] (1)Naturally, for π = πΌ, π½, π½ + , π½ β such that this holds for the smallest possible π so thatwe need to least amount of risk to allow cooperation to be a bestresponse to these beliefs. Pseudocode for the Accumulating RiskCapital Through Investing in Cooperation is given in Algorithm 1.Following from results in [6], this algorithm is safe.We can think of π as representing the amount of risk capital weare willing to invest in an opponent. For safe play in a sequentialgame, we begin with no risk capital and play the minimax strategy We could forgo perfect safety in the first round to increase cooperation betweenagents initially for better long-term rewards with only a small amount of risk. igure 1: Simulations of ARCTIC playing different oppo-nents in 100 rounds of Prisonerβs Dilemma with various π₯ .Against the adversarial strategy, ARCTIC does not learn tocooperate as there is not enough risk capital gained fromtheir interactions. When two ARCTIC players interact, therisk capital slowly builds over time for all π₯ as they both playcautiously.Algorithm 1: ARCTICInitialize π₯ β [ , ] , π½ β [ , ] , π½ + β π½ β β π β π£ π β minimax value; for π‘ = to T do π π β ( β π ) π π΄π + ππ πΆπ ; π π β arg max πΈ [ π’ π ( π π , π π ( π π ))] ; π plays π π from π π , β π plays π β π from unknown π β π ; π β min ( π + πΈ [ π’ π ( π π , π β π )] β π£ π , ) end since we assume our opponent is adversarial. As the game goeson, the amount of risk capital will increase against non-adversarialopponents and we can safely invest such risk capital with the ex-pectation of a return on the investment. Gradually, we build trustagainst similar opponents that reciprocate collaborative behaviors.If our opponent has been cooperating in the past, then there isenough risk capital for ARCTIC to cooperate with such an oppo-nent. However, if they then defect the amount of risk capital dropsand they are more likely to defect in the next round, similar tohow a tit-for-tat strategy punishes defections. If the risk capital ishigh enough, then punishment of a defection is less common sinceARCTIC has learned to trust the good behavior of its opponent.This mechanism stops the strategy from being exploited againstadversaries whilst maximizing cooperation with allies. To test the algorithmβs performance, we simulated an ARCTICagent for 100 rounds of Prisonerβs Dilemma and Stag Hunt whereits opponent either followed a simple strategy or best respondedto their policy-conditioned beliefs. The simple strategies playedagainst against were tit-for-tat (T4T) and pure defector (Adv), andthe policy-conditioned belief opponents followed either ARCTIC
Figure 2: Simulations of 100 rounds of Stag Hunt whilst play-ing the ARCTIC strategy against 5 different opponent strate-gies. For cooperation to occur here, we need π₯ and π½ to belarge enough to satisfy condition 1. or the cooperation inducing belief π πΆ . Figures 1 and 2 show thecooperation levels and risk capital π for these experiments. Exper-iment parameters were: random action noise 5%, Β― πΌ = πΌ , and thediscount factor πΎ = .
9. Results were then averaged over 200 runs.In Prisonerβs Dilemma (Figure 1), the ARCTIC agent quicklylearns to cooperate at rate π₯ against the cooperation incentivizedT4T and π πΆ players. When playing itself, the amount of risk capital, π , increases more gradually since they are learning to trust morecautiously than a T4T or π πΆ player. Against the adversarial player,the cooperation levels a very low, and thus the ARCTIC agentmaintains the safety property.Similarly, in Stag Hunt (Figure 2) the ARCTIC agent learns tocooperate with all but the adversary. For π₯ and π½ large enough tosatisfy condition (1), ARCTIC learns to fully cooperate by the endof the 100 rounds of play. Cooperation is much easier to achieve inStag Hunt than in Prisonerβs Dilemma since defect is not a dominantstrategy. Against the adversary, not enough risk capital is collectedfor the best response to be cooperation which preserves the safetyof the strategy.Here, we have consider ARCTIC for payoffs normalized on [ , ] as seen in Table 1. For games where payoffs have a range greater igure 3: Cooperation of ARCTIC against different opponents over 100 rounds of Prisonerβs Dilemma with initial risk capitalof 0 (left) and 1 (right) when π₯ = . . Against the adversarial and baseline opponents, ARCTIC learns to not cooperate, whereaswhen played against π πΆ it cooperates around π₯ . Against itself it cooperates at lower but increasing levels. than 1, the π update step should be π β min ( π + πΎ ( πΈ [ π’ π ( π π , π β π )] β π£ π ) , ) , where πΎ is the greatest difference between possible payoffs in orderto normalize risk capital to be in [ , ] . SSDs are often subgames of more complex multiplayer games. Thesetypes of games are of particular interest in multi-agent RL due to thedifficulty of learning cooperative policies. In more complex gameswhere teamwork is required, it is only sequences of actions thatcreate cooperative or selfish behaviors. Thus, we consider long-termbehaviors in order to capture the nature of the agents.We assume that the minimax value, π£ , of the game can be deter-mined. An opponent can therefore detect whether the agent is co-operating by measuring whether their rewards are at least the valueof the game. The level of cooperation is now π₯ π‘ : = (cid:205) π‘π = πΎ π‘ β π π π > π£ .Let player π βs cooperation inducing policy-conditioned belief π πΆπ ,be defined as π πΆ ( π π ) : = (cid:40) ( π π , π + π ) π₯ ππ‘ β₯ π₯ ( π π , π β π ) otherwisewhere π π ππΆπ ,π + π ( π ) > π π ππΆπ ,π β π ( π ) for some threshold π₯ β ( , ] .If cooperation level π₯ π‘π is above a certain level, then the opponentwill behave cooperatively. Otherwise, they will act in their ownself-interest. To train agents with these cooperative beliefs, we canadapt the reward functions of their opponents as such: π β ππ‘ β (cid:40) π ππ‘ + π β ππ‘ π₯ ππ‘ β₯ π₯π β ππ‘ otherwiseTo train an agent with a policy-conditioned belief, it can betrained in an environment where those beliefs are true and trans-ferred into the standard environment for deployment.We trained distributed asynchronous advantage actor-critic (A3C)[17] agents on Prisonerβs Dilemma and Stag Hunt environments.Agent policies were trained with the policy-conditioned beliefs π πΆ and ARCTIC where π₯ = . πΌ = πΌ . Agents without beliefswere the Baseline agent (unmodified A3C) and the Adv agent. Theneural network consists of two fully connected layers of size 32 anda Long Short Term Memory (LSTM) recurrent layer [8]. The learn-ing rate for Baseline, Adv, and π πΆ agents was 0.001. For ARCTIC,the learning rates were 0.00007 and 0.0001 for Prisonerβs Dilemmaand Stag Hunt respectively. The entropy coefficient was 0.01. Thestate space for the ARCTIC agents was a onehot encoded π value.Each agent was trained on 3 different random seeds and results areaverage across these policies for 300 rollouts. The difficulty of playing Prisonerβs Dilemma with a generic multi-agent RL algorithm is that defection is a strictly dominant strategyand, thus, usually converge to defecting. This means that a mecha-nism for agents to cooperate must be used to promote cooperation,which leaves them open to exploitation. By using ARCTIC here,the agent still acts optimally, but it acts optimally with respect totheir policy-conditioned belief.Table 2 shows the cumulative rewards for trained agents after 100rounds of playing Prisonerβs Dilemma. The baseline agent performspoorly against itself due to its inability to cooperate whereas theARCTIC agent cooperates with itself some of the time and achievesa more socially optimal outcome. With the π πΆ player, ARCTICcooperates at higher levels than with itself. Against the adversary,ARCTIC achieves close to the value of the game on average. Table 2: Agent scores for 100 rounds of Prisonerβs Dilemma.
Baseline ARCTIC π πΆ AdvBaseline 25.01, 25.01 25.54, 24.83 70.29, 9.91 25.00, 25.01ARCTIC 24.83, 25.54 34.12, 34.12 57.84, 46.72 24.82, 25.56 π πΆ igure 4: Cooperation of ARCTIC against different opponents over 100 rounds of Stag Hunt when starting with risk capital of0 (left) or 1 (right). In both cases, ARCTIC cooperates the most with π πΆ and baseline agents and the least with adversaries.Table 3: Scores for 100 rounds of Prisonerβs Dilemma with π = . Baseline ARCTIC π πΆ AdvARCTIC 24.65, 26.05 45.35, 45.35 56.23, 50.46 24.64, 26.06
For an RL agent playing Stag Hunt, multi-agent RL algorithmslearn to cooperate more effectively than Prisonerβs Dilemma butthis leaves them exploitable to adversaries.In Table 4 the scores for agents in the Stag Hunt tournament canbe found. Here, the baseline agent performs well against themselfbut achieve a poor score against adversaries. On the other hand,ARCTIC agents achieve the value of the game against adversaries.
Table 4: Agent scores for 100 rounds of Stag Hunt.
Baseline ARCTIC π πΆ AdvBaseline 99.53, 99.53 78.61, 94.31 99.78, 99.34 0.13, 74.84ARCTIC 94.31, 78.61 27.02, 27.02 93.52, 78.17 24.85, 25.31 π πΆ π =
1, the ARCTIC agent is able to achieve more sociallyoptimal outcomes against all players except for the adversary whereit achieves marginally less than when π =
0. See results in Table 5.
Table 5: Scores for 100 rounds of Stag Hunt with π = . Baseline ARCTIC π πΆ AdvARCTIC 95.19, 81.92 73.36, 73.36 94.56, 82.03 24.60, 25.82
In summary, we studied the trade-off between cooperation andsafety, first showing how to unify these two objectives in the for-malism of policy-conditioned beliefs and then characterizing a trade-off between them. We find that small risks to safety can leadto large returns in cooperation. We made this trade-off more intu-itive through the idea of risk capital and seeing cooperation as thecompounding returns on its investment. We use this intuition tobuild Accumulating Risk Capital Through Investing in Cooperation(ARCTIC) which enacts this trade-off, achieving safe cooperationin iterated prisonerβs dilemma and stag hunt.Cooperating while maintaining approximate safety allows usto design agents that individual developers would want to use outof their own self-interests. This is a promising development butleaves open questions that will be important in more complex envi-ronments. For instance, when there are many styles of successfulcooperative strategies, agent designers would need to coordinateon a particular style of cooperation or build their agents to be adap-tive to other agentsβ techniques. Moreover, although our methodprotects the agent against adversaries, it does not protect the agentagainst exploitative agents, who want to maximize their reward,which happens to come at the cost of our reward. Although anARCTIC agent will not achieve less than π£ π β π in expectation, theycould be exploited into accepting less than their fair share of thereward as long as they receive more than π£ π . This becomes morecomplex when combined with the coordination problems, as differ-ent coordination solutions could have different payouts which mustbe somehow distinguished from exploitative strategies. Extendingthe ideas of risk capital to these settings is left to future work.There are also interesting challenges in scaling ARCTIC to largerenvironments. Our method is currently reliant on both knowingan expected minimax value and a clear notion of cooperation. Inlarger environments, these are both less accessible. To extend ARC-TIC to these settings, either environment features would have tobe estimated or the reliance on these features would have to beremoved. Ultimately, addressing these issues could lead to generalalgorithms for safe multi-agent cooperation. ACKNOWLEDGMENTS
This work was supported by the Center for Human-Compatible AIand the Open Philanthropy Project. We are grateful for funding ofthis work as a gift from the Berkeley Existential Risk Intuitive.
EFERENCES [1] Robert Axelrod and William Donald Hamilton. 1981. The evolution of cooperation.
Science
Rationality and Society
31, 4 (2019),371β408. https://doi.org/10.1177/1043463119885102[3] Celso M de Melo, Peter Khooshabeh, Ori Amir, and Jonathan Gratch. 2018. Shap-ing Cooperation between Humans and Agents with Emotion Expressions andFraming. In
Proc. of the 17th Int. Conf. on Autonomous Agents and MultiagentSystems (AAMAS 2018) . 2224β2226.[4] Tom Eccles, Edward Hughes, JΓ‘nos KramΓ‘r, Steven Wheelwright, and Joel ZLeibo. 2019. Learning Reciprocity in Complex Sequential Social Dilemmas. arXivpreprint arXiv:1903.08082 (2019).[5] Tom Eccles, Edward Hughes, Janos Kramar, Steven Wheelwright, and Joel Z.Leibo. 2019. The Imitation Game: Learned Reciprocity in Markov games. In . 1934β1936. https://doi.org/10.1007/s00182-013-0370-1[6] Sam Ganzfried and Tuomas Sandholm. 2012. Safe opponent exploitation.
Pro-ceedings of the Adaptive and Learning Agents Workshop 2012, ALA 2012 - Held inConjunction with the 11th International Conference on Autonomous Agents andMultiagent Systems, AAMAS 2012
1, 212 (2012), 119β126.[7] Roy Gardner, Elinor Ostrom, and James Walker. 1984. Social capital and coopera-tion: Communication, bounded rationality, and behavioral heuristics. In
SocialDilemmas and Cooperation . Springer, 375β411. https://doi.org/10.1007/978-3-642-78860-4{_}20[8] Felix A. Gers, Jurgen Schmidhuber, and Fred Cummins. 1999. Learning to Forget:Continual Prediction with LSTM. (1999), 850β855. https://doi.org/10.1385/ABAB:94:2:127[9] Adam Gleave, Michael Dennis, Cody Wild, Neel Kant, Sergey Levine, and StuartRussell. 2019. Adversarial Policies: Attacking Deep Reinforcement Learning.(2019), 1β16. http://arxiv.org/abs/1905.10615[10] Nikoleta E. Glynatsi and Vincent A. Knight. 2020. Using a theory of mind tofind best responses to memory-one strategies.
Scientific Reports
10, 1 (2020), 1β9.https://doi.org/10.1038/s41598-020-74181-y[11] Rens Hoegen, Giota Stratou, and Jonathan Gratch. 2017. Incorporating emotionperception into opponent modeling for social dilemmas. In
Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems,AAMAS . 801β809.[12] Edward Hughes, Joel Z. Leibo, Matthew Phillips, Karl Tuyls, Edgar DueΓ±ez-Guzman, Antonio GarcΓa CastaΓ±eda, Iain Dunning, Tina Zhu, Kevin McKee,Raphael Koster, Heather Roff, and Thore Graepel. 2018. Inequity aversion im-proves cooperation in intertemporal social dilemmas.
Advances in Neural Infor-mation Processing Systems
Proceedings of the International Joint Conference on Autonomous Agents and Mul-tiagent Systems, AAMAS
Bulletin of MathematicalBiology
71, 8 (2009), 1818β1850. https://doi.org/10.1007/s11538-009-9424-8[17] Volodymyr Mnih, Adria Puigdomenech Badia, Lehdi Mirza, Alex Graves, TimHarley, Timothy P. Lillicrap, David Silver, and Koray Kavukcuoglu. 2016. Asyn-chronous methods for deep reinforcement learning.
Proceedings ofthe National Academy of Sciences of the United States of America
Trends inCognitive Sciences
17, 8 (2013), 413β425. https://doi.org/10.1016/j.tics.2013.06.003[20] Anatol Rapoport, Albert M. Chammah, and Carol J. Orwant. 1965.
PrisonerβsDilemma: A Study in Conflict and Cooperation . University of Michigan Press.[21] Sven Van Segbroeck, Steven de Jong, Ann Nowe, Francisco C. Santos, and TomLenaerts. 2010. Learning to coordinate in complex networks.