Modelling Cooperation in Network Games with Spatio-Temporal Complexity
Michiel A. Bakker, Richard Everett, Laura Weidinger, Iason Gabriel, William S. Isaac, Joel Z. Leibo, Edward Hughes
MModelling Cooperation in Network Gameswith Spatio-Temporal Complexity
Michiel A. Bakker *, Richard Everett *, Laura Weidinger , Iason Gabriel , William S. Isaac , Joel Z.Leibo , & Edward Hughes
1. DeepMind, 2. [email protected],{reverett,lweidinger,iason,jzl,williamis,edwardhughes}@google.com
ABSTRACT
The real world is awash with multi-agent problems that requirecollective action by self-interested agents, from the routing of pack-ets across a computer network to the management of irrigationsystems. Such systems have local incentives for individuals, whosebehavior has an impact on the global outcome for the group. Givenappropriate mechanisms describing agent interaction, groups mayachieve socially beneficial outcomes, even in the face of short-termselfish incentives. In many cases, collective action problems possessan underlying graph structure, whose topology crucially deter-mines the relationship between local decisions and emergent globaleffects. Such scenarios have received great attention through thelens of network games. However, this abstraction typically col-lapses important dimensions, such as geometry and time, relevantto the design of mechanisms promoting cooperation. In parallelwork, multi-agent deep reinforcement learning has shown greatpromise in modelling the emergence of self-organized coopera-tion in complex gridworld domains. Here we apply this paradigmin graph-structured collective action problems. Using multi-agentdeep reinforcement learning, we simulate an agent society for avariety of plausible mechanisms, finding clear transitions betweendifferent equilibria over time. We define analytic tools inspired byrelated literatures to measure the social outcomes, and use these todraw conclusions about the efficacy of different environmental in-terventions. Our methods have implications for mechanism designin both human and artificial agent systems.
Collective action problems are ubiquitous in the modern world andrepresent an important challenge for artificial intelligence. Routingpackets across computer networks [53], irrigating large tracts ofland with multiple owners [38] and coordinating autonomous vehi-cles [44] are all vitally important mixed-motive problems, whereindividual incentives are not automatically aligned yet multi-agentcooperation is necessary to achieve beneficial outcomes for all.Often, such problems have a latent graph which structures thestrategic interaction of the participants, such as the network cableslinking computers, the irrigation channels supplying water, andthe road system for vehicles. The field of network games [22] stud-ies interactions between rational agents in the context of differenttopologies. However, network games abstract away crucial detailsinforming the decision-making of agents, such as the geometry ofthe world and the temporal extent of strategic choices. *these authors contributed equally.
Multi-agent deep reinforcement learning provides precisely theright tools to fill this gap in the literature, allowing us to simulate ra-tional learned behavior as we vary properties of a spatio-temporallycomplex world. One approach to solving these problems is to in-voke a single, centralized objective. However, this approach suffersfrom the โlazy agentโ problem [47], where some agents learn todo nothing due to incorrect credit assignment. Moreover it doesnot generalize to situations involving interaction with humans. Wemust therefore look to uncover methods that incentivise coopera-tive behavior among rational self-interested agents.Game theory provides a precise framework for modelling thestrategic choices of self-interested agents. There are two broadclasses of interventions that promote cooperation: behavioural eco-nomics models of agent incentives (e.g. inequity aversion [15]), andmechanism design for the game itself (e.g. auction design [27]).Traditional game-theoretic models do not capture the rich cock-tail of social-ecological factors that govern success or failure inreal-world collective action problems, such as social preferences,geography, opportunity cost and learnability [39]. With the adventof deep reinforcement learning, it has become possible to studythese processes in detail [28, 40]. Thus far, this work has mostlyfocused on the emergence of cooperation inspired by behavioralpsychology [11, 20, 24] and algorithmic game theory [3, 16, 29]. Weinstead take a perspective motivated by mechanism design.In the most general sense, mechanism design examines how in-stitutions can be constructed that result in better social behavior bygroups of agents [21, 33]. Classically, a mechanism designer choosesa payoff structure to optimize some social objective. The payoffstructure is known as a mechanism . Network games in particularoffer a rich space of design possibilities, by virtue of the underlyinggraph topology; see for example [1]. In spatio-temporally extendedsettings, we can naturally generalize the notion of a mechanism togenerate a richer space, compromising not only interventions onthe utility functions experienced by agents, but also subtle changesto the geographical layout and to the behavior of environmentalcomponents over time. Traditional analytic tools cannot simulatethe effects of such mechanisms, for they reduce the complex envi-ronment to a more tractable abstract network game. Here we applymulti-agent deep reinforcement learning techniques to performa simulation in a complex gridworld, and explore the effect of awide range of novel manipulations. This opens the question of au-tomatically selecting mechanisms for particular objectives, whichwe leave for further study.
Our contribution . The aims of this paper are (1) to introduce anew method for modelling the behavior of self-interested agentsin collective action problems with topological, geometrical andtemporal structure, and (2) to use this method to draw conclusions a r X i v : . [ c s . M A ] F e b elevant for mechanism design that promotes cooperative behavior.Our method comprises a rigorous protocol for intervening on aspatio-temporally complex environment, and modelling the effectson social outcomes for rational agents via multi-agent deep rein-forcement learning. To illustrate this general method, we introducea new gridworld domain called Supply Chain , in which agents arerewarded for processing goods according to a given network struc-ture. The actions of agents are interdependent: downstream agentsrely on upstream agents to deliver goods to process. Furthermore,processing centres require periodic maintenance from more thanone agent. This yields a collective action problem with latent graphstructure: each agent obtains individual reward by processing items,but the public good is best served when the agents occasionallycooperate to keep the supply chain intact.For environmental interventions, we take the perspective of asystem designer and ask: what mechanisms might we introduce tothe world, and how do these affect the cooperation of agents? Wenot only vary the topology of the world, as in traditional networkgames, but also the geometry, maintenance cost, and agent special-ization. In all cases we find an intricate interplay between incentivestructure, multi-agent interaction and learnability which affects thenature of emergent cooperation. More precisely, we introduce ametric of care in order to understand these dynamics [51]. We findthat reciprocal care is diminished when the maintenance burdenis lower, and that reciprocity is promoted by training generalistagents that can operate any station in the supply chain, rather thanspecialists. We do not expect the conclusions we draw to have gen-eral applicability; rather we argue that this case study demonstratesthe power and insight provided by our new method.
We are not the first to apply machine learning techniques in prob-lems related to mechanism design. In auctions, the simulation andoptimization of mechanisms has been explicitly performed via avariety of learning methods [6, 9, 10, 48, 49]. In the context of matrixgame social dilemmas, [7] used multi-agent reinforcement learningto adaptively learn reward-giving mechanisms that promote cooper-ative behavior. [55] also takes the perspective of a social planner todirectly optimize taxation policies for agents in a gridworld setting.Multi-agent reinforment learning in particular has been appliedto design mechanisms for abstract social networks [2]. Our workbuilds on this literature, extending the notion of a mechanism toinclude a wide variety of interventions that inform decision-makingunder learning: geometrical, topological and temporal. By virtue ofthis added complexity, we focus here on the simulation of learningagents under different mechanistic interventions, rather than theoptimization of the mechanisms themselves.There also is an extensive literature on multi-agent learning insupply chain problems. The archetypal abstraction is provided bythe Beer game [17], an extensive form game that captures somedynamics of supply chain coordination. In [37], the authors treatedthe Beer game as a fully cooperative problem, and simulated thebehavior of deep Q learners. Similarly [42] and [26] investigatehow reinforcement learning agents perform in network games thatapproximate supply chain dynamics. Our work is very much inthe same spirit, albeit with three important differences. Firstly, we treat the supply chain as a collective action problem rather than asa fully cooperative one. As such, the emergence of cooperation isnot a given, rather it is heavily influenced by the environmentaland learning dynamics, as we shall see in Section 3. Secondly, oursupply chain is embedded in a complex gridworld, allowing us tostudy the emergence of cooperation in great detail. Finally, ourmethods are sufficiently general to apply to other situations withunderlying graph structure; the supply chain is a first example,rather than the primary application.
The Supply Chain environment (Figure 1) is a 2-dimensional grid-world in which agents must maintain their own individual process-ing centers with the help of other agents, in order to process units passing through the supply chain. These units enter the environ-ment via the source tile and travel through the supply chain untilthey reach the sink tile where they are removed. Importantly, unitsstop next to each processing center and do not continue along thesupply chain until they have been processed by that processingcenterโs owner. This is achieved by the owning agent standing ontheir associated processing center tile, thereby processing the unitwhich allows it to continue and giving + ๐ = each step (re-ferred to as โself-repairโ).(2) Manually by two agents standing on both the processingcenter tile and associated repair tile (referred to as โtwo-agent repairโ).With the exception of the experiments performed in Section 3.2,we disable self-repair and consider only two-agent repair.An episode in this environment lasts 1000 steps. On each step,there is a 10% chance of a unit entering the environment at eachsource tile, and all other units in the supply chain move once if thenext space in the supply chain is unoccupied. If the next space isoccupied by another unit, the moving unit is permanently discardedfrom the supply chain and therefore cannot be processed by anyagent, leading to a lost opportunity to obtain reward.This environment is a directed graph-structured collective ac-tion problem. The supply chain on which items flow is the spatio-temporal realization of an abstract network game. Collective actionis required to maintain the processing centers. In particular, eachagent would prefer that others took on the responsibility for fixingbroken processing centers, since this comes at the opportunity costof processing units themselves. However, if all agents refuse to co-operate, then they receive low group reward, since the processingcenters break, and no units can be processed.The spatio-temporal nature of the environment and the underly-ing topology of the supply chain admit a wide range of mechanisticmanipulations. These manipulations interact with agent learning inintricate ways, significantly altering the equilibria at convergence.More precisely, depending on the environment properties we see a) The state of the environment at the start of an episode. (b) An example state mid-episode. Figure 1: The Supply Chain environment, visualized with a circular layout. The sprite representations in the figure are forclarity of human interpretation. Agents instead receive 13x13 pixels observations where each type of object has as a uniqueRGB color. different patterns of โcareโ between agents, understood in terms ofhelp provided to others when repairing broken processing centers.In particular, under certain circumstances reciprocity [4, 8, 50] mayarise naturally, promoted by the underlying graph. In Section 3,we manipulate the auto-repair of processing centers, the geometryof the processing center layout, and the topology of the supplychain, drawing conclusions about which mechanisms promote andsuppress the emergence of care. Unless specified otherwise, theexperiments are performed using the environment in Figure 1.
Consider an instance of the
Supply Chain game with episode length ๐ and ๐ agents, uniquely assigned to ๐ processing centers suchthat agent ๐ always processes units at center ๐ . Each supply chainhas an underlying directed graph structure ๐บ = ( ๐ , ๐ธ ) with a set ๐ of ๐ vertices, and a set ๐ธ of ordered pairs of centers which are thedirected edges that make up the supply chain itself. Each vertexwithout incoming edges is a center that receives units from a sourcetile while each vertex without outgoing edges is a center from whichunits flow towards a sink tile. For each center, we use UP ๐บ ( ๐ ) todenote the set of vertices that are upstream from ๐ , i.e. from whicha unit could flow to ๐ , while DO ๐บ ( ๐ ) denotes the set of centers in ๐บ that are downstream from ๐ . Let ๐ ๐๐ก , ๐ ๐๐ก and ๐ ๐ ๐๐ก , be binary variablesthat specify if, at time ๐ก , agent ๐ respectively processes a unit, breaksits processing center, or repairs a processing center of agent ๐ . Inturn, ๐ ๐ , ๐ต ๐ , and ๐ถ ๐ ๐ are the number of processed units (reward),breakages, and repairs aggregated over one episode.Building on these definitions, we introduce a set of metrics thathelp analyze social outcomes: โข The care matrix ( ๐ถ ) with elements ๐ถ ๐ ๐ tracks the care (re-pairs) each agent has received from each other agent, relativeto the total number of breakages (cid:205) ๐ ๐ต ๐ . โข The care reciprocity ( ๐ ) measures how symmetric the carematrix is ๐ = โฅ ๐ถ ๐ ๐ฆ๐ โฅ โ โฅ ๐ถ ๐๐๐ก๐ โฅ โฅ ๐ถ ๐ ๐ฆ๐ โฅ + โฅ ๐ถ ๐๐๐ก๐ โฅ (1)where ๐ถ ๐ ๐ฆ๐ = ( ๐ถ + ๐ถ T ) and ๐ถ ๐๐๐ก๐ = ( ๐ถ โ ๐ถ T ) . ๐ = ๐ = โ ๐ถ is non-negative, ๐ โฅ โข The average care direction ( ๐ท ) measures whether care is en-tirely upstream ( ๐ท = ๐ท = โ
1) or inbetween ( โ < ๐ท < ๐ท = (cid:205) ๐๐ = (cid:205) ๐๐ = ๐ถ ๐ ๐ ๐ โ๏ธ ๐ = ๐ โ๏ธ ๐ = ([ ๐ โ UP ๐บ ( ๐ )] โ [ ๐ โ DO ๐บ ( ๐ )]) ๐ถ ๐ ๐ . (2) We train agents using the advantage actor-critic (A3C) [34] learningalgorithm using 400 parallel environments to generate experiencefor each learner. Episodes contain 4 agents which are sampledwithout replacement from a population of 8 and assigned to randomprocessing centers in the environment [23] . Every agent usestheir own neural network and is trained for 10 steps by receivingimportance-weighted policy updates [12]. For each observation, consisting of 13x13 RGB pixels, the neuralnetwork architecture computes a policy (probability per action)and value (for each observation). It consists of a visual encoder,which projects to a 2-layer fully connected network, followed byan LSTM, which in turn projects via another linear map to thepolicy and value outputs. The visual encoder is a 2D-convolutionalneural net with one layer of 6 channel with a kernel and stride sizeof 1. The fully connected network has 64 ReLU neurons in eachlayer. The LSTM has 128 units. The action space corresponds tofive actions: moving up, moving down, moving left, moving right,or wait one timestep. The actions themselves naturally generalizeto other gridworld environments.
For agent learning, we use a discount-factor of 0 .
99, a batch size of16, an unroll length of 100 for backpropagation through time, whilethe weight of entropy regularisation of the policy logits is 0 . . โ , momentum 0 .
0, and decay 0 .
99. The agent also minimizes a We also run an experiment where agents are instead to the same processing centerfor the course of training, see Section 3.3 ontrastive predictive coding loss in the manner of an auxiliaryobjective.
In this section, we study how different environmental interventionsinfluence social outcomes. In the Supply Chain environment, weanalyze the learning dynamics and the emergence of reciprocal care.At convergence, we study the effect of changing if and how fastprocessing centers can repair themselves autonomously, and weincrease the inter-center distance to study the influence of geome-try. Finally, we discuss how subtle changes in the environmentโsunderlying graph structure can drastically change social outcomes.Unless specified otherwise, the experiments are performed usingthe environment in Figure 1.
We first consider the case in which self-repair has been turnedoff and processing centers can only be repaired by other agents.Agents need to cooperate with other agents to ensure their pro-cessing centers are repaired. However, learning to care for othersonly benefits agents indirectly and is thus a more complex behaviorto learn than simply processing units. Learning to care happensin distinct phases, each characterized by different behaviors andsocial outcomes. Each phase features a rather abrupt change in theindividual reward received: once an agent has found an improvedstrategy via exploration, it is quickly able to exploit this, shifting theequilibrium dramatically. Accordingly, each phase can last shorteror longer depending on when the agents โdiscoverโ the new behav-ior. We therefore analyze the outcomes after each phase, similarlyto [5, 40], and fist analyze one typical run before analyzing theaverages across mutliple runs. A second typical run can be foundin the Appendix. The behaviors observed in these two runs arearchetypal for the behaviors that are observed in all eight runs.In Figure 2, we observe the individual rewards, the total (un-normalized) care per agent ๐ถ ๐ = (cid:205) ๐ ๐ถ ๐ ๐ , and the care reciprocity.The dashed lines represent the phase transitions between 4 distincttraining phases until convergence. The full care matrix after eachphase can be found in Figure 3. Phase 1 begins at the start of train-ing and ends after โ ยท training steps. It is characterized byagents learning how to navigate the environment, process units andexplore upstream caring (indicated by the non-zero values in theupper-right triangle of 3a). Agents have yet to learn when repairingis beneficial, and we thus observe the highest average reward forthe agent that is closest to the source and a progressively lowerreward for agents further downstream, since units can only reachthese agents if all centers upstream are not broken. Phase 2 endsafter โ ยท steps and is characterized by the first examples ofconsistent care. The second agent (and to a lesser extent the thirdand fourth) learn that they can earn more reward by repairing thefirst processing center. Phase 3 ends after โ . ยท steps. In thisphase, agent 2 learns that repairing center 1 only results in morereward when it can process units at its own center. To keep thisincentive for agent 2, agent 1 thus learns to reciprocate the carereceived by agent 2. This leads to a sudden increase in care reci-procity and a drastic increase in reward for agent 2. At the same time, agent 4 learns to help agent 3 in order to receive more reward,similar to what we observed for the first two agents in phase 2.Finally, during phase 4, reciprocity emerges between agents 3 and 4as agent 3 learns that agent 4 only repairs when its own processingcenter is fixed. Only in phase 4 are there an appreciable number ofunits delivered to the sink: note that this quantity is equal to thereward received by agent 4.In Figure 4, we show the same social outcome metrics but nowaveraged across 8 runs. Due to the abrupt nature of the learningprocess, we no longer observe the clear phase transitions that weobserved in a single run and observe a high relative variance forall the metrics. The latter is especially clear or the care reciprocityas the the amount of reciprocal care does not only differ duringlearning but also at convergence. Naturally, as receiving care isnecessary to earn any additional reward beyond the trivial rewardthat can be earned before the first processing centers break, weobserve a strong correlation between the group reward and thetotal care.In a sense, it is surprising that reciprocity emerges in this en-vironment. Naรฏvely, one might expect that the collective actionproblem is dominated by selfish incentives, and an explicit model-based intervention may be required to solve the social dilemma, asin [11, 41]. However, in our case, the underlying graph structureorganizes agent interactions in such a way as to promote the emer-gence of reciprocity. This is exactly in line with previous work inabstract network games [35], extending it to a setting where wecan examine mechanism design in detail.Reciprocity, in the sense of tit-for-tat, requires that I repair yourprocessing center if and only if you do the same for me. This isa complex, conditional and fundamentally temporal strategy. Weperform an experiment to check that our agents have learned reci-procity in this strong sense. In Figure 5b, we examine the carematrix while evaluating, after convergence, three โcaringโ agentsand one โselfishโ agent in a single environment. The caring agents,assigned to centers 1, 2 and 4, were trained in an environmentwhere self-repair is turned off and agents thus learn to care whileagent 3 was trained in environment with fast self-repairing (repairtime =
10) and thus has never learned to care for other agents (seeSection 3.2). Interestingly, not only does selfish agent 3 stop caringfor agent 4 but, without the care being reciprocated, agent 4 alsostops caring for agent 3. We find that, instead of agents learning tocare unconditionally, they learn that giving care is only beneficialif its reciprocated and thus if their own processing center is alsobeing cared for.
Our first mechanistic intervention is to enable self-repair. Naรฏvely,one might believe that this will cause a uniform uplift in the groupreward of the agents, without diminishing the cooperative behavior.Indeed, a supply chain manager, faced with the decision whetherto invest in such technology, may well perceive this to be with-out downsides. However, the dynamics of emergent reciprocityunder learning are clearly intricate. As we can see in Figure 6, theintroduction of self-repair undermines these dynamics, leading toa significant reduction in care reciprocity, with no significant im-provement in group reward. Moreover, we observe that the care a) Individual reward per agent stacked forgroup reward. (b) Care given per agent stacked for total caregiven. (c) Care reciprocity.
Figure 2: Evolution of social outcome metrics over the course of training on a circular chain with four agents and self-repairdisabled. Please see Figure 3 for the care matrices at the end of each phase. (a) End of phase 1. (b) End of phase 2. (c) End of phase 3. (d) End of phase 4.
Figure 3: Care matrices at the end of the four distinct learning phases. For improved readability, values below 0.01 are omitted. (a) Group reward. (b) Total care. (c) Care reciprocity.
Figure 4: Evolution of different social outcome metrics over the coarse of training, averaged across 8 runs with differentrandom seeds. The shaded regions represent confidence intervals. that does still occur is predominantly directed upstream rather thanthe more balanced behavior we observe when self-repair is disabled(Figure 6c).This provides a first illustration of the practical benefits of ourapproach. It is not obvious how to use a network game modelto predict the effect of introducing self-repair. By modelling theemergence of care with deep reinforcement learning in a more concrete environment, we can draw nuanced conclusions fromgrounded interventions.
At the start of each episode, the supply chain is populated by fouragents. There are several possible strategies for assigning agentsfrom a finite-sized population to each of the four processing centers.So far, we have randomly assigned agents to positions at the start of able 1: Individual and group rewards, as well as the system efficiency, for the environments visualized in Figure 9. The effi-ciency corresponds to the percentage of units that leave the environment through a sink node relative to the number of unitsthat enter the environment through a source node.
Environment ๐ ๐ ๐ ๐ (cid:205) ๐ ๐ ๐ EfficiencyFigure 9a 125 . ยฑ . . ยฑ .
48 55 . ยฑ .
85 48 . ยฑ . . ยฑ . . ยฑ . . ยฑ . . ยฑ . . ยฑ . . ยฑ . . ยฑ . . ยฑ . . ยฑ . ยฑ
15 31 ยฑ
14 17 ยฑ
14 198 ยฑ
21 8 . ยฑ . Figure 5: Reciprocal care for agent 4 is not observed whenagent 3 is replaced by a selfish agent at test time. Comparethe care matrix with Figure 3d for an evaluation without re-placing agent 3 for a โselfishโ agent. each episode. Thus the agents must learn policies which generalizeacross positions. Alternatively, we could encourage specializationby always assigning the same agents to the same positions in thesupply chain. This would eliminate the need to learn a single policythat generalizes across positions, and may speed up learning. Assuch, it is an appealing intervention from the perspective of amechanism designer.However, as we see in Figure 8, agents only learn to care whenthey are assigned to different positions during training. In otherwords, specialization undermines the learning dynamics that begetcaring. In effect, specialization leads to overfitting. Agents alwaysappear in the same positions and learn that they can receive somereward by standing on their activation tile. This is a local optimum,for leaving the activation tile has an immediate opportunity costfor processing further units. Whatโs more, there is little benefitin repairing another agentโs processing center, unless someone islikely to repair yours: this is the root of the collective action problem.Thus discovering the benefits of care is a hard joint explorationproblem. Diversity in role experience helps to solve this problem,at the cost of longer learning times. The link between random roleassignment and the selection of just policies has also been studiedin political theory, most famously through the โveil of ignoranceโthought experiment developed by John Rawls [43]. This connectionrepresents a fruitful area for future research [18].
In contrast to a game-theoretic analysis of the network structure,multi-agent reinforcement learning provides tools for investigating the impact of geometric changes to the environment that are notcaptured in the abstract graph.To demonstrate this, we vary the distance between processingcenters and inspect the group reward, care reciprocity and caredirection at convergence (Figure 7). To have more fine-grainedcontrol over the inter-center distance, we use a linear chain ratherthat the circular chain presented in Figure 1 (see the Appendix for avisualization of the environment). First, we observe that the groupreward decreases as the inter-center distance increases (Figure 7a).Interestingly, however, we observe that changing the geometry notonly affects the overall reward but also changes the dynamics ofcare. The longer distance increases the effective โcostโ of caringwhich makes the reciprocal nature of care more important (Figure7b) while no significant impact is observed on the care direction(Figure 7c).In summary, the dynamics of care and the geometry of the envi-ronment are intertwined in a non-trivial way. Our model thereforehighlights a tension which is invisible in standard network games.Namely, the trade-off between encouraging caring dynamics andincreasing group reward. In real-world systems involving a mixtureof humans and agents, this trade-off is vitally important: we wouldnot want to simply optimize for group reward at the expense ofcooperation [19]. Using our approach, one can design mechanismsfor the geographical layout of network systems to balance socialgoods.
Thus far, we have analyzed the social outcomes in supply chainswhere the underlying abstract graph structure is linear. In practice,networks often have more complex topologies. To investigate theeffect of topology on social outcomes, we introduce three environ-ments with the same number of agents but different underlyinggraphs that govern how units flow in the supply chain, see Figure9. In Table 1, we show the individual and cumulative rewards ineach environment. In all three environments, agent 1 accumulatesmost reward on average because all other agents depend on thisprocessing center for reward. Rewards for the other agents dependstrongly on the topology. In the first environment (Figure 9a), thesupply chain branches out after center 1 with each branch randomlyreceiving half of the units. In the top branch, agent 2 earns relativelylittle reward as no other agent has a strong incentive to care for itsprocessing center. In the bottom branch, however, agents 3 and 4both earn close to half of the reward of the first agent as they eachhave an incentive to care for each other. Agent 4 does so because a) Group reward. (b) Care reciprocity. (c) Care direction.
Figure 6: Social outcome metrics for different values of the repair time averaged over different runs where repair time = โ denotes that self-repair has been disabled. (a) Group reward. (b) Care reciprocity. (c) Care direction. Figure 7: Influence of geometry on social outcome metrics. Increasing the inter-center distance results in lower group reward(a) and higher care reciprocity (b). Metrics are averaged over 8 runs with confidence intervals.Figure 8: Group rewards during training averaged across random seeds for agents that are assigned to random pro-cessing centers during training versus agents that are as-signed to fixed centers. agent 3 is upstream, while agent 3 has an incentive to reciprocatethat care to ensure that agent 4 keeps repairing.The importance of reciprocity is also observed in the socialoutcomes of the second supply chain (Figure 9b). Here the chainbranches out after two agents, and those two agents thus depend oneach other for care. By contrast, the last two agents are in differentbranches of the tree, so are not interdependent and thus fail to forma reciprocal pair. Finally, in the third supply chain (Figure 9c), the chain branchesout after the first agent but reconnects again before the last agent.Here learning is more unstable. Agents 2, 3 and 4 all earn somereward but the amount of collected reward varies strongly betweendifferent runs (see Table 1). This is because there are multiple stablesocial outcomes that the system randomly converges to (see theAppendix for care matrices of the individual runs). Averaged acrossall 8 runs, we observe the highest care from agent 2 and 3 to agent1, intuitively sensible since they both depend directly on agent 1 forreward. In turn, we observe that some of this help is reciprocatedby agent 1. Finally, we see that though the last agent is able toestablish some reciprocal care with agents 2 and 3, it is receivingthe least reward.A mechanism designer may be interested in creating a supplychain that maximizes the number of units that are successfully pro-cessed by the system as a whole rather than maximizing the sum ofindividual rewards. Hence, we compute the efficiency of the system(see Table 1), defined as the number of units that leave the system atany of the sink nodes relative to the the number of units that enterthe system. Interestingly, in these complex topologies, the systemefficiency does not precisely correlate with group reward. Indeed,we observe the best overall system efficiency in environment 1.Though environment 2 yields the highest group reward, most unitsare discarded before they reach the sink tiles, resulting in a lownumber of units processed by the system overall. This highlights a) Environment 1. (b) Environment 2. (c) Environment 3. Figure 9: Three environments with 4 agents and the corresponding care matrices. The care matrices are averages over 8 differ-ent seeds. For improved readability, values below 0.01 are omitted. the value of conducting detailed simulations admitting multiplemetrics.
Games with similar topological structures have been considered inthe cooperative game theory literature [30, 32]. Cost-tree games,also called irrigation games, are a class of transferable utility net-work games where each edge of a directed tree has an associatedcost (e.g. for maintenance) that should be shared across the users(e.g. farmers in an irrigation system). The Shapley solution [45] pre-dicts that the costs corresponding to each edge are shared equallyby all users that depend on that edge. This is at odds with the carereciprocity we observe in our experiments. In our spatio-temporallycomplex setting, the assumption of transferable utility is too reduc-tive. Our method allows to extend beyond such assumptions byway of empirical game theory [52], reaching new conclusions.There are several natural avenues for extending our contribu-tion. Firstly, it would be valuable to extend the dynamics of ourenvironment to incorporate more detail of supply chains, along thelines of the Beer game. Secondly, it would be fruitful to explorethe consequences of our methods in other complex environmentswith graph structure, such as 2- or 3- dimensional realizations ofan irrigation system. This may pave the way for a wider class ofmetrics, and allow us to investigate a broader range of mechanismsthat promote cooperative behavior. Along the same lines, it wouldbe useful to extend our analysis to larger numbers of agents in in-teraction, where one might be able to observe phase transition-likeeffects under learning more clearly.More broadly, our work raises several questions that may spurfuture investigation in this field. How can we automate the designof mechanisms that depend on structural changes to the world?This question intersects with the lively field of procedural genera-tion for reinforcement learning environments [13, 23, 25]. It also relates to recent work on learning to teach [14, 36, 46] and learn-ing to incentivize [31, 54, 55]. A complementary direction wouldinvolve validating our model by comparison with experiments in-volving human participants: to what extent are the conclusions wedraw about mechanistic interventions borne out in the behavior ofhumans? Tackling such questions in progressively greater detailmay yield important insights into the management of real-worldcollective action problems in the years to come.
ACKNOWLEDGEMENTS
We would like to thank Theophane Weber, Kevin McKee and manyother colleagues at DeepMind for useful discussions and feedbackon this work.
REFERENCES [1] Richa Agarwal and รzlem Ergun. 2008. Mechanism design for a multicommodityflow game in service network alliances.
Operations Research Letters
36, 5 (2008),520 โ 524. https://doi.org/10.1016/j.orl.2008.04.007[2] John Ahlgren, Maria Eugenia Berezin, Kinga Bojarczuk, Elena Dulskyte, InnaDvortsova, Johann George, Natalija Gucevska, Mark Harman, Ralf Lรคmmel, ErikMeijer, Silvia Sapora, and Justin Spahr-Summers. 2020. WES: Agent-based UserInteraction Simulation on Real Infrastructure. arXiv:2004.05363 [cs.SE][3] Nicolas Anastassacos, Steve Hailes, and Mirco Musolesi. 2019. Understanding TheImpact of Partner Choice on Cooperation and Social Norms by means of Multi-agent Reinforcement Learning.
CoRR abs/1902.03185 (2019). arXiv:1902.03185http://arxiv.org/abs/1902.03185[4] R Axelrod and WD Hamilton. 1981. The evolution of cooperation.
Sci-ence arXiv preprint arXiv:1909.07528 (2019).[6] M. . Balcan, A. Blum, J. D. Hartline, and Y. Mansour. 2005. Mechanism design viamachine learning. In . 605โ614.[7] Tobias Baumann, Thore Graepel, and John Shawe-Taylor. 2018. Adaptive Mech-anism Design: Learning to Promote Cooperation.
CoRR abs/1806.04067 (2018).arXiv:1806.04067 http://arxiv.org/abs/1806.040678] A.R.A.M. Chammah, A. Rapoport, A.M. Chammah, and C.J. Orwant. 1965.
Pris-onerโs Dilemma: A Study in Conflict and Cooperation . University of MichiganPress.[9] Vincent Conitzer and Tuomas Sandholm. 2002. Complexity of mechanism design.In
Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence .103โ110.[10] Paul Dรผtting, Zhe Feng, Harikrishna Narasimhan, and David C. Parkes. 2017.Optimal Auctions through Deep Learning.
CoRR abs/1706.03459 (2017).arXiv:1706.03459 http://arxiv.org/abs/1706.03459[11] Tom Eccles, Edward Hughes, Jรกnos Kramรกr, Steven Wheelwright, and Joel Z.Leibo. 2019. Learning Reciprocity in Complex Sequential Social Dilemmas.arXiv:1903.08082 [cs.MA][12] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, TomWard, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. 2018. IM-PALA: Scalable Distributed Deep-RL with Importance Weighted Actor-LearnerArchitectures. In
International Conference on Machine Learning . 1407โ1416.[13] Richard Everett, Adam Cobb, Andrew Markham, and Stephen Roberts. 2019.Optimising Worlds to Evaluate and Influence Reinforcement Learning Agents. In
Proceedings of the 18th International Conference on Autonomous Agents and Multi-Agent Systems (Montreal QC, Canada) (AAMAS โ19) . International Foundationfor Autonomous Agents and Multiagent Systems, Richland, SC, 1943โ1945.[14] Anestis Fachantidis, Matthew E. Taylor, and Ioannis P. Vlahavas. 2017. Learn-ing to Teach Reinforcement Learning Agents.
CoRR abs/1707.09079 (2017).arXiv:1707.09079 http://arxiv.org/abs/1707.09079[15] Ernst Fehr and Klaus M. Schmidt. 1999. A Theory of Fairness, Competition,and Cooperation.
The Quarterly Journal of Economics
CoRR abs/1709.04326 (2017). arXiv:1709.04326 http://arxiv.org/abs/1709.04326[17] Jay W Forrester. 1958. Industrial Dynamics. A major breakthrough for decisionmakers.
Harvard business review
36, 4 (1958), 37โ66.[18] Iason Gabriel. 2020. Artificial Intelligence, Values, and Alignment.
Minds andMachines
30, 3 (2020), 411โ437. https://doi.org/10.1007/s11023-020-09539-2[19] Virginia Held et al. 2006.
The ethics of care: Personal, political, and global . OxfordUniversity Press on Demand.[20] Edward Hughes, Joel Z Leibo, Matthew Phillips, Karl Tuyls, Edgar Dueรฑez-Guzman, Antonio Garcรญa Castaรฑeda, Iain Dunning, Tina Zhu, Kevin McKee,Raphael Koster, et al. 2018. Inequity aversion improves cooperation in intertem-poral social dilemmas. (2018), 3326โ3336.[21] Matthew O Jackson. 2014. Mechanism theory.
Available at SSRN 2542983 (2014).[22] Matthew O. Jackson and Yves Zenou. 2015. Chapter 3 - Games on Networks.Handbook of Game Theory with Economic Applications, Vol. 4. Elsevier, 95 โ163. https://doi.org/10.1016/B978-0-444-53766-9.00003-3[23] Max Jaderberg, Wojciech M Czarnecki, Iain Dunning, Luke Marris, Guy Lever,Antonio Garcia Castaneda, Charles Beattie, Neil C Rabinowitz, Ari S Morcos,Avraham Ruderman, et al. 2019. Human-level performance in 3D multiplayergames with population-based reinforcement learning.
Science
InternationalConference on Machine Learning . PMLR, 3040โ3049.[25] Niels Justesen, Ruben Rodriguez Torrado, Philip Bontrager, Ahmed Khalifa, Ju-lian Togelius, and Sebastian Risi. 2018. Procedural Level Generation ImprovesGenerality of Deep Reinforcement Learning.
NeurIPS Deep RL Workshop 2018 (2018). arXiv:1806.10729 http://arxiv.org/abs/1806.10729[26] Lukas Kemmer, Henrik von Kleist, Diego de Rochebouรซt, Nikolaos Tziortziotis,and Jesse Read. 2018. Reinforcement learning for supply chain optimization. In
European Workshop on Reinforcement Learning 14 . 1โ9.[27] Paul Klemperer. 2002. What Really Matters in Auction Design.
The Journal ofEconomic Perspectives
Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems .464โ473.[29] Adam Lerer and Alexander Peysakhovich. 2017. Maintaining cooperation incomplex social dilemmas using deep reinforcement learning.
CoRR abs/1707.01068(2017). arXiv:1707.01068 http://arxiv.org/abs/1707.01068[30] Stephen C Littlechild and Guillermo Owen. 1973. A simple expression for theShapley value in a special case.
Management Science
20, 3 (1973), 370โ372.[31] Andrei Lupu and Doina Precup. 2020. Gifting in Multi-Agent ReinforcementLearning. In
Proceedings of the 19th Conference on Autonomous Agents and Multi-Agent Systems . International Foundation for Autonomous Agents and MultiagentSystems, 789โ797.[32] Judit Mรกrkus, Pรฉter Miklรณs Pintรฉr, and Anna Radvรกnyi. 2012.
The Shapley valuefor airport and irrigation games . Technical Report. IEHAS Discussion Papers. [33] Eric Maskin. 2019. Introduction to mechanism design and implementation.
Transnational Corporations Review
11, 1 (2019), 1โ6. https://doi.org/10.1080/19186444.2019.1591087 arXiv:https://doi.org/10.1080/19186444.2019.1591087[34] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Tim-othy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asyn-chronous methods for deep reinforcement learning. In
International conferenceon machine learning . 1928โ1937.[35] Hisashi Ohtsuki and Martin A Nowak. 2007. Direct reciprocity on graphs.
Journalof theoretical biology
Proceedings of theAAAI Conference on Artificial Intelligence , Vol. 33. 6128โ6136.[37] Afshin Oroojlooyjadid, MohammadReza Nazari, Lawrence V. Snyder, and MartinTakรกc. 2017. A Deep Q-Network for the Beer Game with Partial Information.
CoRR abs/1708.05924 (2017). arXiv:1708.05924 http://arxiv.org/abs/1708.05924[38] Elinor Ostrom. 1990.
Governing the Commons: The Evolution of Institutionsfor Collective Action . Cambridge University Press. https://doi.org/10.1017/CBO9780511807763[39] Elinor Ostrom. 2009. A General Framework for Analyzing Sustainability ofSocial-Ecological Systems.
Science
Advances in Neural Information Processing Systems .3643โ3652.[41] Alexander Peysakhovich and Adam Lerer. 2017. Consequentialist conditionalcooperation in social dilemmas with imperfect information. (10 2017).[42] Pierpaolo Pontrandolfo, Abhijit Gosavi, O. Okogbaa, and Tapas Das. 2002.Global supply chain management: A reinforcement learning approach.
Inter-national Journal of Production Research
40 (04 2002). https://doi.org/10.1080/00207540110118640[43] John Rawls. 2009.
A theory of justice . Harvard university press.[44] Wilko Schwarting, Alyssa Pierson, Javier Alonso-Mora, Sertac Karaman, andDaniela Rus. 2019. Social behavior for autonomous vehicles.
Proceedings of theNational Academy of Sciences
CoRR abs/1810.00147 (2018). arXiv:1810.00147http://arxiv.org/abs/1810.00147[47] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vini-cius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo,Karl Tuyls, et al. 2018. Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward. In
Proceedings of the 17th InternationalConference on Autonomous Agents and MultiAgent Systems . 2085โ2087.[48] Andrea Tacchetti, DJ Strouse, Marta Garnelo, Thore Graepel, and Yoram Bachrach.2019. A Neural Architecture for Designing Truthful and Efficient Auctions.
CoRR abs/1907.05181 (2019). arXiv:1907.05181 http://arxiv.org/abs/1907.05181[49] Pingzhong Tang. 2017. Reinforcement mechanism design. In
Proceedings of the26th International Joint Conference on Artificial Intelligence . 5146โ5150.[50] Robert Trivers. 1971. The Evolution of Reciprocal Altruism.
Quarterly Review ofBiology
46 (03 1971), 35โ57. https://doi.org/10.1086/406755[51] Joan C. Tronto. 1993.
Moral Boundaries: A Political Argument for an Ethic of Care .Routledge.[52] Michael P Wellman. 2006. Methods for empirical game-theoretic analysis. In proceedings of the 21st national conference on Artificial intelligence-Volume 2 . 1552โ1555.[53] David H Wolpert and Kagan Tumer. 2002. Collective intelligence, data routingand braessโ paradox.
Journal of Artificial Intelligence Research
16 (2002), 359โ387.[54] Jiachen Yang, Ang Li, Mehrdad Farajtabar, Peter Sunehag, Edward Hughes, andHongyuan Zha. 2020. Learning to Incentivize Other Learning Agents. In
AAMAS2020 - Adaptive and Learning Agents (ALA) Workshop .[55] Stephan Zheng, Alexander Trott, Sunil Srinivasa, Nikhil Naik, Melvin Gruesbeck,David C. Parkes, and Richard Socher. 2020. The AI Economist: Improving Equalityand Productivity with AI-Driven Tax Policies. arXiv:2004.13332 [econ.GN]
ENVIRONMENTA.1 Linear supply chains
In Section 3.4 in the main paper, where we vary the inter-centerdistance, we use a linear version of the supply chain instead of thecanonical circular one shown in Figure 1. Examples of this type ofenvironment for three different inter-center distances can be foundin Figure A1. Note only the geometry differs between the linearand circular environments while the topology remains the same (achain with four nodes).
B ADDITIONAL RESULTSB.1 Emergence of care
In Section 3.1 in the main text, we discuss the learning dynamicsand the emergence of care for a single training run. However, dueto the abrupt nature of learning, there are large difference betweenlearning dynamics across individual runs with different randomseeds. In Figure A2, we observe the learning curves for a secondtypical run while the care matrices at the end of each learning phasecan be found in Figure A3. In phase 1 (ending after โ . ยท ),we observe that, in contrast to the run highlighted in Section 3.1,it is now the last agent that first learns to care for the first agent.In the second phase (ending after โ . ยท ) this behavior is alsoadopted by the second and the third agent, resulting in a similarcare matrix to what we observed at the end of the second phasein the other example run. A large difference between both runs,however, emerges during the third phase (ending after โ ยท ).While we previously observed a large increase in reciprocity during this phase because the first agent learned to reciprocate the caregiven by the second agent, we now observe that the second agent isinstead receiving care from the third agent (and to a lesser extendthe third agent from the last agent). Consistent reciprocal carein this run only emerges during the last phase (ending after โ . ยท ) when the third agent learns to reciprocate the care ofthe fourth agent. From the perspective of a mechanism designer, itis important to understand which different stable outcomes existand, if necessary, implement geometric, topological or functionalchanges to the environment to ensure that the system convergesto a desired outcome. B.2 Care matrices of individual runs for oneT-junction supply chain
In the third tree-like supply chain that we discussed in Section 3.5in the main paper, we observed a high variance in the individualrewards for agents 2, 3 and 4 (see Table 1) because there are mul-tiple stable outcomes that the system randomly converges to. Weshow these outcomes in terms of care matrices for eight individualruns (random seeds) in Figure A4. Note first that there are indeedlarge differences between the care matrices across individual runs.Moreover, we see in most care matrices (Figures A4d to A4h) that,despite the fact that different agents are involved, similar patternsof reciprocal care emerge as we observed in other experiments.Finally, we observe that only in the last run (Figure A4h), a pair isformed between the first agent and both agent 2 and 3 while for theother runs pairing happens with only one of these center agents. a) d = 2. (b) d = 5. (c) d = 7.
Figure A1: Examples of the linear layout at varying inter-center distances. (a) Individual reward per agent stacked forgroup reward. (b) Care given per agent stacked for total caregiven. (c) Care reciprocity.
Figure A2: Evolution of social outcome metrics over the course of training on a circular chain with four agents and self-repairdisabled. Please see Figure A3 for the care matrices at the end of each phase. (a) End of phase 1. (b) End of phase 2. (c) End of phase 3. (d) End of phase 4.
Figure A3: Care matrices at the end of the four distinct learning phases corresponding to Figure A2. For improved readability,values below 0.01 are omitted. a) Care matrix run 1. (b) Care matrix run 2. (c) Care matrix run 3. (d) Care matrix run 4(e) Care matrix run 5. (f) Care matrix run 6. (g) Care matrix run 7. (h) Care matrix run 8.a) Care matrix run 1. (b) Care matrix run 2. (c) Care matrix run 3. (d) Care matrix run 4(e) Care matrix run 5. (f) Care matrix run 6. (g) Care matrix run 7. (h) Care matrix run 8.